Introducing Fineweb-Edu-Fortified, an open dataset of high-quality educational web content

Introducing Fineweb-Edu-Fortified, an open dataset of high-quality educational web content

Josh Bauer·8/14/2024

Today, we are proud to publish Fineweb-Edu-Fortified: An enhanced dataset based on the popular Fineweb-Edu dataset.

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models.

The dataset is derived from the Fineweb-Edu subset of the large Fineweb dataset and includes:

  • Exact-match deduplication across all crawls
  • Embeddings for each row using the TaylorAI/bge-micro model
  • Count column indicating duplication frequency
  • Includes data from 95 Common Crawl crawls (2013-2024)
  • Rows have been reduced from 1.279B to 0.324B after deduplication
  • It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Explore 500k randomly selected Fineweb-Edu-Fortified rows in the Airtrain Dataset Explorer.

Or access the entire Fineweb-Edu-Fortified dataset on Hugging Face: airtrain-ai/fineweb-edu-fortified.

What is it?

Fineweb-Edu-Fortified is a dataset derived from Fineweb-Edu by applying exact-match deduplication across the whole dataset and producing an embedding for each row. The number of times the text from each row appears is also included as a count column. The embeddings were produced using TaylorAI/bge-micro.

Fineweb and Fineweb-Edu were obtained by processing data from 95 crawls of Common Crawl, covering a time period from 2013 to 2024. More information about the original datasets can be found by consulting:

The contents of a randomly selected 500k rows from this dataset can be interactively explored in this Airtrain Dataset Explorer.

Deduplication

Deduplication in original Fineweb and Fineweb-Edu

During creation of the original Fineweb dataset, a variety of deduplication strategies were explored. The evaluation criteria used to assess deduplication strategies was to train ablation models on randomly selected subsets of the data, using a subset of up to ~350 billion tokens.

Using this mechanism, the Fineweb authors selected a MinHash algorithm, using parameters considering documents with approximately 75% similarity or higher to be duplicates. This deduplication was performed within each Common Crawl crawl. For example, it would have removed all approximate duplicates from the 20th crawl from 2013, but would have retained an identical record that showed up in both the 2013-20 crawl and the 2013-48 crawl. The authors note that applying the deduplication across crawls reduced the evaluation performance of the ablation models used for assessment. The proposed reason for this performance degredation is that data duplicated across crawls is more likely to be high-quality compared to data that is not, so leaving in the duplicates effectively upsamples the higer-quality data.

Following deduplication in Fineweb, Fineweb-Edu was extracted using a model-based quality classifier targeting educational content. It thus inherited the same inter-crawl deduplication strategy of Fineweb.

Deduplication in this dataset

Motivation

Given the findings that cross-crawl deduplication reduced ablation model performance, one might ask what the motivation is for producing a dataset that uses it. Our motivation was threefold:

  • Reduce the number of rows that needed to be embedded by avoiding embedding of exact-match content
  • Enable easier filtering of the dataset for subsets-of-interest
  • Provide a version of the dataset for users whose training goals include avoiding training on non-unique tokens.

For use cases that would benefit from "re-hydrating" or filtering the rows based on how frequently the text appeared in the original dataset, the new count column retains the number of appearances of the associated text.

Procedure

The overall procedure was to remove exact matches that appeared in multiple crawls (also referred to as "dumps"). This was achieved by performing an md5 hash on the text column and removing rows with duplicate hashes. To make this tractable at scale, we first grouped all rows by the first two hex digits of their hashes, then looked for exact hash matches within each of the resulting 256 buckets of data. Note that unlike the intra-crawl deduplication, we only eliminated exact matches across crawls. For duplicated rows, a strong preference was given to keep the metadata (ex: dump, url) from the oldest crawl where the text appeared. Following deduplication and embedding, the data were grouped by the "dump" column, mirroring the organization of the original Fineweb-Edu dataset.

Deduplication stats

Deduplication removed approximately 74.7% of rows from the original dataset (from 1.279 billion in Fineweb-Edu to 0.324 billion rows in Fineweb-Edu-Fortified). This indicates that a substantial amount of data in Fineweb-Edu is present across multiple crawls.

The total token count in the deduplicated dataset is approximately 375 billion, compared to the 1,320 billion tokens in Fineweb-Edu.

duplication count distribution

Embeddings

To support use cases with Fineweb-Edu such as classification, clustering, semantic search, etc., we have produced an embedding vector for each row in the dataset. The embedding model TaylorAI/bge-micro was selected for its tradeoff of strong performance on MTEB benchmarks relative to its size (17 million parameters). The model's embedding space has 384 dimensions. The context-window of the model is 512 tokens (roughly several paragraphs of text); each row is embedded by using the first 512 tokens in its text field. Producing the embeddings took approximately 412 GPU-hours on Nvidia T4 GPUs.

Using via datasets

from datasets import load_dataset
fw = load_dataset(
    "airtrain-ai/fineweb-edu-fortified",
    name="CC-MAIN-2024-10",
    split="train",
    streaming=True
)

Acknowledgements

Airtrain would like to thank the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets, as well as for their support during work on Fineweb-Edu-Fortified.

We'd also like to thank @underspirit for pointing out the amount of reduction in dataset size that could be achieved via deduplication.

We owe gratitude to TaylorAI for the bge-micro embedding model.

Finally, thank you to the Hugging Face community for fostering a thriving ecosystem of models, datasets, and tools to support open-source AI.

AI Data Platform

A comprehensive AI platform

Dataset Curation

Generate high-quality datasets.

LLM Fine-Tuning

Customize LLMs to your specific use case.

LLM Playground

Vibe-check 30+ SOTA LLMs at once.

LLM Evaluation

Compare LLMs on your entire eval set.

Accelerate your AI workflows with Airtrain's comprehensive suite of tools. From dataset curation to LLM fine-tuning and evaluation.

Unlock your data, control your AI.