LLM Dataset Construction
The quality of an LLM is determined more by its training data than its architecture. Pre-training data determines what the model knows; fine-tuning data determines how it behaves. Both require deliberate construction pipelines: web crawls need quality filters, deduplication, and safety scrubbing; instruction datasets need format, diversity, and quality control that automated metrics can't fully capture.
How It Works
LLM pre-training data pipeline — click a stage
Click each stage above. A typical pre-training data pipeline takes 10B+ raw tokens and discards 60–80% of them through successive filtering. Each filter trades data volume for quality — the goal is a smaller, cleaner dataset that trains faster and to higher quality than the unfiltered version.
Raw web data is noisy enough that its quality ceiling is low: templated pages, spam, low-information repetition, and multilingual noise all dilute what the model can learn per token. Pre-training data pipelines are expensive because filtering is the work — the 30 billion tokens that survive from 100 billion raw tokens are worth more for training than all 100 billion unfiltered tokens combined.
Pre-training data pipeline
Collection: web crawl data (Common Crawl), books (Project Gutenberg, Books3), code (GitHub), Wikipedia, domain-specific corpora (PubMed, ArXiv). Mix matters — more code improves reasoning; more books improves coherence.
Deduplication: MinHash Locality-Sensitive Hashing (LSH) finds near-duplicate documents by hashing n-gram shingles. Documents with Jaccard similarity above 0.8 are considered duplicates; keep one. Exact substring deduplication removes repeated paragraphs across documents. Deduplication reduces memorization and improves generalization.
Quality filtering: a classifier trained on curated (Wikipedia, books) vs random web text distinguishes high-quality from low-quality documents. Language detection filters non-target-language content. Perplexity filtering using a small n-gram language model (KenLM) removes low-perplexity documents (highly templated, not informative) and very high-perplexity documents (garbled text).
Safety filtering: a toxic content classifier removes hate speech, violence, and adult content. PII scrubbing removes emails, phone numbers, SSNs, and addresses using regex and NER. CSAM detection uses perceptual hash matching against known databases.
Formatting: tokenize with the target tokenizer, shuffle globally, pack multiple documents into fixed-length context windows with separator tokens. Store in sharded binary format.
Instruction dataset construction
For fine-tuning (SFT and RLHF), datasets take the form of (system prompt, user turn, assistant response) triples. Three construction approaches:
Human-written: highest quality, highest cost. Used for the "gold" examples in the dataset. Necessary for sensitive domains where automated quality is unreliable.
Distillation from a stronger model: generate responses to user prompts using GPT-4 or Claude, then fine-tune a smaller model on those responses. This is how most open-source instruction models are built. Quality is bounded by the teacher model, but scales cheaply. License restrictions from OpenAI prohibit using GPT-4 outputs to train competing models.
Self-instruct / synthetic generation: generate both prompts AND responses from an existing model using seed tasks. Alpaca used 175 seed tasks to generate 52k instruction-response pairs from text-davinci-003. Quality varies widely; diversity of the seed tasks drives coverage.
Data format for fine-tuning
Fine-tuning data is structured as chat templates matching the model's expected format:
{
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string."},
{"role": "assistant", "content": "```python\ndef reverse_string(s: str) -> str:\n return s[::-1]\n```"}
]
}The loss mask (from SFT) applies only to assistant tokens — the model learns to produce responses, not to predict user turns or system prompts.
The loss mask had to exclude user turns and system prompts from the training objective because the goal of fine-tuning is to teach the model how to respond, not to predict what users will say. Computing loss on user turns would reward a model that memorizes common user phrasings, which has no value for inference. Including system prompts in the loss would teach the model to "predict" system instructions — but at inference time, the system prompt is provided as input, not generated. Masking to assistant tokens only ensures gradient updates improve only the behavior the model is responsible for producing.
Design Tradeoffs
Where Your Intuition Breaks
More data is not always better for LLMs. The Chinchilla scaling law showed that for a given compute budget, smaller models trained on more tokens outperform larger models trained on fewer tokens — but this assumes the additional tokens are of comparable quality. Adding low-quality tokens can hurt model performance: the model trains on noise and learns spurious patterns. Quality filtering that discards 70% of raw web data improves the effective training signal per token even though it reduces volume. The counterintuitive implication is that investing in data quality pipelines (better deduplication, better quality classifiers) often produces larger capability gains than adding more raw data — a shift from "collect more" to "filter better" that runs counter to the intuition that bigger datasets are always better.
Data mixture ratios
Pre-training data mixture — proportions of web, books, code, Wikipedia — significantly affects model capabilities:
| Data source | Effect if increased |
|---|---|
| Code | Better reasoning, structured output, math |
| Books | Better coherence, narrative, long-form text |
| Web | Broader knowledge, more up-to-date |
| Wikipedia | Factual accuracy, structured knowledge |
| Math | Arithmetic, formal reasoning |
Optimal mixtures are dataset- and task-specific. Experiments with held-out benchmarks guide mixing decisions.
Instruction data quality vs quantity
For instruction fine-tuning, quality dominates quantity past a surprisingly small scale. Datasets of 1,000–10,000 high-quality examples often outperform 100k noisily-filtered examples. The LIMA paper showed that 1,000 carefully curated examples produce competitive instruction-following behavior.
Quality signals for instruction data:
- Diversity: does the dataset cover many task types, styles, and domains?
- Response quality: are responses correct, helpful, and appropriately detailed?
- Instruction difficulty: easy prompts produce trivially correct responses that don't improve the model
- Format consistency: chat template, length, and tone match the target deployment context
Contamination and benchmark leakage
If training data includes documents from evaluation benchmarks, benchmark scores are inflated and misleading. Benchmark contamination is detected by checking for exact or near-exact matches between training documents and benchmark items.
Deduplication against held-out benchmarks should happen BEFORE quality filtering — quality filters won't catch benchmark contamination that passes quality thresholds.
In Practice
Building an instruction dataset with quality filters
from datasets import Dataset
import json
def filter_instruction_quality(example):
"""Keep examples with sufficient response length and no refusals."""
response = example["messages"][-1]["content"]
# Filter too-short responses (unhelpful)
if len(response.split()) < 20:
return False
# Filter refusals (model couldn't complete the task)
refusal_phrases = ["I cannot", "I'm unable to", "As an AI, I don't"]
if any(p in response for p in refusal_phrases):
return False
return True
# Load raw dataset
with open("raw_instructions.jsonl") as f:
data = [json.loads(line) for line in f]
dataset = Dataset.from_list(data)
filtered = dataset.filter(filter_instruction_quality)
print(f"Retained {len(filtered)} / {len(dataset)} examples ({len(filtered)/len(dataset):.1%})")Deduplication with MinHash
from datasketch import MinHash, MinHashLSH
def get_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode("utf-8"))
return m
lsh = MinHashLSH(threshold=0.8, num_perm=128)
unique_docs = []
for i, doc in enumerate(documents):
m = get_minhash(doc["text"])
if not lsh.query(m): # no near-duplicate found
lsh.insert(str(i), m)
unique_docs.append(doc)
print(f"Kept {len(unique_docs)} / {len(documents)} after dedup")Evaluating data quality before training
Don't train on a new dataset without spot-checking:
- Sample 200 random examples and read them manually
- Compute basic statistics: length distribution, vocabulary coverage, language distribution
- Run an automated quality classifier and check the score distribution
- Check for contamination against your held-out benchmarks
If spot-checking finds 10–20% problematic examples, the filters need tuning before training starts.
Production Patterns
Conversation format validation
Before any filtering or deduplication, validate that every example is structurally correct. Malformed records that pass earlier stages cause cryptic training failures.
from datasets import Dataset
import json
from typing import Any
VALID_ROLES = {"system", "user", "assistant"}
def validate_conversation(example: dict[str, Any]) -> tuple[bool, str]:
"""Return (is_valid, reason). Reasons are used for rejection statistics."""
messages = example.get("messages")
if not isinstance(messages, list) or len(messages) < 2:
return False, "too_few_messages"
roles = [m.get("role") for m in messages]
if any(r not in VALID_ROLES for r in roles):
return False, f"invalid_role:{set(roles) - VALID_ROLES}"
# Must start with system or user, never assistant
if roles[0] == "assistant":
return False, "starts_with_assistant"
# Must end with assistant turn (the target response)
if roles[-1] != "assistant":
return False, "missing_assistant_turn"
# All turns must have non-empty string content
for i, m in enumerate(messages):
if not isinstance(m.get("content"), str) or not m["content"].strip():
return False, f"empty_content_at_turn_{i}"
return True, "ok"
def filter_and_report(dataset: Dataset) -> Dataset:
rejection_counts: dict[str, int] = {}
valid_indices = []
for i, ex in enumerate(dataset):
ok, reason = validate_conversation(ex)
if ok:
valid_indices.append(i)
else:
rejection_counts[reason] = rejection_counts.get(reason, 0) + 1
print(f"Retained {len(valid_indices)} / {len(dataset)} examples")
for reason, count in sorted(rejection_counts.items(), key=lambda x: -x[1]):
print(f" {reason}: {count}")
return dataset.select(valid_indices)
with open("raw_sft_data.jsonl") as f:
raw = [json.loads(line) for line in f]
dataset = Dataset.from_list(raw)
clean = filter_and_report(dataset)
clean.save_to_disk("validated_sft_data")Log rejection counts by reason to a metrics store (MLflow, W&B) alongside the run that consumed the dataset. A sudden spike in missing_assistant_turn usually signals a data generation pipeline change, not a real quality shift.
LLM-as-judge scoring pipeline
Automated quality scoring with an LLM judge lets you score thousands of examples per hour. Use async API calls to saturate rate limits and keep wall-clock time reasonable.
import asyncio
import json
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
JUDGE_PROMPT = """\
You are evaluating the quality of an AI assistant response.
User request:
{user_turn}
Assistant response:
{assistant_turn}
Score the response on each dimension from 1–5:
- helpfulness: Does it fully address what was asked?
- accuracy: Is the information correct?
- clarity: Is it well-written and easy to follow?
- safety: Does it avoid harmful content?
Respond with JSON only, no prose:
{{"helpfulness": N, "accuracy": N, "clarity": N, "safety": N, "overall": N}}"""
async def score_example(example: dict, semaphore: asyncio.Semaphore) -> dict:
user_turn = next(
m["content"] for m in example["messages"] if m["role"] == "user"
)
assistant_turn = next(
m["content"] for m in example["messages"] if m["role"] == "assistant"
)
async with semaphore:
response = await client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
user_turn=user_turn, assistant_turn=assistant_turn
),
}],
)
scores = json.loads(response.content[0].text)
return {**example, "judge_scores": scores}
async def score_dataset(
examples: list[dict], concurrency: int = 20
) -> list[dict]:
semaphore = asyncio.Semaphore(concurrency)
tasks = [score_example(ex, semaphore) for ex in examples]
return await asyncio.gather(*tasks)
# Run and filter to overall >= 4
scored = asyncio.run(score_dataset(examples, concurrency=20))
high_quality = [ex for ex in scored if ex["judge_scores"]["overall"] >= 4]
print(f"Kept {len(high_quality)} / {len(scored)} after judge filtering")Calibrate the judge score threshold against a sample of human ratings before applying it at scale. An uncalibrated judge can silently discard the most creative or unconventional examples, biasing the dataset toward bland responses.
Deduplication with MinHash LSH at scale
The datasketch implementation is sufficient for datasets up to ~10M documents. For larger corpora, run the same algorithm on Spark using pyspark.ml.feature.MinHashLSH.
from datasketch import MinHash, MinHashLSH
from datasets import Dataset
import re
NUM_PERM = 128
JACCARD_THRESHOLD = 0.8
NGRAM_SIZE = 5 # character 5-grams are more robust than word-level for short texts
def text_to_ngrams(text: str, n: int = NGRAM_SIZE) -> list[str]:
"""Normalize and extract character n-grams."""
text = re.sub(r"\s+", " ", text.lower().strip())
return [text[i:i+n] for i in range(len(text) - n + 1)]
def build_minhash(text: str) -> MinHash:
m = MinHash(num_perm=NUM_PERM)
for gram in text_to_ngrams(text):
m.update(gram.encode("utf-8"))
return m
def deduplicate(dataset: Dataset, text_field: str = "text") -> Dataset:
lsh = MinHashLSH(threshold=JACCARD_THRESHOLD, num_perm=NUM_PERM)
kept_indices = []
duplicates = 0
for i, example in enumerate(dataset):
text = " ".join(
m["content"] for m in example["messages"]
) if "messages" in example else example[text_field]
m = build_minhash(text)
candidates = lsh.query(m)
if candidates:
duplicates += 1
else:
lsh.insert(str(i), m)
kept_indices.append(i)
print(f"Removed {duplicates} near-duplicates ({duplicates/len(dataset):.1%})")
return dataset.select(kept_indices)Run deduplication before LLM-as-judge scoring — scoring duplicates wastes API budget. For instruction datasets sourced from multiple generators (human-written + synthetic), deduplication also catches cases where the same prompt was independently generated twice with slightly different wording.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.