Evals Framework

You can't improve what you can't measure. Teams that ship LLM products without rigorous evals are flying blind — they don't know if the new prompt is better, if the model update degraded a capability, or if the product is even achieving its stated goal. At mature AI labs, every model change is gated on eval results; a regression in a single eval benchmark can block a release. The challenge with LLM evaluation is that outputs are open-ended text — you can't just check for equality. McNemar's test, bootstrap CIs, and Cohen's κ give rigorous statistical grounding to pairwise comparisons, and LLM-as-judge unlocks scalable evaluation for tasks where automated metrics fail. This lesson covers the statistical machinery of eval pipelines and implements a complete LLM judge framework.

Theory

Judge Calibration

ECE measures mean absolute deviation between judge and human score distributions. A biased judge over-scores — sycophantic toward high-quality-looking text even when content is flawed.

You can't trust your own judgment about whether your model is improving — confirmation bias is too strong. Evals are the forcing function: they measure what the model actually does on a representative sample, not what you hope it does. The statistical machinery below (McNemar's test, bootstrap CIs, Cohen's κ) exists for one reason: to tell you whether an observed difference is real or noise, so you can make shipping decisions you can defend.

Evaluation as Hypothesis Testing

The core question: "Is model A better than model B?" This is a paired hypothesis test.

Null hypothesis $H_0$ : models A and B have the same performance. For $n$ eval examples, the test statistic under McNemar's test (for binary outcomes):

$\chi^2 = \frac{(b - c)^2}{b + c}$

where $b$ = number of examples where A correct, B wrong; $c$ = reverse. $p < 0.05$ rejects $H_0$ .

McNemar's test uses only the discordant pairs ( $b$ and $c$ ) — examples where A and B disagree — because concordant pairs carry no information about which model is better. If both models get an example right, that tells you nothing about their relative quality; same for both getting it wrong. The test statistic is a chi-squared test on the 2×2 table of disagreements. This is why McNemar's is the right test for comparing two classifiers on the same evaluation set, rather than a standard two-sample proportion test.

Bootstrap Confidence Intervals

For any metric $M$ (accuracy, BLEU, win rate), bootstrap gives distribution-free confidence intervals:

$\text{CI}_{95\%} = [\hat{M} - 1.96 \cdot \text{se}_{\text{boot}},\ \hat{M} + 1.96 \cdot \text{se}_{\text{boot}}]$

python

import numpy as np
 
def bootstrap_ci(scores: list[float], n_boot: int = 1000, alpha: float = 0.05) -> tuple:
    means = [np.mean(np.random.choice(scores, len(scores))) for _ in range(n_boot)]
    return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])
 
scores = [0.85, 0.92, 0.78, 0.91, 0.88, 0.76, 0.95, 0.83]
lo, hi = bootstrap_ci(scores)
print(f"Mean: {np.mean(scores):.3f}, 95% CI: [{lo:.3f}, {hi:.3f}]")
# Mean: 0.860, 95% CI: [0.796, 0.918]

Inter-Rater Reliability

When using human annotators or LLM judges, measure agreement with Cohen's κ:

$\kappa = \frac{P_o - P_e}{1 - P_e}$

where $P_o$ is observed agreement and $P_e$ is expected chance agreement. Interpretation:

$\kappa < 0.2$ : slight (judges disagree significantly)
$0.2$ – $0.4$ : fair
$0.6$ – $0.8$ : substantial
$> 0.8$ : near-perfect

LLM-as-judge vs human annotators typically achieves $\kappa \approx 0.65$ – $0.75$ for clear quality dimensions.

💡Why CI matters more than point estimates

"Model A gets 87.3% vs Model B's 86.1%" is meaningless without confidence intervals. On 200 examples, these would have overlapping 95% CIs — the difference is not statistically significant. Always report CIs, not just means.

Walkthrough

Building an Eval Harness

python

import anthropic
import json
from pathlib import Path
from dataclasses import dataclass
 
client = anthropic.Anthropic()
 
@dataclass
class EvalExample:
    question: str
    reference: str | None = None
    context: str | None = None
 
@dataclass
class EvalResult:
    question: str
    response: str
    score: float | None = None
    judgment: str | None = None
 
def run_eval(
    examples: list[EvalExample],
    system_prompt: str,
    model: str = "claude-sonnet-4-6",
) -> list[EvalResult]:
    results = []
    for ex in examples:
        messages = [{"role": "user", "content": ex.question}]
        if ex.context:
            messages[0]["content"] = f"Context: {ex.context}\n\nQuestion: {ex.question}"
 
        response = client.messages.create(
            model=model, max_tokens=1024,
            system=system_prompt, messages=messages,
        )
        results.append(EvalResult(
            question=ex.question,
            response=response.content[0].text,
        ))
    return results

LLM-as-Judge

LLM judge win rates and correlation with human judgments across evaluation dimensions

The left panel shows pairwise win rates across evaluation dimensions (helpfulness, accuracy, safety). The right panel shows human-judge correlation: well-calibrated LLM judges (GPT-4, Claude) achieve 0.8+ Spearman correlation with human preferences on most tasks — sufficient for catching regressions, though not for final safety decisions.

python

JUDGE_PROMPT = """You are evaluating an AI assistant's response.
 
Question: {question}
Reference answer: {reference}
Model response: {response}
 
Score the response 1-5 on:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer?)
- Conciseness (is it appropriately brief?)
 
Respond with JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""
 
def judge_responses(
    results: list[EvalResult],
    examples: list[EvalExample],
) -> list[EvalResult]:
    for result, ex in zip(results, examples):
        if not ex.reference:
            continue
        judgment = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=200,
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                question=ex.question,
                reference=ex.reference,
                response=result.response,
            )}]
        )
        try:
            scores = json.loads(judgment.content[0].text)
            result.score = (scores["accuracy"] + scores["completeness"] + scores["conciseness"]) / 3
            result.judgment = scores.get("reasoning", "")
        except:
            result.score = 3.0
    return results

MMLU-Style Benchmark

python

# Multiple-choice evaluation (exact match)
def eval_mcq(
    questions: list[dict],  # {"question": str, "choices": list, "answer": str}
    model: str = "claude-sonnet-4-6",
) -> dict:
    correct = 0
    for q in questions:
        choices_str = "\n".join(f"{chr(65+i)}. {c}" for i, c in enumerate(q["choices"]))
        prompt = f"{q['question']}\n\n{choices_str}\n\nAnswer with just the letter (A/B/C/D):"
 
        r = client.messages.create(
            model=model, max_tokens=5,
            messages=[{"role": "user", "content": prompt}]
        )
        pred = r.content[0].text.strip()[0].upper()
        if pred == q["answer"]:
            correct += 1
 
    accuracy = correct / len(questions)
    lo, hi = bootstrap_ci([1.0 if q else 0.0 for q in questions])
    print(f"Accuracy: {accuracy:.3f} [{lo:.3f}, {hi:.3f}] on {len(questions)} questions")
    return {"accuracy": accuracy, "ci": (lo, hi), "n": len(questions)}

Analysis & Evaluation

Where Your Intuition Breaks

A high eval score means the model is good at the task. Eval scores measure the model's behavior on the eval distribution — not on your production distribution. The gap between these two distributions is where eval-passing models fail in production. A model can score 95% on your held-out set because the held-out set was drawn from the same distribution as the training prompts; it may score 60% on real user queries that differ in phrasing, domain, or intent. Goodhart's Law applies: once a measure becomes a target, it ceases to be a good measure. Eval suites must evolve as the model and product evolve.

Benchmark Landscape

Benchmark	Focus	Metric	Size
Massive Multitask Language Understanding (MMLU)	World knowledge (57 subjects)	Accuracy	14k MCQ
HumanEval	Python coding	Pass@1	164 problems
GSM8K	Grade school math	Accuracy	1.3k problems
MT-Bench	Multi-turn instruction	GPT-4 score (1–10)	80 questions
Holistic Evaluation of Language Models (HELM)	Multi-axis (accuracy + bias + efficiency)	Composite	~42 scenarios
BIG-Bench Hard	Hard reasoning tasks	Accuracy	6.5k

Goodhart's Law in Practice

⚠️When the metric becomes the target

Once a benchmark is used in training data selection or reward model training, models optimize for it specifically. GPT-4 scored 86.4% on MMLU — but when MMLU-like data appears in pretraining, models can "overfit" to question formats without gaining underlying knowledge. Maintain private holdout evals not shared publicly.

Eval Design Principles

Task-specific: general benchmarks don't predict performance on your actual use case
Representative: sample from your real input distribution, not just obvious cases
Adversarial: include edge cases, ambiguous inputs, and known failure modes
Versioned: track eval dataset version alongside model version
Blinded: score with multiple methods; compare LLM judge to human ratings

python

# Eval versioning
eval_config = {
    "eval_id": "v1.2.0",
    "date": "2026-02-28",
    "model": "claude-sonnet-4-6",
    "dataset": "support_tickets_v3",
    "n_examples": 500,
    "metrics": ["accuracy", "f1", "auc"],
    "results": {"accuracy": 0.923, "f1": 0.918, "auc": 0.971},
}
Path("eval_results/v1.2.0.json").write_text(json.dumps(eval_config, indent=2))

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Fine-Tuning in Practice

LLM-as-Judge & Regression Evals