Neural-Path/Notes
25 min

Evals Framework

You can't improve what you can't measure. Teams that ship LLM products without rigorous evals are flying blind — they don't know if the new prompt is better, if the model update degraded a capability, or if the product is even achieving its stated goal. At mature AI labs, every model change is gated on eval results; a regression in a single eval benchmark can block a release. The challenge with LLM evaluation is that outputs are open-ended text — you can't just check for equality. McNemar's test, bootstrap CIs, and Cohen's κ give rigorous statistical grounding to pairwise comparisons, and LLM-as-judge unlocks scalable evaluation for tasks where automated metrics fail. This lesson covers the statistical machinery of eval pipelines and implements a complete LLM judge framework.

Theory

Judge Calibration
12345quality score (1–5)0%10%20%30%40%freqECE = 0.012✓ well calibratedhumancalibrated judge

ECE measures mean absolute deviation between judge and human score distributions. A biased judge over-scores — sycophantic toward high-quality-looking text even when content is flawed.

You can't trust your own judgment about whether your model is improving — confirmation bias is too strong. Evals are the forcing function: they measure what the model actually does on a representative sample, not what you hope it does. The statistical machinery below (McNemar's test, bootstrap CIs, Cohen's κ) exists for one reason: to tell you whether an observed difference is real or noise, so you can make shipping decisions you can defend.

Evaluation as Hypothesis Testing

The core question: "Is model A better than model B?" This is a paired hypothesis test.

Null hypothesis H0H_0: models A and B have the same performance. For nn eval examples, the test statistic under McNemar's test (for binary outcomes):

χ2=(bc)2b+c\chi^2 = \frac{(b - c)^2}{b + c}

where bb = number of examples where A correct, B wrong; cc = reverse. p<0.05p < 0.05 rejects H0H_0.

McNemar's test uses only the discordant pairs (bb and cc) — examples where A and B disagree — because concordant pairs carry no information about which model is better. If both models get an example right, that tells you nothing about their relative quality; same for both getting it wrong. The test statistic is a chi-squared test on the 2×2 table of disagreements. This is why McNemar's is the right test for comparing two classifiers on the same evaluation set, rather than a standard two-sample proportion test.

Bootstrap Confidence Intervals

For any metric MM (accuracy, BLEU, win rate), bootstrap gives distribution-free confidence intervals:

CI95%=[M^1.96seboot, M^+1.96seboot]\text{CI}_{95\%} = [\hat{M} - 1.96 \cdot \text{se}_{\text{boot}},\ \hat{M} + 1.96 \cdot \text{se}_{\text{boot}}]

python
import numpy as np
 
def bootstrap_ci(scores: list[float], n_boot: int = 1000, alpha: float = 0.05) -> tuple:
    means = [np.mean(np.random.choice(scores, len(scores))) for _ in range(n_boot)]
    return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])
 
scores = [0.85, 0.92, 0.78, 0.91, 0.88, 0.76, 0.95, 0.83]
lo, hi = bootstrap_ci(scores)
print(f"Mean: {np.mean(scores):.3f}, 95% CI: [{lo:.3f}, {hi:.3f}]")
# Mean: 0.860, 95% CI: [0.796, 0.918]

Inter-Rater Reliability

When using human annotators or LLM judges, measure agreement with Cohen's κ:

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

where PoP_o is observed agreement and PeP_e is expected chance agreement. Interpretation:

  • κ<0.2\kappa < 0.2: slight (judges disagree significantly)
  • 0.20.20.40.4: fair
  • 0.60.60.80.8: substantial
  • >0.8> 0.8: near-perfect

LLM-as-judge vs human annotators typically achieves κ0.65\kappa \approx 0.650.750.75 for clear quality dimensions.

💡Why CI matters more than point estimates

"Model A gets 87.3% vs Model B's 86.1%" is meaningless without confidence intervals. On 200 examples, these would have overlapping 95% CIs — the difference is not statistically significant. Always report CIs, not just means.

Walkthrough

Building an Eval Harness

python
import anthropic
import json
from pathlib import Path
from dataclasses import dataclass
 
client = anthropic.Anthropic()
 
@dataclass
class EvalExample:
    question: str
    reference: str | None = None
    context: str | None = None
 
@dataclass
class EvalResult:
    question: str
    response: str
    score: float | None = None
    judgment: str | None = None
 
def run_eval(
    examples: list[EvalExample],
    system_prompt: str,
    model: str = "claude-sonnet-4-6",
) -> list[EvalResult]:
    results = []
    for ex in examples:
        messages = [{"role": "user", "content": ex.question}]
        if ex.context:
            messages[0]["content"] = f"Context: {ex.context}\n\nQuestion: {ex.question}"
 
        response = client.messages.create(
            model=model, max_tokens=1024,
            system=system_prompt, messages=messages,
        )
        results.append(EvalResult(
            question=ex.question,
            response=response.content[0].text,
        ))
    return results

LLM-as-Judge

LLM judge win rates and correlation with human judgments across evaluation dimensions

The left panel shows pairwise win rates across evaluation dimensions (helpfulness, accuracy, safety). The right panel shows human-judge correlation: well-calibrated LLM judges (GPT-4, Claude) achieve 0.8+ Spearman correlation with human preferences on most tasks — sufficient for catching regressions, though not for final safety decisions.

python
JUDGE_PROMPT = """You are evaluating an AI assistant's response.
 
Question: {question}
Reference answer: {reference}
Model response: {response}
 
Score the response 1-5 on:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer?)
- Conciseness (is it appropriately brief?)
 
Respond with JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""
 
def judge_responses(
    results: list[EvalResult],
    examples: list[EvalExample],
) -> list[EvalResult]:
    for result, ex in zip(results, examples):
        if not ex.reference:
            continue
        judgment = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=200,
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                question=ex.question,
                reference=ex.reference,
                response=result.response,
            )}]
        )
        try:
            scores = json.loads(judgment.content[0].text)
            result.score = (scores["accuracy"] + scores["completeness"] + scores["conciseness"]) / 3
            result.judgment = scores.get("reasoning", "")
        except:
            result.score = 3.0
    return results

MMLU-Style Benchmark

python
# Multiple-choice evaluation (exact match)
def eval_mcq(
    questions: list[dict],  # {"question": str, "choices": list, "answer": str}
    model: str = "claude-sonnet-4-6",
) -> dict:
    correct = 0
    for q in questions:
        choices_str = "\n".join(f"{chr(65+i)}. {c}" for i, c in enumerate(q["choices"]))
        prompt = f"{q['question']}\n\n{choices_str}\n\nAnswer with just the letter (A/B/C/D):"
 
        r = client.messages.create(
            model=model, max_tokens=5,
            messages=[{"role": "user", "content": prompt}]
        )
        pred = r.content[0].text.strip()[0].upper()
        if pred == q["answer"]:
            correct += 1
 
    accuracy = correct / len(questions)
    lo, hi = bootstrap_ci([1.0 if q else 0.0 for q in questions])
    print(f"Accuracy: {accuracy:.3f} [{lo:.3f}, {hi:.3f}] on {len(questions)} questions")
    return {"accuracy": accuracy, "ci": (lo, hi), "n": len(questions)}

Analysis & Evaluation

Where Your Intuition Breaks

A high eval score means the model is good at the task. Eval scores measure the model's behavior on the eval distribution — not on your production distribution. The gap between these two distributions is where eval-passing models fail in production. A model can score 95% on your held-out set because the held-out set was drawn from the same distribution as the training prompts; it may score 60% on real user queries that differ in phrasing, domain, or intent. Goodhart's Law applies: once a measure becomes a target, it ceases to be a good measure. Eval suites must evolve as the model and product evolve.

Benchmark Landscape

BenchmarkFocusMetricSize
Massive Multitask Language Understanding (MMLU)World knowledge (57 subjects)Accuracy14k MCQ
HumanEvalPython codingPass@1164 problems
GSM8KGrade school mathAccuracy1.3k problems
MT-BenchMulti-turn instructionGPT-4 score (1–10)80 questions
Holistic Evaluation of Language Models (HELM)Multi-axis (accuracy + bias + efficiency)Composite~42 scenarios
BIG-Bench HardHard reasoning tasksAccuracy6.5k

Goodhart's Law in Practice

⚠️When the metric becomes the target

Once a benchmark is used in training data selection or reward model training, models optimize for it specifically. GPT-4 scored 86.4% on MMLU — but when MMLU-like data appears in pretraining, models can "overfit" to question formats without gaining underlying knowledge. Maintain private holdout evals not shared publicly.

Eval Design Principles

  1. Task-specific: general benchmarks don't predict performance on your actual use case
  2. Representative: sample from your real input distribution, not just obvious cases
  3. Adversarial: include edge cases, ambiguous inputs, and known failure modes
  4. Versioned: track eval dataset version alongside model version
  5. Blinded: score with multiple methods; compare LLM judge to human ratings
python
# Eval versioning
eval_config = {
    "eval_id": "v1.2.0",
    "date": "2026-02-28",
    "model": "claude-sonnet-4-6",
    "dataset": "support_tickets_v3",
    "n_examples": 500,
    "metrics": ["accuracy", "f1", "auc"],
    "results": {"accuracy": 0.923, "f1": 0.918, "auc": 0.971},
}
Path("eval_results/v1.2.0.json").write_text(json.dumps(eval_config, indent=2))

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.