Requires:Evals Framework

LLM-as-Judge & Regression Evals

Human evaluation of LLM outputs is the gold standard — and completely unscalable. LLM-as-Judge replaces a human rater with a powerful LLM, enabling evaluation at thousands of examples per hour. The technique is now standard at every lab for evaluating model changes, prompt revisions, and RAG pipeline upgrades. This lesson covers how to design a judge that correlates with human judgment, how to detect and correct for common biases, and how to build a regression eval suite that catches quality regressions before they reach production.

Theory

LLM-as-Judge — pairwise evaluation & bias correction

⚠ position bias present

Q: What is overfitting?

Position 1 (A)

Overfitting is a phenomenon in machine learning where a model learns the training data too well, including its noise and outliers, which causes the model to perform poorly on new, unseen data because it has essentially memorized the training examples rather than learning the underlying patterns that generalize.

Position 2 (B)

Overfitting means the model memorizes training data and fails to generalize to new data.

Judge verdict (original order):

Prefers A

Human preference: Response B

formula: score(A,B) = [score(A first) + (1 − score(B first))] / 2 · disagreement → tie

Human evaluation of LLM outputs is the gold standard — and completely unscalable at production velocity. LLM-as-Judge replaces a human rater with a powerful LLM, enabling evaluation at thousands of examples per hour at a fraction of the cost. The catch: an LLM judge has its own biases, and those biases are systematic, not random. The diagram above shows the three-step pipeline: generate outputs, run them through a judge prompt, aggregate scores. The math below explains how to measure whether your judge is actually correlated with human judgment — and how to correct for the biases it introduces.

Judge Correlation with Human Raters

The goal of an LLM judge is to approximate human judgment. Quality is measured by Spearman rank correlation $\rho$ or Cohen's kappa $\kappa$ between judge scores and human scores:

$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$

Rank correlation is used instead of Pearson correlation because judge scores are ordinal, not interval-scaled. The difference between a score of 4 and 5 is not necessarily the same magnitude as the difference between 1 and 2 — judges tend to cluster scores in the middle of the scale with inconsistent spacing. Spearman's $\rho$ only uses rank positions, making it robust to arbitrary score distributions and calibration drift. A judge that reliably ranks responses in the right order is useful even if its absolute scores are miscalibrated.

where $d_i$ is the rank difference for example $i$ . An acceptable LLM judge typically achieves $\rho > 0.7$ with human raters on the same rubric.

Pairwise vs pointwise evaluation:

Pointwise: judge scores each response independently (e.g., 1–5 scale). Pros: fast, parallelizable. Cons: scores drift without calibration (what is a "4" vs "5"?).
Pairwise: judge picks which of two responses is better. Pros: more reliable (relative comparison is easier than absolute). Cons: $O(n^2)$ pairs for ranking $n$ responses; use efficient tournament brackets.

Position Bias and Verbosity Bias

LLM judges exhibit systematic biases that inflate agreement with incorrect judgments:

Position bias: when comparing two responses, judges prefer the response in position A (first) at rates significantly above 50%. To correct: swap positions and average scores.

$\text{score}(A, B) = \frac{\text{score}(A \text{ first}, B \text{ second}) + (1 - \text{score}(B \text{ first}, A \text{ second}))}{2}$

Verbosity bias: longer responses receive higher scores independent of quality. Correlation between response length and judge score is often 0.3–0.5 even for irrelevant verbosity. Correct by including explicit rubric criteria that penalize unnecessary length.

Calibration

A well-calibrated judge produces scores whose distribution matches the actual quality distribution of your model's outputs. To measure calibration, compare judge score distributions against human score distributions using a calibration plot (judge score vs average human score, binned).

Recalibration: if the judge overestimates quality (common), add calibration examples — cases the judge historically got wrong — to the judge prompt.

Walkthrough

Building a Regression Eval Suite

Design a rubric-based judge prompt:

python

JUDGE_PROMPT = """You are evaluating a response from an AI assistant.
 
Rubric:
- Accuracy (0-3): Is the response factually correct and complete?
- Relevance (0-2): Does it directly address the question?
- Conciseness (0-2): Is it appropriately concise without unnecessary padding?
- Format (0-1): Is it properly formatted and readable?
 
Total score: 0-8. Higher is better.
 
Question: {question}
 
Response to evaluate:
{response}
 
Output JSON: {"accuracy": int, "relevance": int, "conciseness": int, "format": int, "total": int, "reasoning": str}"""
 
def judge(question: str, response: str) -> dict:
    import anthropic, json
    client = anthropic.Anthropic()
    result = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, response=response
        )}]
    )
    return json.loads(result.content[0].text)

Position-bias-corrected pairwise comparison:

python

def compare_responses(question: str, response_a: str, response_b: str) -> str:
    """Returns 'A', 'B', or 'tie'"""
    # Forward comparison
    forward = judge_pairwise(question, response_a, response_b)  # returns 'A'/'B'/'tie'
    # Reverse comparison (swap positions to cancel position bias)
    reverse = judge_pairwise(question, response_b, response_a)  # A and B swapped
 
    # Map reverse result back to original labels
    reverse_mapped = {'A': 'B', 'B': 'A', 'tie': 'tie'}[reverse]
 
    if forward == reverse_mapped:
        return forward
    return 'tie'  # disagree → call it a tie

Regression eval harness:

python

import json
from pathlib import Path
 
def run_regression_eval(model_fn, eval_set_path: str, threshold: float = 6.5):
    """Run eval and flag regressions vs baseline scores."""
    eval_set = json.loads(Path(eval_set_path).read_text())
    results = []
 
    for ex in eval_set:
        response = model_fn(ex["question"])
        score = judge(ex["question"], response)["total"]
        delta = score - ex["baseline_score"]
        results.append({
            "id": ex["id"],
            "score": score,
            "baseline": ex["baseline_score"],
            "delta": delta,
            "regression": delta < -1,  # more than 1 point drop = regression
        })
 
    regressions = [r for r in results if r["regression"]]
    avg_score = sum(r["score"] for r in results) / len(results)
    print(f"Avg score: {avg_score:.2f} | Regressions: {len(regressions)}/{len(results)}")
    return results

Analysis & Evaluation

Where Your Intuition Breaks

A consistent judge is a reliable judge. Self-consistency and accuracy are independent properties. A judge can be consistently wrong — always preferring verbose responses regardless of quality, always favoring the first option in pairwise comparisons, always overscoring outputs that use technical vocabulary. These are systematic biases, and they produce high self-consistency precisely because they are systematic. The right check is correlation with human judgments on a calibration set, not internal consistency. A judge that correlates poorly with humans ( $\rho < 0.5$ ) but scores very consistently is worse than one with moderate correlation and moderate consistency.

LLM Judge Quality Checklist

Criterion	How to verify
Rubric specificity	Each criterion has 3+ point scale with concrete examples
Calibration examples	5–10 "anchor" examples with expected scores in the prompt
Position bias corrected	Swap response order and average
Judge model vs evaluated model	Judge should be more capable (e.g., Opus judging Sonnet outputs)
Agreement rate with humans	Measure $\kappa > 0.6$ on a held-out human-labeled set

When LLM-as-Judge Fails

Self-preference bias: a model judging its own outputs gives systematically inflated scores. Use a different model family as the judge whenever possible.

Rubric gaming: if the evaluated model is trained or prompted with knowledge of the judge criteria, it can optimize for the rubric rather than actual quality (Goodhart's law). Keep judge criteria internal; evaluate on surprise examples.

Low task difficulty: on tasks where nearly all responses are good, judge scores cluster near the top of the scale and small quality differences are undetectable. Design evals with adversarial examples where the model is likely to fail.

🚀Production

LLM-as-Judge in production:

Build your eval set incrementally. Start with 50 gold-standard examples you evaluate manually. Expand to 500+ over time. The eval set is a long-term asset — treat it like test code.
Version your eval set. When you add examples, keep old baselines. Regressions on older examples often reveal capability trade-offs from new training.
Separate accuracy from format evals. Run LLM-as-Judge for quality/accuracy; use deterministic checks (regex, JSON schema validation, length limits) for format compliance. Don't let format issues contaminate quality scores.
Judge latency matters. A full pairwise eval suite takes 2–5 minutes with position-bias correction. Build this into your CI pipeline, not as an afterthought.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Evals Framework

Latency & Cost Optimization