LLM-as-Judge & Regression Evals
Human evaluation of LLM outputs is the gold standard — and completely unscalable. LLM-as-Judge replaces a human rater with a powerful LLM, enabling evaluation at thousands of examples per hour. The technique is now standard at every lab for evaluating model changes, prompt revisions, and RAG pipeline upgrades. This lesson covers how to design a judge that correlates with human judgment, how to detect and correct for common biases, and how to build a regression eval suite that catches quality regressions before they reach production.
Theory
formula: score(A,B) = [score(A first) + (1 − score(B first))] / 2 · disagreement → tie
Human evaluation of LLM outputs is the gold standard — and completely unscalable at production velocity. LLM-as-Judge replaces a human rater with a powerful LLM, enabling evaluation at thousands of examples per hour at a fraction of the cost. The catch: an LLM judge has its own biases, and those biases are systematic, not random. The diagram above shows the three-step pipeline: generate outputs, run them through a judge prompt, aggregate scores. The math below explains how to measure whether your judge is actually correlated with human judgment — and how to correct for the biases it introduces.
Judge Correlation with Human Raters
The goal of an LLM judge is to approximate human judgment. Quality is measured by Spearman rank correlation or Cohen's kappa between judge scores and human scores:
Rank correlation is used instead of Pearson correlation because judge scores are ordinal, not interval-scaled. The difference between a score of 4 and 5 is not necessarily the same magnitude as the difference between 1 and 2 — judges tend to cluster scores in the middle of the scale with inconsistent spacing. Spearman's only uses rank positions, making it robust to arbitrary score distributions and calibration drift. A judge that reliably ranks responses in the right order is useful even if its absolute scores are miscalibrated.
where is the rank difference for example . An acceptable LLM judge typically achieves with human raters on the same rubric.
Pairwise vs pointwise evaluation:
- Pointwise: judge scores each response independently (e.g., 1–5 scale). Pros: fast, parallelizable. Cons: scores drift without calibration (what is a "4" vs "5"?).
- Pairwise: judge picks which of two responses is better. Pros: more reliable (relative comparison is easier than absolute). Cons: pairs for ranking responses; use efficient tournament brackets.
Position Bias and Verbosity Bias
LLM judges exhibit systematic biases that inflate agreement with incorrect judgments:
Position bias: when comparing two responses, judges prefer the response in position A (first) at rates significantly above 50%. To correct: swap positions and average scores.
Verbosity bias: longer responses receive higher scores independent of quality. Correlation between response length and judge score is often 0.3–0.5 even for irrelevant verbosity. Correct by including explicit rubric criteria that penalize unnecessary length.
Calibration
A well-calibrated judge produces scores whose distribution matches the actual quality distribution of your model's outputs. To measure calibration, compare judge score distributions against human score distributions using a calibration plot (judge score vs average human score, binned).
Recalibration: if the judge overestimates quality (common), add calibration examples — cases the judge historically got wrong — to the judge prompt.
Walkthrough
Building a Regression Eval Suite
Design a rubric-based judge prompt:
JUDGE_PROMPT = """You are evaluating a response from an AI assistant.
Rubric:
- Accuracy (0-3): Is the response factually correct and complete?
- Relevance (0-2): Does it directly address the question?
- Conciseness (0-2): Is it appropriately concise without unnecessary padding?
- Format (0-1): Is it properly formatted and readable?
Total score: 0-8. Higher is better.
Question: {question}
Response to evaluate:
{response}
Output JSON: {"accuracy": int, "relevance": int, "conciseness": int, "format": int, "total": int, "reasoning": str}"""
def judge(question: str, response: str) -> dict:
import anthropic, json
client = anthropic.Anthropic()
result = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
question=question, response=response
)}]
)
return json.loads(result.content[0].text)Position-bias-corrected pairwise comparison:
def compare_responses(question: str, response_a: str, response_b: str) -> str:
"""Returns 'A', 'B', or 'tie'"""
# Forward comparison
forward = judge_pairwise(question, response_a, response_b) # returns 'A'/'B'/'tie'
# Reverse comparison (swap positions to cancel position bias)
reverse = judge_pairwise(question, response_b, response_a) # A and B swapped
# Map reverse result back to original labels
reverse_mapped = {'A': 'B', 'B': 'A', 'tie': 'tie'}[reverse]
if forward == reverse_mapped:
return forward
return 'tie' # disagree → call it a tieRegression eval harness:
import json
from pathlib import Path
def run_regression_eval(model_fn, eval_set_path: str, threshold: float = 6.5):
"""Run eval and flag regressions vs baseline scores."""
eval_set = json.loads(Path(eval_set_path).read_text())
results = []
for ex in eval_set:
response = model_fn(ex["question"])
score = judge(ex["question"], response)["total"]
delta = score - ex["baseline_score"]
results.append({
"id": ex["id"],
"score": score,
"baseline": ex["baseline_score"],
"delta": delta,
"regression": delta < -1, # more than 1 point drop = regression
})
regressions = [r for r in results if r["regression"]]
avg_score = sum(r["score"] for r in results) / len(results)
print(f"Avg score: {avg_score:.2f} | Regressions: {len(regressions)}/{len(results)}")
return resultsAnalysis & Evaluation
Where Your Intuition Breaks
A consistent judge is a reliable judge. Self-consistency and accuracy are independent properties. A judge can be consistently wrong — always preferring verbose responses regardless of quality, always favoring the first option in pairwise comparisons, always overscoring outputs that use technical vocabulary. These are systematic biases, and they produce high self-consistency precisely because they are systematic. The right check is correlation with human judgments on a calibration set, not internal consistency. A judge that correlates poorly with humans () but scores very consistently is worse than one with moderate correlation and moderate consistency.
LLM Judge Quality Checklist
| Criterion | How to verify |
|---|---|
| Rubric specificity | Each criterion has 3+ point scale with concrete examples |
| Calibration examples | 5–10 "anchor" examples with expected scores in the prompt |
| Position bias corrected | Swap response order and average |
| Judge model vs evaluated model | Judge should be more capable (e.g., Opus judging Sonnet outputs) |
| Agreement rate with humans | Measure on a held-out human-labeled set |
When LLM-as-Judge Fails
Self-preference bias: a model judging its own outputs gives systematically inflated scores. Use a different model family as the judge whenever possible.
Rubric gaming: if the evaluated model is trained or prompted with knowledge of the judge criteria, it can optimize for the rubric rather than actual quality (Goodhart's law). Keep judge criteria internal; evaluate on surprise examples.
Low task difficulty: on tasks where nearly all responses are good, judge scores cluster near the top of the scale and small quality differences are undetectable. Design evals with adversarial examples where the model is likely to fail.
LLM-as-Judge in production:
- Build your eval set incrementally. Start with 50 gold-standard examples you evaluate manually. Expand to 500+ over time. The eval set is a long-term asset — treat it like test code.
- Version your eval set. When you add examples, keep old baselines. Regressions on older examples often reveal capability trade-offs from new training.
- Separate accuracy from format evals. Run LLM-as-Judge for quality/accuracy; use deterministic checks (regex, JSON schema validation, length limits) for format compliance. Don't let format issues contaminate quality scores.
- Judge latency matters. A full pairwise eval suite takes 2–5 minutes with position-bias correction. Build this into your CI pipeline, not as an afterthought.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.