Agent Evals & Reliability
Evaluating agents is harder than evaluating single-turn models: the output is a multi-step trajectory, not a single response. Success may require all steps to be correct; a single wrong tool call can invalidate a correct final answer. This lesson covers the measurement frameworks for agent evaluation, how to build eval suites that catch reliability regressions, and the specific reliability patterns — retries, timeouts, guardrails — that make agents production-ready.
Theory
Episode success = (1−ε)^n. With 10% per-step error and 10 steps, expected success is only 35%. Hover to inspect any episode length.
You can't tell if your agent is reliable by running it once — a 90% success rate looks like success if you try it three times and it happens to work all three. Agent evaluation requires measuring success rates over many episodes, with statistical confidence. The metrics below (task success rate, step success rate, error budget) give you the language to set reliability targets and diagnose where your agent is failing.
Task Success Rate and Step Success Rate
Two complementary metrics for agent evaluation:
Task success rate (TSR): fraction of episodes that fully complete the task correctly. The primary metric — what users care about.
Step success rate (SSR): average fraction of steps that are correct across all episodes. Diagnostic metric — helps identify where in the pipeline errors occur.
A high SSR with low TSR indicates errors cluster at the final steps. A low SSR with moderate TSR indicates early errors that happen to cancel out — a fragile system.
Error Budget Allocation
Given a target TSR and episode length , the maximum allowable per-step error rate satisfies:
The error budget inequality is forced by the compounding structure of multi-step episodes: per-step error rate and episode success rate are related by an exponential, not a linear function. This means that halving the per-step error rate more than doubles the episode success rate for long episodes — improvements compound just as errors do. It also means that the per-step target becomes extremely demanding for long episodes: a 10-step episode requires 99% per-step success to hit 90% episode success. There is no way to meet aggressive episode-level targets without first establishing aggressive per-step reliability.
For , : — each step must succeed 99%+ of the time. This is the reliability budget: allocate it across tool calls, parsing, and model decisions.
Trajectory Similarity
For tasks with multiple valid solution paths, exact-match evaluation fails. Trajectory similarity compares the sequence of tool calls taken:
(Jaccard similarity over tool-call multisets.) Two trajectories that use the same tools in different orders but reach the same result should score high.
For open-ended tasks (research, code generation), use LLM-as-Judge on the final output rather than trajectory matching — trajectories diverge too much across valid approaches.
Walkthrough
Building an Agent Eval Suite
import anthropic
import json
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class AgentEval:
id: str
task: str
expected_tools: list[str] # tools that should be called
success_fn: Callable[[str], bool] # function to check final answer
difficulty: str = "medium" # easy/medium/hard
EVAL_SUITE = [
AgentEval(
id="math_01",
task="What is 15% of 840, rounded to the nearest integer?",
expected_tools=["calculator"],
success_fn=lambda ans: "126" in ans,
),
AgentEval(
id="search_01",
task="Who is the current CEO of Anthropic?",
expected_tools=["search"],
success_fn=lambda ans: "dario" in ans.lower() or "amodei" in ans.lower(),
),
AgentEval(
id="multi_01",
task="Search for the population of Tokyo, then calculate how many people that is per square kilometer given Tokyo is 2,194 km².",
expected_tools=["search", "calculator"],
success_fn=lambda ans: any(c.isdigit() for c in ans), # has a number
),
]
def evaluate_agent(agent_fn: Callable, eval_suite: list[AgentEval]) -> dict:
results = []
for ev in eval_suite:
trajectory, answer = agent_fn(ev.task)
tools_used = {t for t, _ in trajectory}
result = {
"id": ev.id,
"success": ev.success_fn(answer),
"expected_tools": set(ev.expected_tools),
"used_tools": tools_used,
"tool_coverage": len(set(ev.expected_tools) & tools_used) / len(set(ev.expected_tools)),
"trajectory_len": len(trajectory),
"difficulty": ev.difficulty,
}
results.append(result)
tsr = sum(r["success"] for r in results) / len(results)
avg_tool_coverage = sum(r["tool_coverage"] for r in results) / len(results)
return {
"task_success_rate": tsr,
"avg_tool_coverage": avg_tool_coverage,
"n_total": len(results),
"n_success": sum(r["success"] for r in results),
"by_difficulty": {
d: sum(r["success"] for r in results if r["difficulty"] == d) /
max(1, sum(1 for r in results if r["difficulty"] == d))
for d in ["easy", "medium", "hard"]
},
"results": results,
}Reliability Patterns
import time
from functools import wraps
def with_timeout(seconds: int):
"""Decorator to enforce per-tool timeout."""
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
import signal
def handler(signum, frame):
raise TimeoutError(f"Tool timed out after {seconds}s")
signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds)
try:
return fn(*args, **kwargs)
finally:
signal.alarm(0)
return wrapper
return decorator
def with_retry(max_retries: int = 3, backoff: float = 1.0):
"""Decorator for exponential backoff retry on tool errors."""
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
last_error = None
for attempt in range(max_retries):
try:
return fn(*args, **kwargs)
except (TimeoutError, ConnectionError) as e:
last_error = e
if attempt < max_retries - 1:
time.sleep(backoff * (2 ** attempt))
return f"Error after {max_retries} retries: {last_error}"
return wrapper
return decorator
# Apply to tools
@with_timeout(10)
@with_retry(max_retries=3)
def search(query: str) -> str:
# ... implementation
passAnalysis & Evaluation
Where Your Intuition Breaks
A high task success rate means the agent is reliable. TSR captures average performance but hides variance. An agent that succeeds 90% of the time but fails catastrophically (corrupts data, sends incorrect messages, loops indefinitely) on the remaining 10% is not acceptable for production — even if its TSR looks good. Reliability in production requires both high TSR and bounded worst-case behavior: max cost per session, max latency per session, and graceful degradation when the agent can't complete the task. TSR is a necessary condition for production readiness, not a sufficient one.
Reliability Checklist
| Risk | Mitigation |
|---|---|
| Tool timeout | Per-tool timeout + graceful error string return |
| API rate limit | Exponential backoff with jitter on 429 errors |
| Infinite loop | Step counter + loop detection (same action, same input) |
| Context overflow | Token budget tracking; summarize old turns when approaching limit |
| Cascading errors | Quality gate between pipeline stages; stop-on-critical-failure |
| Prompt injection | Sanitize tool outputs before inserting into context; treat tool results as untrusted |
Regression Testing
Agent eval suites should run in CI to catch regressions from:
- Model updates (new model version may have different tool-use behavior)
- Prompt changes (even small wording changes can affect multi-step behavior)
- Tool API changes (new schema, new error format)
- Context window size changes
Minimum viable regression suite: 20–50 examples covering the main task types, with deterministic success functions where possible. Regression threshold: alert if TSR drops > 5% vs baseline.
Agent evals in production:
- Instrument before you launch. Add logging for every tool call (name, inputs, outputs, latency) before your agent reaches users. Debugging production failures without traces is nearly impossible.
- Separate task success from output quality. TSR measures whether the agent completed the task at all. Output quality measures how well. Both metrics are needed — a task can succeed with a mediocre answer.
- Human review on failure cases. Review 10–20 failed agent traces per week. Failure patterns cluster: 80% of failures often come from 2–3 root causes. Fix those first.
- Shadow mode before rollout. Run the new agent version in parallel (shadow mode) without affecting users. Compare TSR vs the current version over 100+ tasks before switching traffic.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.