Agent Evals & Reliability

Evaluating agents is harder than evaluating single-turn models: the output is a multi-step trajectory, not a single response. Success may require all steps to be correct; a single wrong tool call can invalidate a correct final answer. This lesson covers the measurement frameworks for agent evaluation, how to build eval suites that catch reliability regressions, and the specific reliability patterns — retries, timeouts, guardrails — that make agents production-ready.

Theory

Reliability Budget

Episode success = (1−ε)^n. With 10% per-step error and 10 steps, expected success is only 35%. Hover to inspect any episode length.

You can't tell if your agent is reliable by running it once — a 90% success rate looks like success if you try it three times and it happens to work all three. Agent evaluation requires measuring success rates over many episodes, with statistical confidence. The metrics below (task success rate, step success rate, error budget) give you the language to set reliability targets and diagnose where your agent is failing.

Task Success Rate and Step Success Rate

Two complementary metrics for agent evaluation:

Task success rate (TSR): fraction of episodes that fully complete the task correctly. The primary metric — what users care about.

$\text{TSR} = \frac{N_{\text{success}}}{N_{\text{total}}}$

Step success rate (SSR): average fraction of steps that are correct across all episodes. Diagnostic metric — helps identify where in the pipeline errors occur.

$\text{SSR} = \frac{1}{N} \sum_{i=1}^{N} \frac{|\{t : a_t^{(i)} \text{ correct}\}|}{T^{(i)}}$

A high SSR with low TSR indicates errors cluster at the final steps. A low SSR with moderate TSR indicates early errors that happen to cancel out — a fragile system.

Error Budget Allocation

Given a target TSR $\tau$ and episode length $N$ , the maximum allowable per-step error rate $\epsilon$ satisfies:

$(1 - \epsilon)^N \geq \tau \implies \epsilon \leq 1 - \tau^{1/N}$

The error budget inequality is forced by the compounding structure of multi-step episodes: per-step error rate and episode success rate are related by an exponential, not a linear function. This means that halving the per-step error rate more than doubles the episode success rate for long episodes — improvements compound just as errors do. It also means that the per-step target becomes extremely demanding for long episodes: a 10-step episode requires 99% per-step success to hit 90% episode success. There is no way to meet aggressive episode-level targets without first establishing aggressive per-step reliability.

For $\tau = 0.90$ , $N = 10$ : $\epsilon \leq 1 - 0.90^{0.1} \approx 0.0105$ — each step must succeed 99%+ of the time. This is the reliability budget: allocate it across tool calls, parsing, and model decisions.

Trajectory Similarity

For tasks with multiple valid solution paths, exact-match evaluation fails. Trajectory similarity compares the sequence of tool calls taken:

$\text{sim}(T_1, T_2) = \frac{|T_1 \cap T_2|}{|T_1 \cup T_2|}$

(Jaccard similarity over tool-call multisets.) Two trajectories that use the same tools in different orders but reach the same result should score high.

For open-ended tasks (research, code generation), use LLM-as-Judge on the final output rather than trajectory matching — trajectories diverge too much across valid approaches.

Walkthrough

Building an Agent Eval Suite

python

import anthropic
import json
from dataclasses import dataclass, field
from typing import Callable
 
@dataclass
class AgentEval:
    id: str
    task: str
    expected_tools: list[str]       # tools that should be called
    success_fn: Callable[[str], bool]  # function to check final answer
    difficulty: str = "medium"      # easy/medium/hard
 
EVAL_SUITE = [
    AgentEval(
        id="math_01",
        task="What is 15% of 840, rounded to the nearest integer?",
        expected_tools=["calculator"],
        success_fn=lambda ans: "126" in ans,
    ),
    AgentEval(
        id="search_01",
        task="Who is the current CEO of Anthropic?",
        expected_tools=["search"],
        success_fn=lambda ans: "dario" in ans.lower() or "amodei" in ans.lower(),
    ),
    AgentEval(
        id="multi_01",
        task="Search for the population of Tokyo, then calculate how many people that is per square kilometer given Tokyo is 2,194 km².",
        expected_tools=["search", "calculator"],
        success_fn=lambda ans: any(c.isdigit() for c in ans),  # has a number
    ),
]
 
def evaluate_agent(agent_fn: Callable, eval_suite: list[AgentEval]) -> dict:
    results = []
    for ev in eval_suite:
        trajectory, answer = agent_fn(ev.task)
        tools_used = {t for t, _ in trajectory}
 
        result = {
            "id": ev.id,
            "success": ev.success_fn(answer),
            "expected_tools": set(ev.expected_tools),
            "used_tools": tools_used,
            "tool_coverage": len(set(ev.expected_tools) & tools_used) / len(set(ev.expected_tools)),
            "trajectory_len": len(trajectory),
            "difficulty": ev.difficulty,
        }
        results.append(result)
 
    tsr = sum(r["success"] for r in results) / len(results)
    avg_tool_coverage = sum(r["tool_coverage"] for r in results) / len(results)
 
    return {
        "task_success_rate": tsr,
        "avg_tool_coverage": avg_tool_coverage,
        "n_total": len(results),
        "n_success": sum(r["success"] for r in results),
        "by_difficulty": {
            d: sum(r["success"] for r in results if r["difficulty"] == d) /
               max(1, sum(1 for r in results if r["difficulty"] == d))
            for d in ["easy", "medium", "hard"]
        },
        "results": results,
    }

Reliability Patterns

python

import time
from functools import wraps
 
def with_timeout(seconds: int):
    """Decorator to enforce per-tool timeout."""
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            import signal
            def handler(signum, frame):
                raise TimeoutError(f"Tool timed out after {seconds}s")
            signal.signal(signal.SIGALRM, handler)
            signal.alarm(seconds)
            try:
                return fn(*args, **kwargs)
            finally:
                signal.alarm(0)
        return wrapper
    return decorator
 
def with_retry(max_retries: int = 3, backoff: float = 1.0):
    """Decorator for exponential backoff retry on tool errors."""
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            last_error = None
            for attempt in range(max_retries):
                try:
                    return fn(*args, **kwargs)
                except (TimeoutError, ConnectionError) as e:
                    last_error = e
                    if attempt < max_retries - 1:
                        time.sleep(backoff * (2 ** attempt))
            return f"Error after {max_retries} retries: {last_error}"
        return wrapper
    return decorator
 
# Apply to tools
@with_timeout(10)
@with_retry(max_retries=3)
def search(query: str) -> str:
    # ... implementation
    pass

Analysis & Evaluation

Where Your Intuition Breaks

A high task success rate means the agent is reliable. TSR captures average performance but hides variance. An agent that succeeds 90% of the time but fails catastrophically (corrupts data, sends incorrect messages, loops indefinitely) on the remaining 10% is not acceptable for production — even if its TSR looks good. Reliability in production requires both high TSR and bounded worst-case behavior: max cost per session, max latency per session, and graceful degradation when the agent can't complete the task. TSR is a necessary condition for production readiness, not a sufficient one.

Reliability Checklist

Risk	Mitigation
Tool timeout	Per-tool timeout + graceful error string return
API rate limit	Exponential backoff with jitter on 429 errors
Infinite loop	Step counter + loop detection (same action, same input)
Context overflow	Token budget tracking; summarize old turns when approaching limit
Cascading errors	Quality gate between pipeline stages; stop-on-critical-failure
Prompt injection	Sanitize tool outputs before inserting into context; treat tool results as untrusted

Regression Testing

Agent eval suites should run in CI to catch regressions from:

Model updates (new model version may have different tool-use behavior)
Prompt changes (even small wording changes can affect multi-step behavior)
Tool API changes (new schema, new error format)
Context window size changes

Minimum viable regression suite: 20–50 examples covering the main task types, with deterministic success functions where possible. Regression threshold: alert if TSR drops > 5% vs baseline.

🚀Production

Agent evals in production:

Instrument before you launch. Add logging for every tool call (name, inputs, outputs, latency) before your agent reaches users. Debugging production failures without traces is nearly impossible.
Separate task success from output quality. TSR measures whether the agent completed the task at all. Output quality measures how well. Both metrics are needed — a task can succeed with a mediocre answer.
Human review on failure cases. Review 10–20 failed agent traces per week. Failure patterns cluster: 80% of failures often come from 2–3 root causes. Fix those first.
Shadow mode before rollout. Run the new agent version in parallel (shadow mode) without affecting users. Compare TSR vs the current version over 100+ tasks before switching traffic.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Planning & Reasoning

Agents in Production