Requires:Agents & Tool Use

Agent Patterns

An LLM agent is any system where a language model drives a loop: observe, reason, act, repeat. The shift from single-pass inference to multi-step loops is what makes agents qualitatively different from chatbots. This lesson covers the foundational patterns — ReAct, plan-and-execute, reflection — the control flow abstractions that compose them, and the failure modes that make agent reliability an engineering problem, not just a prompting problem.

Theory

ReAct loop — Reason + Act interleaved

↺ loop

Think

The model generates a chain-of-thought reasoning step before acting. This scratchpad improves task decomposition and recovery from bad observations. Not shown to users.

observe → think → act → observe result → repeat · stop when model calls finish() or end_turn

An LLM agent is a loop: observe the environment, reason about what to do, take an action, observe the result, repeat. You already run this loop when you debug code — read the error, form a hypothesis, try a fix, read the new output. The diagram above shows the ReAct loop, the most common agent pattern: Thought → Action → Observation, cycling until the task is done. The key insight is that the model's "working memory" is its context window — every observation accumulates there, and the loop ends when the context runs out or the task completes.

The Agent as a Markov Decision Process

Formally, an LLM agent operates over a sequence of states $s_t$ , observations $o_t$ , and actions $a_t$ :

$s_{t+1} = T(s_t, a_t), \quad a_t = \pi_\theta(s_t)$

where $T$ is the environment transition function and $\pi_\theta$ is the LLM policy. Unlike RL agents, $\pi_\theta$ is not trained online — the "policy" is the prompt plus the model weights. The key insight: context window = working memory. Everything the agent knows about the current state must fit in the context.

Episode length and failure compounding: if each step has error probability $\epsilon$ , the probability of a complete $N$ -step episode succeeding is:

$P(\text{success}) = (1 - \epsilon)^N$

The geometric compounding is not a pessimistic model — it is the exact result when step errors are independent under a fixed policy. Correlated errors (e.g., a systematic misunderstanding that propagates through all steps) would be worse: correlated failures do not benefit from the near-cancellation that independent errors sometimes provide. This is why agent reliability engineering targets the per-step error rate $\epsilon$ directly: halving $\epsilon$ from 0.10 to 0.05 nearly doubles the success rate of a 10-step episode.

For $\epsilon = 0.05$ (5% per-step error rate) and $N = 10$ steps: $P = 0.95^{10} \approx 0.60$ . For $N = 20$ steps: $P \approx 0.36$ . Error rates that are acceptable for single-step tasks become catastrophic over long horizons.

ReAct: Reasoning + Acting Interleaved

ReAct (Yao et al., 2022) interleaves thought and action in the context:

Thought: I need to find the population of Tokyo.
Action: search("Tokyo population 2024")
Observation: Tokyo metropolitan area: 37.4 million (2024)
Thought: I have the answer.
Action: finish("37.4 million")

The thought step is not just decorative — it provides a scratchpad that allows the model to decompose the task, track progress, and recover from bad observations. Ablation studies show ReAct outperforms action-only baselines (no thought) by 15–30% on multi-step retrieval tasks.

Context growth: ReAct grows the context linearly with steps. For $N$ steps with average thought/observation $L$ tokens each: context size ≈ $N \cdot L$ tokens. At $N=20$ , $L=500$ : 10K tokens. Context limits bound agent horizon.

Plan-and-Execute

Rather than interleaving reasoning with each action, plan-and-execute separates planning from execution:

Plan: LLM generates a full task plan: $[a_1, a_2, \ldots, a_K]$
Execute: each action $a_k$ runs; observations may update the remaining plan

Advantage over ReAct: the planner sees the full task before any execution, enabling better decomposition. Disadvantage: upfront plan may be invalidated by early observations — requires a re-plan trigger.

Re-plan condition: if an action fails or returns unexpected results, trigger re-planning from the current state. Re-planning cost is one additional LLM call, typically worth it for tasks > 5 steps.

Reflection and Self-Critique

Reflexion (Shinn et al., 2023) adds a reflection step after task failure:

$\text{reflection}_t = \text{LLM}(s_0, a_{0:t}, o_{0:t}, \text{"What went wrong and how to fix it?"})$

The reflection is stored in a memory buffer and prepended to the next episode's context. Over multiple trials, the agent accumulates a "lesson log" of its mistakes. On coding and decision-making benchmarks, Reflexion improves task success rate by 10–20% over single-attempt ReAct.

Walkthrough

Implementing a ReAct Agent Loop

python

import anthropic
import json
 
client = anthropic.Anthropic()
 
TOOLS = [
    {
        "name": "search",
        "description": "Search the web for current information.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "calculator",
        "description": "Evaluate a mathematical expression.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Python math expression"}
            },
            "required": ["expression"]
        }
    }
]
 
def run_tool(name: str, inputs: dict) -> str:
    """Dispatch tool calls to implementations."""
    if name == "search":
        return f"[Search results for '{inputs['query']}': ...]"  # real: call search API
    elif name == "calculator":
        try:
            return str(eval(inputs["expression"]))  # real: use safe eval
        except Exception as e:
            return f"Error: {e}"
    return f"Unknown tool: {name}"
 
def react_agent(task: str, max_steps: int = 10) -> str:
    messages = [{"role": "user", "content": task}]
 
    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=TOOLS,
            messages=messages
        )
 
        # Append assistant turn
        messages.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task complete."
 
        if response.stop_reason == "tool_use":
            # Process all tool calls in this turn
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = run_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})
 
    return "Max steps reached without completing task."

Adding a Reflection Loop

python

def reflexion_agent(task: str, max_trials: int = 3) -> str:
    memory = []  # accumulated reflections from prior trials
 
    for trial in range(max_trials):
        # Build context with reflection memory
        context = task
        if memory:
            context += "\n\nPrevious attempts and lessons:\n" + "\n".join(memory)
 
        result, success = react_agent_with_status(context)
        if success:
            return result
 
        # Ask model to reflect on failure
        reflection = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"Task: {task}\nResult: {result}\nWhat went wrong and what should be done differently?"
            }]
        ).content[0].text
 
        memory.append(f"Trial {trial + 1}: {reflection}")
 
    return result  # return best attempt after max trials

Analysis & Evaluation

Where Your Intuition Breaks

Agent failures happen because the model isn't smart enough. Most agent failures are architectural, not cognitive. The three most common failure modes — infinite loops, context overflow, and compounding errors — are all engineering problems with engineering solutions: loop detection, context compression, and per-step validation. A frontier model in a poorly designed agent scaffold will fail reliably on tasks a smaller model handles with the right guardrails. The model's reasoning capability matters for individual steps; the scaffold determines whether those steps compose into reliable multi-step task completion.

Pattern Selection Guide

Pattern	Best for	Failure mode
ReAct	Interleaved search/reasoning, uncertain environment	Context bloat at > 15 steps
Plan-execute	Structured tasks with known action space	Brittle to early failures that invalidate plan
Reflection	Tasks with clear success/failure signal	Doesn't help if reflection diagnoses wrong root cause
Hierarchical agents	Long-horizon tasks that decompose cleanly	Coordination overhead, error propagation between agents

Agent Reliability Checklist

Per-step error handling: every tool call should have a timeout and fallback. Don't let a single failed tool crash the episode.

Loop detection: if the agent takes the same action twice in a row with the same inputs, it's stuck. Add a deduplication check and force a different strategy.

Max steps enforcement: always set a hard step limit. An agent that loops forever is worse than an agent that gives up and returns an error.

Observation truncation: tool outputs can be arbitrarily large (e.g., a full web page). Truncate to the first N characters and include a hint: "response truncated at 2000 chars — request a more specific query for full content."

🚀Production

Agent patterns in production:

Start with the simplest pattern that works. ReAct is sufficient for most tool-use tasks. Move to plan-execute only when task structure is complex enough that upfront decomposition clearly helps.
Instrument every step. Log which tools were called, their inputs and outputs, elapsed time, and token counts. Agents fail in surprising ways — observability is your primary debugging tool.
Error rate × episode length = reliability. Before deploying a 20-step agent with 5% per-step error rate, compute $(0.95)^{20} \approx 36\%$ success rate. Either improve per-step reliability or reduce episode length.
Human-in-the-loop for high-stakes actions. Pause and request approval before any action that is hard to reverse (file deletion, email sending, API writes). The pause can be conditional on confidence scores.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

LLM Apps

Deployment & Serving

Tool Use & MCP