Agent Patterns
An LLM agent is any system where a language model drives a loop: observe, reason, act, repeat. The shift from single-pass inference to multi-step loops is what makes agents qualitatively different from chatbots. This lesson covers the foundational patterns — ReAct, plan-and-execute, reflection — the control flow abstractions that compose them, and the failure modes that make agent reliability an engineering problem, not just a prompting problem.
Theory
The model generates a chain-of-thought reasoning step before acting. This scratchpad improves task decomposition and recovery from bad observations. Not shown to users.
An LLM agent is a loop: observe the environment, reason about what to do, take an action, observe the result, repeat. You already run this loop when you debug code — read the error, form a hypothesis, try a fix, read the new output. The diagram above shows the ReAct loop, the most common agent pattern: Thought → Action → Observation, cycling until the task is done. The key insight is that the model's "working memory" is its context window — every observation accumulates there, and the loop ends when the context runs out or the task completes.
The Agent as a Markov Decision Process
Formally, an LLM agent operates over a sequence of states , observations , and actions :
where is the environment transition function and is the LLM policy. Unlike RL agents, is not trained online — the "policy" is the prompt plus the model weights. The key insight: context window = working memory. Everything the agent knows about the current state must fit in the context.
Episode length and failure compounding: if each step has error probability , the probability of a complete -step episode succeeding is:
The geometric compounding is not a pessimistic model — it is the exact result when step errors are independent under a fixed policy. Correlated errors (e.g., a systematic misunderstanding that propagates through all steps) would be worse: correlated failures do not benefit from the near-cancellation that independent errors sometimes provide. This is why agent reliability engineering targets the per-step error rate directly: halving from 0.10 to 0.05 nearly doubles the success rate of a 10-step episode.
For (5% per-step error rate) and steps: . For steps: . Error rates that are acceptable for single-step tasks become catastrophic over long horizons.
ReAct: Reasoning + Acting Interleaved
ReAct (Yao et al., 2022) interleaves thought and action in the context:
Thought: I need to find the population of Tokyo.
Action: search("Tokyo population 2024")
Observation: Tokyo metropolitan area: 37.4 million (2024)
Thought: I have the answer.
Action: finish("37.4 million")
The thought step is not just decorative — it provides a scratchpad that allows the model to decompose the task, track progress, and recover from bad observations. Ablation studies show ReAct outperforms action-only baselines (no thought) by 15–30% on multi-step retrieval tasks.
Context growth: ReAct grows the context linearly with steps. For steps with average thought/observation tokens each: context size ≈ tokens. At , : 10K tokens. Context limits bound agent horizon.
Plan-and-Execute
Rather than interleaving reasoning with each action, plan-and-execute separates planning from execution:
- Plan: LLM generates a full task plan:
- Execute: each action runs; observations may update the remaining plan
Advantage over ReAct: the planner sees the full task before any execution, enabling better decomposition. Disadvantage: upfront plan may be invalidated by early observations — requires a re-plan trigger.
Re-plan condition: if an action fails or returns unexpected results, trigger re-planning from the current state. Re-planning cost is one additional LLM call, typically worth it for tasks > 5 steps.
Reflection and Self-Critique
Reflexion (Shinn et al., 2023) adds a reflection step after task failure:
The reflection is stored in a memory buffer and prepended to the next episode's context. Over multiple trials, the agent accumulates a "lesson log" of its mistakes. On coding and decision-making benchmarks, Reflexion improves task success rate by 10–20% over single-attempt ReAct.
Walkthrough
Implementing a ReAct Agent Loop
import anthropic
import json
client = anthropic.Anthropic()
TOOLS = [
{
"name": "search",
"description": "Search the web for current information.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "calculator",
"description": "Evaluate a mathematical expression.",
"input_schema": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Python math expression"}
},
"required": ["expression"]
}
}
]
def run_tool(name: str, inputs: dict) -> str:
"""Dispatch tool calls to implementations."""
if name == "search":
return f"[Search results for '{inputs['query']}': ...]" # real: call search API
elif name == "calculator":
try:
return str(eval(inputs["expression"])) # real: use safe eval
except Exception as e:
return f"Error: {e}"
return f"Unknown tool: {name}"
def react_agent(task: str, max_steps: int = 10) -> str:
messages = [{"role": "user", "content": task}]
for step in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=TOOLS,
messages=messages
)
# Append assistant turn
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
# Extract final text response
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Task complete."
if response.stop_reason == "tool_use":
# Process all tool calls in this turn
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = run_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
return "Max steps reached without completing task."Adding a Reflection Loop
def reflexion_agent(task: str, max_trials: int = 3) -> str:
memory = [] # accumulated reflections from prior trials
for trial in range(max_trials):
# Build context with reflection memory
context = task
if memory:
context += "\n\nPrevious attempts and lessons:\n" + "\n".join(memory)
result, success = react_agent_with_status(context)
if success:
return result
# Ask model to reflect on failure
reflection = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Task: {task}\nResult: {result}\nWhat went wrong and what should be done differently?"
}]
).content[0].text
memory.append(f"Trial {trial + 1}: {reflection}")
return result # return best attempt after max trialsAnalysis & Evaluation
Where Your Intuition Breaks
Agent failures happen because the model isn't smart enough. Most agent failures are architectural, not cognitive. The three most common failure modes — infinite loops, context overflow, and compounding errors — are all engineering problems with engineering solutions: loop detection, context compression, and per-step validation. A frontier model in a poorly designed agent scaffold will fail reliably on tasks a smaller model handles with the right guardrails. The model's reasoning capability matters for individual steps; the scaffold determines whether those steps compose into reliable multi-step task completion.
Pattern Selection Guide
| Pattern | Best for | Failure mode |
|---|---|---|
| ReAct | Interleaved search/reasoning, uncertain environment | Context bloat at > 15 steps |
| Plan-execute | Structured tasks with known action space | Brittle to early failures that invalidate plan |
| Reflection | Tasks with clear success/failure signal | Doesn't help if reflection diagnoses wrong root cause |
| Hierarchical agents | Long-horizon tasks that decompose cleanly | Coordination overhead, error propagation between agents |
Agent Reliability Checklist
Per-step error handling: every tool call should have a timeout and fallback. Don't let a single failed tool crash the episode.
Loop detection: if the agent takes the same action twice in a row with the same inputs, it's stuck. Add a deduplication check and force a different strategy.
Max steps enforcement: always set a hard step limit. An agent that loops forever is worse than an agent that gives up and returns an error.
Observation truncation: tool outputs can be arbitrarily large (e.g., a full web page). Truncate to the first N characters and include a hint: "response truncated at 2000 chars — request a more specific query for full content."
Agent patterns in production:
- Start with the simplest pattern that works. ReAct is sufficient for most tool-use tasks. Move to plan-execute only when task structure is complex enough that upfront decomposition clearly helps.
- Instrument every step. Log which tools were called, their inputs and outputs, elapsed time, and token counts. Agents fail in surprising ways — observability is your primary debugging tool.
- Error rate × episode length = reliability. Before deploying a 20-step agent with 5% per-step error rate, compute success rate. Either improve per-step reliability or reduce episode length.
- Human-in-the-loop for high-stakes actions. Pause and request approval before any action that is hard to reverse (file deletion, email sending, API writes). The pause can be conditional on confidence scores.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.