Agents & Tool Use

LLM agents with tool use represent the transition from question-answering to task completion. A single-turn model gives you an answer; an agent with tools executes a Python script, queries a database, browses a URL, and synthesizes the results — all in one workflow. In production deployments, agents with tool use power internal research pipelines; in coding assistants, an agent edits code files, runs tests, and debugs failures autonomously. The architecture is straightforward — give the model a list of tool schemas and let it decide which tools to call based on context — but the engineering challenge is in reliability: agents need retry logic, timeout handling, and guardrails to prevent runaway tool chains. This lesson derives the ReAct (reason + act) loop, implements multi-agent architectures, and covers production reliability patterns.

Theory

Tool Call Flow— click any step for details

LLM:The LLM decides whether to respond directly or invoke a tool based on the message content and available schemas.

Tool execution happens outside the model. The model cannot fake results — it only sees the observation the host injects back into context.

An agent is just a loop: observe the world, decide what to do, act, observe the result, repeat. You already understand this — it's how you make a phone call to book a reservation, or debug code by running it and reading the output. The model's "world" is its context window; tools are the actions it can take; the loop runs until the task is done or a budget runs out. What makes this powerful is that the model decides which tools to call and in what order — you define the tools, the model does the orchestration.

An agent is a policy $\pi$ that maps observations to actions, executing a loop until a terminal state or max-steps is reached:

$a_t = \pi(\text{action} \mid s_t,\ \text{history}_{<t},\ \text{tools})$

where $s_t = (q, o_1, \ldots, o_{t-1})$ accumulates the query and all prior observations.

ReAct Framework

ReAct interleaves Reasoning and Acting at each step:

$\text{Thought}_t \to \text{Action}_t \to \text{Observation}_t \to \text{Thought}_{t+1} \to \ldots$

This is superior to pure chain-of-thought (which doesn't interact with environment) and pure reactive agents (which don't reason before acting). ReAct reduces hallucination by grounding reasoning in actual tool outputs.

Tool Call as Function

A tool call is formally:

$o_t = f(\text{tool\_name},\ \text{args}_t) + \epsilon_{\text{tool}}$

where $\epsilon_{\text{tool}}$ captures tool errors (network failures, invalid inputs, timeouts). A robust agent must handle non-zero $\epsilon_{\text{tool}}$ .

Termination and Safety

An agent terminates when:

stop_reason == "end_turn" (model signals done)
Max steps reached (hard limit)
Budget exceeded (token/cost guard)

The expected cost of an agent run:

$\mathbb{E}[\text{cost}] = \sum_{t=1}^{T} \left( c_{\text{input}} \cdot |s_t| + c_{\text{output}} \cdot |a_t| \right) + \sum_{\text{tool calls}} c_{\text{tool}}$

The history accumulation in $s_t = (q, o_1, \ldots, o_{t-1})$ is not optional — it is forced by the architecture. A transformer has no implicit memory between forward passes; every piece of context the model needs to reason about must be present in the current input. This is why agent context windows grow with each step, and why cost grows quadratically with depth: each step adds tokens that all subsequent steps must process. Long-horizon agents are expensive not because tool calls are expensive, but because context accumulates.

⚠️Infinite loop risk

Agents can get stuck retrying failed tool calls or oscillating between states. Always implement: (1) max_steps hard limit, (2) tool call deduplication check, (3) total token budget guard. Log every tool call for debugging.

Walkthrough

Task: Build a research agent that searches, reads, and synthesizes information using the Anthropic tool use API directly.

Define Tools

python

import anthropic
import httpx
from typing import Any
 
client = anthropic.Anthropic()
 
TOOLS = [
    {
        "name": "web_search",
        "description": "Search the web for current information. Returns top 3 results with titles, URLs, and snippets.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_url",
        "description": "Read the text content of a web page.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "Full URL to fetch"},
                "max_chars": {"type": "integer", "default": 3000},
            },
            "required": ["url"],
        },
    },
    {
        "name": "python_repl",
        "description": "Execute Python code and return stdout. Use for calculations and data analysis.",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute"},
            },
            "required": ["code"],
        },
    },
]

Tool Implementations

python

import subprocess
import json
 
def web_search(query: str) -> str:
    """Mock search — in production use SerpAPI or Brave Search API"""
    return json.dumps([
        {"title": f"Result for: {query}", "url": "https://example.com/1",
         "snippet": f"Relevant information about {query}..."},
    ])
 
def read_url(url: str, max_chars: int = 3000) -> str:
    try:
        r = httpx.get(url, timeout=10, follow_redirects=True)
        # Strip HTML in production with BeautifulSoup
        return r.text[:max_chars]
    except Exception as e:
        return f"Error fetching {url}: {e}"
 
def python_repl(code: str) -> str:
    try:
        result = subprocess.run(
            ["python3", "-c", code],
            capture_output=True, text=True, timeout=10
        )
        return result.stdout or result.stderr
    except subprocess.TimeoutExpired:
        return "Timeout after 10s"
 
TOOL_REGISTRY = {
    "web_search": web_search,
    "read_url": read_url,
    "python_repl": python_repl,
}

Agent Loop

python

def run_agent(task: str, max_steps: int = 10) -> str:
    messages = [{"role": "user", "content": task}]
    step = 0
 
    print(f"Task: {task}\n{'='*50}")
 
    while step < max_steps:
        step += 1
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system="You are a research agent. Use tools to gather information before answering. Think step-by-step.",
            tools=TOOLS,
            messages=messages,
        )
 
        # Accumulate assistant's response
        messages.append({"role": "assistant", "content": response.content})
 
        # Print reasoning
        for block in response.content:
            if hasattr(block, "text"):
                print(f"\n[Step {step}] Thought: {block.text[:200]}")
 
        # Terminal: model is done
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""
 
        # Execute tool calls
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  → Tool: {block.name}({json.dumps(block.input)[:80]})")
                    result = TOOL_REGISTRY[block.name](**block.input)
                    print(f"  ← Result: {str(result)[:100]}...")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
            messages.append({"role": "user", "content": tool_results})
 
    return "Max steps reached"
 
# Run it
answer = run_agent("What is the current state of the art on RLHF? Find 2 recent papers and summarize.")
print(f"\nFinal answer:\n{answer}")

Code Implementation

20_langchain_agent//

python

# alignment/20_langchain_agent/train/train.py
# LangChain agent version for comparison
from langchain_anthropic import ChatAnthropic
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain.tools import tool
 
@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression."""
    return str(eval(expression))
 
@tool
def get_current_date() -> str:
    """Get today's date."""
    from datetime import date
    return str(date.today())
 
llm = ChatAnthropic(model="claude-sonnet-4-6")
tools = [calculate, get_current_date]
 
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant with tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])
 
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=10)
 
result = executor.invoke({"input": "What is 2^32 and when is it?"})
print(result["output"])

Analysis & Evaluation

Where Your Intuition Breaks

A smarter model means a more reliable agent. Model capability and agent reliability are different properties. A highly capable model can still fail as an agent because agentic reliability depends on tool design, error handling, loop termination, and context management — not just reasoning quality. Common agent failures (infinite loops, misinterpreted tool outputs, context overflow) are engineering problems, not intelligence problems. A mediocre model with well-designed tools and tight guardrails often outperforms a frontier model in an under-engineered agent scaffold.

Failure Mode Analysis

python

# Track agent metrics
from dataclasses import dataclass
 
@dataclass
class AgentRun:
    task: str
    steps: int
    tool_calls: list[str]
    success: bool
    cost_usd: float
    error: str | None = None
 
def evaluate_agent(tasks: list[str]) -> dict:
    runs = []
    for task in tasks:
        try:
            answer = run_agent(task, max_steps=10)
            runs.append(AgentRun(task=task, steps=0,  # step count requires instrumenting run_agent
                                 tool_calls=[], success=bool(answer), cost_usd=0.0))
        except Exception as e:
            runs.append(AgentRun(task=task, steps=0, tool_calls=[], success=False, cost_usd=0, error=str(e)))
 
    return {
        "success_rate": sum(r.success for r in runs) / len(runs),
        "avg_steps": sum(r.steps for r in runs) / len(runs),
        "avg_cost": sum(r.cost_usd for r in runs) / len(runs),
    }

Multi-Agent Architecture

The orchestrator agent decomposes the task and dispatches to specialist agents. Each specialist has a focused toolset, reducing hallucination risk.

🚀Cost estimation before running

For expensive agent runs, estimate cost upfront: count input tokens in the initial prompt, multiply by expected steps and avg output per step. Add a soft warning at 50% of budget and hard stop at 100%. Track per-task costs to identify expensive tasks for optimization.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Advanced RAG

Fine-Tuning in Practice