Planning & Reasoning

Planning and reasoning are the cognitive bottleneck for LLM agents. A model that can't plan multi-step tasks or that reasons incorrectly will fail even with perfect tools and a bug-free agent loop. This lesson covers the reasoning techniques that have strong empirical support — chain-of-thought, tree-of-thought, and extended thinking — and when each adds enough value to justify its cost. The goal is a principled understanding of when to invest in better reasoning, not a survey of every prompting technique.

Theory

Reasoning Structure

Tree-of-Thought: explore multiple reasoning branches, score each with an evaluator, prune low-quality paths early.

Humans solve hard problems by talking through them — writing notes, drawing diagrams, working examples. Chain-of-thought gives the model a scratchpad to do the same thing: instead of jumping to an answer, it produces intermediate reasoning steps that carry information the next step depends on. The math below formalizes why this works and when to invest in more expensive reasoning variants like tree-of-thought and extended thinking.

Chain-of-Thought as Computation Graph

Chain-of-thought (Wei et al., 2022) prompts the model to produce intermediate reasoning steps before the final answer. Formally, instead of generating $p(y \mid x)$ directly, the model generates:

$p(y \mid x) = \sum_z p(y \mid x, z) \cdot p(z \mid x)$

Marginalizing over intermediate reasoning steps $z$ is the right formulation because the model cannot know in advance which chain of thought will lead to the correct answer — it must consider the distribution over reasoning paths and implicitly average over them. In practice, the model generates a single $z$ (greedy or sampled), which approximates the marginalization. This is why CoT works better with temperature slightly above 0 for reasoning tasks: sampling diverse reasoning chains and majority-voting the answers (self-consistency) approximates the full marginalization better than a single greedy chain.

where $z$ is the chain of intermediate reasoning steps. The key result: CoT dramatically improves performance on tasks that require more than one reasoning step — arithmetic, commonsense reasoning, multi-hop retrieval. For single-step tasks (classification, short extraction), CoT adds latency with no benefit.

When does CoT help? Tasks with problem complexity $> 1$ step and where intermediate results are needed to compute the next step. The model's context window acts as a scratchpad: longer chains of thought = more computation, but also more tokens consumed.

Zero-shot CoT: appending "Let's think step by step" to a prompt induces CoT without few-shot examples. Effective for many reasoning tasks; few-shot CoT is stronger but requires example curation.

Tree-of-Thought

Tree-of-Thought (Yao et al., 2023) frames problem solving as a tree search where each node is an intermediate thought state:

Breadth-first (BFS): generate $k$ next thoughts at each step, evaluate all, keep top- $b$ , continue until terminal. Best when early errors are recoverable.
Depth-first (DFS): explore one path deeply; backtrack on failure. Best when the search space is structured with clear dead-ends.

Cost: $k$ thoughts × $N$ steps × evaluation calls per step. For $k=5$ , $N=4$ , 1 eval each: 20 generation calls + 20 evaluation calls vs 1 for standard inference. ToT is expensive — reserve it for tasks where a single chain of thought is reliably insufficient (combinatorial puzzles, complex multi-step planning).

Extended Thinking

Claude's extended thinking mode allocates additional tokens to internal reasoning before generating the response. Unlike CoT (where reasoning is part of the visible output), extended thinking uses a separate thinking block that is consumed during generation but not shown in the final response.

Budget tokens: the model is given a budget_tokens parameter (e.g., 10,000 tokens) for the thinking block. The model may use up to this budget; unused budget tokens are not charged. Extended thinking improves performance on tasks requiring:

Long multi-step deduction
Mathematical reasoning with many intermediate steps
Complex code generation where planning upfront matters

Cost: thinking tokens are billed at the input token rate. A 10K thinking budget + 1K response at Sonnet pricing adds ≈ $0.03 to the request cost.

Walkthrough

Chain-of-Thought Prompting

python

import anthropic
 
client = anthropic.Anthropic()
 
def cot_solve(problem: str) -> dict:
    """Zero-shot CoT with structured output."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""{problem}
 
Think step by step. Show your reasoning, then give your final answer.
Format: {{"reasoning": "...", "answer": "..."}}"""
        }]
    )
    import json
    return json.loads(response.content[0].text)
 
# Example: multi-step arithmetic
result = cot_solve("A store sells 3 items: A at $12, B at $18, C at $7. A customer buys 2 of A, 1 of B, and 4 of C. What is the total?")
# reasoning: 2×12=24, 1×18=18, 4×7=28, total=70
# answer: $70

Extended Thinking

python

def extended_thinking_solve(problem: str, budget: int = 10_000) -> str:
    """Use extended thinking for complex multi-step reasoning."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16_000,
        thinking={
            "type": "enabled",
            "budget_tokens": budget
        },
        messages=[{"role": "user", "content": problem}]
    )
 
    # Response contains thinking block + text block
    for block in response.content:
        if block.type == "thinking":
            print(f"[thinking: {len(block.thinking)} chars]")
        elif block.type == "text":
            return block.text
 
    return ""
 
# Best for: complex math, multi-step planning, hard coding problems
result = extended_thinking_solve(
    "Design an algorithm to find all prime numbers up to N using the Sieve of Eratosthenes. "
    "Analyze its time and space complexity, then implement it in Python."
)

Lightweight Tree-of-Thought

python

def tot_solve(problem: str, n_thoughts: int = 3) -> str:
    """Simplified ToT: generate multiple approaches, select best."""
 
    # Generate candidate approaches in parallel
    candidates = []
    for i in range(n_thoughts):
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": f"Approach {i+1}: Think of one way to solve this problem and outline the key steps.\n\n{problem}"
            }]
        )
        candidates.append(response.content[0].text)
 
    # Evaluate and select best approach
    eval_prompt = f"""Problem: {problem}
 
Candidate approaches:
{chr(10).join(f"{i+1}. {c}" for i, c in enumerate(candidates))}
 
Which approach is most likely to lead to a correct solution, and why?
Select one and execute it to produce the final answer."""
 
    final = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    return final.content[0].text

Analysis & Evaluation

Where Your Intuition Breaks

More reasoning tokens always improve accuracy. Reasoning tokens improve accuracy for tasks that require multi-step deduction but add latency and cost with no benefit for tasks that are inherently single-step. For classification, extraction, or simple lookup tasks, a 10,000-token thinking budget produces the same answer as a 100-token response — the model "already knows" the answer and fills the thinking budget with restatements. The correct question is not "how many reasoning tokens should I allocate?" but "does this task require multi-step computation where each step depends on the previous?" If the answer is no, reasoning tokens are waste.

Reasoning Technique Selection

Technique	When to use	Cost multiplier	Improvement range
Standard inference	Simple tasks, classification, extraction	1×	—
Zero-shot CoT	Multi-step reasoning, math, logic	1.5–2×	+10–30% on reasoning tasks
Few-shot CoT	Same as above, harder tasks	1.5–2× + example prep	+20–40%
Extended thinking	Hard math, complex code, long-horizon planning	5–20×	+20–50% on hardest tasks
Tree-of-Thought	Combinatorial, ill-structured, high-stakes	5–50×	+10–30% vs CoT

Default to CoT for reasoning tasks. Extended thinking and ToT are useful but expensive — measure baseline CoT performance before escalating. Most multi-step reasoning tasks don't need ToT.

Failure Modes in Reasoning

Reasoning collapse: the model generates plausible-looking reasoning steps that are internally inconsistent. CoT doesn't guarantee correct reasoning — it just makes the reasoning visible so you can check it. Always validate CoT answers on tasks where correctness can be verified.

Step-by-step over-trust: CoT can give false confidence in wrong answers. A 10-step chain that ends with a wrong answer can look more convincing than a 2-step chain. Add automatic verification where possible (run code, check against known facts).

Thinking budget saturation: with extended thinking, more budget doesn't always mean better answers — after a point, additional thinking tokens are spent on redundant reasoning. Monitor thinking block length vs answer quality; often 4K–8K thinking tokens gives most of the benefit.

🚀Production

Planning and reasoning in production:

Use CoT by default for any multi-step task. Appending "Think step by step" to a prompt is free (marginally more output tokens) and regularly improves accuracy by 10–30% on tasks with more than one reasoning step.
Extended thinking for hard tasks, not all tasks. Use it when standard CoT gives 70–80% accuracy and you need 90%+. The 5–20× cost premium is rarely justified for extraction or classification.
Cache reasoning patterns. If the same complex reasoning chain runs on many similar inputs (e.g., a contract review prompt), prompt caching can reduce the thinking overhead to near-zero on cache hits.
Verify, don't just trust. For consequential decisions (code execution, financial calculations), add a verification step: ask the model (or a second model) to check the reasoning independently. Self-consistency: run 3× and take majority vote if cost allows.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Multi-Agent Systems

Agent Evals & Reliability