Neural-Path/Notes
30 min

Multi-Agent Systems

A single-agent loop hits two ceilings: context window limits cap the amount of information it can track, and sequential execution caps throughput. Multi-agent systems address both by decomposing tasks across multiple specialized agents that run in parallel or in a pipeline. The tradeoff is coordination overhead and amplified failure modes — errors propagate across agent boundaries. This lesson covers the topology options, the coordination patterns that actually work, and how to avoid the pitfalls that make multi-agent systems harder to debug than single-agent ones.

Theory

Multi-agent topologies

Orchestrator dispatches to workers simultaneously. Latency = max worker latency.

OrchestratordecomposeWorker 1subtask AWorker 2subtask BWorker 3subtask CAggregatormerge results
latency:T = T_orch + max(T_w1, T_w2, T_w3) + T_agg
strength:3× throughput vs sequential; scales with workers
weakness:Aggregation logic needed; more complex orchestration

amber = orchestrator · cyan = worker · green = aggregator · violet = sub-orchestrator

A single agent hits two ceilings: the context window caps what it can track, and sequential execution caps throughput. Multi-agent systems break through both by splitting work across agents that run in parallel or in a pipeline. The diagram above shows the four topology options — sequential, parallel fan-out, hierarchical, and peer-to-peer. The tradeoff is that errors compound across agent boundaries: a mistake in one agent becomes noise in the next.

Network Topologies

Multi-agent systems can be characterized by their communication graph G=(V,E)G = (V, E) where VV is the set of agents and EE is the set of communication edges:

Sequential (pipeline): A1A2ANA_1 \to A_2 \to \cdots \to A_N. Each agent's output is the next agent's input. Latency = iTi\sum_i T_i. Useful for staged processing: retrieve → rerank → synthesize → format.

Parallel fan-out: orchestrator sends task to KK workers simultaneously, then aggregates results. Latency ≈ maxkTk\max_k T_k (assuming parallelism). Throughput scales linearly with KK up to API rate limits.

Hierarchical: orchestrator spawns sub-orchestrators, which spawn workers. Depth-dd hierarchy with branching factor bb: bdb^d leaf workers. Good for tasks with two-level decomposition (e.g., research across topics, then synthesis).

Fully connected (peer-to-peer): any agent can communicate with any other. O(N2)O(N^2) edges — high coordination overhead. Rarely used in practice.

Error Propagation in Pipelines

If agent AiA_i has per-task error rate ϵi\epsilon_i, a pipeline of NN agents has compound success rate:

P(success)=i=1N(1ϵi)P(\text{success}) = \prod_{i=1}^N (1 - \epsilon_i)

The product formula is exact when agent errors are independent — each agent either succeeds or fails without knowing what the others are doing. Independence holds in parallel fan-out architectures but is violated in pipelines, where agent AiA_i receives Ai1A_{i-1}'s output as input. In pipelines, errors are correlated: a wrong fact extracted in stage 1 causes wrong inferences in stage 3, and the true failure probability is higher than i(1ϵi)\prod_i (1 - \epsilon_i) suggests. Per-agent quality gates exist precisely to break this correlation: catching errors at each stage prevents them from propagating.

For N=4N=4 agents each with ϵi=0.05\epsilon_i = 0.05: P=0.9540.81P = 0.95^4 \approx 0.81. Error rates multiply — a 4-agent pipeline with individually-acceptable 5% error rates fails 19% of the time. This motivates per-agent quality gates.

Error amplification with context: downstream agents inherit upstream errors embedded in their inputs. Unlike independent errors, inherited errors can compound non-linearly — a wrong fact extracted in step 1 may cause two wrong inferences in step 3.

Aggregation Strategies

When parallel workers produce multiple answers for the same question, aggregation determines the final output:

Majority voting: take the most common answer across KK workers. For binary correctness with each worker correct probability p>0.5p > 0.5, majority vote accuracy improves with KK:

P(majority correct)=j=K/2K(Kj)pj(1p)KjP(\text{majority correct}) = \sum_{j=\lceil K/2 \rceil}^{K} \binom{K}{j} p^j (1-p)^{K-j}

For p=0.7p = 0.7, K=5K = 5: majority vote accuracy ≈ 0.84. For K=7K = 7: ≈ 0.87. Diminishing returns — more workers give marginal gains once K>5K > 5.

Best-of-K with a judge: run KK workers, use an LLM judge to score all outputs and return the best. More expensive (K+1K+1 LLM calls) but handles non-binary tasks (long-form generation) where majority voting is undefined.

Walkthrough

Orchestrator-Workers Pattern

python
import anthropic
import asyncio
 
client = anthropic.Anthropic()
 
async def worker(task: str, context: str) -> str:
    """Individual worker agent — handles a sub-task."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheaper model for sub-tasks
        max_tokens=512,
        messages=[{"role": "user", "content": f"Context: {context}\n\nTask: {task}"}]
    )
    return response.content[0].text
 
async def orchestrator(main_task: str) -> str:
    """Orchestrator: decompose → dispatch → aggregate."""
 
    # Step 1: decompose task into sub-tasks
    decomp_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"""Break this task into 3-5 independent sub-tasks that can be done in parallel.
Return JSON: {{"subtasks": ["...", "...", ...]}}
 
Task: {main_task}"""
        }]
    )
    import json
    subtasks = json.loads(decomp_response.content[0].text)["subtasks"]
 
    # Step 2: fan out to workers (parallel)
    worker_results = await asyncio.gather(*[
        asyncio.to_thread(worker, subtask, main_task)
        for subtask in subtasks
    ])
 
    # Step 3: synthesize results
    synthesis_prompt = f"""Original task: {main_task}
 
Sub-task results:
{chr(10).join(f"- {r}" for r in worker_results)}
 
Synthesize these results into a comprehensive answer."""
 
    final_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    return final_response.content[0].text

Quality Gates Between Agents

python
def quality_gate(output: str, criteria: str, threshold: float = 0.7) -> tuple[bool, str]:
    """LLM-based quality check between pipeline stages."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=128,
        messages=[{
            "role": "user",
            "content": f"""Rate this output on the following criteria. Return JSON: {{"score": 0.0-1.0, "issues": "..."}}
 
Criteria: {criteria}
 
Output: {output[:1000]}"""
        }]
    )
    import json
    result = json.loads(response.content[0].text)
    return result["score"] >= threshold, result.get("issues", "")
 
# Usage in pipeline
extracted = extraction_agent(raw_document)
passed, issues = quality_gate(extracted, "Extracted data is complete and correctly formatted JSON")
if not passed:
    extracted = extraction_agent(raw_document, error_context=issues)  # retry with feedback

Analysis & Evaluation

Where Your Intuition Breaks

More agents means more capability and more parallelism. Agent count is a cost multiplier, not a capability multiplier. Adding agents adds API calls, coordination latency, error surfaces, and debugging complexity. A well-designed single-agent loop with good tools often outperforms a multi-agent system for the same task, with lower cost and simpler failure modes. Multi-agent systems are justified when task decomposition is genuinely parallel (the sub-tasks are independent), when context window limits bind (no single agent can hold all the information), or when specialization is needed (different agents with different system prompts and tool sets). Otherwise, multi-agent adds complexity without adding capability.

When to Use Multi-Agent Systems

Use multi-agent when:

  • Task naturally decomposes into independent sub-tasks (parallelism helps)
  • Single context window is insufficient to hold all relevant information
  • Different sub-tasks benefit from different models (cost optimization: cheap workers, expensive orchestrator)
  • Parallel execution time savings > coordination overhead

Prefer single-agent when:

  • Task requires continuous shared context (each step depends on all prior steps)
  • Latency of additional LLM calls for orchestration exceeds any savings
  • Debugging complexity is not justified by the use case

Debugging Multi-Agent Systems

Multi-agent systems fail in ways single-agent systems don't:

Silent failure propagation: agent 2 receives a wrong answer from agent 1 but doesn't know it's wrong — it produces a confidently wrong output that looks correct to agent 3. Add explicit "sanity check" steps between agents for critical pipelines.

Coordination loops: agent A asks agent B a question, agent B asks agent A for clarification, deadlock. Use directed communication graphs (no cycles) or timeout-based fallbacks.

Observability: log the full input/output of every agent invocation, not just the final answer. Debugging a 5-agent system requires being able to trace exactly what each agent received and produced.

🚀Production

Multi-agent systems in production:

  • Don't over-architect. A 3-agent pipeline that you understand is better than an 8-agent graph you can't debug. Add agents only when you have a measured performance reason to do so.
  • Use different model tiers per role. The orchestrator and critic roles benefit from more capable models; workers doing structured extraction or classification can use cheaper models. A sonnet orchestrator + haiku workers can reduce costs 3–5× vs all-sonnet.
  • Define agent interfaces as contracts. Each agent should have a documented input schema and output schema. Treat agent outputs like API responses — validate before passing downstream.
  • Idempotency matters at scale. If an agent call fails mid-pipeline, you need to be able to retry it without re-running upstream agents. Cache intermediate results keyed by input hash.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.