Multi-Agent Systems
A single-agent loop hits two ceilings: context window limits cap the amount of information it can track, and sequential execution caps throughput. Multi-agent systems address both by decomposing tasks across multiple specialized agents that run in parallel or in a pipeline. The tradeoff is coordination overhead and amplified failure modes — errors propagate across agent boundaries. This lesson covers the topology options, the coordination patterns that actually work, and how to avoid the pitfalls that make multi-agent systems harder to debug than single-agent ones.
Theory
Orchestrator dispatches to workers simultaneously. Latency = max worker latency.
amber = orchestrator · cyan = worker · green = aggregator · violet = sub-orchestrator
A single agent hits two ceilings: the context window caps what it can track, and sequential execution caps throughput. Multi-agent systems break through both by splitting work across agents that run in parallel or in a pipeline. The diagram above shows the four topology options — sequential, parallel fan-out, hierarchical, and peer-to-peer. The tradeoff is that errors compound across agent boundaries: a mistake in one agent becomes noise in the next.
Network Topologies
Multi-agent systems can be characterized by their communication graph where is the set of agents and is the set of communication edges:
Sequential (pipeline): . Each agent's output is the next agent's input. Latency = . Useful for staged processing: retrieve → rerank → synthesize → format.
Parallel fan-out: orchestrator sends task to workers simultaneously, then aggregates results. Latency ≈ (assuming parallelism). Throughput scales linearly with up to API rate limits.
Hierarchical: orchestrator spawns sub-orchestrators, which spawn workers. Depth- hierarchy with branching factor : leaf workers. Good for tasks with two-level decomposition (e.g., research across topics, then synthesis).
Fully connected (peer-to-peer): any agent can communicate with any other. edges — high coordination overhead. Rarely used in practice.
Error Propagation in Pipelines
If agent has per-task error rate , a pipeline of agents has compound success rate:
The product formula is exact when agent errors are independent — each agent either succeeds or fails without knowing what the others are doing. Independence holds in parallel fan-out architectures but is violated in pipelines, where agent receives 's output as input. In pipelines, errors are correlated: a wrong fact extracted in stage 1 causes wrong inferences in stage 3, and the true failure probability is higher than suggests. Per-agent quality gates exist precisely to break this correlation: catching errors at each stage prevents them from propagating.
For agents each with : . Error rates multiply — a 4-agent pipeline with individually-acceptable 5% error rates fails 19% of the time. This motivates per-agent quality gates.
Error amplification with context: downstream agents inherit upstream errors embedded in their inputs. Unlike independent errors, inherited errors can compound non-linearly — a wrong fact extracted in step 1 may cause two wrong inferences in step 3.
Aggregation Strategies
When parallel workers produce multiple answers for the same question, aggregation determines the final output:
Majority voting: take the most common answer across workers. For binary correctness with each worker correct probability , majority vote accuracy improves with :
For , : majority vote accuracy ≈ 0.84. For : ≈ 0.87. Diminishing returns — more workers give marginal gains once .
Best-of-K with a judge: run workers, use an LLM judge to score all outputs and return the best. More expensive ( LLM calls) but handles non-binary tasks (long-form generation) where majority voting is undefined.
Walkthrough
Orchestrator-Workers Pattern
import anthropic
import asyncio
client = anthropic.Anthropic()
async def worker(task: str, context: str) -> str:
"""Individual worker agent — handles a sub-task."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheaper model for sub-tasks
max_tokens=512,
messages=[{"role": "user", "content": f"Context: {context}\n\nTask: {task}"}]
)
return response.content[0].text
async def orchestrator(main_task: str) -> str:
"""Orchestrator: decompose → dispatch → aggregate."""
# Step 1: decompose task into sub-tasks
decomp_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Break this task into 3-5 independent sub-tasks that can be done in parallel.
Return JSON: {{"subtasks": ["...", "...", ...]}}
Task: {main_task}"""
}]
)
import json
subtasks = json.loads(decomp_response.content[0].text)["subtasks"]
# Step 2: fan out to workers (parallel)
worker_results = await asyncio.gather(*[
asyncio.to_thread(worker, subtask, main_task)
for subtask in subtasks
])
# Step 3: synthesize results
synthesis_prompt = f"""Original task: {main_task}
Sub-task results:
{chr(10).join(f"- {r}" for r in worker_results)}
Synthesize these results into a comprehensive answer."""
final_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": synthesis_prompt}]
)
return final_response.content[0].textQuality Gates Between Agents
def quality_gate(output: str, criteria: str, threshold: float = 0.7) -> tuple[bool, str]:
"""LLM-based quality check between pipeline stages."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=128,
messages=[{
"role": "user",
"content": f"""Rate this output on the following criteria. Return JSON: {{"score": 0.0-1.0, "issues": "..."}}
Criteria: {criteria}
Output: {output[:1000]}"""
}]
)
import json
result = json.loads(response.content[0].text)
return result["score"] >= threshold, result.get("issues", "")
# Usage in pipeline
extracted = extraction_agent(raw_document)
passed, issues = quality_gate(extracted, "Extracted data is complete and correctly formatted JSON")
if not passed:
extracted = extraction_agent(raw_document, error_context=issues) # retry with feedbackAnalysis & Evaluation
Where Your Intuition Breaks
More agents means more capability and more parallelism. Agent count is a cost multiplier, not a capability multiplier. Adding agents adds API calls, coordination latency, error surfaces, and debugging complexity. A well-designed single-agent loop with good tools often outperforms a multi-agent system for the same task, with lower cost and simpler failure modes. Multi-agent systems are justified when task decomposition is genuinely parallel (the sub-tasks are independent), when context window limits bind (no single agent can hold all the information), or when specialization is needed (different agents with different system prompts and tool sets). Otherwise, multi-agent adds complexity without adding capability.
When to Use Multi-Agent Systems
Use multi-agent when:
- Task naturally decomposes into independent sub-tasks (parallelism helps)
- Single context window is insufficient to hold all relevant information
- Different sub-tasks benefit from different models (cost optimization: cheap workers, expensive orchestrator)
- Parallel execution time savings > coordination overhead
Prefer single-agent when:
- Task requires continuous shared context (each step depends on all prior steps)
- Latency of additional LLM calls for orchestration exceeds any savings
- Debugging complexity is not justified by the use case
Debugging Multi-Agent Systems
Multi-agent systems fail in ways single-agent systems don't:
Silent failure propagation: agent 2 receives a wrong answer from agent 1 but doesn't know it's wrong — it produces a confidently wrong output that looks correct to agent 3. Add explicit "sanity check" steps between agents for critical pipelines.
Coordination loops: agent A asks agent B a question, agent B asks agent A for clarification, deadlock. Use directed communication graphs (no cycles) or timeout-based fallbacks.
Observability: log the full input/output of every agent invocation, not just the final answer. Debugging a 5-agent system requires being able to trace exactly what each agent received and produced.
Multi-agent systems in production:
- Don't over-architect. A 3-agent pipeline that you understand is better than an 8-agent graph you can't debug. Add agents only when you have a measured performance reason to do so.
- Use different model tiers per role. The orchestrator and critic roles benefit from more capable models; workers doing structured extraction or classification can use cheaper models. A sonnet orchestrator + haiku workers can reduce costs 3–5× vs all-sonnet.
- Define agent interfaces as contracts. Each agent should have a documented input schema and output schema. Treat agent outputs like API responses — validate before passing downstream.
- Idempotency matters at scale. If an agent call fails mid-pipeline, you need to be able to retry it without re-running upstream agents. Cache intermediate results keyed by input hash.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.