Latency & Cost Optimization
LLM inference costs can be 10–100× higher than necessary for most production workloads. The opportunities are stacked: prompt caching eliminates redundant input processing, model selection matches capability to task, batching improves GPU utilization, and quantization shrinks serving costs. Each of these is independent and additive. This lesson gives you a mental model for the latency-cost-quality triangle and a prioritized list of where to look first.
Theory
cost = n_in · p_in + n_out · p_out · cache read ≈ 10% of p_in · illustrative pricing based on Sonnet 3-tier
LLM inference costs break down into two line items: what you send in, and what the model generates back. Output tokens are 3–5× more expensive than input tokens because each one requires a full sequential forward pass — you can't parallelize output generation the way you can parallelize reading a long prompt. The diagram above shows the cost-latency trade-off across optimization levers: prompt caching, model selection, and output length. Start with the biggest lever first.
The LLM Cost Model
API pricing is denominated in tokens. Total cost for a request with input tokens and output tokens:
Output tokens cost more than input tokens because of how autoregressive generation works: each output token requires a full forward pass through the model with the updated KV cache, which is sequential and cannot be batched across positions. Input tokens, by contrast, are processed in parallel during the prefill phase — the transformer processes all input positions simultaneously via self-attention. This architectural asymmetry (parallel prefill vs sequential decoding) directly maps to the pricing asymmetry. It is not a pricing choice; it reflects real compute cost.
where and are per-token prices (output tokens are typically 3–5× more expensive than input tokens because they require sequential decoding passes).
Key insight: output tokens dominate costs at scale. A 2000-token system prompt adds per request, but a response that averages 500 output tokens costs . Output reduction has 3× the cost leverage of input reduction.
Prompt Caching
When multiple requests share a common prefix (e.g., a large system prompt or document), prompt caching avoids recomputing the KV cache for that prefix:
where (cache reads are ~10% of full input token cost on Anthropic's API). For a 10K-token system prompt reused across 1000 requests, caching saves approximately tokens-worth of computation.
Cache hit rate . At steady state, effective input cost:
For : — an 81% reduction on the cached portion.
Time-to-First-Token (TTFT) vs Throughput
Latency has two dimensions:
TTFT (time to first token): latency from request to first streamed token. Dominated by input processing (prefill). Long prompts → higher TTFT. Caching dramatically reduces TTFT for cached prefixes.
Inter-token latency (ITL): time between successive output tokens. Dominated by model size and hardware. Smaller models → lower ITL.
For interactive applications (chat, copilots): optimize TTFT first. For batch processing: optimize throughput (total tokens/second across requests) instead.
Model Selection: The Quality-Cost Curve
Different model tiers offer different quality-cost trade-offs. The efficient frontier:
For most classification and extraction tasks, a smaller model (e.g., Haiku) achieves 90%+ of a larger model's accuracy at 10–20× lower cost. The quality gap only matters for complex reasoning, long-form generation, and nuanced judgment.
Walkthrough
Profiling and Optimizing a RAG Pipeline
Step 1 — Measure baseline costs:
import anthropic, time
def timed_request(prompt: str, context: str) -> dict:
client = anthropic.Anthropic()
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {prompt}"
}]
)
elapsed = time.time() - start
usage = response.usage
return {
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"latency_ms": elapsed * 1000,
"cost_usd": usage.input_tokens * 3e-6 + usage.output_tokens * 15e-6,
}Step 2 — Enable prompt caching for system prompt:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # cache this block
}],
messages=[{"role": "user", "content": user_query}]
)
# Second+ requests with same system prompt: cache_read_input_tokens shows hits
print(response.usage.cache_read_input_tokens) # > 0 on cache hitStep 3 — Route by task complexity:
def route_query(query: str, context: str) -> str:
"""Use a cheap model for simple lookups, expensive for complex reasoning."""
# Heuristic: short context + factual question → small model
is_simple = len(context) < 1000 and "?" in query and len(query.split()) < 20
model = "claude-haiku-4-5-20251001" if is_simple else "claude-sonnet-4-6"
return generate(query, context, model=model)Analysis & Evaluation
Where Your Intuition Breaks
A long system prompt makes every request more expensive. With prompt caching enabled, a long system prompt that is identical across requests is processed once and cached — subsequent requests pay only the cache read cost, approximately 10% of the full input token cost. A 10,000-token system prompt reused across 1,000 requests costs roughly the same as 1,100 uncached requests (1 full compute + 999 cache reads). Without caching, cost scales linearly with system prompt length per request; with caching, the marginal cost of the system prompt approaches zero at steady state. The expensive prompt is the one that changes per request, not the one that stays constant.
Optimization Priority Order
| Optimization | Complexity | Typical savings | Apply when |
|---|---|---|---|
| Prompt caching | Low | 50–90% input cost | Shared prefix > 1K tokens |
| Model downgrade (small tasks) | Medium | 80–95% per request | Classification, extraction, routing |
| Output length control | Low | 20–50% output cost | Responses longer than needed |
| Batching | High | 30–50% throughput | Async/offline workloads |
| Quantization (self-hosted) | Very high | 40–60% serving cost | Self-hosted inference at scale |
Start here: prompt caching + model routing. Both are API-level changes with no infrastructure work and frequently achieve 5–10× cost reduction.
Latency Optimization Patterns
Streaming: start rendering before generation completes. Perceived latency drops even when total generation time is unchanged. Use stream=True in the SDK.
Prefetch: for applications with predictable follow-up queries (e.g., pagination, wizard flows), start the next LLM call before the user clicks.
Speculative decoding (self-hosted): a small draft model generates tokens quickly; a large model verifies them in parallel. Achieves large-model quality at near-small-model latency. Available in vLLM and SGLang.
Latency-cost-quality trade-offs in practice:
- Measure before optimizing. Instrument token counts, cache hit rates, and latency percentiles (p50, p95, p99). Most optimization effort should go to the 5% of requests at p95+ latency.
- Output tokens are the lever. Cutting output token count (via tighter prompts, max_tokens limits, structured output schemas that prevent padding) has 3–5× the cost impact of input token reduction.
- Prompt caching requires prefix stability. Dynamic content inserted before a stable suffix breaks caching. Keep the large stable content (system prompt, reference documents) first; append dynamic content last.
- Model routing is the highest-leverage change. Routing 70% of requests to a smaller model while sending only the 30% that need it to a larger model often matches full-large-model quality at 30–40% of the cost.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.