Neural-Path/Notes
25 min

Latency & Cost Optimization

LLM inference costs can be 10–100× higher than necessary for most production workloads. The opportunities are stacked: prompt caching eliminates redundant input processing, model selection matches capability to task, batching improves GPU utilization, and quantization shrinks serving costs. Each of these is independent and additive. This lesson gives you a mental model for the latency-cost-quality triangle and a prioritized list of where to look first.

Theory

Latency & cost model — token pricing
input output
uncached
$13.50/1K req
cached
$10.26/1K req
input tokens2,000
output tokens500
cached prefix1,500
cache hit rate80%
uncached cost/req$13.500/1K reqcached cost/req$10.260/1K reqcost savings24.0%effective p_in$0.84/M tokens
Low savings. Cache reads only help if prefix is large and requests are frequent.
output tokens = 56% of uncached cost · output is 5× more expensive per token than input · cutting output length has 5× the leverage of input reduction

cost = n_in · p_in + n_out · p_out · cache read ≈ 10% of p_in · illustrative pricing based on Sonnet 3-tier

LLM inference costs break down into two line items: what you send in, and what the model generates back. Output tokens are 3–5× more expensive than input tokens because each one requires a full sequential forward pass — you can't parallelize output generation the way you can parallelize reading a long prompt. The diagram above shows the cost-latency trade-off across optimization levers: prompt caching, model selection, and output length. Start with the biggest lever first.

The LLM Cost Model

API pricing is denominated in tokens. Total cost for a request with ninn_{\text{in}} input tokens and noutn_{\text{out}} output tokens:

cost=ninpin+noutpout\text{cost} = n_{\text{in}} \cdot p_{\text{in}} + n_{\text{out}} \cdot p_{\text{out}}

Output tokens cost more than input tokens because of how autoregressive generation works: each output token requires a full forward pass through the model with the updated KV cache, which is sequential and cannot be batched across positions. Input tokens, by contrast, are processed in parallel during the prefill phase — the transformer processes all input positions simultaneously via self-attention. This architectural asymmetry (parallel prefill vs sequential decoding) directly maps to the pricing asymmetry. It is not a pricing choice; it reflects real compute cost.

where pinp_{\text{in}} and poutp_{\text{out}} are per-token prices (output tokens are typically 3–5× more expensive than input tokens because they require sequential decoding passes).

Key insight: output tokens dominate costs at scale. A 2000-token system prompt adds 2000pin2000 \cdot p_{\text{in}} per request, but a response that averages 500 output tokens costs 500pout5003pin500 \cdot p_{\text{out}} \approx 500 \cdot 3p_{\text{in}}. Output reduction has 3× the cost leverage of input reduction.

Prompt Caching

When multiple requests share a common prefix (e.g., a large system prompt or document), prompt caching avoids recomputing the KV cache for that prefix:

costcached=ncachedpcache_read+nnewpin+noutpout\text{cost}_{\text{cached}} = n_{\text{cached}} \cdot p_{\text{cache\_read}} + n_{\text{new}} \cdot p_{\text{in}} + n_{\text{out}} \cdot p_{\text{out}}

where pcache_read0.1pinp_{\text{cache\_read}} \approx 0.1 \cdot p_{\text{in}} (cache reads are ~10% of full input token cost on Anthropic's API). For a 10K-token system prompt reused across 1000 requests, caching saves approximately 10,000×999×0.9pin10{,}000 \times 999 \times 0.9 \cdot p_{\text{in}} tokens-worth of computation.

Cache hit rate h=Ncache_hits/Ntotalh = N_{\text{cache\_hits}} / N_{\text{total}}. At steady state, effective input cost:

peffective=hpcache_read+(1h)pinp_{\text{effective}} = h \cdot p_{\text{cache\_read}} + (1 - h) \cdot p_{\text{in}}

For h=0.9h = 0.9: peffective0.19pinp_{\text{effective}} \approx 0.19 \cdot p_{\text{in}} — an 81% reduction on the cached portion.

Time-to-First-Token (TTFT) vs Throughput

Latency has two dimensions:

TTFT (time to first token): latency from request to first streamed token. Dominated by input processing (prefill). Long prompts → higher TTFT. Caching dramatically reduces TTFT for cached prefixes.

Inter-token latency (ITL): time between successive output tokens. Dominated by model size and hardware. Smaller models → lower ITL.

For interactive applications (chat, copilots): optimize TTFT first. For batch processing: optimize throughput (total tokens/second across requests) instead.

Model Selection: The Quality-Cost Curve

Different model tiers offer different quality-cost trade-offs. The efficient frontier:

select model m=argminmcost(m) s.t. quality(m)threshold\text{select model } m^* = \arg\min_m \text{cost}(m) \text{ s.t. } \text{quality}(m) \geq \text{threshold}

For most classification and extraction tasks, a smaller model (e.g., Haiku) achieves 90%+ of a larger model's accuracy at 10–20× lower cost. The quality gap only matters for complex reasoning, long-form generation, and nuanced judgment.

Walkthrough

Profiling and Optimizing a RAG Pipeline

Step 1 — Measure baseline costs:

python
import anthropic, time
 
def timed_request(prompt: str, context: str) -> dict:
    client = anthropic.Anthropic()
    start = time.time()
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {prompt}"
        }]
    )
 
    elapsed = time.time() - start
    usage = response.usage
    return {
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "latency_ms": elapsed * 1000,
        "cost_usd": usage.input_tokens * 3e-6 + usage.output_tokens * 15e-6,
    }

Step 2 — Enable prompt caching for system prompt:

python
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}  # cache this block
    }],
    messages=[{"role": "user", "content": user_query}]
)
# Second+ requests with same system prompt: cache_read_input_tokens shows hits
print(response.usage.cache_read_input_tokens)  # > 0 on cache hit

Step 3 — Route by task complexity:

python
def route_query(query: str, context: str) -> str:
    """Use a cheap model for simple lookups, expensive for complex reasoning."""
    # Heuristic: short context + factual question → small model
    is_simple = len(context) < 1000 and "?" in query and len(query.split()) < 20
 
    model = "claude-haiku-4-5-20251001" if is_simple else "claude-sonnet-4-6"
    return generate(query, context, model=model)

Analysis & Evaluation

Where Your Intuition Breaks

A long system prompt makes every request more expensive. With prompt caching enabled, a long system prompt that is identical across requests is processed once and cached — subsequent requests pay only the cache read cost, approximately 10% of the full input token cost. A 10,000-token system prompt reused across 1,000 requests costs roughly the same as 1,100 uncached requests (1 full compute + 999 cache reads). Without caching, cost scales linearly with system prompt length per request; with caching, the marginal cost of the system prompt approaches zero at steady state. The expensive prompt is the one that changes per request, not the one that stays constant.

Optimization Priority Order

OptimizationComplexityTypical savingsApply when
Prompt cachingLow50–90% input costShared prefix > 1K tokens
Model downgrade (small tasks)Medium80–95% per requestClassification, extraction, routing
Output length controlLow20–50% output costResponses longer than needed
BatchingHigh30–50% throughputAsync/offline workloads
Quantization (self-hosted)Very high40–60% serving costSelf-hosted inference at scale

Start here: prompt caching + model routing. Both are API-level changes with no infrastructure work and frequently achieve 5–10× cost reduction.

Latency Optimization Patterns

Streaming: start rendering before generation completes. Perceived latency drops even when total generation time is unchanged. Use stream=True in the SDK.

Prefetch: for applications with predictable follow-up queries (e.g., pagination, wizard flows), start the next LLM call before the user clicks.

Speculative decoding (self-hosted): a small draft model generates tokens quickly; a large model verifies them in parallel. Achieves large-model quality at near-small-model latency. Available in vLLM and SGLang.

🚀Production

Latency-cost-quality trade-offs in practice:

  • Measure before optimizing. Instrument token counts, cache hit rates, and latency percentiles (p50, p95, p99). Most optimization effort should go to the 5% of requests at p95+ latency.
  • Output tokens are the lever. Cutting output token count (via tighter prompts, max_tokens limits, structured output schemas that prevent padding) has 3–5× the cost impact of input token reduction.
  • Prompt caching requires prefix stability. Dynamic content inserted before a stable suffix breaks caching. Keep the large stable content (system prompt, reference documents) first; append dynamic content last.
  • Model routing is the highest-leverage change. Routing 70% of requests to a smaller model while sending only the 30% that need it to a larger model often matches full-large-model quality at 30–40% of the cost.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.