Neural-Path/Notes
30 min

Claude API & SDK

The Claude API is the entry point for building AI-powered applications. Unlike raw model access, the API abstracts token management, streaming, and tool use into a clean interface — letting you focus on the product rather than the infrastructure. In production, teams use the Claude API to power contract review, fraud investigation, and writing assistant workflows. Understanding the underlying mechanics — token probability distributions, prompt caching, streaming architecture — lets you optimize for both quality and cost. A well-structured system prompt can reduce token usage by 40% on repeated similar requests through prefix caching. This lesson covers the SDK from basics through production patterns including streaming, tool use, and cost optimization.

Theory

Next Token Distribution
sat
34%
is
21%
was
15%
jumped
12%
ran
9%
meowed
5%
other...
4%

softmax(logits) → probability distribution over vocabulary

Every token the model produces is drawn from a probability distribution over the vocabulary — not a fixed lookup table. Temperature is the dial: turn it toward 0 and the model always picks its top choice; turn it toward 1 and it samples proportionally, occasionally picking lower-probability alternatives. The diagram above lets you feel what temperature actually does to that distribution before you read the math.

Token Probability and Sampling

At each generation step, the model produces a logit vector zRV\mathbf{z} \in \mathbb{R}^{|V|} over the vocabulary. Temperature TT scales the distribution:

P(wicontext)=exp(zi/T)jexp(zj/T)P(w_i \mid \text{context}) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Dividing the logits by TT before softmax is the only transformation that uniformly rescales confidence without breaking relative ordering. Multiplying by a constant after softmax would distort probabilities into non-normalizable values; any per-token scaling would change the shape arbitrarily. Division before the exponential gives a clean parameter with interpretable limits: T0T \to 0 is deterministic, T=1T = 1 is the model's trained distribution, TT \to \infty is uniform.

Low T0T \to 0: distribution sharpens to argmax (greedy, deterministic). High TT \to \infty: approaches uniform (maximally random).

Nucleus Sampling (Top-p)

Sample from the smallest set SS such that cumulative probability exceeds pp:

S=argminS{S:wSP(w)p}S = \arg\min_{S'} \left\{ |S'| : \sum_{w \in S'} P(w) \geq p \right\}

With p=0.95p=0.95, we retain the top-95% of probability mass and renormalize. This adapts dynamically — for high-confidence predictions it samples from very few tokens; for uncertain predictions it samples from many.

KV Cache and Context Window

Memory for the key-value cache scales as:

MemoryKV=2×nlayers×nheads×dhead×L×bytes\text{Memory}_{\text{KV}} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times L \times \text{bytes}

For Claude Sonnet 4.6 (approx.): 32 layers × 32 heads × 128 dim × 200k tokens × 2 bytes ≈ 52 GB per request. This is why long-context requests are expensive and why prompt caching (reusing KV cache across requests) is a significant optimization.

Walkthrough

Basic Messages

python
import anthropic
 
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env
 
# Single-turn
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain backpropagation in 3 sentences."}]
)
print(message.content[0].text)
print(f"\nUsage: {message.usage}")
# Usage: Usage(input_tokens=17, output_tokens=142)

Streaming

python
# Stream for low latency in interactive applications
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a Python quicksort implementation."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
 
# Get final message with usage stats
final = stream.get_final_message()
print(f"\nTokens: {final.usage}")

System Prompts and Multi-turn

python
messages = [{"role": "user", "content": "What is gradient descent?"}]
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="You are a concise ML tutor. Explain concepts in under 100 words with one concrete example.",
    messages=messages,
)
answer = response.content[0].text
 
# Continue the conversation
messages.append({"role": "assistant", "content": answer})
messages.append({"role": "user", "content": "Now explain the learning rate hyperparameter."})
 
followup = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=512,
    system="You are a concise ML tutor. Explain concepts in under 100 words with one concrete example.",
    messages=messages,
)

Tool Use

python
tools = [{
    "name": "get_stock_price",
    "description": "Get the current stock price for a ticker symbol",
    "input_schema": {
        "type": "object",
        "properties": {
            "ticker": {"type": "string", "description": "Stock ticker (e.g., AAPL)"},
            "currency": {"type": "string", "enum": ["USD", "EUR"], "default": "USD"},
        },
        "required": ["ticker"],
    }
}]
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the current price of Apple stock?"}]
)
 
# Check if Claude wants to use a tool
for block in response.content:
    if block.type == "tool_use":
        print(f"Tool: {block.name}")
        print(f"Input: {block.input}")
        # → Tool: get_stock_price
        # → Input: {'ticker': 'AAPL', 'currency': 'USD'}
 
        # Execute the tool (your implementation)
        result = {"price": 189.42, "currency": "USD"}
 
        # Return result to Claude
        final = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=[
                {"role": "user", "content": "What's the current price of Apple stock?"},
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": [{"type": "tool_result", "tool_use_id": block.id, "content": str(result)}]},
            ]
        )
        print(final.content[0].text)

Vision

python
import base64
from pathlib import Path
 
# Encode image as base64
img_b64 = base64.standard_b64encode(Path("chart.png").read_bytes()).decode()
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
            {"type": "text", "text": "What trend does this chart show? Give me the key takeaway in one sentence."},
        ],
    }]
)

Code Implementation

train.py
python
# alignment/20_langchain_agent/train/train.py
import anthropic
from typing import Any
 
client = anthropic.Anthropic()
 
def run_agent(
    task: str,
    tools: list[dict],
    tool_executor: dict[str, callable],
    max_turns: int = 10,
    model: str = "claude-sonnet-4-6",
) -> str:
    """Run an agentic loop with tool use."""
    messages = [{"role": "user", "content": task}]
 
    for turn in range(max_turns):
        response = client.messages.create(
            model=model, max_tokens=4096,
            tools=tools, messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "end_turn":
            # Extract final text answer
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""
 
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = tool_executor[block.name](**block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
            messages.append({"role": "user", "content": tool_results})
 
    return "Max turns reached"

Analysis & Evaluation

Where Your Intuition Breaks

Lower temperature produces higher-quality outputs. Temperature controls predictability, not quality. A well-calibrated model at T=1T = 1 already assigns highest probability to good continuations — that's what training optimized. Lowering temperature makes the model more predictable and reduces variance, which feels like quality improvement on short, formulaic tasks (code, JSON, factual answers). For creative writing or open-ended reasoning, lower temperature produces repetitive, "safe" outputs that are technically probable but informationally thin. The right temperature is task-dependent, not universally low.

Token Pricing (approximate, check docs for current rates)

ModelInput ($/1M tokens)Output ($/1M tokens)Context
claude-haiku-4-5$0.80$4.00200k
claude-sonnet-4-6$3.00$15.00200k
claude-opus-4-6$15.00$75.00200k

Prompt Caching

python
# Cache the system prompt (saves ~90% on repeated calls with same system prompt)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_system_prompt,  # up to 200k tokens
        "cache_control": {"type": "ephemeral"},  # cached for 5 min
    }],
    messages=[{"role": "user", "content": user_query}],
)
# Cache write: ~25% extra on cached tokens (one-time)
# Cache hit:  ~10% of base input price

Latency Benchmarks

OperationTypical Time To First Token (TTFT)Throughput
Short query (100 input, 100 output)0.3–0.8s~60 tokens/s
Long context (100k input, 500 output)3–8s~50 tokens/s
Streaming first token0.2–0.5s

Production-Ready Code

python
import anthropic
import time
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import json
 
app = FastAPI(title="Claude API Wrapper")
client = anthropic.Anthropic()
 
class ChatRequest(BaseModel):
    messages: list[dict]
    system: str = ""
    model: str = "claude-sonnet-4-6"
    max_tokens: int = 2048
    stream: bool = False
 
@app.post("/chat")
async def chat(req: ChatRequest):
    if req.stream:
        def generate():
            with client.messages.stream(
                model=req.model, max_tokens=req.max_tokens,
                system=req.system, messages=req.messages,
            ) as stream:
                for text in stream.text_stream:
                    yield f"data: {json.dumps({'text': text})}\n\n"
            yield "data: [DONE]\n\n"
        return StreamingResponse(generate(), media_type="text/event-stream")
    else:
        response = client.messages.create(
            model=req.model, max_tokens=req.max_tokens,
            system=req.system, messages=req.messages,
        )
        return {"content": response.content[0].text, "usage": dict(response.usage)}
 
@app.get("/health")
def health():
    return {"status": "ok"}
🚀Rate limiting and retry

Anthropic rate limits by requests per minute (RPM) and tokens per minute (TPM). Implement exponential backoff with jitter for 429 responses: wait min(base * 2^attempt + random(0, 1), max_wait) seconds. The Software Development Kit (SDK)'s built-in retry handles most cases; add custom retry logic for circuit breaking in high-scale deployments.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.