Claude API & SDK
The Claude API is the entry point for building AI-powered applications. Unlike raw model access, the API abstracts token management, streaming, and tool use into a clean interface — letting you focus on the product rather than the infrastructure. In production, teams use the Claude API to power contract review, fraud investigation, and writing assistant workflows. Understanding the underlying mechanics — token probability distributions, prompt caching, streaming architecture — lets you optimize for both quality and cost. A well-structured system prompt can reduce token usage by 40% on repeated similar requests through prefix caching. This lesson covers the SDK from basics through production patterns including streaming, tool use, and cost optimization.
Theory
softmax(logits) → probability distribution over vocabulary
Every token the model produces is drawn from a probability distribution over the vocabulary — not a fixed lookup table. Temperature is the dial: turn it toward 0 and the model always picks its top choice; turn it toward 1 and it samples proportionally, occasionally picking lower-probability alternatives. The diagram above lets you feel what temperature actually does to that distribution before you read the math.
Token Probability and Sampling
At each generation step, the model produces a logit vector over the vocabulary. Temperature scales the distribution:
Dividing the logits by before softmax is the only transformation that uniformly rescales confidence without breaking relative ordering. Multiplying by a constant after softmax would distort probabilities into non-normalizable values; any per-token scaling would change the shape arbitrarily. Division before the exponential gives a clean parameter with interpretable limits: is deterministic, is the model's trained distribution, is uniform.
Low : distribution sharpens to argmax (greedy, deterministic). High : approaches uniform (maximally random).
Nucleus Sampling (Top-p)
Sample from the smallest set such that cumulative probability exceeds :
With , we retain the top-95% of probability mass and renormalize. This adapts dynamically — for high-confidence predictions it samples from very few tokens; for uncertain predictions it samples from many.
KV Cache and Context Window
Memory for the key-value cache scales as:
For Claude Sonnet 4.6 (approx.): 32 layers × 32 heads × 128 dim × 200k tokens × 2 bytes ≈ 52 GB per request. This is why long-context requests are expensive and why prompt caching (reusing KV cache across requests) is a significant optimization.
Walkthrough
Basic Messages
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
# Single-turn
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain backpropagation in 3 sentences."}]
)
print(message.content[0].text)
print(f"\nUsage: {message.usage}")
# Usage: Usage(input_tokens=17, output_tokens=142)Streaming
# Stream for low latency in interactive applications
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a Python quicksort implementation."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Get final message with usage stats
final = stream.get_final_message()
print(f"\nTokens: {final.usage}")System Prompts and Multi-turn
messages = [{"role": "user", "content": "What is gradient descent?"}]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system="You are a concise ML tutor. Explain concepts in under 100 words with one concrete example.",
messages=messages,
)
answer = response.content[0].text
# Continue the conversation
messages.append({"role": "assistant", "content": answer})
messages.append({"role": "user", "content": "Now explain the learning rate hyperparameter."})
followup = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512,
system="You are a concise ML tutor. Explain concepts in under 100 words with one concrete example.",
messages=messages,
)Tool Use
tools = [{
"name": "get_stock_price",
"description": "Get the current stock price for a ticker symbol",
"input_schema": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "Stock ticker (e.g., AAPL)"},
"currency": {"type": "string", "enum": ["USD", "EUR"], "default": "USD"},
},
"required": ["ticker"],
}
}]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the current price of Apple stock?"}]
)
# Check if Claude wants to use a tool
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}")
print(f"Input: {block.input}")
# → Tool: get_stock_price
# → Input: {'ticker': 'AAPL', 'currency': 'USD'}
# Execute the tool (your implementation)
result = {"price": 189.42, "currency": "USD"}
# Return result to Claude
final = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "What's the current price of Apple stock?"},
{"role": "assistant", "content": response.content},
{"role": "user", "content": [{"type": "tool_result", "tool_use_id": block.id, "content": str(result)}]},
]
)
print(final.content[0].text)Vision
import base64
from pathlib import Path
# Encode image as base64
img_b64 = base64.standard_b64encode(Path("chart.png").read_bytes()).decode()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
{"type": "text", "text": "What trend does this chart show? Give me the key takeaway in one sentence."},
],
}]
)Code Implementation
train.py# alignment/20_langchain_agent/train/train.py
import anthropic
from typing import Any
client = anthropic.Anthropic()
def run_agent(
task: str,
tools: list[dict],
tool_executor: dict[str, callable],
max_turns: int = 10,
model: str = "claude-sonnet-4-6",
) -> str:
"""Run an agentic loop with tool use."""
messages = [{"role": "user", "content": task}]
for turn in range(max_turns):
response = client.messages.create(
model=model, max_tokens=4096,
tools=tools, messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
# Extract final text answer
for block in response.content:
if hasattr(block, "text"):
return block.text
return ""
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_executor[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
messages.append({"role": "user", "content": tool_results})
return "Max turns reached"Analysis & Evaluation
Where Your Intuition Breaks
Lower temperature produces higher-quality outputs. Temperature controls predictability, not quality. A well-calibrated model at already assigns highest probability to good continuations — that's what training optimized. Lowering temperature makes the model more predictable and reduces variance, which feels like quality improvement on short, formulaic tasks (code, JSON, factual answers). For creative writing or open-ended reasoning, lower temperature produces repetitive, "safe" outputs that are technically probable but informationally thin. The right temperature is task-dependent, not universally low.
Token Pricing (approximate, check docs for current rates)
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Context |
|---|---|---|---|
| claude-haiku-4-5 | $0.80 | $4.00 | 200k |
| claude-sonnet-4-6 | $3.00 | $15.00 | 200k |
| claude-opus-4-6 | $15.00 | $75.00 | 200k |
Prompt Caching
# Cache the system prompt (saves ~90% on repeated calls with same system prompt)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": long_system_prompt, # up to 200k tokens
"cache_control": {"type": "ephemeral"}, # cached for 5 min
}],
messages=[{"role": "user", "content": user_query}],
)
# Cache write: ~25% extra on cached tokens (one-time)
# Cache hit: ~10% of base input priceLatency Benchmarks
| Operation | Typical Time To First Token (TTFT) | Throughput |
|---|---|---|
| Short query (100 input, 100 output) | 0.3–0.8s | ~60 tokens/s |
| Long context (100k input, 500 output) | 3–8s | ~50 tokens/s |
| Streaming first token | 0.2–0.5s | — |
Production-Ready Code
import anthropic
import time
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import json
app = FastAPI(title="Claude API Wrapper")
client = anthropic.Anthropic()
class ChatRequest(BaseModel):
messages: list[dict]
system: str = ""
model: str = "claude-sonnet-4-6"
max_tokens: int = 2048
stream: bool = False
@app.post("/chat")
async def chat(req: ChatRequest):
if req.stream:
def generate():
with client.messages.stream(
model=req.model, max_tokens=req.max_tokens,
system=req.system, messages=req.messages,
) as stream:
for text in stream.text_stream:
yield f"data: {json.dumps({'text': text})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
else:
response = client.messages.create(
model=req.model, max_tokens=req.max_tokens,
system=req.system, messages=req.messages,
)
return {"content": response.content[0].text, "usage": dict(response.usage)}
@app.get("/health")
def health():
return {"status": "ok"}Anthropic rate limits by requests per minute (RPM) and tokens per minute (TPM). Implement exponential backoff with jitter for 429 responses: wait min(base * 2^attempt + random(0, 1), max_wait) seconds. The Software Development Kit (SDK)'s built-in retry handles most cases; add custom retry logic for circuit breaking in high-scale deployments.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.