Structured Outputs
LLMs generate free-form text. Production systems need structured data — JSON objects, typed fields, enum values. The gap between these two is where many LLM integrations break. Structured outputs (JSON mode, tool calling schemas, constrained generation) enforce format at the API level, so your parsing code never sees malformed output. This lesson covers the three enforcement mechanisms, how to design schemas that minimize hallucination, and the reliability patterns that make structured extraction production-grade.
Theory
Schema enforced at API level. Field names, types, and required fields are validated before the response is returned.
tool calling = schema enforced at API · json mode = valid syntax only, field validation is your responsibility
A language model generates tokens one at a time, sampling from a probability distribution over the vocabulary. Structured outputs work by restricting that distribution at each step to only tokens that keep the output valid according to a schema — after {"name": , the model can only generate a string-opening quote, not a number or brace. The diagram above shows the three enforcement mechanisms: JSON mode (valid JSON required), tool calling (full schema enforcement), and constrained decoding (grammar-level token masking). Each gives stronger guarantees at the cost of more setup.
Constrained Decoding
Standard autoregressive sampling picks the next token from the full vocabulary distribution:
where is the hidden state and is temperature. Constrained decoding masks the distribution to allow only tokens that are valid continuations of the current structured format:
The token mask approach is forced by the model's architecture: the probability distribution was trained on unconstrained natural language, and we cannot retrain it for every possible schema at inference time. Applying a binary mask over the learned distribution preserves the model's relative preferences among valid tokens while hard-blocking invalid ones. The alternative — prompting the model to follow a schema without enforcement — produces valid outputs most of the time but fails unpredictably on complex schemas or adversarial inputs. Token-level enforcement is the only mechanism that provides hard guarantees.
where is the set of valid next tokens given current output state (e.g., after {"name": , valid tokens are string-opening quotes, not numbers or braces).
This is how Anthropic's tool_use API, OpenAI's response_format, and libraries like Outlines enforce structure — they compile a schema to a token mask and apply it at each decoding step.
Schema Design and Hallucination
The probability of hallucination in a structured field scales with:
- Enum size — more valid values → more opportunities to pick a wrong one
- Nested depth — each level of nesting multiplies opportunities for format errors
- Optional vs required fields — optional fields with no natural value tend to be filled with plausible-sounding guesses
Minimum effective schema: include only fields you actually need. Every additional field is an opportunity for hallucination. Flat schemas outperform deeply nested ones on reliability.
Tool Calling vs JSON Mode
Two mechanisms for structured output:
JSON mode (response_format: { type: "json_object" }): the model is constrained to output valid JSON but there is no schema enforcement on field names or types. The model decides which fields to include. Good for flexible extraction where you want the model to decide structure.
Tool calling (function definitions with typed parameters): the model must populate a specific schema. Field names, types, and required vs optional are all enforced at the API level. Better for cases where you need an exact, predictable shape.
Walkthrough
Extracting Structured Data from Documents
Goal: extract invoice data (vendor, amount, date, line items) from unstructured invoice text.
Tool definition:
import anthropic
client = anthropic.Anthropic()
invoice_tool = {
"name": "extract_invoice",
"description": "Extract structured invoice data from text",
"input_schema": {
"type": "object",
"properties": {
"vendor": {"type": "string", "description": "Vendor or supplier name"},
"total_amount": {"type": "number", "description": "Total invoice amount in USD"},
"invoice_date": {
"type": "string",
"description": "Invoice date in YYYY-MM-DD format"
},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
},
"required": ["description", "amount"]
}
}
},
"required": ["vendor", "total_amount", "invoice_date"]
}
}
def extract_invoice(text: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=[invoice_tool],
tool_choice={"type": "tool", "name": "extract_invoice"}, # force tool use
messages=[{
"role": "user",
"content": f"Extract invoice data:\n\n{text}"
}]
)
# Tool use response is always in content[0].input when tool_choice forces it
return response.content[0].inputValidation and retry pattern:
from pydantic import BaseModel, ValidationError
from typing import Optional
import json
class LineItem(BaseModel):
description: str
amount: float
class Invoice(BaseModel):
vendor: str
total_amount: float
invoice_date: str # validate format separately
line_items: list[LineItem] = []
def extract_with_validation(text: str, max_retries: int = 2) -> Invoice:
for attempt in range(max_retries + 1):
raw = extract_invoice(text)
try:
return Invoice(**raw)
except ValidationError as e:
if attempt == max_retries:
raise
# Feed validation error back to the model for self-correction
text = f"{text}\n\nPrevious attempt failed validation: {e}. Please fix."
raise RuntimeError("Extraction failed after retries")Analysis & Evaluation
Where Your Intuition Breaks
JSON mode and tool calling are equivalent structured output mechanisms. JSON mode requires valid JSON syntax but does not enforce field names, types, or required fields — the model decides what to include. Tool calling enforces a specific schema at the API level: field names, types, and required vs optional are all constrained. They solve different problems: JSON mode for flexible extraction where the model chooses structure, tool calling for exact reproducible shapes where downstream code depends on specific fields. Using JSON mode when you need tool calling is the most common structured output bug — the output parses, but the fields are wrong or missing.
Choosing the Right Mechanism
| Scenario | Recommendation |
|---|---|
| Exact schema required, all fields typed | Tool calling with tool_choice forced |
| Flexible structure, model decides fields | JSON mode |
| Simple enum classification | Tool calling or constrained enum in prompt |
| Streaming required | JSON mode (tool calling blocks until complete) |
| Complex nested schema (3+ levels) | Flatten schema; use multiple extraction calls |
Reliability Patterns
Force the tool: set tool_choice: { type: "tool", name: "my_tool" } rather than letting the model decide whether to call a tool. Without forcing, the model may respond in prose when it's uncertain.
Mark required fields clearly: fields marked required in the schema are almost always populated. Optional fields are often omitted or hallucinated. If a field matters, make it required.
Retry with error context: if Pydantic validation fails, the most effective fix is to send the validation error back to the model in a follow-up turn. Models are generally good at self-correcting specific field errors when shown the error message.
Avoid unbounded arrays: "type": "array" with no maxItems can produce hundreds of items for complex documents. Set "maxItems" or paginate extraction.
Structured output reliability in production:
- Monitor parse success rate in addition to correctness. A schema that works 95% of the time will generate exceptions at scale. Target 99%+.
- Log raw model output before parsing. When parsing fails in production, you need the raw output to debug. Don't discard it.
- Start simpler than you think you need. A 5-field flat schema is almost always more reliable than a 15-field nested one. Add fields incrementally as you confirm each one is reliably extracted.
- Temperature 0 for extraction tasks. Structured extraction is a deterministic task — there's a right answer in the source text. Low temperature reduces variability and improves consistency.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.