Structured Outputs

LLMs generate free-form text. Production systems need structured data — JSON objects, typed fields, enum values. The gap between these two is where many LLM integrations break. Structured outputs (JSON mode, tool calling schemas, constrained generation) enforce format at the API level, so your parsing code never sees malformed output. This lesson covers the three enforcement mechanisms, how to design schemas that minimize hallucination, and the reliability patterns that make structured extraction production-grade.

Theory

Structured output enforcement

Schema enforced at API level. Field names, types, and required fields are validated before the response is returned.

Prompt + Tool Schema

Schema defines field names, types, required

LLM (constrained decoding)

Only valid schema continuations are sampled

API validates schema

Field types and required fields enforced by API

Pydantic / app validation

Business logic validation (range checks, format)

Structured Result ✓

↺Business logic failure → retry with error context

tool calling = schema enforced at API · json mode = valid syntax only, field validation is your responsibility

A language model generates tokens one at a time, sampling from a probability distribution over the vocabulary. Structured outputs work by restricting that distribution at each step to only tokens that keep the output valid according to a schema — after {"name": , the model can only generate a string-opening quote, not a number or brace. The diagram above shows the three enforcement mechanisms: JSON mode (valid JSON required), tool calling (full schema enforcement), and constrained decoding (grammar-level token masking). Each gives stronger guarantees at the cost of more setup.

Constrained Decoding

Standard autoregressive sampling picks the next token from the full vocabulary distribution:

$p(t) = \text{softmax}(W h_t / T)$

where $h_t$ is the hidden state and $T$ is temperature. Constrained decoding masks the distribution to allow only tokens that are valid continuations of the current structured format:

$p_{\text{constrained}}(t) \propto p(t) \cdot \mathbf{1}[t \in \mathcal{V}_{\text{valid}}(s_t)]$

The token mask approach is forced by the model's architecture: the probability distribution was trained on unconstrained natural language, and we cannot retrain it for every possible schema at inference time. Applying a binary mask over the learned distribution preserves the model's relative preferences among valid tokens while hard-blocking invalid ones. The alternative — prompting the model to follow a schema without enforcement — produces valid outputs most of the time but fails unpredictably on complex schemas or adversarial inputs. Token-level enforcement is the only mechanism that provides hard guarantees.

where $\mathcal{V}_{\text{valid}}(s_t)$ is the set of valid next tokens given current output state $s_t$ (e.g., after {"name": , valid tokens are string-opening quotes, not numbers or braces).

This is how Anthropic's tool_use API, OpenAI's response_format, and libraries like Outlines enforce structure — they compile a schema to a token mask and apply it at each decoding step.

Schema Design and Hallucination

The probability of hallucination in a structured field scales with:

Enum size — more valid values → more opportunities to pick a wrong one
Nested depth — each level of nesting multiplies opportunities for format errors
Optional vs required fields — optional fields with no natural value tend to be filled with plausible-sounding guesses

Minimum effective schema: include only fields you actually need. Every additional field is an opportunity for hallucination. Flat schemas outperform deeply nested ones on reliability.

Tool Calling vs JSON Mode

Two mechanisms for structured output:

JSON mode (response_format: { type: "json_object" }): the model is constrained to output valid JSON but there is no schema enforcement on field names or types. The model decides which fields to include. Good for flexible extraction where you want the model to decide structure.

Tool calling (function definitions with typed parameters): the model must populate a specific schema. Field names, types, and required vs optional are all enforced at the API level. Better for cases where you need an exact, predictable shape.

Walkthrough

Extracting Structured Data from Documents

Goal: extract invoice data (vendor, amount, date, line items) from unstructured invoice text.

Tool definition:

python

import anthropic
 
client = anthropic.Anthropic()
 
invoice_tool = {
    "name": "extract_invoice",
    "description": "Extract structured invoice data from text",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor": {"type": "string", "description": "Vendor or supplier name"},
            "total_amount": {"type": "number", "description": "Total invoice amount in USD"},
            "invoice_date": {
                "type": "string",
                "description": "Invoice date in YYYY-MM-DD format"
            },
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    },
                    "required": ["description", "amount"]
                }
            }
        },
        "required": ["vendor", "total_amount", "invoice_date"]
    }
}
 
def extract_invoice(text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        tools=[invoice_tool],
        tool_choice={"type": "tool", "name": "extract_invoice"},  # force tool use
        messages=[{
            "role": "user",
            "content": f"Extract invoice data:\n\n{text}"
        }]
    )
    # Tool use response is always in content[0].input when tool_choice forces it
    return response.content[0].input

Validation and retry pattern:

python

from pydantic import BaseModel, ValidationError
from typing import Optional
import json
 
class LineItem(BaseModel):
    description: str
    amount: float
 
class Invoice(BaseModel):
    vendor: str
    total_amount: float
    invoice_date: str  # validate format separately
    line_items: list[LineItem] = []
 
def extract_with_validation(text: str, max_retries: int = 2) -> Invoice:
    for attempt in range(max_retries + 1):
        raw = extract_invoice(text)
        try:
            return Invoice(**raw)
        except ValidationError as e:
            if attempt == max_retries:
                raise
            # Feed validation error back to the model for self-correction
            text = f"{text}\n\nPrevious attempt failed validation: {e}. Please fix."
    raise RuntimeError("Extraction failed after retries")

Analysis & Evaluation

Where Your Intuition Breaks

JSON mode and tool calling are equivalent structured output mechanisms. JSON mode requires valid JSON syntax but does not enforce field names, types, or required fields — the model decides what to include. Tool calling enforces a specific schema at the API level: field names, types, and required vs optional are all constrained. They solve different problems: JSON mode for flexible extraction where the model chooses structure, tool calling for exact reproducible shapes where downstream code depends on specific fields. Using JSON mode when you need tool calling is the most common structured output bug — the output parses, but the fields are wrong or missing.

Choosing the Right Mechanism

Scenario	Recommendation
Exact schema required, all fields typed	Tool calling with `tool_choice` forced
Flexible structure, model decides fields	JSON mode
Simple enum classification	Tool calling or constrained enum in prompt
Streaming required	JSON mode (tool calling blocks until complete)
Complex nested schema (3+ levels)	Flatten schema; use multiple extraction calls

Reliability Patterns

Force the tool: set tool_choice: { type: "tool", name: "my_tool" } rather than letting the model decide whether to call a tool. Without forcing, the model may respond in prose when it's uncertain.

Mark required fields clearly: fields marked required in the schema are almost always populated. Optional fields are often omitted or hallucinated. If a field matters, make it required.

Retry with error context: if Pydantic validation fails, the most effective fix is to send the validation error back to the model in a follow-up turn. Models are generally good at self-correcting specific field errors when shown the error message.

Avoid unbounded arrays: "type": "array" with no maxItems can produce hundreds of items for complex documents. Set "maxItems" or paginate extraction.

🚀Production

Structured output reliability in production:

Monitor parse success rate in addition to correctness. A schema that works 95% of the time will generate exceptions at scale. Target 99%+.
Log raw model output before parsing. When parsing fails in production, you need the raw output to debug. Don't discard it.
Start simpler than you think you need. A 5-field flat schema is almost always more reliable than a 15-field nested one. Add fields incrementally as you confirm each one is reliably extracted.
Temperature 0 for extraction tasks. Structured extraction is a deterministic task — there's a right answer in the source text. Low temperature reduces variability and improves consistency.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Prompt Engineering

RAG Systems