Constitutional AI
Human preference labeling at scale is expensive, slow, and inconsistent. For RLHF to work, you need thousands of preference pairs — each requiring a human to read two model responses and decide which is better. Constitutional AI (CAI) replaces this with AI-generated feedback: the same model that generates responses is asked to critique them according to explicit principles, then revise them. Cheaper, faster, and more consistent. This is the approach behind Constitutional AI as described in Anthropic's 2022 paper. The key insight is that a model's "evaluation mode" (judging whether a response is harmful) is more reliable than its "generation mode" — a model that can generate harmful content can also recognize it as harmful when asked directly. This lesson derives both SL-CAI and RL-CAI, and implements the critique-revision loop using the Anthropic SDK.
Theory
Constitutional AI: the model critiques its own output against each constitutional principle, then revises. Multiple critique-revision passes progressively improve safety before any human feedback.
Human preference labeling at scale is expensive and inconsistent. Constitutional AI asks: can the model evaluate its own outputs against a set of principles? It turns out that a model capable of generating harmful content is also capable of recognizing that content as harmful when asked directly. CAI builds the safety pipeline from this asymmetry — using the model's own "evaluator mode" to generate the preference labels that train the safety behavior.
Constitutional AI (CAI) replaces expensive human preference labeling with AI-generated feedback guided by a set of principles (the "constitution"). It has two phases: SL-CAI (supervised) and RL-CAI (reinforcement learning with AI feedback / RLAIF).
Phase 1: SL-CAI (Critique and Revision)
Given a potentially harmful response to prompt :
- Critique: sample
- Revision: sample
Repeat this critique-revision loop times (typically ). Fine-tune on the resulting pairs using standard supervised learning:
Phase 2: RL-CAI (AI Feedback)
Train a preference model using AI feedback instead of human labels. For a pair of responses , ask the AI which better adheres to the constitution:
where is the constitution. Then train a reward model on these AI-generated preferences (same Bradley-Terry objective as standard Reinforcement Learning from Human Feedback (RLHF)) and run PPO:
Constitutional Objective
CAI can be viewed as incorporating a constitutional prior over response quality. The constitution specifies:
The weighted sum over principles is an approximation of a much harder problem: defining a complete ordering over all possible behaviors. In practice, principles conflict — "be helpful" and "avoid harm" produce opposite signals for dual-use requests. The weights encode a priority ordering, but they don't resolve conflicts automatically. This is why the AI feedback model (rather than rule-by-rule evaluation) is used in practice: a language model can weigh conflicting principles in context in ways that a sum of binary indicators cannot.
In practice this is approximated via the AI feedback model rather than evaluated rule-by-rule.
A model that can generate harmful content can also recognize that content as harmful when asked directly. The key insight: generation and evaluation are different tasks — a model's "evaluator mode" is often more reliable than its "generator mode" at detecting policy violations.
Walkthrough
The Constitution
Anthropic's CAI paper (2022) uses 16 principles, including:
- "Choose the response that is least likely to contain harmful or unethical content"
- "Choose the response that a thoughtful, senior Anthropic employee would consider optimal"
- "Which response is less likely to contain misinformation, fake news, or conspiracy theories?"
Critique-Revision Loop Example
Original prompt: "How do I pick a lock?"
Initial harmful response (from early fine-tuning):
To pick a lock, you'll need a tension wrench and pick. Insert the tension
wrench into the bottom of the keyway, apply light rotational pressure...
[continues with detailed instructions]
Critique generation:
Critique: This response provides detailed instructions for picking locks,
which could facilitate illegal entry into homes or businesses. While there
are legitimate uses (locksport, locked-out scenarios), providing step-by-step
instructions without context creates more harm than good.
Revision:
If you're locked out of your own property, the best approach is to call a
licensed locksmith — they're trained, insured, and can help quickly. If
you're interested in lock mechanisms as a hobby (locksport), there are
legitimate competitions and communities that teach this skill responsibly
through proper channels.
Implementing the Critique Loop
import anthropic
client = anthropic.Anthropic()
CRITIQUE_PROMPT = """Here is a potentially harmful AI response:
<response>{response}</response>
Identify specific ways this response could be harmful, unethical, or against
Anthropic's guidelines. Be specific about what is problematic."""
REVISION_PROMPT = """Here is a critique of a response:
<critique>{critique}</critique>
Revise the original response to address these concerns while still being
helpful where possible."""
def constitutional_revise(prompt: str, initial_response: str, n_rounds: int = 1) -> str:
response = initial_response
for _ in range(n_rounds):
# Critique
critique = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": CRITIQUE_PROMPT.format(response=response)}]
).content[0].text
# Revise
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": REVISION_PROMPT.format(critique=critique)
}]
).content[0].text
return responseAI Feedback for Preference Labeling
FEEDBACK_PROMPT = """Which of these two responses better follows Anthropic's principles?
Principle: {principle}
Response A: {response_a}
Response B: {response_b}
Reply with exactly "A" or "B"."""
def get_ai_preference(prompt, response_a, response_b, principle):
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=5,
messages=[{
"role": "user",
"content": FEEDBACK_PROMPT.format(
principle=principle,
response_a=response_a,
response_b=response_b,
)
}]
)
return result.content[0].text.strip()
# Scale this to generate 100k+ preference pairs without human labelersAnalysis & Evaluation
Where Your Intuition Breaks
A well-specified constitution fully determines what the model should do in any situation. Constitutions are necessarily incomplete and ambiguous. "Be harmless" does not specify whether to discuss historical atrocities, provide dual-use chemistry information, or generate dark fiction. Constitutional compliance is evaluated by the same model that generates the responses — if the model has blind spots or biases in its "evaluator mode," those propagate into the training signal. CAI reduces the need for human labeling, but it doesn't eliminate the need for human judgment about which principles to include, how to weight conflicts, and where the model's self-evaluation fails.
CAI vs RLHF Comparison
| Aspect | RLHF | Constitutional AI |
|---|---|---|
| Feedback source | Human labelers | AI (guided by constitution) |
| Cost | High ($5–50/comparison) | Low (API cost ~$0.001) |
| Scale | Limited by human bandwidth | Scales arbitrarily |
| Consistency | Variable inter-rater | Consistent (same model) |
| Cultural bias | Human biases | Model biases |
| Transparency | Implicit | Explicit principles |
Empirical Results
Anthropic's 2022 paper showed:
- SL-CAI + RL-CAI models were preferred by humans 49% of the time vs RLHF models (near parity)
- CAI models were rated as significantly less harmful while maintaining similar helpfulness
- Removing harmful content via critique-revision is ~90% effective for the top harmful categories
CAI reduces harmful outputs but doesn't eliminate them. The AI critique model can miss subtle harms, and the revision model can reintroduce harmful content in modified form. CAI is a layer in a defense-in-depth strategy, not a complete solution.
The Helpfulness-Harmlessness Frontier
There is an inherent tension: a maximally helpful assistant will sometimes help with harmful requests; a maximally safe assistant will over-refuse. CAI attempts to move the Pareto frontier rather than just trade off:
where is helpfulness, is safety, and reflects the desired tradeoff. CAI shifts both and upward by encoding the tradeoff explicitly in the constitution rather than leaving it implicit in reward model training data.
The CAI paper's constitution includes "which response would a thoughtful, senior employee consider best?" This heuristic captures the desired behavior more richly than binary harmful/harmless labels — it includes helpfulness, accuracy, and epistemic humility alongside safety.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.