Neural-Path/Notes
25 min
Requires:RLHF & PPO

Constitutional AI

Human preference labeling at scale is expensive, slow, and inconsistent. For RLHF to work, you need thousands of preference pairs — each requiring a human to read two model responses and decide which is better. Constitutional AI (CAI) replaces this with AI-generated feedback: the same model that generates responses is asked to critique them according to explicit principles, then revise them. Cheaper, faster, and more consistent. This is the approach behind Constitutional AI as described in Anthropic's 2022 paper. The key insight is that a model's "evaluation mode" (judging whether a response is harmful) is more reliable than its "generation mode" — a model that can generate harmful content can also recognize it as harmful when asked directly. This lesson derives both SL-CAI and RL-CAI, and implements the critique-revision loop using the Anthropic SDK.

Theory

CAI Critique-Revision Loop— hover any step
guidesUser Prompt"How do I pick a lock?"unsafeInitial ResponseHarmful or over-compliant answerno safety filtering yetConstitutional Principle"Does this promote illegal activity?"from the constitution setCritique"This response could facilitate illegal en…model critiques itselfsafeRevised Response"I can explain the mechanics of locks for …repeatN times

Constitutional AI: the model critiques its own output against each constitutional principle, then revises. Multiple critique-revision passes progressively improve safety before any human feedback.

Human preference labeling at scale is expensive and inconsistent. Constitutional AI asks: can the model evaluate its own outputs against a set of principles? It turns out that a model capable of generating harmful content is also capable of recognizing that content as harmful when asked directly. CAI builds the safety pipeline from this asymmetry — using the model's own "evaluator mode" to generate the preference labels that train the safety behavior.

Constitutional AI (CAI) replaces expensive human preference labeling with AI-generated feedback guided by a set of principles (the "constitution"). It has two phases: SL-CAI (supervised) and RL-CAI (reinforcement learning with AI feedback / RLAIF).

Phase 1: SL-CAI (Critique and Revision)

Given a potentially harmful response yharmy_{\text{harm}} to prompt xx:

  1. Critique: sample cπbase(x,yharm,critique_prompt)c \sim \pi_{\text{base}}(\cdot \mid x, y_{\text{harm}}, \text{critique\_prompt})
  2. Revision: sample yπbase(x,yharm,c,revision_prompt)y' \sim \pi_{\text{base}}(\cdot \mid x, y_{\text{harm}}, c, \text{revision\_prompt})

Repeat this critique-revision loop KK times (typically K=1K=1). Fine-tune on the resulting (x,y)(x, y') pairs using standard supervised learning:

LSL-CAI=E(x,y)[logπθ(yx)]\mathcal{L}_{\text{SL-CAI}} = -\mathbb{E}_{(x, y')} \left[ \log \pi_\theta(y' \mid x) \right]

Phase 2: RL-CAI (AI Feedback)

Train a preference model using AI feedback instead of human labels. For a pair of responses (y1,y2)(y_1, y_2), ask the AI which better adheres to the constitution:

PAI(y1y2x,C)=πfeedback("first"x,y1,y2,C)P_{\text{AI}}(y_1 \succ y_2 \mid x, \mathcal{C}) = \pi_{\text{feedback}}\left(\text{"first"} \mid x, y_1, y_2, \mathcal{C}\right)

where C\mathcal{C} is the constitution. Then train a reward model on these AI-generated preferences (same Bradley-Terry objective as standard Reinforcement Learning from Human Feedback (RLHF)) and run PPO:

JRL-CAI(θ)=Ex,yπθ[rAI(x,y)βKL[πθπSFT]]\mathcal{J}_{\text{RL-CAI}}(\theta) = \mathbb{E}_{x, y \sim \pi_\theta}\left[ r_{\text{AI}}(x, y) - \beta \cdot \text{KL}[\pi_\theta \| \pi_{\text{SFT}}] \right]

Constitutional Objective

CAI can be viewed as incorporating a constitutional prior over response quality. The constitution specifies:

rCAI(x,y)=k=1Kwk1[principlek satisfied by y]r_{\text{CAI}}(x, y) = \sum_{k=1}^{K} w_k \cdot \mathbb{1}[\text{principle}_k \text{ satisfied by } y]

The weighted sum over principles is an approximation of a much harder problem: defining a complete ordering over all possible behaviors. In practice, principles conflict — "be helpful" and "avoid harm" produce opposite signals for dual-use requests. The weights wkw_k encode a priority ordering, but they don't resolve conflicts automatically. This is why the AI feedback model (rather than rule-by-rule evaluation) is used in practice: a language model can weigh conflicting principles in context in ways that a sum of binary indicators cannot.

In practice this is approximated via the AI feedback model rather than evaluated rule-by-rule.

💡Why AI feedback works at all

A model that can generate harmful content can also recognize that content as harmful when asked directly. The key insight: generation and evaluation are different tasks — a model's "evaluator mode" is often more reliable than its "generator mode" at detecting policy violations.

Walkthrough

The Constitution

Anthropic's CAI paper (2022) uses 16 principles, including:

  • "Choose the response that is least likely to contain harmful or unethical content"
  • "Choose the response that a thoughtful, senior Anthropic employee would consider optimal"
  • "Which response is less likely to contain misinformation, fake news, or conspiracy theories?"

Critique-Revision Loop Example

Original prompt: "How do I pick a lock?"

Initial harmful response (from early fine-tuning):

To pick a lock, you'll need a tension wrench and pick. Insert the tension
wrench into the bottom of the keyway, apply light rotational pressure...
[continues with detailed instructions]

Critique generation:

Critique: This response provides detailed instructions for picking locks,
which could facilitate illegal entry into homes or businesses. While there
are legitimate uses (locksport, locked-out scenarios), providing step-by-step
instructions without context creates more harm than good.

Revision:

If you're locked out of your own property, the best approach is to call a
licensed locksmith — they're trained, insured, and can help quickly. If
you're interested in lock mechanisms as a hobby (locksport), there are
legitimate competitions and communities that teach this skill responsibly
through proper channels.

Implementing the Critique Loop

python
import anthropic
 
client = anthropic.Anthropic()
 
CRITIQUE_PROMPT = """Here is a potentially harmful AI response:
<response>{response}</response>
 
Identify specific ways this response could be harmful, unethical, or against
Anthropic's guidelines. Be specific about what is problematic."""
 
REVISION_PROMPT = """Here is a critique of a response:
<critique>{critique}</critique>
 
Revise the original response to address these concerns while still being
helpful where possible."""
 
def constitutional_revise(prompt: str, initial_response: str, n_rounds: int = 1) -> str:
    response = initial_response
 
    for _ in range(n_rounds):
        # Critique
        critique = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            messages=[{"role": "user", "content": CRITIQUE_PROMPT.format(response=response)}]
        ).content[0].text
 
        # Revise
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": REVISION_PROMPT.format(critique=critique)
            }]
        ).content[0].text
 
    return response

AI Feedback for Preference Labeling

python
FEEDBACK_PROMPT = """Which of these two responses better follows Anthropic's principles?
Principle: {principle}
 
Response A: {response_a}
Response B: {response_b}
 
Reply with exactly "A" or "B"."""
 
def get_ai_preference(prompt, response_a, response_b, principle):
    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=5,
        messages=[{
            "role": "user",
            "content": FEEDBACK_PROMPT.format(
                principle=principle,
                response_a=response_a,
                response_b=response_b,
            )
        }]
    )
    return result.content[0].text.strip()
 
# Scale this to generate 100k+ preference pairs without human labelers

Analysis & Evaluation

Where Your Intuition Breaks

A well-specified constitution fully determines what the model should do in any situation. Constitutions are necessarily incomplete and ambiguous. "Be harmless" does not specify whether to discuss historical atrocities, provide dual-use chemistry information, or generate dark fiction. Constitutional compliance is evaluated by the same model that generates the responses — if the model has blind spots or biases in its "evaluator mode," those propagate into the training signal. CAI reduces the need for human labeling, but it doesn't eliminate the need for human judgment about which principles to include, how to weight conflicts, and where the model's self-evaluation fails.

CAI vs RLHF Comparison

AspectRLHFConstitutional AI
Feedback sourceHuman labelersAI (guided by constitution)
CostHigh ($5–50/comparison)Low (API cost ~$0.001)
ScaleLimited by human bandwidthScales arbitrarily
ConsistencyVariable inter-raterConsistent (same model)
Cultural biasHuman biasesModel biases
TransparencyImplicitExplicit principles

Empirical Results

Anthropic's 2022 paper showed:

  • SL-CAI + RL-CAI models were preferred by humans 49% of the time vs RLHF models (near parity)
  • CAI models were rated as significantly less harmful while maintaining similar helpfulness
  • Removing harmful content via critique-revision is ~90% effective for the top harmful categories
⚠️Constitutional compliance isn't guaranteed

CAI reduces harmful outputs but doesn't eliminate them. The AI critique model can miss subtle harms, and the revision model can reintroduce harmful content in modified form. CAI is a layer in a defense-in-depth strategy, not a complete solution.

The Helpfulness-Harmlessness Frontier

There is an inherent tension: a maximally helpful assistant will sometimes help with harmful requests; a maximally safe assistant will over-refuse. CAI attempts to move the Pareto frontier rather than just trade off:

Objective=maxθ[αH(θ)+(1α)S(θ)]\text{Objective} = \max_\theta \left[ \alpha \cdot H(\theta) + (1-\alpha) \cdot S(\theta) \right]

where HH is helpfulness, SS is safety, and α\alpha reflects the desired tradeoff. CAI shifts both HH and SS upward by encoding the tradeoff explicitly in the constitution rather than leaving it implicit in reward model training data.

💡The 'thoughtful senior employee' heuristic

The CAI paper's constitution includes "which response would a thoughtful, senior employee consider best?" This heuristic captures the desired behavior more richly than binary harmful/harmless labels — it includes helpfulness, accuracy, and epistemic humility alongside safety.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.