Neural-Path/Notes
40 min

RLHF & PPO

RLHF is what turned GPT-3 into ChatGPT. A raw language model trained on next-token prediction knows a lot, but it doesn't know what humans want — it optimizes for predicting internet text, not for being helpful. RLHF closes this gap by learning a reward model from human preference comparisons, then using PPO to fine-tune the LM to maximize that reward while staying close to the original distribution. Every major production LLM — Claude, GPT-4, Gemini — uses some form of this pipeline. The KL penalty and PPO clip objective are the critical engineering details that prevent "reward hacking" (generating nonsense that fools the reward model). This lesson derives all three stages from first principles and implements the training loop using TRL.

Theory

RLHF Pipeline
📚
Supervised Fine-Tuning

Fine-tune base LLM on high-quality (prompt, response) demonstration pairs from human labelers. Teaches instruction-following.

  • Base model + demonstrations → SFT model
  • Typically 10k–100k examples
  • Cross-entropy loss on responses

Reinforcement Learning from Human Feedback (RLHF) is a three-stage handoff: first teach the model to follow instructions (SFT checkpoint), then learn what humans prefer by training a reward model on ranked response pairs, then use RL to optimize the policy against that reward while a KL penalty keeps it from drifting too far from the SFT baseline. The diagram above shows this pipeline — each stage produces the artifact the next stage needs.

Stage 1: Supervised Fine-Tuning

Starting from a pretrained Large Language Model (LLM) πbase\pi_{\text{base}}, fine-tune on demonstration pairs (x,y)(x, y^*):

LSFT=E(x,y)D[t=1Tlogπθ(ytx,y<t)]\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x, y^*) \sim \mathcal{D}} \left[ \sum_{t=1}^{T} \log \pi_{\theta}(y^*_t \mid x, y^*_{<t}) \right]

This gives πSFT\pi_{\text{SFT}} — instruction-following, but not optimized for human preference.

Stage 2: Reward Model

Collect preference pairs (yw,yl)(y_w, y_l) for prompt xx where ywy_w is preferred. The Bradley-Terry model gives probability:

P(ywylx)=σ(rϕ(x,yw)rϕ(x,yl))P(y_w \succ y_l \mid x) = \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Minimize negative log-likelihood:

LRM=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) \right]

Stage 3: PPO Optimization

Maximize reward while staying close to πSFT\pi_{\text{SFT}} via Kullback-Leibler divergence (KL) penalty:

J(θ)=ExD, yπθ(x)[rϕ(x,y)βKL[πθ(x)πSFT(x)]]\mathcal{J}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\ y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \cdot \text{KL}\left[\pi_\theta(\cdot|x) \| \pi_{\text{SFT}}(\cdot|x)\right] \right]

The KL penalty is forced by the reward model's distribution. A reward model trained on human preferences only covers the output distribution of the SFT model — it has no signal about outputs far from that distribution. Without the KL term, the policy will exploit the reward model: it discovers that certain token patterns (repetition, specific phrases, unusual formatting) score high even though they're incoherent, because the reward model never saw them during training and can't evaluate them reliably. The KL constraint keeps the policy inside the region where the reward model is calibrated.

The KL term prevents reward hacking — without it, the model learns to output strings that fool the reward model while being incoherent.

PPO clipped objective (applied per token position):

LPPO=Et[min(ρtA^t, clip(ρt,1ϵ,1+ϵ)A^t)]\mathcal{L}_{\text{PPO}} = \mathbb{E}_t \left[ \min\left( \rho_t \hat{A}_t,\ \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\, \hat{A}_t \right) \right]

where ρt=πθ(atst)πold(atst)\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} and ϵ0.2\epsilon \approx 0.2.

💡Why clip the probability ratio?

Without clipping, large policy updates can cause catastrophic forgetting or reward model exploitation. The clip keeps updates in a "trust region" without the expensive Hessian computation that Trust Region Policy Optimization (TRPO) requires — PPO is the practical approximation to TRPO.

Generalized Advantage Estimation (GAE)

A^t=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

λ[0,1]\lambda \in [0,1] trades off bias (λ0\lambda \to 0) vs variance (λ1\lambda \to 1). Typical: γ=1.0\gamma = 1.0, λ=0.95\lambda = 0.95.

Walkthrough

Simplified RLHF on TL;DR summarization using TRL (Transformer Reinforcement Learning).

python
from datasets import load_dataset
from transformers import AutoTokenizer
 
# Reddit posts + human-ranked summaries
dataset = load_dataset("CarperAI/openai_summarize_comparisons")
# {"prompt": reddit_post, "chosen": good_summary, "rejected": bad_summary}
 
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
 
print(f"Train: {len(dataset['train'])}")   # 92,858 pairs
print(f"Val:   {len(dataset['valid1'])}")  # 83,144 pairs

SFT Training

python
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer
 
model = GPT2LMHeadModel.from_pretrained("gpt2")
 
def preprocess(examples):
    combined = [f"{p}\n\nTL;DR: {c}"
                for p, c in zip(examples["prompt"], examples["chosen"])]
    return tokenizer(combined, truncation=True, max_length=512,
                     padding="max_length", return_tensors="pt")
 
sft_data = dataset["train"].map(preprocess, batched=True)
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        "sft_model", num_train_epochs=1,
        per_device_train_batch_size=8, learning_rate=1e-5,
    ),
    train_dataset=sft_data,
)
trainer.train()

Reward Model Training

python
from trl import RewardTrainer, RewardConfig
from transformers import GPT2ForSequenceClassification
 
rm = GPT2ForSequenceClassification.from_pretrained("sft_model", num_labels=1)
rm.config.pad_token_id = tokenizer.eos_token_id
 
reward_trainer = RewardTrainer(
    model=rm,
    config=RewardConfig(
        output_dir="reward_model", num_train_epochs=1,
        per_device_train_batch_size=4, learning_rate=1e-5, max_length=512,
    ),
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
)
reward_trainer.train()
# Validation accuracy ~72% (human preference agreement)

PPO Training

python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
 
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
 
config = PPOConfig(
    learning_rate=1.41e-5,
    batch_size=16,
    ppo_epochs=4,
    init_kl_coef=0.2,   # β
    target_kl=6.0,       # adaptive KL target
    cliprange=0.2,       # ε
    gamma=1.0,
    lam=0.95,
)
trainer = PPOTrainer(config, ppo_model, ref_model, tokenizer)
 
for batch in trainer.dataloader:
    queries = batch["input_ids"]
    responses = trainer.generate(queries, max_new_tokens=100, do_sample=True, top_k=50)
    texts = [tokenizer.decode(r, skip_special_tokens=True) for r in responses]
    rewards = [rm(tokenizer(t, return_tensors="pt").input_ids).logits[0] for t in texts]
    stats = trainer.step(queries, responses, rewards)
    trainer.log_stats(stats, batch, rewards)

Code Implementation

17_rlhf//
python
# alignment/17_rlhf/train/train.py
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch
 
def train_rlhf(
    sft_model_path: str,
    reward_model_path: str,
    output_dir: str = "rlhf_model",
    kl_coef: float = 0.2,
    lr: float = 1.41e-5,
):
    tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
    tokenizer.pad_token = tokenizer.eos_token
 
    model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
    ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
 
    reward_pipe = pipeline(
        "text-classification", model=reward_model_path,
        device=0 if torch.cuda.is_available() else -1,
    )
    config = PPOConfig(
        output_dir=output_dir, learning_rate=lr, batch_size=32,
        ppo_epochs=4, init_kl_coef=kl_coef, target_kl=6.0,
        cliprange=0.2, gamma=1.0, lam=0.95,
    )
    trainer = PPOTrainer(config, model, ref_model, tokenizer)
 
    for epoch in range(config.num_train_epochs):
        for batch in trainer.dataloader:
            queries = batch["input_ids"]
            responses = trainer.generate(queries, max_new_tokens=100, do_sample=True)
            texts = [tokenizer.decode(r, skip_special_tokens=True) for r in responses]
            rewards = [torch.tensor(r["score"]) for r in reward_pipe(texts)]
            trainer.step(queries, responses, rewards)
 
    trainer.save_pretrained(output_dir)

Analysis & Evaluation

Where Your Intuition Breaks

The reward model accurately captures human preferences — after all, it was trained on human labels. Reward models are trained on a narrow preference distribution from a specific annotator pool and fail silently out of distribution. A common artifact: reward models trained on text quality correlate "longer response" with "better response" because annotators tend to rate more elaborate answers higher. The policy learns to generate verbose output, which scores well on the reward model but not in actual human evaluation. This reward hacking is not adversarial — the policy is doing exactly what it's trained to do. The bug is in the reward model's blind spots, not the training algorithm.

KL Divergence as a Health Monitor

KL (nats)InterpretationAction
< 1Policy barely changedDecrease β or train longer
2–6Healthy rangeMaintain
> 10Drifting far from SFTIncrease β or reduce LR
> 20Catastrophic divergenceStop, debug

Common Failure Modes

Reward hacking: model finds degenerate strategies — very long responses (if reward model correlates length with quality), repetitive high-reward phrases, or language switching.

Value function lag: value estimates don't track rewards → poor advantage estimates → slow convergence. Fix: pretrain the value head separately before RL.

⚠️Reward model out-of-distribution

If your reward model was trained on a different distribution than your RL prompts, reward scores will be unreliable. Use out-of-distribution prompts during RL or train the reward model with diverse data.

Production-Ready Code

serve_api/app.py
python
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
 
app = FastAPI(title="RLHF Model API")
model = AutoModelForCausalLM.from_pretrained("rlhf_model")
tokenizer = AutoTokenizer.from_pretrained("rlhf_model")
reward_scorer = pipeline("text-classification", model="reward_model")
 
class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 200
    num_return_sequences: int = 4    # best-of-N
    temperature: float = 0.7
 
@app.post("/generate")
def generate(req: GenerateRequest):
    inputs = tokenizer(req.prompt, return_tensors="pt")
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=req.max_new_tokens,
            num_return_sequences=req.num_return_sequences,
            temperature=req.temperature, do_sample=True,
        )
    responses = [tokenizer.decode(o, skip_special_tokens=True) for o in out]
    scores = [r["score"] for r in reward_scorer(responses)]
    best_idx = scores.index(max(scores))
    return {"responses": responses, "scores": scores, "best": responses[best_idx]}
🚀Best-of-N sampling

A compute-cheap alternative to full PPO: generate N responses per prompt, score all with the reward model, return the best. Best-of-64 can match PPO quality at much lower training cost. Used in production at several frontier labs as a post-training step.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.