RLHF & PPO
RLHF is what turned GPT-3 into ChatGPT. A raw language model trained on next-token prediction knows a lot, but it doesn't know what humans want — it optimizes for predicting internet text, not for being helpful. RLHF closes this gap by learning a reward model from human preference comparisons, then using PPO to fine-tune the LM to maximize that reward while staying close to the original distribution. Every major production LLM — Claude, GPT-4, Gemini — uses some form of this pipeline. The KL penalty and PPO clip objective are the critical engineering details that prevent "reward hacking" (generating nonsense that fools the reward model). This lesson derives all three stages from first principles and implements the training loop using TRL.
Theory
Fine-tune base LLM on high-quality (prompt, response) demonstration pairs from human labelers. Teaches instruction-following.
- ▸Base model + demonstrations → SFT model
- ▸Typically 10k–100k examples
- ▸Cross-entropy loss on responses
Reinforcement Learning from Human Feedback (RLHF) is a three-stage handoff: first teach the model to follow instructions (SFT checkpoint), then learn what humans prefer by training a reward model on ranked response pairs, then use RL to optimize the policy against that reward while a KL penalty keeps it from drifting too far from the SFT baseline. The diagram above shows this pipeline — each stage produces the artifact the next stage needs.
Stage 1: Supervised Fine-Tuning
Starting from a pretrained Large Language Model (LLM) , fine-tune on demonstration pairs :
This gives — instruction-following, but not optimized for human preference.
Stage 2: Reward Model
Collect preference pairs for prompt where is preferred. The Bradley-Terry model gives probability:
Minimize negative log-likelihood:
Stage 3: PPO Optimization
Maximize reward while staying close to via Kullback-Leibler divergence (KL) penalty:
The KL penalty is forced by the reward model's distribution. A reward model trained on human preferences only covers the output distribution of the SFT model — it has no signal about outputs far from that distribution. Without the KL term, the policy will exploit the reward model: it discovers that certain token patterns (repetition, specific phrases, unusual formatting) score high even though they're incoherent, because the reward model never saw them during training and can't evaluate them reliably. The KL constraint keeps the policy inside the region where the reward model is calibrated.
The KL term prevents reward hacking — without it, the model learns to output strings that fool the reward model while being incoherent.
PPO clipped objective (applied per token position):
where and .
Without clipping, large policy updates can cause catastrophic forgetting or reward model exploitation. The clip keeps updates in a "trust region" without the expensive Hessian computation that Trust Region Policy Optimization (TRPO) requires — PPO is the practical approximation to TRPO.
Generalized Advantage Estimation (GAE)
trades off bias () vs variance (). Typical: , .
Walkthrough
Simplified RLHF on TL;DR summarization using TRL (Transformer Reinforcement Learning).
from datasets import load_dataset
from transformers import AutoTokenizer
# Reddit posts + human-ranked summaries
dataset = load_dataset("CarperAI/openai_summarize_comparisons")
# {"prompt": reddit_post, "chosen": good_summary, "rejected": bad_summary}
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
print(f"Train: {len(dataset['train'])}") # 92,858 pairs
print(f"Val: {len(dataset['valid1'])}") # 83,144 pairsSFT Training
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer
model = GPT2LMHeadModel.from_pretrained("gpt2")
def preprocess(examples):
combined = [f"{p}\n\nTL;DR: {c}"
for p, c in zip(examples["prompt"], examples["chosen"])]
return tokenizer(combined, truncation=True, max_length=512,
padding="max_length", return_tensors="pt")
sft_data = dataset["train"].map(preprocess, batched=True)
trainer = Trainer(
model=model,
args=TrainingArguments(
"sft_model", num_train_epochs=1,
per_device_train_batch_size=8, learning_rate=1e-5,
),
train_dataset=sft_data,
)
trainer.train()Reward Model Training
from trl import RewardTrainer, RewardConfig
from transformers import GPT2ForSequenceClassification
rm = GPT2ForSequenceClassification.from_pretrained("sft_model", num_labels=1)
rm.config.pad_token_id = tokenizer.eos_token_id
reward_trainer = RewardTrainer(
model=rm,
config=RewardConfig(
output_dir="reward_model", num_train_epochs=1,
per_device_train_batch_size=4, learning_rate=1e-5, max_length=512,
),
tokenizer=tokenizer,
train_dataset=dataset["train"],
)
reward_trainer.train()
# Validation accuracy ~72% (human preference agreement)PPO Training
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft_model")
config = PPOConfig(
learning_rate=1.41e-5,
batch_size=16,
ppo_epochs=4,
init_kl_coef=0.2, # β
target_kl=6.0, # adaptive KL target
cliprange=0.2, # ε
gamma=1.0,
lam=0.95,
)
trainer = PPOTrainer(config, ppo_model, ref_model, tokenizer)
for batch in trainer.dataloader:
queries = batch["input_ids"]
responses = trainer.generate(queries, max_new_tokens=100, do_sample=True, top_k=50)
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in responses]
rewards = [rm(tokenizer(t, return_tensors="pt").input_ids).logits[0] for t in texts]
stats = trainer.step(queries, responses, rewards)
trainer.log_stats(stats, batch, rewards)Code Implementation
17_rlhf//# alignment/17_rlhf/train/train.py
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch
def train_rlhf(
sft_model_path: str,
reward_model_path: str,
output_dir: str = "rlhf_model",
kl_coef: float = 0.2,
lr: float = 1.41e-5,
):
tokenizer = AutoTokenizer.from_pretrained(sft_model_path)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_path)
reward_pipe = pipeline(
"text-classification", model=reward_model_path,
device=0 if torch.cuda.is_available() else -1,
)
config = PPOConfig(
output_dir=output_dir, learning_rate=lr, batch_size=32,
ppo_epochs=4, init_kl_coef=kl_coef, target_kl=6.0,
cliprange=0.2, gamma=1.0, lam=0.95,
)
trainer = PPOTrainer(config, model, ref_model, tokenizer)
for epoch in range(config.num_train_epochs):
for batch in trainer.dataloader:
queries = batch["input_ids"]
responses = trainer.generate(queries, max_new_tokens=100, do_sample=True)
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in responses]
rewards = [torch.tensor(r["score"]) for r in reward_pipe(texts)]
trainer.step(queries, responses, rewards)
trainer.save_pretrained(output_dir)Analysis & Evaluation
Where Your Intuition Breaks
The reward model accurately captures human preferences — after all, it was trained on human labels. Reward models are trained on a narrow preference distribution from a specific annotator pool and fail silently out of distribution. A common artifact: reward models trained on text quality correlate "longer response" with "better response" because annotators tend to rate more elaborate answers higher. The policy learns to generate verbose output, which scores well on the reward model but not in actual human evaluation. This reward hacking is not adversarial — the policy is doing exactly what it's trained to do. The bug is in the reward model's blind spots, not the training algorithm.
KL Divergence as a Health Monitor
| KL (nats) | Interpretation | Action |
|---|---|---|
| < 1 | Policy barely changed | Decrease β or train longer |
| 2–6 | Healthy range | Maintain |
| > 10 | Drifting far from SFT | Increase β or reduce LR |
| > 20 | Catastrophic divergence | Stop, debug |
Common Failure Modes
Reward hacking: model finds degenerate strategies — very long responses (if reward model correlates length with quality), repetitive high-reward phrases, or language switching.
Value function lag: value estimates don't track rewards → poor advantage estimates → slow convergence. Fix: pretrain the value head separately before RL.
If your reward model was trained on a different distribution than your RL prompts, reward scores will be unreliable. Use out-of-distribution prompts during RL or train the reward model with diverse data.
Production-Ready Code
serve_api/app.pyfrom fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
app = FastAPI(title="RLHF Model API")
model = AutoModelForCausalLM.from_pretrained("rlhf_model")
tokenizer = AutoTokenizer.from_pretrained("rlhf_model")
reward_scorer = pipeline("text-classification", model="reward_model")
class GenerateRequest(BaseModel):
prompt: str
max_new_tokens: int = 200
num_return_sequences: int = 4 # best-of-N
temperature: float = 0.7
@app.post("/generate")
def generate(req: GenerateRequest):
inputs = tokenizer(req.prompt, return_tensors="pt")
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=req.max_new_tokens,
num_return_sequences=req.num_return_sequences,
temperature=req.temperature, do_sample=True,
)
responses = [tokenizer.decode(o, skip_special_tokens=True) for o in out]
scores = [r["score"] for r in reward_scorer(responses)]
best_idx = scores.index(max(scores))
return {"responses": responses, "scores": scores, "best": responses[best_idx]}A compute-cheap alternative to full PPO: generate N responses per prompt, score all with the reward model, return the best. Best-of-64 can match PPO quality at much lower training cost. Used in production at several frontier labs as a post-training step.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.