DPO: Direct Preference Optimization
Direct Preference Optimization (DPO) (Rafailov et al., 2023) collapsed the expensive three-stage RLHF pipeline into a single supervised loss. The key insight: the optimal RLHF policy can be expressed analytically in terms of log-probabilities, which lets you eliminate the separate reward model and the policy gradient algorithm entirely. DPO trains directly on preference pairs — chosen and rejected responses — and has become the dominant alignment method for open-source LLMs.
Theory
DPO loss = −log σ(β · margin) · training pushes chosen ↑, rejected ↓, margin grows, loss falls
RLHF trains a reward model, then optimizes against it. DPO asks: what if we skipped the reward model entirely? Given a preference pair — one response the human prefers, one they don't — DPO directly adjusts the model's probabilities to favor the chosen response over the rejected one. The diagram above shows the log-ratio bars: training pushes the chosen bar up and the rejected bar down. No separate reward model, no RL training loop.
The RLHF Objective
RLHF fine-tunes a policy to maximize a reward while staying close to a reference policy :
The KL term prevents reward hacking: without it, the policy degenerates into outputs that score high on but are nonsensical.
The Optimal Policy
The closed-form solution to this KL-constrained optimization is:
where is a normalizing partition function.
The DPO Derivation
Solving for :
Now plug into the Bradley-Terry preference model . The terms cancel (they appear identically in both and ):
Replace with the parameterized model and take the negative log-likelihood over a preference dataset. This is the DPO loss:
The cancellation is what makes DPO tractable. The optimal RLHF policy contains a partition function that is intractable to compute — it requires summing over all possible responses. When you express the reward in terms of the log-ratio and then subtract chosen from rejected, appears identically in both and cancels exactly. This is not an approximation; it's an algebraic identity that falls out of the Bradley-Terry model structure.
DPO loss decreases when the chosen log-ratio exceeds the rejected log-ratio by a larger margin. Training pushes the model to assign relatively more probability to the preferred response and relatively less to the rejected response — all without ever learning an explicit reward function.
The Role of
controls the KL constraint strength:
- Small (e.g., 0.01): model can shift far from to maximize the preference margin; risk of style drift or capability loss
- Large (e.g., 0.5): model stays close to ; harder to achieve large margins but safer
Typical values: . Default: .
Computing the Log-Ratios
The log-ratio for a response is the sum of per-token log-probability differences:
This requires a forward pass through both the current model and the reference model for every training step. The reference model is the SFT checkpoint, frozen for the duration of DPO training.
Walkthrough
Training on a Preference Dataset
Dataset format: triples of (prompt, chosen, rejected):
examples = [
{
"prompt": "Explain gradient descent in one sentence.",
"chosen": "Gradient descent iteratively adjusts parameters in the direction "
"that most reduces the loss.",
"rejected": "Gradient descent is a machine learning algorithm that is very "
"commonly used by data scientists.",
},
# ... thousands more
]Training with TRL's DPOTrainer:
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
lora_cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
trainer = DPOTrainer(
model=model, # SFT checkpoint (π_θ, will be trained)
ref_model=ref_model, # frozen SFT checkpoint (π_ref)
args=DPOConfig(
beta=0.1,
output_dir="./dpo-output",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-5,
num_train_epochs=1,
),
train_dataset=dataset,
peft_config=lora_cfg,
)
trainer.train()Key metrics during training: rewards/chosen should increase; rewards/rejected should decrease; rewards/margins (their difference) should grow. Flat margins signal no preference learning.
Analysis & Evaluation
Where Your Intuition Breaks
DPO is strictly better than RLHF — it's simpler and achieves the same result. DPO cannot perform exploration. PPO generates new responses during training and receives reward signal on those novel outputs; DPO trains only on responses already in the preference dataset. For tasks where the best outputs require generating responses outside the training distribution — complex reasoning, long-horizon planning, tasks requiring diverse strategies — RLHF with PPO can discover better solutions. DPO is the right choice when a high-quality preference dataset exists; RLHF is necessary when you need the model to explore its own output space during training.
DPO vs. PPO
| Property | DPO | PPO (RLHF) |
|---|---|---|
| Reward model | Not needed | Required |
| Online sampling | No (offline dataset) | Yes (generates during training) |
| Compute per step | ~2× forward passes | ~4× (policy + RM + value + ref) |
| Stability | Generally stable | Can diverge (reward hacking) |
| Preference coverage | Limited to dataset | Can explore unseen responses |
| Best when | Curated preference data available | Broad alignment with continuous feedback |
When PPO outperforms DPO: tasks requiring exploration — the model must discover better responses not in the preference dataset. DPO is bounded by its offline data.
DPO Failure Modes
Distribution mismatch: chosen and rejected responses are too similar (low human agreement scores). DPO loss stays near zero throughout training; margins don't grow. Fix: filter preference pairs below a confidence threshold.
Length bias: longer responses are systematically chosen (raters conflate verbosity with quality). DPO amplifies this, producing increasingly verbose outputs. Fix: length-normalize log-probabilities or control for length in preference collection.
Reference drift: if diverges from early (aggressive LR), log-ratios become numerically unstable. Fix: small LR (1e-5 to 5e-5) and .
DPO Variants
| Variant | Key change | Use when |
|---|---|---|
| IPO | Squared margin loss (no sigmoid saturation) | Preference data has noisy labels |
| KTO | Individual good/bad labels, no pairs needed | Easier data collection |
| ORPO | Merges SFT + DPO loss in one pass; no ref model forward pass | Memory-constrained training |
| SimPO | Removes reference model; uses length-normalized log-prob as implicit reward | Simpler setup, competitive quality |
DPO in practice:
- Dataset size: 5K–100K preference pairs. Quality dominates quantity — a carefully curated 5K dataset outperforms a noisy 100K dataset. Filter pairs with low inter-annotator agreement.
- Beta: start at 0.1. Increase (0.2–0.5) if the model drifts from its base style. Decrease (0.01–0.05) if margins aren't growing after the first epoch.
- Reference model: must be the exact SFT checkpoint used to initialize training. Using a different checkpoint breaks the DPO derivation.
- Epochs: 1–2 is standard. DPO on preference data overfits faster than SFT — stop when
val/rewards_marginsflattens. - LoRA is fine: standard for DPO. Full fine-tuning works but risks forgetting SFT behavior if LR is too high.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.