Requires:RLHF & PPO

DPO: Direct Preference Optimization

Direct Preference Optimization (DPO) (Rafailov et al., 2023) collapsed the expensive three-stage RLHF pipeline into a single supervised loss. The key insight: the optimal RLHF policy can be expressed analytically in terms of log-probabilities, which lets you eliminate the separate reward model and the policy gradient algorithm entirely. DPO trains directly on preference pairs — chosen and rejected responses — and has become the dominant alignment method for open-source LLMs.

Theory

DPO log-ratio dynamics — adjust sliders to exploreDPO loss: 0.693

margin: +0.000β·margin: 0.000σ(β·margin): 0.500

chosen log-ratio: +0.00

rejected log-ratio: +0.00

β (KL weight): 0.10

DPO loss = −log σ(β · margin) · training pushes chosen ↑, rejected ↓, margin grows, loss falls

RLHF trains a reward model, then optimizes against it. DPO asks: what if we skipped the reward model entirely? Given a preference pair — one response the human prefers, one they don't — DPO directly adjusts the model's probabilities to favor the chosen response over the rejected one. The diagram above shows the log-ratio bars: training pushes the chosen bar up and the rejected bar down. No separate reward model, no RL training loop.

The RLHF Objective

RLHF fine-tunes a policy $\pi_\theta$ to maximize a reward $r(x, y)$ while staying close to a reference policy $\pi_{\text{ref}}$ :

$\max_{\pi_\theta} \mathbb{E}_{x,\, y \sim \pi_\theta(y|x)} [r(x, y)] - \beta\, \text{KL}[\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)]$

The KL term prevents reward hacking: without it, the policy degenerates into outputs that score high on $r$ but are nonsensical.

The Optimal Policy

The closed-form solution to this KL-constrained optimization is:

$\pi^*(y|x) = \frac{\pi_{\text{ref}}(y|x)}{Z(x)} \exp\!\left(\frac{r(x,y)}{\beta}\right)$

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ is a normalizing partition function.

The DPO Derivation

Solving $\pi^*$ for $r$ :

$r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$

Now plug into the Bradley-Terry preference model $p(y_w \succ y_l \mid x) = \sigma(r(x,y_w) - r(x,y_l))$ . The $\log Z(x)$ terms cancel (they appear identically in both $r(x,y_w)$ and $r(x,y_l)$ ):

$p(y_w \succ y_l \mid x) = \sigma\!\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$

Replace $\pi^*$ with the parameterized model $\pi_\theta$ and take the negative log-likelihood over a preference dataset. This is the DPO loss:

$\boxed{\mathcal{L}_{\text{DPO}}(\pi_\theta) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \left[ \log \sigma\!\left(\beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)\right) \right]}$

The $\log Z(x)$ cancellation is what makes DPO tractable. The optimal RLHF policy contains a partition function $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$ that is intractable to compute — it requires summing over all possible responses. When you express the reward in terms of the log-ratio and then subtract chosen from rejected, $\log Z(x)$ appears identically in both and cancels exactly. This is not an approximation; it's an algebraic identity that falls out of the Bradley-Terry model structure.

💡Intuition

DPO loss decreases when the chosen log-ratio exceeds the rejected log-ratio by a larger margin. Training pushes the model to assign relatively more probability to the preferred response and relatively less to the rejected response — all without ever learning an explicit reward function.

The Role of $\beta$

$\beta$ controls the KL constraint strength:

Small $\beta$ (e.g., 0.01): model can shift far from $\pi_{\text{ref}}$ to maximize the preference margin; risk of style drift or capability loss
Large $\beta$ (e.g., 0.5): model stays close to $\pi_{\text{ref}}$ ; harder to achieve large margins but safer

Typical values: $\beta \in [0.01, 0.5]$ . Default: $\beta = 0.1$ .

Computing the Log-Ratios

The log-ratio for a response $y$ is the sum of per-token log-probability differences:

$\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} = \sum_{t=1}^{T} \left[\log \pi_\theta(y_t \mid x, y_{<t}) - \log \pi_{\text{ref}}(y_t \mid x, y_{<t})\right]$

This requires a forward pass through both the current model and the reference model for every training step. The reference model is the SFT checkpoint, frozen for the duration of DPO training.

Walkthrough

Training on a Preference Dataset

Dataset format: triples of (prompt, chosen, rejected):

python

examples = [
    {
        "prompt": "Explain gradient descent in one sentence.",
        "chosen": "Gradient descent iteratively adjusts parameters in the direction "
                  "that most reduces the loss.",
        "rejected": "Gradient descent is a machine learning algorithm that is very "
                    "commonly used by data scientists.",
    },
    # ... thousands more
]

Training with TRL's DPOTrainer:

python

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
 
lora_cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
 
trainer = DPOTrainer(
    model=model,         # SFT checkpoint (π_θ, will be trained)
    ref_model=ref_model, # frozen SFT checkpoint (π_ref)
    args=DPOConfig(
        beta=0.1,
        output_dir="./dpo-output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=5e-5,
        num_train_epochs=1,
    ),
    train_dataset=dataset,
    peft_config=lora_cfg,
)
trainer.train()

Key metrics during training: rewards/chosen should increase; rewards/rejected should decrease; rewards/margins (their difference) should grow. Flat margins signal no preference learning.

Analysis & Evaluation

Where Your Intuition Breaks

DPO is strictly better than RLHF — it's simpler and achieves the same result. DPO cannot perform exploration. PPO generates new responses during training and receives reward signal on those novel outputs; DPO trains only on responses already in the preference dataset. For tasks where the best outputs require generating responses outside the training distribution — complex reasoning, long-horizon planning, tasks requiring diverse strategies — RLHF with PPO can discover better solutions. DPO is the right choice when a high-quality preference dataset exists; RLHF is necessary when you need the model to explore its own output space during training.

DPO vs. PPO

Property	DPO	PPO (RLHF)
Reward model	Not needed	Required
Online sampling	No (offline dataset)	Yes (generates during training)
Compute per step	~2× forward passes	~4× (policy + RM + value + ref)
Stability	Generally stable	Can diverge (reward hacking)
Preference coverage	Limited to dataset	Can explore unseen responses
Best when	Curated preference data available	Broad alignment with continuous feedback

When PPO outperforms DPO: tasks requiring exploration — the model must discover better responses not in the preference dataset. DPO is bounded by its offline data.

DPO Failure Modes

Distribution mismatch: chosen and rejected responses are too similar (low human agreement scores). DPO loss stays near zero throughout training; margins don't grow. Fix: filter preference pairs below a confidence threshold.

Length bias: longer responses are systematically chosen (raters conflate verbosity with quality). DPO amplifies this, producing increasingly verbose outputs. Fix: length-normalize log-probabilities or control for length in preference collection.

Reference drift: if $\pi_\theta$ diverges from $\pi_{\text{ref}}$ early (aggressive LR), log-ratios become numerically unstable. Fix: small LR (1e-5 to 5e-5) and $\beta \geq 0.05$ .

DPO Variants

Variant	Key change	Use when
IPO	Squared margin loss (no sigmoid saturation)	Preference data has noisy labels
KTO	Individual good/bad labels, no pairs needed	Easier data collection
ORPO	Merges SFT + DPO loss in one pass; no ref model forward pass	Memory-constrained training
SimPO	Removes reference model; uses length-normalized log-prob as implicit reward	Simpler setup, competitive quality

🚀Production

DPO in practice:

Dataset size: 5K–100K preference pairs. Quality dominates quantity — a carefully curated 5K dataset outperforms a noisy 100K dataset. Filter pairs with low inter-annotator agreement.
Beta: start at 0.1. Increase (0.2–0.5) if the model drifts from its base style. Decrease (0.01–0.05) if margins aren't growing after the first epoch.
Reference model: must be the exact SFT checkpoint used to initialize training. Using a different checkpoint breaks the DPO derivation.
Epochs: 1–2 is standard. DPO on preference data overfits faster than SFT — stop when val/rewards_margins flattens.
LoRA is fine: standard for DPO. Full fine-tuning works but risks forgetting SFT behavior if LR is too high.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

RLHF & PPO

GRPO: Group Relative Policy Optimization