Neural-Path/Notes
30 min
Requires:RLHF & PPO

DPO: Direct Preference Optimization

Direct Preference Optimization (DPO) (Rafailov et al., 2023) collapsed the expensive three-stage RLHF pipeline into a single supervised loss. The key insight: the optimal RLHF policy can be expressed analytically in terms of log-probabilities, which lets you eliminate the separate reward model and the policy gradient algorithm entirely. DPO trains directly on preference pairs — chosen and rejected responses — and has become the dominant alignment method for open-source LLMs.

Theory

DPO log-ratio dynamics — adjust sliders to exploreDPO loss: 0.693
-2-10+1+2log π_θ / π_ref+0.00+0.00chosen (y_w)rejected (y_l)
margin: +0.000β·margin: 0.000σ(β·margin): 0.500

DPO loss = −log σ(β · margin) · training pushes chosen ↑, rejected ↓, margin grows, loss falls

RLHF trains a reward model, then optimizes against it. DPO asks: what if we skipped the reward model entirely? Given a preference pair — one response the human prefers, one they don't — DPO directly adjusts the model's probabilities to favor the chosen response over the rejected one. The diagram above shows the log-ratio bars: training pushes the chosen bar up and the rejected bar down. No separate reward model, no RL training loop.

The RLHF Objective

RLHF fine-tunes a policy πθ\pi_\theta to maximize a reward r(x,y)r(x, y) while staying close to a reference policy πref\pi_{\text{ref}}:

maxπθEx,yπθ(yx)[r(x,y)]βKL[πθ(yx)πref(yx)]\max_{\pi_\theta} \mathbb{E}_{x,\, y \sim \pi_\theta(y|x)} [r(x, y)] - \beta\, \text{KL}[\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)]

The KL term prevents reward hacking: without it, the policy degenerates into outputs that score high on rr but are nonsensical.

The Optimal Policy

The closed-form solution to this KL-constrained optimization is:

π(yx)=πref(yx)Z(x)exp ⁣(r(x,y)β)\pi^*(y|x) = \frac{\pi_{\text{ref}}(y|x)}{Z(x)} \exp\!\left(\frac{r(x,y)}{\beta}\right)

where Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta) is a normalizing partition function.

The DPO Derivation

Solving π\pi^* for rr:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Now plug into the Bradley-Terry preference model p(ywylx)=σ(r(x,yw)r(x,yl))p(y_w \succ y_l \mid x) = \sigma(r(x,y_w) - r(x,y_l)). The logZ(x)\log Z(x) terms cancel (they appear identically in both r(x,yw)r(x,y_w) and r(x,yl)r(x,y_l)):

p(ywylx)=σ ⁣(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))p(y_w \succ y_l \mid x) = \sigma\!\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)

Replace π\pi^* with the parameterized model πθ\pi_\theta and take the negative log-likelihood over a preference dataset. This is the DPO loss:

LDPO(πθ)=E(x,yw,yl)D[logσ ⁣(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\boxed{\mathcal{L}_{\text{DPO}}(\pi_\theta) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}} \left[ \log \sigma\!\left(\beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)\right) \right]}

The logZ(x)\log Z(x) cancellation is what makes DPO tractable. The optimal RLHF policy contains a partition function Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta) that is intractable to compute — it requires summing over all possible responses. When you express the reward in terms of the log-ratio and then subtract chosen from rejected, logZ(x)\log Z(x) appears identically in both and cancels exactly. This is not an approximation; it's an algebraic identity that falls out of the Bradley-Terry model structure.

💡Intuition

DPO loss decreases when the chosen log-ratio exceeds the rejected log-ratio by a larger margin. Training pushes the model to assign relatively more probability to the preferred response and relatively less to the rejected response — all without ever learning an explicit reward function.

The Role of β\beta

β\beta controls the KL constraint strength:

  • Small β\beta (e.g., 0.01): model can shift far from πref\pi_{\text{ref}} to maximize the preference margin; risk of style drift or capability loss
  • Large β\beta (e.g., 0.5): model stays close to πref\pi_{\text{ref}}; harder to achieve large margins but safer

Typical values: β[0.01,0.5]\beta \in [0.01, 0.5]. Default: β=0.1\beta = 0.1.

Computing the Log-Ratios

The log-ratio for a response yy is the sum of per-token log-probability differences:

logπθ(yx)πref(yx)=t=1T[logπθ(ytx,y<t)logπref(ytx,y<t)]\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} = \sum_{t=1}^{T} \left[\log \pi_\theta(y_t \mid x, y_{<t}) - \log \pi_{\text{ref}}(y_t \mid x, y_{<t})\right]

This requires a forward pass through both the current model and the reference model for every training step. The reference model is the SFT checkpoint, frozen for the duration of DPO training.

Walkthrough

Training on a Preference Dataset

Dataset format: triples of (prompt, chosen, rejected):

python
examples = [
    {
        "prompt": "Explain gradient descent in one sentence.",
        "chosen": "Gradient descent iteratively adjusts parameters in the direction "
                  "that most reduces the loss.",
        "rejected": "Gradient descent is a machine learning algorithm that is very "
                    "commonly used by data scientists.",
    },
    # ... thousands more
]

Training with TRL's DPOTrainer:

python
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
 
lora_cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
 
trainer = DPOTrainer(
    model=model,         # SFT checkpoint (π_θ, will be trained)
    ref_model=ref_model, # frozen SFT checkpoint (π_ref)
    args=DPOConfig(
        beta=0.1,
        output_dir="./dpo-output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=5e-5,
        num_train_epochs=1,
    ),
    train_dataset=dataset,
    peft_config=lora_cfg,
)
trainer.train()

Key metrics during training: rewards/chosen should increase; rewards/rejected should decrease; rewards/margins (their difference) should grow. Flat margins signal no preference learning.

Analysis & Evaluation

Where Your Intuition Breaks

DPO is strictly better than RLHF — it's simpler and achieves the same result. DPO cannot perform exploration. PPO generates new responses during training and receives reward signal on those novel outputs; DPO trains only on responses already in the preference dataset. For tasks where the best outputs require generating responses outside the training distribution — complex reasoning, long-horizon planning, tasks requiring diverse strategies — RLHF with PPO can discover better solutions. DPO is the right choice when a high-quality preference dataset exists; RLHF is necessary when you need the model to explore its own output space during training.

DPO vs. PPO

PropertyDPOPPO (RLHF)
Reward modelNot neededRequired
Online samplingNo (offline dataset)Yes (generates during training)
Compute per step~2× forward passes~4× (policy + RM + value + ref)
StabilityGenerally stableCan diverge (reward hacking)
Preference coverageLimited to datasetCan explore unseen responses
Best whenCurated preference data availableBroad alignment with continuous feedback

When PPO outperforms DPO: tasks requiring exploration — the model must discover better responses not in the preference dataset. DPO is bounded by its offline data.

DPO Failure Modes

Distribution mismatch: chosen and rejected responses are too similar (low human agreement scores). DPO loss stays near zero throughout training; margins don't grow. Fix: filter preference pairs below a confidence threshold.

Length bias: longer responses are systematically chosen (raters conflate verbosity with quality). DPO amplifies this, producing increasingly verbose outputs. Fix: length-normalize log-probabilities or control for length in preference collection.

Reference drift: if πθ\pi_\theta diverges from πref\pi_{\text{ref}} early (aggressive LR), log-ratios become numerically unstable. Fix: small LR (1e-5 to 5e-5) and β0.05\beta \geq 0.05.

DPO Variants

VariantKey changeUse when
IPOSquared margin loss (no sigmoid saturation)Preference data has noisy labels
KTOIndividual good/bad labels, no pairs neededEasier data collection
ORPOMerges SFT + DPO loss in one pass; no ref model forward passMemory-constrained training
SimPORemoves reference model; uses length-normalized log-prob as implicit rewardSimpler setup, competitive quality
🚀Production

DPO in practice:

  • Dataset size: 5K–100K preference pairs. Quality dominates quantity — a carefully curated 5K dataset outperforms a noisy 100K dataset. Filter pairs with low inter-annotator agreement.
  • Beta: start at 0.1. Increase (0.2–0.5) if the model drifts from its base style. Decrease (0.01–0.05) if margins aren't growing after the first epoch.
  • Reference model: must be the exact SFT checkpoint used to initialize training. Using a different checkpoint breaks the DPO derivation.
  • Epochs: 1–2 is standard. DPO on preference data overfits faster than SFT — stop when val/rewards_margins flattens.
  • LoRA is fine: standard for DPO. Full fine-tuning works but risks forgetting SFT behavior if LR is too high.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.