KL Divergence, f-Divergences & Total Variation Distance
KL divergence quantifies how much one probability distribution differs from another and is the central object in variational inference, policy optimization, and training objectives. Its asymmetry gives rise to a family of divergence measures — f-divergences — each with distinct geometric and statistical properties.
Concepts
KL divergence is asymmetric: KL(P‖Q) diverges when Q has zero probability where P doesn't. JS divergence is symmetric and bounded by log 2. Adjust the distributions to see the asymmetry.
Set σ₂ small while shifting μ₂ away from μ₁: KL(P‖Q) explodes but KL(Q‖P) stays bounded. JS is always ≤ log 2 ≈ 0.693.
Training a language model by minimizing cross-entropy loss is exactly minimizing the KL divergence from the model's output distribution to the true data distribution. KL divergence is the workhorse quantity of machine learning: it appears in variational inference (ELBO), policy gradients (trust region), knowledge distillation, and RLHF — always measuring how much one probability distribution deviates from another, and in which direction.
KL Divergence
For distributions and on the same alphabet , the KL divergence (Kullback-Leibler divergence) is:
For continuous distributions with densities: .
Convention: when but ; .
Non-negativity (Gibbs' inequality): with equality iff almost everywhere.
Proof: , by Jensen's inequality applied to the concave function .
Non-negativity follows from Jensen's inequality applied to the concave log — this is not a special property of KL but a consequence of convexity. The asymmetry is not a defect to be fixed: forward and reverse KL encode genuinely different optimization criteria with different practical consequences, and the specific direction used in VAEs, RLHF, and knowledge distillation shapes the learned solution in qualitatively different ways.
Asymmetry: in general. The two directions encode different penalties:
- (forward KL, "I-projection"): must cover all regions where has mass — otherwise KL diverges. Minimizing this yields mean-seeking approximations (cover all modes).
- (reverse KL, "M-projection"): is penalized for having mass where has none, but not for missing mass of . Minimizing this yields mode-seeking approximations (concentrate on one mode).
KL divergence for exponential families: for and :
This is the Bregman divergence generated by the log-partition function .
Gaussian-Gaussian KL: for and :
For scalar Gaussians: .
Connection to Entropy and Cross-Entropy
where is the entropy of and is the cross-entropy of relative to .
Cross-entropy loss in classification: if is the true label distribution (one-hot) and is the model's predicted distribution, then for the true class . The KL divergence measures how much worse the model is than the best possible predictor. Since is a constant (the label is deterministic), minimizing cross-entropy is equivalent to minimizing KL divergence.
f-Divergences
A general f-divergence is defined for a convex function with :
By Jensen's inequality ( convex): .
Special cases:
| Generator | f-divergence |
|---|---|
| (reverse KL) | |
| divergence | |
| Squared Hellinger distance | |
| $\frac12 | t-1 |
| Jensen-Shannon divergence |
Total Variation Distance
TV is the maximum probability that any test can distinguish from from a single sample:
Pinsker's inequality: .
Hellinger distance: . Sandwiched:
Jensen-Shannon Divergence
Properties: symmetric, bounded by (or 1 bit), zero iff . The JS distance is a metric.
GAN connection: the Jensen-Shannon divergence arises naturally in the original GAN objective. The optimal discriminator gives the GAN value function equal to . Minimizing the GAN objective is equivalent to minimizing the JS divergence between data and generator distributions.
Worked Example
Example 1: Mode-Seeking vs Mean-Seeking
Approximate a bimodal (mixture of two Gaussians) with a unimodal .
Minimizing (forward, as in maximum likelihood): must cover all of 's mass. The optimal has mean near the average of the two modes and large variance to cover both — a broad distribution between the modes.
Minimizing (reverse, as in variational inference with mean-field): is penalized only for having mass where is zero. The optimal collapses to one mode (mode-seeking). Standard mean-field variational inference minimizes reverse KL, which explains why it tends to underestimate uncertainty (collapses to a single mode).
Example 2: Cross-Entropy in Language Models
For a language model predicting next-token probabilities over vocabulary :
The cross-entropy per token is where is the true data distribution and is the model. Since KL(P‖Q) = H(P,Q) - H(P) and is fixed, minimizing cross-entropy = minimizing KL. A model achieving cross-entropy of 2 bits/token on English text has bits/token above the true entropy.
Perplexity = . Perplexity 8 means the model is as uncertain as a uniform distribution over 8 equally likely tokens — a useful interpretable metric.
Example 3: f-Divergence in Generative Models
Different GAN variants correspond to different f-divergences:
- Standard GAN: Jensen-Shannon
- -GAN (Nowozin et al.): any f-divergence via variational dual form where is the Fenchel conjugate of
- Wasserstein GAN: not an f-divergence, but an optimal transport distance — more stable training for distributions with disjoint support (where KL = ∞ and JS = log 2 are useless)
Connections
Where Your Intuition Breaks
The "direction" of KL divergence matters more than the magnitude. Minimizing (reverse KL) and minimizing (forward KL) solve different problems and can give radically different solutions. Reverse KL forces to be zero wherever is near-zero, producing mode-seeking approximations that can collapse to a single mode even when is multimodal. Forward KL forces to spread mass wherever has mass, producing mean-seeking approximations that average over modes. In variational inference, reverse KL is standard because it is tractable — but the mode-seeking consequence means VI systematically underestimates posterior variance in complex posteriors. Choosing the KL direction is a modeling decision, not a mathematical convention.
KL divergence is the "cost" of using the wrong code. If the true distribution is but you design an optimal code for , you use bits per symbol instead of the optimal . The KL divergence is exactly this overhead. Shannon's source coding theorem says that no code for can compress below bits/symbol. This makes cross-entropy loss the correct training objective for any classification problem where you are trying to learn the true conditional distribution — you are minimizing the coding overhead from using the model distribution instead of the true distribution.
Reverse KL is the basis of variational inference. Variational inference minimizes , which is tractable because is a simple family (Gaussian, mean-field). The reverse direction means avoids regions where is small but ignores regions of that doesn't cover. This systematically underestimates posterior uncertainty in multimodal posteriors. Forward KL (as in expectation propagation) covers all modes but is harder to optimize. The choice between forward and reverse KL is a fundamental tradeoff in approximate inference.
KL divergence is infinite when the supports don't match. whenever assigns positive probability to an event that assigns zero probability. This makes KL impractical for comparing distributions with different supports — common in generative modeling when real and generated distributions occupy different manifolds. This is the original motivation for Wasserstein GANs: the Wasserstein distance is always finite even for distributions with disjoint support, because it uses the geometry of the underlying space rather than pointwise ratios.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.