Bridge: ELBO & VAEs, Contrastive Learning & Rate-Distortion as Compression
The ELBO in VAEs is a KL divergence plus a reconstruction term; contrastive learning maximizes mutual information; rate-distortion theory formalizes the compression-quality tradeoff. Information theory unifies these three pillars of modern generative and self-supervised learning.
Concepts
Rate-distortion curve R(D): the minimum rate (bits/symbol) needed to compress a source to within distortion D. Practical compressors operate above the Shannon limit.
Area below R(D) is achievable; above is not. Every compressor lives above the curve. R(0) = ∞ (lossless), R(σ²) = 0 (just output 0).
When a VAE trains, it solves a compression problem: the encoder squeezes an input into a compact latent code, and the decoder reconstructs from that code. Rate-distortion theory formalizes this tradeoff precisely — any compression scheme faces a fundamental limit between the rate (bits needed to describe the code) and distortion (reconstruction quality). The VAE objective is literally a rate-distortion objective, connecting generative modeling directly to information theory.
The ELBO as Rate-Distortion
Variational Autoencoder (VAE): encoder maps input to latent ; decoder maps back to . The evidence lower bound (ELBO):
The ELBO decomposes as reconstruction term minus KL divergence — not as two separate loss terms but as two sides of a rate-distortion tradeoff. The KL is the rate (how many bits the encoder uses beyond the prior); the reconstruction loss is the distortion. The Lagrangian in -VAE is exactly the rate-distortion tradeoff parameter: higher enforces stronger compression at the cost of worse reconstruction, pushing the latent space toward disentangled structure.
Rate-distortion interpretation: the ELBO is exactly a rate-distortion tradeoff.
- Distortion term : the expected reconstruction loss (cross-entropy for discrete , MSE for Gaussian decoder). Lower is better.
- Rate term : the KL divergence measures how many bits the encoder uses to describe given beyond the prior . This is the compression of into the bottleneck.
-VAE: introduces a weighting on the rate term:
This is the information bottleneck objective with the prior as the reference distribution. Larger forces the latent code to be more compressed, learning disentangled representations where each latent dimension captures an independent factor.
Connection to rate-distortion: where is distortion and is the rate. The VAE training problem is: minimize distortion subject to an (implicit) rate constraint determined by the KL weight.
Mutual Information Maximization in Self-Supervised Learning
Contrastive learning (SimCLR, MoCo, CLIP) trains representations by maximizing mutual information between different views of the same input.
InfoNCE bound (Oord et al. 2018): given a positive pair and negative samples , the InfoNCE loss is:
Mutual information bound: .
Minimizing provides a lower bound on the mutual information between the two views. As , the bound becomes tight: .
Why this works: a good encoder should map the same underlying content to nearby representations (high MI between views), while mapping different content to distant representations. The InfoNCE objective learns this structure without requiring labels.
CLIP interpretation: CLIP maximizes between matching image-text pairs while minimizing for non-matching pairs, using the InfoNCE framework at scale.
Minimum Description Length and Bayesian Model Selection
Minimum Description Length (MDL) principle: the best model is the one that minimizes the total code length needed to describe both the data and the model.
Two-part code: total description length = description of model + description of data given model:
Minimizing this is equivalent to MAP estimation: .
Refined MDL (Rissanen): use the normalized maximum likelihood (NML) code rather than a specific prior. The NML code length is:
The second term is the parametric complexity — the extra bits needed by a model with free parameters: (from Fisher information geometry). This recovers the BIC: .
Information-Theoretic Analysis of Generalization
PAC-Bayes meets information theory: the mutual information between training data and learned hypothesis (the trained model) controls generalization:
Interpretation: if training reveals little about the specific dataset (small ), the model generalizes well. Stochastic training (dropout, noise injection) limits by introducing randomness — this is why stochastic regularizers improve generalization.
Compressibility and generalization: the number of bits needed to describe the learned model (its algorithmic complexity) bounds generalization. Models that can be compressed to bits using the training data generalize with excess risk.
Worked Example
Example 1: VAE Latent Space Geometry
For a VAE trained on MNIST digits (, 10 classes) with 2D latent space:
Before training (random encoder): (posterior ≈ prior — no information compressed).
After training: different digits form separate clusters in latent space. The KL term nats/digit (using 2 nats of information to encode which digit, out of nats maximum). The decoder achieves near-perfect reconstruction using these 2 nats of compressed information.
For -VAE with : higher compression. Latent space better disentangled (separate dimensions for stroke width, tilt, style) but higher reconstruction loss. The rate-distortion tradeoff is visible in the ELBO components.
Example 2: InfoNCE Sample Efficiency
For SimCLR with negatives per positive pair: the MI lower bound satisfies bits.
A trained model achieving nats at implies bits between the two augmented views of the same image. This high mutual information means the encoder successfully ignores augmentation-specific details (crop position, color jitter) while retaining content information.
Increasing tightens the bound: at (MoCo momentum queue), the bound is bits — more powerful but harder to optimize.
Example 3: MDL for Model Selection
Compare a linear model (2 parameters) vs a polynomial degree-5 model (6 parameters) for data points:
Log-likelihoods at MLE: nats, nats.
BIC penalty: linear: nats; poly5: nats.
BIC scores: linear: ; poly5: .
BIC selects the polynomial: the 8-nat improvement in fit outweighs the 7.8-nat complexity cost. Note that with only 50 points and 6 parameters, the polynomial barely earns its complexity.
Connections
Where Your Intuition Breaks
"Maximizing mutual information" sounds theoretically grounded as a self-supervised learning objective, but the InfoNCE lower bound is tight only when is large. With small batch sizes (small ), the bound is loose and the gradient signal comes primarily from estimating the bound rather than from maximizing MI. This is why SimCLR requires batch sizes of 65,536 to work well: the bound only becomes informative at scale. Contrastive learning methods that claim to "maximize mutual information" in small batches are not maximizing MI — they are minimizing a noise-dominated lower bound that has little relationship to the true MI at small .
VAE training is automatic compression to the information bottleneck. The KL term in the ELBO enforces that the latent code stays close to the prior — it limits how many bits the encoder can use per example. The reconstruction term pushes the encoder to use those bits wisely. Together they find the optimal tradeoff on the rate-distortion curve, without explicitly computing . The parameter in -VAE slides along this curve: high forces lower rate (more compression), low allows higher rate (lower distortion, less disentanglement).
Contrastive learning is self-supervised MI maximization. Without labels, the only training signal is the structural assumption that different views of the same input share semantic content. InfoNCE formalizes this: learn an encoder that maximizes MI between views. The key insight is that maximizing forces the encoder to extract view-invariant features — exactly the semantic content we want. This is why representations learned via contrastive objectives transfer well: they have captured genuinely high-MI structure in the data.
Rate-distortion bounds do not constrain model quality — only compression. A language model with 70B parameters and 2-bit quantization uses GB of storage. The rate-distortion bound for Gaussian weight distributions bounds the minimum number of bits to represent the weights to within some MSE on the weights — not the model quality. Weight quantization performance is not well-described by rate-distortion theory because the loss (perplexity on text) is a highly nonlinear function of the weights, not simply their MSE. Better compression bounds for neural network inference use task loss as the distortion measure, leading to mixed-precision quantization schemes.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.