Neural-Path/Notes
35 min

Bridge: ELBO & VAEs, Contrastive Learning & Rate-Distortion as Compression

The ELBO in VAEs is a KL divergence plus a reconstruction term; contrastive learning maximizes mutual information; rate-distortion theory formalizes the compression-quality tradeoff. Information theory unifies these three pillars of modern generative and self-supervised learning.

Concepts

Rate-distortion curve R(D): the minimum rate (bits/symbol) needed to compress a source to within distortion D. Practical compressors operate above the Shannon limit.

0.00.91.82.63.500.260.530.791.05Distortion DRate R (bits)practicalgap
Gaussian R(D) = ½ log₂(σ²/D)Bernoulli(0.5)
Target D
0.300
R(D) limit
0.868 bits
Rate gap
0.800 bits

Area below R(D) is achievable; above is not. Every compressor lives above the curve. R(0) = ∞ (lossless), R(σ²) = 0 (just output 0).

When a VAE trains, it solves a compression problem: the encoder squeezes an input into a compact latent code, and the decoder reconstructs from that code. Rate-distortion theory formalizes this tradeoff precisely — any compression scheme faces a fundamental limit between the rate (bits needed to describe the code) and distortion (reconstruction quality). The VAE objective is literally a rate-distortion objective, connecting generative modeling directly to information theory.

The ELBO as Rate-Distortion

Variational Autoencoder (VAE): encoder qϕ(zx)q_\phi(z|x) maps input xx to latent zz; decoder pθ(xz)p_\theta(x|z) maps zz back to xx. The evidence lower bound (ELBO):

ELBO(ϕ,θ)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z)).\text{ELBO}(\phi, \theta) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z)).

The ELBO decomposes as reconstruction term minus KL divergence — not as two separate loss terms but as two sides of a rate-distortion tradeoff. The KL KL(qϕ(zx)p(z))\text{KL}(q_\phi(z|x) \| p(z)) is the rate (how many bits the encoder uses beyond the prior); the reconstruction loss is the distortion. The Lagrangian β\beta in β\beta-VAE is exactly the rate-distortion tradeoff parameter: higher β\beta enforces stronger compression at the cost of worse reconstruction, pushing the latent space toward disentangled structure.

Rate-distortion interpretation: the ELBO is exactly a rate-distortion tradeoff.

  • Distortion term E[logpθ(xz)]-\mathbb{E}[\log p_\theta(x|z)]: the expected reconstruction loss (cross-entropy for discrete xx, MSE for Gaussian decoder). Lower is better.
  • Rate term KL(qϕ(zx)p(z))\text{KL}(q_\phi(z|x) \| p(z)): the KL divergence measures how many bits the encoder uses to describe zz given xx beyond the prior p(z)p(z). This is the compression of xx into the bottleneck.

β\beta-VAE: introduces a weighting β>1\beta > 1 on the rate term:

ELBOβ=E[logpθ(xz)]βKL(qϕ(zx)p(z)).\text{ELBO}_\beta = \mathbb{E}[\log p_\theta(x|z)] - \beta \cdot \text{KL}(q_\phi(z|x) \| p(z)).

This is the information bottleneck objective with the prior as the reference distribution. Larger β\beta forces the latent code to be more compressed, learning disentangled representations where each latent dimension captures an independent factor.

Connection to rate-distortion: ELBO=D+R(1)\text{ELBO} = -D + R \cdot (-1) where DD is distortion and R=KL(qp)R = \text{KL}(q\|p) is the rate. The VAE training problem is: minimize distortion subject to an (implicit) rate constraint determined by the KL weight.

Mutual Information Maximization in Self-Supervised Learning

Contrastive learning (SimCLR, MoCo, CLIP) trains representations by maximizing mutual information between different views of the same input.

InfoNCE bound (Oord et al. 2018): given a positive pair (x,x+)(x, x^+) and K1K-1 negative samples x1,,xK1x_1^-, \ldots, x_{K-1}^-, the InfoNCE loss is:

LInfoNCE=E ⁣[logef(x)Tg(x+)ef(x)Tg(x+)+k=1K1ef(x)Tg(xk)].\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\!\left[\log\frac{e^{f(x)^T g(x^+)}}{e^{f(x)^T g(x^+)} + \sum_{k=1}^{K-1} e^{f(x)^T g(x_k^-)}}\right].

Mutual information bound: I(X;X+)logKLInfoNCEI(X; X^+) \geq \log K - \mathcal{L}_{\text{InfoNCE}}.

Minimizing LInfoNCE\mathcal{L}_{\text{InfoNCE}} provides a lower bound on the mutual information between the two views. As KK \to \infty, the bound becomes tight: I(X;X+)=logKLInfoNCE+o(1)I(X; X^+) = \log K - \mathcal{L}_{\text{InfoNCE}} + o(1).

Why this works: a good encoder ff should map the same underlying content to nearby representations (high MI between views), while mapping different content to distant representations. The InfoNCE objective learns this structure without requiring labels.

CLIP interpretation: CLIP maximizes I(image;text)I(\text{image}; \text{text}) between matching image-text pairs while minimizing I(image;text)I(\text{image}; \text{text}) for non-matching pairs, using the InfoNCE framework at scale.

Minimum Description Length and Bayesian Model Selection

Minimum Description Length (MDL) principle: the best model is the one that minimizes the total code length needed to describe both the data and the model.

Two-part code: total description length = description of model + description of data given model:

L(M)+L(XM)=logP(M)+(logP(XM))=logP(M)P(XM)=logP(M,X).L(M) + L(X \mid M) = -\log P(M) + (-\log P(X \mid M)) = -\log P(M) \cdot P(X \mid M) = -\log P(M, X).

Minimizing this is equivalent to MAP estimation: M^=argmaxMP(MX)P(XM)P(M)\hat M = \arg\max_M P(M|X) \propto P(X|M) P(M).

Refined MDL (Rissanen): use the normalized maximum likelihood (NML) code rather than a specific prior. The NML code length is:

logp(xn;θ^(xn))+logp(yn;θ^(yn))dyn.-\log p(x^n; \hat\theta(x^n)) + \log \int p(y^n; \hat\theta(y^n))\,dy^n.

The second term is the parametric complexity — the extra bits needed by a model with kk free parameters: k/2log(n/(2πe))+O(1)k/2 \cdot \log(n/(2\pi e)) + O(1) (from Fisher information geometry). This recovers the BIC: BIC=2logp(xM,θ^)+klogn\text{BIC} = -2\log p(x|M, \hat\theta) + k\log n.

Information-Theoretic Analysis of Generalization

PAC-Bayes meets information theory: the mutual information between training data SS and learned hypothesis WW (the trained model) controls generalization:

E[L(W)L^(W)]I(S;W)2n.\mathbb{E}[L(W) - \hat L(W)] \leq \sqrt{\frac{I(S; W)}{2n}}.

Interpretation: if training reveals little about the specific dataset SS (small I(S;W)I(S;W)), the model generalizes well. Stochastic training (dropout, noise injection) limits I(S;W)I(S;W) by introducing randomness — this is why stochastic regularizers improve generalization.

Compressibility and generalization: the number of bits needed to describe the learned model (its algorithmic complexity) bounds generalization. Models that can be compressed to kk bits using the training data generalize with O(k/n)O(k/n) excess risk.

Worked Example

Example 1: VAE Latent Space Geometry

For a VAE trained on MNIST digits (xR784x \in \mathbb{R}^{784}, 10 classes) with 2D latent space:

Before training (random encoder): KL(q(zx)p(z))0\text{KL}(q(z|x)\|p(z)) \approx 0 (posterior ≈ prior — no information compressed).

After training: different digits form separate clusters in latent space. The KL term 2\approx 2 nats/digit (using 2 nats of information to encode which digit, out of log102.3\log 10 \approx 2.3 nats maximum). The decoder achieves near-perfect reconstruction using these 2 nats of compressed information.

For β\beta-VAE with β=4\beta = 4: higher compression. Latent space better disentangled (separate dimensions for stroke width, tilt, style) but higher reconstruction loss. The rate-distortion tradeoff is visible in the ELBO components.

Example 2: InfoNCE Sample Efficiency

For SimCLR with K=256K = 256 negatives per positive pair: the MI lower bound satisfies I(view1;view2)log256L=8LI(\text{view1}; \text{view2}) \geq \log 256 - \mathcal{L} = 8 - \mathcal{L} bits.

A trained model achieving LInfoNCE=0.5\mathcal{L}_{\text{InfoNCE}} = 0.5 nats at K=256K = 256 implies I80.5/ln27.3I \geq 8 - 0.5/\ln 2 \approx 7.3 bits between the two augmented views of the same image. This high mutual information means the encoder successfully ignores augmentation-specific details (crop position, color jitter) while retaining content information.

Increasing KK tightens the bound: at K=65536K = 65536 (MoCo momentum queue), the bound is 16L/ln2\geq 16 - \mathcal{L}/\ln 2 bits — more powerful but harder to optimize.

Example 3: MDL for Model Selection

Compare a linear model (2 parameters) vs a polynomial degree-5 model (6 parameters) for n=50n = 50 data points:

Log-likelihoods at MLE: logp(xθ^linear)=80\log p(x|\hat\theta_{\text{linear}}) = -80 nats, logp(xθ^poly)=72\log p(x|\hat\theta_{\text{poly}}) = -72 nats.

BIC penalty: linear: 2log50/23.92\log 50 / 2 \approx 3.9 nats; poly5: 6log50/211.76\log 50 / 2 \approx 11.7 nats.

BIC scores: linear: 160+3.9=163.9160 + 3.9 = 163.9; poly5: 144+11.7=155.7144 + 11.7 = 155.7.

BIC selects the polynomial: the 8-nat improvement in fit outweighs the 7.8-nat complexity cost. Note that with only 50 points and 6 parameters, the polynomial barely earns its complexity.

Connections

Where Your Intuition Breaks

"Maximizing mutual information" sounds theoretically grounded as a self-supervised learning objective, but the InfoNCE lower bound I(X;X+)logKLInfoNCEI(X;X^+) \geq \log K - \mathcal{L}_{\text{InfoNCE}} is tight only when KK is large. With small batch sizes (small KK), the bound is loose and the gradient signal comes primarily from estimating the bound rather than from maximizing MI. This is why SimCLR requires batch sizes of 65,536 to work well: the bound only becomes informative at scale. Contrastive learning methods that claim to "maximize mutual information" in small batches are not maximizing MI — they are minimizing a noise-dominated lower bound that has little relationship to the true MI at small KK.

💡Intuition

VAE training is automatic compression to the information bottleneck. The KL term in the ELBO enforces that the latent code stays close to the prior — it limits how many bits the encoder can use per example. The reconstruction term pushes the encoder to use those bits wisely. Together they find the optimal tradeoff on the rate-distortion curve, without explicitly computing R(D)R(D). The β\beta parameter in β\beta-VAE slides along this curve: high β\beta forces lower rate (more compression), low β\beta allows higher rate (lower distortion, less disentanglement).

💡Intuition

Contrastive learning is self-supervised MI maximization. Without labels, the only training signal is the structural assumption that different views of the same input share semantic content. InfoNCE formalizes this: learn an encoder that maximizes MI between views. The key insight is that maximizing I(view1;view2)I(\text{view1};\text{view2}) forces the encoder to extract view-invariant features — exactly the semantic content we want. This is why representations learned via contrastive objectives transfer well: they have captured genuinely high-MI structure in the data.

⚠️Warning

Rate-distortion bounds do not constrain model quality — only compression. A language model with 70B parameters and 2-bit quantization uses 140\approx 140 GB of storage. The rate-distortion bound for Gaussian weight distributions bounds the minimum number of bits to represent the weights to within some MSE on the weights — not the model quality. Weight quantization performance is not well-described by rate-distortion theory because the loss (perplexity on text) is a highly nonlinear function of the weights, not simply their MSE. Better compression bounds for neural network inference use task loss as the distortion measure, leading to mixed-precision quantization schemes.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.