Bridge: ELBO & VAEs, Contrastive Learning & Rate-Distortion as Compression

The ELBO in VAEs is a KL divergence plus a reconstruction term; contrastive learning maximizes mutual information; rate-distortion theory formalizes the compression-quality tradeoff. Information theory unifies these three pillars of modern generative and self-supervised learning.

Concepts

Rate-distortion curve R(D): the minimum rate (bits/symbol) needed to compress a source to within distortion D. Practical compressors operate above the Shannon limit.

Gaussian R(D) = ½ log₂(σ²/D)Bernoulli(0.5)

σ² = 1.0

D/σ² = 0.30

Show Bernoulli

Target D

0.300

R(D) limit

0.868 bits

Rate gap

0.800 bits

Area below R(D) is achievable; above is not. Every compressor lives above the curve. R(0) = ∞ (lossless), R(σ²) = 0 (just output 0).

When a VAE trains, it solves a compression problem: the encoder squeezes an input into a compact latent code, and the decoder reconstructs from that code. Rate-distortion theory formalizes this tradeoff precisely — any compression scheme faces a fundamental limit between the rate (bits needed to describe the code) and distortion (reconstruction quality). The VAE objective is literally a rate-distortion objective, connecting generative modeling directly to information theory.

The ELBO as Rate-Distortion

Variational Autoencoder (VAE): encoder $q_\phi(z|x)$ maps input $x$ to latent $z$ ; decoder $p_\theta(x|z)$ maps $z$ back to $x$ . The evidence lower bound (ELBO):

$\text{ELBO}(\phi, \theta) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z)).$

The ELBO decomposes as reconstruction term minus KL divergence — not as two separate loss terms but as two sides of a rate-distortion tradeoff. The KL $\text{KL}(q_\phi(z|x) \| p(z))$ is the rate (how many bits the encoder uses beyond the prior); the reconstruction loss is the distortion. The Lagrangian $\beta$ in $\beta$ -VAE is exactly the rate-distortion tradeoff parameter: higher $\beta$ enforces stronger compression at the cost of worse reconstruction, pushing the latent space toward disentangled structure.

Rate-distortion interpretation: the ELBO is exactly a rate-distortion tradeoff.

Distortion term $-\mathbb{E}[\log p_\theta(x|z)]$ : the expected reconstruction loss (cross-entropy for discrete $x$ , MSE for Gaussian decoder). Lower is better.
Rate term $\text{KL}(q_\phi(z|x) \| p(z))$ : the KL divergence measures how many bits the encoder uses to describe $z$ given $x$ beyond the prior $p(z)$ . This is the compression of $x$ into the bottleneck.

$\beta$ -VAE: introduces a weighting $\beta > 1$ on the rate term:

$\text{ELBO}_\beta = \mathbb{E}[\log p_\theta(x|z)] - \beta \cdot \text{KL}(q_\phi(z|x) \| p(z)).$

This is the information bottleneck objective with the prior as the reference distribution. Larger $\beta$ forces the latent code to be more compressed, learning disentangled representations where each latent dimension captures an independent factor.

Connection to rate-distortion: $\text{ELBO} = -D + R \cdot (-1)$ where $D$ is distortion and $R = \text{KL}(q\|p)$ is the rate. The VAE training problem is: minimize distortion subject to an (implicit) rate constraint determined by the KL weight.

Mutual Information Maximization in Self-Supervised Learning

Contrastive learning (SimCLR, MoCo, CLIP) trains representations by maximizing mutual information between different views of the same input.

InfoNCE bound (Oord et al. 2018): given a positive pair $(x, x^+)$ and $K-1$ negative samples $x_1^-, \ldots, x_{K-1}^-$ , the InfoNCE loss is:

$\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\!\left[\log\frac{e^{f(x)^T g(x^+)}}{e^{f(x)^T g(x^+)} + \sum_{k=1}^{K-1} e^{f(x)^T g(x_k^-)}}\right].$

Mutual information bound: $I(X; X^+) \geq \log K - \mathcal{L}_{\text{InfoNCE}}$ .

Minimizing $\mathcal{L}_{\text{InfoNCE}}$ provides a lower bound on the mutual information between the two views. As $K \to \infty$ , the bound becomes tight: $I(X; X^+) = \log K - \mathcal{L}_{\text{InfoNCE}} + o(1)$ .

Why this works: a good encoder $f$ should map the same underlying content to nearby representations (high MI between views), while mapping different content to distant representations. The InfoNCE objective learns this structure without requiring labels.

CLIP interpretation: CLIP maximizes $I(\text{image}; \text{text})$ between matching image-text pairs while minimizing $I(\text{image}; \text{text})$ for non-matching pairs, using the InfoNCE framework at scale.

Minimum Description Length and Bayesian Model Selection

Minimum Description Length (MDL) principle: the best model is the one that minimizes the total code length needed to describe both the data and the model.

Two-part code: total description length = description of model + description of data given model:

$L(M) + L(X \mid M) = -\log P(M) + (-\log P(X \mid M)) = -\log P(M) \cdot P(X \mid M) = -\log P(M, X).$

Minimizing this is equivalent to MAP estimation: $\hat M = \arg\max_M P(M|X) \propto P(X|M) P(M)$ .

Refined MDL (Rissanen): use the normalized maximum likelihood (NML) code rather than a specific prior. The NML code length is:

$-\log p(x^n; \hat\theta(x^n)) + \log \int p(y^n; \hat\theta(y^n))\,dy^n.$

The second term is the parametric complexity — the extra bits needed by a model with $k$ free parameters: $k/2 \cdot \log(n/(2\pi e)) + O(1)$ (from Fisher information geometry). This recovers the BIC: $\text{BIC} = -2\log p(x|M, \hat\theta) + k\log n$ .

Information-Theoretic Analysis of Generalization

PAC-Bayes meets information theory: the mutual information between training data $S$ and learned hypothesis $W$ (the trained model) controls generalization:

$\mathbb{E}[L(W) - \hat L(W)] \leq \sqrt{\frac{I(S; W)}{2n}}.$

Interpretation: if training reveals little about the specific dataset $S$ (small $I(S;W)$ ), the model generalizes well. Stochastic training (dropout, noise injection) limits $I(S;W)$ by introducing randomness — this is why stochastic regularizers improve generalization.

Compressibility and generalization: the number of bits needed to describe the learned model (its algorithmic complexity) bounds generalization. Models that can be compressed to $k$ bits using the training data generalize with $O(k/n)$ excess risk.

Worked Example

Example 1: VAE Latent Space Geometry

For a VAE trained on MNIST digits ( $x \in \mathbb{R}^{784}$ , 10 classes) with 2D latent space:

Before training (random encoder): $\text{KL}(q(z|x)\|p(z)) \approx 0$ (posterior ≈ prior — no information compressed).

After training: different digits form separate clusters in latent space. The KL term $\approx 2$ nats/digit (using 2 nats of information to encode which digit, out of $\log 10 \approx 2.3$ nats maximum). The decoder achieves near-perfect reconstruction using these 2 nats of compressed information.

For $\beta$ -VAE with $\beta = 4$ : higher compression. Latent space better disentangled (separate dimensions for stroke width, tilt, style) but higher reconstruction loss. The rate-distortion tradeoff is visible in the ELBO components.

Example 2: InfoNCE Sample Efficiency

For SimCLR with $K = 256$ negatives per positive pair: the MI lower bound satisfies $I(\text{view1}; \text{view2}) \geq \log 256 - \mathcal{L} = 8 - \mathcal{L}$ bits.

A trained model achieving $\mathcal{L}_{\text{InfoNCE}} = 0.5$ nats at $K = 256$ implies $I \geq 8 - 0.5/\ln 2 \approx 7.3$ bits between the two augmented views of the same image. This high mutual information means the encoder successfully ignores augmentation-specific details (crop position, color jitter) while retaining content information.

Increasing $K$ tightens the bound: at $K = 65536$ (MoCo momentum queue), the bound is $\geq 16 - \mathcal{L}/\ln 2$ bits — more powerful but harder to optimize.

Example 3: MDL for Model Selection

Compare a linear model (2 parameters) vs a polynomial degree-5 model (6 parameters) for $n = 50$ data points:

Log-likelihoods at MLE: $\log p(x|\hat\theta_{\text{linear}}) = -80$ nats, $\log p(x|\hat\theta_{\text{poly}}) = -72$ nats.

BIC penalty: linear: $2\log 50 / 2 \approx 3.9$ nats; poly5: $6\log 50 / 2 \approx 11.7$ nats.

BIC scores: linear: $160 + 3.9 = 163.9$ ; poly5: $144 + 11.7 = 155.7$ .

BIC selects the polynomial: the 8-nat improvement in fit outweighs the 7.8-nat complexity cost. Note that with only 50 points and 6 parameters, the polynomial barely earns its complexity.

Connections

Where Your Intuition Breaks

"Maximizing mutual information" sounds theoretically grounded as a self-supervised learning objective, but the InfoNCE lower bound $I(X;X^+) \geq \log K - \mathcal{L}_{\text{InfoNCE}}$ is tight only when $K$ is large. With small batch sizes (small $K$ ), the bound is loose and the gradient signal comes primarily from estimating the bound rather than from maximizing MI. This is why SimCLR requires batch sizes of 65,536 to work well: the bound only becomes informative at scale. Contrastive learning methods that claim to "maximize mutual information" in small batches are not maximizing MI — they are minimizing a noise-dominated lower bound that has little relationship to the true MI at small $K$ .

💡Intuition

VAE training is automatic compression to the information bottleneck. The KL term in the ELBO enforces that the latent code stays close to the prior — it limits how many bits the encoder can use per example. The reconstruction term pushes the encoder to use those bits wisely. Together they find the optimal tradeoff on the rate-distortion curve, without explicitly computing $R(D)$ . The $\beta$ parameter in $\beta$ -VAE slides along this curve: high $\beta$ forces lower rate (more compression), low $\beta$ allows higher rate (lower distortion, less disentanglement).

💡Intuition

Contrastive learning is self-supervised MI maximization. Without labels, the only training signal is the structural assumption that different views of the same input share semantic content. InfoNCE formalizes this: learn an encoder that maximizes MI between views. The key insight is that maximizing $I(\text{view1};\text{view2})$ forces the encoder to extract view-invariant features — exactly the semantic content we want. This is why representations learned via contrastive objectives transfer well: they have captured genuinely high-MI structure in the data.

⚠️Warning

Rate-distortion bounds do not constrain model quality — only compression. A language model with 70B parameters and 2-bit quantization uses $\approx 140$ GB of storage. The rate-distortion bound for Gaussian weight distributions bounds the minimum number of bits to represent the weights to within some MSE on the weights — not the model quality. Weight quantization performance is not well-described by rate-distortion theory because the loss (perplexity on text) is a highly nonlinear function of the weights, not simply their MSE. Better compression bounds for neural network inference use task loss as the distortion measure, leading to mixed-precision quantization schemes.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Channel Capacity & Shannon's Coding Theorems

Stochastic Processes

Discrete-Time Markov Chains: Stationarity, Ergodicity & Mixing Times