Neural-Path/Notes
35 min

Data Processing Inequality & Sufficient Statistics

The data processing inequality states that no transformation can increase the information a random variable carries about another — a fundamental limit with deep consequences for representation learning, bottleneck architectures, and the theory of sufficient statistics.

Concepts

Every intermediate layer of a neural network is a deterministic transformation of the previous layer. The data processing inequality says something fundamental about this: no layer can increase the information the representation carries about the target label. The only way to preserve information is to use invertible transformations — giving an information-theoretic justification for why representation learning is fundamentally about finding the right compression, not finding the right enrichment.

The Data Processing Inequality

Markov chain notation: XYZX \to Y \to Z means ZZ is conditionally independent of XX given YY: P(ZX,Y)=P(ZY)P(Z \mid X, Y) = P(Z \mid Y).

Data Processing Inequality (DPI): if XYZX \to Y \to Z is a Markov chain, then

I(X;Z)I(X;Y).I(X; Z) \leq I(X; Y).

Processing YY through any channel to produce ZZ cannot increase the information about XX.

The proof is algebraic: the Markov condition XYZX \to Y \to Z means I(X;ZY)=0I(X; Z \mid Y) = 0 (knowing YY makes ZZ independent of XX). This forces I(X;Z)I(X;Y)I(X;Z) \leq I(X;Y) via the chain rule. The result is structural — it does not depend on the complexity of the transformation from YY to ZZ. No matter how computationally elaborate, no deterministic function of YY can extract information about XX that was not already present in YY.

Proof:

I(X;Y,Z)=I(X;Y)+I(X;ZY)=I(X;Y)+0=I(X;Y),I(X; Y, Z) = I(X; Y) + I(X; Z \mid Y) = I(X; Y) + 0 = I(X; Y),

using the Markov condition I(X;ZY)=0I(X; Z \mid Y) = 0. By the chain rule and non-negativity of MI:

I(X;Y,Z)=I(X;Z)+I(X;YZ)I(X;Z).I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z).

Combining: I(X;Y)I(X;Z)I(X; Y) \geq I(X; Z). \square

Corollary (DPI for KL divergence): for any channel PYXP_{Y|X} and distributions P,QP, Q on XX:

KL(PYQY)KL(PXQX),\text{KL}(P_Y \| Q_Y) \leq \text{KL}(P_X \| Q_X),

where PY=PXPYXP_Y = P_X \cdot P_{Y|X} (passing distributions through a channel reduces their distinguishability).

Corollary (DPI for total variation): TV(PY,QY)TV(PX,QX)\text{TV}(P_Y, Q_Y) \leq \text{TV}(P_X, Q_X).

This says: any processing step makes distributions harder (or equally hard) to distinguish. A classifier cannot increase information; it can only lose it.

Sufficient Statistics Characterized by DPI

Recall (from Module 07): a statistic T=T(X)T = T(X) is sufficient for parameter θ\theta if XTX \to T \to any inference is as powerful as XX \to any inference.

Information-theoretic characterization: TT is sufficient for θ\theta iff I(X;θ)=I(T;θ)I(X; \theta) = I(T; \theta) for all prior distributions on θ\theta.

Proof: if TT is sufficient, then θT(X)X\theta \to T(X) \to X is a Markov chain (the raw data XX given TT does not depend on θ\theta). By DPI applied to XTθX \to T \to \theta: I(θ;T)I(θ;X)I(\theta; T) \leq I(\theta; X). But also θXT\theta \to X \to T is Markov, so I(θ;T)I(θ;X)I(\theta; T) \leq I(\theta; X). Actually sufficient means I(θ;T)=I(θ;X)I(\theta;T) = I(\theta; X)TT preserves all the mutual information.

Intuition: the sufficient statistic compresses XX to TT without discarding any information about θ\theta. It is the ideal compression of XX for the purpose of inferring θ\theta.

Fano's Inequality

Fano's inequality lower-bounds the probability of error in estimation problems.

Setup: parameter XXX \in \mathcal{X} with X=m|\mathcal{X}| = m, observation YY, estimator X^=g(Y)\hat X = g(Y), error Pe=P(X^X)P_e = P(\hat X \neq X).

Fano's inequality:

H(XY)H(Pe)+Pelog(m1),H(X \mid Y) \leq H(P_e) + P_e \log(m-1),

where H(Pe)=PelogPe(1Pe)log(1Pe)H(P_e) = -P_e\log P_e - (1-P_e)\log(1-P_e) is the binary entropy. Equivalently:

PeH(XY)1log(m1)=H(X)I(X;Y)1log(m1).P_e \geq \frac{H(X \mid Y) - 1}{\log(m-1)} = \frac{H(X) - I(X;Y) - 1}{\log(m-1)}.

Interpretation: if the conditional entropy H(XY)H(X|Y) is large (the observation YY leaves much uncertainty), the error rate must be large. No estimator can achieve low error when the posterior uncertainty is high.

Proof sketch: introduce the error indicator E=1[X^X]E = \mathbf{1}[\hat X \neq X]. Apply chain rule: H(XY)=H(XX^,Y)+I(X;X^Y)H(XX^)H(E)+Pelog(m1)H(X \mid Y) = H(X \mid \hat X, Y) + I(X; \hat X \mid Y) \leq H(X \mid \hat X) \leq H(E) + P_e \log(m-1).

Application to minimax lower bounds: Fano's inequality is the standard tool for proving lower bounds on statistical estimation. To show that no estimator can achieve error below ε2\varepsilon^2:

  1. Construct mm distributions in the parameter space that are 2ε2\varepsilon apart in the metric
  2. Show these distributions satisfy I(X;Y)nI(θi;Y1)I(X;Y) \leq n \cdot I(\theta_i; Y_1) (information accumulates over samples)
  3. Apply Fano to get Pe1O(I/logm)P_e \geq 1 - O(I/\log m)

Information Bottleneck

The information bottleneck principle (Tishby, Pereira, Bialek 1999) frames representation learning as a compression problem: find a representation ZZ of input XX that maximally retains information about a target YY while compressing XX:

minP(ZX)I(Z;X)βI(Z;Y).\min_{P(Z|X)} I(Z; X) - \beta I(Z; Y).

The Lagrange multiplier β\beta controls the tradeoff:

  • β=0\beta = 0: compress XX maximally (just predict the mean), discard all information about YY
  • β\beta \to \infty: retain all information about YY (sufficient statistic), no compression constraint

DPI constraint: since XZX \to Z \to is a Markov chain, I(Z;Y)I(X;Y)I(Z;Y) \leq I(X;Y) by DPI. The maximum information about YY that any representation ZZ can achieve is I(X;Y)I(X;Y) — the information between the raw input and the label.

Optimal representation: the optimal ZZ^* is a sufficient statistic of XX for YY — it preserves I(X;Y)I(X;Y) while minimizing I(Z;X)I(Z;X).

Implied Markov chain in a deep network: YXZ1Z2Y^Y \to X \to Z_1 \to Z_2 \to \ldots \to \hat Y where each layer is Markovian given the previous. By repeated DPI application:

I(Y;Zk)I(Y;Zk1)I(Y;X).I(Y; Z_k) \leq I(Y; Z_{k-1}) \leq \ldots \leq I(Y; X).

Each layer can only lose or preserve information about YY.

Worked Example

Example 1: DPI in a Deep Classifier

Network: input XR1000X \in \mathbb{R}^{1000} (image pixels), layer 1 output Z1R512Z_1 \in \mathbb{R}^{512}, layer 2 Z2R128Z_2 \in \mathbb{R}^{128}, logits Y^R10\hat Y \in \mathbb{R}^{10}.

By DPI: I(Y;Y^)I(Y;Z2)I(Y;Z1)I(Y;X)=H(Y)=log2103.32I(Y; \hat Y) \leq I(Y; Z_2) \leq I(Y; Z_1) \leq I(Y; X) = H(Y) = \log_2 10 \approx 3.32 bits (for balanced 10-class classification).

The maximum information any 128-dimensional representation can retain about a 10-class label is I(Y;X)=3.32I(Y;X) = 3.32 bits. Since I(Y;X)I(Y;X) is fixed by the task, every intermediate representation is limited by this ceiling. Good representations are those that saturate this bound — they retain all task-relevant information while discarding irrelevant details.

Example 2: Fano's Inequality for Hypothesis Testing

Test H0:θ=θ0H_0: \theta = \theta_0 vs H1:θ=θ1H_1: \theta = \theta_1 from nn observations. By Fano:

Pe1I(θ;Xn)+1log2=1nI(θ;X1)+1log2,P_e \geq 1 - \frac{I(\theta; X^n) + 1}{\log 2} = 1 - \frac{n \cdot I(\theta; X_1) + 1}{\log 2},

so Pe1/2nI(θ;X1)/21/(2log2)P_e \geq 1/2 - n \cdot I(\theta; X_1)/2 - 1/(2\log 2).

This shows: to achieve PeαP_e \leq \alpha, need n(12α)/I(θ;X1)n \geq (1-2\alpha)/I(\theta; X_1) samples — an information-theoretic lower bound on sample complexity for distinguishing the two hypotheses.

The Chernoff-Stein exponent gives the exact exponential rate, but Fano gives the leading factor.

Example 3: Information Bottleneck in VAEs

A VAE encodes input XX to latent ZZ, then decodes to X^\hat X. The ELBO (from Module 07 Bayesian lesson):

ELBO=E[logp(XZ)]KL(q(ZX)p(Z)).\text{ELBO} = \mathbb{E}[\log p(X \mid Z)] - \text{KL}(q(Z \mid X) \| p(Z)).

The KL term KL(qp)=I(X;Z)+const\text{KL}(q \| p) = I(X; Z) + \text{const} penalizes mutual information between input and latent (compression). The reconstruction term encourages I(Y;Z)I(Y; Z) for Y=XY = X (in reconstruction). The ELBO is exactly the information bottleneck objective with β=1\beta = 1 and Y=XY = X — a special case where we want to compress XX while retaining enough information to reconstruct it.

Connections

Where Your Intuition Breaks

The information bottleneck principle — train a representation to compress XX while retaining only information relevant to label YY — is theoretically elegant. Tishby and colleagues conjectured that SGD implicitly optimizes this objective, with DNNs first memorizing the training data (high I(Z;X)I(Z;X)) then compressing (decreasing I(Z;X)I(Z;X) while maintaining I(Z;Y)I(Z;Y)). This compression phase was not empirically replicated in most subsequent work: the mutual information estimates used in the original experiments depended sensitively on activation binning, and different estimation methods give qualitatively different pictures of whether compression actually occurs. The theoretical appeal of the information bottleneck is real; the claim that SGD implements it is not established.

💡Intuition

DPI says information cannot be created from nothing. Any transformation, compression, or processing of data can only lose information — it cannot add information about the input that was not already there. This is the information-theoretic analog of the second law of thermodynamics. For ML: the information about the label YY in a model's final output Y^\hat Y is at most the information in the raw input XX. No architecture, no matter how deep or complex, can exceed I(X;Y)I(X;Y) — the information the problem itself contains. Good ML is about preserving this information efficiently through the computation graph.

💡Intuition

The sufficient statistic is the ideal compression. Among all statistics TT that preserve I(θ;T)=I(θ;X)I(\theta;T) = I(\theta;X), the minimal sufficient statistic achieves this while being maximally compressed. This is the information bottleneck at β\beta \to \infty: compress maximally while retaining all task-relevant information. The exponential family sufficient statistic (e.g., sample mean and sum of squares for Gaussian data) is exactly this ideal compression — it discards all irrelevant ordering information while retaining everything about θ\theta.

⚠️Warning

The information bottleneck theory of deep learning is contested. Tishby et al. (2017) proposed that neural networks learn by first compressing I(Z;X)I(Z;X) and then maximizing I(Z;Y)I(Z;Y), suggesting two distinct training phases visible as a "compression phase." Subsequent work showed these results depend heavily on activation functions and binning procedures for estimating mutual information. For networks with ReLU activations, the compression phase does not appear. The information bottleneck is a useful conceptual framework, but the empirical claims about training dynamics remain disputed.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.