Data Processing Inequality & Sufficient Statistics

The data processing inequality states that no transformation can increase the information a random variable carries about another — a fundamental limit with deep consequences for representation learning, bottleneck architectures, and the theory of sufficient statistics.

Concepts

Every intermediate layer of a neural network is a deterministic transformation of the previous layer. The data processing inequality says something fundamental about this: no layer can increase the information the representation carries about the target label. The only way to preserve information is to use invertible transformations — giving an information-theoretic justification for why representation learning is fundamentally about finding the right compression, not finding the right enrichment.

The Data Processing Inequality

Markov chain notation: $X \to Y \to Z$ means $Z$ is conditionally independent of $X$ given $Y$ : $P(Z \mid X, Y) = P(Z \mid Y)$ .

Data Processing Inequality (DPI): if $X \to Y \to Z$ is a Markov chain, then

$I(X; Z) \leq I(X; Y).$

Processing $Y$ through any channel to produce $Z$ cannot increase the information about $X$ .

The proof is algebraic: the Markov condition $X \to Y \to Z$ means $I(X; Z \mid Y) = 0$ (knowing $Y$ makes $Z$ independent of $X$ ). This forces $I(X;Z) \leq I(X;Y)$ via the chain rule. The result is structural — it does not depend on the complexity of the transformation from $Y$ to $Z$ . No matter how computationally elaborate, no deterministic function of $Y$ can extract information about $X$ that was not already present in $Y$ .

Proof:

$I(X; Y, Z) = I(X; Y) + I(X; Z \mid Y) = I(X; Y) + 0 = I(X; Y),$

using the Markov condition $I(X; Z \mid Y) = 0$ . By the chain rule and non-negativity of MI:

$I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z).$

Combining: $I(X; Y) \geq I(X; Z)$ . $\square$

Corollary (DPI for KL divergence): for any channel $P_{Y|X}$ and distributions $P, Q$ on $X$ :

$\text{KL}(P_Y \| Q_Y) \leq \text{KL}(P_X \| Q_X),$

where $P_Y = P_X \cdot P_{Y|X}$ (passing distributions through a channel reduces their distinguishability).

Corollary (DPI for total variation): $\text{TV}(P_Y, Q_Y) \leq \text{TV}(P_X, Q_X)$ .

This says: any processing step makes distributions harder (or equally hard) to distinguish. A classifier cannot increase information; it can only lose it.

Sufficient Statistics Characterized by DPI

Recall (from Module 07): a statistic $T = T(X)$ is sufficient for parameter $\theta$ if $X \to T \to$ any inference is as powerful as $X \to$ any inference.

Information-theoretic characterization: $T$ is sufficient for $\theta$ iff $I(X; \theta) = I(T; \theta)$ for all prior distributions on $\theta$ .

Proof: if $T$ is sufficient, then $\theta \to T(X) \to X$ is a Markov chain (the raw data $X$ given $T$ does not depend on $\theta$ ). By DPI applied to $X \to T \to \theta$ : $I(\theta; T) \leq I(\theta; X)$ . But also $\theta \to X \to T$ is Markov, so $I(\theta; T) \leq I(\theta; X)$ . Actually sufficient means $I(\theta;T) = I(\theta; X)$ — $T$ preserves all the mutual information.

Intuition: the sufficient statistic compresses $X$ to $T$ without discarding any information about $\theta$ . It is the ideal compression of $X$ for the purpose of inferring $\theta$ .

Fano's Inequality

Fano's inequality lower-bounds the probability of error in estimation problems.

Setup: parameter $X \in \mathcal{X}$ with $|\mathcal{X}| = m$ , observation $Y$ , estimator $\hat X = g(Y)$ , error $P_e = P(\hat X \neq X)$ .

Fano's inequality:

$H(X \mid Y) \leq H(P_e) + P_e \log(m-1),$

where $H(P_e) = -P_e\log P_e - (1-P_e)\log(1-P_e)$ is the binary entropy. Equivalently:

$P_e \geq \frac{H(X \mid Y) - 1}{\log(m-1)} = \frac{H(X) - I(X;Y) - 1}{\log(m-1)}.$

Interpretation: if the conditional entropy $H(X|Y)$ is large (the observation $Y$ leaves much uncertainty), the error rate must be large. No estimator can achieve low error when the posterior uncertainty is high.

Proof sketch: introduce the error indicator $E = \mathbf{1}[\hat X \neq X]$ . Apply chain rule: $H(X \mid Y) = H(X \mid \hat X, Y) + I(X; \hat X \mid Y) \leq H(X \mid \hat X) \leq H(E) + P_e \log(m-1)$ .

Application to minimax lower bounds: Fano's inequality is the standard tool for proving lower bounds on statistical estimation. To show that no estimator can achieve error below $\varepsilon^2$ :

Construct $m$ distributions in the parameter space that are $2\varepsilon$ apart in the metric
Show these distributions satisfy $I(X;Y) \leq n \cdot I(\theta_i; Y_1)$ (information accumulates over samples)
Apply Fano to get $P_e \geq 1 - O(I/\log m)$

Information Bottleneck

The information bottleneck principle (Tishby, Pereira, Bialek 1999) frames representation learning as a compression problem: find a representation $Z$ of input $X$ that maximally retains information about a target $Y$ while compressing $X$ :

$\min_{P(Z|X)} I(Z; X) - \beta I(Z; Y).$

The Lagrange multiplier $\beta$ controls the tradeoff:

$\beta = 0$ : compress $X$ maximally (just predict the mean), discard all information about $Y$
$\beta \to \infty$ : retain all information about $Y$ (sufficient statistic), no compression constraint

DPI constraint: since $X \to Z \to$ is a Markov chain, $I(Z;Y) \leq I(X;Y)$ by DPI. The maximum information about $Y$ that any representation $Z$ can achieve is $I(X;Y)$ — the information between the raw input and the label.

Optimal representation: the optimal $Z^*$ is a sufficient statistic of $X$ for $Y$ — it preserves $I(X;Y)$ while minimizing $I(Z;X)$ .

Implied Markov chain in a deep network: $Y \to X \to Z_1 \to Z_2 \to \ldots \to \hat Y$ where each layer is Markovian given the previous. By repeated DPI application:

$I(Y; Z_k) \leq I(Y; Z_{k-1}) \leq \ldots \leq I(Y; X).$

Each layer can only lose or preserve information about $Y$ .

Worked Example

Example 1: DPI in a Deep Classifier

Network: input $X \in \mathbb{R}^{1000}$ (image pixels), layer 1 output $Z_1 \in \mathbb{R}^{512}$ , layer 2 $Z_2 \in \mathbb{R}^{128}$ , logits $\hat Y \in \mathbb{R}^{10}$ .

By DPI: $I(Y; \hat Y) \leq I(Y; Z_2) \leq I(Y; Z_1) \leq I(Y; X) = H(Y) = \log_2 10 \approx 3.32$ bits (for balanced 10-class classification).

The maximum information any 128-dimensional representation can retain about a 10-class label is $I(Y;X) = 3.32$ bits. Since $I(Y;X)$ is fixed by the task, every intermediate representation is limited by this ceiling. Good representations are those that saturate this bound — they retain all task-relevant information while discarding irrelevant details.

Example 2: Fano's Inequality for Hypothesis Testing

Test $H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$ from $n$ observations. By Fano:

$P_e \geq 1 - \frac{I(\theta; X^n) + 1}{\log 2} = 1 - \frac{n \cdot I(\theta; X_1) + 1}{\log 2},$

so $P_e \geq 1/2 - n \cdot I(\theta; X_1)/2 - 1/(2\log 2)$ .

This shows: to achieve $P_e \leq \alpha$ , need $n \geq (1-2\alpha)/I(\theta; X_1)$ samples — an information-theoretic lower bound on sample complexity for distinguishing the two hypotheses.

The Chernoff-Stein exponent gives the exact exponential rate, but Fano gives the leading factor.

Example 3: Information Bottleneck in VAEs

A VAE encodes input $X$ to latent $Z$ , then decodes to $\hat X$ . The ELBO (from Module 07 Bayesian lesson):

$\text{ELBO} = \mathbb{E}[\log p(X \mid Z)] - \text{KL}(q(Z \mid X) \| p(Z)).$

The KL term $\text{KL}(q \| p) = I(X; Z) + \text{const}$ penalizes mutual information between input and latent (compression). The reconstruction term encourages $I(Y; Z)$ for $Y = X$ (in reconstruction). The ELBO is exactly the information bottleneck objective with $\beta = 1$ and $Y = X$ — a special case where we want to compress $X$ while retaining enough information to reconstruct it.

Connections

Where Your Intuition Breaks

The information bottleneck principle — train a representation to compress $X$ while retaining only information relevant to label $Y$ — is theoretically elegant. Tishby and colleagues conjectured that SGD implicitly optimizes this objective, with DNNs first memorizing the training data (high $I(Z;X)$ ) then compressing (decreasing $I(Z;X)$ while maintaining $I(Z;Y)$ ). This compression phase was not empirically replicated in most subsequent work: the mutual information estimates used in the original experiments depended sensitively on activation binning, and different estimation methods give qualitatively different pictures of whether compression actually occurs. The theoretical appeal of the information bottleneck is real; the claim that SGD implements it is not established.

💡Intuition

DPI says information cannot be created from nothing. Any transformation, compression, or processing of data can only lose information — it cannot add information about the input that was not already there. This is the information-theoretic analog of the second law of thermodynamics. For ML: the information about the label $Y$ in a model's final output $\hat Y$ is at most the information in the raw input $X$ . No architecture, no matter how deep or complex, can exceed $I(X;Y)$ — the information the problem itself contains. Good ML is about preserving this information efficiently through the computation graph.

💡Intuition

The sufficient statistic is the ideal compression. Among all statistics $T$ that preserve $I(\theta;T) = I(\theta;X)$ , the minimal sufficient statistic achieves this while being maximally compressed. This is the information bottleneck at $\beta \to \infty$ : compress maximally while retaining all task-relevant information. The exponential family sufficient statistic (e.g., sample mean and sum of squares for Gaussian data) is exactly this ideal compression — it discards all irrelevant ordering information while retaining everything about $\theta$ .

⚠️Warning

The information bottleneck theory of deep learning is contested. Tishby et al. (2017) proposed that neural networks learn by first compressing $I(Z;X)$ and then maximizing $I(Z;Y)$ , suggesting two distinct training phases visible as a "compression phase." Subsequent work showed these results depend heavily on activation functions and binning procedures for estimating mutual information. For networks with ReLU activations, the compression phase does not appear. The information bottleneck is a useful conceptual framework, but the empirical claims about training dynamics remain disputed.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

KL Divergence, f-Divergences & Total Variation Distance

Channel Capacity & Shannon's Coding Theorems