Data Processing Inequality & Sufficient Statistics
The data processing inequality states that no transformation can increase the information a random variable carries about another — a fundamental limit with deep consequences for representation learning, bottleneck architectures, and the theory of sufficient statistics.
Concepts
Every intermediate layer of a neural network is a deterministic transformation of the previous layer. The data processing inequality says something fundamental about this: no layer can increase the information the representation carries about the target label. The only way to preserve information is to use invertible transformations — giving an information-theoretic justification for why representation learning is fundamentally about finding the right compression, not finding the right enrichment.
The Data Processing Inequality
Markov chain notation: means is conditionally independent of given : .
Data Processing Inequality (DPI): if is a Markov chain, then
Processing through any channel to produce cannot increase the information about .
The proof is algebraic: the Markov condition means (knowing makes independent of ). This forces via the chain rule. The result is structural — it does not depend on the complexity of the transformation from to . No matter how computationally elaborate, no deterministic function of can extract information about that was not already present in .
Proof:
using the Markov condition . By the chain rule and non-negativity of MI:
Combining: .
Corollary (DPI for KL divergence): for any channel and distributions on :
where (passing distributions through a channel reduces their distinguishability).
Corollary (DPI for total variation): .
This says: any processing step makes distributions harder (or equally hard) to distinguish. A classifier cannot increase information; it can only lose it.
Sufficient Statistics Characterized by DPI
Recall (from Module 07): a statistic is sufficient for parameter if any inference is as powerful as any inference.
Information-theoretic characterization: is sufficient for iff for all prior distributions on .
Proof: if is sufficient, then is a Markov chain (the raw data given does not depend on ). By DPI applied to : . But also is Markov, so . Actually sufficient means — preserves all the mutual information.
Intuition: the sufficient statistic compresses to without discarding any information about . It is the ideal compression of for the purpose of inferring .
Fano's Inequality
Fano's inequality lower-bounds the probability of error in estimation problems.
Setup: parameter with , observation , estimator , error .
Fano's inequality:
where is the binary entropy. Equivalently:
Interpretation: if the conditional entropy is large (the observation leaves much uncertainty), the error rate must be large. No estimator can achieve low error when the posterior uncertainty is high.
Proof sketch: introduce the error indicator . Apply chain rule: .
Application to minimax lower bounds: Fano's inequality is the standard tool for proving lower bounds on statistical estimation. To show that no estimator can achieve error below :
- Construct distributions in the parameter space that are apart in the metric
- Show these distributions satisfy (information accumulates over samples)
- Apply Fano to get
Information Bottleneck
The information bottleneck principle (Tishby, Pereira, Bialek 1999) frames representation learning as a compression problem: find a representation of input that maximally retains information about a target while compressing :
The Lagrange multiplier controls the tradeoff:
- : compress maximally (just predict the mean), discard all information about
- : retain all information about (sufficient statistic), no compression constraint
DPI constraint: since is a Markov chain, by DPI. The maximum information about that any representation can achieve is — the information between the raw input and the label.
Optimal representation: the optimal is a sufficient statistic of for — it preserves while minimizing .
Implied Markov chain in a deep network: where each layer is Markovian given the previous. By repeated DPI application:
Each layer can only lose or preserve information about .
Worked Example
Example 1: DPI in a Deep Classifier
Network: input (image pixels), layer 1 output , layer 2 , logits .
By DPI: bits (for balanced 10-class classification).
The maximum information any 128-dimensional representation can retain about a 10-class label is bits. Since is fixed by the task, every intermediate representation is limited by this ceiling. Good representations are those that saturate this bound — they retain all task-relevant information while discarding irrelevant details.
Example 2: Fano's Inequality for Hypothesis Testing
Test vs from observations. By Fano:
so .
This shows: to achieve , need samples — an information-theoretic lower bound on sample complexity for distinguishing the two hypotheses.
The Chernoff-Stein exponent gives the exact exponential rate, but Fano gives the leading factor.
Example 3: Information Bottleneck in VAEs
A VAE encodes input to latent , then decodes to . The ELBO (from Module 07 Bayesian lesson):
The KL term penalizes mutual information between input and latent (compression). The reconstruction term encourages for (in reconstruction). The ELBO is exactly the information bottleneck objective with and — a special case where we want to compress while retaining enough information to reconstruct it.
Connections
Where Your Intuition Breaks
The information bottleneck principle — train a representation to compress while retaining only information relevant to label — is theoretically elegant. Tishby and colleagues conjectured that SGD implicitly optimizes this objective, with DNNs first memorizing the training data (high ) then compressing (decreasing while maintaining ). This compression phase was not empirically replicated in most subsequent work: the mutual information estimates used in the original experiments depended sensitively on activation binning, and different estimation methods give qualitatively different pictures of whether compression actually occurs. The theoretical appeal of the information bottleneck is real; the claim that SGD implements it is not established.
DPI says information cannot be created from nothing. Any transformation, compression, or processing of data can only lose information — it cannot add information about the input that was not already there. This is the information-theoretic analog of the second law of thermodynamics. For ML: the information about the label in a model's final output is at most the information in the raw input . No architecture, no matter how deep or complex, can exceed — the information the problem itself contains. Good ML is about preserving this information efficiently through the computation graph.
The sufficient statistic is the ideal compression. Among all statistics that preserve , the minimal sufficient statistic achieves this while being maximally compressed. This is the information bottleneck at : compress maximally while retaining all task-relevant information. The exponential family sufficient statistic (e.g., sample mean and sum of squares for Gaussian data) is exactly this ideal compression — it discards all irrelevant ordering information while retaining everything about .
The information bottleneck theory of deep learning is contested. Tishby et al. (2017) proposed that neural networks learn by first compressing and then maximizing , suggesting two distinct training phases visible as a "compression phase." Subsequent work showed these results depend heavily on activation functions and binning procedures for estimating mutual information. For networks with ReLU activations, the compression phase does not appear. The information bottleneck is a useful conceptual framework, but the empirical claims about training dynamics remain disputed.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.