Entropy, Mutual Information & the Information Hierarchy

Entropy measures the average uncertainty in a random variable; mutual information measures how much knowing one variable reduces uncertainty about another — two concepts that appear throughout ML in loss functions, information bottleneck, contrastive learning, and feature selection.

Concepts

Binary entropy H(p) = −p log₂ p − (1−p) log₂(1−p) peaks at 1 bit for a fair coin and collapses to 0 for a deterministic outcome. For a binary symmetric channel with crossover p, capacity = 1 − H(p).

p = 0.500

H(p) bits

1.0000

H(p) nats

0.6931

BSC capacity

0.0000

Max H (p=½)

1.0000

General discrete entropy H(X) = −∑ pᵢ log₂ pᵢ over 6 outcomes. Maximum entropy is log₂ 6 = 2.585 bits (uniform distribution). Adjust weights or pick a preset — watch H(X) respond.

x1p=0.167

x2p=0.167

x3p=0.167

x4p=0.167

x5p=0.167

x6p=0.167

H(X) = 2.5850 bits

max H = 2.585

H / H_max = 100.0%

Every time a model outputs a probability distribution — a softmax over class labels, or a language model's token probabilities — the cross-entropy loss measures how far that distribution is from certainty about the right answer. Shannon entropy is the foundation: it quantifies the average uncertainty in a distribution and gives the theoretical lower bound on how compactly that uncertainty can be described without losing information.

Shannon Entropy

For a discrete random variable $X$ with PMF $p(x) = P(X = x)$ over alphabet $\mathcal{X}$ :

$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = \mathbb{E}[-\log p(X)].$

The $-\log p(x)$ term is not an arbitrary choice: it is the unique function satisfying (1) more-likely events are less surprising, (2) surprises add when independent events combine (log converts product to sum), and (3) the function is continuous. Entropy is then the expected surprise under the distribution itself — the average code length for optimal lossless compression. This is why minimizing cross-entropy is equivalent to maximizing log-likelihood: both minimize the expected code length of the true labels under the model's distribution.

Logarithms are base 2 (unit: bits) or base $e$ (unit: nats). $0 \log 0 \stackrel{\text{def}}{=} 0$ .

Properties of entropy:

Non-negativity: $H(X) \geq 0$ , with equality iff $X$ is deterministic ( $p(x^*)=1$ for some $x^*$ ).
Maximum entropy: $H(X) \leq \log|\mathcal{X}|$ , achieved uniquely by the uniform distribution.
Concavity: $H(\lambda p + (1-\lambda)q) \geq \lambda H(p) + (1-\lambda)H(q)$ .
Data processing: $H(f(X)) \leq H(X)$ for any function $f$ (applying a function cannot increase uncertainty).

Binary entropy function: for $X \sim \text{Bernoulli}(p)$ :

$H_b(p) = -p\log_2 p - (1-p)\log_2(1-p), \quad H_b(0) = H_b(1) = 0, \quad H_b(1/2) = 1.$

Entropy of key distributions:

Distribution	Entropy
Bernoulli( $p$ )	$H_b(p)$ bits
Uniform on $\{1,\ldots,n\}$	$\log_2 n$ bits
Geometric( $p$ )	$(-\log_2 p - (1-p)\log_2(1-p))/p$
Poisson( $\lambda$ )	$\lambda(1-\log\lambda) + e^{-\lambda}\sum_k \frac{\lambda^k \log(k!)}{k!}$

Differential Entropy

For a continuous random variable $X$ with PDF $f_X$ :

$h(X) = -\int f_X(x) \log f_X(x)\,dx.$

Differential entropy can be negative (unlike discrete entropy). Key values:

Distribution	Differential entropy
$\mathcal{N}(\mu, \sigma^2)$	$\frac{1}{2}\log(2\pi e \sigma^2)$
Uniform $[a,b]$	$\log(b-a)$
Exponential( $\lambda$ )	$1 - \log\lambda$
Multivariate $\mathcal{N}(\mu, \Sigma)$	$\frac{1}{2}\log((2\pi e)^d \det\Sigma)$

Maximum entropy theorem: among all distributions on $\mathbb{R}$ with variance $\sigma^2$ , the Gaussian maximizes differential entropy. This is why the Gaussian is the "worst case" noise distribution in many bounds.

The Information Hierarchy

For joint $(X, Y)$ :

$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y) \quad \text{(joint entropy)},$

$H(X \mid Y) = -\sum_{x,y} p(x,y) \log p(x \mid y) = \mathbb{E}_Y[H(X \mid Y=y)] \quad \text{(conditional entropy)}.$

Chain rules:

$H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y).$

$H(X_1, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1}).$

Mutual information:

$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y).$

$I(X;Y)$ measures the reduction in uncertainty about $X$ from knowing $Y$ (and vice versa). Always:

$I(X;Y) \geq 0, \quad I(X;Y) = I(Y;X), \quad I(X;Y) = 0 \iff X \perp Y.$

The information diagram (I-diagram): arrange $H(X)$ and $H(Y)$ as overlapping circles. The overlap is $I(X;Y)$ ; each circle's non-overlapping region is $H(X|Y)$ and $H(Y|X)$ .

Asymptotic Equipartition Property (AEP)

AEP (weak form): for iid $X_1, X_2, \ldots \sim p(x)$ :

$-\frac{1}{n}\log p(X_1, \ldots, X_n) \xrightarrow{P} H(X).$

The $\varepsilon$ -typical set $A_\varepsilon^{(n)}$ : sequences with $|{-\frac{1}{n}\log p(x^n) - H}| \leq \varepsilon$ .

Properties of the typical set:

$P(A_\varepsilon^{(n)}) \geq 1-\varepsilon$ for large $n$
$|A_\varepsilon^{(n)}| \leq 2^{n(H+\varepsilon)}$
$|A_\varepsilon^{(n)}| \geq (1-\varepsilon)2^{n(H-\varepsilon)}$

Source coding theorem (Shannon 1948): the minimum number of bits to losslessly compress $n$ iid draws from a source with entropy $H$ is $nH$ bits per symbol. Specifically, $\lceil H\rceil$ bits per symbol suffice (via Huffman coding), and no compression below $H$ bits/symbol is possible.

Worked Example

Example 1: English Text Entropy

Shannon estimated the entropy of English at approximately 1–1.5 bits/character by having humans predict the next character in English text. Compare:

Random ASCII: $\log_2 128 = 7$ bits/character
Letter frequency only (26 letters): $H \approx 4.2$ bits/character (from actual frequencies)
Accounting for digrams (letter pairs): $H \approx 3.6$ bits/character
Accounting for long-range context: $H \approx 1.2$ bits/character

Each step incorporates more statistical structure, reducing entropy. Modern LLMs achieve effective per-token entropy of roughly 1.5–2 bits per byte, consistent with Shannon's estimate.

Example 2: Mutual Information for Feature Selection

Random variable $Y$ (label) and features $X_1, X_2$ :

$I(X_1; Y) = H(Y) - H(Y \mid X_1)$

computes how much $X_1$ reduces uncertainty about $Y$ . Features are ranked by mutual information with the label — a model-free measure of relevance that captures nonlinear dependencies.

For discrete $X_1$ (binned) and binary $Y$ : if $Y$ is perfectly predictable from $X_1$ (i.e., $H(Y \mid X_1) = 0$ ), then $I(X_1;Y) = H(Y)$ — maximum mutual information. If $X_1 \perp Y$ , then $I(X_1;Y) = 0$ .

MINE (Mutual Information Neural Estimation) uses a dual representation of mutual information to estimate $I(X;Y)$ from samples when the joint distribution is unknown:

$I(X;Y) = \sup_{T:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}} \mathbb{E}_{p(x,y)}[T(x,y)] - \log\mathbb{E}_{p(x)p(y)}[e^{T(x,y)}].$

Example 3: Joint Entropy Computation

$X \in \{0,1\}$ , $Y \in \{0,1\}$ , joint distribution:

	$Y=0$	$Y=1$
$X=0$	1/4	1/4
$X=1$	1/4	1/4

$H(X) = 1$ , $H(Y) = 1$ , $H(X,Y) = 2$ , $H(X \mid Y) = 1$ , $I(X;Y) = 0$ — independent.

Now change to: $p(0,0) = p(1,1) = 1/2$ , $p(0,1) = p(1,0) = 0$ .

$H(X) = 1$ , $H(Y) = 1$ , $H(X,Y) = 1$ , $H(X \mid Y) = 0$ , $I(X;Y) = 1$ — perfectly correlated: knowing $Y$ tells you everything about $X$ .

Connections

Where Your Intuition Breaks

Entropy is often described as measuring "randomness" or "disorder" — but discrete and differential entropy are fundamentally different objects. Discrete entropy $H(X) \geq 0$ is always non-negative and operationally meaningful (bits per symbol). Differential entropy $h(X) = -\int f(x)\log f(x)\,dx$ can be negative, is not invariant under change of variables ( $h(aX) = h(X) + \log|a|$ ), and depends on the reference measure chosen. Only differences of differential entropies — like mutual information $I(X;Y) = h(X) - h(X|Y)$ — are invariant and have direct operational meaning. Treating differential entropy as "continuous entropy in bits" and plugging it into formulas designed for discrete entropy leads to sign errors and scale-dependent results.

💡Intuition

Entropy measures irreducible randomness. H(X) is the expected number of bits needed to describe $X$ , regardless of encoding. For a fair die ( $H = \log_2 6 \approx 2.58$ bits): no code can describe the outcome in fewer than 2.58 bits per roll on average. Huffman codes approach this limit. The entropy is "irreducible" — it cannot be compressed away because it is real randomness in the source, not just poor representation. In ML: cross-entropy loss $H(p, q) = -\sum_x p(x)\log q(x) = H(p) + \text{KL}(p \| q)$ decomposes into irreducible entropy $H(p)$ and the KL divergence from the model $q$ to the true distribution $p$ . The training objective can reduce KL but not $H(p)$ .

💡Intuition

Mutual information is the reduction in entropy from a joint observation. $I(X;Y) = H(X) - H(X|Y)$ : how many bits of uncertainty about $X$ are resolved by observing $Y$ . The conditioning can never increase entropy: $H(X|Y) \leq H(X)$ (conditioning on anything reduces average uncertainty). When $X$ and $Y$ are independent, knowing $Y$ tells you nothing: $I(X;Y)=0$ . When $X=Y$ , knowing $Y$ tells you everything: $I(X;Y)=H(X)$ . Mutual information is symmetric ( $I(X;Y)=I(Y;X)$ ) — the amount of information $Y$ contains about $X$ equals the amount $X$ contains about $Y$ .

⚠️Warning

Differential entropy is not a direct analog of discrete entropy. Differential entropy can be negative (e.g., Uniform $[0, 0.1]$ has $h = \log(0.1) < 0$ ), is not invariant under change of variables ( $h(aX) = h(X) + \log|a|$ ), and depends on the choice of reference measure. Only differences of differential entropies (like mutual information $I(X;Y) = h(X) + h(Y) - h(X,Y)$ ) are invariant under reparameterization and have direct operational meaning. When computing mutual information for continuous distributions, always use the form $I(X;Y) = h(X) - h(X|Y)$ — the difference cancels the problematic parts.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Statistics & Learning Theory

Bridge: Double Descent, Implicit Regularization & Modern Generalization Theory

KL Divergence, f-Divergences & Total Variation Distance