Neural-Path/Notes
40 min

Entropy, Mutual Information & the Information Hierarchy

Entropy measures the average uncertainty in a random variable; mutual information measures how much knowing one variable reduces uncertainty about another — two concepts that appear throughout ML in loss functions, information bottleneck, contrastive learning, and feature selection.

Concepts

Binary entropy H(p) = −p log₂ p − (1−p) log₂(1−p) peaks at 1 bit for a fair coin and collapses to 0 for a deterministic outcome. For a binary symmetric channel with crossover p, capacity = 1 − H(p).

00.250.50.75100.20.40.60.81pH(p) bits
H(p) bits
1.0000
H(p) nats
0.6931
BSC capacity
0.0000
Max H (p=½)
1.0000

General discrete entropy H(X) = −∑ pᵢ log₂ pᵢ over 6 outcomes. Maximum entropy is log₂ 6 = 2.585 bits (uniform distribution). Adjust weights or pick a preset — watch H(X) respond.

x1
x2
x3
x4
x5
x6
x1p=0.167
x2p=0.167
x3p=0.167
x4p=0.167
x5p=0.167
x6p=0.167
H(X) = 2.5850 bits
max H = 2.585
H / H_max = 100.0%

Every time a model outputs a probability distribution — a softmax over class labels, or a language model's token probabilities — the cross-entropy loss measures how far that distribution is from certainty about the right answer. Shannon entropy is the foundation: it quantifies the average uncertainty in a distribution and gives the theoretical lower bound on how compactly that uncertainty can be described without losing information.

Shannon Entropy

For a discrete random variable XX with PMF p(x)=P(X=x)p(x) = P(X = x) over alphabet X\mathcal{X}:

H(X)=xXp(x)logp(x)=E[logp(X)].H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = \mathbb{E}[-\log p(X)].

The logp(x)-\log p(x) term is not an arbitrary choice: it is the unique function satisfying (1) more-likely events are less surprising, (2) surprises add when independent events combine (log converts product to sum), and (3) the function is continuous. Entropy is then the expected surprise under the distribution itself — the average code length for optimal lossless compression. This is why minimizing cross-entropy is equivalent to maximizing log-likelihood: both minimize the expected code length of the true labels under the model's distribution.

Logarithms are base 2 (unit: bits) or base ee (unit: nats). 0log0=def00 \log 0 \stackrel{\text{def}}{=} 0.

Properties of entropy:

  1. Non-negativity: H(X)0H(X) \geq 0, with equality iff XX is deterministic (p(x)=1p(x^*)=1 for some xx^*).

  2. Maximum entropy: H(X)logXH(X) \leq \log|\mathcal{X}|, achieved uniquely by the uniform distribution.

  3. Concavity: H(λp+(1λ)q)λH(p)+(1λ)H(q)H(\lambda p + (1-\lambda)q) \geq \lambda H(p) + (1-\lambda)H(q).

  4. Data processing: H(f(X))H(X)H(f(X)) \leq H(X) for any function ff (applying a function cannot increase uncertainty).

Binary entropy function: for XBernoulli(p)X \sim \text{Bernoulli}(p):

Hb(p)=plog2p(1p)log2(1p),Hb(0)=Hb(1)=0,Hb(1/2)=1.H_b(p) = -p\log_2 p - (1-p)\log_2(1-p), \quad H_b(0) = H_b(1) = 0, \quad H_b(1/2) = 1.

Entropy of key distributions:

DistributionEntropy
Bernoulli(pp)Hb(p)H_b(p) bits
Uniform on {1,,n}\{1,\ldots,n\}log2n\log_2 n bits
Geometric(pp)(log2p(1p)log2(1p))/p(-\log_2 p - (1-p)\log_2(1-p))/p
Poisson(λ\lambda)λ(1logλ)+eλkλklog(k!)k!\lambda(1-\log\lambda) + e^{-\lambda}\sum_k \frac{\lambda^k \log(k!)}{k!}

Differential Entropy

For a continuous random variable XX with PDF fXf_X:

h(X)=fX(x)logfX(x)dx.h(X) = -\int f_X(x) \log f_X(x)\,dx.

Differential entropy can be negative (unlike discrete entropy). Key values:

DistributionDifferential entropy
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)12log(2πeσ2)\frac{1}{2}\log(2\pi e \sigma^2)
Uniform[a,b][a,b]log(ba)\log(b-a)
Exponential(λ\lambda)1logλ1 - \log\lambda
Multivariate N(μ,Σ)\mathcal{N}(\mu, \Sigma)12log((2πe)ddetΣ)\frac{1}{2}\log((2\pi e)^d \det\Sigma)

Maximum entropy theorem: among all distributions on R\mathbb{R} with variance σ2\sigma^2, the Gaussian maximizes differential entropy. This is why the Gaussian is the "worst case" noise distribution in many bounds.

The Information Hierarchy

For joint (X,Y)(X, Y):

H(X,Y)=x,yp(x,y)logp(x,y)(joint entropy),H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y) \quad \text{(joint entropy)},

H(XY)=x,yp(x,y)logp(xy)=EY[H(XY=y)](conditional entropy).H(X \mid Y) = -\sum_{x,y} p(x,y) \log p(x \mid y) = \mathbb{E}_Y[H(X \mid Y=y)] \quad \text{(conditional entropy)}.

Chain rules:

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY).H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y).

H(X1,,Xn)=i=1nH(XiX1,,Xi1).H(X_1, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1}).

Mutual information:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y).I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y).

I(X;Y)I(X;Y) measures the reduction in uncertainty about XX from knowing YY (and vice versa). Always:

I(X;Y)0,I(X;Y)=I(Y;X),I(X;Y)=0    XY.I(X;Y) \geq 0, \quad I(X;Y) = I(Y;X), \quad I(X;Y) = 0 \iff X \perp Y.

The information diagram (I-diagram): arrange H(X)H(X) and H(Y)H(Y) as overlapping circles. The overlap is I(X;Y)I(X;Y); each circle's non-overlapping region is H(XY)H(X|Y) and H(YX)H(Y|X).

Asymptotic Equipartition Property (AEP)

AEP (weak form): for iid X1,X2,p(x)X_1, X_2, \ldots \sim p(x):

1nlogp(X1,,Xn)PH(X).-\frac{1}{n}\log p(X_1, \ldots, X_n) \xrightarrow{P} H(X).

The ε\varepsilon-typical set Aε(n)A_\varepsilon^{(n)}: sequences with 1nlogp(xn)Hε|{-\frac{1}{n}\log p(x^n) - H}| \leq \varepsilon.

Properties of the typical set:

  1. P(Aε(n))1εP(A_\varepsilon^{(n)}) \geq 1-\varepsilon for large nn
  2. Aε(n)2n(H+ε)|A_\varepsilon^{(n)}| \leq 2^{n(H+\varepsilon)}
  3. Aε(n)(1ε)2n(Hε)|A_\varepsilon^{(n)}| \geq (1-\varepsilon)2^{n(H-\varepsilon)}

Source coding theorem (Shannon 1948): the minimum number of bits to losslessly compress nn iid draws from a source with entropy HH is nHnH bits per symbol. Specifically, H\lceil H\rceil bits per symbol suffice (via Huffman coding), and no compression below HH bits/symbol is possible.

Worked Example

Example 1: English Text Entropy

Shannon estimated the entropy of English at approximately 1–1.5 bits/character by having humans predict the next character in English text. Compare:

  • Random ASCII: log2128=7\log_2 128 = 7 bits/character
  • Letter frequency only (26 letters): H4.2H \approx 4.2 bits/character (from actual frequencies)
  • Accounting for digrams (letter pairs): H3.6H \approx 3.6 bits/character
  • Accounting for long-range context: H1.2H \approx 1.2 bits/character

Each step incorporates more statistical structure, reducing entropy. Modern LLMs achieve effective per-token entropy of roughly 1.5–2 bits per byte, consistent with Shannon's estimate.

Example 2: Mutual Information for Feature Selection

Random variable YY (label) and features X1,X2X_1, X_2:

I(X1;Y)=H(Y)H(YX1)I(X_1; Y) = H(Y) - H(Y \mid X_1)

computes how much X1X_1 reduces uncertainty about YY. Features are ranked by mutual information with the label — a model-free measure of relevance that captures nonlinear dependencies.

For discrete X1X_1 (binned) and binary YY: if YY is perfectly predictable from X1X_1 (i.e., H(YX1)=0H(Y \mid X_1) = 0), then I(X1;Y)=H(Y)I(X_1;Y) = H(Y) — maximum mutual information. If X1YX_1 \perp Y, then I(X1;Y)=0I(X_1;Y) = 0.

MINE (Mutual Information Neural Estimation) uses a dual representation of mutual information to estimate I(X;Y)I(X;Y) from samples when the joint distribution is unknown:

I(X;Y)=supT:X×YREp(x,y)[T(x,y)]logEp(x)p(y)[eT(x,y)].I(X;Y) = \sup_{T:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}} \mathbb{E}_{p(x,y)}[T(x,y)] - \log\mathbb{E}_{p(x)p(y)}[e^{T(x,y)}].

Example 3: Joint Entropy Computation

X{0,1}X \in \{0,1\}, Y{0,1}Y \in \{0,1\}, joint distribution:

Y=0Y=0Y=1Y=1
X=0X=01/41/4
X=1X=11/41/4

H(X)=1H(X) = 1, H(Y)=1H(Y) = 1, H(X,Y)=2H(X,Y) = 2, H(XY)=1H(X \mid Y) = 1, I(X;Y)=0I(X;Y) = 0 — independent.

Now change to: p(0,0)=p(1,1)=1/2p(0,0) = p(1,1) = 1/2, p(0,1)=p(1,0)=0p(0,1) = p(1,0) = 0.

H(X)=1H(X) = 1, H(Y)=1H(Y) = 1, H(X,Y)=1H(X,Y) = 1, H(XY)=0H(X \mid Y) = 0, I(X;Y)=1I(X;Y) = 1 — perfectly correlated: knowing YY tells you everything about XX.

Connections

Where Your Intuition Breaks

Entropy is often described as measuring "randomness" or "disorder" — but discrete and differential entropy are fundamentally different objects. Discrete entropy H(X)0H(X) \geq 0 is always non-negative and operationally meaningful (bits per symbol). Differential entropy h(X)=f(x)logf(x)dxh(X) = -\int f(x)\log f(x)\,dx can be negative, is not invariant under change of variables (h(aX)=h(X)+logah(aX) = h(X) + \log|a|), and depends on the reference measure chosen. Only differences of differential entropies — like mutual information I(X;Y)=h(X)h(XY)I(X;Y) = h(X) - h(X|Y) — are invariant and have direct operational meaning. Treating differential entropy as "continuous entropy in bits" and plugging it into formulas designed for discrete entropy leads to sign errors and scale-dependent results.

💡Intuition

Entropy measures irreducible randomness. H(X) is the expected number of bits needed to describe XX, regardless of encoding. For a fair die (H=log262.58H = \log_2 6 \approx 2.58 bits): no code can describe the outcome in fewer than 2.58 bits per roll on average. Huffman codes approach this limit. The entropy is "irreducible" — it cannot be compressed away because it is real randomness in the source, not just poor representation. In ML: cross-entropy loss H(p,q)=xp(x)logq(x)=H(p)+KL(pq)H(p, q) = -\sum_x p(x)\log q(x) = H(p) + \text{KL}(p \| q) decomposes into irreducible entropy H(p)H(p) and the KL divergence from the model qq to the true distribution pp. The training objective can reduce KL but not H(p)H(p).

💡Intuition

Mutual information is the reduction in entropy from a joint observation. I(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X|Y): how many bits of uncertainty about XX are resolved by observing YY. The conditioning can never increase entropy: H(XY)H(X)H(X|Y) \leq H(X) (conditioning on anything reduces average uncertainty). When XX and YY are independent, knowing YY tells you nothing: I(X;Y)=0I(X;Y)=0. When X=YX=Y, knowing YY tells you everything: I(X;Y)=H(X)I(X;Y)=H(X). Mutual information is symmetric (I(X;Y)=I(Y;X)I(X;Y)=I(Y;X)) — the amount of information YY contains about XX equals the amount XX contains about YY.

⚠️Warning

Differential entropy is not a direct analog of discrete entropy. Differential entropy can be negative (e.g., Uniform[0,0.1][0, 0.1] has h=log(0.1)<0h = \log(0.1) < 0), is not invariant under change of variables (h(aX)=h(X)+logah(aX) = h(X) + \log|a|), and depends on the choice of reference measure. Only differences of differential entropies (like mutual information I(X;Y)=h(X)+h(Y)h(X,Y)I(X;Y) = h(X) + h(Y) - h(X,Y)) are invariant under reparameterization and have direct operational meaning. When computing mutual information for continuous distributions, always use the form I(X;Y)=h(X)h(XY)I(X;Y) = h(X) - h(X|Y) — the difference cancels the problematic parts.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.