Entropy, Mutual Information & the Information Hierarchy
Entropy measures the average uncertainty in a random variable; mutual information measures how much knowing one variable reduces uncertainty about another — two concepts that appear throughout ML in loss functions, information bottleneck, contrastive learning, and feature selection.
Concepts
Binary entropy H(p) = −p log₂ p − (1−p) log₂(1−p) peaks at 1 bit for a fair coin and collapses to 0 for a deterministic outcome. For a binary symmetric channel with crossover p, capacity = 1 − H(p).
General discrete entropy H(X) = −∑ pᵢ log₂ pᵢ over 6 outcomes. Maximum entropy is log₂ 6 = 2.585 bits (uniform distribution). Adjust weights or pick a preset — watch H(X) respond.
Every time a model outputs a probability distribution — a softmax over class labels, or a language model's token probabilities — the cross-entropy loss measures how far that distribution is from certainty about the right answer. Shannon entropy is the foundation: it quantifies the average uncertainty in a distribution and gives the theoretical lower bound on how compactly that uncertainty can be described without losing information.
Shannon Entropy
For a discrete random variable with PMF over alphabet :
The term is not an arbitrary choice: it is the unique function satisfying (1) more-likely events are less surprising, (2) surprises add when independent events combine (log converts product to sum), and (3) the function is continuous. Entropy is then the expected surprise under the distribution itself — the average code length for optimal lossless compression. This is why minimizing cross-entropy is equivalent to maximizing log-likelihood: both minimize the expected code length of the true labels under the model's distribution.
Logarithms are base 2 (unit: bits) or base (unit: nats). .
Properties of entropy:
-
Non-negativity: , with equality iff is deterministic ( for some ).
-
Maximum entropy: , achieved uniquely by the uniform distribution.
-
Concavity: .
-
Data processing: for any function (applying a function cannot increase uncertainty).
Binary entropy function: for :
Entropy of key distributions:
| Distribution | Entropy |
|---|---|
| Bernoulli() | bits |
| Uniform on | bits |
| Geometric() | |
| Poisson() |
Differential Entropy
For a continuous random variable with PDF :
Differential entropy can be negative (unlike discrete entropy). Key values:
| Distribution | Differential entropy |
|---|---|
| Uniform | |
| Exponential() | |
| Multivariate |
Maximum entropy theorem: among all distributions on with variance , the Gaussian maximizes differential entropy. This is why the Gaussian is the "worst case" noise distribution in many bounds.
The Information Hierarchy
For joint :
Chain rules:
Mutual information:
measures the reduction in uncertainty about from knowing (and vice versa). Always:
The information diagram (I-diagram): arrange and as overlapping circles. The overlap is ; each circle's non-overlapping region is and .
Asymptotic Equipartition Property (AEP)
AEP (weak form): for iid :
The -typical set : sequences with .
Properties of the typical set:
- for large
Source coding theorem (Shannon 1948): the minimum number of bits to losslessly compress iid draws from a source with entropy is bits per symbol. Specifically, bits per symbol suffice (via Huffman coding), and no compression below bits/symbol is possible.
Worked Example
Example 1: English Text Entropy
Shannon estimated the entropy of English at approximately 1–1.5 bits/character by having humans predict the next character in English text. Compare:
- Random ASCII: bits/character
- Letter frequency only (26 letters): bits/character (from actual frequencies)
- Accounting for digrams (letter pairs): bits/character
- Accounting for long-range context: bits/character
Each step incorporates more statistical structure, reducing entropy. Modern LLMs achieve effective per-token entropy of roughly 1.5–2 bits per byte, consistent with Shannon's estimate.
Example 2: Mutual Information for Feature Selection
Random variable (label) and features :
computes how much reduces uncertainty about . Features are ranked by mutual information with the label — a model-free measure of relevance that captures nonlinear dependencies.
For discrete (binned) and binary : if is perfectly predictable from (i.e., ), then — maximum mutual information. If , then .
MINE (Mutual Information Neural Estimation) uses a dual representation of mutual information to estimate from samples when the joint distribution is unknown:
Example 3: Joint Entropy Computation
, , joint distribution:
| 1/4 | 1/4 | |
| 1/4 | 1/4 |
, , , , — independent.
Now change to: , .
, , , , — perfectly correlated: knowing tells you everything about .
Connections
Where Your Intuition Breaks
Entropy is often described as measuring "randomness" or "disorder" — but discrete and differential entropy are fundamentally different objects. Discrete entropy is always non-negative and operationally meaningful (bits per symbol). Differential entropy can be negative, is not invariant under change of variables (), and depends on the reference measure chosen. Only differences of differential entropies — like mutual information — are invariant and have direct operational meaning. Treating differential entropy as "continuous entropy in bits" and plugging it into formulas designed for discrete entropy leads to sign errors and scale-dependent results.
Entropy measures irreducible randomness. H(X) is the expected number of bits needed to describe , regardless of encoding. For a fair die ( bits): no code can describe the outcome in fewer than 2.58 bits per roll on average. Huffman codes approach this limit. The entropy is "irreducible" — it cannot be compressed away because it is real randomness in the source, not just poor representation. In ML: cross-entropy loss decomposes into irreducible entropy and the KL divergence from the model to the true distribution . The training objective can reduce KL but not .
Mutual information is the reduction in entropy from a joint observation. : how many bits of uncertainty about are resolved by observing . The conditioning can never increase entropy: (conditioning on anything reduces average uncertainty). When and are independent, knowing tells you nothing: . When , knowing tells you everything: . Mutual information is symmetric () — the amount of information contains about equals the amount contains about .
Differential entropy is not a direct analog of discrete entropy. Differential entropy can be negative (e.g., Uniform has ), is not invariant under change of variables (), and depends on the choice of reference measure. Only differences of differential entropies (like mutual information ) are invariant under reparameterization and have direct operational meaning. When computing mutual information for continuous distributions, always use the form — the difference cancels the problematic parts.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.