Estimation Theory: MLE, Sufficiency, Fisher Information & Cramér-Rao
Maximum likelihood estimation asks: given data, which parameter value makes the observations most probable? Fisher information and the Cramér-Rao bound characterize how much a dataset can tell us — and the MLE achieves this limit asymptotically, making it the canonical estimator.
Concepts
Fisher information I(θ) measures the sharpness of the log-likelihood peak. The Cramér-Rao bound says no unbiased estimator can achieve variance below 1/(n·I(θ)). Drag θ to see how information and the bound change.
MLE variance = θ(1−θ)/n equals the CR bound exactly — Bernoulli MLE is efficient. At θ=0.5, I(θ)=4 is minimized: hardest to estimate the fair coin.
When you train a neural network by minimizing cross-entropy loss, you are computing the maximum likelihood estimate of the weights — MLE is not a special technique for statisticians, it is the mathematical justification for gradient descent on any loss derived from a probability model. The Fisher information then measures the fundamental limit on estimation precision: how precisely any unbiased method can extract information about the parameter from observations.
Maximum Likelihood Estimation
For iid observations with , the maximum likelihood estimator (MLE) is:
The log transformation — replacing the product of likelihoods with a sum — is not merely a computational convenience. Products are numerically unstable for large ; but more fundamentally, the log-likelihood is a sum of iid terms, so the law of large numbers and CLT apply directly to it. The MLE's consistency and asymptotic normality follow from applying the LLN and CLT to the per-sample score — the algebraic structure of the log is precisely what makes the whole asymptotic theory tractable.
The score function is the gradient of the log-likelihood:
Setting gives the MLE via the score equation.
Key fact: the score has zero mean under the true parameter. For a single observation:
This is the starting point for both Fisher information and the Cramér-Rao bound.
Properties of the MLE:
-
Consistency: under mild regularity (identifiability + LLN applied to log-likelihood).
-
Asymptotic normality: where is the Fisher information matrix. Proof: Taylor-expand the score equation around and apply CLT.
-
Asymptotic efficiency: the MLE achieves the Cramér-Rao lower bound asymptotically — no consistent estimator has smaller asymptotic variance.
-
Invariance: if is the MLE of , then is the MLE of for any measurable .
Sufficient Statistics
A statistic is sufficient for if the conditional distribution does not depend on — captures all the information about in the data.
Factorization theorem (Fisher-Neyman): is sufficient for if and only if
Exponential family: . Here is the natural sufficient statistic, is the natural parameter, and is the log-partition function.
Key examples:
- Bernoulli(): is sufficient for
- Poisson(): is sufficient for
- Gaussian : are jointly sufficient for
Fisher Information
The Fisher information measures how much a single observation tells us about :
The two expressions are equal under regularity (differentiate with respect to ). The second form — the negative expected Hessian of the log-likelihood — shows that Fisher information equals the curvature of the log-likelihood: a sharply peaked likelihood contains more information.
For iid observations: .
Fisher information matrix (multiparameter):
is always positive semidefinite. It is the covariance matrix of the score vector.
Cramér-Rao Lower Bound
Theorem: For any unbiased estimator of (with ):
Proof via Cauchy-Schwarz. Let be the full-data score. Then:
-
(by unbiasedness + differentiation under integral).
-
.
-
Cauchy-Schwarz: .
-
Therefore .
Multiparameter: in the PSD (positive semidefinite) order — meaning is PSD.
An estimator achieving the CR bound is called efficient. The MLE is asymptotically efficient.
When is the CR bound achievable exactly? Only for exponential families: the bound is tight iff the score can be written as — satisfied precisely by the natural exponential family with sufficient statistic .
Rao-Blackwell Theorem and UMVUE
Rao-Blackwell theorem: If is any unbiased estimator and is a sufficient statistic, then satisfies:
- is unbiased: .
- with equality iff a.s.
Proof: .
The UMVUE (uniformly minimum variance unbiased estimator) is the best unbiased estimator at every . By the Lehmann-Scheffé theorem: if is a complete sufficient statistic and is unbiased for , then is the unique UMVUE.
A sufficient statistic is complete if for all implies a.s. — the exponential family with natural sufficient statistic is complete under mild conditions.
Worked Example
Example 1: Gaussian MLE is Efficient
with known; estimate .
Log-likelihood: .
Score: .
Fisher info: .
CR bound: .
MLE variance: . Exact equality — is efficient.
The asymptotic normal approximation: . The Fisher information here is , so . ✓
Example 2: Bernoulli CR Bound
. Score: .
. CR bound: . MLE variance: . Tight again.
The CR bound is hardest to achieve near where is minimized — small Fisher information means high variance is unavoidable. At , , so far less variance is needed to estimate accurately.
Example 3: Rao-Blackwell in Action
Estimate in Bernoulli with naive estimator (just the first observation).
is unbiased: . Variance: .
Sufficient statistic: . By exchangeability:
Rao-Blackwellized estimator: , variance . Ratio of improvement: — the variance shrinks by a factor of . The UMVUE is the sample mean, as expected.
Connections
Where Your Intuition Breaks
The MLE is consistent and asymptotically efficient — optimal in the limit of large . The dangerous misconception is extending "optimal" to all finite-sample settings. In dimension , the MLE for estimating the mean of a multivariate Gaussian is inadmissible: the James-Stein estimator is biased yet has strictly lower mean squared error at every true . The Cramér-Rao bound only constrains unbiased estimators; biased estimators can trade bias for variance and achieve lower MSE. In ML, this is not a curiosity — ridge regression, Lasso, and Bayesian MAP estimators are biased versions of MLE that routinely outperform it on finite-sample test data in high dimensions. "Maximum likelihood" is asymptotically optimal, not universally dominant.
Fisher information is the curvature of the log-likelihood. Equivalently, measures how volatile the score is. A high-information parameter is one where small changes in produce large changes in the likelihood — the likelihood is steep. For Bernoulli at , changing by 0.01 relative to its scale makes a huge fractional difference to the likelihood of each 0: vs . The information is high near the boundary. At , symmetric distributions mean small changes barely change the likelihood shape.
The CR bound is achievable only for exponential families. For the Poisson, Gaussian, Bernoulli, and Gamma families, the score factors as , making the Cauchy-Schwarz inequality tight. For non-exponential families (like the uniform ), the MLE converges at rate rather than — the CR bound does not apply because the support depends on (regularity conditions fail). In those cases, much faster rates are achievable.
Unbiasedness is not always desirable. The UMVUE minimizes variance among unbiased estimators, but biased estimators can have strictly lower mean squared error. The James-Stein estimator for estimating the mean of a multivariate Gaussian in dimensions is biased but dominates the MLE in MSE — the MLE is inadmissible. The CR bound only bounds unbiased estimators; biased ones (including ridge regression, Lasso, Bayes estimators) can violate it. In high-dimensional ML, regularized (biased) estimators almost always outperform the MLE.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.