Neural-Path/Notes
45 min

Estimation Theory: MLE, Sufficiency, Fisher Information & Cramér-Rao

Maximum likelihood estimation asks: given data, which parameter value makes the observations most probable? Fisher information and the Cramér-Rao bound characterize how much a dataset can tell us — and the MLE achieves this limit asymptotically, making it the canonical estimator.

Concepts

Fisher information I(θ) measures the sharpness of the log-likelihood peak. The Cramér-Rao bound says no unbiased estimator can achieve variance below 1/(n·I(θ)). Drag θ to see how information and the bound change.

051015200.10.20.30.40.50.60.70.80.9θI(θ)
Bernoulli(θ): 1/(θ(1−θ))N(θ,1): 1Poisson(θ): 1/θ
I(θ) — Bernoulli
4.000
CR bound (n=1)
0.2500
CR bound (n=100)
0.00250

MLE variance = θ(1−θ)/n equals the CR bound exactly — Bernoulli MLE is efficient. At θ=0.5, I(θ)=4 is minimized: hardest to estimate the fair coin.

When you train a neural network by minimizing cross-entropy loss, you are computing the maximum likelihood estimate of the weights — MLE is not a special technique for statisticians, it is the mathematical justification for gradient descent on any loss derived from a probability model. The Fisher information then measures the fundamental limit on estimation precision: how precisely any unbiased method can extract information about the parameter from nn observations.

Maximum Likelihood Estimation

For iid observations X1,,Xnp(x;θ)X_1, \ldots, X_n \sim p(x; \theta) with θΘRk\theta \in \Theta \subseteq \mathbb{R}^k, the maximum likelihood estimator (MLE) is:

θ^MLE=argmaxθΘi=1np(xi;θ)=argmaxθΘ(θ),(θ)=i=1nlogp(xi;θ).\hat\theta_{\text{MLE}} = \arg\max_{\theta \in \Theta} \prod_{i=1}^n p(x_i; \theta) = \arg\max_{\theta \in \Theta} \ell(\theta), \quad \ell(\theta) = \sum_{i=1}^n \log p(x_i; \theta).

The log transformation — replacing the product of likelihoods with a sum — is not merely a computational convenience. Products are numerically unstable for large nn; but more fundamentally, the log-likelihood is a sum of iid terms, so the law of large numbers and CLT apply directly to it. The MLE's consistency and asymptotic normality follow from applying the LLN and CLT to the per-sample score θlogp(xi;θ)\nabla_\theta \log p(x_i; \theta) — the algebraic structure of the log is precisely what makes the whole asymptotic theory tractable.

The score function is the gradient of the log-likelihood:

s(θ)=θ(θ)=i=1nθlogp(xi;θ).s(\theta) = \nabla_\theta \ell(\theta) = \sum_{i=1}^n \nabla_\theta \log p(x_i; \theta).

Setting s(θ^)=0s(\hat\theta) = 0 gives the MLE via the score equation.

Key fact: the score has zero mean under the true parameter. For a single observation:

Eθ[θlogp(X;θ)]=θp(x;θ)p(x;θ)p(x;θ)dx=θp(x;θ)dx=θ1=0.\mathbb{E}_\theta[\nabla_\theta \log p(X; \theta)] = \int \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)} p(x;\theta)\,dx = \nabla_\theta \int p(x;\theta)\,dx = \nabla_\theta 1 = 0.

This is the starting point for both Fisher information and the Cramér-Rao bound.

Properties of the MLE:

  1. Consistency: θ^MLEPθ0\hat\theta_{\text{MLE}} \xrightarrow{P} \theta_0 under mild regularity (identifiability + LLN applied to log-likelihood).

  2. Asymptotic normality: n(θ^MLEθ0)dN(0,I(θ0)1),\sqrt{n}(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}), where I(θ0)I(\theta_0) is the Fisher information matrix. Proof: Taylor-expand the score equation around θ0\theta_0 and apply CLT.

  3. Asymptotic efficiency: the MLE achieves the Cramér-Rao lower bound asymptotically — no consistent estimator has smaller asymptotic variance.

  4. Invariance: if θ^\hat\theta is the MLE of θ\theta, then g(θ^)g(\hat\theta) is the MLE of g(θ)g(\theta) for any measurable gg.

Sufficient Statistics

A statistic T(X1,,Xn)T(X_1, \ldots, X_n) is sufficient for θ\theta if the conditional distribution p(X1,,XnT=t;θ)p(X_1, \ldots, X_n \mid T = t; \theta) does not depend on θ\thetaTT captures all the information about θ\theta in the data.

Factorization theorem (Fisher-Neyman): TT is sufficient for θ\theta if and only if

p(x1,,xn;θ)=g(T(x1,,xn);θ)h(x1,,xn).p(x_1, \ldots, x_n; \theta) = g(T(x_1,\ldots,x_n); \theta) \cdot h(x_1,\ldots,x_n).

Exponential family: p(x;η)=h(x)exp(ηTT(x)A(η))p(x; \eta) = h(x)\exp(\eta^T T(x) - A(\eta)). Here T(x)T(x) is the natural sufficient statistic, η\eta is the natural parameter, and A(η)=logh(x)eηTT(x)dxA(\eta) = \log \int h(x)e^{\eta^T T(x)}\,dx is the log-partition function.

Key examples:

  • Bernoulli(pp): T=xiT = \sum x_i is sufficient for pp
  • Poisson(λ\lambda): T=xiT = \sum x_i is sufficient for λ\lambda
  • Gaussian N(μ,σ2)\mathcal{N}(\mu, \sigma^2): (T1,T2)=(xi,xi2)(T_1, T_2) = (\sum x_i, \sum x_i^2) are jointly sufficient for (μ,σ2)(\mu, \sigma^2)

Fisher Information

The Fisher information measures how much a single observation tells us about θ\theta:

I(θ)=E ⁣[(logp(X;θ)θ)2]=E ⁣[2logp(X;θ)θ2].I(\theta) = \mathbb{E}\!\left[\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\theta)}{\partial \theta^2}\right].

The two expressions are equal under regularity (differentiate E[s(θ)]=0\mathbb{E}[s(\theta)]=0 with respect to θ\theta). The second form — the negative expected Hessian of the log-likelihood — shows that Fisher information equals the curvature of the log-likelihood: a sharply peaked likelihood contains more information.

For nn iid observations: In(θ)=nI(θ)I_n(\theta) = n \cdot I(\theta).

Fisher information matrix (multiparameter):

[I(θ)]jk=E ⁣[logpθjlogpθk]=E ⁣[2logpθjθk].[I(\theta)]_{jk} = \mathbb{E}\!\left[\frac{\partial \log p}{\partial \theta_j}\frac{\partial \log p}{\partial \theta_k}\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta_j \partial \theta_k}\right].

I(θ)I(\theta) is always positive semidefinite. It is the covariance matrix of the score vector.

Cramér-Rao Lower Bound

Theorem: For any unbiased estimator θ^\hat\theta of θ\theta (with Eθ[θ^]=θ\mathbb{E}_\theta[\hat\theta] = \theta):

Varθ(θ^)1nI(θ).\operatorname{Var}_\theta(\hat\theta) \geq \frac{1}{n \cdot I(\theta)}.

Proof via Cauchy-Schwarz. Let sn(θ)=ilogp(xi;θ)θs_n(\theta) = \sum_i \frac{\partial \log p(x_i;\theta)}{\partial\theta} be the full-data score. Then:

  1. Cov(θ^,sn)=E[θ^sn]=θEθ[θ^]=1\text{Cov}(\hat\theta, s_n) = \mathbb{E}[\hat\theta \cdot s_n] = \frac{\partial}{\partial\theta}\mathbb{E}_\theta[\hat\theta] = 1 (by unbiasedness + differentiation under integral).

  2. Var(sn)=nI(θ)\text{Var}(s_n) = n \cdot I(\theta).

  3. Cauchy-Schwarz: Var(θ^)Var(sn)[Cov(θ^,sn)]2=1\text{Var}(\hat\theta) \cdot \text{Var}(s_n) \geq [\text{Cov}(\hat\theta, s_n)]^2 = 1.

  4. Therefore Var(θ^)1/(nI(θ))\text{Var}(\hat\theta) \geq 1/(n \cdot I(\theta)). \square

Multiparameter: Cov(θ^)[nI(θ)]1\operatorname{Cov}(\hat\theta) \geq [n \cdot I(\theta)]^{-1} in the PSD (positive semidefinite) order — meaning Cov(θ^)[nI(θ)]1\operatorname{Cov}(\hat\theta) - [n\cdot I(\theta)]^{-1} is PSD.

An estimator achieving the CR bound is called efficient. The MLE is asymptotically efficient.

When is the CR bound achievable exactly? Only for exponential families: the bound is tight iff the score can be written as s(θ)=c(θ)(Tθ)s(\theta) = c(\theta)(T - \theta) — satisfied precisely by the natural exponential family with sufficient statistic TT.

Rao-Blackwell Theorem and UMVUE

Rao-Blackwell theorem: If θ^\hat\theta is any unbiased estimator and TT is a sufficient statistic, then θ~=E[θ^T]\tilde\theta = \mathbb{E}[\hat\theta \mid T] satisfies:

  1. θ~\tilde\theta is unbiased: E[θ~]=θ\mathbb{E}[\tilde\theta] = \theta.
  2. Var(θ~)Var(θ^)\text{Var}(\tilde\theta) \leq \text{Var}(\hat\theta) with equality iff θ^=θ~\hat\theta = \tilde\theta a.s.

Proof: Var(θ^)=Var(E[θ^T])+E[Var(θ^T)]=Var(θ~)+E[Var(θ^T)]Var(θ~)\text{Var}(\hat\theta) = \text{Var}(\mathbb{E}[\hat\theta \mid T]) + \mathbb{E}[\text{Var}(\hat\theta \mid T)] = \text{Var}(\tilde\theta) + \mathbb{E}[\text{Var}(\hat\theta \mid T)] \geq \text{Var}(\tilde\theta).

The UMVUE (uniformly minimum variance unbiased estimator) is the best unbiased estimator at every θ\theta. By the Lehmann-Scheffé theorem: if TT is a complete sufficient statistic and θ~=g(T)\tilde\theta = g(T) is unbiased for θ\theta, then θ~\tilde\theta is the unique UMVUE.

A sufficient statistic TT is complete if E[f(T)]=0\mathbb{E}[f(T)] = 0 for all θ\theta implies f(T)=0f(T) = 0 a.s. — the exponential family with natural sufficient statistic is complete under mild conditions.

Worked Example

Example 1: Gaussian MLE is Efficient

X1,,XniidN(μ,σ2)X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known; estimate μ\mu.

Log-likelihood: (μ)=n2log(2πσ2)12σ2(xiμ)2\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i-\mu)^2.

Score: s(μ)=1σ2(xiμ)=0μ^=xˉs(\mu) = \frac{1}{\sigma^2}\sum(x_i - \mu) = 0 \Rightarrow \hat\mu = \bar x.

Fisher info: I(μ)=E[2/μ2]/n=1/σ2I(\mu) = -\mathbb{E}[\partial^2\ell/\partial\mu^2]/n = 1/\sigma^2.

CR bound: Var(μ^)σ2/n\text{Var}(\hat\mu) \geq \sigma^2/n.

MLE variance: Var(xˉ)=σ2/n\text{Var}(\bar x) = \sigma^2/n. Exact equality — xˉ\bar x is efficient.

The asymptotic normal approximation: n(μ^μ)dN(0,σ2)\sqrt{n}(\hat\mu - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2). The Fisher information here is 1/σ21/\sigma^2, so I(μ)1=σ2I(\mu)^{-1} = \sigma^2. ✓

Example 2: Bernoulli CR Bound

XiiidBernoulli(p)X_i \stackrel{\text{iid}}{\sim} \text{Bernoulli}(p). Score: s(p)=k/p(nk)/(1p)=0p^=k/ns(p) = k/p - (n-k)/(1-p) = 0 \Rightarrow \hat p = k/n.

I(p)=1/(p(1p))I(p) = 1/(p(1-p)). CR bound: p(1p)/np(1-p)/n. MLE variance: p(1p)/np(1-p)/n. Tight again.

The CR bound is hardest to achieve near p=0.5p = 0.5 where I(p)=4I(p) = 4 is minimized — small Fisher information means high variance is unavoidable. At p=0.9p = 0.9, I(p)=1/(0.90.1)11.1I(p) = 1/(0.9 \cdot 0.1) \approx 11.1, so far less variance is needed to estimate pp accurately.

Example 3: Rao-Blackwell in Action

Estimate pp in Bernoulli(p)(p) with naive estimator θ^=X1\hat\theta = X_1 (just the first observation).

θ^\hat\theta is unbiased: E[X1]=p\mathbb{E}[X_1] = p. Variance: p(1p)p(1-p).

Sufficient statistic: T=i=1nXiT = \sum_{i=1}^n X_i. By exchangeability:

θ~=E[X1T=k]=kn.\tilde\theta = \mathbb{E}[X_1 \mid T = k] = \frac{k}{n}.

Rao-Blackwellized estimator: θ~=Xˉ\tilde\theta = \bar X, variance p(1p)/np(1-p)/n. Ratio of improvement: nn — the variance shrinks by a factor of nn. The UMVUE is the sample mean, as expected.

Connections

Where Your Intuition Breaks

The MLE is consistent and asymptotically efficient — optimal in the limit of large nn. The dangerous misconception is extending "optimal" to all finite-sample settings. In dimension d3d \geq 3, the MLE for estimating the mean of a multivariate Gaussian is inadmissible: the James-Stein estimator is biased yet has strictly lower mean squared error at every true μ\mu. The Cramér-Rao bound only constrains unbiased estimators; biased estimators can trade bias for variance and achieve lower MSE. In ML, this is not a curiosity — ridge regression, Lasso, and Bayesian MAP estimators are biased versions of MLE that routinely outperform it on finite-sample test data in high dimensions. "Maximum likelihood" is asymptotically optimal, not universally dominant.

💡Intuition

Fisher information is the curvature of the log-likelihood. Equivalently, I(θ)=E[s2]I(\theta) = \mathbb{E}[s^2] measures how volatile the score is. A high-information parameter is one where small changes in θ\theta produce large changes in the likelihood — the likelihood is steep. For Bernoulli at p=0.01p=0.01, changing pp by 0.01 relative to its scale makes a huge fractional difference to the likelihood of each 0: log(10.01)\log(1-0.01) vs log(10.02)\log(1-0.02). The information is high near the boundary. At p=0.5p=0.5, symmetric distributions mean small pp changes barely change the likelihood shape.

💡Intuition

The CR bound is achievable only for exponential families. For the Poisson, Gaussian, Bernoulli, and Gamma families, the score factors as c(θ)(TE[T])c(\theta)(T - \mathbb{E}[T]), making the Cauchy-Schwarz inequality tight. For non-exponential families (like the uniform U(0,θ)U(0,\theta)), the MLE θ^=X(n)\hat\theta = X_{(n)} converges at rate nn rather than n\sqrt{n} — the CR bound does not apply because the support depends on θ\theta (regularity conditions fail). In those cases, much faster rates are achievable.

⚠️Warning

Unbiasedness is not always desirable. The UMVUE minimizes variance among unbiased estimators, but biased estimators can have strictly lower mean squared error. The James-Stein estimator for estimating the mean of a multivariate Gaussian in d3d \geq 3 dimensions is biased but dominates the MLE in MSE — the MLE is inadmissible. The CR bound only bounds unbiased estimators; biased ones (including ridge regression, Lasso, Bayes estimators) can violate it. In high-dimensional ML, regularized (biased) estimators almost always outperform the MLE.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.