Estimation Theory: MLE, Sufficiency, Fisher Information & Cramér-Rao

Maximum likelihood estimation asks: given data, which parameter value makes the observations most probable? Fisher information and the Cramér-Rao bound characterize how much a dataset can tell us — and the MLE achieves this limit asymptotically, making it the canonical estimator.

Concepts

Fisher information I(θ) measures the sharpness of the log-likelihood peak. The Cramér-Rao bound says no unbiased estimator can achieve variance below 1/(n·I(θ)). Drag θ to see how information and the bound change.

Bernoulli(θ): 1/(θ(1−θ))N(θ,1): 1Poisson(θ): 1/θ

θ = 0.50

n = 100

I(θ) — Bernoulli

4.000

CR bound (n=1)

0.2500

CR bound (n=100)

0.00250

MLE variance = θ(1−θ)/n equals the CR bound exactly — Bernoulli MLE is efficient. At θ=0.5, I(θ)=4 is minimized: hardest to estimate the fair coin.

When you train a neural network by minimizing cross-entropy loss, you are computing the maximum likelihood estimate of the weights — MLE is not a special technique for statisticians, it is the mathematical justification for gradient descent on any loss derived from a probability model. The Fisher information then measures the fundamental limit on estimation precision: how precisely any unbiased method can extract information about the parameter from $n$ observations.

Maximum Likelihood Estimation

For iid observations $X_1, \ldots, X_n \sim p(x; \theta)$ with $\theta \in \Theta \subseteq \mathbb{R}^k$ , the maximum likelihood estimator (MLE) is:

$\hat\theta_{\text{MLE}} = \arg\max_{\theta \in \Theta} \prod_{i=1}^n p(x_i; \theta) = \arg\max_{\theta \in \Theta} \ell(\theta), \quad \ell(\theta) = \sum_{i=1}^n \log p(x_i; \theta).$

The log transformation — replacing the product of likelihoods with a sum — is not merely a computational convenience. Products are numerically unstable for large $n$ ; but more fundamentally, the log-likelihood is a sum of iid terms, so the law of large numbers and CLT apply directly to it. The MLE's consistency and asymptotic normality follow from applying the LLN and CLT to the per-sample score $\nabla_\theta \log p(x_i; \theta)$ — the algebraic structure of the log is precisely what makes the whole asymptotic theory tractable.

The score function is the gradient of the log-likelihood:

$s(\theta) = \nabla_\theta \ell(\theta) = \sum_{i=1}^n \nabla_\theta \log p(x_i; \theta).$

Setting $s(\hat\theta) = 0$ gives the MLE via the score equation.

Key fact: the score has zero mean under the true parameter. For a single observation:

$\mathbb{E}_\theta[\nabla_\theta \log p(X; \theta)] = \int \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)} p(x;\theta)\,dx = \nabla_\theta \int p(x;\theta)\,dx = \nabla_\theta 1 = 0.$

This is the starting point for both Fisher information and the Cramér-Rao bound.

Properties of the MLE:

Consistency: $\hat\theta_{\text{MLE}} \xrightarrow{P} \theta_0$ under mild regularity (identifiability + LLN applied to log-likelihood).
Asymptotic normality: $\sqrt{n}(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}),$ where $I(\theta_0)$ is the Fisher information matrix. Proof: Taylor-expand the score equation around $\theta_0$ and apply CLT.
Asymptotic efficiency: the MLE achieves the Cramér-Rao lower bound asymptotically — no consistent estimator has smaller asymptotic variance.
Invariance: if $\hat\theta$ is the MLE of $\theta$ , then $g(\hat\theta)$ is the MLE of $g(\theta)$ for any measurable $g$ .

Sufficient Statistics

A statistic $T(X_1, \ldots, X_n)$ is sufficient for $\theta$ if the conditional distribution $p(X_1, \ldots, X_n \mid T = t; \theta)$ does not depend on $\theta$ — $T$ captures all the information about $\theta$ in the data.

Factorization theorem (Fisher-Neyman): $T$ is sufficient for $\theta$ if and only if

$p(x_1, \ldots, x_n; \theta) = g(T(x_1,\ldots,x_n); \theta) \cdot h(x_1,\ldots,x_n).$

Exponential family: $p(x; \eta) = h(x)\exp(\eta^T T(x) - A(\eta))$ . Here $T(x)$ is the natural sufficient statistic, $\eta$ is the natural parameter, and $A(\eta) = \log \int h(x)e^{\eta^T T(x)}\,dx$ is the log-partition function.

Key examples:

Bernoulli( $p$ ): $T = \sum x_i$ is sufficient for $p$
Poisson( $\lambda$ ): $T = \sum x_i$ is sufficient for $\lambda$
Gaussian $\mathcal{N}(\mu, \sigma^2)$ : $(T_1, T_2) = (\sum x_i, \sum x_i^2)$ are jointly sufficient for $(\mu, \sigma^2)$

Fisher Information

The Fisher information measures how much a single observation tells us about $\theta$ :

$I(\theta) = \mathbb{E}\!\left[\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\theta)}{\partial \theta^2}\right].$

The two expressions are equal under regularity (differentiate $\mathbb{E}[s(\theta)]=0$ with respect to $\theta$ ). The second form — the negative expected Hessian of the log-likelihood — shows that Fisher information equals the curvature of the log-likelihood: a sharply peaked likelihood contains more information.

For $n$ iid observations: $I_n(\theta) = n \cdot I(\theta)$ .

Fisher information matrix (multiparameter):

$[I(\theta)]_{jk} = \mathbb{E}\!\left[\frac{\partial \log p}{\partial \theta_j}\frac{\partial \log p}{\partial \theta_k}\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta_j \partial \theta_k}\right].$

$I(\theta)$ is always positive semidefinite. It is the covariance matrix of the score vector.

Cramér-Rao Lower Bound

Theorem: For any unbiased estimator $\hat\theta$ of $\theta$ (with $\mathbb{E}_\theta[\hat\theta] = \theta$ ):

$\operatorname{Var}_\theta(\hat\theta) \geq \frac{1}{n \cdot I(\theta)}.$

Proof via Cauchy-Schwarz. Let $s_n(\theta) = \sum_i \frac{\partial \log p(x_i;\theta)}{\partial\theta}$ be the full-data score. Then:

$\text{Cov}(\hat\theta, s_n) = \mathbb{E}[\hat\theta \cdot s_n] = \frac{\partial}{\partial\theta}\mathbb{E}_\theta[\hat\theta] = 1$ (by unbiasedness + differentiation under integral).
$\text{Var}(s_n) = n \cdot I(\theta)$ .
Cauchy-Schwarz: $\text{Var}(\hat\theta) \cdot \text{Var}(s_n) \geq [\text{Cov}(\hat\theta, s_n)]^2 = 1$ .
Therefore $\text{Var}(\hat\theta) \geq 1/(n \cdot I(\theta))$ . $\square$

Multiparameter: $\operatorname{Cov}(\hat\theta) \geq [n \cdot I(\theta)]^{-1}$ in the PSD (positive semidefinite) order — meaning $\operatorname{Cov}(\hat\theta) - [n\cdot I(\theta)]^{-1}$ is PSD.

An estimator achieving the CR bound is called efficient. The MLE is asymptotically efficient.

When is the CR bound achievable exactly? Only for exponential families: the bound is tight iff the score can be written as $s(\theta) = c(\theta)(T - \theta)$ — satisfied precisely by the natural exponential family with sufficient statistic $T$ .

Rao-Blackwell Theorem and UMVUE

Rao-Blackwell theorem: If $\hat\theta$ is any unbiased estimator and $T$ is a sufficient statistic, then $\tilde\theta = \mathbb{E}[\hat\theta \mid T]$ satisfies:

$\tilde\theta$ is unbiased: $\mathbb{E}[\tilde\theta] = \theta$ .
$\text{Var}(\tilde\theta) \leq \text{Var}(\hat\theta)$ with equality iff $\hat\theta = \tilde\theta$ a.s.

Proof: $\text{Var}(\hat\theta) = \text{Var}(\mathbb{E}[\hat\theta \mid T]) + \mathbb{E}[\text{Var}(\hat\theta \mid T)] = \text{Var}(\tilde\theta) + \mathbb{E}[\text{Var}(\hat\theta \mid T)] \geq \text{Var}(\tilde\theta)$ .

The UMVUE (uniformly minimum variance unbiased estimator) is the best unbiased estimator at every $\theta$ . By the Lehmann-Scheffé theorem: if $T$ is a complete sufficient statistic and $\tilde\theta = g(T)$ is unbiased for $\theta$ , then $\tilde\theta$ is the unique UMVUE.

A sufficient statistic $T$ is complete if $\mathbb{E}[f(T)] = 0$ for all $\theta$ implies $f(T) = 0$ a.s. — the exponential family with natural sufficient statistic is complete under mild conditions.

Worked Example

Example 1: Gaussian MLE is Efficient

$X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known; estimate $\mu$ .

Log-likelihood: $\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(x_i-\mu)^2$ .

Score: $s(\mu) = \frac{1}{\sigma^2}\sum(x_i - \mu) = 0 \Rightarrow \hat\mu = \bar x$ .

Fisher info: $I(\mu) = -\mathbb{E}[\partial^2\ell/\partial\mu^2]/n = 1/\sigma^2$ .

CR bound: $\text{Var}(\hat\mu) \geq \sigma^2/n$ .

MLE variance: $\text{Var}(\bar x) = \sigma^2/n$ . Exact equality — $\bar x$ is efficient.

The asymptotic normal approximation: $\sqrt{n}(\hat\mu - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ . The Fisher information here is $1/\sigma^2$ , so $I(\mu)^{-1} = \sigma^2$ . ✓

Example 2: Bernoulli CR Bound

$X_i \stackrel{\text{iid}}{\sim} \text{Bernoulli}(p)$ . Score: $s(p) = k/p - (n-k)/(1-p) = 0 \Rightarrow \hat p = k/n$ .

$I(p) = 1/(p(1-p))$ . CR bound: $p(1-p)/n$ . MLE variance: $p(1-p)/n$ . Tight again.

The CR bound is hardest to achieve near $p = 0.5$ where $I(p) = 4$ is minimized — small Fisher information means high variance is unavoidable. At $p = 0.9$ , $I(p) = 1/(0.9 \cdot 0.1) \approx 11.1$ , so far less variance is needed to estimate $p$ accurately.

Example 3: Rao-Blackwell in Action

Estimate $p$ in Bernoulli $(p)$ with naive estimator $\hat\theta = X_1$ (just the first observation).

$\hat\theta$ is unbiased: $\mathbb{E}[X_1] = p$ . Variance: $p(1-p)$ .

Sufficient statistic: $T = \sum_{i=1}^n X_i$ . By exchangeability:

$\tilde\theta = \mathbb{E}[X_1 \mid T = k] = \frac{k}{n}.$

Rao-Blackwellized estimator: $\tilde\theta = \bar X$ , variance $p(1-p)/n$ . Ratio of improvement: $n$ — the variance shrinks by a factor of $n$ . The UMVUE is the sample mean, as expected.

Connections

Where Your Intuition Breaks

The MLE is consistent and asymptotically efficient — optimal in the limit of large $n$ . The dangerous misconception is extending "optimal" to all finite-sample settings. In dimension $d \geq 3$ , the MLE for estimating the mean of a multivariate Gaussian is inadmissible: the James-Stein estimator is biased yet has strictly lower mean squared error at every true $\mu$ . The Cramér-Rao bound only constrains unbiased estimators; biased estimators can trade bias for variance and achieve lower MSE. In ML, this is not a curiosity — ridge regression, Lasso, and Bayesian MAP estimators are biased versions of MLE that routinely outperform it on finite-sample test data in high dimensions. "Maximum likelihood" is asymptotically optimal, not universally dominant.

💡Intuition

Fisher information is the curvature of the log-likelihood. Equivalently, $I(\theta) = \mathbb{E}[s^2]$ measures how volatile the score is. A high-information parameter is one where small changes in $\theta$ produce large changes in the likelihood — the likelihood is steep. For Bernoulli at $p=0.01$ , changing $p$ by 0.01 relative to its scale makes a huge fractional difference to the likelihood of each 0: $\log(1-0.01)$ vs $\log(1-0.02)$ . The information is high near the boundary. At $p=0.5$ , symmetric distributions mean small $p$ changes barely change the likelihood shape.

💡Intuition

The CR bound is achievable only for exponential families. For the Poisson, Gaussian, Bernoulli, and Gamma families, the score factors as $c(\theta)(T - \mathbb{E}[T])$ , making the Cauchy-Schwarz inequality tight. For non-exponential families (like the uniform $U(0,\theta)$ ), the MLE $\hat\theta = X_{(n)}$ converges at rate $n$ rather than $\sqrt{n}$ — the CR bound does not apply because the support depends on $\theta$ (regularity conditions fail). In those cases, much faster rates are achievable.

⚠️Warning

Unbiasedness is not always desirable. The UMVUE minimizes variance among unbiased estimators, but biased estimators can have strictly lower mean squared error. The James-Stein estimator for estimating the mean of a multivariate Gaussian in $d \geq 3$ dimensions is biased but dominates the MLE in MSE — the MLE is inadmissible. The CR bound only bounds unbiased estimators; biased ones (including ridge regression, Lasso, Bayes estimators) can violate it. In high-dimensional ML, regularized (biased) estimators almost always outperform the MLE.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Probability Theory

Bridge: Concentration Inequalities (Hoeffding, Bernstein, Sub-Gaussian) & Generalization Bounds

Hypothesis Testing: Neyman-Pearson, Likelihood Ratio Tests & Multiple Testing