Neural-Path/Notes
40 min

Limit Theorems: LLN, CLT & Berry-Esseen

The Law of Large Numbers and Central Limit Theorem are the two pillars of classical probability. The LLN says sample means concentrate around the true mean; the CLT says the fluctuations around that mean are Gaussian, regardless of the original distribution. The Berry-Esseen theorem quantifies how fast the Gaussian approximation kicks in. Together they justify essentially every estimator used in statistical ML.

Concepts

n =σₙ = σ/√n = 0.2049
μ-4σₙμ-3σₙμ-2σₙμ-1σₙμμ+1σₙμ+2σₙμ+3σₙμ+4σₙSample mean X̄ₙ (n=5)3000 simulated sample means  |  white curve: N(μ, σ²/n)CLT: still converging√n·(X̄ₙ−μ)/σ → N(0,1)
Increase n — even highly non-Gaussian distributions (Bernoulli, exponential, bimodal) produce sample means that converge to a Gaussian shape. At n≥30, the N(μ, σ²/n) approximation is excellent.

Roll one die: you might get a 6. Roll ten dice and average them: the distribution of averages looks roughly bell-shaped. Roll a thousand dice: the bell shape is extremely tight around 3.5. This is the Central Limit Theorem in action: no matter what the underlying distribution (fair die, biased coin, anything with finite variance), the average of enough independent samples is approximately Gaussian. This is why confidence intervals work, why gradient estimates in SGD are interpretable, and why hypothesis tests based on the Gaussian distribution are valid even for non-Gaussian data.

The Law of Large Numbers

Let X1,X2,X_1, X_2, \ldots be iid with E[Xi]=μ<\mathbb{E}[X_i] = \mu < \infty.

Weak Law of Large Numbers (WLLN). Xˉn=1ni=1nXiPμ\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu.

Proof via Chebyshev (requires Var(X)<\text{Var}(X) < \infty):

P(Xˉnμ>ε)Var(Xˉn)ε2=Var(X)nε20.P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\text{Var}(X)}{n\varepsilon^2} \to 0.

Strong Law of Large Numbers (SLLN). Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu.

Proof sketch (requires E[X]<\mathbb{E}[|X|] < \infty). Truncate Xi=Xi1XinX_i' = X_i \mathbf{1}_{|X_i| \leq n}. Show that XˉnXˉn0\bar{X}_n - \bar{X}_n' \to 0 a.s. using Borel-Cantelli. Show Xˉnμ\bar{X}_n' \to \mu a.s. using the Kronecker lemma and L2L^2 convergence. The full proof uses the 44th moment bound for subsequences and interpolation.

The a.s. vs in-probability gap. SLLN gives P(Xˉnμ)=1P(\bar{X}_n \to \mu) = 1 — for almost every realization of the sequence, the sample mean converges. WLLN only gives P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0 for each fixed ε\varepsilon — convergence could fail for different ε\varepsilon simultaneously.

Glivenko-Cantelli. The empirical CDF Fn(x)=1ni1[Xix]F_n(x) = \frac{1}{n}\sum_i \mathbf{1}[X_i \leq x] satisfies:

supxFn(x)F(x)a.s.0.\sup_x |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0.

This uniform SLLN justifies plug-in estimators and empirical risk minimization: the empirical distribution converges to the true distribution uniformly over all thresholds.

The Central Limit Theorem

Let X1,,XnX_1, \ldots, X_n be iid with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var(Xi)=σ2(0,)\text{Var}(X_i) = \sigma^2 \in (0,\infty). Define Zn=n(Xˉnμ)/σZ_n = \sqrt{n}(\bar{X}_n - \mu)/\sigma.

CLT: ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0,1).

Proof via characteristic functions. WLOG μ=0\mu=0, σ2=1\sigma^2=1. Then Zn=1niXiZ_n = \frac{1}{\sqrt{n}}\sum_i X_i has CF:

φZn(t)=(φX(t/n))n.\varphi_{Z_n}(t) = \left(\varphi_X(t/\sqrt{n})\right)^n.

Taylor expand φX\varphi_X at 0: φX(s)=1+iμs(σ2+μ2)s22+o(s2)=1t22n+o ⁣(1n)\varphi_X(s) = 1 + i\mu s - (\sigma^2+\mu^2)\frac{s^2}{2} + o(s^2) = 1 - \frac{t^2}{2n} + o\!\left(\frac{1}{n}\right).

Hence φZn(t)=(1t22n+o(1/n))net2/2\varphi_{Z_n}(t) = \left(1 - \frac{t^2}{2n} + o(1/n)\right)^n \to e^{-t^2/2} as nn \to \infty.

Since et2/2e^{-t^2/2} is the CF of N(0,1)\mathcal{N}(0,1), by Lévy's continuity theorem: ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0,1). The characteristic function is the right tool here because it always exists (unlike the MGF, which requires all moments to exist), and convergence of CFs to a continuous limit implies convergence in distribution (Lévy's theorem). The Gaussian CF et2/2e^{-t^2/2} appears naturally as the limit of (1t2/2n)n(1 - t^2/2n)^n because it is the unique distribution whose CF is stable under this repeated product — the Gaussian is the fixed point of the CLT map.

Practical form. For a statistical estimator θ^n\hat\theta_n:

n(θ^nθ)dN(0,σθ2)\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma_\theta^2)

where σθ2=Var(Xi)\sigma_\theta^2 = \text{Var}(X_i) for the sample mean. This gives approximate confidence intervals: θθ^n±zα/2σ/n\theta \in \hat\theta_n \pm z_{\alpha/2}\sigma/\sqrt{n} with probability 1α\approx 1-\alpha.

Berry-Esseen Theorem

The CLT tells us convergence happens but not how fast. The Berry-Esseen theorem quantifies the rate:

Theorem. For iid XiX_i with E[Xi]=0\mathbb{E}[X_i] = 0, E[Xi2]=σ2\mathbb{E}[X_i^2] = \sigma^2, and ρ=E[Xi3]<\rho = \mathbb{E}[|X_i|^3] < \infty:

supxP(Znx)Φ(x)Cρσ3n,\sup_x \left|P(Z_n \leq x) - \Phi(x)\right| \leq \frac{C\rho}{\sigma^3\sqrt{n}},

where Φ\Phi is the standard normal CDF and C0.4748C \leq 0.4748 (the best known constant).

Interpretation. The CDF error decays as O(1/n)O(1/\sqrt{n}). For n=100n=100 and a symmetric distribution (ρ/σ3\rho/\sigma^3 moderate), the error is <5%< 5\% — the Gaussian approximation is already quite good.

When Berry-Esseen is tight. For Bernoulli(pp) with small pp: ρ/σ3=(p(1p)3+(1p)p3)/(p(1p))3/2(p1)/p3/2=p1/2\rho/\sigma^3 = (p(1-p)^3 + (1-p)p^3)/(p(1-p))^{3/2} \approx (p \cdot 1)/p^{3/2} = p^{-1/2} for small pp. So the bound is O(1/(pn))O(1/(p\sqrt{n})) — Gaussian approximation is poor when the success probability is small. This is why the Poisson approximation is preferred for rare events.

Lindeberg-Feller CLT (Non-iid Case)

Let Xn1,,XnnX_{n1}, \ldots, X_{nn} be independent (but not identically distributed) with E[Xnk]=0\mathbb{E}[X_{nk}] = 0, Var(Xnk)=σnk2\text{Var}(X_{nk}) = \sigma_{nk}^2, and sn2=kσnk2s_n^2 = \sum_k \sigma_{nk}^2.

Lindeberg condition: for all ε>0\varepsilon > 0:

1sn2k=1nE[Xnk21Xnk>εsn]0.\frac{1}{s_n^2}\sum_{k=1}^n \mathbb{E}[X_{nk}^2 \mathbf{1}_{|X_{nk}|>\varepsilon s_n}] \to 0.

Lindeberg-Feller CLT. If the Lindeberg condition holds:

1snk=1nXnkdN(0,1).\frac{1}{s_n}\sum_{k=1}^n X_{nk} \xrightarrow{d} \mathcal{N}(0,1).

Intuition: The Lindeberg condition says no single term dominates the sum (no single XnkX_{nk} contributes a large fraction of the total variance). This is the correct generalization of iid to non-iid sums.

ML use: SGD iterates, mini-batch gradients with varying per-sample loss distributions, heterogeneous federated learning — all involve sums of non-iid terms where Lindeberg-Feller justifies Gaussian approximations.

Multivariate CLT and Cramér-Wold Device

Multivariate CLT. For iid XiRd\mathbf{X}_i \in \mathbb{R}^d with mean μ\boldsymbol\mu and covariance Σ\Sigma:

n(Xˉnμ)dN(0,Σ).\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol\mu) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma).

Cramér-Wold device. XndX\mathbf{X}_n \xrightarrow{d} \mathbf{X} in Rd\mathbb{R}^d iff tTXndtTXt^T\mathbf{X}_n \xrightarrow{d} t^T\mathbf{X} for all tRdt \in \mathbb{R}^d. So multivariate convergence reduces to univariate convergence of all linear projections.

Law of iterated logarithm (LIL). Sharpening of the SLLN:

lim supnXˉnμσ2loglogn/n=1a.s.\limsup_{n\to\infty}\frac{\bar{X}_n - \mu}{\sigma\sqrt{2\log\log n/n}} = 1 \quad \text{a.s.}

The LIL gives the exact (up to constant) fluctuation size of the sample mean — it oscillates on the scale σ2loglogn/n\sigma\sqrt{2\log\log n/n}, slightly slower than the CLT's σ/n\sigma/\sqrt{n} scale.

Worked Example

Example 1: CLT for the Empirical Risk

Consider training with a loss (fw(x),y)\ell(f_w(x), y) and empirical risk L^(w)=1nii\hat{L}(w) = \frac{1}{n}\sum_i \ell_i. Under the CLT:

n(L^(w)L(w))dN(0,Var(i)).\sqrt{n}(\hat{L}(w) - L(w)) \xrightarrow{d} \mathcal{N}(0, \text{Var}(\ell_i)).

This gives a confidence interval for the true risk: L(w)L^(w)±z0.025σ^/nL(w) \in \hat{L}(w) \pm z_{0.025}\hat\sigma/\sqrt{n} at 95% confidence, where σ^2=1ni(iL^)2\hat\sigma^2 = \frac{1}{n}\sum_i(\ell_i - \hat{L})^2.

Practical implication: On 10,000 test examples with average cross-entropy 0.5 and std dev 0.4: 95% CI width is ±1.960.4/10000=±0.0078\pm 1.96 \cdot 0.4/\sqrt{10000} = \pm 0.0078. The test loss is reported to 4 decimal places, but only about 2 are statistically meaningful.

Example 2: LLN for Monte Carlo Estimation

Monte Carlo estimator: μ^=1Ni=1Nf(Xi)\hat\mu = \frac{1}{N}\sum_{i=1}^N f(X_i) for XiPX_i \sim P.

SLLN: μ^a.s.E[f(X)]=μ\hat\mu \xrightarrow{\text{a.s.}} \mathbb{E}[f(X)] = \mu (if E[f(X)]<\mathbb{E}[|f(X)|] < \infty).

CLT: N(μ^μ)/σfdN(0,1)\sqrt{N}(\hat\mu - \mu)/\sigma_f \xrightarrow{d} \mathcal{N}(0,1) where σf2=Var(f(X))\sigma_f^2 = \text{Var}(f(X)).

MSE: E[(μ^μ)2]=σf2/N\mathbb{E}[(\hat\mu - \mu)^2] = \sigma_f^2/N — the O(1/N)O(1/N) rate is dimension-independent! This is why Monte Carlo (and by extension stochastic gradient methods) scales to high-dimensional problems — unlike numerical quadrature whose error scales exponentially with dimension.

Example 3: Berry-Esseen Applied to A/B Testing

An A/B test with n=1000n = 1000 users per variant, binary conversion (Bernoulli(pp) with p=0.05p = 0.05):

ρ=E[Xp3]p(1p)3+(1p)p3=p(1p)(12p+p2+p2)p(1p)\rho = \mathbb{E}[|X-p|^3] \approx p(1-p)^3 + (1-p)p^3 = p(1-p)(1-2p+p^2+p^2) \approx p(1-p) for small pp.

ρ/σ3=p(1p)/(p(1p))3/2=(p(1p))1/2=1/0.050.954.6\rho/\sigma^3 = p(1-p)/(p(1-p))^{3/2} = (p(1-p))^{-1/2} = 1/\sqrt{0.05\cdot 0.95} \approx 4.6.

Berry-Esseen error bound: C4.6/10000.4754.6/31.60.069C \cdot 4.6/\sqrt{1000} \approx 0.475 \cdot 4.6/31.6 \approx 0.069.

The normal approximation has CDF error up to ~7% — not negligible for rare events. This is why practitioners often use exact binomial tests or bootstrap for low-conversion-rate A/B tests.

Connections

Where Your Intuition Breaks

The CLT requires finite variance, but the convergence rate (Berry-Esseen: O(1/n)O(1/\sqrt{n})) also depends on the third absolute moment ρ=E[X3]\rho = \mathbb{E}[|X|^3]. For distributions with heavy tails — power laws, Pareto, or even a moderately heavy-tailed distribution like the tt-distribution with 3 degrees of freedom — n=100n=100 may not be nearly enough for the Gaussian approximation to be accurate. A common mistake in ML: running an A/B test with 100 samples per variant and applying a zz-test, not realizing that the outcome metric (e.g., revenue per user) has heavy enough tails that n=100n=100 gives a Gaussian approximation with significant CDF error. The Berry-Esseen bound tells you the worst-case error; you should check whether your metric has a small ρ/σ3\rho/\sigma^3 before trusting the normal approximation at small sample sizes.

💡Intuition

The CLT works because higher cumulants vanish after normalization. From the CF proof: the cumulant generating function KSn(t)=nKX(t/n)K_{S_n}(t) = n K_X(t/\sqrt{n}). At order kk: the kk-th cumulant of Sn=(1/n)XiS_n = (1/\sqrt{n})\sum X_i is nκk/(n)k=κk/nk/21n \cdot \kappa_k / (\sqrt{n})^k = \kappa_k / n^{k/2-1}. For k3k \geq 3: this goes to zero as nn\to\infty. Only κ1=μ\kappa_1 = \mu and κ2=σ2\kappa_2 = \sigma^2 survive — which are exactly the cumulants of the Gaussian. The Gaussian is the attractor of the CLT precisely because it has no higher cumulants.

💡Intuition

Monte Carlo's O(1/N)O(1/\sqrt{N}) rate is dimension-free. Numerical quadrature in dd dimensions using NN grid points achieves error O(Nr/d)O(N^{-r/d}) for a degree-rr rule — exponentially worse in high dd (curse of dimensionality). Monte Carlo achieves O(1/N)O(1/\sqrt{N}) by the CLT — completely independent of dd. This makes MC and its variants (MCMC, importance sampling, SGD) the only practical approaches to integration in high dimensions. The price is a slow O(1/N)O(1/\sqrt{N}) rate, whereas quadrature achieves O(Nr)O(N^{-r}) in low dimensions.

⚠️Warning

The CLT's Gaussian approximation can be very poor for extreme quantiles. Berry-Esseen bounds the CDF error at O(1/n)O(1/\sqrt{n}) — but this is a uniform bound. The error in tail probabilities P(Zn>t)P(Z_n > t) for large tt (like t=4σt = 4\sigma) can be much worse: the Gaussian underestimates heavy-tailed probabilities. For rare event simulation (financial risk, safety-critical ML), the CLT's Gaussian approximation fails precisely where it matters most. Large deviation theory gives exponentially accurate bounds for tail probabilities at fixed tt as nn\to\infty.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.