Limit Theorems: LLN, CLT & Berry-Esseen

The Law of Large Numbers and Central Limit Theorem are the two pillars of classical probability. The LLN says sample means concentrate around the true mean; the CLT says the fluctuations around that mean are Gaussian, regardless of the original distribution. The Berry-Esseen theorem quantifies how fast the Gaussian approximation kicks in. Together they justify essentially every estimator used in statistical ML.

Concepts

n =σₙ = σ/√n = 0.2049

Increase n — even highly non-Gaussian distributions (Bernoulli, exponential, bimodal) produce sample means that converge to a Gaussian shape. At n≥30, the N(μ, σ²/n) approximation is excellent.

Roll one die: you might get a 6. Roll ten dice and average them: the distribution of averages looks roughly bell-shaped. Roll a thousand dice: the bell shape is extremely tight around 3.5. This is the Central Limit Theorem in action: no matter what the underlying distribution (fair die, biased coin, anything with finite variance), the average of enough independent samples is approximately Gaussian. This is why confidence intervals work, why gradient estimates in SGD are interpretable, and why hypothesis tests based on the Gaussian distribution are valid even for non-Gaussian data.

The Law of Large Numbers

Let $X_1, X_2, \ldots$ be iid with $\mathbb{E}[X_i] = \mu < \infty$ .

Weak Law of Large Numbers (WLLN). $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu$ .

Proof via Chebyshev (requires $\text{Var}(X) < \infty$ ):

$P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\text{Var}(X)}{n\varepsilon^2} \to 0.$

Strong Law of Large Numbers (SLLN). $\bar{X}_n \xrightarrow{\text{a.s.}} \mu$ .

Proof sketch (requires $\mathbb{E}[|X|] < \infty$ ). Truncate $X_i' = X_i \mathbf{1}_{|X_i| \leq n}$ . Show that $\bar{X}_n - \bar{X}_n' \to 0$ a.s. using Borel-Cantelli. Show $\bar{X}_n' \to \mu$ a.s. using the Kronecker lemma and $L^2$ convergence. The full proof uses the $4$ th moment bound for subsequences and interpolation.

The a.s. vs in-probability gap. SLLN gives $P(\bar{X}_n \to \mu) = 1$ — for almost every realization of the sequence, the sample mean converges. WLLN only gives $P(|\bar{X}_n - \mu| > \varepsilon) \to 0$ for each fixed $\varepsilon$ — convergence could fail for different $\varepsilon$ simultaneously.

Glivenko-Cantelli. The empirical CDF $F_n(x) = \frac{1}{n}\sum_i \mathbf{1}[X_i \leq x]$ satisfies:

$\sup_x |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0.$

This uniform SLLN justifies plug-in estimators and empirical risk minimization: the empirical distribution converges to the true distribution uniformly over all thresholds.

The Central Limit Theorem

Let $X_1, \ldots, X_n$ be iid with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2 \in (0,\infty)$ . Define $Z_n = \sqrt{n}(\bar{X}_n - \mu)/\sigma$ .

CLT: $Z_n \xrightarrow{d} \mathcal{N}(0,1)$ .

Proof via characteristic functions. WLOG $\mu=0$ , $\sigma^2=1$ . Then $Z_n = \frac{1}{\sqrt{n}}\sum_i X_i$ has CF:

$\varphi_{Z_n}(t) = \left(\varphi_X(t/\sqrt{n})\right)^n.$

Taylor expand $\varphi_X$ at 0: $\varphi_X(s) = 1 + i\mu s - (\sigma^2+\mu^2)\frac{s^2}{2} + o(s^2) = 1 - \frac{t^2}{2n} + o\!\left(\frac{1}{n}\right)$ .

Hence $\varphi_{Z_n}(t) = \left(1 - \frac{t^2}{2n} + o(1/n)\right)^n \to e^{-t^2/2}$ as $n \to \infty$ .

Since $e^{-t^2/2}$ is the CF of $\mathcal{N}(0,1)$ , by Lévy's continuity theorem: $Z_n \xrightarrow{d} \mathcal{N}(0,1)$ . The characteristic function is the right tool here because it always exists (unlike the MGF, which requires all moments to exist), and convergence of CFs to a continuous limit implies convergence in distribution (Lévy's theorem). The Gaussian CF $e^{-t^2/2}$ appears naturally as the limit of $(1 - t^2/2n)^n$ because it is the unique distribution whose CF is stable under this repeated product — the Gaussian is the fixed point of the CLT map.

Practical form. For a statistical estimator $\hat\theta_n$ :

$\sqrt{n}(\hat\theta_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma_\theta^2)$

where $\sigma_\theta^2 = \text{Var}(X_i)$ for the sample mean. This gives approximate confidence intervals: $\theta \in \hat\theta_n \pm z_{\alpha/2}\sigma/\sqrt{n}$ with probability $\approx 1-\alpha$ .

Berry-Esseen Theorem

The CLT tells us convergence happens but not how fast. The Berry-Esseen theorem quantifies the rate:

Theorem. For iid $X_i$ with $\mathbb{E}[X_i] = 0$ , $\mathbb{E}[X_i^2] = \sigma^2$ , and $\rho = \mathbb{E}[|X_i|^3] < \infty$ :

$\sup_x \left|P(Z_n \leq x) - \Phi(x)\right| \leq \frac{C\rho}{\sigma^3\sqrt{n}},$

where $\Phi$ is the standard normal CDF and $C \leq 0.4748$ (the best known constant).

Interpretation. The CDF error decays as $O(1/\sqrt{n})$ . For $n=100$ and a symmetric distribution ( $\rho/\sigma^3$ moderate), the error is $< 5\%$ — the Gaussian approximation is already quite good.

When Berry-Esseen is tight. For Bernoulli( $p$ ) with small $p$ : $\rho/\sigma^3 = (p(1-p)^3 + (1-p)p^3)/(p(1-p))^{3/2} \approx (p \cdot 1)/p^{3/2} = p^{-1/2}$ for small $p$ . So the bound is $O(1/(p\sqrt{n}))$ — Gaussian approximation is poor when the success probability is small. This is why the Poisson approximation is preferred for rare events.

Lindeberg-Feller CLT (Non-iid Case)

Let $X_{n1}, \ldots, X_{nn}$ be independent (but not identically distributed) with $\mathbb{E}[X_{nk}] = 0$ , $\text{Var}(X_{nk}) = \sigma_{nk}^2$ , and $s_n^2 = \sum_k \sigma_{nk}^2$ .

Lindeberg condition: for all $\varepsilon > 0$ :

$\frac{1}{s_n^2}\sum_{k=1}^n \mathbb{E}[X_{nk}^2 \mathbf{1}_{|X_{nk}|>\varepsilon s_n}] \to 0.$

Lindeberg-Feller CLT. If the Lindeberg condition holds:

$\frac{1}{s_n}\sum_{k=1}^n X_{nk} \xrightarrow{d} \mathcal{N}(0,1).$

Intuition: The Lindeberg condition says no single term dominates the sum (no single $X_{nk}$ contributes a large fraction of the total variance). This is the correct generalization of iid to non-iid sums.

ML use: SGD iterates, mini-batch gradients with varying per-sample loss distributions, heterogeneous federated learning — all involve sums of non-iid terms where Lindeberg-Feller justifies Gaussian approximations.

Multivariate CLT and Cramér-Wold Device

Multivariate CLT. For iid $\mathbf{X}_i \in \mathbb{R}^d$ with mean $\boldsymbol\mu$ and covariance $\Sigma$ :

$\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol\mu) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma).$

Cramér-Wold device. $\mathbf{X}_n \xrightarrow{d} \mathbf{X}$ in $\mathbb{R}^d$ iff $t^T\mathbf{X}_n \xrightarrow{d} t^T\mathbf{X}$ for all $t \in \mathbb{R}^d$ . So multivariate convergence reduces to univariate convergence of all linear projections.

Law of iterated logarithm (LIL). Sharpening of the SLLN:

$\limsup_{n\to\infty}\frac{\bar{X}_n - \mu}{\sigma\sqrt{2\log\log n/n}} = 1 \quad \text{a.s.}$

The LIL gives the exact (up to constant) fluctuation size of the sample mean — it oscillates on the scale $\sigma\sqrt{2\log\log n/n}$ , slightly slower than the CLT's $\sigma/\sqrt{n}$ scale.

Worked Example

Example 1: CLT for the Empirical Risk

Consider training with a loss $\ell(f_w(x), y)$ and empirical risk $\hat{L}(w) = \frac{1}{n}\sum_i \ell_i$ . Under the CLT:

$\sqrt{n}(\hat{L}(w) - L(w)) \xrightarrow{d} \mathcal{N}(0, \text{Var}(\ell_i)).$

This gives a confidence interval for the true risk: $L(w) \in \hat{L}(w) \pm z_{0.025}\hat\sigma/\sqrt{n}$ at 95% confidence, where $\hat\sigma^2 = \frac{1}{n}\sum_i(\ell_i - \hat{L})^2$ .

Practical implication: On 10,000 test examples with average cross-entropy 0.5 and std dev 0.4: 95% CI width is $\pm 1.96 \cdot 0.4/\sqrt{10000} = \pm 0.0078$ . The test loss is reported to 4 decimal places, but only about 2 are statistically meaningful.

Example 2: LLN for Monte Carlo Estimation

Monte Carlo estimator: $\hat\mu = \frac{1}{N}\sum_{i=1}^N f(X_i)$ for $X_i \sim P$ .

SLLN: $\hat\mu \xrightarrow{\text{a.s.}} \mathbb{E}[f(X)] = \mu$ (if $\mathbb{E}[|f(X)|] < \infty$ ).

CLT: $\sqrt{N}(\hat\mu - \mu)/\sigma_f \xrightarrow{d} \mathcal{N}(0,1)$ where $\sigma_f^2 = \text{Var}(f(X))$ .

MSE: $\mathbb{E}[(\hat\mu - \mu)^2] = \sigma_f^2/N$ — the $O(1/N)$ rate is dimension-independent! This is why Monte Carlo (and by extension stochastic gradient methods) scales to high-dimensional problems — unlike numerical quadrature whose error scales exponentially with dimension.

Example 3: Berry-Esseen Applied to A/B Testing

An A/B test with $n = 1000$ users per variant, binary conversion (Bernoulli( $p$ ) with $p = 0.05$ ):

$\rho = \mathbb{E}[|X-p|^3] \approx p(1-p)^3 + (1-p)p^3 = p(1-p)(1-2p+p^2+p^2) \approx p(1-p)$ for small $p$ .

$\rho/\sigma^3 = p(1-p)/(p(1-p))^{3/2} = (p(1-p))^{-1/2} = 1/\sqrt{0.05\cdot 0.95} \approx 4.6$ .

Berry-Esseen error bound: $C \cdot 4.6/\sqrt{1000} \approx 0.475 \cdot 4.6/31.6 \approx 0.069$ .

The normal approximation has CDF error up to ~7% — not negligible for rare events. This is why practitioners often use exact binomial tests or bootstrap for low-conversion-rate A/B tests.

Connections

Where Your Intuition Breaks

The CLT requires finite variance, but the convergence rate (Berry-Esseen: $O(1/\sqrt{n})$ ) also depends on the third absolute moment $\rho = \mathbb{E}[|X|^3]$ . For distributions with heavy tails — power laws, Pareto, or even a moderately heavy-tailed distribution like the $t$ -distribution with 3 degrees of freedom — $n=100$ may not be nearly enough for the Gaussian approximation to be accurate. A common mistake in ML: running an A/B test with 100 samples per variant and applying a $z$ -test, not realizing that the outcome metric (e.g., revenue per user) has heavy enough tails that $n=100$ gives a Gaussian approximation with significant CDF error. The Berry-Esseen bound tells you the worst-case error; you should check whether your metric has a small $\rho/\sigma^3$ before trusting the normal approximation at small sample sizes.

💡Intuition

The CLT works because higher cumulants vanish after normalization. From the CF proof: the cumulant generating function $K_{S_n}(t) = n K_X(t/\sqrt{n})$ . At order $k$ : the $k$ -th cumulant of $S_n = (1/\sqrt{n})\sum X_i$ is $n \cdot \kappa_k / (\sqrt{n})^k = \kappa_k / n^{k/2-1}$ . For $k \geq 3$ : this goes to zero as $n\to\infty$ . Only $\kappa_1 = \mu$ and $\kappa_2 = \sigma^2$ survive — which are exactly the cumulants of the Gaussian. The Gaussian is the attractor of the CLT precisely because it has no higher cumulants.

💡Intuition

Monte Carlo's $O(1/\sqrt{N})$ rate is dimension-free. Numerical quadrature in $d$ dimensions using $N$ grid points achieves error $O(N^{-r/d})$ for a degree- $r$ rule — exponentially worse in high $d$ (curse of dimensionality). Monte Carlo achieves $O(1/\sqrt{N})$ by the CLT — completely independent of $d$ . This makes MC and its variants (MCMC, importance sampling, SGD) the only practical approaches to integration in high dimensions. The price is a slow $O(1/\sqrt{N})$ rate, whereas quadrature achieves $O(N^{-r})$ in low dimensions.

⚠️Warning

The CLT's Gaussian approximation can be very poor for extreme quantiles. Berry-Esseen bounds the CDF error at $O(1/\sqrt{n})$ — but this is a uniform bound. The error in tail probabilities $P(Z_n > t)$ for large $t$ (like $t = 4\sigma$ ) can be much worse: the Gaussian underestimates heavy-tailed probabilities. For rare event simulation (financial risk, safety-critical ML), the CLT's Gaussian approximation fails precisely where it matters most. Large deviation theory gives exponentially accurate bounds for tail probabilities at fixed $t$ as $n\to\infty$ .

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Modes of Convergence: Almost Sure, In Probability, Lp & In Distribution

Bridge: Concentration Inequalities (Hoeffding, Bernstein, Sub-Gaussian) & Generalization Bounds