Limit Theorems: LLN, CLT & Berry-Esseen
The Law of Large Numbers and Central Limit Theorem are the two pillars of classical probability. The LLN says sample means concentrate around the true mean; the CLT says the fluctuations around that mean are Gaussian, regardless of the original distribution. The Berry-Esseen theorem quantifies how fast the Gaussian approximation kicks in. Together they justify essentially every estimator used in statistical ML.
Concepts
Roll one die: you might get a 6. Roll ten dice and average them: the distribution of averages looks roughly bell-shaped. Roll a thousand dice: the bell shape is extremely tight around 3.5. This is the Central Limit Theorem in action: no matter what the underlying distribution (fair die, biased coin, anything with finite variance), the average of enough independent samples is approximately Gaussian. This is why confidence intervals work, why gradient estimates in SGD are interpretable, and why hypothesis tests based on the Gaussian distribution are valid even for non-Gaussian data.
The Law of Large Numbers
Let be iid with .
Weak Law of Large Numbers (WLLN). .
Proof via Chebyshev (requires ):
Strong Law of Large Numbers (SLLN). .
Proof sketch (requires ). Truncate . Show that a.s. using Borel-Cantelli. Show a.s. using the Kronecker lemma and convergence. The full proof uses the th moment bound for subsequences and interpolation.
The a.s. vs in-probability gap. SLLN gives — for almost every realization of the sequence, the sample mean converges. WLLN only gives for each fixed — convergence could fail for different simultaneously.
Glivenko-Cantelli. The empirical CDF satisfies:
This uniform SLLN justifies plug-in estimators and empirical risk minimization: the empirical distribution converges to the true distribution uniformly over all thresholds.
The Central Limit Theorem
Let be iid with and . Define .
CLT: .
Proof via characteristic functions. WLOG , . Then has CF:
Taylor expand at 0: .
Hence as .
Since is the CF of , by Lévy's continuity theorem: . The characteristic function is the right tool here because it always exists (unlike the MGF, which requires all moments to exist), and convergence of CFs to a continuous limit implies convergence in distribution (Lévy's theorem). The Gaussian CF appears naturally as the limit of because it is the unique distribution whose CF is stable under this repeated product — the Gaussian is the fixed point of the CLT map.
Practical form. For a statistical estimator :
where for the sample mean. This gives approximate confidence intervals: with probability .
Berry-Esseen Theorem
The CLT tells us convergence happens but not how fast. The Berry-Esseen theorem quantifies the rate:
Theorem. For iid with , , and :
where is the standard normal CDF and (the best known constant).
Interpretation. The CDF error decays as . For and a symmetric distribution ( moderate), the error is — the Gaussian approximation is already quite good.
When Berry-Esseen is tight. For Bernoulli() with small : for small . So the bound is — Gaussian approximation is poor when the success probability is small. This is why the Poisson approximation is preferred for rare events.
Lindeberg-Feller CLT (Non-iid Case)
Let be independent (but not identically distributed) with , , and .
Lindeberg condition: for all :
Lindeberg-Feller CLT. If the Lindeberg condition holds:
Intuition: The Lindeberg condition says no single term dominates the sum (no single contributes a large fraction of the total variance). This is the correct generalization of iid to non-iid sums.
ML use: SGD iterates, mini-batch gradients with varying per-sample loss distributions, heterogeneous federated learning — all involve sums of non-iid terms where Lindeberg-Feller justifies Gaussian approximations.
Multivariate CLT and Cramér-Wold Device
Multivariate CLT. For iid with mean and covariance :
Cramér-Wold device. in iff for all . So multivariate convergence reduces to univariate convergence of all linear projections.
Law of iterated logarithm (LIL). Sharpening of the SLLN:
The LIL gives the exact (up to constant) fluctuation size of the sample mean — it oscillates on the scale , slightly slower than the CLT's scale.
Worked Example
Example 1: CLT for the Empirical Risk
Consider training with a loss and empirical risk . Under the CLT:
This gives a confidence interval for the true risk: at 95% confidence, where .
Practical implication: On 10,000 test examples with average cross-entropy 0.5 and std dev 0.4: 95% CI width is . The test loss is reported to 4 decimal places, but only about 2 are statistically meaningful.
Example 2: LLN for Monte Carlo Estimation
Monte Carlo estimator: for .
SLLN: (if ).
CLT: where .
MSE: — the rate is dimension-independent! This is why Monte Carlo (and by extension stochastic gradient methods) scales to high-dimensional problems — unlike numerical quadrature whose error scales exponentially with dimension.
Example 3: Berry-Esseen Applied to A/B Testing
An A/B test with users per variant, binary conversion (Bernoulli() with ):
for small .
.
Berry-Esseen error bound: .
The normal approximation has CDF error up to ~7% — not negligible for rare events. This is why practitioners often use exact binomial tests or bootstrap for low-conversion-rate A/B tests.
Connections
Where Your Intuition Breaks
The CLT requires finite variance, but the convergence rate (Berry-Esseen: ) also depends on the third absolute moment . For distributions with heavy tails — power laws, Pareto, or even a moderately heavy-tailed distribution like the -distribution with 3 degrees of freedom — may not be nearly enough for the Gaussian approximation to be accurate. A common mistake in ML: running an A/B test with 100 samples per variant and applying a -test, not realizing that the outcome metric (e.g., revenue per user) has heavy enough tails that gives a Gaussian approximation with significant CDF error. The Berry-Esseen bound tells you the worst-case error; you should check whether your metric has a small before trusting the normal approximation at small sample sizes.
The CLT works because higher cumulants vanish after normalization. From the CF proof: the cumulant generating function . At order : the -th cumulant of is . For : this goes to zero as . Only and survive — which are exactly the cumulants of the Gaussian. The Gaussian is the attractor of the CLT precisely because it has no higher cumulants.
Monte Carlo's rate is dimension-free. Numerical quadrature in dimensions using grid points achieves error for a degree- rule — exponentially worse in high (curse of dimensionality). Monte Carlo achieves by the CLT — completely independent of . This makes MC and its variants (MCMC, importance sampling, SGD) the only practical approaches to integration in high dimensions. The price is a slow rate, whereas quadrature achieves in low dimensions.
The CLT's Gaussian approximation can be very poor for extreme quantiles. Berry-Esseen bounds the CDF error at — but this is a uniform bound. The error in tail probabilities for large (like ) can be much worse: the Gaussian underestimates heavy-tailed probabilities. For rare event simulation (financial risk, safety-critical ML), the CLT's Gaussian approximation fails precisely where it matters most. Large deviation theory gives exponentially accurate bounds for tail probabilities at fixed as .
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.