Modes of Convergence: Almost Sure, In Probability, Lp & In Distribution
There are four distinct notions of convergence for sequences of random variables. Understanding their relationships — which mode implies which — is essential for rigorously analyzing stochastic algorithms, proving sample complexity bounds, and understanding why the law of large numbers, central limit theorem, and delta method work the way they do.
Concepts
"The algorithm converges" — but what does that mean? In classical analysis, convergence is unambiguous. For sequences of random variables, there are four distinct and inequivalent answers: convergence on every single sample path (almost sure), convergence in the probability of deviations (in probability), convergence of expected errors (), or convergence of the whole distribution (in distribution). These are not interchangeable. The SLLN gives almost sure convergence; the CLT gives convergence in distribution. Understanding the hierarchy — which implies which — tells you when you can treat a stochastic limit like a deterministic one and when you cannot.
The Four Modes
Let be random variables on .
Almost sure (a.s.) convergence:
The sequence converges pointwise on a set of probability 1. The exceptional set (of measure 0) is harmless.
Convergence in probability:
For each , the probability of a large deviation goes to zero.
convergence ():
For : convergence in mean square (MSE 0).
Convergence in distribution (weak convergence):
Equivalently: for all bounded continuous .
Implications
The hierarchy a.s. → in probability → in distribution exists because each step strips away information: almost sure convergence is about every sample path, in-probability strips out the "occasionally bad" sample paths, and in-distribution strips out the coupling entirely (only the marginal distribution matters). Each relaxation is genuine: the implications cannot be reversed, as the counterexamples below show.
None of the converses holds in general. Notable exceptions and partial converses:
- a.s. in general (but subsequences converge a.s.)
- a.s. (need uniform integrability)
- In probability + monotone (or bounded) a.s. along a subsequence
- (constant)
Counterexamples showing the limits are strict:
In probability but not a.s.: Let with Lebesgue measure. The "typewriter sequence": , , , , For each , oscillates between 0 and 1 infinitely often — no pointwise convergence. But for .
In distribution but not in probability: Let and for all . Then (same distribution), but .
Uniform Integrability
A family is uniformly integrable (UI) if:
Key theorem. iff and is UI.
UI holds if: (a) is bounded by an integrable : a.s.; (b) is bounded in for some .
Slutsky's Theorem and Continuous Mapping
Continuous mapping theorem. If and is continuous at the support of , then .
Slutsky's theorem. If and (constant), then:
- (if )
Application (t-statistic). For iid with mean , variance : (CLT). Since (sample std, LLN + CMT), Slutsky gives — the basis of the -test.
The Delta Method
Delta method. If and is differentiable at :
Proof. By Taylor expansion: . Apply Slutsky.
Multivariate delta method. If and is differentiable:
where is the Jacobian of at .
Borel-Cantelli Lemmas
Borel-Cantelli I. If , then ("infinitely often" events are negligible).
Borel-Cantelli II. If events are independent and , then .
.
Use in a.s. convergence. To show : for each , let . If for all , then by BC-I, , proving a.s. convergence.
Worked Example
Example 1: Diagnosing a.s. vs in-probability
Let independently. Does ?
In probability: . Yes, .
Almost surely: . By BC-II (independence), . So — almost surely the sequence hits 1 infinitely often.
Now change: independently. Then . By BC-I: , so .
Lesson: a.s. convergence requires summable probabilities of exceedance; in-probability convergence only requires they go to zero.
Example 2: Delta Method for Log-Odds
Let be the sample proportion of successes (). The log-odds: , .
CLT: .
Delta method: .
This gives a CLT for the log-odds estimate with asymptotic variance — the basis for confidence intervals in logistic regression output.
Example 3: SGD Convergence in ML Theory
In stochastic gradient descent, the iterates are random. What mode of convergence is the goal?
In probability: Often (or to the set of stationary points). This is the standard result for convex objectives with appropriate step sizes.
Almost surely: Stronger. The Strong LLN proves , and some SGD analyses achieve a.s. convergence to stationary points.
: — mean-squared convergence. Requires controlling the variance of gradient estimates; typically achievable with variance reduction (SVRG/SAGA).
In distribution: The SGD iterate does not in general converge in distribution to a point mass — with constant step size it oscillates in a neighborhood of . But rescaled fluctuations converge in distribution to an Ornstein-Uhlenbeck process.
Connections
Where Your Intuition Breaks
The most practically dangerous confusion: convergence in distribution does NOT imply convergence of the random variables themselves. When you say "the empirical loss converges to the true loss," you typically mean in-probability or almost-sure convergence — a strong statement about a single realization of the algorithm. When the CLT says "the normalized sample mean converges to Gaussian," it only gives convergence in distribution: the distribution looks Gaussian, but the actual sample mean on any given run can still be far from the true mean. This is why CLT-based confidence intervals are approximate (not exact) and why tail probability bounds (Chernoff, Hoeffding) that give in-probability guarantees are strictly stronger for risk analysis than CLT approximations.
Almost sure convergence is sample-path convergence; in-distribution convergence is just law-to-law. A.s. convergence says the trajectory converges pointwise for almost all — strong. Convergence in distribution says the histograms of converge to the histogram of — it says nothing about the coupling between and on the same probability space. Two random variables and (where ) have the same distribution but deterministically — convergence in distribution doesn't know about this coupling.
Uniform integrability bridges and in-probability. Convergence in probability lets the tail of escape to infinity while still being "usually small." Uniform integrability is exactly the condition that prevents this escape — it says the tails of are uniformly controlled, regardless of how large they get. With UI, in-probability convergence upgrades to convergence. In ML, uniform integrability of the loss sequence is often the key condition that allows swapping limit and expectation in convergence proofs.
Convergence in distribution does not imply convergence of moments. Even if , it can happen that . This requires additional conditions (e.g., UI, or uniform bound on ). In practice: when you prove the CLT for a statistic and want to say its variance converges, you need a separate argument. The Portmanteau theorem (characterizing weak convergence via bounded continuous functions) helps: for bounded continuous , but is unbounded.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.