Neural-Path/Notes
45 min

Modes of Convergence: Almost Sure, In Probability, Lp & In Distribution

There are four distinct notions of convergence for sequences of random variables. Understanding their relationships — which mode implies which — is essential for rigorously analyzing stochastic algorithms, proving sample complexity bounds, and understanding why the law of large numbers, central limit theorem, and delta method work the way they do.

Concepts

"The algorithm converges" — but what does that mean? In classical analysis, convergence is unambiguous. For sequences of random variables, there are four distinct and inequivalent answers: convergence on every single sample path (almost sure), convergence in the probability of deviations (in probability), convergence of expected errors (LpL^p), or convergence of the whole distribution (in distribution). These are not interchangeable. The SLLN gives almost sure convergence; the CLT gives convergence in distribution. Understanding the hierarchy — which implies which — tells you when you can treat a stochastic limit like a deterministic one and when you cannot.

The Four Modes

Let Xn,XX_n, X be random variables on (Ω,F,P)(\Omega, \mathcal{F}, P).

Almost sure (a.s.) convergence:

Xna.s.X    P ⁣(limnXn=X)=1    P ⁣({ω:Xn(ω)X(ω)})=1.X_n \xrightarrow{\text{a.s.}} X \iff P\!\left(\lim_{n\to\infty} X_n = X\right) = 1 \iff P\!\left(\left\{\omega : X_n(\omega) \to X(\omega)\right\}\right) = 1.

The sequence converges pointwise on a set of probability 1. The exceptional set (of measure 0) is harmless.

Convergence in probability:

XnPX    ε>0:  P(XnX>ε)0    as n.X_n \xrightarrow{P} X \iff \forall\varepsilon > 0: \; P(|X_n - X| > \varepsilon) \to 0 \;\; \text{as } n\to\infty.

For each ε>0\varepsilon > 0, the probability of a large deviation goes to zero.

LpL^p convergence (p1p \geq 1):

XnLpX    E[XnXp]0.X_n \xrightarrow{L^p} X \iff \mathbb{E}[|X_n - X|^p] \to 0.

For p=2p=2: convergence in mean square (MSE \to 0).

Convergence in distribution (weak convergence):

XndX    FXn(x)FX(x) at all continuity points of FX.X_n \xrightarrow{d} X \iff F_{X_n}(x) \to F_X(x) \text{ at all continuity points of } F_X.

Equivalently: E[g(Xn)]E[g(X)]\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for all bounded continuous gg.

Implications

Xna.s.X    XnPX    XndX.X_n \xrightarrow{\text{a.s.}} X \implies X_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X.

XnLpX    XnPX    XndX.X_n \xrightarrow{L^p} X \implies X_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X.

The hierarchy a.s. → in probability → in distribution exists because each step strips away information: almost sure convergence is about every sample path, in-probability strips out the "occasionally bad" sample paths, and in-distribution strips out the coupling entirely (only the marginal distribution matters). Each relaxation is genuine: the implications cannot be reversed, as the counterexamples below show.

None of the converses holds in general. Notable exceptions and partial converses:

  • Lp⇏L^p \not\Rightarrow a.s. in general (but subsequences converge a.s.)
  • a.s. ⇏\not\Rightarrow LpL^p (need uniform integrability)
  • In probability + monotone (or bounded) \Rightarrow a.s. along a subsequence
  • XndcX_n \xrightarrow{d} c (constant)     XnPc\iff X_n \xrightarrow{P} c

Counterexamples showing the limits are strict:

In probability but not a.s.: Let [0,1][0,1] with Lebesgue measure. The "typewriter sequence": X1=1[0,1]X_1 = \mathbf{1}_{[0,1]}, X2=1[0,1/2]X_2 = \mathbf{1}_{[0,1/2]}, X3=1[1/2,1]X_3 = \mathbf{1}_{[1/2,1]}, X4=1[0,1/4]X_4 = \mathbf{1}_{[0,1/4]}, X5=1[1/4,1/2],X_5 = \mathbf{1}_{[1/4,1/2]}, \ldots For each ω\omega, Xn(ω)X_n(\omega) oscillates between 0 and 1 infinitely often — no pointwise convergence. But P(Xn>ε)0P(|X_n| > \varepsilon) \to 0 for ε<1\varepsilon < 1.

In distribution but not in probability: Let XN(0,1)X \sim \mathcal{N}(0,1) and Xn=XX_n = -X for all nn. Then XndXX_n \xrightarrow{d} X (same distribution), but P(XnX>ε)=P(2X>ε)↛0P(|X_n - X| > \varepsilon) = P(2|X| > \varepsilon) \not\to 0.

Uniform Integrability

A family {Xα}\{X_\alpha\} is uniformly integrable (UI) if:

limMsupαE[Xα1Xα>M]=0.\lim_{M\to\infty}\sup_\alpha \mathbb{E}[|X_\alpha|\mathbf{1}_{|X_\alpha|>M}] = 0.

Key theorem. XnL1XX_n \xrightarrow{L^1} X iff XnPXX_n \xrightarrow{P} X and {Xn}\{X_n\} is UI.

UI holds if: (a) {Xn}\{X_n\} is bounded by an integrable YY: XnY|X_n| \leq Y a.s.; (b) {Xn}\{X_n\} is bounded in L1+εL^{1+\varepsilon} for some ε>0\varepsilon > 0.

Slutsky's Theorem and Continuous Mapping

Continuous mapping theorem. If XndXX_n \xrightarrow{d} X and gg is continuous at the support of XX, then g(Xn)dg(X)g(X_n) \xrightarrow{d} g(X).

Slutsky's theorem. If XndXX_n \xrightarrow{d} X and YnPcY_n \xrightarrow{P} c (constant), then:

  • Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c
  • XnYndcXX_n Y_n \xrightarrow{d} cX
  • Xn/YndX/cX_n / Y_n \xrightarrow{d} X/c (if c0c \neq 0)

Application (t-statistic). For iid XiX_i with mean μ\mu, variance σ2\sigma^2: n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1) (CLT). Since SnPσS_n \xrightarrow{P} \sigma (sample std, LLN + CMT), Slutsky gives n(Xˉnμ)/SndN(0,1)\sqrt{n}(\bar{X}_n - \mu)/S_n \xrightarrow{d} \mathcal{N}(0,1) — the basis of the tt-test.

The Delta Method

Delta method. If n(Tnθ)dN(0,σ2)\sqrt{n}(T_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2) and gg is differentiable at θ\theta:

n(g(Tn)g(θ))dN(0,σ2[g(θ)]2).\sqrt{n}(g(T_n) - g(\theta)) \xrightarrow{d} \mathcal{N}(0, \sigma^2 [g'(\theta)]^2).

Proof. By Taylor expansion: g(Tn)g(θ)g(θ)(Tnθ)g(T_n) - g(\theta) \approx g'(\theta)(T_n - \theta). Apply Slutsky.

Multivariate delta method. If n(Tnθ)dN(0,Σ)\sqrt{n}(\mathbf{T}_n - \boldsymbol\theta) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma) and g:RkRmg : \mathbb{R}^k \to \mathbb{R}^m is differentiable:

n(g(Tn)g(θ))dN(0,JgΣJgT),\sqrt{n}(g(\mathbf{T}_n) - g(\boldsymbol\theta)) \xrightarrow{d} \mathcal{N}(\mathbf{0}, J_g \Sigma J_g^T),

where JgJ_g is the Jacobian of gg at θ\boldsymbol\theta.

Borel-Cantelli Lemmas

Borel-Cantelli I. If n=1P(An)<\sum_{n=1}^\infty P(A_n) < \infty, then P(An i.o.)=0P(A_n \text{ i.o.}) = 0 ("infinitely often" events are negligible).

Borel-Cantelli II. If events AnA_n are independent and n=1P(An)=\sum_{n=1}^\infty P(A_n) = \infty, then P(An i.o.)=1P(A_n \text{ i.o.}) = 1.

P(An i.o.)=P(lim supnAn)=P(NnNAn)P(A_n \text{ i.o.}) = P(\limsup_n A_n) = P(\bigcap_{N}\bigcup_{n\geq N} A_n).

Use in a.s. convergence. To show Xna.s.XX_n \xrightarrow{\text{a.s.}} X: for each ε>0\varepsilon > 0, let Anε={XnX>ε}A_n^\varepsilon = \{|X_n - X| > \varepsilon\}. If nP(Anε)<\sum_n P(A_n^\varepsilon) < \infty for all ε>0\varepsilon > 0, then by BC-I, P(Anε i.o.)=0P(A_n^\varepsilon \text{ i.o.}) = 0, proving a.s. convergence.

Worked Example

Example 1: Diagnosing a.s. vs in-probability

Let XnBernoulli(1/n)X_n \sim \text{Bernoulli}(1/n) independently. Does Xn0X_n \to 0?

In probability: P(Xn>ε)=P(Xn=1)=1/n0P(|X_n| > \varepsilon) = P(X_n = 1) = 1/n \to 0. Yes, XnP0X_n \xrightarrow{P} 0.

Almost surely: nP(Xn=1)=n1/n=\sum_n P(X_n = 1) = \sum_n 1/n = \infty. By BC-II (independence), P(Xn=1 i.o.)=1P(X_n = 1 \text{ i.o.}) = 1. So Xn̸a.s.0X_n \not\xrightarrow{\text{a.s.}} 0 — almost surely the sequence hits 1 infinitely often.

Now change: XnBernoulli(1/n2)X_n \sim \text{Bernoulli}(1/n^2) independently. Then nP(Xn=1)=n1/n2=π2/6<\sum_n P(X_n=1) = \sum_n 1/n^2 = \pi^2/6 < \infty. By BC-I: P(Xn=1 i.o.)=0P(X_n = 1 \text{ i.o.}) = 0, so Xna.s.0X_n \xrightarrow{\text{a.s.}} 0.

Lesson: a.s. convergence requires summable probabilities of exceedance; in-probability convergence only requires they go to zero.

Example 2: Delta Method for Log-Odds

Let p^n=Xˉn\hat{p}_n = \bar{X}_n be the sample proportion of successes (XiBernoulli(p)X_i \sim \text{Bernoulli}(p)). The log-odds: g(p)=log(p/(1p))g(p) = \log(p/(1-p)), g(p)=1/(p(1p))g'(p) = 1/(p(1-p)).

CLT: n(p^np)dN(0,p(1p))\sqrt{n}(\hat{p}_n - p) \xrightarrow{d} \mathcal{N}(0, p(1-p)).

Delta method: n(g(p^n)g(p))dN(0,p(1p)[g(p)]2)\sqrt{n}(g(\hat{p}_n) - g(p)) \xrightarrow{d} \mathcal{N}(0, p(1-p) \cdot [g'(p)]^2).

g(p)=1p(1p),so variance=p(1p)1[p(1p)]2=1p(1p).g'(p) = \frac{1}{p(1-p)}, \quad \text{so variance} = p(1-p) \cdot \frac{1}{[p(1-p)]^2} = \frac{1}{p(1-p)}.

This gives a CLT for the log-odds estimate with asymptotic variance 1/(p(1p))1/(p(1-p)) — the basis for confidence intervals in logistic regression output.

Example 3: SGD Convergence in ML Theory

In stochastic gradient descent, the iterates θk\theta_k are random. What mode of convergence is the goal?

In probability: Often θkPθ\theta_k \xrightarrow{P} \theta^* (or to the set of stationary points). This is the standard result for convex objectives with appropriate step sizes.

Almost surely: Stronger. The Strong LLN proves Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu, and some SGD analyses achieve a.s. convergence to stationary points.

L2L^2: E[θkθ2]0\mathbb{E}[\|\theta_k - \theta^*\|^2] \to 0 — mean-squared convergence. Requires controlling the variance of gradient estimates; typically achievable with variance reduction (SVRG/SAGA).

In distribution: The SGD iterate does not in general converge in distribution to a point mass — with constant step size it oscillates in a neighborhood of θ\theta^*. But rescaled fluctuations ηk(θkθ)\sqrt{\eta_k}(\theta_k - \theta^*) converge in distribution to an Ornstein-Uhlenbeck process.

Connections

Where Your Intuition Breaks

The most practically dangerous confusion: convergence in distribution does NOT imply convergence of the random variables themselves. When you say "the empirical loss converges to the true loss," you typically mean in-probability or almost-sure convergence — a strong statement about a single realization of the algorithm. When the CLT says "the normalized sample mean converges to Gaussian," it only gives convergence in distribution: the distribution looks Gaussian, but the actual sample mean on any given run can still be far from the true mean. This is why CLT-based confidence intervals are approximate (not exact) and why tail probability bounds (Chernoff, Hoeffding) that give in-probability guarantees are strictly stronger for risk analysis than CLT approximations.

💡Intuition

Almost sure convergence is sample-path convergence; in-distribution convergence is just law-to-law. A.s. convergence says the trajectory X1(ω),X2(ω),X_1(\omega), X_2(\omega), \ldots converges pointwise for almost all ω\omega — strong. Convergence in distribution says the histograms of XnX_n converge to the histogram of XX — it says nothing about the coupling between XnX_n and XX on the same probability space. Two random variables XX and Y=XY = -X (where XN(0,1)X \sim \mathcal{N}(0,1)) have the same distribution but X+Y=0X + Y = 0 deterministically — convergence in distribution doesn't know about this coupling.

💡Intuition

Uniform integrability bridges L1L^1 and in-probability. Convergence in probability lets the tail of Xn|X_n| escape to infinity while still being "usually small." Uniform integrability is exactly the condition that prevents this escape — it says the tails of Xn|X_n| are uniformly controlled, regardless of how large they get. With UI, in-probability convergence upgrades to L1L^1 convergence. In ML, uniform integrability of the loss sequence is often the key condition that allows swapping limit and expectation in convergence proofs.

⚠️Warning

Convergence in distribution does not imply convergence of moments. Even if XndXX_n \xrightarrow{d} X, it can happen that E[Xn2]↛E[X2]\mathbb{E}[X_n^2] \not\to \mathbb{E}[X^2]. This requires additional conditions (e.g., UI, or uniform bound on E[Xn2+ε]\mathbb{E}[|X_n|^{2+\varepsilon}]). In practice: when you prove the CLT for a statistic and want to say its variance converges, you need a separate argument. The Portmanteau theorem (characterizing weak convergence via bounded continuous functions) helps: E[g(Xn)]E[g(X)]\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for bounded continuous gg, but g(x)=x2g(x) = x^2 is unbounded.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.