Modes of Convergence: Almost Sure, In Probability, Lp & In Distribution

There are four distinct notions of convergence for sequences of random variables. Understanding their relationships — which mode implies which — is essential for rigorously analyzing stochastic algorithms, proving sample complexity bounds, and understanding why the law of large numbers, central limit theorem, and delta method work the way they do.

Concepts

"The algorithm converges" — but what does that mean? In classical analysis, convergence is unambiguous. For sequences of random variables, there are four distinct and inequivalent answers: convergence on every single sample path (almost sure), convergence in the probability of deviations (in probability), convergence of expected errors ( $L^p$ ), or convergence of the whole distribution (in distribution). These are not interchangeable. The SLLN gives almost sure convergence; the CLT gives convergence in distribution. Understanding the hierarchy — which implies which — tells you when you can treat a stochastic limit like a deterministic one and when you cannot.

The Four Modes

Let $X_n, X$ be random variables on $(\Omega, \mathcal{F}, P)$ .

Almost sure (a.s.) convergence:

$X_n \xrightarrow{\text{a.s.}} X \iff P\!\left(\lim_{n\to\infty} X_n = X\right) = 1 \iff P\!\left(\left\{\omega : X_n(\omega) \to X(\omega)\right\}\right) = 1.$

The sequence converges pointwise on a set of probability 1. The exceptional set (of measure 0) is harmless.

Convergence in probability:

$X_n \xrightarrow{P} X \iff \forall\varepsilon > 0: \; P(|X_n - X| > \varepsilon) \to 0 \;\; \text{as } n\to\infty.$

For each $\varepsilon > 0$ , the probability of a large deviation goes to zero.

$L^p$ convergence ( $p \geq 1$ ):

$X_n \xrightarrow{L^p} X \iff \mathbb{E}[|X_n - X|^p] \to 0.$

For $p=2$ : convergence in mean square (MSE $\to$ 0).

Convergence in distribution (weak convergence):

$X_n \xrightarrow{d} X \iff F_{X_n}(x) \to F_X(x) \text{ at all continuity points of } F_X.$

Equivalently: $\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)]$ for all bounded continuous $g$ .

Implications

$X_n \xrightarrow{\text{a.s.}} X \implies X_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X.$

$X_n \xrightarrow{L^p} X \implies X_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X.$

The hierarchy a.s. → in probability → in distribution exists because each step strips away information: almost sure convergence is about every sample path, in-probability strips out the "occasionally bad" sample paths, and in-distribution strips out the coupling entirely (only the marginal distribution matters). Each relaxation is genuine: the implications cannot be reversed, as the counterexamples below show.

None of the converses holds in general. Notable exceptions and partial converses:

$L^p \not\Rightarrow$ a.s. in general (but subsequences converge a.s.)
a.s. $\not\Rightarrow$ $L^p$ (need uniform integrability)
In probability + monotone (or bounded) $\Rightarrow$ a.s. along a subsequence
$X_n \xrightarrow{d} c$ (constant) $\iff X_n \xrightarrow{P} c$

Counterexamples showing the limits are strict:

In probability but not a.s.: Let $[0,1]$ with Lebesgue measure. The "typewriter sequence": $X_1 = \mathbf{1}_{[0,1]}$ , $X_2 = \mathbf{1}_{[0,1/2]}$ , $X_3 = \mathbf{1}_{[1/2,1]}$ , $X_4 = \mathbf{1}_{[0,1/4]}$ , $X_5 = \mathbf{1}_{[1/4,1/2]}, \ldots$ For each $\omega$ , $X_n(\omega)$ oscillates between 0 and 1 infinitely often — no pointwise convergence. But $P(|X_n| > \varepsilon) \to 0$ for $\varepsilon < 1$ .

In distribution but not in probability: Let $X \sim \mathcal{N}(0,1)$ and $X_n = -X$ for all $n$ . Then $X_n \xrightarrow{d} X$ (same distribution), but $P(|X_n - X| > \varepsilon) = P(2|X| > \varepsilon) \not\to 0$ .

Uniform Integrability

A family $\{X_\alpha\}$ is uniformly integrable (UI) if:

$\lim_{M\to\infty}\sup_\alpha \mathbb{E}[|X_\alpha|\mathbf{1}_{|X_\alpha|>M}] = 0.$

Key theorem. $X_n \xrightarrow{L^1} X$ iff $X_n \xrightarrow{P} X$ and $\{X_n\}$ is UI.

UI holds if: (a) $\{X_n\}$ is bounded by an integrable $Y$ : $|X_n| \leq Y$ a.s.; (b) $\{X_n\}$ is bounded in $L^{1+\varepsilon}$ for some $\varepsilon > 0$ .

Slutsky's Theorem and Continuous Mapping

Continuous mapping theorem. If $X_n \xrightarrow{d} X$ and $g$ is continuous at the support of $X$ , then $g(X_n) \xrightarrow{d} g(X)$ .

Slutsky's theorem. If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{P} c$ (constant), then:

$X_n + Y_n \xrightarrow{d} X + c$
$X_n Y_n \xrightarrow{d} cX$
$X_n / Y_n \xrightarrow{d} X/c$ (if $c \neq 0$ )

Application (t-statistic). For iid $X_i$ with mean $\mu$ , variance $\sigma^2$ : $\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)$ (CLT). Since $S_n \xrightarrow{P} \sigma$ (sample std, LLN + CMT), Slutsky gives $\sqrt{n}(\bar{X}_n - \mu)/S_n \xrightarrow{d} \mathcal{N}(0,1)$ — the basis of the $t$ -test.

The Delta Method

Delta method. If $\sqrt{n}(T_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta$ :

$\sqrt{n}(g(T_n) - g(\theta)) \xrightarrow{d} \mathcal{N}(0, \sigma^2 [g'(\theta)]^2).$

Proof. By Taylor expansion: $g(T_n) - g(\theta) \approx g'(\theta)(T_n - \theta)$ . Apply Slutsky.

Multivariate delta method. If $\sqrt{n}(\mathbf{T}_n - \boldsymbol\theta) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma)$ and $g : \mathbb{R}^k \to \mathbb{R}^m$ is differentiable:

$\sqrt{n}(g(\mathbf{T}_n) - g(\boldsymbol\theta)) \xrightarrow{d} \mathcal{N}(\mathbf{0}, J_g \Sigma J_g^T),$

where $J_g$ is the Jacobian of $g$ at $\boldsymbol\theta$ .

Borel-Cantelli Lemmas

Borel-Cantelli I. If $\sum_{n=1}^\infty P(A_n) < \infty$ , then $P(A_n \text{ i.o.}) = 0$ ("infinitely often" events are negligible).

Borel-Cantelli II. If events $A_n$ are independent and $\sum_{n=1}^\infty P(A_n) = \infty$ , then $P(A_n \text{ i.o.}) = 1$ .

$P(A_n \text{ i.o.}) = P(\limsup_n A_n) = P(\bigcap_{N}\bigcup_{n\geq N} A_n)$ .

Use in a.s. convergence. To show $X_n \xrightarrow{\text{a.s.}} X$ : for each $\varepsilon > 0$ , let $A_n^\varepsilon = \{|X_n - X| > \varepsilon\}$ . If $\sum_n P(A_n^\varepsilon) < \infty$ for all $\varepsilon > 0$ , then by BC-I, $P(A_n^\varepsilon \text{ i.o.}) = 0$ , proving a.s. convergence.

Worked Example

Example 1: Diagnosing a.s. vs in-probability

Let $X_n \sim \text{Bernoulli}(1/n)$ independently. Does $X_n \to 0$ ?

In probability: $P(|X_n| > \varepsilon) = P(X_n = 1) = 1/n \to 0$ . Yes, $X_n \xrightarrow{P} 0$ .

Almost surely: $\sum_n P(X_n = 1) = \sum_n 1/n = \infty$ . By BC-II (independence), $P(X_n = 1 \text{ i.o.}) = 1$ . So $X_n \not\xrightarrow{\text{a.s.}} 0$ — almost surely the sequence hits 1 infinitely often.

Now change: $X_n \sim \text{Bernoulli}(1/n^2)$ independently. Then $\sum_n P(X_n=1) = \sum_n 1/n^2 = \pi^2/6 < \infty$ . By BC-I: $P(X_n = 1 \text{ i.o.}) = 0$ , so $X_n \xrightarrow{\text{a.s.}} 0$ .

Lesson: a.s. convergence requires summable probabilities of exceedance; in-probability convergence only requires they go to zero.

Example 2: Delta Method for Log-Odds

Let $\hat{p}_n = \bar{X}_n$ be the sample proportion of successes ( $X_i \sim \text{Bernoulli}(p)$ ). The log-odds: $g(p) = \log(p/(1-p))$ , $g'(p) = 1/(p(1-p))$ .

CLT: $\sqrt{n}(\hat{p}_n - p) \xrightarrow{d} \mathcal{N}(0, p(1-p))$ .

Delta method: $\sqrt{n}(g(\hat{p}_n) - g(p)) \xrightarrow{d} \mathcal{N}(0, p(1-p) \cdot [g'(p)]^2)$ .

$g'(p) = \frac{1}{p(1-p)}, \quad \text{so variance} = p(1-p) \cdot \frac{1}{[p(1-p)]^2} = \frac{1}{p(1-p)}.$

This gives a CLT for the log-odds estimate with asymptotic variance $1/(p(1-p))$ — the basis for confidence intervals in logistic regression output.

Example 3: SGD Convergence in ML Theory

In stochastic gradient descent, the iterates $\theta_k$ are random. What mode of convergence is the goal?

In probability: Often $\theta_k \xrightarrow{P} \theta^*$ (or to the set of stationary points). This is the standard result for convex objectives with appropriate step sizes.

Almost surely: Stronger. The Strong LLN proves $\bar{X}_n \xrightarrow{\text{a.s.}} \mu$ , and some SGD analyses achieve a.s. convergence to stationary points.

$L^2$ : $\mathbb{E}[\|\theta_k - \theta^*\|^2] \to 0$ — mean-squared convergence. Requires controlling the variance of gradient estimates; typically achievable with variance reduction (SVRG/SAGA).

In distribution: The SGD iterate does not in general converge in distribution to a point mass — with constant step size it oscillates in a neighborhood of $\theta^*$ . But rescaled fluctuations $\sqrt{\eta_k}(\theta_k - \theta^*)$ converge in distribution to an Ornstein-Uhlenbeck process.

Connections

Where Your Intuition Breaks

The most practically dangerous confusion: convergence in distribution does NOT imply convergence of the random variables themselves. When you say "the empirical loss converges to the true loss," you typically mean in-probability or almost-sure convergence — a strong statement about a single realization of the algorithm. When the CLT says "the normalized sample mean converges to Gaussian," it only gives convergence in distribution: the distribution looks Gaussian, but the actual sample mean on any given run can still be far from the true mean. This is why CLT-based confidence intervals are approximate (not exact) and why tail probability bounds (Chernoff, Hoeffding) that give in-probability guarantees are strictly stronger for risk analysis than CLT approximations.

💡Intuition

Almost sure convergence is sample-path convergence; in-distribution convergence is just law-to-law. A.s. convergence says the trajectory $X_1(\omega), X_2(\omega), \ldots$ converges pointwise for almost all $\omega$ — strong. Convergence in distribution says the histograms of $X_n$ converge to the histogram of $X$ — it says nothing about the coupling between $X_n$ and $X$ on the same probability space. Two random variables $X$ and $Y = -X$ (where $X \sim \mathcal{N}(0,1)$ ) have the same distribution but $X + Y = 0$ deterministically — convergence in distribution doesn't know about this coupling.

💡Intuition

Uniform integrability bridges $L^1$ and in-probability. Convergence in probability lets the tail of $|X_n|$ escape to infinity while still being "usually small." Uniform integrability is exactly the condition that prevents this escape — it says the tails of $|X_n|$ are uniformly controlled, regardless of how large they get. With UI, in-probability convergence upgrades to $L^1$ convergence. In ML, uniform integrability of the loss sequence is often the key condition that allows swapping limit and expectation in convergence proofs.

⚠️Warning

Convergence in distribution does not imply convergence of moments. Even if $X_n \xrightarrow{d} X$ , it can happen that $\mathbb{E}[X_n^2] \not\to \mathbb{E}[X^2]$ . This requires additional conditions (e.g., UI, or uniform bound on $\mathbb{E}[|X_n|^{2+\varepsilon}]$ ). In practice: when you prove the CLT for a statistic and want to say its variance converges, you need a separate argument. The Portmanteau theorem (characterizing weak convergence via bounded continuous functions) helps: $\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)]$ for bounded continuous $g$ , but $g(x) = x^2$ is unbounded.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Expectation, Moments, Characteristic Functions & Generating Functions

Limit Theorems: LLN, CLT & Berry-Esseen