Neural-Path/Notes
40 min

Martingales: Optional Stopping & Doob's Inequalities

A martingale formalizes the concept of a "fair game" — a stochastic process whose expected future value, given the present, equals its current value. Martingale theory provides the sharpest tools for controlling random processes: optional stopping determines when you can stop a process without changing its expected value, and Doob's inequalities bound the maximum of the process over time.

Concepts

A fair coin-flip game: you start with $100, win $1 on heads, lose $1 on tails. After any number of flips, your expected future wealth equals your current wealth — the game has no memory advantage, and no clever strategy changes your expectation. This is a martingale. The optional stopping theorem makes this rigorous: no betting rule, however elaborate, can turn a fair game into a profitable one by choosing when to stop.

Filtrations and Adapted Processes

A filtration (Ft)t0(\mathcal{F}_t)_{t \geq 0} is an increasing sequence of sigma-algebras representing the information available at time tt: FsFt\mathcal{F}_s \subseteq \mathcal{F}_t for sts \leq t. A process (Xt)(X_t) is adapted to (Ft)(\mathcal{F}_t) if XtX_t is Ft\mathcal{F}_t-measurable for each tt.

The natural filtration of XX is FtX=σ(Xs:st)\mathcal{F}_t^X = \sigma(X_s : s \leq t) — the information generated by the process itself.

Martingales: Definition and Examples

An adapted integrable process (Mt)(M_t) is a martingale with respect to (Ft)(\mathcal{F}_t) if:

E[MtFs]=Msfor all st.E[M_t \mid \mathcal{F}_s] = M_s \quad \text{for all } s \leq t.

The filtration (Ft)(\mathcal{F}_t) in this definition is not a technicality — it is the precise specification of what "knowing the present" means. The same process can be a martingale with respect to one filtration and not another: Bt2tB_t^2 - t is a martingale with respect to the natural filtration of Brownian motion, but enlarging the filtration to include the future destroys this property. Stating which filtration makes a process fair is not optional; it is the entire content of the definition.

A supermartingale satisfies E[MtFs]MsE[M_t \mid \mathcal{F}_s] \leq M_s (expected to decrease); a submartingale satisfies E[MtFs]MsE[M_t \mid \mathcal{F}_s] \geq M_s (expected to increase).

Canonical examples:

ProcessType
Mn=i=1nXiM_n = \sum_{i=1}^n X_i with E[Xi]=0E[X_i] = 0 and XiFi1X_i \perp \mathcal{F}_{i-1}Martingale
Brownian motion BtB_tMartingale
Bt2tB_t^2 - tMartingale
eθBtθ2t/2e^{\theta B_t - \theta^2 t/2} (exponential martingale)Martingale
$-M_n
f(Mn)f(M_n) for convex ff (Jensen)Submartingale

Martingale difference sequences: (Dn)(D_n) where Dn=MnMn1D_n = M_n - M_{n-1} satisfies E[DnFn1]=0E[D_n \mid \mathcal{F}_{n-1}] = 0. The variance of Mn=M0+k=1nDkM_n = M_0 + \sum_{k=1}^n D_k satisfies Var[Mn]=k=1nE[Dk2]\text{Var}[M_n] = \sum_{k=1}^n E[D_k^2] (orthogonality of increments).

Optional Stopping Theorem

A stopping time τ\tau is a random variable with {τt}Ft\{\tau \leq t\} \in \mathcal{F}_t for all tt — the decision to stop can only depend on current and past information, not the future.

Optional Stopping Theorem (OST): let (Mn)(M_n) be a martingale and τ\tau a stopping time. If any of the following conditions hold:

  1. τ\tau is bounded: τN\tau \leq N a.s. for some fixed NN
  2. MnτC|M_{n \wedge \tau}| \leq C uniformly
  3. E[τ]<E[\tau] < \infty and the increments are bounded: MnMn1C|M_n - M_{n-1}| \leq C a.s.

then E[Mτ]=E[M0]E[M_\tau] = E[M_0].

Consequence: in any fair game, no stopping strategy can create an expected profit. The Gambler's Ruin is a direct application: a gambler starting at xx betting ±1\pm 1 on a fair coin, stopping at 00 or NN, has E[Mτ]=xE[M_\tau] = x, so the ruin probability from xx is (Nx)/N(N-x)/N (solving E[Mτ]=x=(1pruin)N+pruin0E[M_\tau] = x = (1 - p_{\text{ruin}}) \cdot N + p_{\text{ruin}} \cdot 0).

Doob's Inequalities

Doob's maximal inequality: for a non-negative submartingale (Mn)n=0N(M_n)_{n=0}^N:

P ⁣(max0kNMkλ)E[MN]λ.P\!\left(\max_{0 \leq k \leq N} M_k \geq \lambda\right) \leq \frac{E[M_N]}{\lambda}.

For a martingale: P(maxkNMkλ)E[MN2]/λ2P(\max_{k \leq N} |M_k| \geq \lambda) \leq E[M_N^2] / \lambda^2 (using Jensen, Mk2M_k^2 is a submartingale).

Doob's LpL^p inequality: for p>1p > 1:

E ⁣[(maxkNMk)p](pp1)pE[MNp].E\!\left[\left(\max_{k \leq N} |M_k|\right)^p\right] \leq \left(\frac{p}{p-1}\right)^p E[|M_N|^p].

The constant (p/(p1))p(p/(p-1))^p is sharp (Doob, 1953). For p=2p=2: E[maxkMk2]4E[MN2]E[\max_k M_k^2] \leq 4E[M_N^2].

Martingale Convergence Theorems

Upcrossings: an upcrossing of [a,b][a,b] by (Mn)(M_n) is a passage from below aa to above bb. Let UN(a,b)U_N(a,b) be the number of upcrossings by time NN.

Doob's upcrossing inequality: E[UN(a,b)]E[(MNa)+]baE[U_N(a,b)] \leq \frac{E[(M_N - a)^+]}{b-a}.

Martingale convergence theorem: if (Mn)(M_n) is a supermartingale with supnE[Mn]<\sup_n E[M_n^-] < \infty, then M=limnMnM_\infty = \lim_{n\to\infty} M_n exists a.s. The limit is finite a.s.

L2L^2 convergence: if supnE[Mn2]<\sup_n E[M_n^2] < \infty, then MnMM_n \to M_\infty both a.s. and in L2L^2.

Azuma-Hoeffding inequality: let (Mn)(M_n) be a martingale with bounded differences MkMk1ck|M_k - M_{k-1}| \leq c_k a.s. Then:

P(MnM0t)exp ⁣(t22k=1nck2).P(M_n - M_0 \geq t) \leq \exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right).

This is the fundamental concentration inequality for dependent variables and is the basis for proving generalization bounds in learning theory.

Worked Example

Example 1: Azuma-Hoeffding for Learned Models

Let f(x1,,xn)f(x_1, \ldots, x_n) be the empirical risk of a learned classifier. Define Mk=E[fx1,,xk]M_k = E[f \mid x_1, \ldots, x_k]. Then (Mk)(M_k) is a martingale (by tower property), and if replacing one training point changes ff by at most c/nc/n (bounded sensitivity), then MkMk1c/n|M_k - M_{k-1}| \leq c/n.

Azuma-Hoeffding gives:

P(fE[f]t)exp ⁣(t22n(c/n)2)=exp ⁣(nt22c2).P(f - E[f] \geq t) \leq \exp\!\left(-\frac{t^2}{2n(c/n)^2}\right) = \exp\!\left(-\frac{nt^2}{2c^2}\right).

This is the McDiarmid inequality for functions of independent variables. It implies that the empirical risk concentrates around its mean at rate O(1/n)O(1/\sqrt{n}), matching VC theory but with sharper constants.

Example 2: Gambler's Ruin with Biased Coin

A gambler bets ±1\pm 1 repeatedly, with win probability p1/2p \neq 1/2. Define Mn=(qp)SnM_n = \left(\frac{q}{p}\right)^{S_n} where SnS_n is the current fortune. Then MnM_n is a martingale: E[Mn+1Sn=s]=p(q/p)s+1+q(q/p)s1=(q/p)s=MnE[M_{n+1} \mid S_n = s] = p(q/p)^{s+1} + q(q/p)^{s-1} = (q/p)^s = M_n.

Apply OST with τ=min{Sn=0 or Sn=N}\tau = \min\{S_n = 0 \text{ or } S_n = N\}: E[Mτ]=E[M0]=(q/p)xE[M_\tau] = E[M_0] = (q/p)^x. Since Mτ=(q/p)0=1M_\tau = (q/p)^0 = 1 at ruin or (q/p)N(q/p)^N at success:

P(ruin)1+(1P(ruin))(q/p)N=(q/p)x.P(\text{ruin}) \cdot 1 + (1 - P(\text{ruin})) \cdot (q/p)^N = (q/p)^x.

Solving: P(ruin)=(q/p)N(q/p)x(q/p)N1P(\text{ruin}) = \frac{(q/p)^N - (q/p)^x}{(q/p)^N - 1}. For p>1/2p > 1/2 (favorable game), q/p<1q/p < 1, so (q/p)N0(q/p)^N \to 0 as NN \to \infty and P(ruin)(q/p)xP(\text{ruin}) \to (q/p)^x: a favorable game still has positive ruin probability from finite capital.

Example 3: Martingales in SGD Analysis

Define the loss L(θt)L(\theta_t) at step tt of SGD. The stochastic gradient gtg_t has E[gtθt]=L(θt)E[g_t \mid \theta_t] = \nabla L(\theta_t) and noise ξt=gtL(θt)\xi_t = g_t - \nabla L(\theta_t) satisfying E[ξtθt]=0E[\xi_t \mid \theta_t] = 0.

The accumulated noise term Mt=k=0t1ηkξkM_t = \sum_{k=0}^{t-1} \eta_k \xi_k is a martingale. Azuma-Hoeffding controls its fluctuations when gradient noise is bounded. The convergence rate of SGD — the O(1/T)O(1/\sqrt{T}) rate for convex problems — follows from balancing the deterministic progress term ηtL2-\eta_t \|\nabla L\|^2 against the martingale noise term ηtξt\eta_t \xi_t.

Connections

Where Your Intuition Breaks

Optional stopping says you cannot profit in a fair game by choosing a clever stopping time — but the theorem requires the stopping time to be bounded, or the stopped process to be uniformly integrable. The gambler's doubling strategy (double the bet after each loss, stop after the first win) appears to always profit: the stopping time τ\tau is almost surely finite. The flaw is not in the math but in the conditions: the maximum loss before stopping is unbounded, the process is not uniformly integrable, and the OST does not apply. More subtly, random walk on Z\mathbb{Z} with one absorbing barrier has an almost-surely-finite stopping time with infinite expectation — technically satisfying the letter of some OST conditions while violating the spirit. The conditions of optional stopping are not formalities; they are exactly the hypotheses that prevent infinite-horizon exploitation.

💡Intuition

The optional stopping theorem formulates "no free lunch" precisely. Any stopping rule — doubling bets, stopping after wins, timing the market — cannot change the expected value of a fair game. This is not just an informal principle but a theorem with precise conditions. The conditions matter: if you allow unbounded stopping times and the process can grow, OST may fail (the St. Petersburg paradox involves an unbounded martingale). The theorem works exactly when you cannot "gain information from the future."

💡Intuition

Martingale differences are the right generalization of independence. Many concentration inequalities (Chernoff, Hoeffding) assume independence. Azuma-Hoeffding replaces independence with the martingale difference structure E[DkFk1]=0E[D_k \mid \mathcal{F}_{k-1}] = 0, which allows arbitrarily complex dependence between DkD_k and past variables. This is why martingale methods appear throughout machine learning theory: training examples are not always independent (active learning, online learning, Markov chain-sampled minibatches), but the error increment can still form a martingale difference sequence.

⚠️Warning

Doob's LpL^p inequality has a sharp constant that blows up at p=1p=1. The constant (p/(p1))p(p/(p-1))^p \to \infty as p1p \to 1, so no analog exists for p=1p=1. There is no uniform bound on E[maxkMk]E[\max_k |M_k|] in terms of E[MN]E[|M_N|] — the maximum can be much larger than the terminal value. This failure at p=1p=1 is not a gap in the theorem; it reflects a genuine phenomenon: there exist martingales (Mn)(M_n) with E[MN]=1E[|M_N|] = 1 but E[maxkMk]=E[\max_k |M_k|] = \infty. Processes with heavy tails require different tools (e.g., Burkholder-Davis-Gundy inequalities using the quadratic variation).

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.