Neural-Path/Notes
45 min

Measure Theory Primer: σ-Algebras, Measures & Lebesgue Integration

Measure theory provides the rigorous foundation for probability, replacing informal notions of "area under a curve" with a precise framework that handles continuous and discrete distributions uniformly. Without it, statements like "the sample mean converges almost surely" or "the derivative of an expectation is the expectation of the derivative" lack precise meaning. This lesson builds the essential machinery — σ\sigma-algebras, measures, and the Lebesgue integral — that underlies all of learning theory.

Concepts

The Riemann integral you learned in calculus can't handle the indicator function of the rationals: 1Q(x)=1\mathbf{1}_\mathbb{Q}(x) = 1 if xQx \in \mathbb{Q}, 0 otherwise. The lower Riemann sum is 0 (rationals are dense but have gaps) and the upper Riemann sum is 1 (irrationals are dense too) — they never agree. The Lebesgue integral handles this immediately: the rationals have measure zero, so 1Qdx=00+11=0\int \mathbf{1}_\mathbb{Q}\,dx = 0 \cdot 0 + 1 \cdot 1 = 0 (measure of rationals times their value, plus measure of irrationals times theirs). This is not a contrived example — it is the prototype for why learning theory needs the Lebesgue integral: empirical averages over "almost all" training examples, convergence "almost surely," and expectations of indicator functions of events all require Lebesgue measure to be well-defined.

Sigma-Algebras

A σ\sigma-algebra (or σ\sigma-field) on a set Ω\Omega is a collection F2Ω\mathcal{F} \subseteq 2^\Omega satisfying:

  1. ΩF\Omega \in \mathcal{F}
  2. AFAcFA \in \mathcal{F} \Rightarrow A^c \in \mathcal{F} (closed under complement)
  3. A1,A2,Fn=1AnFA_1, A_2, \ldots \in \mathcal{F} \Rightarrow \bigcup_{n=1}^\infty A_n \in \mathcal{F} (closed under countable union)

It follows that F\emptyset \in \mathcal{F} and F\mathcal{F} is also closed under countable intersection (by De Morgan). The pair (Ω,F)(\Omega, \mathcal{F}) is called a measurable space. The three axioms are precisely what is needed to assign consistent sizes to sets: if you can measure AA and AcA^c separately, you need their measures to sum to the measure of Ω\Omega; if you can measure each AnA_n, you need to be able to measure their union (for countable unions, not just finite ones — this is what makes infinite series of probabilities well-defined).

Examples:

Ω\OmegaF\mathcal{F}Name
Any Ω\Omega{,Ω}\{\emptyset, \Omega\}Trivial σ\sigma-algebra
Any Ω\Omega2Ω2^\OmegaDiscrete σ\sigma-algebra
R\mathbb{R}σ(open sets)\sigma(\text{open sets})Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R})
Rn\mathbb{R}^nσ(open sets in Rn)\sigma(\text{open sets in } \mathbb{R}^n)Borel σ\sigma-algebra B(Rn)\mathcal{B}(\mathbb{R}^n)

The Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}) contains all open intervals, closed intervals, half-open intervals, countable sets, and their complements and countable unions. It is by far the most common σ\sigma-algebra in probability.

Generated σ\sigma-algebra. For a collection C\mathcal{C} of subsets, σ(C)\sigma(\mathcal{C}) is the smallest σ\sigma-algebra containing C\mathcal{C}. Since arbitrary intersections of σ\sigma-algebras are again σ\sigma-algebras, σ(C)={F:F is a σ-algebra and CF}\sigma(\mathcal{C}) = \bigcap \{\mathcal{F} : \mathcal{F} \text{ is a } \sigma\text{-algebra and } \mathcal{C} \subseteq \mathcal{F}\}.

Why σ\sigma-algebras? The Vitali set construction (using the Axiom of Choice) shows that there exists a subset V[0,1]V \subseteq [0,1] with no consistent notion of length: assigning any λ(V)\lambda(V) leads to either λ([0,1])1\lambda([0,1]) \neq 1 or non-additivity. The σ\sigma-algebra is the fence that excludes pathological sets, keeping measure theory consistent.

σ\sigma-algebras as information. In stochastic processes, the filtration Ft\mathcal{F}_t (increasing family of σ\sigma-algebras) represents information up to time tt. A random variable XtX_t is Ft\mathcal{F}_t-measurable iff its value can be determined from information up to time tt — the key concept in martingale theory and reinforcement learning.

Measures

A measure on (Ω,F)(\Omega, \mathcal{F}) is a function μ:F[0,+]\mu : \mathcal{F} \to [0, +\infty] with:

  1. μ()=0\mu(\emptyset) = 0
  2. σ\sigma-additivity: for pairwise disjoint A1,A2,FA_1, A_2, \ldots \in \mathcal{F}:

μ ⁣(n=1An)=n=1μ(An).\mu\!\left(\bigsqcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty \mu(A_n).

Key measures:

MeasureDefinitionRole
Lebesgue λ\lambdaλ([a,b])=ba\lambda([a,b]) = b-aLength/area/volume
Counting #\#$#(A) =A
Dirac δx\delta_xδx(A)=1[xA]\delta_x(A) = \mathbf{1}[x \in A]Point mass
Probability PPμ\mu with μ(Ω)=1\mu(\Omega) = 1Probability theory
Product μ1μ2\mu_1 \otimes \mu_2(μ1μ2)(A1×A2)=μ1(A1)μ2(A2)(\mu_1\otimes\mu_2)(A_1\times A_2) = \mu_1(A_1)\mu_2(A_2)Joint distributions

Continuity of measure. If A1A2A_1 \subseteq A_2 \subseteq \ldots (increasing), then μ(nAn)=limnμ(An)\mu(\bigcup_n A_n) = \lim_n \mu(A_n). If B1B2B_1 \supseteq B_2 \supseteq \ldots (decreasing) and μ(B1)<\mu(B_1) < \infty, then μ(nBn)=limnμ(Bn)\mu(\bigcap_n B_n) = \lim_n \mu(B_n). These continuity properties follow from σ\sigma-additivity and are central to proving convergence theorems.

Measurable Functions

A function f:(Ω1,F1)(Ω2,F2)f : (\Omega_1, \mathcal{F}_1) \to (\Omega_2, \mathcal{F}_2) is measurable if f1(B)F1f^{-1}(B) \in \mathcal{F}_1 for all BF2B \in \mathcal{F}_2.

For f:RRf : \mathbb{R} \to \mathbb{R} with Borel σ\sigma-algebras, ff is measurable iff {x:f(x)c}B(R)\{x : f(x) \leq c\} \in \mathcal{B}(\mathbb{R}) for all cRc \in \mathbb{R}.

Preserved under operations:

  • Continuous functions are measurable.
  • Sums, products, and compositions of measurable functions are measurable.
  • Pointwise limits (f=limnfnf = \lim_n f_n) of measurable functions are measurable.
  • supnfn\sup_n f_n, infnfn\inf_n f_n, lim supnfn\limsup_n f_n, lim infnfn\liminf_n f_n are measurable.

The Lebesgue Integral

The Lebesgue integral of ff against μ\mu is built in three stages:

Stage 1: Indicator functions. 1Adμ=μ(A)\int \mathbf{1}_A \, d\mu = \mu(A).

Stage 2: Simple functions. A simple function ϕ=k=1nak1Ak\phi = \sum_{k=1}^n a_k \mathbf{1}_{A_k} (nonneg, finite range, disjoint AkA_k):

ϕdμ=k=1nakμ(Ak).\int \phi \, d\mu = \sum_{k=1}^n a_k \mu(A_k).

Stage 3: General nonneg functions. For f0f \geq 0 measurable:

fdμ=sup{ϕdμ:0ϕf,  ϕ simple}.\int f \, d\mu = \sup\left\{\int \phi \, d\mu : 0 \leq \phi \leq f, \; \phi \text{ simple}\right\}.

Stage 4: Signed functions. f=f+ff = f^+ - f^-, fdμ=f+dμfdμ\int f\,d\mu = \int f^+\,d\mu - \int f^-\,d\mu when at least one is finite.

Lebesgue vs Riemann. For bounded continuous ff on [a,b][a,b]: they agree. For discontinuous ff: Lebesgue handles it (the Dirichlet function 1Q\mathbf{1}_\mathbb{Q} has Lebesgue integral 0 but no Riemann integral). For limits: Lebesgue allows limfn=limfn\lim\int f_n = \int\lim f_n under mild conditions (MCT, DCT); Riemann requires uniform convergence.

Convergence Theorems

Monotone Convergence Theorem (MCT). Let 0f1f20 \leq f_1 \leq f_2 \leq \ldots with fnff_n \to f pointwise. Then:

fdμ=limnfndμ.\int f \, d\mu = \lim_{n\to\infty} \int f_n \, d\mu.

Proof idea. Since fn\int f_n is increasing, the limit exists (possibly \infty). Use the approximation of ff by simple functions to show the limit equals f\int f.

Fatou's Lemma. For fn0f_n \geq 0:

lim infnfndμlim infnfndμ.\int \liminf_{n} f_n \, d\mu \leq \liminf_{n} \int f_n \, d\mu.

Dominated Convergence Theorem (DCT). If fnff_n \to f μ\mu-a.e. and fng|f_n| \leq g for all nn with gdμ<\int g \, d\mu < \infty, then:

limnfndμ=fdμ.\lim_{n\to\infty} \int f_n \, d\mu = \int f \, d\mu.

DCT is used constantly: to differentiate through integrals, exchange sums and integrals, pass limits inside expectations.

Radon-Nikodym Theorem

If ν\nu and μ\mu are σ\sigma-finite measures on (Ω,F)(\Omega,\mathcal{F}) and νμ\nu \ll \mu (absolutely continuous: μ(A)=0ν(A)=0\mu(A)=0 \Rightarrow \nu(A)=0), there exists a μ\mu-a.e. unique nonneg measurable pp with:

ν(A)=ApdμAF.\nu(A) = \int_A p \, d\mu \quad \forall A \in \mathcal{F}.

The function p=dν/dμp = d\nu/d\mu is the Radon-Nikodym derivative or density.

In probability: The PDF of a continuous distribution is the Radon-Nikodym derivative w.r.t. Lebesgue measure. The KL divergence KL(QP)=EQ[log(dQ/dP)]\text{KL}(Q\|P) = \mathbb{E}_Q[\log(dQ/dP)].

Importance sampling: EP[f]=EQ[fdP/dQ]\mathbb{E}_P[f] = \mathbb{E}_Q[f \cdot dP/dQ] for PQP \ll Q.

Worked Example

Example 1: DCT for Gradient of Expectation

Claim. Under mild conditions, ddθEXPθ[f(X)]=EXPθ[θf(X)]\frac{d}{d\theta}\mathbb{E}_{X\sim P_\theta}[f(X)] = \mathbb{E}_{X\sim P_\theta}[\partial_\theta f(X)].

Formal requirement. Let h0h \to 0. The difference quotient (f(X;θ+h)f(X;θ))/hθf(X;θ)(f(X;\theta+h) - f(X;\theta))/h \to \partial_\theta f(X;\theta) pointwise (differentiability). If there exists gg with θf(X;θ)g(X)|\partial_\theta f(X;\theta')| \leq g(X) for all θ\theta' near θ\theta, and E[g(X)]<\mathbb{E}[g(X)] < \infty, then DCT gives:

ddθE[f(X;θ)]=limh0E[f(X;θ+h)]E[f(X;θ)]h=E ⁣[limh0f(X;θ+h)f(X;θ)h]=E[θf(X;θ)].\frac{d}{d\theta}\mathbb{E}[f(X;\theta)] = \lim_{h\to 0}\frac{\mathbb{E}[f(X;\theta+h)] - \mathbb{E}[f(X;\theta)]}{h} = \mathbb{E}\!\left[\lim_{h\to 0}\frac{f(X;\theta+h)-f(X;\theta)}{h}\right] = \mathbb{E}[\partial_\theta f(X;\theta)].

Application: The REINFORCE gradient estimator θE[r(τ)]=E[r(τ)θlogpθ(τ)]\nabla_\theta \mathbb{E}[r(\tau)] = \mathbb{E}[r(\tau)\nabla_\theta \log p_\theta(\tau)] is derived by applying DCT after the log-derivative trick.

Example 2: Lebesgue Integration Unifies Discrete and Continuous

For a discrete XX taking values {xk}\{x_k\} with probabilities {pk}\{p_k\}: E[f(X)]=fdP=kf(xk)pk\mathbb{E}[f(X)] = \int f \, dP = \sum_k f(x_k)p_k (using counting measure as reference). For continuous XX with PDF p(x)p(x): E[f(X)]=f(x)p(x)dx\mathbb{E}[f(X)] = \int f(x)p(x)\,dx. Both are Lebesgue integrals — the sum formula is the integral against the counting measure, and the integral formula is the integral against Lebesgue measure. No separate derivation needed for discrete vs continuous; the same theory covers both.

Example 3: Why Not All Subsets Are Measurable

The Vitali construction: partition [0,1][0,1] into equivalence classes under the relation xyx \sim y iff xyQx - y \in \mathbb{Q}. By the Axiom of Choice, pick one representative from each class to form a set VV. Suppose λ(V)=c0\lambda(V) = c \geq 0. The rational translates V+q={v+q:vV}V + q = \{v+q : v \in V\} for qQ[0,1]q \in \mathbb{Q}\cap[0,1] are pairwise disjoint and cover [0,2][0,2]. By σ\sigma-additivity: λ([0,2])=qλ(V+q)=qc\lambda([0,2]) = \sum_{q}\lambda(V+q) = \sum_q c. If c=0c = 0: λ([0,2])=02\lambda([0,2]) = 0 \neq 2. If c>0c > 0: λ([0,2])=2\lambda([0,2]) = \infty \neq 2. Contradiction. So VV has no consistent length — it is non-measurable.

Connections

Where Your Intuition Breaks

Measure theory looks like overkill for practical ML — you can compute expectations with Riemann integrals and never invoke Lebesgue's theorem. The trap is the dominated convergence theorem: you can interchange a limit and an integral (limnfn=limnfn\lim_n \int f_n = \int \lim_n f_n) whenever there exists an integrable dominating function. Without this, the training loop proof that "the expected gradient equals the gradient of the expected loss" — θE[L]=E[θL]\nabla_\theta \mathbb{E}[L] = \mathbb{E}[\nabla_\theta L] — doesn't follow automatically. Riemann integration doesn't have a clean version of dominated convergence. When you take gradients through expectations in neural network theory, policy gradient derivations, or score function estimators, you are implicitly invoking dominated convergence. If it fails (e.g., unbounded gradients, heavy-tailed losses), the interchange is invalid and you need to either bound the gradient explicitly or use a different estimator.

💡Intuition

The Lebesgue integral unifies discrete and continuous probability. The expectation E[f(X)]\mathbb{E}[f(X)] is fdP\int f\,dP regardless of whether XX is discrete or continuous — the choice of reference measure (counting vs Lebesgue) is the only difference. This is why probability theorems (LLN, CLT, etc.) hold for both cases without separate proofs. In ML, mixed distributions (continuous embeddings with discrete structure, or energy functions over graphs) fit naturally into this framework.

💡Intuition

Absolute continuity is why PDFs exist. A continuous distribution PP on R\mathbb{R} has a density p(x)=dP/dλp(x) = dP/d\lambda because PλP \ll \lambda: any set of zero length has zero probability. For a discrete distribution, PP is absolutely continuous w.r.t. counting measure and the PMF is the Radon-Nikodym derivative. Distributions with mixed components (e.g., having both atoms and a continuous part) require careful handling — they are not absolutely continuous w.r.t. either reference measure alone.

⚠️Warning

DCT requires finding a dominating function. When differentiating through an expectation or swapping limit and integral, you must exhibit an integrable gg with fng|f_n| \leq g. For bounded losses (e.g., cross-entropy with clipped logits, bounded reward), this is automatic. For heavy-tailed distributions (e.g., policy gradient with unbounded rewards), it can fail — gradients may have infinite variance and the swap is invalid, leading to biased gradient estimates or divergence.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.