Measure Theory Primer: σ-Algebras, Measures & Lebesgue Integration
Measure theory provides the rigorous foundation for probability, replacing informal notions of "area under a curve" with a precise framework that handles continuous and discrete distributions uniformly. Without it, statements like "the sample mean converges almost surely" or "the derivative of an expectation is the expectation of the derivative" lack precise meaning. This lesson builds the essential machinery — -algebras, measures, and the Lebesgue integral — that underlies all of learning theory.
Concepts
The Riemann integral you learned in calculus can't handle the indicator function of the rationals: if , 0 otherwise. The lower Riemann sum is 0 (rationals are dense but have gaps) and the upper Riemann sum is 1 (irrationals are dense too) — they never agree. The Lebesgue integral handles this immediately: the rationals have measure zero, so (measure of rationals times their value, plus measure of irrationals times theirs). This is not a contrived example — it is the prototype for why learning theory needs the Lebesgue integral: empirical averages over "almost all" training examples, convergence "almost surely," and expectations of indicator functions of events all require Lebesgue measure to be well-defined.
Sigma-Algebras
A -algebra (or -field) on a set is a collection satisfying:
- (closed under complement)
- (closed under countable union)
It follows that and is also closed under countable intersection (by De Morgan). The pair is called a measurable space. The three axioms are precisely what is needed to assign consistent sizes to sets: if you can measure and separately, you need their measures to sum to the measure of ; if you can measure each , you need to be able to measure their union (for countable unions, not just finite ones — this is what makes infinite series of probabilities well-defined).
Examples:
| Name | ||
|---|---|---|
| Any | Trivial -algebra | |
| Any | Discrete -algebra | |
| Borel -algebra | ||
| Borel -algebra |
The Borel -algebra contains all open intervals, closed intervals, half-open intervals, countable sets, and their complements and countable unions. It is by far the most common -algebra in probability.
Generated -algebra. For a collection of subsets, is the smallest -algebra containing . Since arbitrary intersections of -algebras are again -algebras, .
Why -algebras? The Vitali set construction (using the Axiom of Choice) shows that there exists a subset with no consistent notion of length: assigning any leads to either or non-additivity. The -algebra is the fence that excludes pathological sets, keeping measure theory consistent.
-algebras as information. In stochastic processes, the filtration (increasing family of -algebras) represents information up to time . A random variable is -measurable iff its value can be determined from information up to time — the key concept in martingale theory and reinforcement learning.
Measures
A measure on is a function with:
- -additivity: for pairwise disjoint :
Key measures:
| Measure | Definition | Role |
|---|---|---|
| Lebesgue | Length/area/volume | |
| Counting | $#(A) = | A |
| Dirac | Point mass | |
| Probability | with | Probability theory |
| Product | Joint distributions |
Continuity of measure. If (increasing), then . If (decreasing) and , then . These continuity properties follow from -additivity and are central to proving convergence theorems.
Measurable Functions
A function is measurable if for all .
For with Borel -algebras, is measurable iff for all .
Preserved under operations:
- Continuous functions are measurable.
- Sums, products, and compositions of measurable functions are measurable.
- Pointwise limits () of measurable functions are measurable.
- , , , are measurable.
The Lebesgue Integral
The Lebesgue integral of against is built in three stages:
Stage 1: Indicator functions. .
Stage 2: Simple functions. A simple function (nonneg, finite range, disjoint ):
Stage 3: General nonneg functions. For measurable:
Stage 4: Signed functions. , when at least one is finite.
Lebesgue vs Riemann. For bounded continuous on : they agree. For discontinuous : Lebesgue handles it (the Dirichlet function has Lebesgue integral 0 but no Riemann integral). For limits: Lebesgue allows under mild conditions (MCT, DCT); Riemann requires uniform convergence.
Convergence Theorems
Monotone Convergence Theorem (MCT). Let with pointwise. Then:
Proof idea. Since is increasing, the limit exists (possibly ). Use the approximation of by simple functions to show the limit equals .
Fatou's Lemma. For :
Dominated Convergence Theorem (DCT). If -a.e. and for all with , then:
DCT is used constantly: to differentiate through integrals, exchange sums and integrals, pass limits inside expectations.
Radon-Nikodym Theorem
If and are -finite measures on and (absolutely continuous: ), there exists a -a.e. unique nonneg measurable with:
The function is the Radon-Nikodym derivative or density.
In probability: The PDF of a continuous distribution is the Radon-Nikodym derivative w.r.t. Lebesgue measure. The KL divergence .
Importance sampling: for .
Worked Example
Example 1: DCT for Gradient of Expectation
Claim. Under mild conditions, .
Formal requirement. Let . The difference quotient pointwise (differentiability). If there exists with for all near , and , then DCT gives:
Application: The REINFORCE gradient estimator is derived by applying DCT after the log-derivative trick.
Example 2: Lebesgue Integration Unifies Discrete and Continuous
For a discrete taking values with probabilities : (using counting measure as reference). For continuous with PDF : . Both are Lebesgue integrals — the sum formula is the integral against the counting measure, and the integral formula is the integral against Lebesgue measure. No separate derivation needed for discrete vs continuous; the same theory covers both.
Example 3: Why Not All Subsets Are Measurable
The Vitali construction: partition into equivalence classes under the relation iff . By the Axiom of Choice, pick one representative from each class to form a set . Suppose . The rational translates for are pairwise disjoint and cover . By -additivity: . If : . If : . Contradiction. So has no consistent length — it is non-measurable.
Connections
Where Your Intuition Breaks
Measure theory looks like overkill for practical ML — you can compute expectations with Riemann integrals and never invoke Lebesgue's theorem. The trap is the dominated convergence theorem: you can interchange a limit and an integral () whenever there exists an integrable dominating function. Without this, the training loop proof that "the expected gradient equals the gradient of the expected loss" — — doesn't follow automatically. Riemann integration doesn't have a clean version of dominated convergence. When you take gradients through expectations in neural network theory, policy gradient derivations, or score function estimators, you are implicitly invoking dominated convergence. If it fails (e.g., unbounded gradients, heavy-tailed losses), the interchange is invalid and you need to either bound the gradient explicitly or use a different estimator.
The Lebesgue integral unifies discrete and continuous probability. The expectation is regardless of whether is discrete or continuous — the choice of reference measure (counting vs Lebesgue) is the only difference. This is why probability theorems (LLN, CLT, etc.) hold for both cases without separate proofs. In ML, mixed distributions (continuous embeddings with discrete structure, or energy functions over graphs) fit naturally into this framework.
Absolute continuity is why PDFs exist. A continuous distribution on has a density because : any set of zero length has zero probability. For a discrete distribution, is absolutely continuous w.r.t. counting measure and the PMF is the Radon-Nikodym derivative. Distributions with mixed components (e.g., having both atoms and a continuous part) require careful handling — they are not absolutely continuous w.r.t. either reference measure alone.
DCT requires finding a dominating function. When differentiating through an expectation or swapping limit and integral, you must exhibit an integrable with . For bounded losses (e.g., cross-entropy with clipped logits, bounded reward), this is automatic. For heavy-tailed distributions (e.g., policy gradient with unbounded rewards), it can fail — gradients may have infinite variance and the swap is invalid, leading to biased gradient estimates or divergence.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.