Measure Theory Primer: σ-Algebras, Measures & Lebesgue Integration

Measure theory provides the rigorous foundation for probability, replacing informal notions of "area under a curve" with a precise framework that handles continuous and discrete distributions uniformly. Without it, statements like "the sample mean converges almost surely" or "the derivative of an expectation is the expectation of the derivative" lack precise meaning. This lesson builds the essential machinery — $\sigma$ -algebras, measures, and the Lebesgue integral — that underlies all of learning theory.

Concepts

The Riemann integral you learned in calculus can't handle the indicator function of the rationals: $\mathbf{1}_\mathbb{Q}(x) = 1$ if $x \in \mathbb{Q}$ , 0 otherwise. The lower Riemann sum is 0 (rationals are dense but have gaps) and the upper Riemann sum is 1 (irrationals are dense too) — they never agree. The Lebesgue integral handles this immediately: the rationals have measure zero, so $\int \mathbf{1}_\mathbb{Q}\,dx = 0 \cdot 0 + 1 \cdot 1 = 0$ (measure of rationals times their value, plus measure of irrationals times theirs). This is not a contrived example — it is the prototype for why learning theory needs the Lebesgue integral: empirical averages over "almost all" training examples, convergence "almost surely," and expectations of indicator functions of events all require Lebesgue measure to be well-defined.

Sigma-Algebras

A $\sigma$ -algebra (or $\sigma$ -field) on a set $\Omega$ is a collection $\mathcal{F} \subseteq 2^\Omega$ satisfying:

$\Omega \in \mathcal{F}$
$A \in \mathcal{F} \Rightarrow A^c \in \mathcal{F}$ (closed under complement)
$A_1, A_2, \ldots \in \mathcal{F} \Rightarrow \bigcup_{n=1}^\infty A_n \in \mathcal{F}$ (closed under countable union)

It follows that $\emptyset \in \mathcal{F}$ and $\mathcal{F}$ is also closed under countable intersection (by De Morgan). The pair $(\Omega, \mathcal{F})$ is called a measurable space. The three axioms are precisely what is needed to assign consistent sizes to sets: if you can measure $A$ and $A^c$ separately, you need their measures to sum to the measure of $\Omega$ ; if you can measure each $A_n$ , you need to be able to measure their union (for countable unions, not just finite ones — this is what makes infinite series of probabilities well-defined).

Examples:

$\Omega$	$\mathcal{F}$	Name
Any $\Omega$	$\{\emptyset, \Omega\}$	Trivial $\sigma$ -algebra
Any $\Omega$	$2^\Omega$	Discrete $\sigma$ -algebra
$\mathbb{R}$	$\sigma(\text{open sets})$	Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R})$
$\mathbb{R}^n$	$\sigma(\text{open sets in } \mathbb{R}^n)$	Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R}^n)$

The Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R})$ contains all open intervals, closed intervals, half-open intervals, countable sets, and their complements and countable unions. It is by far the most common $\sigma$ -algebra in probability.

Generated $\sigma$ -algebra. For a collection $\mathcal{C}$ of subsets, $\sigma(\mathcal{C})$ is the smallest $\sigma$ -algebra containing $\mathcal{C}$ . Since arbitrary intersections of $\sigma$ -algebras are again $\sigma$ -algebras, $\sigma(\mathcal{C}) = \bigcap \{\mathcal{F} : \mathcal{F} \text{ is a } \sigma\text{-algebra and } \mathcal{C} \subseteq \mathcal{F}\}$ .

Why $\sigma$ -algebras? The Vitali set construction (using the Axiom of Choice) shows that there exists a subset $V \subseteq [0,1]$ with no consistent notion of length: assigning any $\lambda(V)$ leads to either $\lambda([0,1]) \neq 1$ or non-additivity. The $\sigma$ -algebra is the fence that excludes pathological sets, keeping measure theory consistent.

$\sigma$ -algebras as information. In stochastic processes, the filtration $\mathcal{F}_t$ (increasing family of $\sigma$ -algebras) represents information up to time $t$ . A random variable $X_t$ is $\mathcal{F}_t$ -measurable iff its value can be determined from information up to time $t$ — the key concept in martingale theory and reinforcement learning.

Measures

A measure on $(\Omega, \mathcal{F})$ is a function $\mu : \mathcal{F} \to [0, +\infty]$ with:

$\mu(\emptyset) = 0$
$\sigma$ -additivity: for pairwise disjoint $A_1, A_2, \ldots \in \mathcal{F}$ :

$\mu\!\left(\bigsqcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty \mu(A_n).$

Key measures:

Measure	Definition	Role
Lebesgue $\lambda$	$\lambda([a,b]) = b-a$	Length/area/volume
Counting $\#$	$#(A) =	A
Dirac $\delta_x$	$\delta_x(A) = \mathbf{1}[x \in A]$	Point mass
Probability $P$	$\mu$ with $\mu(\Omega) = 1$	Probability theory
Product $\mu_1 \otimes \mu_2$	$(\mu_1\otimes\mu_2)(A_1\times A_2) = \mu_1(A_1)\mu_2(A_2)$	Joint distributions

Continuity of measure. If $A_1 \subseteq A_2 \subseteq \ldots$ (increasing), then $\mu(\bigcup_n A_n) = \lim_n \mu(A_n)$ . If $B_1 \supseteq B_2 \supseteq \ldots$ (decreasing) and $\mu(B_1) < \infty$ , then $\mu(\bigcap_n B_n) = \lim_n \mu(B_n)$ . These continuity properties follow from $\sigma$ -additivity and are central to proving convergence theorems.

Measurable Functions

A function $f : (\Omega_1, \mathcal{F}_1) \to (\Omega_2, \mathcal{F}_2)$ is measurable if $f^{-1}(B) \in \mathcal{F}_1$ for all $B \in \mathcal{F}_2$ .

For $f : \mathbb{R} \to \mathbb{R}$ with Borel $\sigma$ -algebras, $f$ is measurable iff $\{x : f(x) \leq c\} \in \mathcal{B}(\mathbb{R})$ for all $c \in \mathbb{R}$ .

Preserved under operations:

Continuous functions are measurable.
Sums, products, and compositions of measurable functions are measurable.
Pointwise limits ( $f = \lim_n f_n$ ) of measurable functions are measurable.
$\sup_n f_n$ , $\inf_n f_n$ , $\limsup_n f_n$ , $\liminf_n f_n$ are measurable.

The Lebesgue Integral

The Lebesgue integral of $f$ against $\mu$ is built in three stages:

Stage 1: Indicator functions. $\int \mathbf{1}_A \, d\mu = \mu(A)$ .

Stage 2: Simple functions. A simple function $\phi = \sum_{k=1}^n a_k \mathbf{1}_{A_k}$ (nonneg, finite range, disjoint $A_k$ ):

$\int \phi \, d\mu = \sum_{k=1}^n a_k \mu(A_k).$

Stage 3: General nonneg functions. For $f \geq 0$ measurable:

$\int f \, d\mu = \sup\left\{\int \phi \, d\mu : 0 \leq \phi \leq f, \; \phi \text{ simple}\right\}.$

Stage 4: Signed functions. $f = f^+ - f^-$ , $\int f\,d\mu = \int f^+\,d\mu - \int f^-\,d\mu$ when at least one is finite.

Lebesgue vs Riemann. For bounded continuous $f$ on $[a,b]$ : they agree. For discontinuous $f$ : Lebesgue handles it (the Dirichlet function $\mathbf{1}_\mathbb{Q}$ has Lebesgue integral 0 but no Riemann integral). For limits: Lebesgue allows $\lim\int f_n = \int\lim f_n$ under mild conditions (MCT, DCT); Riemann requires uniform convergence.

Convergence Theorems

Monotone Convergence Theorem (MCT). Let $0 \leq f_1 \leq f_2 \leq \ldots$ with $f_n \to f$ pointwise. Then:

$\int f \, d\mu = \lim_{n\to\infty} \int f_n \, d\mu.$

Proof idea. Since $\int f_n$ is increasing, the limit exists (possibly $\infty$ ). Use the approximation of $f$ by simple functions to show the limit equals $\int f$ .

Fatou's Lemma. For $f_n \geq 0$ :

$\int \liminf_{n} f_n \, d\mu \leq \liminf_{n} \int f_n \, d\mu.$

Dominated Convergence Theorem (DCT). If $f_n \to f$ $\mu$ -a.e. and $|f_n| \leq g$ for all $n$ with $\int g \, d\mu < \infty$ , then:

$\lim_{n\to\infty} \int f_n \, d\mu = \int f \, d\mu.$

DCT is used constantly: to differentiate through integrals, exchange sums and integrals, pass limits inside expectations.

Radon-Nikodym Theorem

If $\nu$ and $\mu$ are $\sigma$ -finite measures on $(\Omega,\mathcal{F})$ and $\nu \ll \mu$ (absolutely continuous: $\mu(A)=0 \Rightarrow \nu(A)=0$ ), there exists a $\mu$ -a.e. unique nonneg measurable $p$ with:

$\nu(A) = \int_A p \, d\mu \quad \forall A \in \mathcal{F}.$

The function $p = d\nu/d\mu$ is the Radon-Nikodym derivative or density.

In probability: The PDF of a continuous distribution is the Radon-Nikodym derivative w.r.t. Lebesgue measure. The KL divergence $\text{KL}(Q\|P) = \mathbb{E}_Q[\log(dQ/dP)]$ .

Importance sampling: $\mathbb{E}_P[f] = \mathbb{E}_Q[f \cdot dP/dQ]$ for $P \ll Q$ .

Worked Example

Example 1: DCT for Gradient of Expectation

Claim. Under mild conditions, $\frac{d}{d\theta}\mathbb{E}_{X\sim P_\theta}[f(X)] = \mathbb{E}_{X\sim P_\theta}[\partial_\theta f(X)]$ .

Formal requirement. Let $h \to 0$ . The difference quotient $(f(X;\theta+h) - f(X;\theta))/h \to \partial_\theta f(X;\theta)$ pointwise (differentiability). If there exists $g$ with $|\partial_\theta f(X;\theta')| \leq g(X)$ for all $\theta'$ near $\theta$ , and $\mathbb{E}[g(X)] < \infty$ , then DCT gives:

$\frac{d}{d\theta}\mathbb{E}[f(X;\theta)] = \lim_{h\to 0}\frac{\mathbb{E}[f(X;\theta+h)] - \mathbb{E}[f(X;\theta)]}{h} = \mathbb{E}\!\left[\lim_{h\to 0}\frac{f(X;\theta+h)-f(X;\theta)}{h}\right] = \mathbb{E}[\partial_\theta f(X;\theta)].$

Application: The REINFORCE gradient estimator $\nabla_\theta \mathbb{E}[r(\tau)] = \mathbb{E}[r(\tau)\nabla_\theta \log p_\theta(\tau)]$ is derived by applying DCT after the log-derivative trick.

Example 2: Lebesgue Integration Unifies Discrete and Continuous

For a discrete $X$ taking values $\{x_k\}$ with probabilities $\{p_k\}$ : $\mathbb{E}[f(X)] = \int f \, dP = \sum_k f(x_k)p_k$ (using counting measure as reference). For continuous $X$ with PDF $p(x)$ : $\mathbb{E}[f(X)] = \int f(x)p(x)\,dx$ . Both are Lebesgue integrals — the sum formula is the integral against the counting measure, and the integral formula is the integral against Lebesgue measure. No separate derivation needed for discrete vs continuous; the same theory covers both.

Example 3: Why Not All Subsets Are Measurable

The Vitali construction: partition $[0,1]$ into equivalence classes under the relation $x \sim y$ iff $x - y \in \mathbb{Q}$ . By the Axiom of Choice, pick one representative from each class to form a set $V$ . Suppose $\lambda(V) = c \geq 0$ . The rational translates $V + q = \{v+q : v \in V\}$ for $q \in \mathbb{Q}\cap[0,1]$ are pairwise disjoint and cover $[0,2]$ . By $\sigma$ -additivity: $\lambda([0,2]) = \sum_{q}\lambda(V+q) = \sum_q c$ . If $c = 0$ : $\lambda([0,2]) = 0 \neq 2$ . If $c > 0$ : $\lambda([0,2]) = \infty \neq 2$ . Contradiction. So $V$ has no consistent length — it is non-measurable.

Connections

Where Your Intuition Breaks

Measure theory looks like overkill for practical ML — you can compute expectations with Riemann integrals and never invoke Lebesgue's theorem. The trap is the dominated convergence theorem: you can interchange a limit and an integral ( $\lim_n \int f_n = \int \lim_n f_n$ ) whenever there exists an integrable dominating function. Without this, the training loop proof that "the expected gradient equals the gradient of the expected loss" — $\nabla_\theta \mathbb{E}[L] = \mathbb{E}[\nabla_\theta L]$ — doesn't follow automatically. Riemann integration doesn't have a clean version of dominated convergence. When you take gradients through expectations in neural network theory, policy gradient derivations, or score function estimators, you are implicitly invoking dominated convergence. If it fails (e.g., unbounded gradients, heavy-tailed losses), the interchange is invalid and you need to either bound the gradient explicitly or use a different estimator.

💡Intuition

The Lebesgue integral unifies discrete and continuous probability. The expectation $\mathbb{E}[f(X)]$ is $\int f\,dP$ regardless of whether $X$ is discrete or continuous — the choice of reference measure (counting vs Lebesgue) is the only difference. This is why probability theorems (LLN, CLT, etc.) hold for both cases without separate proofs. In ML, mixed distributions (continuous embeddings with discrete structure, or energy functions over graphs) fit naturally into this framework.

💡Intuition

Absolute continuity is why PDFs exist. A continuous distribution $P$ on $\mathbb{R}$ has a density $p(x) = dP/d\lambda$ because $P \ll \lambda$ : any set of zero length has zero probability. For a discrete distribution, $P$ is absolutely continuous w.r.t. counting measure and the PMF is the Radon-Nikodym derivative. Distributions with mixed components (e.g., having both atoms and a continuous part) require careful handling — they are not absolutely continuous w.r.t. either reference measure alone.

⚠️Warning

DCT requires finding a dominating function. When differentiating through an expectation or swapping limit and integral, you must exhibit an integrable $g$ with $|f_n| \leq g$ . For bounded losses (e.g., cross-entropy with clipped logits, bounded reward), this is automatic. For heavy-tailed distributions (e.g., policy gradient with unbounded rewards), it can fail — gradients may have infinite variance and the swap is invalid, leading to biased gradient estimates or divergence.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Convex Optimization

Bridge: Adam, Learning Rate Theory & Neural Loss Landscape Analysis

Probability Spaces, Random Variables & Distributions