Integration in Rⁿ: Fubini, Change of Variables & Surface Integrals

Integration in multiple dimensions underpins the computation of expectations, partition functions, marginal distributions, and the change-of-variables formula that makes normalizing flows tractable. Fubini's theorem tells us when iterated integrals can be exchanged; the change-of-variables formula with its Jacobian determinant is the mathematical foundation for density transformations in generative modeling.

Concepts

Change of Variables — hover cells to see Jacobian det = r

Left: uniform grid in polar parameter space (θ, r). Right: the same cells after the map (r,θ)↦(r cosθ, r sinθ). Cells near the center (small r) shrink — the Jacobian determinant det J = r corrects for this in the change-of-variables formula. Darker = larger det J = more area stretching.

Every probability density you work with in ML — Gaussian, softmax output, normalizing flow — must integrate to 1 over its domain. When you transform variables (reparameterize a VAE, apply a normalizing flow, change to polar coordinates to compute a Gaussian normalization constant), the integral doesn't just follow the map — it must be corrected by how much the map stretches or squishes volume. That correction factor is the Jacobian determinant. Integration in $\mathbb{R}^n$ makes this precise.

The Lebesgue Integral in $\mathbb{R}^n$

The Lebesgue integral of a measurable function $f : \mathbb{R}^n \to \mathbb{R}$ over a measurable set $E$ is defined via measure theory. For practical purposes, it agrees with the Riemann integral whenever the Riemann integral exists, but handles pathological functions and limit exchanges more gracefully.

Key properties:

Linearity: $\int_E (af + bg) = a\int_E f + b\int_E g$
Monotonicity: $f \leq g$ a.e. $\implies \int_E f \leq \int_E g$
Triangle inequality: $|\int_E f| \leq \int_E |f|$

Fubini-Tonelli Theorem

Theorem (Fubini). Let $f : \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}$ be integrable (i.e., $\int |f| < \infty$ ). Then:

$\int_{\mathbb{R}^m \times \mathbb{R}^n} f(\mathbf{x}, \mathbf{y}) \, d(\mathbf{x}, \mathbf{y}) = \int_{\mathbb{R}^m} \left(\int_{\mathbb{R}^n} f(\mathbf{x}, \mathbf{y}) \, d\mathbf{y}\right) d\mathbf{x} = \int_{\mathbb{R}^n} \left(\int_{\mathbb{R}^m} f(\mathbf{x}, \mathbf{y}) \, d\mathbf{x}\right) d\mathbf{y}.$

Tonelli's extension: if $f \geq 0$ , the iterated integrals are equal (possibly $+\infty$ ) even without the integrability assumption.

Why it matters for ML. Computing joint distributions and marginalizing: $p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}$ . Fubini guarantees you can do this in any order. The ELBO (Evidence Lower BOund) in variational inference involves $\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x},\mathbf{z})]$ — an integral over latent variables, computable as an iterated integral under Fubini.

Counterexample (why we need integrability). The function $f(x,y) = (x^2 - y^2)/(x^2+y^2)^2$ on $[0,1]^2$ has $\int_0^1(\int_0^1 f\,dx)\,dy = \pi/4$ but $\int_0^1(\int_0^1 f\,dy)\,dx = -\pi/4$ — the iterated integrals are unequal because $\int |f| = \infty$ .

Change of Variables Formula

Theorem. Let $\phi : U \to V$ be a $C^1$ diffeomorphism between open sets in $\mathbb{R}^n$ , and $f : V \to \mathbb{R}$ integrable. Then:

$\int_V f(\mathbf{y}) \, d\mathbf{y} = \int_U f(\phi(\mathbf{x})) |\det J_\phi(\mathbf{x})| \, d\mathbf{x}.$

The Jacobian determinant $|\det J_\phi(\mathbf{x})|$ is the local volume scaling factor — how much $\phi$ expands or contracts area/volume near $\mathbf{x}$ . Without this correction, probability mass would not be conserved under the transformation: regions that get stretched would appear to have more probability, and compressed regions less, without the total remaining 1. The determinant is precisely the factor needed to make the integral invariant under smooth reparameterization.

Polar coordinates ( $n=2$ ): $\phi(r,\theta) = (r\cos\theta, r\sin\theta)$ , $J_\phi = \begin{pmatrix}\cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta\end{pmatrix}$ , $\det J_\phi = r$ . Thus:

$\int_{\mathbb{R}^2} f(x,y) \, dx\, dy = \int_0^\infty \int_0^{2\pi} f(r\cos\theta, r\sin\theta) \cdot r \, d\theta \, dr.$

The extra $r$ factor prevents cells near the origin (which map to tiny wedges) from being over-counted.

Spherical coordinates ( $n=3$ ): $\phi(r,\theta,\varphi) = (r\sin\theta\cos\varphi, r\sin\theta\sin\varphi, r\cos\theta)$ , $|\det J_\phi| = r^2\sin\theta$ . The volume element is $r^2\sin\theta\,dr\,d\theta\,d\varphi$ .

Important Integrals via Change of Variables

Gaussian integral.

$\int_{-\infty}^\infty e^{-x^2} dx = \sqrt{\pi}.$

Proof: $\left(\int e^{-x^2}dx\right)^2 = \int\int e^{-(x^2+y^2)}dx\,dy = \int_0^\infty e^{-r^2} \cdot 2\pi r\,dr = \pi$ . Change to polar; the Jacobian determinant $r$ is essential.

Multivariate Gaussian normalization.

$\int_{\mathbb{R}^n} \exp\!\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x}\right)d\mathbf{x} = (2\pi)^{n/2}\sqrt{\det\Sigma}.$

Proof: change variables $\mathbf{y} = \Sigma^{-1/2}\mathbf{x}$ , Jacobian $|\det J| = \det(\Sigma^{1/2}) = \sqrt{\det\Sigma}$ , reduce to a product of standard Gaussians.

Surface Integrals and Differential Forms

A surface integral generalizes integration over curves in $\mathbb{R}^3$ to integration over surfaces. For a surface $S$ parameterized by $\phi : U \subset \mathbb{R}^2 \to \mathbb{R}^3$ :

$\iint_S f \, dS = \iint_U f(\phi(u,v)) \left\|\frac{\partial\phi}{\partial u} \times \frac{\partial\phi}{\partial v}\right\| du\,dv.$

The cross product $\|\partial\phi/\partial u \times \partial\phi/\partial v\|$ is the 2D Jacobian determinant analogue for surfaces in 3D.

Differential forms provide the coordinate-free version: a $k$ -form $\omega$ integrates over $k$ -dimensional oriented manifolds without choosing coordinates. The key fact: $\int_M d\omega = \int_{\partial M} \omega$ (Stokes' theorem), which generalizes the Fundamental Theorem of Calculus, Green's theorem, and the Divergence theorem.

The Laplacian and Harmonic Functions

The Laplacian $\Delta f = \operatorname{div}(\nabla f) = \sum_{i=1}^n \partial^2 f/\partial x_i^2$ is the divergence of the gradient — it measures how much the average value of $f$ near a point exceeds its value at the point.

Harmonic functions: $\Delta f = 0$ . By the mean value property, $f(\mathbf{x}_0) = \frac{1}{|B(x_0,r)|}\int_{B(x_0,r)} f$ — the value equals the average over any ball. Harmonic functions are the equilibrium solutions of diffusion processes.

Laplacian in ML:

Graph Laplacian $L = D - W$ (discrete analogue): eigenvectors give spectral clustering embeddings
Laplacian regularization: minimize $\mathbf{f}^TL\mathbf{f} = \sum_{ij}w_{ij}(f_i-f_j)^2$ — smooths predictions across graph edges
Score function: $\nabla_\mathbf{x}\log p(\mathbf{x})$ appears in score-based generative models (DDPM, score matching)

Worked Example

Example 1: Gaussian KL Divergence

For $p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ , $q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ :

$\operatorname{KL}(p\|q) = \int p(\mathbf{x})\log\frac{p(\mathbf{x})}{q(\mathbf{x})}\,d\mathbf{x} = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^T\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - n + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right].$

This closed-form integral uses change of variables (to diagonalize $\Sigma_1$ ) and the Gaussian normalization formula. It appears in the ELBO for VAEs, Gaussian process inference, and information-theoretic analysis of representation learning.

Example 2: Expectation by Change of Variables

If $\mathbf{z} \sim \mathcal{N}(\mathbf{0},I)$ and $\mathbf{x} = \mu + L\mathbf{z}$ (with $\Sigma = LL^T$ ), then $\mathbf{x} \sim \mathcal{N}(\mu, \Sigma)$ .

$\mathbb{E}_\mathbf{x}[f(\mathbf{x})] = \mathbb{E}_\mathbf{z}[f(\mu + L\mathbf{z})].$

This is the reparameterization trick in VAEs: gradient flows through $\mu$ and $L$ because $\mathbf{z}$ is the noise, making the sampling step differentiable. The change-of-variables formula justifies the density transformation.

Example 3: Normalizing Flow Log-Likelihood

For an invertible map $\mathbf{x} = f(\mathbf{z})$ with $\mathbf{z} \sim p_z$ :

$p_x(\mathbf{x}) = p_z(f^{-1}(\mathbf{x})) \cdot |\det J_{f^{-1}}(\mathbf{x})| = p_z(f^{-1}(\mathbf{x})) \cdot \frac{1}{|\det J_f(f^{-1}(\mathbf{x}))|}.$

Log-likelihood: $\log p_x(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) - \log|\det J_f(f^{-1}(\mathbf{x}))|$ .

Training maximizes this over data $\{\mathbf{x}_i\}$ . The Jacobian log-determinant is the correction for how $f$ stretches or squishes volume — exactly what the change-of-variables diagram illustrates.

Connections

Where Your Intuition Breaks

The most seductive shortcut: treating expectation and differentiation as freely interchangeable. In many ML derivations you want to write $\nabla_\theta \mathbb{E}_{p_\theta}[f(\mathbf{x})] = \mathbb{E}_{p_\theta}[\nabla_\theta f(\mathbf{x})]$ or differentiate under the integral sign. This works when $f$ and $\nabla_\theta f$ are both integrable and a dominating function condition holds (Leibniz integral rule / dominated convergence theorem) — but not in general. The REINFORCE estimator is one rigorous way to handle the gradient of an expectation when the distribution depends on $\theta$ ; the reparameterization trick is another. Using naive differentiation under the integral without checking integrability conditions leads to incorrect gradient estimates that can be silently wrong in sparse reward or heavy-tailed settings.

💡Intuition

Intractable integrals and VI. Most interesting probability integrals in ML are intractable: $p(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z}$ requires integrating over all latent configurations. Variational inference replaces this with an optimization: $\max_{q \in \mathcal{Q}} \text{ELBO}(q)$ where $\mathcal{Q}$ is a tractable family. Monte Carlo integration replaces it with sampling. Both approaches sidestep the intractable integral but introduce approximation error — the gap between these methods is a central theme of probabilistic ML.

⚠️Warning

Improper integrals and unnormalized densities. Many score-based and energy-based models define $p(\mathbf{x}) \propto \exp(-E(\mathbf{x}))$ without computing the normalizing constant $Z = \int \exp(-E(\mathbf{x}))\,d\mathbf{x}$ . The Fubini condition (integrability) must be checked to ensure $Z < \infty$ . Distributions that don't normalize properly lead to nonsensical samples. This is why score matching trains on $\nabla_\mathbf{x}\log p$ (the score function) without needing $Z$ — the gradient of $\log Z$ is zero w.r.t. parameters.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Differentiation in Rⁿ: Jacobians, Hessians & the Chain Rule

Smooth Manifolds & Tangent Spaces