Neural-Path/Notes
40 min

Integration in Rⁿ: Fubini, Change of Variables & Surface Integrals

Integration in multiple dimensions underpins the computation of expectations, partition functions, marginal distributions, and the change-of-variables formula that makes normalizing flows tractable. Fubini's theorem tells us when iterated integrals can be exchanged; the change-of-variables formula with its Jacobian determinant is the mathematical foundation for density transformations in generative modeling.

Concepts

Change of Variables — hover cells to see Jacobian det = r

polar param. space (θ, r)Cartesian image (x, y)0πθ→r

Left: uniform grid in polar parameter space (θ, r). Right: the same cells after the map (r,θ)↦(r cosθ, r sinθ). Cells near the center (small r) shrink — the Jacobian determinant det J = r corrects for this in the change-of-variables formula. Darker = larger det J = more area stretching.

Every probability density you work with in ML — Gaussian, softmax output, normalizing flow — must integrate to 1 over its domain. When you transform variables (reparameterize a VAE, apply a normalizing flow, change to polar coordinates to compute a Gaussian normalization constant), the integral doesn't just follow the map — it must be corrected by how much the map stretches or squishes volume. That correction factor is the Jacobian determinant. Integration in Rn\mathbb{R}^n makes this precise.

The Lebesgue Integral in Rn\mathbb{R}^n

The Lebesgue integral of a measurable function f:RnRf : \mathbb{R}^n \to \mathbb{R} over a measurable set EE is defined via measure theory. For practical purposes, it agrees with the Riemann integral whenever the Riemann integral exists, but handles pathological functions and limit exchanges more gracefully.

Key properties:

  • Linearity: E(af+bg)=aEf+bEg\int_E (af + bg) = a\int_E f + b\int_E g
  • Monotonicity: fgf \leq g a.e.     EfEg\implies \int_E f \leq \int_E g
  • Triangle inequality: EfEf|\int_E f| \leq \int_E |f|

Fubini-Tonelli Theorem

Theorem (Fubini). Let f:Rm×RnRf : \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R} be integrable (i.e., f<\int |f| < \infty). Then:

Rm×Rnf(x,y)d(x,y)=Rm(Rnf(x,y)dy)dx=Rn(Rmf(x,y)dx)dy.\int_{\mathbb{R}^m \times \mathbb{R}^n} f(\mathbf{x}, \mathbf{y}) \, d(\mathbf{x}, \mathbf{y}) = \int_{\mathbb{R}^m} \left(\int_{\mathbb{R}^n} f(\mathbf{x}, \mathbf{y}) \, d\mathbf{y}\right) d\mathbf{x} = \int_{\mathbb{R}^n} \left(\int_{\mathbb{R}^m} f(\mathbf{x}, \mathbf{y}) \, d\mathbf{x}\right) d\mathbf{y}.

Tonelli's extension: if f0f \geq 0, the iterated integrals are equal (possibly ++\infty) even without the integrability assumption.

Why it matters for ML. Computing joint distributions and marginalizing: p(x)=p(x,z)dzp(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}. Fubini guarantees you can do this in any order. The ELBO (Evidence Lower BOund) in variational inference involves Eq(zx)[logp(x,z)]\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x},\mathbf{z})] — an integral over latent variables, computable as an iterated integral under Fubini.

Counterexample (why we need integrability). The function f(x,y)=(x2y2)/(x2+y2)2f(x,y) = (x^2 - y^2)/(x^2+y^2)^2 on [0,1]2[0,1]^2 has 01(01fdx)dy=π/4\int_0^1(\int_0^1 f\,dx)\,dy = \pi/4 but 01(01fdy)dx=π/4\int_0^1(\int_0^1 f\,dy)\,dx = -\pi/4 — the iterated integrals are unequal because f=\int |f| = \infty.

Change of Variables Formula

Theorem. Let ϕ:UV\phi : U \to V be a C1C^1 diffeomorphism between open sets in Rn\mathbb{R}^n, and f:VRf : V \to \mathbb{R} integrable. Then:

Vf(y)dy=Uf(ϕ(x))detJϕ(x)dx.\int_V f(\mathbf{y}) \, d\mathbf{y} = \int_U f(\phi(\mathbf{x})) |\det J_\phi(\mathbf{x})| \, d\mathbf{x}.

The Jacobian determinant detJϕ(x)|\det J_\phi(\mathbf{x})| is the local volume scaling factor — how much ϕ\phi expands or contracts area/volume near x\mathbf{x}. Without this correction, probability mass would not be conserved under the transformation: regions that get stretched would appear to have more probability, and compressed regions less, without the total remaining 1. The determinant is precisely the factor needed to make the integral invariant under smooth reparameterization.

Polar coordinates (n=2n=2): ϕ(r,θ)=(rcosθ,rsinθ)\phi(r,\theta) = (r\cos\theta, r\sin\theta), Jϕ=(cosθrsinθsinθrcosθ)J_\phi = \begin{pmatrix}\cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta\end{pmatrix}, detJϕ=r\det J_\phi = r. Thus:

R2f(x,y)dxdy=002πf(rcosθ,rsinθ)rdθdr.\int_{\mathbb{R}^2} f(x,y) \, dx\, dy = \int_0^\infty \int_0^{2\pi} f(r\cos\theta, r\sin\theta) \cdot r \, d\theta \, dr.

The extra rr factor prevents cells near the origin (which map to tiny wedges) from being over-counted.

Spherical coordinates (n=3n=3): ϕ(r,θ,φ)=(rsinθcosφ,rsinθsinφ,rcosθ)\phi(r,\theta,\varphi) = (r\sin\theta\cos\varphi, r\sin\theta\sin\varphi, r\cos\theta), detJϕ=r2sinθ|\det J_\phi| = r^2\sin\theta. The volume element is r2sinθdrdθdφr^2\sin\theta\,dr\,d\theta\,d\varphi.

Important Integrals via Change of Variables

Gaussian integral.

ex2dx=π.\int_{-\infty}^\infty e^{-x^2} dx = \sqrt{\pi}.

Proof: (ex2dx)2=e(x2+y2)dxdy=0er22πrdr=π\left(\int e^{-x^2}dx\right)^2 = \int\int e^{-(x^2+y^2)}dx\,dy = \int_0^\infty e^{-r^2} \cdot 2\pi r\,dr = \pi. Change to polar; the Jacobian determinant rr is essential.

Multivariate Gaussian normalization.

Rnexp ⁣(12xTΣ1x)dx=(2π)n/2detΣ.\int_{\mathbb{R}^n} \exp\!\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x}\right)d\mathbf{x} = (2\pi)^{n/2}\sqrt{\det\Sigma}.

Proof: change variables y=Σ1/2x\mathbf{y} = \Sigma^{-1/2}\mathbf{x}, Jacobian detJ=det(Σ1/2)=detΣ|\det J| = \det(\Sigma^{1/2}) = \sqrt{\det\Sigma}, reduce to a product of standard Gaussians.

Surface Integrals and Differential Forms

A surface integral generalizes integration over curves in R3\mathbb{R}^3 to integration over surfaces. For a surface SS parameterized by ϕ:UR2R3\phi : U \subset \mathbb{R}^2 \to \mathbb{R}^3:

SfdS=Uf(ϕ(u,v))ϕu×ϕvdudv.\iint_S f \, dS = \iint_U f(\phi(u,v)) \left\|\frac{\partial\phi}{\partial u} \times \frac{\partial\phi}{\partial v}\right\| du\,dv.

The cross product ϕ/u×ϕ/v\|\partial\phi/\partial u \times \partial\phi/\partial v\| is the 2D Jacobian determinant analogue for surfaces in 3D.

Differential forms provide the coordinate-free version: a kk-form ω\omega integrates over kk-dimensional oriented manifolds without choosing coordinates. The key fact: Mdω=Mω\int_M d\omega = \int_{\partial M} \omega (Stokes' theorem), which generalizes the Fundamental Theorem of Calculus, Green's theorem, and the Divergence theorem.

The Laplacian and Harmonic Functions

The Laplacian Δf=div(f)=i=1n2f/xi2\Delta f = \operatorname{div}(\nabla f) = \sum_{i=1}^n \partial^2 f/\partial x_i^2 is the divergence of the gradient — it measures how much the average value of ff near a point exceeds its value at the point.

Harmonic functions: Δf=0\Delta f = 0. By the mean value property, f(x0)=1B(x0,r)B(x0,r)ff(\mathbf{x}_0) = \frac{1}{|B(x_0,r)|}\int_{B(x_0,r)} f — the value equals the average over any ball. Harmonic functions are the equilibrium solutions of diffusion processes.

Laplacian in ML:

  • Graph Laplacian L=DWL = D - W (discrete analogue): eigenvectors give spectral clustering embeddings
  • Laplacian regularization: minimize fTLf=ijwij(fifj)2\mathbf{f}^TL\mathbf{f} = \sum_{ij}w_{ij}(f_i-f_j)^2 — smooths predictions across graph edges
  • Score function: xlogp(x)\nabla_\mathbf{x}\log p(\mathbf{x}) appears in score-based generative models (DDPM, score matching)

Worked Example

Example 1: Gaussian KL Divergence

For p=N(μ1,Σ1)p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1), q=N(μ2,Σ2)q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2):

KL(pq)=p(x)logp(x)q(x)dx=12[tr(Σ21Σ1)+(μ2μ1)TΣ21(μ2μ1)n+logdetΣ2detΣ1].\operatorname{KL}(p\|q) = \int p(\mathbf{x})\log\frac{p(\mathbf{x})}{q(\mathbf{x})}\,d\mathbf{x} = \frac{1}{2}\left[\operatorname{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^T\Sigma_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - n + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right].

This closed-form integral uses change of variables (to diagonalize Σ1\Sigma_1) and the Gaussian normalization formula. It appears in the ELBO for VAEs, Gaussian process inference, and information-theoretic analysis of representation learning.

Example 2: Expectation by Change of Variables

If zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0},I) and x=μ+Lz\mathbf{x} = \mu + L\mathbf{z} (with Σ=LLT\Sigma = LL^T), then xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\mu, \Sigma).

Ex[f(x)]=Ez[f(μ+Lz)].\mathbb{E}_\mathbf{x}[f(\mathbf{x})] = \mathbb{E}_\mathbf{z}[f(\mu + L\mathbf{z})].

This is the reparameterization trick in VAEs: gradient flows through μ\mu and LL because z\mathbf{z} is the noise, making the sampling step differentiable. The change-of-variables formula justifies the density transformation.

Example 3: Normalizing Flow Log-Likelihood

For an invertible map x=f(z)\mathbf{x} = f(\mathbf{z}) with zpz\mathbf{z} \sim p_z:

px(x)=pz(f1(x))detJf1(x)=pz(f1(x))1detJf(f1(x)).p_x(\mathbf{x}) = p_z(f^{-1}(\mathbf{x})) \cdot |\det J_{f^{-1}}(\mathbf{x})| = p_z(f^{-1}(\mathbf{x})) \cdot \frac{1}{|\det J_f(f^{-1}(\mathbf{x}))|}.

Log-likelihood: logpx(x)=logpz(f1(x))logdetJf(f1(x))\log p_x(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) - \log|\det J_f(f^{-1}(\mathbf{x}))|.

Training maximizes this over data {xi}\{\mathbf{x}_i\}. The Jacobian log-determinant is the correction for how ff stretches or squishes volume — exactly what the change-of-variables diagram illustrates.

Connections

Where Your Intuition Breaks

The most seductive shortcut: treating expectation and differentiation as freely interchangeable. In many ML derivations you want to write θEpθ[f(x)]=Epθ[θf(x)]\nabla_\theta \mathbb{E}_{p_\theta}[f(\mathbf{x})] = \mathbb{E}_{p_\theta}[\nabla_\theta f(\mathbf{x})] or differentiate under the integral sign. This works when ff and θf\nabla_\theta f are both integrable and a dominating function condition holds (Leibniz integral rule / dominated convergence theorem) — but not in general. The REINFORCE estimator is one rigorous way to handle the gradient of an expectation when the distribution depends on θ\theta; the reparameterization trick is another. Using naive differentiation under the integral without checking integrability conditions leads to incorrect gradient estimates that can be silently wrong in sparse reward or heavy-tailed settings.

💡Intuition

Intractable integrals and VI. Most interesting probability integrals in ML are intractable: p(x)=p(xz)p(z)dzp(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\,d\mathbf{z} requires integrating over all latent configurations. Variational inference replaces this with an optimization: maxqQELBO(q)\max_{q \in \mathcal{Q}} \text{ELBO}(q) where Q\mathcal{Q} is a tractable family. Monte Carlo integration replaces it with sampling. Both approaches sidestep the intractable integral but introduce approximation error — the gap between these methods is a central theme of probabilistic ML.

⚠️Warning

Improper integrals and unnormalized densities. Many score-based and energy-based models define p(x)exp(E(x))p(\mathbf{x}) \propto \exp(-E(\mathbf{x})) without computing the normalizing constant Z=exp(E(x))dxZ = \int \exp(-E(\mathbf{x}))\,d\mathbf{x}. The Fubini condition (integrability) must be checked to ensure Z<Z < \infty. Distributions that don't normalize properly lead to nonsensical samples. This is why score matching trains on xlogp\nabla_\mathbf{x}\log p (the score function) without needing ZZ — the gradient of logZ\log Z is zero w.r.t. parameters.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.