Brownian Motion & Gaussian Processes

Brownian motion is the canonical continuous-time stochastic process — continuous paths, independent increments, and Gaussian marginals. Gaussian processes generalize it to distributions over functions indexed by arbitrary inputs, giving a principled Bayesian framework for function estimation that unifies kernel methods, spline interpolation, and kriging.

Concepts

Brownian motion: independent increments, paths are continuous but nowhere differentiable. The ±2σ√t envelope (dashed) contains ≈95% of paths at each time.

σ = 1.0

±2σ√t envelope

E[B(t)]

Var[B(t)]

σ²·t

Cov[B(s),B(t)]

σ²·min(s,t)

B(t) − B(s) ⊥ B(s) (independent increments). Paths are a.s. continuous but have infinite variation — not Riemann-integrable.

A Gaussian Process is a distribution over functions. The kernel encodes prior beliefs about smoothness and structure. Click the plot to add observations — watch uncertainty collapse at data points.

Length-scale ℓ = 1.0

RBF: infinitely smooth. Matérn-3/2: once-differentiable. Periodic: repeating. Linear: Bayesian linear regression prior. Larger ℓ → longer-range correlations.

A particle suspended in liquid jitters continuously under the microscope — continuous motion, but with no well-defined velocity at any point. Brownian motion is the precise model: continuous paths, nowhere differentiable, with variance growing linearly in time. Gaussian processes generalize this from one random trajectory to a distribution over entire functions: instead of asking where a particle is at time $t$ , ask what function value $f(x)$ is at input $x$ , with the correlation structure determined by a kernel.

Standard Brownian Motion

Standard Brownian motion (Wiener process) $B = (B_t)_{t \geq 0}$ is a stochastic process satisfying:

$B_0 = 0$ a.s.
Independent increments: for $0 \leq t_0 < t_1 < \cdots < t_k$ , the increments $B_{t_1}-B_{t_0}, \ldots, B_{t_k}-B_{t_{k-1}}$ are independent
Gaussian increments: $B_t - B_s \sim \mathcal{N}(0, t-s)$ for $s < t$
Continuous paths: $t \mapsto B_t$ is continuous a.s.

Conditions 1–3 determine the finite-dimensional distributions (it is a Gaussian process with $E[B_t] = 0$ and $\text{Cov}(B_s, B_t) = \min(s, t)$ ). Condition 4 is the analytic requirement guaranteed by Kolmogorov's continuity theorem.

Conditions 1–3 alone fully specify all finite-dimensional distributions — they determine $B$ as a Gaussian process with covariance $\min(s,t)$ . Condition 4 selects the continuous representative from the class of processes sharing these distributions. Without it, the process exists only up to modification on measure-zero sets; condition 4 picks the version where paths are actually continuous, not just continuous almost everywhere in a weaker sense.

Non-differentiability: although continuous, Brownian paths are nowhere differentiable a.s. More precisely, $|B_{t+h} - B_t| / h \to \infty$ as $h \to 0$ for all $t$ a.s. The total variation over $[0,T]$ is infinite a.s., so $B$ is not Riemann-integrable. Itô integration requires a separate theory.

Quadratic variation: $[B, B]_T = \lim_{\|\Pi\| \to 0} \sum_k (B_{t_{k+1}} - B_{t_k})^2 = T$ a.s. (in probability). The quadratic variation is deterministic and equals $T$ , the fundamental identity underlying Itô's lemma.

Martingale properties: $B_t$ is a martingale; $B_t^2 - t$ is a martingale; the exponential process $e^{\theta B_t - \theta^2 t/2}$ (Doléans-Dade exponential) is a martingale for all $\theta \in \mathbb{R}$ .

Brownian Motion as Scaled Random Walk

The Donsker invariance principle: let $X_1, X_2, \ldots$ be iid with mean 0 and variance $\sigma^2$ . Define the rescaled process $B^{(n)}_t = \frac{1}{\sigma\sqrt{n}} \sum_{k=1}^{\lfloor nt \rfloor} X_k$ . Then $B^{(n)} \Rightarrow B$ (converges in distribution as processes) to standard Brownian motion. Brownian motion is the universal scaling limit of random walks — it is to stochastic processes what the Gaussian is to distributions.

Gaussian Processes

A Gaussian process $f \sim \mathcal{GP}(\mu, k)$ is a collection of random variables $(f(x))_{x \in \mathcal{X}}$ such that any finite marginal $(f(x_1), \ldots, f(x_n))$ is multivariate Gaussian with mean $(\mu(x_1), \ldots, \mu(x_n))$ and covariance $K_{ij} = k(x_i, x_j)$ .

A GP is a distribution over functions: a sample from a GP is a function $f: \mathcal{X} \to \mathbb{R}$ .

Kernels $k(x, x')$ must be positive semidefinite: $\sum_{i,j} c_i c_j k(x_i, x_j) \geq 0$ for all choices.

Kernel	$k(x, x')$	Properties
RBF (squared exponential)	$\exp\bigl(-\frac{\\|x-x'\\|^2}{2\ell^2}\bigr)$	Infinitely differentiable, universal
Matérn-3/2	$(1+\frac{\sqrt{3}r}{\ell})e^{-\sqrt{3}r/\ell}$ , $r=\\|x-x'\\|$	Once differentiable
Matérn-1/2	$e^{-r/\ell}$	Corresponds to Ornstein-Uhlenbeck process; continuous but not differentiable
Periodic	$\exp\bigl(-\frac{2\sin^2(\pi\lvert x-x'\rvert/p)}{\ell^2}\bigr)$	Exactly periodic with period $p$
Linear	$x^T x' + \sigma_b^2$	Bayesian linear regression prior

Mercer's theorem: every PSD kernel $k$ on a compact domain corresponds to a feature map $\phi$ and an RKHS $\mathcal{H}_k$ such that $k(x, x') = \langle \phi(x), \phi(x')\rangle_{\mathcal{H}_k}$ .

GP Regression (Kriging)

Given observations $y_i = f(x_i) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)$ , the joint distribution of $(f(x_*), \mathbf{y})$ is Gaussian. The posterior $f(x_*) \mid \mathbf{y}$ is Gaussian with:

$\mu_*(x_*) = \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{y},$ $\sigma_*^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*,$

where $K_{ij} = k(x_i, x_j)$ and $\mathbf{k}_{*i} = k(x_*, x_i)$ .

The posterior mean is a linear smoother with weights $\alpha = (K + \sigma_n^2 I)^{-1}\mathbf{y}$ . The posterior variance quantifies uncertainty: $\sigma_*^2 = 0$ at training points (when $\sigma_n^2 \to 0$ ) and grows toward the prior variance away from data.

Marginal likelihood for kernel hyperparameter optimization:

$\log p(\mathbf{y} \mid X, \theta) = -\frac{1}{2}\mathbf{y}^T (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2}\log |K_\theta + \sigma_n^2 I| - \frac{n}{2}\log 2\pi.$

The first term penalizes data fit; the second penalizes model complexity (log determinant = sum of log eigenvalues). Maximizing over $\theta$ balances fit vs complexity — automatic Occam's razor.

Worked Example

Example 1: GP Regression for 1D Time Series

Fit a GP to 10 noisy observations of $f(x) = \sin(x)$ on $[0, 5]$ with $\sigma_n^2 = 0.1$ , using an RBF kernel with $\ell = 1$ , $\sigma_f^2 = 1$ .

Build $K \in \mathbb{R}^{10 \times 10}$ , $K_{ij} = \exp(-(x_i-x_j)^2/2)$ , and solve $(K + 0.1I)\alpha = \mathbf{y}$ . The posterior mean at a new point $x_*$ is $\mu_* = \mathbf{k}_*^T \alpha$ .

At a training point: $\sigma_*^2 \approx 0.1$ (noise level only). Far from training points ( $|x_* - x_i| \gg \ell$ ): $\sigma_*^2 \approx \sigma_f^2 = 1$ (collapses to prior). The uncertainty profile visualizes where the model is confident.

Computational cost: $O(n^3)$ for the Cholesky factorization. For $n = 1000$ : $\sim 10^9$ operations; feasible. For $n = 10^6$ : requires sparse GP approximations (inducing points, Nyström, KISS-GP).

Example 2: Brownian Motion and the Heat Equation

The heat equation $\partial_t u = \frac{1}{2}\Delta u$ with initial condition $u(x, 0) = g(x)$ has the Brownian motion solution:

$u(x, t) = E_x[g(B_t)] = \int g(y) \frac{e^{-(y-x)^2/(2t)}}{\sqrt{2\pi t}}\,dy.$

The solution is a Gaussian convolution of the initial condition. Boundary value problems: $u(x) = E_x[g(B_\tau)]$ where $\tau$ is the first exit time from a domain. Brownian motion literally solves PDEs — this is the Feynman-Kac connection exploited in option pricing and neural network PDEs.

Example 3: Kernel Selection and the Matérn Family

The Matérn family parameterizes sample path smoothness by $\nu$ : sample paths are $(k-1)$ -times differentiable when $\nu = k - 1/2$ .

For GP regression on weather temperature ( $\nu = 5/2$ : twice differentiable), compare RBF (infinitely smooth) vs Matérn-3/2 (once differentiable) by marginal likelihood on held-out days. RBF may overfit smooth predictions that miss sharp weather transitions. Matérn-3/2 allows more realistic variability. The marginal likelihood automatically selects the appropriate $\nu$ from data — it penalizes RBF's overcomplexity when the data is not infinitely smooth.

Connections

Where Your Intuition Breaks

GP regression provides exact posterior uncertainty — but only if the kernel encodes the correct assumptions about the function. An RBF kernel (infinitely smooth prior) will be confidently wrong near sparse data with sharp transitions: the posterior variance will be small in regions the kernel considers smooth, even when the true function is not. The marginal likelihood can optimize kernel hyperparameters, but it has local optima and can overfit the length scale to noise, making predictions appear certain where they should not be. GP uncertainty is calibrated only when the kernel family contains the true covariance structure — and in practice, choosing the kernel is a modeling assumption the data cannot fully resolve.

💡Intuition

The kernel is the prior on function structure. Choosing a kernel is not a technical detail — it encodes strong assumptions about the function. RBF assumes infinite smoothness (no sharp edges), periodic kernel assumes exact periodicity, linear kernel assumes a linear relationship. Kernel choice should be guided by domain knowledge. Composition rules allow rich priors: $k_1 + k_2$ models additive structure (e.g., trend + seasonality), $k_1 \times k_2$ models multiplicative interactions. The expressive power of GP models comes entirely from kernel design.

💡Intuition

GP regression is equivalent to kernel ridge regression. The GP posterior mean $\hat f = K(K + \sigma_n^2 I)^{-1}\mathbf{y}$ is the same as the kernel ridge regression solution: $\hat f = \arg\min_{f \in \mathcal{H}_k} \sum_i (y_i - f(x_i))^2 + \sigma_n^2 \|f\|_{\mathcal{H}_k}^2$ . GPs add the uncertainty quantification (posterior variance) on top of the point estimate. This equivalence means GP regression and KRR have identical computational cost and prediction — the GP framework adds calibrated uncertainty intervals at no extra cost.

⚠️Warning

GP scaling is cubic in $n$ — sparse approximations are essential at scale. The exact GP requires inverting an $n \times n$ matrix: $O(n^3)$ time and $O(n^2)$ memory. At $n = 10^4$ , this requires $\sim 6.4$ GB of RAM. The standard fix is inducing point approximations (sparse GPs, FITC, VFE): select $m \ll n$ inducing points $Z$ ; approximate the full covariance using $K_{nm}K_{mm}^{-1}K_{mn}$ . Cost drops to $O(nm^2)$ . With $m = 1000$ , exact GP on $n = 10^6$ becomes feasible. GPyTorch exploits GPU parallelism and Cholesky-free solvers to scale GPs to millions of points.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Martingales: Optional Stopping & Doob's Inequalities

Itô Calculus & Stochastic Differential Equations