Brownian Motion & Gaussian Processes
Brownian motion is the canonical continuous-time stochastic process — continuous paths, independent increments, and Gaussian marginals. Gaussian processes generalize it to distributions over functions indexed by arbitrary inputs, giving a principled Bayesian framework for function estimation that unifies kernel methods, spline interpolation, and kriging.
Concepts
Brownian motion: independent increments, paths are continuous but nowhere differentiable. The ±2σ√t envelope (dashed) contains ≈95% of paths at each time.
B(t) − B(s) ⊥ B(s) (independent increments). Paths are a.s. continuous but have infinite variation — not Riemann-integrable.
A Gaussian Process is a distribution over functions. The kernel encodes prior beliefs about smoothness and structure. Click the plot to add observations — watch uncertainty collapse at data points.
A particle suspended in liquid jitters continuously under the microscope — continuous motion, but with no well-defined velocity at any point. Brownian motion is the precise model: continuous paths, nowhere differentiable, with variance growing linearly in time. Gaussian processes generalize this from one random trajectory to a distribution over entire functions: instead of asking where a particle is at time , ask what function value is at input , with the correlation structure determined by a kernel.
Standard Brownian Motion
Standard Brownian motion (Wiener process) is a stochastic process satisfying:
- a.s.
- Independent increments: for , the increments are independent
- Gaussian increments: for
- Continuous paths: is continuous a.s.
Conditions 1–3 determine the finite-dimensional distributions (it is a Gaussian process with and ). Condition 4 is the analytic requirement guaranteed by Kolmogorov's continuity theorem.
Conditions 1–3 alone fully specify all finite-dimensional distributions — they determine as a Gaussian process with covariance . Condition 4 selects the continuous representative from the class of processes sharing these distributions. Without it, the process exists only up to modification on measure-zero sets; condition 4 picks the version where paths are actually continuous, not just continuous almost everywhere in a weaker sense.
Non-differentiability: although continuous, Brownian paths are nowhere differentiable a.s. More precisely, as for all a.s. The total variation over is infinite a.s., so is not Riemann-integrable. Itô integration requires a separate theory.
Quadratic variation: a.s. (in probability). The quadratic variation is deterministic and equals , the fundamental identity underlying Itô's lemma.
Martingale properties: is a martingale; is a martingale; the exponential process (Doléans-Dade exponential) is a martingale for all .
Brownian Motion as Scaled Random Walk
The Donsker invariance principle: let be iid with mean 0 and variance . Define the rescaled process . Then (converges in distribution as processes) to standard Brownian motion. Brownian motion is the universal scaling limit of random walks — it is to stochastic processes what the Gaussian is to distributions.
Gaussian Processes
A Gaussian process is a collection of random variables such that any finite marginal is multivariate Gaussian with mean and covariance .
A GP is a distribution over functions: a sample from a GP is a function .
Kernels must be positive semidefinite: for all choices.
| Kernel | Properties | |
|---|---|---|
| RBF (squared exponential) | Infinitely differentiable, universal | |
| Matérn-3/2 | , | Once differentiable |
| Matérn-1/2 | Corresponds to Ornstein-Uhlenbeck process; continuous but not differentiable | |
| Periodic | Exactly periodic with period | |
| Linear | Bayesian linear regression prior |
Mercer's theorem: every PSD kernel on a compact domain corresponds to a feature map and an RKHS such that .
GP Regression (Kriging)
Given observations with , the joint distribution of is Gaussian. The posterior is Gaussian with:
where and .
The posterior mean is a linear smoother with weights . The posterior variance quantifies uncertainty: at training points (when ) and grows toward the prior variance away from data.
Marginal likelihood for kernel hyperparameter optimization:
The first term penalizes data fit; the second penalizes model complexity (log determinant = sum of log eigenvalues). Maximizing over balances fit vs complexity — automatic Occam's razor.
Worked Example
Example 1: GP Regression for 1D Time Series
Fit a GP to 10 noisy observations of on with , using an RBF kernel with , .
Build , , and solve . The posterior mean at a new point is .
At a training point: (noise level only). Far from training points (): (collapses to prior). The uncertainty profile visualizes where the model is confident.
Computational cost: for the Cholesky factorization. For : operations; feasible. For : requires sparse GP approximations (inducing points, Nyström, KISS-GP).
Example 2: Brownian Motion and the Heat Equation
The heat equation with initial condition has the Brownian motion solution:
The solution is a Gaussian convolution of the initial condition. Boundary value problems: where is the first exit time from a domain. Brownian motion literally solves PDEs — this is the Feynman-Kac connection exploited in option pricing and neural network PDEs.
Example 3: Kernel Selection and the Matérn Family
The Matérn family parameterizes sample path smoothness by : sample paths are -times differentiable when .
For GP regression on weather temperature (: twice differentiable), compare RBF (infinitely smooth) vs Matérn-3/2 (once differentiable) by marginal likelihood on held-out days. RBF may overfit smooth predictions that miss sharp weather transitions. Matérn-3/2 allows more realistic variability. The marginal likelihood automatically selects the appropriate from data — it penalizes RBF's overcomplexity when the data is not infinitely smooth.
Connections
Where Your Intuition Breaks
GP regression provides exact posterior uncertainty — but only if the kernel encodes the correct assumptions about the function. An RBF kernel (infinitely smooth prior) will be confidently wrong near sparse data with sharp transitions: the posterior variance will be small in regions the kernel considers smooth, even when the true function is not. The marginal likelihood can optimize kernel hyperparameters, but it has local optima and can overfit the length scale to noise, making predictions appear certain where they should not be. GP uncertainty is calibrated only when the kernel family contains the true covariance structure — and in practice, choosing the kernel is a modeling assumption the data cannot fully resolve.
The kernel is the prior on function structure. Choosing a kernel is not a technical detail — it encodes strong assumptions about the function. RBF assumes infinite smoothness (no sharp edges), periodic kernel assumes exact periodicity, linear kernel assumes a linear relationship. Kernel choice should be guided by domain knowledge. Composition rules allow rich priors: models additive structure (e.g., trend + seasonality), models multiplicative interactions. The expressive power of GP models comes entirely from kernel design.
GP regression is equivalent to kernel ridge regression. The GP posterior mean is the same as the kernel ridge regression solution: . GPs add the uncertainty quantification (posterior variance) on top of the point estimate. This equivalence means GP regression and KRR have identical computational cost and prediction — the GP framework adds calibrated uncertainty intervals at no extra cost.
GP scaling is cubic in — sparse approximations are essential at scale. The exact GP requires inverting an matrix: time and memory. At , this requires GB of RAM. The standard fix is inducing point approximations (sparse GPs, FITC, VFE): select inducing points ; approximate the full covariance using . Cost drops to . With , exact GP on becomes feasible. GPyTorch exploits GPU parallelism and Cholesky-free solvers to scale GPs to millions of points.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.