Neural-Path/Notes
45 min

Brownian Motion & Gaussian Processes

Brownian motion is the canonical continuous-time stochastic process — continuous paths, independent increments, and Gaussian marginals. Gaussian processes generalize it to distributions over functions indexed by arbitrary inputs, giving a principled Bayesian framework for function estimation that unifies kernel methods, spline interpolation, and kriging.

Concepts

Brownian motion: independent increments, paths are continuous but nowhere differentiable. The ±2σ√t envelope (dashed) contains ≈95% of paths at each time.

-2.0-1.00.01.02.000.250.50.751t
E[B(t)]
0
Var[B(t)]
σ²·t
Cov[B(s),B(t)]
σ²·min(s,t)

B(t) − B(s) ⊥ B(s) (independent increments). Paths are a.s. continuous but have infinite variation — not Riemann-integrable.

A Gaussian Process is a distribution over functions. The kernel encodes prior beliefs about smoothness and structure. Click the plot to add observations — watch uncertainty collapse at data points.

-202012345x
RBF: infinitely smooth. Matérn-3/2: once-differentiable. Periodic: repeating. Linear: Bayesian linear regression prior. Larger ℓ → longer-range correlations.

A particle suspended in liquid jitters continuously under the microscope — continuous motion, but with no well-defined velocity at any point. Brownian motion is the precise model: continuous paths, nowhere differentiable, with variance growing linearly in time. Gaussian processes generalize this from one random trajectory to a distribution over entire functions: instead of asking where a particle is at time tt, ask what function value f(x)f(x) is at input xx, with the correlation structure determined by a kernel.

Standard Brownian Motion

Standard Brownian motion (Wiener process) B=(Bt)t0B = (B_t)_{t \geq 0} is a stochastic process satisfying:

  1. B0=0B_0 = 0 a.s.
  2. Independent increments: for 0t0<t1<<tk0 \leq t_0 < t_1 < \cdots < t_k, the increments Bt1Bt0,,BtkBtk1B_{t_1}-B_{t_0}, \ldots, B_{t_k}-B_{t_{k-1}} are independent
  3. Gaussian increments: BtBsN(0,ts)B_t - B_s \sim \mathcal{N}(0, t-s) for s<ts < t
  4. Continuous paths: tBtt \mapsto B_t is continuous a.s.

Conditions 1–3 determine the finite-dimensional distributions (it is a Gaussian process with E[Bt]=0E[B_t] = 0 and Cov(Bs,Bt)=min(s,t)\text{Cov}(B_s, B_t) = \min(s, t)). Condition 4 is the analytic requirement guaranteed by Kolmogorov's continuity theorem.

Conditions 1–3 alone fully specify all finite-dimensional distributions — they determine BB as a Gaussian process with covariance min(s,t)\min(s,t). Condition 4 selects the continuous representative from the class of processes sharing these distributions. Without it, the process exists only up to modification on measure-zero sets; condition 4 picks the version where paths are actually continuous, not just continuous almost everywhere in a weaker sense.

Non-differentiability: although continuous, Brownian paths are nowhere differentiable a.s. More precisely, Bt+hBt/h|B_{t+h} - B_t| / h \to \infty as h0h \to 0 for all tt a.s. The total variation over [0,T][0,T] is infinite a.s., so BB is not Riemann-integrable. Itô integration requires a separate theory.

Quadratic variation: [B,B]T=limΠ0k(Btk+1Btk)2=T[B, B]_T = \lim_{\|\Pi\| \to 0} \sum_k (B_{t_{k+1}} - B_{t_k})^2 = T a.s. (in probability). The quadratic variation is deterministic and equals TT, the fundamental identity underlying Itô's lemma.

Martingale properties: BtB_t is a martingale; Bt2tB_t^2 - t is a martingale; the exponential process eθBtθ2t/2e^{\theta B_t - \theta^2 t/2} (Doléans-Dade exponential) is a martingale for all θR\theta \in \mathbb{R}.

Brownian Motion as Scaled Random Walk

The Donsker invariance principle: let X1,X2,X_1, X_2, \ldots be iid with mean 0 and variance σ2\sigma^2. Define the rescaled process Bt(n)=1σnk=1ntXkB^{(n)}_t = \frac{1}{\sigma\sqrt{n}} \sum_{k=1}^{\lfloor nt \rfloor} X_k. Then B(n)BB^{(n)} \Rightarrow B (converges in distribution as processes) to standard Brownian motion. Brownian motion is the universal scaling limit of random walks — it is to stochastic processes what the Gaussian is to distributions.

Gaussian Processes

A Gaussian process fGP(μ,k)f \sim \mathcal{GP}(\mu, k) is a collection of random variables (f(x))xX(f(x))_{x \in \mathcal{X}} such that any finite marginal (f(x1),,f(xn))(f(x_1), \ldots, f(x_n)) is multivariate Gaussian with mean (μ(x1),,μ(xn))(\mu(x_1), \ldots, \mu(x_n)) and covariance Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

A GP is a distribution over functions: a sample from a GP is a function f:XRf: \mathcal{X} \to \mathbb{R}.

Kernels k(x,x)k(x, x') must be positive semidefinite: i,jcicjk(xi,xj)0\sum_{i,j} c_i c_j k(x_i, x_j) \geq 0 for all choices.

Kernelk(x,x)k(x, x')Properties
RBF (squared exponential)exp(xx222)\exp\bigl(-\frac{\|x-x'\|^2}{2\ell^2}\bigr)Infinitely differentiable, universal
Matérn-3/2(1+3r)e3r/(1+\frac{\sqrt{3}r}{\ell})e^{-\sqrt{3}r/\ell}, r=xxr=\|x-x'\|Once differentiable
Matérn-1/2er/e^{-r/\ell}Corresponds to Ornstein-Uhlenbeck process; continuous but not differentiable
Periodicexp(2sin2(πxx/p)2)\exp\bigl(-\frac{2\sin^2(\pi\lvert x-x'\rvert/p)}{\ell^2}\bigr)Exactly periodic with period pp
LinearxTx+σb2x^T x' + \sigma_b^2Bayesian linear regression prior

Mercer's theorem: every PSD kernel kk on a compact domain corresponds to a feature map ϕ\phi and an RKHS Hk\mathcal{H}_k such that k(x,x)=ϕ(x),ϕ(x)Hkk(x, x') = \langle \phi(x), \phi(x')\rangle_{\mathcal{H}_k}.

GP Regression (Kriging)

Given observations yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i with εiN(0,σn2)\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2), the joint distribution of (f(x),y)(f(x_*), \mathbf{y}) is Gaussian. The posterior f(x)yf(x_*) \mid \mathbf{y} is Gaussian with:

μ(x)=kT(K+σn2I)1y,\mu_*(x_*) = \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}, σ2(x)=k(x,x)kT(K+σn2I)1k,\sigma_*^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*,

where Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and ki=k(x,xi)\mathbf{k}_{*i} = k(x_*, x_i).

The posterior mean is a linear smoother with weights α=(K+σn2I)1y\alpha = (K + \sigma_n^2 I)^{-1}\mathbf{y}. The posterior variance quantifies uncertainty: σ2=0\sigma_*^2 = 0 at training points (when σn20\sigma_n^2 \to 0) and grows toward the prior variance away from data.

Marginal likelihood for kernel hyperparameter optimization:

logp(yX,θ)=12yT(Kθ+σn2I)1y12logKθ+σn2In2log2π.\log p(\mathbf{y} \mid X, \theta) = -\frac{1}{2}\mathbf{y}^T (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2}\log |K_\theta + \sigma_n^2 I| - \frac{n}{2}\log 2\pi.

The first term penalizes data fit; the second penalizes model complexity (log determinant = sum of log eigenvalues). Maximizing over θ\theta balances fit vs complexity — automatic Occam's razor.

Worked Example

Example 1: GP Regression for 1D Time Series

Fit a GP to 10 noisy observations of f(x)=sin(x)f(x) = \sin(x) on [0,5][0, 5] with σn2=0.1\sigma_n^2 = 0.1, using an RBF kernel with =1\ell = 1, σf2=1\sigma_f^2 = 1.

Build KR10×10K \in \mathbb{R}^{10 \times 10}, Kij=exp((xixj)2/2)K_{ij} = \exp(-(x_i-x_j)^2/2), and solve (K+0.1I)α=y(K + 0.1I)\alpha = \mathbf{y}. The posterior mean at a new point xx_* is μ=kTα\mu_* = \mathbf{k}_*^T \alpha.

At a training point: σ20.1\sigma_*^2 \approx 0.1 (noise level only). Far from training points (xxi|x_* - x_i| \gg \ell): σ2σf2=1\sigma_*^2 \approx \sigma_f^2 = 1 (collapses to prior). The uncertainty profile visualizes where the model is confident.

Computational cost: O(n3)O(n^3) for the Cholesky factorization. For n=1000n = 1000: 109\sim 10^9 operations; feasible. For n=106n = 10^6: requires sparse GP approximations (inducing points, Nyström, KISS-GP).

Example 2: Brownian Motion and the Heat Equation

The heat equation tu=12Δu\partial_t u = \frac{1}{2}\Delta u with initial condition u(x,0)=g(x)u(x, 0) = g(x) has the Brownian motion solution:

u(x,t)=Ex[g(Bt)]=g(y)e(yx)2/(2t)2πtdy.u(x, t) = E_x[g(B_t)] = \int g(y) \frac{e^{-(y-x)^2/(2t)}}{\sqrt{2\pi t}}\,dy.

The solution is a Gaussian convolution of the initial condition. Boundary value problems: u(x)=Ex[g(Bτ)]u(x) = E_x[g(B_\tau)] where τ\tau is the first exit time from a domain. Brownian motion literally solves PDEs — this is the Feynman-Kac connection exploited in option pricing and neural network PDEs.

Example 3: Kernel Selection and the Matérn Family

The Matérn family parameterizes sample path smoothness by ν\nu: sample paths are (k1)(k-1)-times differentiable when ν=k1/2\nu = k - 1/2.

For GP regression on weather temperature (ν=5/2\nu = 5/2: twice differentiable), compare RBF (infinitely smooth) vs Matérn-3/2 (once differentiable) by marginal likelihood on held-out days. RBF may overfit smooth predictions that miss sharp weather transitions. Matérn-3/2 allows more realistic variability. The marginal likelihood automatically selects the appropriate ν\nu from data — it penalizes RBF's overcomplexity when the data is not infinitely smooth.

Connections

Where Your Intuition Breaks

GP regression provides exact posterior uncertainty — but only if the kernel encodes the correct assumptions about the function. An RBF kernel (infinitely smooth prior) will be confidently wrong near sparse data with sharp transitions: the posterior variance will be small in regions the kernel considers smooth, even when the true function is not. The marginal likelihood can optimize kernel hyperparameters, but it has local optima and can overfit the length scale to noise, making predictions appear certain where they should not be. GP uncertainty is calibrated only when the kernel family contains the true covariance structure — and in practice, choosing the kernel is a modeling assumption the data cannot fully resolve.

💡Intuition

The kernel is the prior on function structure. Choosing a kernel is not a technical detail — it encodes strong assumptions about the function. RBF assumes infinite smoothness (no sharp edges), periodic kernel assumes exact periodicity, linear kernel assumes a linear relationship. Kernel choice should be guided by domain knowledge. Composition rules allow rich priors: k1+k2k_1 + k_2 models additive structure (e.g., trend + seasonality), k1×k2k_1 \times k_2 models multiplicative interactions. The expressive power of GP models comes entirely from kernel design.

💡Intuition

GP regression is equivalent to kernel ridge regression. The GP posterior mean f^=K(K+σn2I)1y\hat f = K(K + \sigma_n^2 I)^{-1}\mathbf{y} is the same as the kernel ridge regression solution: f^=argminfHki(yif(xi))2+σn2fHk2\hat f = \arg\min_{f \in \mathcal{H}_k} \sum_i (y_i - f(x_i))^2 + \sigma_n^2 \|f\|_{\mathcal{H}_k}^2. GPs add the uncertainty quantification (posterior variance) on top of the point estimate. This equivalence means GP regression and KRR have identical computational cost and prediction — the GP framework adds calibrated uncertainty intervals at no extra cost.

⚠️Warning

GP scaling is cubic in nn — sparse approximations are essential at scale. The exact GP requires inverting an n×nn \times n matrix: O(n3)O(n^3) time and O(n2)O(n^2) memory. At n=104n = 10^4, this requires 6.4\sim 6.4 GB of RAM. The standard fix is inducing point approximations (sparse GPs, FITC, VFE): select mnm \ll n inducing points ZZ; approximate the full covariance using KnmKmm1KmnK_{nm}K_{mm}^{-1}K_{mn}. Cost drops to O(nm2)O(nm^2). With m=1000m = 1000, exact GP on n=106n = 10^6 becomes feasible. GPyTorch exploits GPU parallelism and Cholesky-free solvers to scale GPs to millions of points.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.