Neural-Path/Notes
35 min

Bridge: MCMC, Langevin Dynamics & Diffusion Models as SDEs

MCMC, Langevin dynamics, and diffusion-based generative models are three faces of the same stochastic process theory. Markov chain Monte Carlo samples from a target by constructing a chain with the right stationary distribution; Langevin dynamics speeds this up with gradient information; diffusion models invert a noise-adding SDE to generate data — all three are unified by the same SDE framework.

Concepts

Langevin dynamics adds Gaussian noise to gradient descent: x ← x − η∇U + √(2ηT) ε. Noise lets particles escape local minima and sample the Boltzmann distribution — the foundation of SGLD and diffusion models.

0124-2-1012xU(x)
Langevin: samples ∝ e−U(x)/T — explores both wells at T≥1. Pure GD: converges to nearest minimum, can't escape. Diffusion models: run reverse Langevin from noise → data.

Add noise to an image step by step until it becomes pure static — then learn to reverse each step. This is how diffusion-based generative models work, and it succeeds because the noise-adding process is a well-understood SDE with a tractable reverse. Langevin dynamics runs the same SDE for Bayesian sampling: gradient steps toward high-probability regions, plus noise to prevent collapse. The same stochastic differential equation governs sampling from posteriors and generating images — MCMC, Langevin, and diffusion models are one framework at three different scales.

MCMC as Discrete-Time Markov Chains

Metropolis-Hastings: to sample from π(x)eU(x)\pi(x) \propto e^{-U(x)}, propose xq(xx)x' \sim q(x' \mid x) and accept with probability min(1,π(x)q(xx)π(x)q(xx))\min(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}). This satisfies detailed balance π(x)P(x,x)=π(x)P(x,x)\pi(x)P(x, x') = \pi(x')P(x', x) by construction, ensuring π\pi is stationary.

The detailed balance condition π(x)P(x,x)=π(x)P(x,x)\pi(x)P(x,x') = \pi(x')P(x',x) is both sufficient for stationarity and the precise formulation of time-reversibility: a chain satisfying it looks statistically identical run forward or backward. The Metropolis correction — the accept/reject step — is exactly engineered to restore detailed balance whenever the proposal qq would violate it. Without the correction, the chain would still explore state space but would converge to a biased stationary distribution.

Gibbs sampling: for joint π(x1,,xd)\pi(x_1, \ldots, x_d), cycle through dimensions, sampling each xixix_i \mid x_{-i} from the conditional. No accept/reject step needed. Gibbs is a special case of MH with acceptance rate 1.

Key limitation: local proposals have poor mixing in high dimensions. The mixing time of random-walk MH in Rd\mathbb{R}^d scales as O(d)O(d) (with optimal step size d1/3\sim d^{-1/3}) — each coordinate takes O(d)O(d) steps to move by O(1)O(1).

Langevin Monte Carlo

Unadjusted Langevin Algorithm (ULA): discretize the Langevin SDE dXt=U(Xt)dt+2dBtdX_t = -\nabla U(X_t)\,dt + \sqrt{2}\,dB_t:

xt+1=xtηU(xt)+2ηξt,ξtN(0,I).x_{t+1} = x_t - \eta \nabla U(x_t) + \sqrt{2\eta}\,\xi_t, \quad \xi_t \sim \mathcal{N}(0, I).

The stationary distribution of the continuous-time SDE is πeU(x)\pi \propto e^{-U(x)} (Gibbs). The discretized ULA introduces an O(η)O(\eta) bias — it does not exactly sample π\pi.

Metropolis-Adjusted Langevin Algorithm (MALA): correct the discretization error by using Langevin steps as proposals in MH. The proposal is q(xx)=N(xηU(x),2ηI)q(x' \mid x) = \mathcal{N}(x - \eta\nabla U(x), 2\eta I), accepted/rejected by MH. MALA is exact (stationary distribution is exactly π\pi) but requires an accept step.

Mixing time improvement: Langevin mixes in O(d)O(\sqrt{d}) steps vs O(d)O(d) for random-walk MH, using gradient information to make directed proposals.

Preconditioned Langevin: xt+1=xtηM1U(xt)+2ηM1/2ξtx_{t+1} = x_t - \eta M^{-1}\nabla U(x_t) + \sqrt{2\eta} M^{-1/2}\xi_t uses a preconditioning matrix MM (e.g., the Fisher information matrix or a Hessian estimate) to adapt to the geometry of UU. This is the stochastic analog of Newton's method.

Hamiltonian Monte Carlo

Hamiltonian dynamics augment the state with momentum pN(0,M)p \sim \mathcal{N}(0, M):

dqdt=M1p,dpdt=U(q).\frac{dq}{dt} = M^{-1}p, \quad \frac{dp}{dt} = -\nabla U(q).

The Hamiltonian H(q,p)=U(q)+12pTM1pH(q, p) = U(q) + \frac{1}{2}p^T M^{-1}p is conserved along trajectories. The joint distribution π(q,p)eH(q,p)\pi(q, p) \propto e^{-H(q,p)} factors as π(q)N(p;0,M)\pi(q)\mathcal{N}(p; 0, M).

HMC algorithm: (1) sample pN(0,M)p \sim \mathcal{N}(0, M); (2) run LL leapfrog steps to get (q,p)(q', p'); (3) accept with MH probability min(1,eH(q,p)H(q,p))\min(1, e^{H(q,p) - H(q',p')}); (4) discard pp'.

The leapfrog integrator (Störmer-Verlet) is symplectic — it exactly conserves volume in (q,p)(q, p) space, ensuring the acceptance rate stays near 1 for small step size ε\varepsilon. The optimal acceptance rate for HMC in dd dimensions is 0.65\approx 0.65.

No-U-Turn Sampler (NUTS): automatically sets the trajectory length LL by detecting when the trajectory turns back, eliminating the tuning parameter. NUTS is the default sampler in Stan, PyMC, and NumPyro.

Score-Based Generative Models and Diffusion

Forward SDE (noise injection): dXt=f(Xt,t)dt+g(t)dBtdX_t = f(X_t, t)\,dt + g(t)\,dB_t, X0p0X_0 \sim p_0 (data). Common choice: variance-preserving (VP-SDE, DDPM) with f=12β(t)Xtf = -\frac{1}{2}\beta(t)X_t and g=β(t)g = \sqrt{\beta(t)}, so XtX0N(α(t)X0,σ(t)2I)X_t \mid X_0 \sim \mathcal{N}(\alpha(t)X_0, \sigma(t)^2 I) with closed-form α(t),σ(t)\alpha(t), \sigma(t).

Anderson's reverse SDE (1982): the time-reversal of an Itô SDE dXt=f(Xt,t)dt+g(t)dBtdX_t = f(X_t, t)\,dt + g(t)\,dB_t is:

dXˉt=[f(Xˉt,t)g(t)2xlogpt(Xˉt)]dt+g(t)dBˉt.d\bar X_t = \bigl[f(\bar X_t, t) - g(t)^2 \nabla_x \log p_t(\bar X_t)\bigr]dt + g(t)\,d\bar B_t.

The reverse drift requires the score function xlogpt(x)\nabla_x \log p_t(x) — the gradient of the log-density at time tt.

Score matching: train a neural network sθ(x,t)xlogpt(x)s_\theta(x, t) \approx \nabla_x \log p_t(x) by minimizing the denoising score matching objective:

L(θ)=Et,x0,xt ⁣[sθ(xt,t)xtlogp(xtx0)2].\mathcal{L}(\theta) = E_{t, x_0, x_t}\!\left[\|s_\theta(x_t, t) - \nabla_{x_t}\log p(x_t \mid x_0)\|^2\right].

Since p(xtx0)p(x_t \mid x_0) is a Gaussian (known in closed form for VP-SDE), xtlogp(xtx0)=xtα(t)x0σ(t)2\nabla_{x_t}\log p(x_t \mid x_0) = -\frac{x_t - \alpha(t)x_0}{\sigma(t)^2}. Learning the score is equivalent to learning to denoise.

DDPM formulation: parameterize sθ(xt,t)=ϵθ(xt,t)/σ(t)s_\theta(x_t, t) = -\epsilon_\theta(x_t, t)/\sigma(t) where ϵθ\epsilon_\theta predicts the noise added. The training objective becomes E[ϵϵθ(xt,t)2]E[\|\epsilon - \epsilon_\theta(x_t, t)\|^2] — simple noise prediction.

Sampling: run the reverse SDE from XTN(0,I)X_T \sim \mathcal{N}(0, I) backward to t=0t=0. Discretize with DDIM (deterministic) or DDPM (stochastic). Fewer steps with flow matching (ODE-based) or consistency models.

Worked Example

Example 1: Stochastic Gradient Langevin Dynamics (SGLD)

For training with nn data points and minibatch size mm, SGLD uses:

θt+1=θtη2(logp(θt)+nmiBtlogp(xiθt))+ηξt.\theta_{t+1} = \theta_t - \frac{\eta}{2}\left(\nabla \log p(\theta_t) + \frac{n}{m}\sum_{i \in \mathcal{B}_t}\nabla \log p(x_i \mid \theta_t)\right) + \sqrt{\eta}\,\xi_t.

For ηt0\eta_t \to 0 with ηt=\sum \eta_t = \infty and ηt2<\sum \eta_t^2 < \infty (Robbins-Monro), SGLD asymptotically samples from the posterior p(θD)p(θ)ip(xiθ)p(\theta \mid \mathcal{D}) \propto p(\theta)\prod_i p(x_i \mid \theta).

At early training: the noise term ηξt\sqrt{\eta}\xi_t is negligible vs the gradient — SGLD behaves like SGD. At convergence: the noise dominates and samples from the posterior. SGLD interpolates between optimization and Bayesian inference by annealing η\eta.

Example 2: Diffusion Model Reverse Process

For DDPM with 1000 timesteps, βt\beta_t increases linearly from 10410^{-4} to 0.020.02:

XtX0N(αˉtX0,(1αˉt)I)X_t \mid X_0 \sim \mathcal{N}(\sqrt{\bar\alpha_t}X_0, (1-\bar\alpha_t)I) where αˉt=s=1t(1βs)\bar\alpha_t = \prod_{s=1}^t (1-\beta_s).

At t=1000t = 1000: αˉ10000\bar\alpha_{1000} \approx 0 so X1000N(0,I)X_{1000} \approx \mathcal{N}(0, I) — almost pure noise.

The reverse step: Xt1=1αt(Xtβt1αˉtϵθ(Xt,t))+β~tξtX_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(X_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(X_t, t)\right) + \sqrt{\tilde\beta_t}\,\xi_t.

where β~t=1αˉt11αˉtβt\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t (posterior variance). The U-Net ϵθ\epsilon_\theta is trained to predict noise from (Xt,t)(X_t, t). Generation: start with X1000N(0,I)X_{1000} \sim \mathcal{N}(0, I), run 1000 reverse steps.

DDIM reduces this to 20–50 deterministic steps using an ODE solver instead of the stochastic reverse SDE.

Example 3: Score Matching Connects to Energy-Based Models

An energy-based model defines pθ(x)eEθ(x)p_\theta(x) \propto e^{-E_\theta(x)}. Exact score: xlogpθ(x)=xEθ(x)\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x) — the gradient of the energy, computable by backprop without the intractable partition function.

Explicit score matching minimizes E[sθ(x)xlogpdata(x)2]E[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2] — requires the true score, which is unknown.

Denoising score matching (Vincent 2011) perturbs data x~=x+ϵξ\tilde x = x + \epsilon\xi and matches the score of the perturbed distribution x~logpσ(x~)=(x~x)/σ2\nabla_{\tilde x}\log p_\sigma(\tilde x) = -(\tilde x - x)/\sigma^2, which is tractable. DDPM's noise prediction objective is exactly this, summed over noise levels tt — diffusion training is multi-scale denoising score matching.

Connections

Where Your Intuition Breaks

Diffusion models learn the score xlogpt(x)\nabla_x \log p_t(x) at each noise level tt — apparently requiring knowledge of the data density ptp_t, which is intractable in high dimensions. Score matching sidesteps this by expressing the score of the noisy distribution in terms of the clean data point: for Gaussian noise, xtlogpt(xtx0)=(xtx0)/σt2\nabla_{x_t}\log p_t(x_t \mid x_0) = -(x_t - x_0)/\sigma_t^2, which is computable without ptp_t. This tractability relies critically on the Gaussian structure of the noise; non-Gaussian forward processes require substantially different estimators. The apparent simplicity of "learn to denoise" is not a general principle — it is a consequence of the specific mathematical properties of Gaussian noise.

💡Intuition

MCMC, Langevin, and diffusion models are the same SDE at different granularities. Metropolis-Hastings is a discrete-time Markov chain that converges to π\pi. Langevin MCMC is the continuous-time limit, using the score logπ\nabla\log\pi to guide proposals. Diffusion models run a fixed forward SDE (Gaussian noise injection) and learn the reverse score via neural network. The unifying object is: the SDE with drift U+2TdB-\nabla U + \sqrt{2T}\,dB has stationary distribution eU/T\propto e^{-U/T}. MCMC discretizes this; diffusion models learn to run it backwards from random noise.

💡Intuition

Score matching avoids the partition function entirely. Training energy-based models requires computing Z=eE(x)dxZ = \int e^{-E(x)}dx — intractable in high dimensions. Score matching sidesteps this: xlogp(x)=xE(x)\nabla_x \log p(x) = -\nabla_x E(x) — no ZZ! Training by matching the score gradient requires only backpropagation through the energy, not the normalization constant. This is why score-based generative models and EBMs trained with score matching scale to image dimensions while maximum likelihood EBMs do not — the likelihood requires ZZ; the score does not.

⚠️Warning

ULA is biased; MALA is not — but MALA requires tuning. ULA (Langevin without Metropolis correction) has a stationary distribution that differs from the target by O(η)O(\eta) in TV distance. For η=0.01\eta = 0.01 in d=100d = 100 dimensions, this bias can be substantial. MALA corrects this via accept/reject but requires the acceptance rate to be tuned to 0.57\approx 0.57 (optimal for MALA in high dimensions). Practical advice: use NUTS (a form of HMC) for posteriors with tractable gradients; use ULA for energy-based sampling where exact MCMC is not needed; use diffusion models when you want generation, not posteriors.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.