Bridge: MCMC, Langevin Dynamics & Diffusion Models as SDEs
MCMC, Langevin dynamics, and diffusion-based generative models are three faces of the same stochastic process theory. Markov chain Monte Carlo samples from a target by constructing a chain with the right stationary distribution; Langevin dynamics speeds this up with gradient information; diffusion models invert a noise-adding SDE to generate data — all three are unified by the same SDE framework.
Concepts
Langevin dynamics adds Gaussian noise to gradient descent: x ← x − η∇U + √(2ηT) ε. Noise lets particles escape local minima and sample the Boltzmann distribution — the foundation of SGLD and diffusion models.
Add noise to an image step by step until it becomes pure static — then learn to reverse each step. This is how diffusion-based generative models work, and it succeeds because the noise-adding process is a well-understood SDE with a tractable reverse. Langevin dynamics runs the same SDE for Bayesian sampling: gradient steps toward high-probability regions, plus noise to prevent collapse. The same stochastic differential equation governs sampling from posteriors and generating images — MCMC, Langevin, and diffusion models are one framework at three different scales.
MCMC as Discrete-Time Markov Chains
Metropolis-Hastings: to sample from , propose and accept with probability . This satisfies detailed balance by construction, ensuring is stationary.
The detailed balance condition is both sufficient for stationarity and the precise formulation of time-reversibility: a chain satisfying it looks statistically identical run forward or backward. The Metropolis correction — the accept/reject step — is exactly engineered to restore detailed balance whenever the proposal would violate it. Without the correction, the chain would still explore state space but would converge to a biased stationary distribution.
Gibbs sampling: for joint , cycle through dimensions, sampling each from the conditional. No accept/reject step needed. Gibbs is a special case of MH with acceptance rate 1.
Key limitation: local proposals have poor mixing in high dimensions. The mixing time of random-walk MH in scales as (with optimal step size ) — each coordinate takes steps to move by .
Langevin Monte Carlo
Unadjusted Langevin Algorithm (ULA): discretize the Langevin SDE :
The stationary distribution of the continuous-time SDE is (Gibbs). The discretized ULA introduces an bias — it does not exactly sample .
Metropolis-Adjusted Langevin Algorithm (MALA): correct the discretization error by using Langevin steps as proposals in MH. The proposal is , accepted/rejected by MH. MALA is exact (stationary distribution is exactly ) but requires an accept step.
Mixing time improvement: Langevin mixes in steps vs for random-walk MH, using gradient information to make directed proposals.
Preconditioned Langevin: uses a preconditioning matrix (e.g., the Fisher information matrix or a Hessian estimate) to adapt to the geometry of . This is the stochastic analog of Newton's method.
Hamiltonian Monte Carlo
Hamiltonian dynamics augment the state with momentum :
The Hamiltonian is conserved along trajectories. The joint distribution factors as .
HMC algorithm: (1) sample ; (2) run leapfrog steps to get ; (3) accept with MH probability ; (4) discard .
The leapfrog integrator (Störmer-Verlet) is symplectic — it exactly conserves volume in space, ensuring the acceptance rate stays near 1 for small step size . The optimal acceptance rate for HMC in dimensions is .
No-U-Turn Sampler (NUTS): automatically sets the trajectory length by detecting when the trajectory turns back, eliminating the tuning parameter. NUTS is the default sampler in Stan, PyMC, and NumPyro.
Score-Based Generative Models and Diffusion
Forward SDE (noise injection): , (data). Common choice: variance-preserving (VP-SDE, DDPM) with and , so with closed-form .
Anderson's reverse SDE (1982): the time-reversal of an Itô SDE is:
The reverse drift requires the score function — the gradient of the log-density at time .
Score matching: train a neural network by minimizing the denoising score matching objective:
Since is a Gaussian (known in closed form for VP-SDE), . Learning the score is equivalent to learning to denoise.
DDPM formulation: parameterize where predicts the noise added. The training objective becomes — simple noise prediction.
Sampling: run the reverse SDE from backward to . Discretize with DDIM (deterministic) or DDPM (stochastic). Fewer steps with flow matching (ODE-based) or consistency models.
Worked Example
Example 1: Stochastic Gradient Langevin Dynamics (SGLD)
For training with data points and minibatch size , SGLD uses:
For with and (Robbins-Monro), SGLD asymptotically samples from the posterior .
At early training: the noise term is negligible vs the gradient — SGLD behaves like SGD. At convergence: the noise dominates and samples from the posterior. SGLD interpolates between optimization and Bayesian inference by annealing .
Example 2: Diffusion Model Reverse Process
For DDPM with 1000 timesteps, increases linearly from to :
where .
At : so — almost pure noise.
The reverse step: .
where (posterior variance). The U-Net is trained to predict noise from . Generation: start with , run 1000 reverse steps.
DDIM reduces this to 20–50 deterministic steps using an ODE solver instead of the stochastic reverse SDE.
Example 3: Score Matching Connects to Energy-Based Models
An energy-based model defines . Exact score: — the gradient of the energy, computable by backprop without the intractable partition function.
Explicit score matching minimizes — requires the true score, which is unknown.
Denoising score matching (Vincent 2011) perturbs data and matches the score of the perturbed distribution , which is tractable. DDPM's noise prediction objective is exactly this, summed over noise levels — diffusion training is multi-scale denoising score matching.
Connections
Where Your Intuition Breaks
Diffusion models learn the score at each noise level — apparently requiring knowledge of the data density , which is intractable in high dimensions. Score matching sidesteps this by expressing the score of the noisy distribution in terms of the clean data point: for Gaussian noise, , which is computable without . This tractability relies critically on the Gaussian structure of the noise; non-Gaussian forward processes require substantially different estimators. The apparent simplicity of "learn to denoise" is not a general principle — it is a consequence of the specific mathematical properties of Gaussian noise.
MCMC, Langevin, and diffusion models are the same SDE at different granularities. Metropolis-Hastings is a discrete-time Markov chain that converges to . Langevin MCMC is the continuous-time limit, using the score to guide proposals. Diffusion models run a fixed forward SDE (Gaussian noise injection) and learn the reverse score via neural network. The unifying object is: the SDE with drift has stationary distribution . MCMC discretizes this; diffusion models learn to run it backwards from random noise.
Score matching avoids the partition function entirely. Training energy-based models requires computing — intractable in high dimensions. Score matching sidesteps this: — no ! Training by matching the score gradient requires only backpropagation through the energy, not the normalization constant. This is why score-based generative models and EBMs trained with score matching scale to image dimensions while maximum likelihood EBMs do not — the likelihood requires ; the score does not.
ULA is biased; MALA is not — but MALA requires tuning. ULA (Langevin without Metropolis correction) has a stationary distribution that differs from the target by in TV distance. For in dimensions, this bias can be substantial. MALA corrects this via accept/reject but requires the acceptance rate to be tuned to (optimal for MALA in high dimensions). Practical advice: use NUTS (a form of HMC) for posteriors with tractable gradients; use ULA for energy-based sampling where exact MCMC is not needed; use diffusion models when you want generation, not posteriors.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.