Bridge: MCMC, Langevin Dynamics & Diffusion Models as SDEs

MCMC, Langevin dynamics, and diffusion-based generative models are three faces of the same stochastic process theory. Markov chain Monte Carlo samples from a target by constructing a chain with the right stationary distribution; Langevin dynamics speeds this up with gradient information; diffusion models invert a noise-adding SDE to generate data — all three are unified by the same SDE framework.

Concepts

Langevin dynamics adds Gaussian noise to gradient descent: x ← x − η∇U + √(2ηT) ε. Noise lets particles escape local minima and sample the Boltzmann distribution — the foundation of SGLD and diffusion models.

Temperature T = 1.0

Langevin: samples ∝ e^−U(x)/T — explores both wells at T≥1. Pure GD: converges to nearest minimum, can't escape. Diffusion models: run reverse Langevin from noise → data.

Add noise to an image step by step until it becomes pure static — then learn to reverse each step. This is how diffusion-based generative models work, and it succeeds because the noise-adding process is a well-understood SDE with a tractable reverse. Langevin dynamics runs the same SDE for Bayesian sampling: gradient steps toward high-probability regions, plus noise to prevent collapse. The same stochastic differential equation governs sampling from posteriors and generating images — MCMC, Langevin, and diffusion models are one framework at three different scales.

MCMC as Discrete-Time Markov Chains

Metropolis-Hastings: to sample from $\pi(x) \propto e^{-U(x)}$ , propose $x' \sim q(x' \mid x)$ and accept with probability $\min(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)})$ . This satisfies detailed balance $\pi(x)P(x, x') = \pi(x')P(x', x)$ by construction, ensuring $\pi$ is stationary.

The detailed balance condition $\pi(x)P(x,x') = \pi(x')P(x',x)$ is both sufficient for stationarity and the precise formulation of time-reversibility: a chain satisfying it looks statistically identical run forward or backward. The Metropolis correction — the accept/reject step — is exactly engineered to restore detailed balance whenever the proposal $q$ would violate it. Without the correction, the chain would still explore state space but would converge to a biased stationary distribution.

Gibbs sampling: for joint $\pi(x_1, \ldots, x_d)$ , cycle through dimensions, sampling each $x_i \mid x_{-i}$ from the conditional. No accept/reject step needed. Gibbs is a special case of MH with acceptance rate 1.

Key limitation: local proposals have poor mixing in high dimensions. The mixing time of random-walk MH in $\mathbb{R}^d$ scales as $O(d)$ (with optimal step size $\sim d^{-1/3}$ ) — each coordinate takes $O(d)$ steps to move by $O(1)$ .

Langevin Monte Carlo

Unadjusted Langevin Algorithm (ULA): discretize the Langevin SDE $dX_t = -\nabla U(X_t)\,dt + \sqrt{2}\,dB_t$ :

$x_{t+1} = x_t - \eta \nabla U(x_t) + \sqrt{2\eta}\,\xi_t, \quad \xi_t \sim \mathcal{N}(0, I).$

The stationary distribution of the continuous-time SDE is $\pi \propto e^{-U(x)}$ (Gibbs). The discretized ULA introduces an $O(\eta)$ bias — it does not exactly sample $\pi$ .

Metropolis-Adjusted Langevin Algorithm (MALA): correct the discretization error by using Langevin steps as proposals in MH. The proposal is $q(x' \mid x) = \mathcal{N}(x - \eta\nabla U(x), 2\eta I)$ , accepted/rejected by MH. MALA is exact (stationary distribution is exactly $\pi$ ) but requires an accept step.

Mixing time improvement: Langevin mixes in $O(\sqrt{d})$ steps vs $O(d)$ for random-walk MH, using gradient information to make directed proposals.

Preconditioned Langevin: $x_{t+1} = x_t - \eta M^{-1}\nabla U(x_t) + \sqrt{2\eta} M^{-1/2}\xi_t$ uses a preconditioning matrix $M$ (e.g., the Fisher information matrix or a Hessian estimate) to adapt to the geometry of $U$ . This is the stochastic analog of Newton's method.

Hamiltonian Monte Carlo

Hamiltonian dynamics augment the state with momentum $p \sim \mathcal{N}(0, M)$ :

$\frac{dq}{dt} = M^{-1}p, \quad \frac{dp}{dt} = -\nabla U(q).$

The Hamiltonian $H(q, p) = U(q) + \frac{1}{2}p^T M^{-1}p$ is conserved along trajectories. The joint distribution $\pi(q, p) \propto e^{-H(q,p)}$ factors as $\pi(q)\mathcal{N}(p; 0, M)$ .

HMC algorithm: (1) sample $p \sim \mathcal{N}(0, M)$ ; (2) run $L$ leapfrog steps to get $(q', p')$ ; (3) accept with MH probability $\min(1, e^{H(q,p) - H(q',p')})$ ; (4) discard $p'$ .

The leapfrog integrator (Störmer-Verlet) is symplectic — it exactly conserves volume in $(q, p)$ space, ensuring the acceptance rate stays near 1 for small step size $\varepsilon$ . The optimal acceptance rate for HMC in $d$ dimensions is $\approx 0.65$ .

No-U-Turn Sampler (NUTS): automatically sets the trajectory length $L$ by detecting when the trajectory turns back, eliminating the tuning parameter. NUTS is the default sampler in Stan, PyMC, and NumPyro.

Score-Based Generative Models and Diffusion

Forward SDE (noise injection): $dX_t = f(X_t, t)\,dt + g(t)\,dB_t$ , $X_0 \sim p_0$ (data). Common choice: variance-preserving (VP-SDE, DDPM) with $f = -\frac{1}{2}\beta(t)X_t$ and $g = \sqrt{\beta(t)}$ , so $X_t \mid X_0 \sim \mathcal{N}(\alpha(t)X_0, \sigma(t)^2 I)$ with closed-form $\alpha(t), \sigma(t)$ .

Anderson's reverse SDE (1982): the time-reversal of an Itô SDE $dX_t = f(X_t, t)\,dt + g(t)\,dB_t$ is:

$d\bar X_t = \bigl[f(\bar X_t, t) - g(t)^2 \nabla_x \log p_t(\bar X_t)\bigr]dt + g(t)\,d\bar B_t.$

The reverse drift requires the score function $\nabla_x \log p_t(x)$ — the gradient of the log-density at time $t$ .

Score matching: train a neural network $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ by minimizing the denoising score matching objective:

$\mathcal{L}(\theta) = E_{t, x_0, x_t}\!\left[\|s_\theta(x_t, t) - \nabla_{x_t}\log p(x_t \mid x_0)\|^2\right].$

Since $p(x_t \mid x_0)$ is a Gaussian (known in closed form for VP-SDE), $\nabla_{x_t}\log p(x_t \mid x_0) = -\frac{x_t - \alpha(t)x_0}{\sigma(t)^2}$ . Learning the score is equivalent to learning to denoise.

DDPM formulation: parameterize $s_\theta(x_t, t) = -\epsilon_\theta(x_t, t)/\sigma(t)$ where $\epsilon_\theta$ predicts the noise added. The training objective becomes $E[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$ — simple noise prediction.

Sampling: run the reverse SDE from $X_T \sim \mathcal{N}(0, I)$ backward to $t=0$ . Discretize with DDIM (deterministic) or DDPM (stochastic). Fewer steps with flow matching (ODE-based) or consistency models.

Worked Example

Example 1: Stochastic Gradient Langevin Dynamics (SGLD)

For training with $n$ data points and minibatch size $m$ , SGLD uses:

$\theta_{t+1} = \theta_t - \frac{\eta}{2}\left(\nabla \log p(\theta_t) + \frac{n}{m}\sum_{i \in \mathcal{B}_t}\nabla \log p(x_i \mid \theta_t)\right) + \sqrt{\eta}\,\xi_t.$

For $\eta_t \to 0$ with $\sum \eta_t = \infty$ and $\sum \eta_t^2 < \infty$ (Robbins-Monro), SGLD asymptotically samples from the posterior $p(\theta \mid \mathcal{D}) \propto p(\theta)\prod_i p(x_i \mid \theta)$ .

At early training: the noise term $\sqrt{\eta}\xi_t$ is negligible vs the gradient — SGLD behaves like SGD. At convergence: the noise dominates and samples from the posterior. SGLD interpolates between optimization and Bayesian inference by annealing $\eta$ .

Example 2: Diffusion Model Reverse Process

For DDPM with 1000 timesteps, $\beta_t$ increases linearly from $10^{-4}$ to $0.02$ :

$X_t \mid X_0 \sim \mathcal{N}(\sqrt{\bar\alpha_t}X_0, (1-\bar\alpha_t)I)$ where $\bar\alpha_t = \prod_{s=1}^t (1-\beta_s)$ .

At $t = 1000$ : $\bar\alpha_{1000} \approx 0$ so $X_{1000} \approx \mathcal{N}(0, I)$ — almost pure noise.

The reverse step: $X_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(X_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(X_t, t)\right) + \sqrt{\tilde\beta_t}\,\xi_t$ .

where $\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$ (posterior variance). The U-Net $\epsilon_\theta$ is trained to predict noise from $(X_t, t)$ . Generation: start with $X_{1000} \sim \mathcal{N}(0, I)$ , run 1000 reverse steps.

DDIM reduces this to 20–50 deterministic steps using an ODE solver instead of the stochastic reverse SDE.

Example 3: Score Matching Connects to Energy-Based Models

An energy-based model defines $p_\theta(x) \propto e^{-E_\theta(x)}$ . Exact score: $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ — the gradient of the energy, computable by backprop without the intractable partition function.

Explicit score matching minimizes $E[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2]$ — requires the true score, which is unknown.

Denoising score matching (Vincent 2011) perturbs data $\tilde x = x + \epsilon\xi$ and matches the score of the perturbed distribution $\nabla_{\tilde x}\log p_\sigma(\tilde x) = -(\tilde x - x)/\sigma^2$ , which is tractable. DDPM's noise prediction objective is exactly this, summed over noise levels $t$ — diffusion training is multi-scale denoising score matching.

Connections

Where Your Intuition Breaks

Diffusion models learn the score $\nabla_x \log p_t(x)$ at each noise level $t$ — apparently requiring knowledge of the data density $p_t$ , which is intractable in high dimensions. Score matching sidesteps this by expressing the score of the noisy distribution in terms of the clean data point: for Gaussian noise, $\nabla_{x_t}\log p_t(x_t \mid x_0) = -(x_t - x_0)/\sigma_t^2$ , which is computable without $p_t$ . This tractability relies critically on the Gaussian structure of the noise; non-Gaussian forward processes require substantially different estimators. The apparent simplicity of "learn to denoise" is not a general principle — it is a consequence of the specific mathematical properties of Gaussian noise.

💡Intuition

MCMC, Langevin, and diffusion models are the same SDE at different granularities. Metropolis-Hastings is a discrete-time Markov chain that converges to $\pi$ . Langevin MCMC is the continuous-time limit, using the score $\nabla\log\pi$ to guide proposals. Diffusion models run a fixed forward SDE (Gaussian noise injection) and learn the reverse score via neural network. The unifying object is: the SDE with drift $-\nabla U + \sqrt{2T}\,dB$ has stationary distribution $\propto e^{-U/T}$ . MCMC discretizes this; diffusion models learn to run it backwards from random noise.

💡Intuition

Score matching avoids the partition function entirely. Training energy-based models requires computing $Z = \int e^{-E(x)}dx$ — intractable in high dimensions. Score matching sidesteps this: $\nabla_x \log p(x) = -\nabla_x E(x)$ — no $Z$ ! Training by matching the score gradient requires only backpropagation through the energy, not the normalization constant. This is why score-based generative models and EBMs trained with score matching scale to image dimensions while maximum likelihood EBMs do not — the likelihood requires $Z$ ; the score does not.

⚠️Warning

ULA is biased; MALA is not — but MALA requires tuning. ULA (Langevin without Metropolis correction) has a stationary distribution that differs from the target by $O(\eta)$ in TV distance. For $\eta = 0.01$ in $d = 100$ dimensions, this bias can be substantial. MALA corrects this via accept/reject but requires the acceptance rate to be tuned to $\approx 0.57$ (optimal for MALA in high dimensions). Practical advice: use NUTS (a form of HMC) for posteriors with tractable gradients; use ULA for energy-based sampling where exact MCMC is not needed; use diffusion models when you want generation, not posteriors.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Itô Calculus & Stochastic Differential Equations

Numerical Methods

Floating Point Arithmetic, Numerical Stability & Condition Numbers