Neural-Path/Notes
35 min

Diffusion Models

Diffusion models took over generative modeling by being the first approach to reliably generate high-fidelity, diverse images at scale. The core idea is elegant: gradually add Gaussian noise to training images until they become pure noise, then learn to reverse this process step by step. The forward process has a closed-form solution — you can jump to any noise level in one step — which makes training tractable. The reverse process is learned by a neural network that predicts the noise added at each step. Despite the apparent simplicity, diffusion models require careful understanding of noise schedules, score matching, and the connection between the training objective and likelihood maximization.

Theory

Diffusion process — forward (noising) & reverse (denoising)

α̅_t = cos²(πt/2T). Adds noise more slowly — preserves structure longer. Standard in DDPM/DDIM improvements.

SNR: -0.5 dBt = 5/10
noiset = 5 / 10
forward q(x_t | x_0)
x_t = √(ᾱ_t)·x₀ + √(1−ᾱ_t)·ε, ε∼N(0,I)
analytic — no iteration needed
reverse p_θ(x_t-1 | x_t)
ε_θ(x_t, t) predicts the noise; remove it step by step
iterative — T steps to denoise

DDPM (Ho et al., 2020): T=1000 steps · training minimizes E[‖ε − ε_θ(x_t, t)‖²]

Diffusion models learn to undo a process of gradual noise addition. Think of rubbing out a drawing with an eraser, step by step, until only static remains — diffusion learns to reverse this erasure, starting from static and reconstructing the drawing. The diagram above shows both directions: the forward process (image → noise) and the reverse process (noise → image).

Forward diffusion process

The forward process qq gradually corrupts data x0pdatax_0 \sim p_{\text{data}} by adding Gaussian noise over TT steps: q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \, \beta_t \mathbf{I})

Gaussian noise at each step is the only choice that produces a closed-form marginal q(xtx0)q(x_t \mid x_0) — the property that lets you jump directly from a clean image to any noise level without iterating through all tt steps. This closed form is what makes DDPM training efficient: sample a random timestep tt, compute the noisy image in one step using xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon, and train the denoiser. Non-Gaussian noise distributions do not have this closed-form marginal and would require iterating through all steps to generate a training example.

where βt(0,1)\beta_t \in (0,1) is the noise schedule. Define αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_s.

Key property — closed-form marginal: you can jump directly from x0x_0 to any noisy xtx_t without iterating: q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\, x_0, \, (1 - \bar{\alpha}_t)\mathbf{I})

or equivalently: xt=αˉtx0+1αˉtε,εN(0,I)x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})

This closed form is what makes DDPM training efficient — sample any tt uniformly, add noise at that level in one step, and train.

Noise schedules

Linear schedule (DDPM original): βt\beta_t increases linearly from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02. Corrupts structure too aggressively early in the process.

Cosine schedule (Nichol and Dhariwal, 2021): defines αˉt\bar{\alpha}_t directly: αˉt=f(t)f(0),f(t)=cos ⁣(t/T+s1+sπ2)2\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2

The cosine schedule keeps αˉt\bar{\alpha}_t near 1 for longer (preserves structure), then drops steeply near TT. Improves log-likelihood and sample quality.

Reverse process and training objective

The reverse process pθp_\theta inverts the forward process: pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I})

Training objective: the negative ELBO reduces to a simple noise prediction loss (Ho et al., 2020): Lsimple=Et,x0,ε[εεθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \varepsilon} \left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]

"Predict the noise ε\varepsilon that was added to get xtx_t from x0x_0." The noise predictor εθ\varepsilon_\theta takes the noisy image and timestep tt as inputs.

Sampling (DDPM): given εθ\varepsilon_\theta, reconstruct x^0\hat{x}_0, compute the posterior mean, then step back: x^0=xt1αˉtεθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}

xt1=αˉt1βt1αˉtx^0+αt(1αˉt1)1αˉtxt+σtz,zN(0,I)x_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\hat{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \sigma_t z, \quad z \sim \mathcal{N}(0,\mathbf{I})

Score matching connection

The noise prediction objective is equivalent to denoising score matching (Vincent, 2011). The score function xtlogp(xt)\nabla_{x_t} \log p(x_t) points toward regions of higher data density. The noise predictor approximates: εθ(xt,t)1αˉtxtlogpt(xt)\varepsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\, \nabla_{x_t} \log p_t(x_t)

This connection to score matching provides theoretical grounding and links DDPM to continuous-time SDE formulations (Song et al., 2021).

DDIM — deterministic sampling

DDPM requires T=1000T = 1000 steps for high-quality samples. DDIM (Song et al., 2020) shows that the same trained model can be sampled deterministically in far fewer steps by skipping timesteps:

xt1=αˉt1xt1αˉtεθαˉtx^0+1αˉt1εθx_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta}{\sqrt{\bar{\alpha}_t}}}_{\hat{x}_0} + \sqrt{1-\bar{\alpha}_{t-1}}\,\varepsilon_\theta

With η=0\eta = 0 (deterministic), this produces the same image for the same noise sample — enabling:

  • 50-step sampling with quality comparable to 1000-step DDPM
  • DDIM inversion: given a real image, find the noise that generates it (useful for editing)

Walkthrough

DDPM training loop

python
import torch
import torch.nn.functional as F
 
class GaussianDiffusion:
    def __init__(self, T=1000, schedule='cosine', device='cpu'):
        self.T = T
        betas = self._make_schedule(T, schedule).to(device)
        alphas = 1.0 - betas
        alpha_bar = torch.cumprod(alphas, dim=0)
        self.sqrt_alpha_bar = alpha_bar.sqrt()
        self.sqrt_one_minus_alpha_bar = (1 - alpha_bar).sqrt()
 
    def _make_schedule(self, T, schedule):
        if schedule == 'linear':
            return torch.linspace(1e-4, 0.02, T)
        s = 0.008
        t = torch.linspace(0, T, T + 1) / T
        f = torch.cos((t + s) / (1 + s) * torch.pi / 2) ** 2
        alpha_bar = f / f[0]
        return torch.clamp(1 - alpha_bar[1:] / alpha_bar[:-1], max=0.999)
 
    def q_sample(self, x0, t, noise=None):
        """Forward process: add noise at timestep t (closed form)."""
        if noise is None:
            noise = torch.randn_like(x0)
        sqrt_ab = self.sqrt_alpha_bar[t].view(-1, 1, 1, 1)
        sqrt_1_ab = self.sqrt_one_minus_alpha_bar[t].view(-1, 1, 1, 1)
        return sqrt_ab * x0 + sqrt_1_ab * noise, noise
 
    def training_loss(self, model, x0):
        B = x0.size(0)
        t = torch.randint(0, self.T, (B,), device=x0.device)
        xt, noise = self.q_sample(x0, t)
        pred_noise = model(xt, t)
        return F.mse_loss(pred_noise, noise)
 
 
# Training step
diffusion = GaussianDiffusion(T=1000, schedule='cosine', device='cuda')
# model = UNet(...)  — standard U-Net with time conditioning
# for batch in dataloader:
#     loss = diffusion.training_loss(model, batch.to('cuda'))
#     optimizer.zero_grad(); loss.backward(); optimizer.step()

DDIM sampling

python
@torch.no_grad()
def ddim_sample(model, diffusion, shape, steps=50, device='cuda'):
    T = diffusion.T
    timesteps = torch.linspace(T - 1, 0, steps, dtype=torch.long)
    x = torch.randn(*shape, device=device)
 
    for i, t_cur in enumerate(timesteps):
        t_prev = timesteps[i + 1] if i + 1 < len(timesteps) else -1
        t_batch = torch.full((shape[0],), t_cur, device=device, dtype=torch.long)
        pred_noise = model(x, t_batch)
 
        ab_t    = diffusion.sqrt_alpha_bar[t_cur] ** 2
        ab_prev = diffusion.sqrt_alpha_bar[t_prev] ** 2 if t_prev >= 0 else torch.tensor(1.0)
 
        x0_pred = (x - (1 - ab_t).sqrt() * pred_noise) / ab_t.sqrt()
        x0_pred = x0_pred.clamp(-1, 1)
        x = ab_prev.sqrt() * x0_pred + (1 - ab_prev).sqrt() * pred_noise
 
    return x

Analysis & Evaluation

Where Your Intuition Breaks

Diffusion always requires 1,000 denoising steps, making it too slow for real-time applications. DDPM requires 1,000 steps because it uses a Markov chain with small step sizes. DDIM (Song et al., 2020) showed that the same trained model can generate high-quality samples with 20–50 deterministic steps by skipping intermediate timesteps — no retraining required. Consistency models (Song et al., 2023) reduce this further to 1–4 steps by training on a self-consistency objective. The number of steps is a property of the sampler, not the model: a DDPM-trained model can be sampled with any compatible sampler, and the quality-speed trade-off is controlled at inference time.

Diffusion vs GAN vs VAE

GANVAEDiffusion
Sample qualityHigh (mode drop)Lower (blurry)Highest
Mode coveragePoorGoodGood
Training stabilityUnstableStableStable
Inference speedFast (1 pass)Fast (1 pass)Slow (T steps)
EditabilityLimitedLatent arithmeticDDIM inversion

Evaluation metrics

FID (Fréchet Inception Distance): distance between feature distributions of real and generated images. Lower is better. State of the art on ImageNet 256×256: FID under 2.

Precision / Recall: precision measures sample fidelity; recall measures diversity. High FID can hide tradeoffs between the two.

CLIP score: cosine similarity between generated image and conditioning text — measures text alignment for text-conditioned models.

Key design choices

U-Net backbone: standard noise predictor with skip connections and time conditioning via AdaGN or cross-attention. DiT (Diffusion Transformer) replaces the U-Net for better scaling — covered in Latent Diffusion.

Classifier-free guidance: the primary quality lever in text-to-image models — covered in Latent Diffusion & Guided Generation.

v-parameterization: predict v=αˉtε1αˉtx0v = \sqrt{\bar{\alpha}_t}\,\varepsilon - \sqrt{1-\bar{\alpha}_t}\,x_0 instead of ε\varepsilon directly. Improves numerical stability at extreme noise levels.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.