Requires:Neural Networks & Backprop Training Dynamics

Diffusion Models

Diffusion models took over generative modeling by being the first approach to reliably generate high-fidelity, diverse images at scale. The core idea is elegant: gradually add Gaussian noise to training images until they become pure noise, then learn to reverse this process step by step. The forward process has a closed-form solution — you can jump to any noise level in one step — which makes training tractable. The reverse process is learned by a neural network that predicts the noise added at each step. Despite the apparent simplicity, diffusion models require careful understanding of noise schedules, score matching, and the connection between the training objective and likelihood maximization.

Theory

Diffusion process — forward (noising) & reverse (denoising)

α̅_t = cos²(πt/2T). Adds noise more slowly — preserves structure longer. Standard in DDPM/DDIM improvements.

SNR: -0.5 dBt = 5/10

noiset = 5 / 10

forward q(x_t | x_0)

x_t = √(ᾱ_t)·x₀ + √(1−ᾱ_t)·ε, ε∼N(0,I)

analytic — no iteration needed

reverse p_θ(x_t-1 | x_t)

ε_θ(x_t, t) predicts the noise; remove it step by step

iterative — T steps to denoise

DDPM (Ho et al., 2020): T=1000 steps · training minimizes E[‖ε − ε_θ(x_t, t)‖²]

Diffusion models learn to undo a process of gradual noise addition. Think of rubbing out a drawing with an eraser, step by step, until only static remains — diffusion learns to reverse this erasure, starting from static and reconstructing the drawing. The diagram above shows both directions: the forward process (image → noise) and the reverse process (noise → image).

Forward diffusion process

The forward process $q$ gradually corrupts data $x_0 \sim p_{\text{data}}$ by adding Gaussian noise over $T$ steps: $q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \, \beta_t \mathbf{I})$

Gaussian noise at each step is the only choice that produces a closed-form marginal $q(x_t \mid x_0)$ — the property that lets you jump directly from a clean image to any noise level without iterating through all $t$ steps. This closed form is what makes DDPM training efficient: sample a random timestep $t$ , compute the noisy image in one step using $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon$ , and train the denoiser. Non-Gaussian noise distributions do not have this closed-form marginal and would require iterating through all steps to generate a training example.

where $\beta_t \in (0,1)$ is the noise schedule. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ .

Key property — closed-form marginal: you can jump directly from $x_0$ to any noisy $x_t$ without iterating: $q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\, x_0, \, (1 - \bar{\alpha}_t)\mathbf{I})$

or equivalently: $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})$

This closed form is what makes DDPM training efficient — sample any $t$ uniformly, add noise at that level in one step, and train.

Noise schedules

Linear schedule (DDPM original): $\beta_t$ increases linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ . Corrupts structure too aggressively early in the process.

Cosine schedule (Nichol and Dhariwal, 2021): defines $\bar{\alpha}_t$ directly: $\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$

The cosine schedule keeps $\bar{\alpha}_t$ near 1 for longer (preserves structure), then drops steeply near $T$ . Improves log-likelihood and sample quality.

Reverse process and training objective

The reverse process $p_\theta$ inverts the forward process: $p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I})$

Training objective: the negative ELBO reduces to a simple noise prediction loss (Ho et al., 2020): $\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \varepsilon} \left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]$

"Predict the noise $\varepsilon$ that was added to get $x_t$ from $x_0$ ." The noise predictor $\varepsilon_\theta$ takes the noisy image and timestep $t$ as inputs.

Sampling (DDPM): given $\varepsilon_\theta$ , reconstruct $\hat{x}_0$ , compute the posterior mean, then step back: $\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}$

$x_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\hat{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \sigma_t z, \quad z \sim \mathcal{N}(0,\mathbf{I})$

Score matching connection

The noise prediction objective is equivalent to denoising score matching (Vincent, 2011). The score function $\nabla_{x_t} \log p(x_t)$ points toward regions of higher data density. The noise predictor approximates: $\varepsilon_\theta(x_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\, \nabla_{x_t} \log p_t(x_t)$

This connection to score matching provides theoretical grounding and links DDPM to continuous-time SDE formulations (Song et al., 2021).

DDIM — deterministic sampling

DDPM requires $T = 1000$ steps for high-quality samples. DDIM (Song et al., 2020) shows that the same trained model can be sampled deterministically in far fewer steps by skipping timesteps:

$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\frac{x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta}{\sqrt{\bar{\alpha}_t}}}_{\hat{x}_0} + \sqrt{1-\bar{\alpha}_{t-1}}\,\varepsilon_\theta$

With $\eta = 0$ (deterministic), this produces the same image for the same noise sample — enabling:

50-step sampling with quality comparable to 1000-step DDPM
DDIM inversion: given a real image, find the noise that generates it (useful for editing)

Walkthrough

DDPM training loop

python

import torch
import torch.nn.functional as F
 
class GaussianDiffusion:
    def __init__(self, T=1000, schedule='cosine', device='cpu'):
        self.T = T
        betas = self._make_schedule(T, schedule).to(device)
        alphas = 1.0 - betas
        alpha_bar = torch.cumprod(alphas, dim=0)
        self.sqrt_alpha_bar = alpha_bar.sqrt()
        self.sqrt_one_minus_alpha_bar = (1 - alpha_bar).sqrt()
 
    def _make_schedule(self, T, schedule):
        if schedule == 'linear':
            return torch.linspace(1e-4, 0.02, T)
        s = 0.008
        t = torch.linspace(0, T, T + 1) / T
        f = torch.cos((t + s) / (1 + s) * torch.pi / 2) ** 2
        alpha_bar = f / f[0]
        return torch.clamp(1 - alpha_bar[1:] / alpha_bar[:-1], max=0.999)
 
    def q_sample(self, x0, t, noise=None):
        """Forward process: add noise at timestep t (closed form)."""
        if noise is None:
            noise = torch.randn_like(x0)
        sqrt_ab = self.sqrt_alpha_bar[t].view(-1, 1, 1, 1)
        sqrt_1_ab = self.sqrt_one_minus_alpha_bar[t].view(-1, 1, 1, 1)
        return sqrt_ab * x0 + sqrt_1_ab * noise, noise
 
    def training_loss(self, model, x0):
        B = x0.size(0)
        t = torch.randint(0, self.T, (B,), device=x0.device)
        xt, noise = self.q_sample(x0, t)
        pred_noise = model(xt, t)
        return F.mse_loss(pred_noise, noise)
 
 
# Training step
diffusion = GaussianDiffusion(T=1000, schedule='cosine', device='cuda')
# model = UNet(...)  — standard U-Net with time conditioning
# for batch in dataloader:
#     loss = diffusion.training_loss(model, batch.to('cuda'))
#     optimizer.zero_grad(); loss.backward(); optimizer.step()

DDIM sampling

python

@torch.no_grad()
def ddim_sample(model, diffusion, shape, steps=50, device='cuda'):
    T = diffusion.T
    timesteps = torch.linspace(T - 1, 0, steps, dtype=torch.long)
    x = torch.randn(*shape, device=device)
 
    for i, t_cur in enumerate(timesteps):
        t_prev = timesteps[i + 1] if i + 1 < len(timesteps) else -1
        t_batch = torch.full((shape[0],), t_cur, device=device, dtype=torch.long)
        pred_noise = model(x, t_batch)
 
        ab_t    = diffusion.sqrt_alpha_bar[t_cur] ** 2
        ab_prev = diffusion.sqrt_alpha_bar[t_prev] ** 2 if t_prev >= 0 else torch.tensor(1.0)
 
        x0_pred = (x - (1 - ab_t).sqrt() * pred_noise) / ab_t.sqrt()
        x0_pred = x0_pred.clamp(-1, 1)
        x = ab_prev.sqrt() * x0_pred + (1 - ab_prev).sqrt() * pred_noise
 
    return x

Analysis & Evaluation

Where Your Intuition Breaks

Diffusion always requires 1,000 denoising steps, making it too slow for real-time applications. DDPM requires 1,000 steps because it uses a Markov chain with small step sizes. DDIM (Song et al., 2020) showed that the same trained model can generate high-quality samples with 20–50 deterministic steps by skipping intermediate timesteps — no retraining required. Consistency models (Song et al., 2023) reduce this further to 1–4 steps by training on a self-consistency objective. The number of steps is a property of the sampler, not the model: a DDPM-trained model can be sampled with any compatible sampler, and the quality-speed trade-off is controlled at inference time.

Diffusion vs GAN vs VAE

	GAN	VAE	Diffusion
Sample quality	High (mode drop)	Lower (blurry)	Highest
Mode coverage	Poor	Good	Good
Training stability	Unstable	Stable	Stable
Inference speed	Fast (1 pass)	Fast (1 pass)	Slow (T steps)
Editability	Limited	Latent arithmetic	DDIM inversion

Evaluation metrics

FID (Fréchet Inception Distance): distance between feature distributions of real and generated images. Lower is better. State of the art on ImageNet 256×256: FID under 2.

Precision / Recall: precision measures sample fidelity; recall measures diversity. High FID can hide tradeoffs between the two.

CLIP score: cosine similarity between generated image and conditioning text — measures text alignment for text-conditioned models.

Key design choices

U-Net backbone: standard noise predictor with skip connections and time conditioning via AdaGN or cross-attention. DiT (Diffusion Transformer) replaces the U-Net for better scaling — covered in Latent Diffusion.

Classifier-free guidance: the primary quality lever in text-to-image models — covered in Latent Diffusion & Guided Generation.

v-parameterization: predict $v = \sqrt{\bar{\alpha}_t}\,\varepsilon - \sqrt{1-\bar{\alpha}_t}\,x_0$ instead of $\varepsilon$ directly. Improves numerical stability at extreme noise levels.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Object Detection & Segmentation

Latent Diffusion & Guided Generation