Diffusion Models
Diffusion models took over generative modeling by being the first approach to reliably generate high-fidelity, diverse images at scale. The core idea is elegant: gradually add Gaussian noise to training images until they become pure noise, then learn to reverse this process step by step. The forward process has a closed-form solution — you can jump to any noise level in one step — which makes training tractable. The reverse process is learned by a neural network that predicts the noise added at each step. Despite the apparent simplicity, diffusion models require careful understanding of noise schedules, score matching, and the connection between the training objective and likelihood maximization.
Theory
α̅_t = cos²(πt/2T). Adds noise more slowly — preserves structure longer. Standard in DDPM/DDIM improvements.
DDPM (Ho et al., 2020): T=1000 steps · training minimizes E[‖ε − ε_θ(x_t, t)‖²]
Diffusion models learn to undo a process of gradual noise addition. Think of rubbing out a drawing with an eraser, step by step, until only static remains — diffusion learns to reverse this erasure, starting from static and reconstructing the drawing. The diagram above shows both directions: the forward process (image → noise) and the reverse process (noise → image).
Forward diffusion process
The forward process gradually corrupts data by adding Gaussian noise over steps:
Gaussian noise at each step is the only choice that produces a closed-form marginal — the property that lets you jump directly from a clean image to any noise level without iterating through all steps. This closed form is what makes DDPM training efficient: sample a random timestep , compute the noisy image in one step using , and train the denoiser. Non-Gaussian noise distributions do not have this closed-form marginal and would require iterating through all steps to generate a training example.
where is the noise schedule. Define and .
Key property — closed-form marginal: you can jump directly from to any noisy without iterating:
or equivalently:
This closed form is what makes DDPM training efficient — sample any uniformly, add noise at that level in one step, and train.
Noise schedules
Linear schedule (DDPM original): increases linearly from to . Corrupts structure too aggressively early in the process.
Cosine schedule (Nichol and Dhariwal, 2021): defines directly:
The cosine schedule keeps near 1 for longer (preserves structure), then drops steeply near . Improves log-likelihood and sample quality.
Reverse process and training objective
The reverse process inverts the forward process:
Training objective: the negative ELBO reduces to a simple noise prediction loss (Ho et al., 2020):
"Predict the noise that was added to get from ." The noise predictor takes the noisy image and timestep as inputs.
Sampling (DDPM): given , reconstruct , compute the posterior mean, then step back:
Score matching connection
The noise prediction objective is equivalent to denoising score matching (Vincent, 2011). The score function points toward regions of higher data density. The noise predictor approximates:
This connection to score matching provides theoretical grounding and links DDPM to continuous-time SDE formulations (Song et al., 2021).
DDIM — deterministic sampling
DDPM requires steps for high-quality samples. DDIM (Song et al., 2020) shows that the same trained model can be sampled deterministically in far fewer steps by skipping timesteps:
With (deterministic), this produces the same image for the same noise sample — enabling:
- 50-step sampling with quality comparable to 1000-step DDPM
- DDIM inversion: given a real image, find the noise that generates it (useful for editing)
Walkthrough
DDPM training loop
import torch
import torch.nn.functional as F
class GaussianDiffusion:
def __init__(self, T=1000, schedule='cosine', device='cpu'):
self.T = T
betas = self._make_schedule(T, schedule).to(device)
alphas = 1.0 - betas
alpha_bar = torch.cumprod(alphas, dim=0)
self.sqrt_alpha_bar = alpha_bar.sqrt()
self.sqrt_one_minus_alpha_bar = (1 - alpha_bar).sqrt()
def _make_schedule(self, T, schedule):
if schedule == 'linear':
return torch.linspace(1e-4, 0.02, T)
s = 0.008
t = torch.linspace(0, T, T + 1) / T
f = torch.cos((t + s) / (1 + s) * torch.pi / 2) ** 2
alpha_bar = f / f[0]
return torch.clamp(1 - alpha_bar[1:] / alpha_bar[:-1], max=0.999)
def q_sample(self, x0, t, noise=None):
"""Forward process: add noise at timestep t (closed form)."""
if noise is None:
noise = torch.randn_like(x0)
sqrt_ab = self.sqrt_alpha_bar[t].view(-1, 1, 1, 1)
sqrt_1_ab = self.sqrt_one_minus_alpha_bar[t].view(-1, 1, 1, 1)
return sqrt_ab * x0 + sqrt_1_ab * noise, noise
def training_loss(self, model, x0):
B = x0.size(0)
t = torch.randint(0, self.T, (B,), device=x0.device)
xt, noise = self.q_sample(x0, t)
pred_noise = model(xt, t)
return F.mse_loss(pred_noise, noise)
# Training step
diffusion = GaussianDiffusion(T=1000, schedule='cosine', device='cuda')
# model = UNet(...) — standard U-Net with time conditioning
# for batch in dataloader:
# loss = diffusion.training_loss(model, batch.to('cuda'))
# optimizer.zero_grad(); loss.backward(); optimizer.step()DDIM sampling
@torch.no_grad()
def ddim_sample(model, diffusion, shape, steps=50, device='cuda'):
T = diffusion.T
timesteps = torch.linspace(T - 1, 0, steps, dtype=torch.long)
x = torch.randn(*shape, device=device)
for i, t_cur in enumerate(timesteps):
t_prev = timesteps[i + 1] if i + 1 < len(timesteps) else -1
t_batch = torch.full((shape[0],), t_cur, device=device, dtype=torch.long)
pred_noise = model(x, t_batch)
ab_t = diffusion.sqrt_alpha_bar[t_cur] ** 2
ab_prev = diffusion.sqrt_alpha_bar[t_prev] ** 2 if t_prev >= 0 else torch.tensor(1.0)
x0_pred = (x - (1 - ab_t).sqrt() * pred_noise) / ab_t.sqrt()
x0_pred = x0_pred.clamp(-1, 1)
x = ab_prev.sqrt() * x0_pred + (1 - ab_prev).sqrt() * pred_noise
return xAnalysis & Evaluation
Where Your Intuition Breaks
Diffusion always requires 1,000 denoising steps, making it too slow for real-time applications. DDPM requires 1,000 steps because it uses a Markov chain with small step sizes. DDIM (Song et al., 2020) showed that the same trained model can generate high-quality samples with 20–50 deterministic steps by skipping intermediate timesteps — no retraining required. Consistency models (Song et al., 2023) reduce this further to 1–4 steps by training on a self-consistency objective. The number of steps is a property of the sampler, not the model: a DDPM-trained model can be sampled with any compatible sampler, and the quality-speed trade-off is controlled at inference time.
Diffusion vs GAN vs VAE
| GAN | VAE | Diffusion | |
|---|---|---|---|
| Sample quality | High (mode drop) | Lower (blurry) | Highest |
| Mode coverage | Poor | Good | Good |
| Training stability | Unstable | Stable | Stable |
| Inference speed | Fast (1 pass) | Fast (1 pass) | Slow (T steps) |
| Editability | Limited | Latent arithmetic | DDIM inversion |
Evaluation metrics
FID (Fréchet Inception Distance): distance between feature distributions of real and generated images. Lower is better. State of the art on ImageNet 256×256: FID under 2.
Precision / Recall: precision measures sample fidelity; recall measures diversity. High FID can hide tradeoffs between the two.
CLIP score: cosine similarity between generated image and conditioning text — measures text alignment for text-conditioned models.
Key design choices
U-Net backbone: standard noise predictor with skip connections and time conditioning via AdaGN or cross-attention. DiT (Diffusion Transformer) replaces the U-Net for better scaling — covered in Latent Diffusion.
Classifier-free guidance: the primary quality lever in text-to-image models — covered in Latent Diffusion & Guided Generation.
v-parameterization: predict instead of directly. Improves numerical stability at extreme noise levels.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.