Requires:Diffusion Models

Latent Diffusion & Guided Generation

Pixel-space diffusion models like DDPM produce high-quality images but are computationally expensive — running 1000 denoising steps on a 512×512 image at full resolution is prohibitive. Latent diffusion models (Rombach et al., 2022) solve this by compressing images into a low-dimensional latent space via a pre-trained VAE, then running diffusion entirely in that compressed space. This 8–16× spatial compression reduces the computational cost by orders of magnitude while preserving perceptual quality. Combined with classifier-free guidance and cross-attention conditioning on text embeddings, latent diffusion became the foundation for Stable Diffusion and most modern text-to-image systems.

Theory

Latent Diffusion Pipeline

Running diffusion in latent space (64×64) instead of pixel space (512×512) gives a ~49× reduction in values per step — the key computational saving in Stable Diffusion.

Standard diffusion on full-resolution images is impractically slow — a 512×512 image has 786K pixels, and self-attention over them is quadratic. Latent diffusion compresses the image to a small latent (64×64 = 4K positions) first, runs all the expensive diffusion steps there, then decodes back to pixels in one pass. This is the architecture behind Stable Diffusion, FLUX, and most production text-to-image systems.

VAE compression to latent space

A pre-trained encoder $\mathcal{E}$ and decoder $\mathcal{D}$ define the latent space. For an image $x \in \mathbb{R}^{H \times W \times 3}$ :

$z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}, \quad \hat{x} = \mathcal{D}(z)$

The compression to latent space is architecturally required, not optional: without it, a single forward pass of the U-Net noise predictor on a 512×512 image would require self-attention over 786K spatial positions, which is computationally infeasible. The 8× spatial downsampling to 64×64 is a deliberate engineering choice — small enough to make attention tractable (4K positions vs 786K), large enough to retain the spatial structure that text conditioning needs to steer. The VAE training objective (perceptual + adversarial + KL) is also forced by this choice: pixel MSE would not preserve the fine texture needed for photorealistic decoding from a lossy compressed representation.

The standard Stable Diffusion VAE uses a downsampling factor of 8: a 512×512 image maps to a 64×64×4 latent. This 8× spatial compression means diffusion operates on $64 \times 64 = 4096$ spatial positions instead of $512 \times 512 = 262144$ — a 64× reduction in the number of tokens for attention.

The VAE is trained with a perceptual loss and patch-based adversarial loss (not just pixel MSE) to preserve fine texture. The KL regularization term keeps the latent distribution close to $\mathcal{N}(0, \mathbf{I})$ , which is where diffusion's noise process starts:

$\mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{rec}} + \lambda_{\text{KL}} \mathcal{L}_{\text{KL}} + \lambda_{\text{perc}} \mathcal{L}_{\text{perc}} + \lambda_{\text{adv}} \mathcal{L}_{\text{adv}}$

Latent diffusion training

With a frozen VAE, latent diffusion trains the noise predictor $\varepsilon_\theta$ in latent space:

$\mathcal{L}_{\text{LDM}} = \mathbb{E}_{z \sim \mathcal{E}(x),\, \varepsilon \sim \mathcal{N}(0,I),\, t} \left[\|\varepsilon - \varepsilon_\theta(z_t, t, \tau_\theta(y))\|^2\right]$

where $z_t = \sqrt{\bar{\alpha}_t} z + \sqrt{1 - \bar{\alpha}_t} \varepsilon$ is the noisy latent and $\tau_\theta(y)$ is the text conditioning signal from a text encoder.

Classifier-free guidance (CFG)

The primary lever for controlling quality vs. diversity in text-to-image generation. During training, the text condition $y$ is randomly dropped with probability $p_{\text{drop}} \approx 0.1$ , training the model to also predict $\varepsilon_\theta(z_t, t, \varnothing)$ (unconditional).

At inference, the guided score is a weighted interpolation:

$\tilde{\varepsilon}_\theta(z_t, t, y) = \varepsilon_\theta(z_t, t, \varnothing) + w \cdot \big[\varepsilon_\theta(z_t, t, y) - \varepsilon_\theta(z_t, t, \varnothing)\big]$

where $w$ is the guidance scale:

$w = 1$ : standard conditional sampling (no guidance boost)
$w = 7.5$ : typical value for Stable Diffusion (high quality, reduced diversity)
$w > 10$ : oversaturation and artifacts

Intuition: the term $\varepsilon_\theta(z_t, t, y) - \varepsilon_\theta(z_t, t, \varnothing)$ points in the direction that makes the sample more consistent with the text prompt. Scaling it by $w > 1$ amplifies this direction, trading diversity for prompt adherence.

Cross-attention for text conditioning

The noise predictor (U-Net or DiT) integrates text via cross-attention. At each spatial layer, image features $Q$ attend over text token embeddings $K, V$ :

$\text{CrossAttn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V$

where $Q = W_Q \cdot \phi_{\text{spatial}}$ , $K = W_K \cdot \tau_\theta(y)$ , $V = W_V \cdot \tau_\theta(y)$ .

Each spatial position independently decides which text tokens to attend to. This is what enables spatial-semantic correspondence — "a red ball on the left" — though it struggles with fine-grained spatial reasoning.

DiT: Diffusion Transformer backbone

Peebles and Xie (2023) replaced the U-Net with a Diffusion Transformer (DiT), treating the latent as a sequence of patches and applying transformer blocks throughout:

Patchify the latent $z_t \in \mathbb{R}^{h \times w \times c}$ into tokens
Condition each block via adaptive layer norm (adaLN-Zero): $\text{adaLN}(x, t, y) = \gamma(t, y) \cdot \text{LN}(x) + \beta(t, y)$ where $\gamma, \beta$ are predicted from the timestep $t$ and class label/text embedding $y$

DiT scales better than U-Net with compute: DiT-XL/2 achieves FID 2.27 on ImageNet 256×256, outperforming the best U-Net baselines. Stable Diffusion 3 and FLUX use transformer-based backbones.

Flow matching

An alternative to DDPM's noise schedule. Flow matching (Lipman et al., 2022; Esser et al., 2024 in SD3) defines a straight-line probability path from noise to data:

$z_t = (1 - t) \cdot \varepsilon + t \cdot z_0, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})$

The velocity field $v_\theta$ predicts the direction to move: $\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, z_0, \varepsilon} \left[\|v_\theta(z_t, t) - (z_0 - \varepsilon)\|^2\right]$

The optimal trajectory is a straight line from $\varepsilon$ to $z_0$ — no curved path needed. This allows fewer sampling steps (10–20 instead of 50+) because the learned flow is approximately linear. SD3, Flux, and Sora's video model use flow matching.

Walkthrough

Stable Diffusion inference

python

from diffusers import StableDiffusionPipeline
import torch
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
 
image = pipe(
    prompt="a photorealistic mountain lake at sunset, 4k",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=512,
    width=512,
).images[0]
image.save("output.png")

CFG guidance scale sweep

python

import torch
from diffusers import StableDiffusionPipeline
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
 
generator = torch.Generator("cuda").manual_seed(42)
prompt = "a golden retriever in a forest, oil painting"
 
for scale in [1.0, 3.0, 7.5, 12.0, 20.0]:
    image = pipe(
        prompt=prompt,
        guidance_scale=scale,
        num_inference_steps=50,
        generator=torch.Generator("cuda").manual_seed(42),
    ).images[0]
    image.save(f"cfg_{scale}.png")
    print(f"Saved cfg_{scale}.png")

DDIM inversion for image editing

python

from diffusers import DDIMScheduler, StableDiffusionPipeline
import torch
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
 
from PIL import Image
import numpy as np
 
def encode_image(pipe, image):
    image = pipe.image_processor.preprocess(image).to("cuda", dtype=torch.float16)
    with torch.no_grad():
        latent = pipe.vae.encode(image).latent_dist.mean * pipe.vae.config.scaling_factor
    return latent
 
# Invert real image to noise space (for editing)
image = Image.open("original.jpg").resize((512, 512))
latent = encode_image(pipe, image)
 
# DDIM inversion: add noise step by step
# Then re-denoise with edited prompt to change content
# while preserving structure

Latent diffusion training loop (minimal)

python

import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL
 
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to("cuda")
vae.requires_grad_(False)
 
def training_step(unet, vae, text_encoder, batch, noise_scheduler):
    pixel_values, input_ids = batch["pixel_values"], batch["input_ids"]
 
    # Encode to latent
    with torch.no_grad():
        latents = vae.encode(pixel_values).latent_dist.sample()
        latents = latents * vae.config.scaling_factor
 
        text_embeds = text_encoder(input_ids)[0]
 
    # Sample noise and timestep
    noise = torch.randn_like(latents)
    t = torch.randint(0, noise_scheduler.num_train_timesteps,
                      (latents.shape[0],), device=latents.device)
    noisy_latents = noise_scheduler.add_noise(latents, noise, t)
 
    # Predict noise; apply CFG dropout
    if torch.rand(1) < 0.1:
        text_embeds = torch.zeros_like(text_embeds)  # unconditional
 
    pred = unet(noisy_latents, t, encoder_hidden_states=text_embeds).sample
    return F.mse_loss(pred, noise)

Analysis & Evaluation

Where Your Intuition Breaks

Higher guidance scale always produces higher-quality, more prompt-faithful images. Guidance scale controls a trade-off between prompt adherence and sample diversity: higher values increase adherence but also increase overexposure, oversaturation, and anatomical artifacts. Guidance scales above 10 typically produce visually degraded results — oversaturated colors, distorted anatomy, loss of fine texture. The optimal guidance scale for a given model and prompt type is empirically determined; w=7 is a typical starting point for Stable Diffusion but optimal values vary across model versions. The correct mental model is not "higher is better" but "higher means more adherent and less diverse, with quality degrading past a model-specific ceiling."

Architecture comparison

	Pixel diffusion	Latent diffusion	Flow matching (SD3/Flux)
Denoising space	H×W×3	h×w×c (8× smaller)	Latent (same)
Steps (quality)	1000 DDPM / 50 DDIM	50 DDIM / 20 DPM++	10–20
Memory (512px)	High	Moderate	Moderate
Quality ceiling	High	Higher (perc. loss VAE)	Highest (SD3, FLUX)
Editability	DDIM inversion	DDIM inversion	Still maturing

Guidance scale tradeoffs

Guidance scale	Effect	Use case
1.0	No guidance — pure conditional sample	Maximum diversity
3–5	Soft guidance	Creative exploration
7–8	Standard (SD default)	Balanced quality
10–15	High guidance	Strict prompt adherence
20+	Oversaturation, artifacts	Rarely useful

Key design choices

VAE quality: the VAE determines the upper bound on reconstruction quality. The SD-VAE-FT-MSE model (fine-tuned with MSE loss) produces sharper reconstructions than the original SD VAE, especially for faces and text.

Text encoder: CLIP ViT-L/14 (SD 1.x), OpenCLIP ViT-H (SD 2.x), T5-XXL + CLIP (SD3 / FLUX). Larger text encoders understand more complex prompts and compositional descriptions.

Negative prompting: providing a negative prompt improves CFG by steering away from unwanted attributes ("blurry, low quality") rather than only toward positives. The guidance becomes: $\tilde{\varepsilon} = \varepsilon_\theta(z_t, y_{\text{neg}}) + w \cdot [\varepsilon_\theta(z_t, y_{\text{pos}}) - \varepsilon_\theta(z_t, y_{\text{neg}})]$

Common failure modes:

Prompt following on complex scenes: "A above B to the left of C" — cross-attention lacks explicit spatial reasoning
Consistent multi-object generation: two different people with specified attributes often blend characteristics
Text rendering: SD 1.x/2.x cannot render legible text; SD3/FLUX substantially improved this via better text encoders and flow matching

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Diffusion Models

Video Understanding