Neural-Path/Notes
35 min

Latent Diffusion & Guided Generation

Pixel-space diffusion models like DDPM produce high-quality images but are computationally expensive — running 1000 denoising steps on a 512×512 image at full resolution is prohibitive. Latent diffusion models (Rombach et al., 2022) solve this by compressing images into a low-dimensional latent space via a pre-trained VAE, then running diffusion entirely in that compressed space. This 8–16× spatial compression reduces the computational cost by orders of magnitude while preserving perceptual quality. Combined with classifier-free guidance and cross-attention conditioning on text embeddings, latent diffusion became the foundation for Stable Diffusion and most modern text-to-image systems.

Theory

Latent Diffusion Pipeline
text promptCLIP / T5 embed×8 compressionT steps×8 decodeImage512×512×3VAE EncoderLatent z64×64×4UNet(+text embed)Denoised z64×64×4VAE Decoderpixel space: 786k valueslatent space: 16k values (49× smaller)

Running diffusion in latent space (64×64) instead of pixel space (512×512) gives a ~49× reduction in values per step — the key computational saving in Stable Diffusion.

Standard diffusion on full-resolution images is impractically slow — a 512×512 image has 786K pixels, and self-attention over them is quadratic. Latent diffusion compresses the image to a small latent (64×64 = 4K positions) first, runs all the expensive diffusion steps there, then decodes back to pixels in one pass. This is the architecture behind Stable Diffusion, FLUX, and most production text-to-image systems.

VAE compression to latent space

A pre-trained encoder E\mathcal{E} and decoder D\mathcal{D} define the latent space. For an image xRH×W×3x \in \mathbb{R}^{H \times W \times 3}:

z=E(x)Rh×w×c,x^=D(z)z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}, \quad \hat{x} = \mathcal{D}(z)

The compression to latent space is architecturally required, not optional: without it, a single forward pass of the U-Net noise predictor on a 512×512 image would require self-attention over 786K spatial positions, which is computationally infeasible. The 8× spatial downsampling to 64×64 is a deliberate engineering choice — small enough to make attention tractable (4K positions vs 786K), large enough to retain the spatial structure that text conditioning needs to steer. The VAE training objective (perceptual + adversarial + KL) is also forced by this choice: pixel MSE would not preserve the fine texture needed for photorealistic decoding from a lossy compressed representation.

The standard Stable Diffusion VAE uses a downsampling factor of 8: a 512×512 image maps to a 64×64×4 latent. This 8× spatial compression means diffusion operates on 64×64=409664 \times 64 = 4096 spatial positions instead of 512×512=262144512 \times 512 = 262144 — a 64× reduction in the number of tokens for attention.

The VAE is trained with a perceptual loss and patch-based adversarial loss (not just pixel MSE) to preserve fine texture. The KL regularization term keeps the latent distribution close to N(0,I)\mathcal{N}(0, \mathbf{I}), which is where diffusion's noise process starts:

LVAE=Lrec+λKLLKL+λpercLperc+λadvLadv\mathcal{L}_{\text{VAE}} = \mathcal{L}_{\text{rec}} + \lambda_{\text{KL}} \mathcal{L}_{\text{KL}} + \lambda_{\text{perc}} \mathcal{L}_{\text{perc}} + \lambda_{\text{adv}} \mathcal{L}_{\text{adv}}

Latent diffusion training

With a frozen VAE, latent diffusion trains the noise predictor εθ\varepsilon_\theta in latent space:

LLDM=EzE(x),εN(0,I),t[εεθ(zt,t,τθ(y))2]\mathcal{L}_{\text{LDM}} = \mathbb{E}_{z \sim \mathcal{E}(x),\, \varepsilon \sim \mathcal{N}(0,I),\, t} \left[\|\varepsilon - \varepsilon_\theta(z_t, t, \tau_\theta(y))\|^2\right]

where zt=αˉtz+1αˉtεz_t = \sqrt{\bar{\alpha}_t} z + \sqrt{1 - \bar{\alpha}_t} \varepsilon is the noisy latent and τθ(y)\tau_\theta(y) is the text conditioning signal from a text encoder.

Classifier-free guidance (CFG)

The primary lever for controlling quality vs. diversity in text-to-image generation. During training, the text condition yy is randomly dropped with probability pdrop0.1p_{\text{drop}} \approx 0.1, training the model to also predict εθ(zt,t,)\varepsilon_\theta(z_t, t, \varnothing) (unconditional).

At inference, the guided score is a weighted interpolation:

ε~θ(zt,t,y)=εθ(zt,t,)+w[εθ(zt,t,y)εθ(zt,t,)]\tilde{\varepsilon}_\theta(z_t, t, y) = \varepsilon_\theta(z_t, t, \varnothing) + w \cdot \big[\varepsilon_\theta(z_t, t, y) - \varepsilon_\theta(z_t, t, \varnothing)\big]

where ww is the guidance scale:

  • w=1w = 1: standard conditional sampling (no guidance boost)
  • w=7.5w = 7.5: typical value for Stable Diffusion (high quality, reduced diversity)
  • w>10w > 10: oversaturation and artifacts

Intuition: the term εθ(zt,t,y)εθ(zt,t,)\varepsilon_\theta(z_t, t, y) - \varepsilon_\theta(z_t, t, \varnothing) points in the direction that makes the sample more consistent with the text prompt. Scaling it by w>1w > 1 amplifies this direction, trading diversity for prompt adherence.

Cross-attention for text conditioning

The noise predictor (U-Net or DiT) integrates text via cross-attention. At each spatial layer, image features QQ attend over text token embeddings K,VK, V:

CrossAttn(Q,K,V)=softmax ⁣(QKd)V\text{CrossAttn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V

where Q=WQϕspatialQ = W_Q \cdot \phi_{\text{spatial}}, K=WKτθ(y)K = W_K \cdot \tau_\theta(y), V=WVτθ(y)V = W_V \cdot \tau_\theta(y).

Each spatial position independently decides which text tokens to attend to. This is what enables spatial-semantic correspondence — "a red ball on the left" — though it struggles with fine-grained spatial reasoning.

DiT: Diffusion Transformer backbone

Peebles and Xie (2023) replaced the U-Net with a Diffusion Transformer (DiT), treating the latent as a sequence of patches and applying transformer blocks throughout:

  1. Patchify the latent ztRh×w×cz_t \in \mathbb{R}^{h \times w \times c} into tokens
  2. Condition each block via adaptive layer norm (adaLN-Zero): adaLN(x,t,y)=γ(t,y)LN(x)+β(t,y)\text{adaLN}(x, t, y) = \gamma(t, y) \cdot \text{LN}(x) + \beta(t, y) where γ,β\gamma, \beta are predicted from the timestep tt and class label/text embedding yy

DiT scales better than U-Net with compute: DiT-XL/2 achieves FID 2.27 on ImageNet 256×256, outperforming the best U-Net baselines. Stable Diffusion 3 and FLUX use transformer-based backbones.

Flow matching

An alternative to DDPM's noise schedule. Flow matching (Lipman et al., 2022; Esser et al., 2024 in SD3) defines a straight-line probability path from noise to data:

zt=(1t)ε+tz0,εN(0,I)z_t = (1 - t) \cdot \varepsilon + t \cdot z_0, \quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})

The velocity field vθv_\theta predicts the direction to move: LFM=Et,z0,ε[vθ(zt,t)(z0ε)2]\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, z_0, \varepsilon} \left[\|v_\theta(z_t, t) - (z_0 - \varepsilon)\|^2\right]

The optimal trajectory is a straight line from ε\varepsilon to z0z_0 — no curved path needed. This allows fewer sampling steps (10–20 instead of 50+) because the learned flow is approximately linear. SD3, Flux, and Sora's video model use flow matching.

Walkthrough

Stable Diffusion inference

python
from diffusers import StableDiffusionPipeline
import torch
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
 
image = pipe(
    prompt="a photorealistic mountain lake at sunset, 4k",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=50,
    guidance_scale=7.5,
    height=512,
    width=512,
).images[0]
image.save("output.png")

CFG guidance scale sweep

python
import torch
from diffusers import StableDiffusionPipeline
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
 
generator = torch.Generator("cuda").manual_seed(42)
prompt = "a golden retriever in a forest, oil painting"
 
for scale in [1.0, 3.0, 7.5, 12.0, 20.0]:
    image = pipe(
        prompt=prompt,
        guidance_scale=scale,
        num_inference_steps=50,
        generator=torch.Generator("cuda").manual_seed(42),
    ).images[0]
    image.save(f"cfg_{scale}.png")
    print(f"Saved cfg_{scale}.png")

DDIM inversion for image editing

python
from diffusers import DDIMScheduler, StableDiffusionPipeline
import torch
 
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
 
from PIL import Image
import numpy as np
 
def encode_image(pipe, image):
    image = pipe.image_processor.preprocess(image).to("cuda", dtype=torch.float16)
    with torch.no_grad():
        latent = pipe.vae.encode(image).latent_dist.mean * pipe.vae.config.scaling_factor
    return latent
 
# Invert real image to noise space (for editing)
image = Image.open("original.jpg").resize((512, 512))
latent = encode_image(pipe, image)
 
# DDIM inversion: add noise step by step
# Then re-denoise with edited prompt to change content
# while preserving structure

Latent diffusion training loop (minimal)

python
import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL
 
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to("cuda")
vae.requires_grad_(False)
 
def training_step(unet, vae, text_encoder, batch, noise_scheduler):
    pixel_values, input_ids = batch["pixel_values"], batch["input_ids"]
 
    # Encode to latent
    with torch.no_grad():
        latents = vae.encode(pixel_values).latent_dist.sample()
        latents = latents * vae.config.scaling_factor
 
        text_embeds = text_encoder(input_ids)[0]
 
    # Sample noise and timestep
    noise = torch.randn_like(latents)
    t = torch.randint(0, noise_scheduler.num_train_timesteps,
                      (latents.shape[0],), device=latents.device)
    noisy_latents = noise_scheduler.add_noise(latents, noise, t)
 
    # Predict noise; apply CFG dropout
    if torch.rand(1) < 0.1:
        text_embeds = torch.zeros_like(text_embeds)  # unconditional
 
    pred = unet(noisy_latents, t, encoder_hidden_states=text_embeds).sample
    return F.mse_loss(pred, noise)

Analysis & Evaluation

Where Your Intuition Breaks

Higher guidance scale always produces higher-quality, more prompt-faithful images. Guidance scale controls a trade-off between prompt adherence and sample diversity: higher values increase adherence but also increase overexposure, oversaturation, and anatomical artifacts. Guidance scales above 10 typically produce visually degraded results — oversaturated colors, distorted anatomy, loss of fine texture. The optimal guidance scale for a given model and prompt type is empirically determined; w=7 is a typical starting point for Stable Diffusion but optimal values vary across model versions. The correct mental model is not "higher is better" but "higher means more adherent and less diverse, with quality degrading past a model-specific ceiling."

Architecture comparison

Pixel diffusionLatent diffusionFlow matching (SD3/Flux)
Denoising spaceH×W×3h×w×c (8× smaller)Latent (same)
Steps (quality)1000 DDPM / 50 DDIM50 DDIM / 20 DPM++10–20
Memory (512px)HighModerateModerate
Quality ceilingHighHigher (perc. loss VAE)Highest (SD3, FLUX)
EditabilityDDIM inversionDDIM inversionStill maturing

Guidance scale tradeoffs

Guidance scaleEffectUse case
1.0No guidance — pure conditional sampleMaximum diversity
3–5Soft guidanceCreative exploration
7–8Standard (SD default)Balanced quality
10–15High guidanceStrict prompt adherence
20+Oversaturation, artifactsRarely useful

Key design choices

VAE quality: the VAE determines the upper bound on reconstruction quality. The SD-VAE-FT-MSE model (fine-tuned with MSE loss) produces sharper reconstructions than the original SD VAE, especially for faces and text.

Text encoder: CLIP ViT-L/14 (SD 1.x), OpenCLIP ViT-H (SD 2.x), T5-XXL + CLIP (SD3 / FLUX). Larger text encoders understand more complex prompts and compositional descriptions.

Negative prompting: providing a negative prompt improves CFG by steering away from unwanted attributes ("blurry, low quality") rather than only toward positives. The guidance becomes: ε~=εθ(zt,yneg)+w[εθ(zt,ypos)εθ(zt,yneg)]\tilde{\varepsilon} = \varepsilon_\theta(z_t, y_{\text{neg}}) + w \cdot [\varepsilon_\theta(z_t, y_{\text{pos}}) - \varepsilon_\theta(z_t, y_{\text{neg}})]

Common failure modes:

  • Prompt following on complex scenes: "A above B to the left of C" — cross-attention lacks explicit spatial reasoning
  • Consistent multi-object generation: two different people with specified attributes often blend characteristics
  • Text rendering: SD 1.x/2.x cannot render legible text; SD3/FLUX substantially improved this via better text encoders and flow matching

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.