Requires:Latent Diffusion & Guided Generation Video Understanding

Video Generation

Generating a coherent video is far harder than generating a single image: each frame must be realistic, frames must be temporally consistent (no flickering, teleporting objects, or lighting discontinuities), and the generated motion must respect physical plausibility. Modern video generation models extend image diffusion to the temporal dimension — either by fine-tuning image diffusion models with temporal attention layers, or by training video DiTs end-to-end on compressed video latents. The core challenges are temporal consistency, computational cost (video has far more tokens than images), and conditioning (text, image, video continuation). Understanding these architectures is essential context for any work involving generative video, video editing, or multimodal generation pipelines.

Theory

Video Attention Mapquery = ★ center patch

Spatial attention captures appearance context within a frame. Temporal attention propagates information across time at the same spatial position. Full joint attention (used in Video Transformers) allows any patch to attend to any other across all frames and spatial locations.

Video generation extends image diffusion along the time axis. The core challenge is that video is enormous: a 4-second 512×512 clip at 24 fps contains 77 million pixel values, making pixel-space diffusion infeasible. Modern video generation models solve this with temporal attention (add time-modeling layers to pretrained image diffusion models) and spatiotemporal compression (use a 3D VAE to compress video to a 192× smaller latent representation).

Extending image diffusion to video

The most practical approach to video generation: start from a pretrained image diffusion model and add temporal modeling.

Temporal attention insertion: add temporal attention layers (as in TimeSformer) into each U-Net or DiT block. For a feature map $F \in \mathbb{R}^{B \times T \times H \times W \times C}$ :

Apply existing spatial attention over $(H, W)$ independently per frame
Apply new temporal attention over $T$ for each spatial position

Training: freeze all pretrained spatial weights, train only the temporal layers on video data. This transfers image quality while teaching temporal coherence.

3D convolutions in U-Net: replace 2D conv layers with 3D or pseudo-3D (R(2+1)D) convolutions. Spatiotemporal downsampling in the encoder, upsampling in the decoder.

Video latent representation

Pixel-space video diffusion is infeasible: a 512×512 video at 24 fps for 4 seconds has 98 frames × 786K pixels = 77M dimensions.

This is not a resource constraint that can be solved with more GPUs — it is a fundamental architectural limit. Self-attention in the noise predictor scales quadratically with the number of spatial positions: 786K pixels per frame makes attention computationally infeasible even for a single frame, before accounting for the temporal dimension. Spatial and spatiotemporal VAE compression are the only mechanisms that reduce this to a tractable size, which is why all production video generation models use latent diffusion rather than pixel-space diffusion.

Two compression approaches:

Spatial VAE: compress each frame independently with the same 2D VAE used for image diffusion. A 512×512 frame becomes 64×64×4. Video of 16 frames: $16 \times 64 \times 64 \times 4 = 262K$ dimensions — still large.

Spatiotemporal VAE (3D VAE): add temporal downsampling $f_t$ to the VAE encoder. A 3D VAE with $f_t = 4$ , $f_{xy} = 8$ compresses 512×512×16 to 64×64×4 — a 192× reduction. Wan, Cosmos, and Sora-style models use 3D VAEs.

The diffusion model then operates on the compressed spatiotemporal latent $z \in \mathbb{R}^{T/f_t \times H/f \times W/f \times c}$ .

Temporal consistency objective

Beyond the standard denoising loss, video models enforce temporal consistency through:

Flow-based regularization: optical flow between consecutive frames should be smooth. Some models add a flow prediction auxiliary loss.

Frame conditioning: condition on the first frame (for video continuation) or multiple anchor frames. The model learns to complete the video while matching the conditions.

Temporal attention with causal masking (for autoregressive generation): frame $t$ can attend to frames $0, \ldots, t-1$ but not future frames. Enables streaming generation.

Video DiT architecture

Following the success of DiT for images, video DiTs process the full spatiotemporal latent as a sequence of 3D patches:

3D patchify: divide $z \in \mathbb{R}^{T' \times H' \times W' \times c}$ into patches of size $p_t \times p \times p$ , project to dimension $D$
3D RoPE: rotary position embeddings extended to 3D $(t, h, w)$ coordinates, enabling generalization to different aspect ratios and frame counts
Full 3D self-attention: every spatiotemporal token attends to every other (computationally expensive but necessary for global coherence)
Text conditioning via cross-attention or AdaLN

For Wan 2.1 (14B params): 3D VAE with $f_t=4, f_{xy}=8$ , then DiT with 3D RoPE and full attention over the latent sequence. Generates 480p or 720p video at 16 fps.

Noise schedules for video

Standard diffusion adds the same noise level to all frames simultaneously. This works but doesn't reflect how video information is structured.

Independent noise per frame: each frame gets independently sampled noise. Forces the model to generate each frame coherently from scratch given context. Used in SVD (Stable Video Diffusion).

Temporal noise correlation: correlated noise across frames — adjacent frames share noise — initializes the model closer to natural video statistics. Used in some video consistency models.

Flow matching for video: rectified flow with straight-line trajectories reduces required denoising steps from 50 to 4–8, making video generation practical for real-time or near-real-time applications.

Conditioning modalities

Conditioning type	Description	Use case
Text	CLIP or T5 text embedding	Text-to-video generation
Image (first frame)	Encode first frame, condition all frames	Image-to-video animation
Video continuation	Condition on N past frames	Extension, prediction
Camera pose	Camera trajectory (rotation, translation)	Camera-controlled generation
Depth / flow	Dense geometric conditioning	Controlled motion

Image-to-video conditioning: encode the first frame with the VAE, concatenate its latent with the noisy video latent channel-wise, and let the model learn to animate from it. SVD (Stability AI) uses this architecture.

Walkthrough

Stable Video Diffusion inference

python

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
import torch
 
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.enable_model_cpu_offload()   # saves VRAM
 
# Animate from a single image
image = load_image("starting_frame.png").resize((1024, 576))
 
frames = pipe(
    image,
    num_frames=25,                # 25 frames at ~6 fps ≈ 4 seconds
    num_inference_steps=25,
    decode_chunk_size=8,          # decode 8 frames at a time to save memory
    motion_bucket_id=127,         # 0-255: controls motion intensity
    noise_aug_strength=0.02,
    generator=torch.manual_seed(42),
).frames[0]
 
export_to_video(frames, "output.mp4", fps=7)

Text-to-video with CogVideoX

python

from diffusers import CogVideoXPipeline
import torch
 
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()     # process video in slices to save VRAM
pipe.vae.enable_tiling()      # tile spatial dimensions
 
prompt = (
    "A time-lapse of a sunflower field from dawn to dusk. "
    "Golden hour lighting, gentle breeze, bees visiting flowers. "
    "Cinematic, 4K."
)
 
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator("cpu").manual_seed(42),
).frames[0]
 
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)

Temporal consistency evaluation

python

import torch
import torch.nn.functional as F
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
 
def temporal_consistency(frames: torch.Tensor) -> float:
    """
    Measure temporal consistency via CLIP frame similarity.
    frames: (T, C, H, W) in [0, 1]
    Returns: mean cosine similarity between consecutive frames.
    """
    from transformers import CLIPModel, CLIPProcessor
    import numpy as np
 
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
    T = frames.shape[0]
    to_pil = lambda t: (t.permute(1, 2, 0).numpy() * 255).astype("uint8")
 
    sims = []
    for i in range(T - 1):
        imgs = [to_pil(frames[i]), to_pil(frames[i+1])]
        inputs = processor(images=imgs, return_tensors="pt")
        with torch.no_grad():
            feats = model.get_image_features(**inputs)
            feats = F.normalize(feats, dim=-1)
        sim = (feats[0] @ feats[1]).item()
        sims.append(sim)
 
    return float(np.mean(sims))

Analysis & Evaluation

Where Your Intuition Breaks

More denoising steps always produce better video quality. For image generation, more steps generally improve quality up to a point. For video, the relationship is more nuanced: temporal consistency depends on the sampler and the temporal attention structure, not just step count. Flow matching samplers (used in Wan 2.1, CogVideoX) achieve high temporal coherence at 20–50 steps — adding more steps produces diminishing returns and can introduce temporal flickering if the noise schedule is not well-tuned for video. The quality ceiling is set by the model architecture and training data, not the step count.

Video generation model comparison (2024–2025)

Model	Architecture	Resolution	Duration	Open
SVD (Stability)	U-Net + temporal attn	1024×576	4s	Yes
CogVideoX-5B	3D DiT	720×480	6s	Yes
Wan 2.1	3D DiT	720p	5–10s	Yes
HunyuanVideo	DiT	720p	5s	Yes
Sora (OpenAI)	DiT (est.)	1080p	60s	No
Veo 2 (Google)	Unknown	4K	60s+	No

Autoregressive vs. diffusion for video

	Diffusion-based	Autoregressive
Temporal coherence	Strong (denoise all frames jointly)	Weaker (frame-by-frame error accumulation)
Streaming generation	Hard (requires all frames)	Natural
Max video length	Fixed by context window	Unlimited (with KV cache)
Text alignment	Good (CFG)	Very good (RLHF from image models)
State of art	Yes (Sora, Wan, HunyuanVideo)	Emerging (VideoPoet)

Key evaluation metrics

FVD (Fréchet Video Distance): extends FID to video using I3D features. Captures both visual quality and temporal realism. Lower is better.

CLIP-SIM: average CLIP cosine similarity between video frames and the text prompt. Measures text-video alignment.

Warping error: use optical flow to warp frame $t$ to frame $t+1$ , measure pixel error against actual frame $t+1$ . Captures motion consistency.

Human evaluation: still the gold standard — automated metrics poorly capture physical plausibility, scene coherence, and aesthetic quality.

Failure modes

Temporal flickering: per-frame noise in appearance without consistent lighting or texture. Fixed by stronger temporal attention and higher temporal noise correlation.

Object drift: objects gradually move or deform unintentionally across frames. Harder to fix — requires stronger spatial consistency constraints or test-time guidance.

Physics violations: objects pass through each other, float, or exhibit non-physical dynamics. Current models learn physics implicitly from data; small models fail more often.

Motion blur artifacts: fast motion produces blurry intermediate frames rather than sharp motion blur. Training data with natural motion blur helps.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Video Understanding

Specialized

Recommender Systems