Video Generation
Generating a coherent video is far harder than generating a single image: each frame must be realistic, frames must be temporally consistent (no flickering, teleporting objects, or lighting discontinuities), and the generated motion must respect physical plausibility. Modern video generation models extend image diffusion to the temporal dimension — either by fine-tuning image diffusion models with temporal attention layers, or by training video DiTs end-to-end on compressed video latents. The core challenges are temporal consistency, computational cost (video has far more tokens than images), and conditioning (text, image, video continuation). Understanding these architectures is essential context for any work involving generative video, video editing, or multimodal generation pipelines.
Theory
Spatial attention captures appearance context within a frame. Temporal attention propagates information across time at the same spatial position. Full joint attention (used in Video Transformers) allows any patch to attend to any other across all frames and spatial locations.
Video generation extends image diffusion along the time axis. The core challenge is that video is enormous: a 4-second 512×512 clip at 24 fps contains 77 million pixel values, making pixel-space diffusion infeasible. Modern video generation models solve this with temporal attention (add time-modeling layers to pretrained image diffusion models) and spatiotemporal compression (use a 3D VAE to compress video to a 192× smaller latent representation).
Extending image diffusion to video
The most practical approach to video generation: start from a pretrained image diffusion model and add temporal modeling.
Temporal attention insertion: add temporal attention layers (as in TimeSformer) into each U-Net or DiT block. For a feature map :
- Apply existing spatial attention over independently per frame
- Apply new temporal attention over for each spatial position
Training: freeze all pretrained spatial weights, train only the temporal layers on video data. This transfers image quality while teaching temporal coherence.
3D convolutions in U-Net: replace 2D conv layers with 3D or pseudo-3D (R(2+1)D) convolutions. Spatiotemporal downsampling in the encoder, upsampling in the decoder.
Video latent representation
Pixel-space video diffusion is infeasible: a 512×512 video at 24 fps for 4 seconds has 98 frames × 786K pixels = 77M dimensions.
This is not a resource constraint that can be solved with more GPUs — it is a fundamental architectural limit. Self-attention in the noise predictor scales quadratically with the number of spatial positions: 786K pixels per frame makes attention computationally infeasible even for a single frame, before accounting for the temporal dimension. Spatial and spatiotemporal VAE compression are the only mechanisms that reduce this to a tractable size, which is why all production video generation models use latent diffusion rather than pixel-space diffusion.
Two compression approaches:
Spatial VAE: compress each frame independently with the same 2D VAE used for image diffusion. A 512×512 frame becomes 64×64×4. Video of 16 frames: dimensions — still large.
Spatiotemporal VAE (3D VAE): add temporal downsampling to the VAE encoder. A 3D VAE with , compresses 512×512×16 to 64×64×4 — a 192× reduction. Wan, Cosmos, and Sora-style models use 3D VAEs.
The diffusion model then operates on the compressed spatiotemporal latent .
Temporal consistency objective
Beyond the standard denoising loss, video models enforce temporal consistency through:
Flow-based regularization: optical flow between consecutive frames should be smooth. Some models add a flow prediction auxiliary loss.
Frame conditioning: condition on the first frame (for video continuation) or multiple anchor frames. The model learns to complete the video while matching the conditions.
Temporal attention with causal masking (for autoregressive generation): frame can attend to frames but not future frames. Enables streaming generation.
Video DiT architecture
Following the success of DiT for images, video DiTs process the full spatiotemporal latent as a sequence of 3D patches:
- 3D patchify: divide into patches of size , project to dimension
- 3D RoPE: rotary position embeddings extended to 3D coordinates, enabling generalization to different aspect ratios and frame counts
- Full 3D self-attention: every spatiotemporal token attends to every other (computationally expensive but necessary for global coherence)
- Text conditioning via cross-attention or AdaLN
For Wan 2.1 (14B params): 3D VAE with , then DiT with 3D RoPE and full attention over the latent sequence. Generates 480p or 720p video at 16 fps.
Noise schedules for video
Standard diffusion adds the same noise level to all frames simultaneously. This works but doesn't reflect how video information is structured.
Independent noise per frame: each frame gets independently sampled noise. Forces the model to generate each frame coherently from scratch given context. Used in SVD (Stable Video Diffusion).
Temporal noise correlation: correlated noise across frames — adjacent frames share noise — initializes the model closer to natural video statistics. Used in some video consistency models.
Flow matching for video: rectified flow with straight-line trajectories reduces required denoising steps from 50 to 4–8, making video generation practical for real-time or near-real-time applications.
Conditioning modalities
| Conditioning type | Description | Use case |
|---|---|---|
| Text | CLIP or T5 text embedding | Text-to-video generation |
| Image (first frame) | Encode first frame, condition all frames | Image-to-video animation |
| Video continuation | Condition on N past frames | Extension, prediction |
| Camera pose | Camera trajectory (rotation, translation) | Camera-controlled generation |
| Depth / flow | Dense geometric conditioning | Controlled motion |
Image-to-video conditioning: encode the first frame with the VAE, concatenate its latent with the noisy video latent channel-wise, and let the model learn to animate from it. SVD (Stability AI) uses this architecture.
Walkthrough
Stable Video Diffusion inference
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
import torch
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16,
variant="fp16",
)
pipe.enable_model_cpu_offload() # saves VRAM
# Animate from a single image
image = load_image("starting_frame.png").resize((1024, 576))
frames = pipe(
image,
num_frames=25, # 25 frames at ~6 fps ≈ 4 seconds
num_inference_steps=25,
decode_chunk_size=8, # decode 8 frames at a time to save memory
motion_bucket_id=127, # 0-255: controls motion intensity
noise_aug_strength=0.02,
generator=torch.manual_seed(42),
).frames[0]
export_to_video(frames, "output.mp4", fps=7)Text-to-video with CogVideoX
from diffusers import CogVideoXPipeline
import torch
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing() # process video in slices to save VRAM
pipe.vae.enable_tiling() # tile spatial dimensions
prompt = (
"A time-lapse of a sunflower field from dawn to dusk. "
"Golden hour lighting, gentle breeze, bees visiting flowers. "
"Cinematic, 4K."
)
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator("cpu").manual_seed(42),
).frames[0]
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)Temporal consistency evaluation
import torch
import torch.nn.functional as F
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
def temporal_consistency(frames: torch.Tensor) -> float:
"""
Measure temporal consistency via CLIP frame similarity.
frames: (T, C, H, W) in [0, 1]
Returns: mean cosine similarity between consecutive frames.
"""
from transformers import CLIPModel, CLIPProcessor
import numpy as np
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
T = frames.shape[0]
to_pil = lambda t: (t.permute(1, 2, 0).numpy() * 255).astype("uint8")
sims = []
for i in range(T - 1):
imgs = [to_pil(frames[i]), to_pil(frames[i+1])]
inputs = processor(images=imgs, return_tensors="pt")
with torch.no_grad():
feats = model.get_image_features(**inputs)
feats = F.normalize(feats, dim=-1)
sim = (feats[0] @ feats[1]).item()
sims.append(sim)
return float(np.mean(sims))Analysis & Evaluation
Where Your Intuition Breaks
More denoising steps always produce better video quality. For image generation, more steps generally improve quality up to a point. For video, the relationship is more nuanced: temporal consistency depends on the sampler and the temporal attention structure, not just step count. Flow matching samplers (used in Wan 2.1, CogVideoX) achieve high temporal coherence at 20–50 steps — adding more steps produces diminishing returns and can introduce temporal flickering if the noise schedule is not well-tuned for video. The quality ceiling is set by the model architecture and training data, not the step count.
Video generation model comparison (2024–2025)
| Model | Architecture | Resolution | Duration | Open |
|---|---|---|---|---|
| SVD (Stability) | U-Net + temporal attn | 1024×576 | 4s | Yes |
| CogVideoX-5B | 3D DiT | 720×480 | 6s | Yes |
| Wan 2.1 | 3D DiT | 720p | 5–10s | Yes |
| HunyuanVideo | DiT | 720p | 5s | Yes |
| Sora (OpenAI) | DiT (est.) | 1080p | 60s | No |
| Veo 2 (Google) | Unknown | 4K | 60s+ | No |
Autoregressive vs. diffusion for video
| Diffusion-based | Autoregressive | |
|---|---|---|
| Temporal coherence | Strong (denoise all frames jointly) | Weaker (frame-by-frame error accumulation) |
| Streaming generation | Hard (requires all frames) | Natural |
| Max video length | Fixed by context window | Unlimited (with KV cache) |
| Text alignment | Good (CFG) | Very good (RLHF from image models) |
| State of art | Yes (Sora, Wan, HunyuanVideo) | Emerging (VideoPoet) |
Key evaluation metrics
FVD (Fréchet Video Distance): extends FID to video using I3D features. Captures both visual quality and temporal realism. Lower is better.
CLIP-SIM: average CLIP cosine similarity between video frames and the text prompt. Measures text-video alignment.
Warping error: use optical flow to warp frame to frame , measure pixel error against actual frame . Captures motion consistency.
Human evaluation: still the gold standard — automated metrics poorly capture physical plausibility, scene coherence, and aesthetic quality.
Failure modes
Temporal flickering: per-frame noise in appearance without consistent lighting or texture. Fixed by stronger temporal attention and higher temporal noise correlation.
Object drift: objects gradually move or deform unintentionally across frames. Harder to fix — requires stronger spatial consistency constraints or test-time guidance.
Physics violations: objects pass through each other, float, or exhibit non-physical dynamics. Current models learn physics implicitly from data; small models fail more often.
Motion blur artifacts: fast motion produces blurry intermediate frames rather than sharp motion blur. Training data with natural motion blur helps.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.