Video Understanding
Images are static snapshots; video adds time. Understanding video requires learning what changes between frames and what stays the same — motion, causality, temporal order. Early approaches stacked 2D CNNs on individual frames, ignoring temporal structure entirely. 3D convolutions and two-stream networks introduced motion modeling, but required careful architectural choices and were expensive. Transformer-based video models (TimeSformer, Video Swin, VideoMAE) remove the architectural constraints and instead learn to attend across both space and time — scaling to hundreds of frames and achieving state-of-the-art results across action recognition, temporal grounding, and dense video understanding.
Theory
2D conv processes each frame independently — no temporal relationships captured. 3D conv extends the kernel across T frames, learning motion patterns like optical flow implicitly.
Video understanding adds a time axis to image understanding. A static image model sees a cup and a hand separately; a video model can recognize "picking up the cup" because it tracks how objects and their relationships change across frames. The architectures below progress from 3D convolutions (local spatiotemporal neighborhoods) to transformers (global temporal attention), trading computational cost for longer-range temporal reasoning.
From images to video
A video clip is a 4D tensor where is the number of frames. Naive processing applies image models frame-independently, ignoring temporal structure — fine for static frames but blind to motion.
3D convolution (C3D, Tran et al., 2015) extends 2D kernels to include a temporal dimension:
The 3D convolution sum directly extends 2D spatial convolution by adding a temporal kernel dimension — this is the minimal modification that allows a convolutional network to detect motion patterns (an edge moving across frames) rather than just static textures. The temporal kernel size bounds the temporal receptive field: a kernel of size 3 can detect motion over 3 consecutive frames but not the long-range dependency between the first and last frame of a 10-second clip. This is why 3D CNNs were superseded for long-form video understanding: the local inductive bias that makes CNNs efficient for images becomes a liability when temporal dependencies span many frames.
A spatiotemporal kernel processes space and time jointly. This allows the network to detect motion patterns (edges moving across frames), but cubes means parameter count and compute scale cubically.
R(2+1)D factorized convolution
Instead of a full kernel, R(2+1)D (Tran et al., 2018) factors into a 2D spatial convolution followed by a 1D temporal convolution:
This doubles the number of non-linearities (each stage has its own ReLU), reduces parameters vs. 3D conv, and separates spatial and temporal learning. R(2+1)D outperformed C3D with fewer FLOPs on Kinetics-400.
Optical flow and two-stream networks
Optical flow models the apparent motion field between consecutive frames. The brightness constancy constraint:
gives the optical flow constraint equation:
where are the flow velocities and are spatial and temporal image gradients.
Two-stream networks (Simonyan and Zisserman, 2014) train two separate CNNs:
- Spatial stream: processes RGB frames → scene/object understanding
- Temporal stream: processes stacked optical flow fields → motion understanding
Late fusion (averaging or SVM) combines both streams. Two-stream networks dominated action recognition benchmarks until transformer-based models surpassed them by learning motion implicitly.
Divided space-time attention (TimeSformer)
TimeSformer (Bertasius et al., 2021) extends ViT to video by applying factorized attention — separate spatial and temporal attention blocks instead of joint 3D self-attention:
Temporal attention (across frames, same spatial position):
Each patch attends over its counterpart across all frames.
Spatial attention (within each frame, all patches):
Each frame applies standard ViT attention among its patches.
Complexity comparison:
- Joint 3D attention:
- Divided space-time:
For frames and patches: joint = pairs vs. divided = pairs — an 8× reduction.
VideoMAE masked pretraining
VideoMAE (Tong et al., 2022) applies masked autoencoding to video with an extremely high masking ratio:
Tube masking: instead of random per-frame masking, VideoMAE masks entire spatiotemporal tubes — all frames at a given spatial position are masked or unmasked together. This prevents trivial copying from adjacent frames.
With 90–95% of tokens masked:
- The encoder processes only 5–10% of the video
- Training is 3–5× faster than processing all tokens
- The model must learn rich spatiotemporal representations to reconstruct masked tubes
VideoMAE ViT-H pretrained on Kinetics achieves 86.6% on UCF-101 with linear probing — stronger than two-stream networks with full fine-tuning.
Temporal grounding
Video question answering and temporal grounding require localizing events in time. The core task: given a query (text or visual), predict the temporal segment where the query occurs.
Contrastive language-video alignment (CLIP4Clip, VideoCLIP): extend CLIP's InfoNCE loss to video-text pairs by mean-pooling or using a temporal transformer over frame features:
where is the aggregated video embedding and is the text embedding for the -th video-caption pair.
Walkthrough
VideoMAE feature extraction
from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np
from PIL import Image
processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
# Simulate loading 16 frames (T=16, H=224, W=224, C=3)
# In practice: use decord or torchvision.io.read_video
frames = [np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8) for _ in range(16)]
frames_pil = [Image.fromarray(f) for f in frames]
inputs = processor(frames_pil, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state: (1, T*N_patches, D) — spatiotemporal token sequence
print(outputs.last_hidden_state.shape) # e.g. (1, 1568, 768) for ViT-BAction recognition with TimeSformer
from transformers import TimesformerModel, AutoImageProcessor
import torch
processor = AutoImageProcessor.from_pretrained(
"facebook/timesformer-base-finetuned-k400"
)
model = TimesformerModel.from_pretrained(
"facebook/timesformer-base-finetuned-k400"
)
# Load 8 video frames (RGB, 224×224)
import torchvision
video_path = "action_clip.mp4"
vframes, _, info = torchvision.io.read_video(video_path, pts_unit="sec")
indices = torch.linspace(0, len(vframes) - 1, 8).long()
frames = [vframes[i].numpy() for i in indices]
inputs = processor(images=frames, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# CLS token for classification
cls_embedding = outputs.last_hidden_state[:, 0]
print(cls_embedding.shape) # (1, 768)Optical flow with RAFT
import torch
import torchvision.transforms.functional as TF
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()
model = raft_large(weights=weights).eval().to("cuda")
from PIL import Image
import numpy as np
frame1 = TF.to_tensor(Image.open("frame_001.jpg")).unsqueeze(0).to("cuda")
frame2 = TF.to_tensor(Image.open("frame_002.jpg")).unsqueeze(0).to("cuda")
img1_batch, img2_batch = transforms(frame1, frame2)
with torch.no_grad():
# Returns list of flow predictions (coarse to fine); take the last
flow_predictions = model(img1_batch, img2_batch)
flow = flow_predictions[-1] # (1, 2, H, W) — u, v components
flow_magnitude = (flow[:, 0] ** 2 + flow[:, 1] ** 2).sqrt()
print(f"Max flow magnitude: {flow_magnitude.max().item():.2f} pixels")Video classification benchmark evaluation
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
def evaluate_top1_top5(model, dataloader, device):
model.eval()
top1_correct, top5_correct, total = 0, 0, 0
with torch.no_grad():
for frames, labels in dataloader:
frames, labels = frames.to(device), labels.to(device)
logits = model(frames).logits
probs = F.softmax(logits, dim=-1)
top5 = probs.topk(5, dim=-1).indices
top1_correct += (top5[:, 0] == labels).sum().item()
top5_correct += (top5 == labels.unsqueeze(1)).any(dim=1).sum().item()
total += labels.size(0)
return top1_correct / total, top5_correct / total
# top1, top5 = evaluate_top1_top5(model, val_loader, "cuda")
# print(f"Top-1: {top1:.3f} Top-5: {top5:.3f}")Analysis & Evaluation
Where Your Intuition Breaks
Sampling more frames always improves video understanding accuracy. For trimmed action recognition (short clips with a single action), accuracy saturates quickly — most useful temporal information is captured in 8–16 uniformly sampled frames, and adding more frames adds compute without improving accuracy. Long-form video understanding is genuinely harder and requires different strategies: sparse sampling (a few frames spread across the video), hierarchical attention, or dedicated long-context architectures. The mistake is applying the same dense-sampling approach used for 2-second clips to 10-minute videos — the temporal structure is fundamentally different and requires a different modeling strategy.
Architecture comparison
| Two-stream CNN | 3D CNN (C3D/I3D) | TimeSformer | VideoMAE ViT-H | |
|---|---|---|---|---|
| Temporal modeling | Optical flow | 3D kernels | Divided attn | Tube masked AE |
| Kinetics-400 Top-1 | ~75% | ~80% | 80.7% | 86.6% |
| Pretraining | ImageNet | ImageNet | ImageNet-21k | K400/K700 |
| Compute | Moderate (flow) | High (3D) | Moderate | Low (masked) |
| Long-range | Limited | Very limited | Configurable | Strong |
Benchmark landscape
| Benchmark | What it tests |
|---|---|
| Kinetics-400/700 | Broad action recognition (400–700 classes) |
| Something-Something v2 | Temporal reasoning (directional actions) |
| UCF-101 / HMDB-51 | Transfer learning evaluation |
| ActivityNet-Captions | Dense captioning and temporal grounding |
| EgoSchema | Long-form egocentric video QA |
Key pitfalls
Temporal leakage: in trimmed clip datasets (Kinetics), the model can often predict the action from a single frame — scene/object context dominates motion cues. Something-Something v2 was designed to require temporal reasoning ("moving X away from Y").
Frame sampling rate matters: subsampling too aggressively loses fast motions; using all frames is prohibitive. VideoMAE uses 16 frames at 0.5s stride for Kinetics (covers 8 seconds of action).
Long video understanding: most models process 8–16 frames. An hour-long video has ~86,400 frames at 24fps. Long-form video understanding (Ego4D, EgoSchema) requires hierarchical or sliding-window approaches — an active research area.
Domain shift: models trained on curated YouTube clips (Kinetics) often degrade on egocentric video, surveillance footage, or medical video with different motion statistics and viewpoints.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.