Requires:Vision Transformers (ViT)Attention & Transformers

Video Understanding

Images are static snapshots; video adds time. Understanding video requires learning what changes between frames and what stays the same — motion, causality, temporal order. Early approaches stacked 2D CNNs on individual frames, ignoring temporal structure entirely. 3D convolutions and two-stream networks introduced motion modeling, but required careful architectural choices and were expensive. Transformer-based video models (TimeSformer, Video Swin, VideoMAE) remove the architectural constraints and instead learn to attend across both space and time — scaling to hundreds of frames and achieving state-of-the-art results across action recognition, temporal grounding, and dense video understanding.

Theory

Convolution Dimensions

2D conv processes each frame independently — no temporal relationships captured. 3D conv extends the kernel across T frames, learning motion patterns like optical flow implicitly.

Video understanding adds a time axis to image understanding. A static image model sees a cup and a hand separately; a video model can recognize "picking up the cup" because it tracks how objects and their relationships change across frames. The architectures below progress from 3D convolutions (local spatiotemporal neighborhoods) to transformers (global temporal attention), trading computational cost for longer-range temporal reasoning.

From images to video

A video clip is a 4D tensor $V \in \mathbb{R}^{T \times H \times W \times C}$ where $T$ is the number of frames. Naive processing applies image models frame-independently, ignoring temporal structure — fine for static frames but blind to motion.

3D convolution (C3D, Tran et al., 2015) extends 2D kernels to include a temporal dimension: $y_{t,h,w} = \sum_{\tau=-k_t/2}^{k_t/2} \sum_{i,j} x_{t+\tau, h+i, w+j} \cdot k_{\tau,i,j}$

The 3D convolution sum directly extends 2D spatial convolution by adding a temporal kernel dimension $\tau$ — this is the minimal modification that allows a convolutional network to detect motion patterns (an edge moving across frames) rather than just static textures. The temporal kernel size $k_t$ bounds the temporal receptive field: a kernel of size 3 can detect motion over 3 consecutive frames but not the long-range dependency between the first and last frame of a 10-second clip. This is why 3D CNNs were superseded for long-form video understanding: the local inductive bias that makes CNNs efficient for images becomes a liability when temporal dependencies span many frames.

A $3 \times 3 \times 3$ spatiotemporal kernel processes space and time jointly. This allows the network to detect motion patterns (edges moving across frames), but cubes $T \times H \times W$ means parameter count and compute scale cubically.

R(2+1)D factorized convolution

Instead of a full $t \times h \times w$ kernel, R(2+1)D (Tran et al., 2018) factors into a 2D spatial convolution followed by a 1D temporal convolution:

$y = \text{Conv1D}_t \circ \text{Conv2D}_{h,w}(x)$

This doubles the number of non-linearities (each stage has its own ReLU), reduces parameters vs. 3D conv, and separates spatial and temporal learning. R(2+1)D outperformed C3D with fewer FLOPs on Kinetics-400.

Optical flow and two-stream networks

Optical flow models the apparent motion field between consecutive frames. The brightness constancy constraint: $I(x, y, t) = I(x + u\Delta t,\ y + v\Delta t,\ t + \Delta t)$

gives the optical flow constraint equation: $I_x u + I_y v + I_t = 0$

where $u, v$ are the flow velocities and $I_x, I_y, I_t$ are spatial and temporal image gradients.

Two-stream networks (Simonyan and Zisserman, 2014) train two separate CNNs:

Spatial stream: processes RGB frames → scene/object understanding
Temporal stream: processes stacked optical flow fields → motion understanding

Late fusion (averaging or SVM) combines both streams. Two-stream networks dominated action recognition benchmarks until transformer-based models surpassed them by learning motion implicitly.

Divided space-time attention (TimeSformer)

TimeSformer (Bertasius et al., 2021) extends ViT to video by applying factorized attention — separate spatial and temporal attention blocks instead of joint 3D self-attention:

Temporal attention (across frames, same spatial position): $A^{\text{time}}_{t} = \text{Softmax}\!\left(\frac{Q_t K_{:,p}^\top}{\sqrt{d}}\right) V_{:,p}$

Each patch $p$ attends over its counterpart across all $T$ frames.

Spatial attention (within each frame, all patches): $A^{\text{space}}_{t} = \text{Softmax}\!\left(\frac{Q_{t,p} K_{t,:}^\top}{\sqrt{d}}\right) V_{t,:}$

Each frame applies standard ViT attention among its $N$ patches.

Complexity comparison:

Joint 3D attention: $O((TN)^2)$
Divided space-time: $O(T^2 N + TN^2)$

For $T=8$ frames and $N=196$ patches: joint = $\approx 2.5M$ pairs vs. divided = $\approx 9K + 307K \approx 316K$ pairs — an 8× reduction.

VideoMAE masked pretraining

VideoMAE (Tong et al., 2022) applies masked autoencoding to video with an extremely high masking ratio:

$\hat{V} = \text{Decoder}(\text{Encoder}(V \odot M)), \quad M \sim \text{Bernoulli}(0.9)$

Tube masking: instead of random per-frame masking, VideoMAE masks entire spatiotemporal tubes — all frames at a given spatial position are masked or unmasked together. This prevents trivial copying from adjacent frames.

With 90–95% of tokens masked:

The encoder processes only 5–10% of the video
Training is 3–5× faster than processing all tokens
The model must learn rich spatiotemporal representations to reconstruct masked tubes

VideoMAE ViT-H pretrained on Kinetics achieves 86.6% on UCF-101 with linear probing — stronger than two-stream networks with full fine-tuning.

Temporal grounding

Video question answering and temporal grounding require localizing events in time. The core task: given a query $q$ (text or visual), predict the temporal segment $[t_{\text{start}}, t_{\text{end}}]$ where the query occurs.

Contrastive language-video alignment (CLIP4Clip, VideoCLIP): extend CLIP's InfoNCE loss to video-text pairs by mean-pooling or using a temporal transformer over frame features:

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(V_i \cdot T_i / \tau)}{\sum_j \exp(V_i \cdot T_j / \tau)}$

where $V_i$ is the aggregated video embedding and $T_i$ is the text embedding for the $i$ -th video-caption pair.

Walkthrough

VideoMAE feature extraction

python

from transformers import VideoMAEImageProcessor, VideoMAEModel
import torch
import numpy as np
from PIL import Image
 
processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
 
# Simulate loading 16 frames (T=16, H=224, W=224, C=3)
# In practice: use decord or torchvision.io.read_video
frames = [np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8) for _ in range(16)]
frames_pil = [Image.fromarray(f) for f in frames]
 
inputs = processor(frames_pil, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
 
# outputs.last_hidden_state: (1, T*N_patches, D) — spatiotemporal token sequence
print(outputs.last_hidden_state.shape)  # e.g. (1, 1568, 768) for ViT-B

Action recognition with TimeSformer

python

from transformers import TimesformerModel, AutoImageProcessor
import torch
 
processor = AutoImageProcessor.from_pretrained(
    "facebook/timesformer-base-finetuned-k400"
)
model = TimesformerModel.from_pretrained(
    "facebook/timesformer-base-finetuned-k400"
)
 
# Load 8 video frames (RGB, 224×224)
import torchvision
video_path = "action_clip.mp4"
vframes, _, info = torchvision.io.read_video(video_path, pts_unit="sec")
indices = torch.linspace(0, len(vframes) - 1, 8).long()
frames = [vframes[i].numpy() for i in indices]
 
inputs = processor(images=frames, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
 
# CLS token for classification
cls_embedding = outputs.last_hidden_state[:, 0]
print(cls_embedding.shape)  # (1, 768)

Optical flow with RAFT

python

import torch
import torchvision.transforms.functional as TF
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
 
weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()
model = raft_large(weights=weights).eval().to("cuda")
 
from PIL import Image
import numpy as np
 
frame1 = TF.to_tensor(Image.open("frame_001.jpg")).unsqueeze(0).to("cuda")
frame2 = TF.to_tensor(Image.open("frame_002.jpg")).unsqueeze(0).to("cuda")
 
img1_batch, img2_batch = transforms(frame1, frame2)
 
with torch.no_grad():
    # Returns list of flow predictions (coarse to fine); take the last
    flow_predictions = model(img1_batch, img2_batch)
    flow = flow_predictions[-1]  # (1, 2, H, W) — u, v components
 
flow_magnitude = (flow[:, 0] ** 2 + flow[:, 1] ** 2).sqrt()
print(f"Max flow magnitude: {flow_magnitude.max().item():.2f} pixels")

Video classification benchmark evaluation

python

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
 
def evaluate_top1_top5(model, dataloader, device):
    model.eval()
    top1_correct, top5_correct, total = 0, 0, 0
 
    with torch.no_grad():
        for frames, labels in dataloader:
            frames, labels = frames.to(device), labels.to(device)
            logits = model(frames).logits
            probs = F.softmax(logits, dim=-1)
 
            top5 = probs.topk(5, dim=-1).indices
            top1_correct += (top5[:, 0] == labels).sum().item()
            top5_correct += (top5 == labels.unsqueeze(1)).any(dim=1).sum().item()
            total += labels.size(0)
 
    return top1_correct / total, top5_correct / total
 
# top1, top5 = evaluate_top1_top5(model, val_loader, "cuda")
# print(f"Top-1: {top1:.3f}  Top-5: {top5:.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

Sampling more frames always improves video understanding accuracy. For trimmed action recognition (short clips with a single action), accuracy saturates quickly — most useful temporal information is captured in 8–16 uniformly sampled frames, and adding more frames adds compute without improving accuracy. Long-form video understanding is genuinely harder and requires different strategies: sparse sampling (a few frames spread across the video), hierarchical attention, or dedicated long-context architectures. The mistake is applying the same dense-sampling approach used for 2-second clips to 10-minute videos — the temporal structure is fundamentally different and requires a different modeling strategy.

Architecture comparison

	Two-stream CNN	3D CNN (C3D/I3D)	TimeSformer	VideoMAE ViT-H
Temporal modeling	Optical flow	3D kernels	Divided attn	Tube masked AE
Kinetics-400 Top-1	~75%	~80%	80.7%	86.6%
Pretraining	ImageNet	ImageNet	ImageNet-21k	K400/K700
Compute	Moderate (flow)	High (3D)	Moderate	Low (masked)
Long-range	Limited	Very limited	Configurable	Strong

Benchmark landscape

Benchmark	What it tests
Kinetics-400/700	Broad action recognition (400–700 classes)
Something-Something v2	Temporal reasoning (directional actions)
UCF-101 / HMDB-51	Transfer learning evaluation
ActivityNet-Captions	Dense captioning and temporal grounding
EgoSchema	Long-form egocentric video QA

Key pitfalls

Temporal leakage: in trimmed clip datasets (Kinetics), the model can often predict the action from a single frame — scene/object context dominates motion cues. Something-Something v2 was designed to require temporal reasoning ("moving X away from Y").

Frame sampling rate matters: subsampling too aggressively loses fast motions; using all frames is prohibitive. VideoMAE uses 16 frames at 0.5s stride for Kinetics (covers 8 seconds of action).

Long video understanding: most models process 8–16 frames. An hour-long video has ~86,400 frames at 24fps. Long-form video understanding (Ego4D, EgoSchema) requires hierarchical or sliding-window approaches — an active research area.

Domain shift: models trained on curated YouTube clips (Kinetics) often degrade on egocentric video, surveillance footage, or medical video with different motion statistics and viewpoints.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Latent Diffusion & Guided Generation

Video Generation