Requires:Embeddings Vision Transformers (ViT)

CLIP & Contrastive Learning

Supervised image classifiers are brittle: they recognize only the categories they were trained on, and retraining for new classes requires labeled data. CLIP (Contrastive Language-Image Pre-Training) takes a different approach — instead of predicting fixed labels, it learns a joint embedding space where matching image-text pairs are close and non-matching pairs are far apart. Trained on 400 million image-text pairs from the internet, CLIP achieves zero-shot classification competitive with supervised ImageNet models by framing classification as text retrieval. The contrastive objective that makes this possible — InfoNCE loss — is now the backbone of self-supervised learning across images, audio, video, and code.

Theory

CLIP Similarity Matrix

CLIP maximizes cosine similarity of matched image–text pairs (diagonal) while pushing negatives apart. Temperature τ controls how sharply the softmax concentrates on the diagonal.

CLIP trains two encoders — one for images, one for text — to produce matching vectors for matching image-caption pairs. The key insight is that you don't need manual labels: internet image-caption pairs (ALT text, photo captions) provide billions of supervision examples for free. Every image is described by its matching caption and must be distinguished from all other captions in the batch — with 32,768 images per batch, each image has 32,767 negative text candidates.

Contrastive learning setup

CLIP jointly trains an image encoder $f_I$ and a text encoder $f_T$ . For a batch of $N$ (image, text) pairs:

Image embeddings: $\mathbf{I}_i = f_I(x_i^{\text{img}}) / \|f_I(x_i^{\text{img}})\|$
Text embeddings: $\mathbf{T}_i = f_T(x_i^{\text{txt}}) / \|f_T(x_i^{\text{txt}})\|$

Both are L2-normalized to the unit sphere. Similarity is measured by dot product (equivalent to cosine similarity after normalization): $S_{ij} = \mathbf{I}_i \cdot \mathbf{T}_j$

The dot product on L2-normalized embeddings is equivalent to cosine similarity, and the L2 normalization is not optional: without it, the similarity scores are dominated by the magnitude of the vectors rather than their direction. Magnitude has no cross-modal semantic meaning — a large image embedding and a large text embedding are not necessarily more similar than small ones. Normalizing both encoders to the unit sphere forces the similarity signal to come entirely from the angle between representations, which is what the contrastive objective is actually training.

This produces an $N \times N$ similarity matrix. Diagonal entries ( $S_{ii}$ ) are matched pairs; off-diagonal are negatives from the same batch.

InfoNCE loss

The InfoNCE (Noise Contrastive Estimation) loss maximizes the likelihood of the correct pairing for each image: $\mathcal{L}_I = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^N \exp(S_{ij} / \tau)}$

and symmetrically for text: $\mathcal{L}_T = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^N \exp(S_{ji} / \tau)}$

Total loss: $\mathcal{L} = (\mathcal{L}_I + \mathcal{L}_T) / 2$ .

The temperature $\tau$ is a learnable scalar. Small $\tau$ sharpens the distribution (confident predictions); large $\tau$ softens it. CLIP initializes $\tau = 0.07$ .

Intuition: for each image $i$ , the loss is a softmax cross-entropy where the "correct class" is its matching text. With $N = 32768$ (CLIP's batch size), there are 32767 negatives per positive — making the task very challenging and representations very discriminative.

Why large batches matter

InfoNCE uses in-batch negatives. More negatives = harder task = better representations. With $N$ samples per batch:

$\mathcal{L}_I = -\mathbb{E}\left[\log \frac{\exp(S_{ii}/\tau)}{\frac{1}{N}\sum_j \exp(S_{ij}/\tau)}\right]$

CLIP used 32,768 samples per batch across 256 GPUs. This is why CLIP-scale training was infeasible at smaller compute budgets before batching tricks (gradient caching, memory-efficient attention) were developed.

Gradient caching: compute embeddings in mini-chunks, cache them, then compute the full $N \times N$ similarity matrix. This decouples batch size from per-GPU memory.

Zero-shot classification

CLIP converts any classification task into an image-text matching problem. For a dataset with classes $\{c_1, \ldots, c_K\}$ :

Encode each class as a text prompt: $\mathbf{T}_k = f_T(\text{"a photo of a } c_k\text{"})$
Encode the query image: $\mathbf{I} = f_I(x)$
Predict: $\hat{k} = \arg\max_k \, \mathbf{I} \cdot \mathbf{T}_k$

Prompt engineering matters: "a photo of a [class]" outperforms just "[class]" by 3–4% on ImageNet zero-shot. Ensembling 80 different prompt templates further improves by 1–2%.

Zero-shot ImageNet accuracy: CLIP ViT-L/14 achieves 75.5% top-1 — matching a supervised ResNet-50 trained on 1.2M labeled images, with zero ImageNet training examples.

Embedding geometry

The CLIP embedding space has useful geometric properties:

Linear separability: despite no supervision for specific classes, linear probing (training a linear classifier on frozen CLIP features) achieves 85.4% ImageNet accuracy — near the state of the art for fully supervised models.

Compositionality: "a red cube on a blue sphere" and "a blue cube on a red sphere" produce different embeddings, enabling spatial and relational reasoning.

Distribution shift robustness: CLIP's zero-shot accuracy on ImageNet distribution shifts (ImageNet-V2, -Sketch, -A, -R) drops far less than supervised models, because it was trained on diverse internet data rather than ImageNet's specific distribution.

Walkthrough

CLIP zero-shot classification

python

import torch
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
 
class_names = ["a photo of a cat", "a photo of a dog",
               "a photo of a bird", "a photo of a car"]
 
inputs = processor(text=class_names, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)
 
for cls, prob in zip(class_names, probs[0]):
    print(f"{cls}: {prob:.3f}")

InfoNCE loss implementation

python

import torch
import torch.nn.functional as F
 
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    image_embeddings: (N, D) L2-normalized
    text_embeddings:  (N, D) L2-normalized
    """
    logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature
    labels = torch.arange(len(logits), device=logits.device)
    loss_i = F.cross_entropy(logits, labels)
    loss_t = F.cross_entropy(logits.T, labels)
    return (loss_i + loss_t) / 2
 
# At random initialization, loss ~= log(N)
N, D = 64, 512
img_emb = F.normalize(torch.randn(N, D), dim=-1)
txt_emb = F.normalize(torch.randn(N, D), dim=-1)
print(f"Random loss: {clip_loss(img_emb, txt_emb):.3f}  (expected ~{torch.log(torch.tensor(float(N))):.3f})")

Linear probe evaluation

python

import numpy as np
from sklearn.linear_model import LogisticRegression
 
def extract_clip_features(model, processor, images, batch_size=64):
    all_features = []
    for i in range(0, len(images), batch_size):
        batch = images[i:i+batch_size]
        inputs = processor(images=batch, return_tensors="pt")
        with torch.no_grad():
            features = model.get_image_features(**inputs)
            features = F.normalize(features, dim=-1)
        all_features.append(features.cpu().numpy())
    return np.concatenate(all_features)
 
# train_features: (N, D), train_labels: (N,)
clf = LogisticRegression(C=0.316, max_iter=1000)
clf.fit(train_features, train_labels)
print(f"Linear probe: {clf.score(test_features, test_labels):.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

CLIP understands image content because it can describe what's in images. CLIP learns statistical correlations between visual patterns and text tokens, not compositional understanding. It can match "a dog playing fetch" to an image because that pattern appeared in training data, but it fails systematically on negations ("a dog NOT playing fetch"), unusual compositions, and counting ("exactly three dogs"). These are not edge cases — they are evidence of the underlying representational model: CLIP encodes global image-text co-occurrence statistics, not scene graph semantics. This is why CLIP-based zero-shot classification underperforms on attributes, relationships, and spatial reasoning relative to tasks that rely on surface texture and object category.

CLIP vs supervised models

	Supervised (ImageNet)	CLIP zero-shot	CLIP linear probe
Training data	1.2M labeled	400M image-text	400M image-text
ImageNet Top-1	85%+ (ResNet-152)	75.5% (ViT-L/14)	85.4%
Distribution shift	Drops 5–15%	Drops 2–5%	Drops 2–5%
New classes	Retrain required	Zero-shot	Linear probe only

Limitations

Compositional failures: CLIP struggles with counting ("two dogs"), relational reasoning ("A above B"), and fine-grained attributes. Text encoders operate at coarse semantic granularity.

Bias amplification: trained on internet text-image pairs, CLIP inherits societal biases from that distribution. Probe studies show demographic and occupational biases.

Long-tail recognition: zero-shot accuracy drops sharply on fine-grained datasets (CUB-200 birds, Stanford Cars) where class differences are subtle and internet captions are noisy.

Successors and variants

Model	Change from CLIP	Key improvement
ALIGN	1.8B noisier pairs	Scale compensates for noise
SigLIP	Sigmoid loss instead of softmax	No global batch normalization needed
OpenCLIP	Open replication	Reproducible, multiple scales
BLIP-2	Q-Former bridge to LLM	Enables VQA and captioning
CoCa	Captioning + contrastive	Unified generative and contrastive

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Vision Transformers (ViT)

Vision-Language Models