Neural-Path/Notes
30 min

CLIP & Contrastive Learning

Supervised image classifiers are brittle: they recognize only the categories they were trained on, and retraining for new classes requires labeled data. CLIP (Contrastive Language-Image Pre-Training) takes a different approach — instead of predicting fixed labels, it learns a joint embedding space where matching image-text pairs are close and non-matching pairs are far apart. Trained on 400 million image-text pairs from the internet, CLIP achieves zero-shot classification competitive with supervised ImageNet models by framing classification as text retrieval. The contrastive objective that makes this possible — InfoNCE loss — is now the backbone of self-supervised learning across images, audio, video, and code.

Theory

CLIP Similarity Matrix
🐕dog🐈cat🚗car🌳tree☁️sky← images →dogcatcartreesky← texts →0.910.620.210.180.150.650.890.190.220.120.200.180.930.280.350.170.240.310.900.420.140.130.330.440.92diagonal = positive pairs · off-diagonal = in-batch negatives

CLIP maximizes cosine similarity of matched image–text pairs (diagonal) while pushing negatives apart. Temperature τ controls how sharply the softmax concentrates on the diagonal.

CLIP trains two encoders — one for images, one for text — to produce matching vectors for matching image-caption pairs. The key insight is that you don't need manual labels: internet image-caption pairs (ALT text, photo captions) provide billions of supervision examples for free. Every image is described by its matching caption and must be distinguished from all other captions in the batch — with 32,768 images per batch, each image has 32,767 negative text candidates.

Contrastive learning setup

CLIP jointly trains an image encoder fIf_I and a text encoder fTf_T. For a batch of NN (image, text) pairs:

  • Image embeddings: Ii=fI(xiimg)/fI(xiimg)\mathbf{I}_i = f_I(x_i^{\text{img}}) / \|f_I(x_i^{\text{img}})\|
  • Text embeddings: Ti=fT(xitxt)/fT(xitxt)\mathbf{T}_i = f_T(x_i^{\text{txt}}) / \|f_T(x_i^{\text{txt}})\|

Both are L2-normalized to the unit sphere. Similarity is measured by dot product (equivalent to cosine similarity after normalization): Sij=IiTjS_{ij} = \mathbf{I}_i \cdot \mathbf{T}_j

The dot product on L2-normalized embeddings is equivalent to cosine similarity, and the L2 normalization is not optional: without it, the similarity scores are dominated by the magnitude of the vectors rather than their direction. Magnitude has no cross-modal semantic meaning — a large image embedding and a large text embedding are not necessarily more similar than small ones. Normalizing both encoders to the unit sphere forces the similarity signal to come entirely from the angle between representations, which is what the contrastive objective is actually training.

This produces an N×NN \times N similarity matrix. Diagonal entries (SiiS_{ii}) are matched pairs; off-diagonal are negatives from the same batch.

InfoNCE loss

The InfoNCE (Noise Contrastive Estimation) loss maximizes the likelihood of the correct pairing for each image: LI=1Ni=1Nlogexp(Sii/τ)j=1Nexp(Sij/τ)\mathcal{L}_I = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^N \exp(S_{ij} / \tau)}

and symmetrically for text: LT=1Ni=1Nlogexp(Sii/τ)j=1Nexp(Sji/τ)\mathcal{L}_T = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^N \exp(S_{ji} / \tau)}

Total loss: L=(LI+LT)/2\mathcal{L} = (\mathcal{L}_I + \mathcal{L}_T) / 2.

The temperature τ\tau is a learnable scalar. Small τ\tau sharpens the distribution (confident predictions); large τ\tau softens it. CLIP initializes τ=0.07\tau = 0.07.

Intuition: for each image ii, the loss is a softmax cross-entropy where the "correct class" is its matching text. With N=32768N = 32768 (CLIP's batch size), there are 32767 negatives per positive — making the task very challenging and representations very discriminative.

Why large batches matter

InfoNCE uses in-batch negatives. More negatives = harder task = better representations. With NN samples per batch:

LI=E[logexp(Sii/τ)1Njexp(Sij/τ)]\mathcal{L}_I = -\mathbb{E}\left[\log \frac{\exp(S_{ii}/\tau)}{\frac{1}{N}\sum_j \exp(S_{ij}/\tau)}\right]

CLIP used 32,768 samples per batch across 256 GPUs. This is why CLIP-scale training was infeasible at smaller compute budgets before batching tricks (gradient caching, memory-efficient attention) were developed.

Gradient caching: compute embeddings in mini-chunks, cache them, then compute the full N×NN \times N similarity matrix. This decouples batch size from per-GPU memory.

Zero-shot classification

CLIP converts any classification task into an image-text matching problem. For a dataset with classes {c1,,cK}\{c_1, \ldots, c_K\}:

  1. Encode each class as a text prompt: Tk=fT("a photo of a ck")\mathbf{T}_k = f_T(\text{"a photo of a } c_k\text{"})
  2. Encode the query image: I=fI(x)\mathbf{I} = f_I(x)
  3. Predict: k^=argmaxkITk\hat{k} = \arg\max_k \, \mathbf{I} \cdot \mathbf{T}_k

Prompt engineering matters: "a photo of a [class]" outperforms just "[class]" by 3–4% on ImageNet zero-shot. Ensembling 80 different prompt templates further improves by 1–2%.

Zero-shot ImageNet accuracy: CLIP ViT-L/14 achieves 75.5% top-1 — matching a supervised ResNet-50 trained on 1.2M labeled images, with zero ImageNet training examples.

Embedding geometry

The CLIP embedding space has useful geometric properties:

Linear separability: despite no supervision for specific classes, linear probing (training a linear classifier on frozen CLIP features) achieves 85.4% ImageNet accuracy — near the state of the art for fully supervised models.

Compositionality: "a red cube on a blue sphere" and "a blue cube on a red sphere" produce different embeddings, enabling spatial and relational reasoning.

Distribution shift robustness: CLIP's zero-shot accuracy on ImageNet distribution shifts (ImageNet-V2, -Sketch, -A, -R) drops far less than supervised models, because it was trained on diverse internet data rather than ImageNet's specific distribution.

Walkthrough

CLIP zero-shot classification

python
import torch
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
 
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
 
class_names = ["a photo of a cat", "a photo of a dog",
               "a photo of a bird", "a photo of a car"]
 
inputs = processor(text=class_names, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=-1)
 
for cls, prob in zip(class_names, probs[0]):
    print(f"{cls}: {prob:.3f}")

InfoNCE loss implementation

python
import torch
import torch.nn.functional as F
 
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    image_embeddings: (N, D) L2-normalized
    text_embeddings:  (N, D) L2-normalized
    """
    logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature
    labels = torch.arange(len(logits), device=logits.device)
    loss_i = F.cross_entropy(logits, labels)
    loss_t = F.cross_entropy(logits.T, labels)
    return (loss_i + loss_t) / 2
 
# At random initialization, loss ~= log(N)
N, D = 64, 512
img_emb = F.normalize(torch.randn(N, D), dim=-1)
txt_emb = F.normalize(torch.randn(N, D), dim=-1)
print(f"Random loss: {clip_loss(img_emb, txt_emb):.3f}  (expected ~{torch.log(torch.tensor(float(N))):.3f})")

Linear probe evaluation

python
import numpy as np
from sklearn.linear_model import LogisticRegression
 
def extract_clip_features(model, processor, images, batch_size=64):
    all_features = []
    for i in range(0, len(images), batch_size):
        batch = images[i:i+batch_size]
        inputs = processor(images=batch, return_tensors="pt")
        with torch.no_grad():
            features = model.get_image_features(**inputs)
            features = F.normalize(features, dim=-1)
        all_features.append(features.cpu().numpy())
    return np.concatenate(all_features)
 
# train_features: (N, D), train_labels: (N,)
clf = LogisticRegression(C=0.316, max_iter=1000)
clf.fit(train_features, train_labels)
print(f"Linear probe: {clf.score(test_features, test_labels):.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

CLIP understands image content because it can describe what's in images. CLIP learns statistical correlations between visual patterns and text tokens, not compositional understanding. It can match "a dog playing fetch" to an image because that pattern appeared in training data, but it fails systematically on negations ("a dog NOT playing fetch"), unusual compositions, and counting ("exactly three dogs"). These are not edge cases — they are evidence of the underlying representational model: CLIP encodes global image-text co-occurrence statistics, not scene graph semantics. This is why CLIP-based zero-shot classification underperforms on attributes, relationships, and spatial reasoning relative to tasks that rely on surface texture and object category.

CLIP vs supervised models

Supervised (ImageNet)CLIP zero-shotCLIP linear probe
Training data1.2M labeled400M image-text400M image-text
ImageNet Top-185%+ (ResNet-152)75.5% (ViT-L/14)85.4%
Distribution shiftDrops 5–15%Drops 2–5%Drops 2–5%
New classesRetrain requiredZero-shotLinear probe only

Limitations

Compositional failures: CLIP struggles with counting ("two dogs"), relational reasoning ("A above B"), and fine-grained attributes. Text encoders operate at coarse semantic granularity.

Bias amplification: trained on internet text-image pairs, CLIP inherits societal biases from that distribution. Probe studies show demographic and occupational biases.

Long-tail recognition: zero-shot accuracy drops sharply on fine-grained datasets (CUB-200 birds, Stanford Cars) where class differences are subtle and internet captions are noisy.

Successors and variants

ModelChange from CLIPKey improvement
ALIGN1.8B noisier pairsScale compensates for noise
SigLIPSigmoid loss instead of softmaxNo global batch normalization needed
OpenCLIPOpen replicationReproducible, multiple scales
BLIP-2Q-Former bridge to LLMEnables VQA and captioning
CoCaCaptioning + contrastiveUnified generative and contrastive

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.