Requires:CLIP & Contrastive Learning BERT & Encoder Models

Vision-Language Models

CLIP showed that images and text can share an embedding space. Vision-language models go further: they feed visual tokens directly into a language model, enabling free-form visual question answering, image captioning, chart interpretation, multi-image reasoning, and visual instruction following. The key design question is how to connect the visual encoder to the language model — via a linear projection, a cross-attention mechanism, a learned query bottleneck, or deep interleaving. Each architecture makes different tradeoffs between parameter efficiency, visual detail preserved, and the cost of training.

Theory

VLM Architecture

Vision encoder (ViT) extracts patch embeddings. The connector bridges vision and language embedding spaces. LLM autoregressively generates text attending over both image and text tokens.

When you show a photo to someone and ask "what's written on that sign?", they process the image and the question together. Vision-language models replicate this: a vision encoder converts the image into patch tokens, a connector projects them into the LLM's embedding space, and the language model generates outputs conditioned on what it sees. The design choices that matter most are in the connector — the bridge between visual and linguistic representations.

The architecture design space

Every VLM has three components:

Vision encoder: encodes images into patch tokens (typically a CLIP-pretrained ViT)
Connector: projects visual tokens into the LLM's token space
Language model: processes the combined visual and text token sequence

The design choices that matter most are in the connector.

Linear projection (LLaVA-style)

The simplest connector: a learned 2-layer MLP mapping each visual token to the LLM's embedding dimension.

$h_i = W_2 \, \text{GELU}(W_1 \, v_i + b_1) + b_2, \quad v_i \in \mathbb{R}^{D_{\text{vis}}}$

The two-layer MLP projection is necessary because image patch features and text token embeddings live in completely different vector spaces trained on different objectives: the vision encoder (typically CLIP ViT) was trained on image-text contrastive learning, while the LLM embedding space was trained on next-token prediction. Without the projection, visual features have no meaningful geometric relationship to the LLM's vocabulary — the learned alignment between these two representational spaces is what the connector encodes. A single linear layer is often insufficient because the alignment between modalities is nonlinear; the GELU non-linearity in the two-layer MLP provides the capacity to learn this mapping.

For a ViT-L encoder producing 256 patch tokens at 1024 dimensions, this creates 256 visual tokens prepended to the text tokens: $[\, h_1, \ldots, h_{256}, \, t_1, \ldots, t_n \,]$

The LLM attends over this combined sequence with normal causal attention. Simple, fast, and surprisingly effective — LLaVA-1.5 achieves competitive performance with this approach.

Cost: 256 tokens per image is expensive for long conversations. High-resolution images (with dynamic tiling) can produce 2000+ visual tokens.

Q-Former bottleneck (BLIP-2)

BLIP-2 introduces a Querying Transformer (Q-Former) that uses $Q$ learned query tokens to extract fixed-length representations from the image, regardless of image resolution:

$Z = \text{QFormer}(q_1, \ldots, q_Q; \, V_{\text{frozen}})$

The Q-Former uses cross-attention to attend over all visual patch tokens and distills into $Q = 32$ output tokens. This bottleneck is then projected to the LLM dimension.

Advantage: the LLM sees only $Q$ tokens regardless of image resolution — efficient and resolution-agnostic.

Disadvantage: the bottleneck may discard fine-grained visual detail needed for OCR or counting.

Flamingo-style cross-attention interleaving

Flamingo (Alayrac et al., DeepMind, 2022) interleaves gated cross-attention layers into the frozen LLM at every $k$ transformer layers:

$h_l = h_l^{\text{LLM}} + \tanh(\alpha) \cdot \text{CrossAttn}(h_l^{\text{LLM}}, V)$

The tanh gate $\alpha$ is initialized to 0, so training starts from the pretrained LLM without disruption. Visual tokens $V$ come from a pretrained Perceiver Resampler (similar in spirit to Q-Former).

Key property: Flamingo handles interleaved images and text in the context window — enabling multi-shot visual reasoning across multiple images.

Training recipe

VLMs are typically trained in stages:

Stage 1 — Connector pretraining: freeze both the vision encoder and LLM; train only the connector on image-caption pairs. Goal: align the visual representation to the LLM's token space.

Stage 2 — Instruction fine-tuning: unfreeze the LLM (and optionally the vision encoder); train on visual instruction-following data (VQA pairs, visual conversations). Goal: develop task-following capability.

Data quality over quantity: LLaVA showed that 150K GPT-4-generated visual conversations outperform millions of shorter caption pairs.

High-resolution handling

ViT processes images at a fixed resolution (e.g., 224×224 or 448×448). High-resolution inputs require dynamic tiling:

Split the input image into overlapping tiles (e.g., 2×2 or 3×3 grid)
Encode each tile independently with the ViT
Concatenate tile tokens + a global thumbnail
Project through the connector

LLaVA-NeXT with 3×3 tiling produces up to 2880 visual tokens from a single image — sufficient for OCR and fine-grained detail, but expensive to process.

Walkthrough

LLaVA inference

python

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
 
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
 
image = Image.open(requests.get("https://example.com/chart.jpg", stream=True).raw)
 
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What trend does this chart show?"},
        ],
    }
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
 
with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=200, temperature=0.0)
 
response = processor.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Minimal VLM connector

python

import torch
import torch.nn as nn
 
class VLMConnector(nn.Module):
    """2-layer MLP projecting CLIP features to LLM dimension."""
    def __init__(self, clip_dim=1024, llm_dim=4096):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(clip_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim),
        )
 
    def forward(self, clip_features):
        # clip_features: (B, N_patches, D_clip)
        return self.proj(clip_features)   # (B, N_patches, D_llm)
 
 
def build_input_embeds(text_ids, image_patches, connector, embed_layer, image_token_id):
    """Replace image token positions with projected visual tokens."""
    B, L = text_ids.shape
    text_embeds = embed_layer(text_ids)
    visual_tokens = connector(image_patches)
 
    combined = []
    for b in range(B):
        pos = (text_ids[b] == image_token_id).nonzero(as_tuple=True)[0]
        combined.append(torch.cat([
            text_embeds[b, :pos[0]],
            visual_tokens[b],
            text_embeds[b, pos[-1]+1:]
        ], dim=0))
    return combined

Analysis & Evaluation

Where Your Intuition Breaks

A better vision encoder always produces a better VLM. The vision encoder quality is one factor, but the connector architecture and the instruction-tuning data are at least as important. LLaVA-1.5 (2023) demonstrated this directly: a simple 2-layer MLP connector with carefully curated instruction-tuning data outperformed models with significantly more complex Q-Former architectures on standard VLM benchmarks. The Q-Former bottleneck (32 tokens regardless of image resolution) is more efficient but can discard fine-grained spatial detail needed for OCR, counting, and precise localization — tasks where simple MLP connectors with higher token counts perform better. The right connector depends on the downstream task, not on any single architectural complexity criterion.

Architecture comparison

	LLaVA (linear proj)	BLIP-2 (Q-Former)	Flamingo (cross-attn)
Visual tokens to LLM	256–2880	32 (fixed)	Via cross-attention
Fine-grained detail	High	Low-medium	Medium
Training cost	Low	Medium	High (new layers)
Few-shot visual	Limited	Limited	Strong
Resolution handling	Tiling	Resolution-agnostic	Resolution-agnostic

Benchmark landscape

Benchmark	What it tests
VQAv2	General visual question answering
TextVQA	OCR and reasoning on scene text
MMBench	Multi-task structured evaluation
MMMU	College-level multimodal reasoning
ChartQA	Chart and plot understanding

Common failure modes

Hallucination: VLMs confidently describe objects not present in the image. CLIP-based encoders optimize similarity, not accuracy — they can match images to plausible-but-wrong descriptions.

Counting and spatial reasoning: "How many red circles are to the left of the blue square?" remains difficult — requires precise token-level localization that patch-level visual encoders don't naturally provide.

Document and chart understanding: requires recognizing small text within images — needs high resolution (448px or more per tile) and OCR-specific training data.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

CLIP & Contrastive Learning

Object Detection & Segmentation