Vision-Language Models
CLIP showed that images and text can share an embedding space. Vision-language models go further: they feed visual tokens directly into a language model, enabling free-form visual question answering, image captioning, chart interpretation, multi-image reasoning, and visual instruction following. The key design question is how to connect the visual encoder to the language model — via a linear projection, a cross-attention mechanism, a learned query bottleneck, or deep interleaving. Each architecture makes different tradeoffs between parameter efficiency, visual detail preserved, and the cost of training.
Theory
Vision encoder (ViT) extracts patch embeddings. The connector bridges vision and language embedding spaces. LLM autoregressively generates text attending over both image and text tokens.
When you show a photo to someone and ask "what's written on that sign?", they process the image and the question together. Vision-language models replicate this: a vision encoder converts the image into patch tokens, a connector projects them into the LLM's embedding space, and the language model generates outputs conditioned on what it sees. The design choices that matter most are in the connector — the bridge between visual and linguistic representations.
The architecture design space
Every VLM has three components:
- Vision encoder: encodes images into patch tokens (typically a CLIP-pretrained ViT)
- Connector: projects visual tokens into the LLM's token space
- Language model: processes the combined visual and text token sequence
The design choices that matter most are in the connector.
Linear projection (LLaVA-style)
The simplest connector: a learned 2-layer MLP mapping each visual token to the LLM's embedding dimension.
The two-layer MLP projection is necessary because image patch features and text token embeddings live in completely different vector spaces trained on different objectives: the vision encoder (typically CLIP ViT) was trained on image-text contrastive learning, while the LLM embedding space was trained on next-token prediction. Without the projection, visual features have no meaningful geometric relationship to the LLM's vocabulary — the learned alignment between these two representational spaces is what the connector encodes. A single linear layer is often insufficient because the alignment between modalities is nonlinear; the GELU non-linearity in the two-layer MLP provides the capacity to learn this mapping.
For a ViT-L encoder producing 256 patch tokens at 1024 dimensions, this creates 256 visual tokens prepended to the text tokens:
The LLM attends over this combined sequence with normal causal attention. Simple, fast, and surprisingly effective — LLaVA-1.5 achieves competitive performance with this approach.
Cost: 256 tokens per image is expensive for long conversations. High-resolution images (with dynamic tiling) can produce 2000+ visual tokens.
Q-Former bottleneck (BLIP-2)
BLIP-2 introduces a Querying Transformer (Q-Former) that uses learned query tokens to extract fixed-length representations from the image, regardless of image resolution:
The Q-Former uses cross-attention to attend over all visual patch tokens and distills into output tokens. This bottleneck is then projected to the LLM dimension.
Advantage: the LLM sees only tokens regardless of image resolution — efficient and resolution-agnostic.
Disadvantage: the bottleneck may discard fine-grained visual detail needed for OCR or counting.
Flamingo-style cross-attention interleaving
Flamingo (Alayrac et al., DeepMind, 2022) interleaves gated cross-attention layers into the frozen LLM at every transformer layers:
The tanh gate is initialized to 0, so training starts from the pretrained LLM without disruption. Visual tokens come from a pretrained Perceiver Resampler (similar in spirit to Q-Former).
Key property: Flamingo handles interleaved images and text in the context window — enabling multi-shot visual reasoning across multiple images.
Training recipe
VLMs are typically trained in stages:
Stage 1 — Connector pretraining: freeze both the vision encoder and LLM; train only the connector on image-caption pairs. Goal: align the visual representation to the LLM's token space.
Stage 2 — Instruction fine-tuning: unfreeze the LLM (and optionally the vision encoder); train on visual instruction-following data (VQA pairs, visual conversations). Goal: develop task-following capability.
Data quality over quantity: LLaVA showed that 150K GPT-4-generated visual conversations outperform millions of shorter caption pairs.
High-resolution handling
ViT processes images at a fixed resolution (e.g., 224×224 or 448×448). High-resolution inputs require dynamic tiling:
- Split the input image into overlapping tiles (e.g., 2×2 or 3×3 grid)
- Encode each tile independently with the ViT
- Concatenate tile tokens + a global thumbnail
- Project through the connector
LLaVA-NeXT with 3×3 tiling produces up to 2880 visual tokens from a single image — sufficient for OCR and fine-grained detail, but expensive to process.
Walkthrough
LLaVA inference
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16,
device_map="auto",
)
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
image = Image.open(requests.get("https://example.com/chart.jpg", stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What trend does this chart show?"},
],
}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=200, temperature=0.0)
response = processor.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)Minimal VLM connector
import torch
import torch.nn as nn
class VLMConnector(nn.Module):
"""2-layer MLP projecting CLIP features to LLM dimension."""
def __init__(self, clip_dim=1024, llm_dim=4096):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(clip_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim),
)
def forward(self, clip_features):
# clip_features: (B, N_patches, D_clip)
return self.proj(clip_features) # (B, N_patches, D_llm)
def build_input_embeds(text_ids, image_patches, connector, embed_layer, image_token_id):
"""Replace image token positions with projected visual tokens."""
B, L = text_ids.shape
text_embeds = embed_layer(text_ids)
visual_tokens = connector(image_patches)
combined = []
for b in range(B):
pos = (text_ids[b] == image_token_id).nonzero(as_tuple=True)[0]
combined.append(torch.cat([
text_embeds[b, :pos[0]],
visual_tokens[b],
text_embeds[b, pos[-1]+1:]
], dim=0))
return combinedAnalysis & Evaluation
Where Your Intuition Breaks
A better vision encoder always produces a better VLM. The vision encoder quality is one factor, but the connector architecture and the instruction-tuning data are at least as important. LLaVA-1.5 (2023) demonstrated this directly: a simple 2-layer MLP connector with carefully curated instruction-tuning data outperformed models with significantly more complex Q-Former architectures on standard VLM benchmarks. The Q-Former bottleneck (32 tokens regardless of image resolution) is more efficient but can discard fine-grained spatial detail needed for OCR, counting, and precise localization — tasks where simple MLP connectors with higher token counts perform better. The right connector depends on the downstream task, not on any single architectural complexity criterion.
Architecture comparison
| LLaVA (linear proj) | BLIP-2 (Q-Former) | Flamingo (cross-attn) | |
|---|---|---|---|
| Visual tokens to LLM | 256–2880 | 32 (fixed) | Via cross-attention |
| Fine-grained detail | High | Low-medium | Medium |
| Training cost | Low | Medium | High (new layers) |
| Few-shot visual | Limited | Limited | Strong |
| Resolution handling | Tiling | Resolution-agnostic | Resolution-agnostic |
Benchmark landscape
| Benchmark | What it tests |
|---|---|
| VQAv2 | General visual question answering |
| TextVQA | OCR and reasoning on scene text |
| MMBench | Multi-task structured evaluation |
| MMMU | College-level multimodal reasoning |
| ChartQA | Chart and plot understanding |
Common failure modes
Hallucination: VLMs confidently describe objects not present in the image. CLIP-based encoders optimize similarity, not accuracy — they can match images to plausible-but-wrong descriptions.
Counting and spatial reasoning: "How many red circles are to the left of the blue square?" remains difficult — requires precise token-level localization that patch-level visual encoders don't naturally provide.
Document and chart understanding: requires recognizing small text within images — needs high resolution (448px or more per tile) and OCR-specific training data.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.