Requires:Vision Transformers (ViT)

Object Detection & Segmentation

Classification asks "what is in this image?" Detection adds "where?" and segmentation adds "exactly which pixels?" These tasks demand that the model produce precise spatial outputs — bounding boxes or masks — not just a single label. The evolution from anchor-based detectors to anchor-free approaches to transformer-based detection reflects a broader shift: from handcrafted spatial priors to learned representations. Understanding DETR's query-based detection and SAM's promptable segmentation shows how the same transformer machinery that powers language models and ViT can be adapted for dense spatial prediction.

Theory

Detection Paradigms

Grid methods predict at every cell and require NMS post-processing. DETR treats detection as set prediction — N learned queries are matched to ground truth objects via Hungarian algorithm, producing exactly one prediction per object with no duplicate removal needed.

Object detection solves two problems at once: localization (where is each object?) and classification (what is it?). Traditional detectors handled this with sliding windows, anchors, and non-maximum suppression — many engineering choices that each require careful tuning. DETR reframes the whole problem as set prediction: output a fixed set of predictions, then find the best one-to-one assignment to ground truth using the Hungarian algorithm.

Detection formulations

Two-stage (R-CNN family): first propose candidate regions, then classify each. High accuracy, slow inference.

One-stage (YOLO family): directly predict boxes and classes from a feature grid. Fast, less accurate on small objects.

Anchor-free: predict box offsets from object centers rather than predefined anchor aspect ratios. FCOS, CenterNet.

Query-based (DETR): learn a fixed set of object queries that attend over image features; each query predicts one box. No anchors, no NMS.

Bipartite matching loss (DETR)

DETR (Carion et al., 2020) frames detection as a set prediction problem. Given $N$ learned object queries, DETR outputs $N$ (class, box) predictions, then finds the optimal one-to-one assignment between predictions and ground-truth objects.

Hungarian algorithm: find the permutation $\hat{\sigma}$ minimizing the total matching cost: $\hat{\sigma} = \arg\min_{\sigma \in S_N} \sum_{i=1}^N \mathcal{C}_{\text{match}}(y_i, \hat{y}_{\sigma(i)})$

The Hungarian matching is forced by the one-to-one assignment requirement: each ground-truth object should be predicted by exactly one query, and each query should be responsible for exactly one ground-truth object. Any one-to-many assignment would produce duplicate predictions, which is precisely what non-maximum suppression (NMS) was designed to remove in traditional detectors. By encoding the one-to-one constraint directly in the training loss via bipartite matching, DETR eliminates the need for NMS entirely — the loss function ensures duplicates are penalized, not post-processed away.

where $\mathcal{C}_{\text{match}}$ combines classification probability and box distance.

Once matched, the set prediction loss is: $\mathcal{L} = \sum_{i=1}^N \left[-\log \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{1}[c_i \neq \varnothing] \mathcal{L}_{\text{box}}(b_i, \hat{b}_{\hat{\sigma}(i)})\right]$

Box regression loss combines L1 and Generalized IoU (GIoU) for scale-invariance: $\mathcal{L}_{\text{box}} = \lambda_1 \|b_i - \hat{b}\|_1 + \lambda_2 \mathcal{L}_{\text{GIoU}}(b_i, \hat{b})$

GIoU penalizes non-overlapping boxes more smoothly than vanilla IoU.

DETR architecture

CNN/ViT backbone: extracts spatial features $F \in \mathbb{R}^{H' \times W' \times C}$
Transformer encoder: flatten $F$ to sequence, add 2D positional encodings, apply self-attention
Transformer decoder: $N = 100$ learned object queries cross-attend over encoder output
FFN prediction heads: each query outputs (class distribution, normalized box coordinates)

Each query specializes to detect objects in specific regions or of specific sizes — this emerges from training, not from handcrafted anchors.

IoU-based metrics

Intersection over Union for predicted box $\hat{B}$ and ground-truth box $B$ : $\text{IoU} = \frac{|\hat{B} \cap B|}{|\hat{B} \cup B|}$

Average Precision (AP): area under the precision-recall curve at an IoU threshold. COCO AP averages over thresholds 0.5:0.05:0.95.

mAP: mean over all classes. Primary metric for detection benchmarks.

Promptable segmentation: SAM

SAM (Segment Anything Model, Kirillov et al., 2023) separates segmentation from semantics. It segments any object described by:

Points: positive/negative clicks on the object
Boxes: bounding box prompts
Masks: coarse masks refined by the model

Architecture:

Heavyweight image encoder (ViT-H, 632M params): runs once per image, caches the embedding
Lightweight prompt encoder: encodes points/boxes as sparse embeddings
Mask decoder (2-layer transformer): cross-attends prompt to image embedding, outputs masks and confidence scores

Ambiguity handling: for an ambiguous prompt, SAM outputs multiple mask hypotheses at different granularities (object part, full object, containing region), ranked by confidence.

Walkthrough

DETR inference

python

from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image
import requests
 
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
 
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
 
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
 
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]
 
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"{model.config.id2label[label.item()]}: {score:.3f} at {box}")

SAM with point prompts

python

from segment_anything import sam_model_registry, SamPredictor
import numpy as np
import cv2
 
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
 
image_rgb = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)   # runs image encoder once, cached
 
input_point = np.array([[500, 375]])
input_label = np.array([1])   # 1=foreground
 
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,   # 3 masks at different granularities
)
print(f"Masks: {masks.shape}, scores: {scores.round(3)}")

mAP evaluation

python

from torchmetrics.detection.mean_ap import MeanAveragePrecision
import torch
 
metric = MeanAveragePrecision(iou_type="bbox")
 
preds = [{
    "boxes":  torch.tensor([[100., 100., 200., 200.], [50., 50., 150., 180.]]),
    "labels": torch.tensor([0, 1]),
    "scores": torch.tensor([0.95, 0.87]),
}]
targets = [{
    "boxes":  torch.tensor([[105., 95., 205., 205.]]),
    "labels": torch.tensor([0]),
}]
 
metric.update(preds, targets)
result = metric.compute()
print(f"mAP: {result['map']:.3f}  AP50: {result['map_50']:.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

End-to-end detectors like DETR have made two-stage detectors obsolete. Two-stage detectors (Faster R-CNN family) still outperform DETR-style models on small objects and dense scenes — crowded pedestrian detection, satellite imagery, and medical imaging. DETR's attention-based queries converge slowly and struggle with small objects because the attention heads distribute capacity across all object scales, while anchor-based methods can be tuned per scale. The right detector depends on the task: DETR and its variants (DINO, RT-DETR) excel at general object detection; specialized two-stage detectors remain competitive in domains with high object density or extreme scale variation.

Detector comparison

	Faster R-CNN	YOLOv8	DETR	RT-DETR
Paradigm	Two-stage anchor	One-stage anchor-free	Query-based	Query-based, fast
Speed (FPS)	~15	160+	28	114
COCO AP	37.4	53.9	42.0	53.1
NMS required	Yes	Yes	No	No
Training	12 epochs	Fast	500 epochs	Moderate

When to use what

Task	Recommended
Real-time detection (edge)	YOLOv8n/s
High-accuracy detection	DINO (improved DETR)
Instance segmentation	Mask R-CNN, YOLOv8-seg
Interactive segmentation	SAM / SAM 2
Open-vocabulary detection	Grounding DINO
Video object tracking	SAM 2

Key pitfalls

DETR slow convergence: bipartite matching requires ~500 epochs on COCO vs 12 for Faster R-CNN. Deformable DETR fixes this with sparse cross-attention, converging in 50 epochs.

SAM hallucination: SAM segments regions without semantic grounding — it will segment any coherent region, including backgrounds and reflections. Always pair with a semantic classifier for meaningful segmentation.

mAP sensitivity: a model with AP50=72 but AP75=40 is poorly calibrated — it gets coarse location right but struggles with precise localization. Track both metrics.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Vision-Language Models

Diffusion Models