Neural-Path/Notes
30 min

Object Detection & Segmentation

Classification asks "what is in this image?" Detection adds "where?" and segmentation adds "exactly which pixels?" These tasks demand that the model produce precise spatial outputs — bounding boxes or masks — not just a single label. The evolution from anchor-based detectors to anchor-free approaches to transformer-based detection reflects a broader shift: from handcrafted spatial priors to learned representations. Understanding DETR's query-based detection and SAM's promptable segmentation shows how the same transformer machinery that powers language models and ViT can be adapted for dense spatial prediction.

Theory

Detection Paradigms
Grid (YOLO-style)predict at each grid cellcat 0.91dog 0.87+ NMS to remove duplicatesDETR (set prediction)Hungarian matching, no NMScatdogcatdogN queriesone prediction per object, exactly

Grid methods predict at every cell and require NMS post-processing. DETR treats detection as set prediction — N learned queries are matched to ground truth objects via Hungarian algorithm, producing exactly one prediction per object with no duplicate removal needed.

Object detection solves two problems at once: localization (where is each object?) and classification (what is it?). Traditional detectors handled this with sliding windows, anchors, and non-maximum suppression — many engineering choices that each require careful tuning. DETR reframes the whole problem as set prediction: output a fixed set of predictions, then find the best one-to-one assignment to ground truth using the Hungarian algorithm.

Detection formulations

Two-stage (R-CNN family): first propose candidate regions, then classify each. High accuracy, slow inference.

One-stage (YOLO family): directly predict boxes and classes from a feature grid. Fast, less accurate on small objects.

Anchor-free: predict box offsets from object centers rather than predefined anchor aspect ratios. FCOS, CenterNet.

Query-based (DETR): learn a fixed set of object queries that attend over image features; each query predicts one box. No anchors, no NMS.

Bipartite matching loss (DETR)

DETR (Carion et al., 2020) frames detection as a set prediction problem. Given NN learned object queries, DETR outputs NN (class, box) predictions, then finds the optimal one-to-one assignment between predictions and ground-truth objects.

Hungarian algorithm: find the permutation σ^\hat{\sigma} minimizing the total matching cost: σ^=argminσSNi=1NCmatch(yi,y^σ(i))\hat{\sigma} = \arg\min_{\sigma \in S_N} \sum_{i=1}^N \mathcal{C}_{\text{match}}(y_i, \hat{y}_{\sigma(i)})

The Hungarian matching is forced by the one-to-one assignment requirement: each ground-truth object should be predicted by exactly one query, and each query should be responsible for exactly one ground-truth object. Any one-to-many assignment would produce duplicate predictions, which is precisely what non-maximum suppression (NMS) was designed to remove in traditional detectors. By encoding the one-to-one constraint directly in the training loss via bipartite matching, DETR eliminates the need for NMS entirely — the loss function ensures duplicates are penalized, not post-processed away.

where Cmatch\mathcal{C}_{\text{match}} combines classification probability and box distance.

Once matched, the set prediction loss is: L=i=1N[logp^σ^(i)(ci)+1[ci]Lbox(bi,b^σ^(i))]\mathcal{L} = \sum_{i=1}^N \left[-\log \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{1}[c_i \neq \varnothing] \mathcal{L}_{\text{box}}(b_i, \hat{b}_{\hat{\sigma}(i)})\right]

Box regression loss combines L1 and Generalized IoU (GIoU) for scale-invariance: Lbox=λ1bib^1+λ2LGIoU(bi,b^)\mathcal{L}_{\text{box}} = \lambda_1 \|b_i - \hat{b}\|_1 + \lambda_2 \mathcal{L}_{\text{GIoU}}(b_i, \hat{b})

GIoU penalizes non-overlapping boxes more smoothly than vanilla IoU.

DETR architecture

  1. CNN/ViT backbone: extracts spatial features FRH×W×CF \in \mathbb{R}^{H' \times W' \times C}
  2. Transformer encoder: flatten FF to sequence, add 2D positional encodings, apply self-attention
  3. Transformer decoder: N=100N = 100 learned object queries cross-attend over encoder output
  4. FFN prediction heads: each query outputs (class distribution, normalized box coordinates)

Each query specializes to detect objects in specific regions or of specific sizes — this emerges from training, not from handcrafted anchors.

IoU-based metrics

Intersection over Union for predicted box B^\hat{B} and ground-truth box BB: IoU=B^BB^B\text{IoU} = \frac{|\hat{B} \cap B|}{|\hat{B} \cup B|}

Average Precision (AP): area under the precision-recall curve at an IoU threshold. COCO AP averages over thresholds 0.5:0.05:0.95.

mAP: mean over all classes. Primary metric for detection benchmarks.

Promptable segmentation: SAM

SAM (Segment Anything Model, Kirillov et al., 2023) separates segmentation from semantics. It segments any object described by:

  • Points: positive/negative clicks on the object
  • Boxes: bounding box prompts
  • Masks: coarse masks refined by the model

Architecture:

  • Heavyweight image encoder (ViT-H, 632M params): runs once per image, caches the embedding
  • Lightweight prompt encoder: encodes points/boxes as sparse embeddings
  • Mask decoder (2-layer transformer): cross-attends prompt to image embedding, outputs masks and confidence scores

Ambiguity handling: for an ambiguous prompt, SAM outputs multiple mask hypotheses at different granularities (object part, full object, containing region), ranked by confidence.

Walkthrough

DETR inference

python
from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image
import requests
 
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
 
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
 
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
 
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]
 
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"{model.config.id2label[label.item()]}: {score:.3f} at {box}")

SAM with point prompts

python
from segment_anything import sam_model_registry, SamPredictor
import numpy as np
import cv2
 
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
 
image_rgb = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image_rgb)   # runs image encoder once, cached
 
input_point = np.array([[500, 375]])
input_label = np.array([1])   # 1=foreground
 
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,   # 3 masks at different granularities
)
print(f"Masks: {masks.shape}, scores: {scores.round(3)}")

mAP evaluation

python
from torchmetrics.detection.mean_ap import MeanAveragePrecision
import torch
 
metric = MeanAveragePrecision(iou_type="bbox")
 
preds = [{
    "boxes":  torch.tensor([[100., 100., 200., 200.], [50., 50., 150., 180.]]),
    "labels": torch.tensor([0, 1]),
    "scores": torch.tensor([0.95, 0.87]),
}]
targets = [{
    "boxes":  torch.tensor([[105., 95., 205., 205.]]),
    "labels": torch.tensor([0]),
}]
 
metric.update(preds, targets)
result = metric.compute()
print(f"mAP: {result['map']:.3f}  AP50: {result['map_50']:.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

End-to-end detectors like DETR have made two-stage detectors obsolete. Two-stage detectors (Faster R-CNN family) still outperform DETR-style models on small objects and dense scenes — crowded pedestrian detection, satellite imagery, and medical imaging. DETR's attention-based queries converge slowly and struggle with small objects because the attention heads distribute capacity across all object scales, while anchor-based methods can be tuned per scale. The right detector depends on the task: DETR and its variants (DINO, RT-DETR) excel at general object detection; specialized two-stage detectors remain competitive in domains with high object density or extreme scale variation.

Detector comparison

Faster R-CNNYOLOv8DETRRT-DETR
ParadigmTwo-stage anchorOne-stage anchor-freeQuery-basedQuery-based, fast
Speed (FPS)~15160+28114
COCO AP37.453.942.053.1
NMS requiredYesYesNoNo
Training12 epochsFast500 epochsModerate

When to use what

TaskRecommended
Real-time detection (edge)YOLOv8n/s
High-accuracy detectionDINO (improved DETR)
Instance segmentationMask R-CNN, YOLOv8-seg
Interactive segmentationSAM / SAM 2
Open-vocabulary detectionGrounding DINO
Video object trackingSAM 2

Key pitfalls

DETR slow convergence: bipartite matching requires ~500 epochs on COCO vs 12 for Faster R-CNN. Deformable DETR fixes this with sparse cross-attention, converging in 50 epochs.

SAM hallucination: SAM segments regions without semantic grounding — it will segment any coherent region, including backgrounds and reflections. Always pair with a semantic classifier for meaningful segmentation.

mAP sensitivity: a model with AP50=72 but AP75=40 is poorly calibrated — it gets coarse location right but struggles with precise localization. Track both metrics.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.