DETR

cv cv object-detection detr transformer hungarian 1 min read

End-to-end object detection with transformers using bipartite matching

DETRResults

DETRArch

DETRIntro

A direct set prediction approach bypassing proposals, anchors, and NMS. Encoder-decoder transformer that predicts all objects at once with bipartite matching loss.

Set Prediction Loss

Finds optimal permutation via Hungarian algorithm:

\[\hat{\sigma} = \arg\min_{\sigma \in \mathfrak{G}_N} \sum_{i}^N \mathcal{L}_{\text{match}}(y_i, \hat{y}_{\sigma(i)})\]

Hungarian loss:

\[\mathcal{L}_{\text{Hungarian}} = \sum_{i=1}^N \left[-\log \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{1}_{\{c_i \neq \varnothing\}} \mathcal{L}_{\text{box}}(b_i, \hat{b}_{\hat{\sigma}(i)})\right]\]

Box loss: $\mathcal{L}{\text{box}} = \lambda{\text{iou}}\mathcal{L}{\text{iou}} + \lambda{\text{L1}}|b_i - \hat{b}_{\sigma(i)}|_1$

Architecture

CNN Backbone — feature extraction
Transformer Encoder — with fixed positional encodings
Transformer Decoder — decodes N objects in parallel (not autoregressive)
FFN — 3-layer perceptron for final predictions

Variants

RT-DETR, Fast-DETR, SAM-Det, ULTRA-DETR