


A direct set prediction approach bypassing proposals, anchors, and NMS. Encoder-decoder transformer that predicts all objects at once with bipartite matching loss.
Set Prediction Loss
Finds optimal permutation via Hungarian algorithm:
\[\hat{\sigma} = \arg\min_{\sigma \in \mathfrak{G}_N} \sum_{i}^N \mathcal{L}_{\text{match}}(y_i, \hat{y}_{\sigma(i)})\]Hungarian loss:
\[\mathcal{L}_{\text{Hungarian}} = \sum_{i=1}^N \left[-\log \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{1}_{\{c_i \neq \varnothing\}} \mathcal{L}_{\text{box}}(b_i, \hat{b}_{\hat{\sigma}(i)})\right]\]Box loss: $\mathcal{L}{\text{box}} = \lambda{\text{iou}}\mathcal{L}{\text{iou}} + \lambda{\text{L1}}|b_i - \hat{b}_{\sigma(i)}|_1$
Architecture
- CNN Backbone — feature extraction
- Transformer Encoder — with fixed positional encodings
- Transformer Decoder — decodes N objects in parallel (not autoregressive)
- FFN — 3-layer perceptron for final predictions
Variants
- RT-DETR, Fast-DETR, SAM-Det, ULTRA-DETR