YOLO-World

cv cv object-detection yolo-world open-vocabulary vision-language 1 min read

Open-vocabulary YOLO with vision-language modeling and RepVL-PAN

RepVL PAN

YOLO World Arch

Comparison with Detection Paradigm

Open-vocabulary detection via vision-language modeling and pre-training on large-scale datasets.

Architecture

YOLO Detector — YOLOv8
Text Encoder — CLIP
Text Contrastive Head — object-text similarity: $s_{k,j} = \alpha \cdot \text{L2-Norm}(e_k) \cdot \text{L2-Norm}(w_j)^T + \beta$

$X_l’ = X_l \cdot \delta(\max_{j \in {1..C}} (X_l W_j^T))$

$W’ = W + \text{MultiHead-Attention}(W, \tilde{X}, \tilde{X})$

$\mathcal{L}(I) = \mathcal{L}{con} + \lambda_I \cdot (\mathcal{L}{iou} + \mathcal{L}_{dfl})$