
It does not have region proposal network and also it does not have the fully-connected layer.
Process


- Divide the image into grid ($S \times S$ cells)
- Predict $B$ anchor boxes at the center of each cell along with confidence score
- Predict $C$ classes for each grid cell
YOLO Tensor: $S \times S \times (B \times 5 + C)$
where $B$ boxes have $(P_c, b_x, b_y, b_h, b_w)$
Loss Function

$L_{total} = L_{localization} + L_{confidence} + L_{classification}$
$L_{localization} = \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 + (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]$
$L_{confidence} = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}{ij}^{obj} (C_i - \hat{C}_i)^2 + \lambda{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{noobj} (C_i - \hat{C}_i)^2$
$L_{classification} = \sum_{i=0}^{S^2} \mathbb{1}{i}^{obj} \sum{c \in classes} (p_i(c) - \hat{p}_i(c))^2$
$\lambda_{coord} = 5, \quad \lambda_{noobj} = 0.5$