π§ YOLO Object Detection: Deep Dive (Training vs Prediction)
π Table of Contents
- 1. What is YOLO?
- 2. One Single Pass of a Neural Network
- 3. How YOLO Works
- 4. Anchor Boxes: How Do We Get Them?
- 5. Training Phase
- 6. Prediction Phase
- 7. Training vs Prediction Summary
- 8. Example: Dog and Ball
- 9. Label Formats
- 10. Conclusion
1. What is YOLO?
YOLO (You Only Look Once) is a family of real-time object detection models. It detects and classifies multiple objects in an image using a single forward pass through a neural network.
2. One Single Pass of a Neural Network
In traditional computer vision pipelines, object detection was multi-step: extract regions β classify each region.
But YOLO simplifies this by doing everything in one go:
YOLO takes an image, processes it through a convolutional neural network (CNN) once, and outputs bounding boxes + class probabilities directly.
Thereβs no separate stage for proposing regions (like in older models). The entire image is treated end-to-end.
2.1 Two-Stage vs One-Stage Detectors
| Type | Example Models | Description | |------------------|--------------------------|-------------| | Two-Stage | R-CNN, Fast R-CNN, Faster R-CNN | First generate region proposals β then classify + refine boxes | | One-Stage | YOLO, SSD, RetinaNet | Directly predict bounding boxes + classes in a single step |
π‘ YOLO is faster but might sacrifice a bit of accuracy compared to some two-stage models. Thatβs why itβs widely used in real-time applications.
3. How YOLO Works
3.1 Grid and Anchor Boxes
- Input image (e.g., 416Γ416) is divided into an SΓS grid (e.g., 13Γ13, 19Γ19).
- Each cell predicts multiple bounding boxes using anchor boxes (e.g., 3, 5, 9).
So for S=13
and A=3
anchors:
13 Γ 13 Γ 3 = 507 predicted boxes
Each prediction includes:
- Bounding box (x, y, w, h)
- Objectness score (is there an object?)
- Class probabilities
3.2 Predictions per Grid Cell
Each grid cell + anchor box predicts:
[tx, ty, tw, th, to, class1, class2, ..., classN]
(tx, ty)
β center of the box (relative to grid cell)(tw, th)
β width and height (adjusted from anchor box)to
β object confidence scoreclass1...classN
β probabilities for each class
4. Anchor Boxes: How Do We Get Them?
4.1 Why Anchor Boxes?
Object shapes vary:
- A person has a tall vertical box.
- A car has a wide horizontal box.
- A ball has a square box.
Using multiple anchor boxes with different aspect ratios allows the model to better match these variations.
4.2 K-Means Clustering for Anchors
To choose good anchor box sizes, we use K-Means clustering on the training set bounding boxes.
π Steps:
- Collect all bounding boxes from your labeled dataset.
- Normalize their widths and heights.
- Apply K-Means clustering on
(width, height)
pairs. - The cluster centers are your anchor box dimensions.
β Why K-Means?
Because it groups boxes with similar shapes together, helping the model generalize better.
π§ YOLOv3 used IoU distance
as the metric instead of Euclidean distance for clustering:
distance = 1 - IoU(box, cluster_center)
5. Training Phase
5.1 Ground Truth and Manual Labeling
- Use tools like LabelImg or Roboflow to label each image.
- Ground truth includes:
- Class
- Bounding box coordinates (x, y, w, h)
YOLO Format:
class_id center_x center_y width height
# All values are normalized (0β1)
5.2 Assigning Anchors to Ground Truth
- For each labeled object, calculate IoU with all anchor boxes.
- Assign that object to the anchor box with the best IoU.
- Train that anchor to predict the object.
- Use a loss function that includes:
- Bounding box regression loss
- Confidence/objectness loss
- Classification loss
6. Prediction Phase
At inference time, the model is given only raw images (no labels). The process:
6.1 Decoding Predictions
Apply transformations to map model outputs into actual bounding box values.
bx = sigmoid(tx) + cx
by = sigmoid(ty) + cy
bw = pw * exp(tw)
bh = ph * exp(th)
6.2 Filtering and Non-Maximum Suppression (NMS)
- Discard predictions with low confidence (e.g., < 0.5).
- Apply NMS per class:
- Keep the highest confidence box.
- Remove overlapping boxes with IoU > threshold (e.g., 0.5).
7. Training vs Prediction Summary
| Feature | Training | Prediction | |-----------------------|----------------------------------|--------------------------------| | Input | Image + Labels | Image only | | Ground Truth | β Yes | β No | | Anchor Assignment | Based on IoU | Not needed | | Output | Loss β backpropagation | Boxes, classes, confidences | | Post-processing | Not needed | β NMS + filtering |
π Loss Function in YOLO
The YOLO loss function is made up of three key components:
Total Loss = Coordinate Loss + Class Probability Loss + Object Confidence Loss
1. π Coordinate Loss
Purpose: Measures how far off the predicted bounding box is from the actual object in terms of position and size.
Equation (plain text):
loss_xywh = lambda_coord * sum_over_cells_and_boxes(
indicator_obj(i,j) * [
(x - x_hat)^2 + (y - y_hat)^2 +
(sqrt(w) - sqrt(w_hat))^2 +
(sqrt(h) - sqrt(h_hat))^2
]
)
Explanation:
- We calculate the squared difference between the predicted and actual center (x, y) and dimensions (width and height).
- Width and height differences are taken in square root space to reduce the impact of large boxes.
- This loss is only applied if there's an object in the cell (
indicator_obj(i,j) = 1
). lambda_coord
is a constant (usually 5) used to give more importance to the localization accuracy.
2. π§ Class Probability Loss
Purpose: Measures how well the model predicts the correct class for the object.
Equation (plain text):
loss_class = lambda_class * sum_over_cells_and_boxes(
indicator_obj(i,j) * sum_over_classes(
true_class_prob * log(predicted_class_prob)
)
)
Explanation:
- We compare the true class probabilities (usually one-hot encoded) with the predicted probabilities using cross-entropy.
- Only calculated if there's an object in the cell.
lambda_class
is a weight term that controls how much we care about classification accuracy.C
is the total number of object classes.
3. β Object Confidence Loss
Purpose: Evaluates how confident the model is about the presence or absence of an object.
Equation (plain text):
loss_conf = (lambda_obj / N_conf) * sum_over_cells_and_boxes(
indicator_obj(i,j) * (IOU(truth, pred) - predicted_confidence)^2
)
+ (lambda_noobj / N_conf) * sum_over_cells_and_boxes(
indicator_noobj(i,j) * (0 - predicted_confidence)^2
)
Explanation:
- The first term penalizes the model when it is not confident enough about real objects (low confidence for true boxes).
- The second term penalizes the model for being too confident in background areas (high confidence in empty cells).
lambda_obj
andlambda_noobj
are balancing parameters.lambda_noobj
is usually small (e.g., 0.5) to avoid punishing the model too much for predicting objects where there are none.N_conf
is a normalization factor (total number of confidence predictions).
π¨ Definitions and Symbols from the Loss Diagram
| Symbol | Meaning | |---------------------|----------------------------------------------------------| | ( S^2 ) | Number of grid cells (e.g., 13Γ13) | | ( B ) | Number of anchor boxes per cell | | ( \mathbb{1}{i,j}^{obj} ) | Indicator (1 if object exists in cell i and anchor j) | | ( x, y, w, h ) | Coordinates of ground truth box | | ( \hat{x}, \hat{y}, \hat{w}, \hat{h} ) | Predicted box values | | ( C{i,j} ) | Ground truth confidence (1 or 0) | | ( \hat{C}_{i,j} ) | Predicted confidence score |
8. Example: Dog and Ball
Training Image:
- πΆ Dog at (x=0.3, y=0.5, w=0.2, h=0.3)
- πΎ Ball at (x=0.8, y=0.7, w=0.1, h=0.1)
YOLO assigns these to closest anchor boxes and trains those anchors in the matching grid cells.
Prediction Image:
- Model sees image β outputs 1000+ predictions.
- Filter + NMS β outputs:
- Box 1: class=dog, conf=0.89
- Box 2: class=ball, conf=0.75
9. Label Formats
| Format | Description | Example Tool |
|----------------|-------------------------------------------|----------------------|
| YOLO .txt
| Class + (x, y, w, h), normalized | LabelImg |
| Pascal VOC XML | XML file with box coordinates | CVAT |
| COCO JSON | JSON format with rich annotations | Labelme, Roboflow |