YOLO Object Detection: From Training to Prediction

April 16, 2025 (5mo ago)

🧠 YOLO Object Detection: Deep Dive (Training vs Prediction)

πŸ“Œ Table of Contents


1. What is YOLO?

YOLO (You Only Look Once) is a family of real-time object detection models. It detects and classifies multiple objects in an image using a single forward pass through a neural network.


2. One Single Pass of a Neural Network

In traditional computer vision pipelines, object detection was multi-step: extract regions β†’ classify each region.

But YOLO simplifies this by doing everything in one go:

YOLO takes an image, processes it through a convolutional neural network (CNN) once, and outputs bounding boxes + class probabilities directly.

There’s no separate stage for proposing regions (like in older models). The entire image is treated end-to-end.


2.1 Two-Stage vs One-Stage Detectors

| Type | Example Models | Description | |------------------|--------------------------|-------------| | Two-Stage | R-CNN, Fast R-CNN, Faster R-CNN | First generate region proposals β†’ then classify + refine boxes | | One-Stage | YOLO, SSD, RetinaNet | Directly predict bounding boxes + classes in a single step |

πŸ’‘ YOLO is faster but might sacrifice a bit of accuracy compared to some two-stage models. That’s why it’s widely used in real-time applications.


3. How YOLO Works

3.1 Grid and Anchor Boxes

So for S=13 and A=3 anchors:

13 Γ— 13 Γ— 3 = 507 predicted boxes

Each prediction includes:


3.2 Predictions per Grid Cell

Each grid cell + anchor box predicts:

[tx, ty, tw, th, to, class1, class2, ..., classN]

4. Anchor Boxes: How Do We Get Them?

4.1 Why Anchor Boxes?

Object shapes vary:

Using multiple anchor boxes with different aspect ratios allows the model to better match these variations.


4.2 K-Means Clustering for Anchors

To choose good anchor box sizes, we use K-Means clustering on the training set bounding boxes.

πŸ“Š Steps:

  1. Collect all bounding boxes from your labeled dataset.
  2. Normalize their widths and heights.
  3. Apply K-Means clustering on (width, height) pairs.
  4. The cluster centers are your anchor box dimensions.

βœ… Why K-Means?

Because it groups boxes with similar shapes together, helping the model generalize better.

🧠 YOLOv3 used IoU distance as the metric instead of Euclidean distance for clustering:

distance = 1 - IoU(box, cluster_center)

5. Training Phase

5.1 Ground Truth and Manual Labeling

YOLO Format:

class_id center_x center_y width height
# All values are normalized (0–1)

5.2 Assigning Anchors to Ground Truth


6. Prediction Phase

At inference time, the model is given only raw images (no labels). The process:

6.1 Decoding Predictions

Apply transformations to map model outputs into actual bounding box values.

bx = sigmoid(tx) + cx
by = sigmoid(ty) + cy
bw = pw * exp(tw)
bh = ph * exp(th)

6.2 Filtering and Non-Maximum Suppression (NMS)

  1. Discard predictions with low confidence (e.g., < 0.5).
  2. Apply NMS per class:
    • Keep the highest confidence box.
    • Remove overlapping boxes with IoU > threshold (e.g., 0.5).

7. Training vs Prediction Summary

| Feature | Training | Prediction | |-----------------------|----------------------------------|--------------------------------| | Input | Image + Labels | Image only | | Ground Truth | βœ… Yes | ❌ No | | Anchor Assignment | Based on IoU | Not needed | | Output | Loss β†’ backpropagation | Boxes, classes, confidences | | Post-processing | Not needed | βœ… NMS + filtering |


πŸ“‰ Loss Function in YOLO

The YOLO loss function is made up of three key components:

Total Loss = Coordinate Loss + Class Probability Loss + Object Confidence Loss


1. πŸ“ Coordinate Loss

Purpose: Measures how far off the predicted bounding box is from the actual object in terms of position and size.

Equation (plain text):

loss_xywh = lambda_coord * sum_over_cells_and_boxes(
  indicator_obj(i,j) * [ 
    (x - x_hat)^2 + (y - y_hat)^2 + 
    (sqrt(w) - sqrt(w_hat))^2 + 
    (sqrt(h) - sqrt(h_hat))^2 
  ]
)

Explanation:


2. 🧠 Class Probability Loss

Purpose: Measures how well the model predicts the correct class for the object.

Equation (plain text):

loss_class = lambda_class * sum_over_cells_and_boxes(
  indicator_obj(i,j) * sum_over_classes(
    true_class_prob * log(predicted_class_prob)
  )
)

Explanation:


3. βœ… Object Confidence Loss

Purpose: Evaluates how confident the model is about the presence or absence of an object.

Equation (plain text):

loss_conf = (lambda_obj / N_conf) * sum_over_cells_and_boxes(
  indicator_obj(i,j) * (IOU(truth, pred) - predicted_confidence)^2
)
+ (lambda_noobj / N_conf) * sum_over_cells_and_boxes(
  indicator_noobj(i,j) * (0 - predicted_confidence)^2
)

Explanation:


🟨 Definitions and Symbols from the Loss Diagram

| Symbol | Meaning | |---------------------|----------------------------------------------------------| | ( S^2 ) | Number of grid cells (e.g., 13Γ—13) | | ( B ) | Number of anchor boxes per cell | | ( \mathbb{1}{i,j}^{obj} ) | Indicator (1 if object exists in cell i and anchor j) | | ( x, y, w, h ) | Coordinates of ground truth box | | ( \hat{x}, \hat{y}, \hat{w}, \hat{h} ) | Predicted box values | | ( C{i,j} ) | Ground truth confidence (1 or 0) | | ( \hat{C}_{i,j} ) | Predicted confidence score |


8. Example: Dog and Ball

Training Image:

YOLO assigns these to closest anchor boxes and trains those anchors in the matching grid cells.

Prediction Image:


9. Label Formats

| Format | Description | Example Tool | |----------------|-------------------------------------------|----------------------| | YOLO .txt | Class + (x, y, w, h), normalized | LabelImg | | Pascal VOC XML | XML file with box coordinates | CVAT | | COCO JSON | JSON format with rich annotations | Labelme, Roboflow |