🧠 YOLO Object Detection: Deep Dive (Training vs Prediction)

📌 Table of Contents

1. What is YOLO?
2. One Single Pass of a Neural Network
- 2.1 Two-Stage vs One-Stage Detectors
3. How YOLO Works
- 3.1 Grid and Anchor Boxes
- 3.2 Predictions per Grid Cell
4. Anchor Boxes: How Do We Get Them?
- 4.1 Why Anchor Boxes?
- 4.2 K-Means Clustering for Anchors
5. Training Phase
- 5.1 Ground Truth and Manual Labeling
- 5.2 Assigning Anchors to Ground Truth
6. Prediction Phase
- 6.1 Decoding Predictions
- 6.2 Filtering and Non-Maximum Suppression (NMS)
7. Training vs Prediction Summary
8. Example: Dog and Ball
9. Label Formats
10. Conclusion

1. What is YOLO?

YOLO (You Only Look Once) is a family of real-time object detection models. It detects and classifies multiple objects in an image using a single forward pass through a neural network.

2. One Single Pass of a Neural Network

In traditional computer vision pipelines, object detection was multi-step: extract regions → classify each region.

But YOLO simplifies this by doing everything in one go:

YOLO takes an image, processes it through a convolutional neural network (CNN) once, and outputs bounding boxes + class probabilities directly.

There’s no separate stage for proposing regions (like in older models). The entire image is treated end-to-end.

2.1 Two-Stage vs One-Stage Detectors

| Type | Example Models | Description | |------------------|--------------------------|-------------| | Two-Stage | R-CNN, Fast R-CNN, Faster R-CNN | First generate region proposals → then classify + refine boxes | | One-Stage | YOLO, SSD, RetinaNet | Directly predict bounding boxes + classes in a single step |

💡 YOLO is faster but might sacrifice a bit of accuracy compared to some two-stage models. That’s why it’s widely used in real-time applications.

3. How YOLO Works

3.1 Grid and Anchor Boxes

Input image (e.g., 416×416) is divided into an S×S grid (e.g., 13×13, 19×19).
Each cell predicts multiple bounding boxes using anchor boxes (e.g., 3, 5, 9).

So for S=13 and A=3 anchors:

13 × 13 × 3 = 507 predicted boxes

Each prediction includes:

Bounding box (x, y, w, h)
Objectness score (is there an object?)
Class probabilities

3.2 Predictions per Grid Cell

Each grid cell + anchor box predicts:

[tx, ty, tw, th, to, class1, class2, ..., classN]

(tx, ty) → center of the box (relative to grid cell)
(tw, th) → width and height (adjusted from anchor box)
to → object confidence score
class1...classN → probabilities for each class

4. Anchor Boxes: How Do We Get Them?

4.1 Why Anchor Boxes?

Object shapes vary:

A person has a tall vertical box.
A car has a wide horizontal box.
A ball has a square box.

Using multiple anchor boxes with different aspect ratios allows the model to better match these variations.

4.2 K-Means Clustering for Anchors

To choose good anchor box sizes, we use K-Means clustering on the training set bounding boxes.

📊 Steps:

Collect all bounding boxes from your labeled dataset.
Normalize their widths and heights.
Apply K-Means clustering on (width, height) pairs.
The cluster centers are your anchor box dimensions.

✅ Why K-Means?

Because it groups boxes with similar shapes together, helping the model generalize better.

🧠 YOLOv3 used `IoU distance` as the metric instead of Euclidean distance for clustering:

distance = 1 - IoU(box, cluster_center)

5. Training Phase

5.1 Ground Truth and Manual Labeling

Use tools like LabelImg or Roboflow to label each image.
Ground truth includes:
- Class
- Bounding box coordinates (x, y, w, h)

YOLO Format:

class_id center_x center_y width height
# All values are normalized (0–1)

5.2 Assigning Anchors to Ground Truth

For each labeled object, calculate IoU with all anchor boxes.
Assign that object to the anchor box with the best IoU.
Train that anchor to predict the object.
Use a loss function that includes:
- Bounding box regression loss
- Confidence/objectness loss
- Classification loss

6. Prediction Phase

At inference time, the model is given only raw images (no labels). The process:

6.1 Decoding Predictions

Apply transformations to map model outputs into actual bounding box values.

bx = sigmoid(tx) + cx
by = sigmoid(ty) + cy
bw = pw * exp(tw)
bh = ph * exp(th)

6.2 Filtering and Non-Maximum Suppression (NMS)

Discard predictions with low confidence (e.g., < 0.5).
Apply NMS per class:
- Keep the highest confidence box.
- Remove overlapping boxes with IoU > threshold (e.g., 0.5).

7. Training vs Prediction Summary

| Feature | Training | Prediction | |-----------------------|----------------------------------|--------------------------------| | Input | Image + Labels | Image only | | Ground Truth | ✅ Yes | ❌ No | | Anchor Assignment | Based on IoU | Not needed | | Output | Loss → backpropagation | Boxes, classes, confidences | | Post-processing | Not needed | ✅ NMS + filtering |

📉 Loss Function in YOLO

The YOLO loss function is made up of three key components:

Total Loss = Coordinate Loss + Class Probability Loss + Object Confidence Loss

1. 📍 Coordinate Loss

Purpose: Measures how far off the predicted bounding box is from the actual object in terms of position and size.

Equation (plain text):

loss_xywh = lambda_coord * sum_over_cells_and_boxes(
  indicator_obj(i,j) * [ 
    (x - x_hat)^2 + (y - y_hat)^2 + 
    (sqrt(w) - sqrt(w_hat))^2 + 
    (sqrt(h) - sqrt(h_hat))^2 
  ]
)

Explanation:

We calculate the squared difference between the predicted and actual center (x, y) and dimensions (width and height).
Width and height differences are taken in square root space to reduce the impact of large boxes.
This loss is only applied if there's an object in the cell (indicator_obj(i,j) = 1).
lambda_coord is a constant (usually 5) used to give more importance to the localization accuracy.

2. 🧠 Class Probability Loss

Purpose: Measures how well the model predicts the correct class for the object.

Equation (plain text):

loss_class = lambda_class * sum_over_cells_and_boxes(
  indicator_obj(i,j) * sum_over_classes(
    true_class_prob * log(predicted_class_prob)
  )
)

Explanation:

We compare the true class probabilities (usually one-hot encoded) with the predicted probabilities using cross-entropy.
Only calculated if there's an object in the cell.
lambda_class is a weight term that controls how much we care about classification accuracy.
C is the total number of object classes.

3. ✅ Object Confidence Loss

Purpose: Evaluates how confident the model is about the presence or absence of an object.

Equation (plain text):

loss_conf = (lambda_obj / N_conf) * sum_over_cells_and_boxes(
  indicator_obj(i,j) * (IOU(truth, pred) - predicted_confidence)^2
)
+ (lambda_noobj / N_conf) * sum_over_cells_and_boxes(
  indicator_noobj(i,j) * (0 - predicted_confidence)^2
)

Explanation:

The first term penalizes the model when it is not confident enough about real objects (low confidence for true boxes).
The second term penalizes the model for being too confident in background areas (high confidence in empty cells).
lambda_obj and lambda_noobj are balancing parameters. lambda_noobj is usually small (e.g., 0.5) to avoid punishing the model too much for predicting objects where there are none.
N_conf is a normalization factor (total number of confidence predictions).

🟨 Definitions and Symbols from the Loss Diagram

| Symbol | Meaning | |---------------------|----------------------------------------------------------| | ( S^2 ) | Number of grid cells (e.g., 13×13) | | ( B ) | Number of anchor boxes per cell | | ( \mathbb{1}{i,j}^{obj} ) | Indicator (1 if object exists in cell i and anchor j) | | ( x, y, w, h ) | Coordinates of ground truth box | | ( \hat{x}, \hat{y}, \hat{w}, \hat{h} ) | Predicted box values | | ( C{i,j} ) | Ground truth confidence (1 or 0) | | ( \hat{C}_{i,j} ) | Predicted confidence score |

8. Example: Dog and Ball

Training Image:

🐶 Dog at (x=0.3, y=0.5, w=0.2, h=0.3)
🎾 Ball at (x=0.8, y=0.7, w=0.1, h=0.1)

YOLO assigns these to closest anchor boxes and trains those anchors in the matching grid cells.

Prediction Image:

Model sees image → outputs 1000+ predictions.
Filter + NMS → outputs:
- Box 1: class=dog, conf=0.89
- Box 2: class=ball, conf=0.75

9. Label Formats

| Format | Description | Example Tool | |----------------|-------------------------------------------|----------------------| | YOLO .txt | Class + (x, y, w, h), normalized | LabelImg | | Pascal VOC XML | XML file with box coordinates | CVAT | | COCO JSON | JSON format with rich annotations | Labelme, Roboflow |

YOLO Object Detection: From Training to Prediction

🧠 YOLO Object Detection: Deep Dive (Training vs Prediction)

📌 Table of Contents

1. What is YOLO?

2. One Single Pass of a Neural Network

2.1 Two-Stage vs One-Stage Detectors

3. How YOLO Works

3.1 Grid and Anchor Boxes

3.2 Predictions per Grid Cell

4. Anchor Boxes: How Do We Get Them?

4.1 Why Anchor Boxes?

4.2 K-Means Clustering for Anchors

📊 Steps:

✅ Why K-Means?

🧠 YOLOv3 used IoU distance as the metric instead of Euclidean distance for clustering:

5. Training Phase

5.1 Ground Truth and Manual Labeling

YOLO Format:

5.2 Assigning Anchors to Ground Truth

6. Prediction Phase

6.1 Decoding Predictions

6.2 Filtering and Non-Maximum Suppression (NMS)

7. Training vs Prediction Summary

📉 Loss Function in YOLO

1. 📍 Coordinate Loss

2. 🧠 Class Probability Loss

3. ✅ Object Confidence Loss

🟨 Definitions and Symbols from the Loss Diagram

8. Example: Dog and Ball

Training Image:

Prediction Image:

9. Label Formats

🧠 YOLOv3 used `IoU distance` as the metric instead of Euclidean distance for clustering: