Computer Vision

Object Detection — Teaching Computers to Find and Identify Objects

Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.

YOLO is Real-Time — Single-shot, grid-based prediction enables real-time detection at 30+ FPS
Faster R-CNN is Accurate — Two-stage detector with Region Proposal Network for high-precision detection
mAP is the Metric — Mean Average Precision over IoU thresholds is the standard evaluation measure

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.

Detection vs. Classification

DfObject Detection

Object detection extends image classification by predicting:

Bounding box: $(x, y, w, h)$ coordinates for each object
Class label: What the object is
Confidence score: How certain the model is

Input: Image → Output: Set of $\{(x, y, w, h, \text{class}, \text{confidence})\}$

IoU (Intersection over Union)

DfIoU Metric

IoU measures the overlap between predicted and ground truth bounding boxes:

\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}

IoU = 1: Perfect overlap
IoU = 0: No overlap
IoU ≥ 0.5: Common threshold for "correct" detection

IoU (Intersection over Union)

\text{IoU} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}

Here,

$B_{\text{pred}}$ =Predicted bounding box
$B_{\text{gt}}$ =Ground truth bounding box
$|\cdot|$ =Area of the box

Anchor Boxes

DfAnchor Boxes

Anchor boxes are pre-defined bounding box shapes (width/height ratios) that the network predicts relative to. Instead of predicting absolute coordinates, the network predicts offsets from anchor boxes:

t_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}

t_w = \log\left(\frac{w}{w_a}\right), \quad t_h = \log\left(\frac{h}{h_a}\right)

where $(x_a, y_a, w_a, h_a)$ is the anchor box and $(t_x, t_y, t_w, t_h)$ are the learned offsets.

Non-Maximum Suppression (NMS)

DfNMS

NMS removes duplicate detections:

Sort all boxes by confidence score
Take the box with highest confidence
Remove all boxes with IoU > threshold (0.5) with this box
Repeat until no boxes remain

\text{Keep box } i \text{ if } \text{IoU}(B_i, B_j) < \theta \text{ for all kept boxes } B_j

Two-Stage: Faster R-CNN

DfFaster R-CNN

Faster R-CNN is a two-stage detector:

Stage 1: Region Proposal Network (RPN)

Slides a small window over feature map
At each position, predicts k anchor box offsets and objectness scores
Outputs region proposals (potential object locations)

Stage 2: Fast R-CNN

RoI Pooling extracts fixed-size features from each proposal
Classification head predicts class + refined bounding box

\mathcal{L} = \mathcal{L}_{\text{RPN cls}} + \mathcal{L}_{\text{RPN box}} + \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{box}}

Single-Stage: YOLO

DfYOLO (You Only Look Once)

YOLO treats detection as a single regression problem:

Divide image into $S \times S$ grid
Each grid cell predicts $B$ bounding boxes + confidence + $C$ class probabilities
Output tensor: $S \times S \times (B \times 5 + C)$

\text{Output: } 7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30 \text{ (for PASCAL VOC)}

YOLOv1-v8 have progressively improved with better backbones, necks, and training strategies.

Two-Stage vs Single-Stage

Evaluation: mAP

DfMean Average Precision (mAP)

mAP is the standard metric for object detection:

Precision-Recall curve: Plot precision vs recall at different confidence thresholds
AP (Average Precision): Area under precision-recall curve for each class
mAP: Mean AP across all classes

\text{mAP} = \frac{1}{C}\sum_{c=1}^{C} \text{AP}_c

Common variants:

mAP@0.5: IoU threshold = 0.5
mAP@0.5:0.95: Average over IoU thresholds 0.5 to 0.95 (COCO standard)

mAP Calculation

\text{mAP} = \frac{1}{C}\sum_{c=1}^{C} \int_0^1 P_c(r) \, dr

Summary

Object detection predicts bounding boxes + class labels + confidence scores
IoU measures overlap between predicted and ground truth boxes
Anchor boxes provide reference shapes for regression
NMS removes duplicate detections
Two-stage (Faster R-CNN): accurate but slower
Single-stage (YOLO): faster, real-time capable
mAP is the standard evaluation metric

Next: Semantic Segmentation

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Object Detection — Teaching Computers to Find and Identify Objects

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Detection vs. Classification

DfObject Detection

IoU (Intersection over Union)

DfIoU Metric

IoU (Intersection over Union)

Anchor Boxes

DfAnchor Boxes

Non-Maximum Suppression (NMS)

DfNMS

Two-Stage: Faster R-CNN

DfFaster R-CNN

Single-Stage: YOLO

DfYOLO (You Only Look Once)

Two-Stage vs Single-Stage

Evaluation: mAP

DfMean Average Precision (mAP)

Summary

Premium Content

Need Expert Deep Learning Help?