Computer Vision
Object Detection — Teaching Computers to Find and Identify Objects
Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.
- YOLO is Real-Time — Single-shot, grid-based prediction enables real-time detection at 30+ FPS
- Faster R-CNN is Accurate — Two-stage detector with Region Proposal Network for high-precision detection
- mAP is the Metric — Mean Average Precision over IoU thresholds is the standard evaluation measure
Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP
Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.
Detection vs. Classification
DfObject Detection
Object detection extends image classification by predicting:
- Bounding box: coordinates for each object
- Class label: What the object is
- Confidence score: How certain the model is
Input: Image → Output: Set of
IoU (Intersection over Union)
DfIoU Metric
IoU measures the overlap between predicted and ground truth bounding boxes:
- IoU = 1: Perfect overlap
- IoU = 0: No overlap
- IoU ≥ 0.5: Common threshold for "correct" detection
IoU (Intersection over Union)
Here,
- =Predicted bounding box
- =Ground truth bounding box
- =Area of the box
Anchor Boxes
DfAnchor Boxes
Anchor boxes are pre-defined bounding box shapes (width/height ratios) that the network predicts relative to. Instead of predicting absolute coordinates, the network predicts offsets from anchor boxes:
where is the anchor box and are the learned offsets.
Non-Maximum Suppression (NMS)
DfNMS
NMS removes duplicate detections:
- Sort all boxes by confidence score
- Take the box with highest confidence
- Remove all boxes with IoU > threshold (0.5) with this box
- Repeat until no boxes remain
Two-Stage: Faster R-CNN
DfFaster R-CNN
Faster R-CNN is a two-stage detector:
Stage 1: Region Proposal Network (RPN)
- Slides a small window over feature map
- At each position, predicts k anchor box offsets and objectness scores
- Outputs region proposals (potential object locations)
Stage 2: Fast R-CNN
- RoI Pooling extracts fixed-size features from each proposal
- Classification head predicts class + refined bounding box
Single-Stage: YOLO
DfYOLO (You Only Look Once)
YOLO treats detection as a single regression problem:
- Divide image into grid
- Each grid cell predicts bounding boxes + confidence + class probabilities
- Output tensor:
YOLOv1-v8 have progressively improved with better backbones, necks, and training strategies.
Two-Stage vs Single-Stage
Evaluation: mAP
DfMean Average Precision (mAP)
mAP is the standard metric for object detection:
- Precision-Recall curve: Plot precision vs recall at different confidence thresholds
- AP (Average Precision): Area under precision-recall curve for each class
- mAP: Mean AP across all classes
Common variants:
- mAP@0.5: IoU threshold = 0.5
- mAP@0.5:0.95: Average over IoU thresholds 0.5 to 0.95 (COCO standard)
Summary
- Object detection predicts bounding boxes + class labels + confidence scores
- IoU measures overlap between predicted and ground truth boxes
- Anchor boxes provide reference shapes for regression
- NMS removes duplicate detections
- Two-stage (Faster R-CNN): accurate but slower
- Single-stage (YOLO): faster, real-time capable
- mAP is the standard evaluation metric
Next: Semantic Segmentation