🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Computer VisionDetection🟢 Free Lesson

Advertisement

Computer Vision

Object Detection — Teaching Computers to Find and Identify Objects

Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.

  • YOLO is Real-Time — Single-shot, grid-based prediction enables real-time detection at 30+ FPS
  • Faster R-CNN is Accurate — Two-stage detector with Region Proposal Network for high-precision detection
  • mAP is the Metric — Mean Average Precision over IoU thresholds is the standard evaluation measure

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.


Detection vs. Classification

DfObject Detection

Object detection extends image classification by predicting:

  1. Bounding box: (x,y,w,h)(x, y, w, h) coordinates for each object
  2. Class label: What the object is
  3. Confidence score: How certain the model is

Input: Image → Output: Set of {(x,y,w,h,class,confidence)}\{(x, y, w, h, \text{class}, \text{confidence})\}

Object Detection Pipeline🚗🚶Input ImageBackbone(CNN)FeatureExtractionNeck(FPN)Multi-scaleFeaturesHead(Class + Box)PredictionOutput🚗 (0.95)🚶 (0.88)Bboxes +Classes

IoU (Intersection over Union)

DfIoU Metric

IoU measures the overlap between predicted and ground truth bounding boxes:

IoU=Area of IntersectionArea of Union=BpredBgtBpredBgt\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}
  • IoU = 1: Perfect overlap
  • IoU = 0: No overlap
  • IoU ≥ 0.5: Common threshold for "correct" detection

IoU (Intersection over Union)

IoU=BpredBgtBpredBgt\text{IoU} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}

Here,

  • BpredB_{\text{pred}}=Predicted bounding box
  • BgtB_{\text{gt}}=Ground truth bounding box
  • |\cdot|=Area of the box
IoU VisualizationGround TruthPredictedIntersectionUnion = Area(GT) + Area(Pred) - IntersectionIoU = 0.9ExcellentIoU = 0.5AcceptableIoU = 0.1Poor

Anchor Boxes

DfAnchor Boxes

Anchor boxes are pre-defined bounding box shapes (width/height ratios) that the network predicts relative to. Instead of predicting absolute coordinates, the network predicts offsets from anchor boxes:

tx=xxawa,ty=yyahat_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}
tw=log(wwa),th=log(hha)t_w = \log\left(\frac{w}{w_a}\right), \quad t_h = \log\left(\frac{h}{h_a}\right)

where (xa,ya,wa,ha)(x_a, y_a, w_a, h_a) is the anchor box and (tx,ty,tw,th)(t_x, t_y, t_w, t_h) are the learned offsets.


Non-Maximum Suppression (NMS)

DfNMS

NMS removes duplicate detections:

  1. Sort all boxes by confidence score
  2. Take the box with highest confidence
  3. Remove all boxes with IoU > threshold (0.5) with this box
  4. Repeat until no boxes remain
Keep box i if IoU(Bi,Bj)<θ for all kept boxes Bj\text{Keep box } i \text{ if } \text{IoU}(B_i, B_j) < \theta \text{ for all kept boxes } B_j
Non-Maximum Suppression (NMS)Before NMSCar (0.9)Car (0.8)Car (0.7)Person (0.85)Person (0.6)After NMSCar (0.9) ✓Person (0.85) ✓

Two-Stage: Faster R-CNN

DfFaster R-CNN

Faster R-CNN is a two-stage detector:

Stage 1: Region Proposal Network (RPN)

  • Slides a small window over feature map
  • At each position, predicts k anchor box offsets and objectness scores
  • Outputs region proposals (potential object locations)

Stage 2: Fast R-CNN

  • RoI Pooling extracts fixed-size features from each proposal
  • Classification head predicts class + refined bounding box
L=LRPN cls+LRPN box+Lcls+Lbox\mathcal{L} = \mathcal{L}_{\text{RPN cls}} + \mathcal{L}_{\text{RPN box}} + \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{box}}
Faster R-CNN ArchitectureImageBackbone(CNN)FeatureMapRPNRegion ProposalsRoI PoolFixed sizeClass +Box RegDetsStage 1: RPN (proposals) → Stage 2: Classification + Box RegressionTwo-stage: accurate but slower

Single-Stage: YOLO

DfYOLO (You Only Look Once)

YOLO treats detection as a single regression problem:

  1. Divide image into S×SS \times S grid
  2. Each grid cell predicts BB bounding boxes + confidence + CC class probabilities
  3. Output tensor: S×S×(B×5+C)S \times S \times (B \times 5 + C)
Output: 7×7×(2×5+20)=7×7×30 (for PASCAL VOC)\text{Output: } 7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30 \text{ (for PASCAL VOC)}

YOLOv1-v8 have progressively improved with better backbones, necks, and training strategies.

YOLO: Single-Shot Detection🚗🚶Input7×7 GridPredictions🚗 (0.95)🚶 (0.88)DetectionsSpeedYOLO: 30+ FPSFaster RCNN: 5 FPSReal-time!

Two-Stage vs Single-Stage

Two-Stage vs Single-Stage DetectorsTwo-Stage• R-CNN, Fast R-CNN, Faster R-CNN• Higher accuracy (mAP)• Slower (2-5 FPS)• Better for small objects• Complex pipeline• Mask R-CNN adds instance segmentationSingle-Stage• YOLO, SSD, RetinaNet• Faster (30+ FPS)• Lower latency• Real-time applications• Focal loss for class imbalance• YOLOv8: state-of-the-art speed/accuracy

Evaluation: mAP

DfMean Average Precision (mAP)

mAP is the standard metric for object detection:

  1. Precision-Recall curve: Plot precision vs recall at different confidence thresholds
  2. AP (Average Precision): Area under precision-recall curve for each class
  3. mAP: Mean AP across all classes
mAP=1Cc=1CAPc\text{mAP} = \frac{1}{C}\sum_{c=1}^{C} \text{AP}_c

Common variants:

  • mAP@0.5: IoU threshold = 0.5
  • mAP@0.5:0.95: Average over IoU thresholds 0.5 to 0.95 (COCO standard)
mAP Calculation
mAP=1Cc=1C01Pc(r)dr\text{mAP} = \frac{1}{C}\sum_{c=1}^{C} \int_0^1 P_c(r) \, dr

Summary

  • Object detection predicts bounding boxes + class labels + confidence scores
  • IoU measures overlap between predicted and ground truth boxes
  • Anchor boxes provide reference shapes for regression
  • NMS removes duplicate detections
  • Two-stage (Faster R-CNN): accurate but slower
  • Single-stage (YOLO): faster, real-time capable
  • mAP is the standard evaluation metric

Next: Semantic Segmentation

Premium Content

Object Detection — YOLO, Faster R-CNN, Anchor Boxes and mAP

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement