SSD:
Single Shot Multibox Detector
NamHyuk Ahn
Object Detection
- mean Average Precision (mAP)
• Popular eval metric
• Compute average precision
for single class, and average
them over all classes
• Detections is True-positive
if box is overlap with ground-
truth more than some threshold
(usually use 0.5)
Object Detection
- R-CNN Family
• Most popular detection method in deep learning
• Use region proposal method <- make model slow
• Good accuracy (Faster: 73.2% mAP), but very slow
• R-CNN: 50 sec/img, Fast: 2 sec/img, Faster: 0.2 sec/img (7 FPS)
- YOLO (You Only Look Once)
• Real-time (45 FPS), but low accuracy (63.4% mAP)
YOLO:
You Only Look Once
- Single shot detector model
• Not separate classification and bbox regression
- Divide image into S x S grid (7x7 in paper)
• Within each grid cell, (4+1)*B + C vector,
• B: # of boxes in each grid (2 in paper)
C: # of classes (20 in paper)
(4+1): 4 box coord + 1 box confidence
- Direct prediction using CNN with regression loss
YOLO:
You Only Look Once
- Operate on a single-scale feature map (last pool)
• Bad accuracy with large or small object
- Predict bbox using fc layer
- Hard data augmentation, 448x448 input image
- Use customized CNN architecture
SSD:
Single Shot Multibox Detector
- Multi-scale feature maps for detection
• Add conv layer at the end of base network, decrease size progressively
• Concat output of multi-scale feature map at the last layer
- Convolutional predictors for detection
• YOLO use fc layer, but SSD use 3x3 conv kernel
SSD:
Single Shot Multibox Detector
- Default boxes and aspect ratios
• Set default boxes at each location, and predict offset relative to
corresponding default box
• output dims: (C+4)K*M*N,
K=# of default box, C=# of classes, MN=feature dims
SSD:
Single Shot Multibox Detector
- Default boxes and aspect ratios
• Use 6 default boxes at each feature cell
• { 1, 2, 3, 1/2, 1/3 } aspect ratio boxes + 1 box with 1 aspect ratio
• Set 3 boxes in conv4_3 to reduce computation
SSD:
Single Shot Multibox Detector
- Output feature (final layer)
• With given output boxes from multi-scale features, sort them
using class confidence
• Pick top-200 boxes and make each box 7-dim vector
• [ batch_idx, class_confidence, label, box offset…]
• Output feature dim is 7x200
•
Model analysis
- Data argumentation is very important
- More feature map is better
• Lower feature map can capture fine-grained details of object
- More default box shape is better
• If you only 4 boxes, performance drop by 0.9%
• Using variety shape of default box makes predicting box easier
- Astrous VGG is better and faster
Result
- Accuracy is compare to state-of-the-art, and with
real-time
Reference
- Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint
arXiv:1512.02325 (2015).

Single Shot Multibox Detector

  • 1.
    SSD: Single Shot MultiboxDetector NamHyuk Ahn
  • 2.
    Object Detection - meanAverage Precision (mAP) • Popular eval metric • Compute average precision for single class, and average them over all classes • Detections is True-positive if box is overlap with ground- truth more than some threshold (usually use 0.5)
  • 3.
    Object Detection - R-CNNFamily • Most popular detection method in deep learning • Use region proposal method <- make model slow • Good accuracy (Faster: 73.2% mAP), but very slow • R-CNN: 50 sec/img, Fast: 2 sec/img, Faster: 0.2 sec/img (7 FPS) - YOLO (You Only Look Once) • Real-time (45 FPS), but low accuracy (63.4% mAP)
  • 4.
    YOLO: You Only LookOnce - Single shot detector model • Not separate classification and bbox regression - Divide image into S x S grid (7x7 in paper) • Within each grid cell, (4+1)*B + C vector, • B: # of boxes in each grid (2 in paper) C: # of classes (20 in paper) (4+1): 4 box coord + 1 box confidence - Direct prediction using CNN with regression loss
  • 5.
    YOLO: You Only LookOnce - Operate on a single-scale feature map (last pool) • Bad accuracy with large or small object - Predict bbox using fc layer - Hard data augmentation, 448x448 input image - Use customized CNN architecture
  • 6.
    SSD: Single Shot MultiboxDetector - Multi-scale feature maps for detection • Add conv layer at the end of base network, decrease size progressively • Concat output of multi-scale feature map at the last layer - Convolutional predictors for detection • YOLO use fc layer, but SSD use 3x3 conv kernel
  • 7.
    SSD: Single Shot MultiboxDetector - Default boxes and aspect ratios • Set default boxes at each location, and predict offset relative to corresponding default box • output dims: (C+4)K*M*N, K=# of default box, C=# of classes, MN=feature dims
  • 8.
    SSD: Single Shot MultiboxDetector - Default boxes and aspect ratios • Use 6 default boxes at each feature cell • { 1, 2, 3, 1/2, 1/3 } aspect ratio boxes + 1 box with 1 aspect ratio • Set 3 boxes in conv4_3 to reduce computation
  • 9.
    SSD: Single Shot MultiboxDetector - Output feature (final layer) • With given output boxes from multi-scale features, sort them using class confidence • Pick top-200 boxes and make each box 7-dim vector • [ batch_idx, class_confidence, label, box offset…] • Output feature dim is 7x200 •
  • 10.
    Model analysis - Dataargumentation is very important - More feature map is better • Lower feature map can capture fine-grained details of object - More default box shape is better • If you only 4 boxes, performance drop by 0.9% • Using variety shape of default box makes predicting box easier - Astrous VGG is better and faster
  • 11.
    Result - Accuracy iscompare to state-of-the-art, and with real-time
  • 12.
    Reference - Liu, Wei,et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).