Batch normalization. 2% more in mAP. High resolution classifier. 4% more in mAP. Convolutional with anchor boxes. 69.5 mAP 81% recall to 69.2 mAP 88% recall. Dimension clusters. Better anchor boxes priors. 60.9% to 67.2% in Avg IOU. Direct location prediction. Solve model instability. Fine-Grained features. 1% more in mAP. Multi-scale training.
Deep learning for object detection
Deep learning for object
*Created in March 2017, might be outdated the time you read.
Slide credit: CS231n
2. Common methods
Region proposal based methods
R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, Mask R-CNN
Single shot based methods
YOLO, YOLOv2, SSD
one image -> one label one image -> labels + bounding boxes
Region based methods - R-CNN
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer
vision and pattern recognition. 2014.
Region based methods - Fast R-CNN
Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.
Region based methods - Faster R-CNN
Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems.
Region based methods - R-FCN
Li, Yi, Kaiming He, and Jian Sun. "R-fcn: Object detection via region-based fully convolutional networks." Advances in Neural Information Processing Systems.
Region based methods - Mask R-CNN
He, Kaiming, et al. "Mask R-CNN." arXiv preprint arXiv:1703.06870 (2017).
Object instance segmentation:
Extend Faster R-CNN by adding a
branch for predicting segmentation
masks on each RoI
Running at 5 fps
Without tricks, outperforms all existing,
single-model entries on every task in
all three tracks of the COCO suite of
challenges, including instance
segmentation, bounding-box object
detection, and person keypoint
Single shot based method - YOLO
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
1. Resize input image to 448*448.
1. Run a single convolutional network.
Predicts B bounding boxes (4 coordinates + confidence) and
C class probabilities for S*S grids, encoded as an
1. Non-maximum suppression.
S*S*B bounding boxes per image and C class probabilities
for each box.
Single shot based method - YOLOv2
Redmon, Joseph, and Ali Farhadi. "YOLO9000: Better, Faster, Stronger." arXiv preprint arXiv:1612.08242 (2016).
1. Significant number of localization errors.
2. Low recall compared to region proposal based methods.
Single shot based method - SSD
Liu, Wei, et al. "SSD: Single shot multibox detector." European Conference on Computer Vision. Springer International Publishing, 2016.
1. Use a small convolutional filter to predict object categories and offsets in bounding box
2. Use multiple layers for prediction at different scales.
From YOLOv2 From SSD