2. Improving Object Detection With One Line of Code
● Non-Maximum Suppression is a greedy
process.
○ It worked well enough in 2007 but it doesn’t
anymore.
● High scoring detections can be suppressed
just as low scoring detections.
○ Overlap with stronger detection is the only
criteria.
● Should one detection completely suppress
another detection, or simply reduce its
confidence?
3. Improving Object Detection With One Line of Code
● NMS:
● Linear Soft-NMS:
● Gaussian Soft-NMS:
○ Linear Soft-NMS is not continuous in terms of
overlap and a sudden penalty is applied when a
NMS threshold is reached.
○ Instead we can use a continuous function:
6. Learning Non-Maximum Suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for same
object count as false positives.
● Instead, train the network to include the
suppression process.
○ Only output one bounding box per object.
○ Learn how to handle close objects.
7. Learning Non-Maximum Suppression
● Additional blocks that:
○ Encode pairwise information.
○ For each detection, pool information from all
pairings.
○ Update feature vector.
○ Repeat.
● New loss:
○ Only one positive candidate per object.
○ Instead of the current practice to take all
objects with IoU>50%
10. ● Multi-scale object detection using image pyramid
○ Predict different scales by applying same model at different image resolutions.
● Classic method.
● But also, in OverFeat.
● Slow. Requires multiple evaluation of the same model.
Multi-Scale Object Detection
11. Multi-Scale Object Detection
● Predict multiple scale of objects using a single feature map.
● Same as Faster R-CNN.
● Fast
● Single model (same in training as in testing).
● Bad features resolution for small objects.
12. ● Predict different object sizes at different feature scales.
● Same as SSD.
● Good features resolution for small objects
● But features are much weaker than in deeper layers.
Multi-Scale Object Detection
13. ● Single model (same in training as in testing).
● Good features resolution for small objects.
● Strong features in all layers.
● Almost no overhead over SSD (= Fast).
Feature Pyramid Network (FPN)
15. Feature Pyramid Network (FPN)
● How important is top-down enrichment?
● How important are lateral connections?
● How important are pyramid representations?
16. Feature Pyramid Network (FPN)
● How important is top-down enrichment?
● How important are lateral connections?
● How important are pyramid representations?
17. Focal Loss for Dense Object Detection
● Can we train a single stage detector to be as accurate as two stage detectors?
● Contributions:
○ RetinaNet: Single stage object detector based on FPN backbone.
○ New loss.
18. Focal Loss for Dense Object Detection
● Class unbalance is an important issue for object detection.
● Previous solutions:
○ Random resampling at 1:3 ratio.
○ Hard negative resampling at 1:3 ratio.
● Both solutions means that at each step, we only a few samples actually matters
to the loss function.
● Instead, include all samples but use different weight for each class.
○ Regular cross entropy:
○ Weighted cross entropy:
19. ● Using weight CE as baseline:
○ Can we do better?
○ Can we use different weight for each sample?
● Focal loss:
● Every sample is weighted according to its error.
○ We want to focus on samples which are
mislabeled.
Focal Loss for Dense Object Detection
25. YouTube Videos
● CS231n
○ Lecture 11 - Detection and segmentation https://youtu.be/nDPWywWRIRo
● Deep Learning for Objects and Scenes (CVPR 2017 Workshop)
○ Lecture 1: Learning Deep Representations for Visual Recognition, by Kaiming He
https://youtu.be/jHv37mKAhV4
○ Lecture 2: Deep Learning for Instance-level Object Understanding, by Ross Girshick
https://youtu.be/jHv37mKAhV4?t=39m4s
28. Mask R-CNN
● Instance segmentation with pose
estimation for people.
● Extends faster R-CNN by adding new
branch for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● https://arxiv.org/abs/1703.06870
● Girshick won young researcher award.
32. Mask R-CNN
● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and avoid
the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and aspect
ratio
34. Mask R-CNN
● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks: softmax
vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection accuracy.
○ Keypoint task reduces object detection
accuracy.
35. Mask R-CNN
● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask,
and adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.