Object detection
Agenda
 Selective search
 RCNN family (Two stage)
 Retinanet (One stage)
 Anchors
 Losses
 Stats
 MAP
Classification vs Detection
Classification
Object Detection
Goal:
Problem: Where do we look in the image for the
object?
Kitte
n
Segmentation
Idea: If we correctly segment the image before running object
recognition, we can use our segmentations as candidate objects.
Advantages: Can be efficient, makes no assumptions about object
sizes or shapes.
Selective search
• Start by oversegmenting the input image
“Efficient graph-based image
segmentation” Felzenszwalband
Huttenlocher, IJCV2004
Image gradients
Image gradients
Similarity measures
 Color: 25 bin color histogram for each channel =75 (rgb)
 Texture: HOG like gaussian derivatives of the image in 8
directions and for each channel. Construct a 10-bin histogram
for each region = 240 dim vector.
 Size: Size similarity encourages smaller regions to merge
early. It ensures that region proposals at all scales are formed
at all parts of the image.
 Shape: Measures how well two regions (ri and rj) fit into each
other. If ri fits into rj merge them to fill gaps
Selective search
1. Merge two most similar regions basedonS.
2. Update similarities between the newregion and its
neighbors.
3. Gobackto step 1.until the
whole imageis
asingle region.
Selective search
• Use hierarchical segmentation: start with small superpixels and
merge based on diverse cues
• Take bounding boxesof all generatedregions andtreat them aspossible
object locations
Selective search
Stats
• Recallis aproportion of objects thatare
covered by some box with >0.5overlap
Selecte
d
setting
s
Region proposals!
R-CNN: Region proposals + CNN features
R-CNN details
• Cons
• Training is slow (84h), takes a lot of disk space
• 2000 CNN passes per image
• Inference (detection) is slow (47s / image with VGG16)
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
Fast R-CNN
ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” layer
Linear +
softmax
FCs Fully-connected layers
Softmax classifier
Region
proposals
Linear Bounding-box regressors
Fast R-CNN
• Pros
• Less compute overhead
• 2.3 seconds per image inference time
• Cons
• Inference of 2.3 secs is still slow for real life!
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
Fast R-CNN training
ConvNet
Linear +
softmax
FCs
Linear
Log loss + smooth L1 loss
Trainable
Multi-task loss
Speed comparison
Faster R-CNN
Region proposal network (RPN)
• Slide a small window over the feature map
• Predict object/no object
• Regress bounding box coordinates
• Box regression is with reference to anchors (3 scales x 3 aspect ratios)
Loss
i : index of an anchor in a mini-batch
pi: is the predicted probability of anchor i being an object
p∗i is 1 if the anchor is positive, and is 0 if the anchor is
negative.
ti: 4 predicted bounding box coordinates
t∗i: ground-truth box associated coordinates with a
positive anchor
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss
function (smooth L1)
Classification+Regression
Online hard example mining
• Class imbalance hurts training.
• We are training the model to learn background
space rather than detecting objects.
 Sort anchors by their calculated loss, apply NMS
 Pick the top ones such that ratio between the
picked negatives and positives is at most 3:1.
• Faster rcnn selects 256 anchors - 128 positive,
128 negative
Speed comparison
Faster R-CNN
• Pros
• 0.2 seconds per image inference time superfast for real life
• Uses RPN instead so better proposals as it can be trained
Strided convolutions (refresher)
Stride 1 convolution with 3x3 Kernel Stride 2 convolution with 3x3 Kernel
IOU: Intersection over union (refresher)
NMS: non max suppression (refresher)
Initial predicted boxes Filtered (Suppressed boxes) by IOU
Why one stage detector trails accuracy?
Two-stage:
The proposal stage rapidly
narrows down #candidate object
locations to a small number (e.g.,
1-2k), filtering out most
background samples
In the classification stage, fix
foreground-to-background ratio to
1:3, or online hard example
mining (OHEM).
One-stage:
Have to process a much larger
set of candidate object locations
regularly sampled across an
image, which amounts to
enumerating ~100k locations that
densely cover spatial positions,
scales, and aspect ratios.
 Extreme foreground-background class imbalance encountered
Activation maps
How about predicting from multiple maps?
As image goes through deeper in the
network, resolution decreases and
semantic value increases
Feature pyramid networks (FPN)
• Improve predictive power of
lower-level feature maps by
adding contextual
information from higher-
level feature maps
Top-Down+Lateral connections
Retinanet
Backbone
Activation
maps at
different
pyramid
levels
Can be:
Densenet
VGG
MobileNet
Retinanet - Architecture
Anchors
• Aspect ratios: 0.5, 1, 2
• Scales: 1, 1.25, 1.58
• Strides: 8,16,32,64,128
• Sizes: 32, 64, 128, 256, 512
• Total (A): ratio*scales=3*3=9 anchors/pixel location
• (K) object classes
Anchors - Example
• Anchor dims=(size*scale)/sqrt(ratio)
• Eg for 32 anchor size:
• [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34
• [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50
• [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70
For 800,600 input image:
• P3 activation map shape: 100,75
• Stride: 8
• Total (A) = 9 anchors per pixel location
• Total anchors at P3 level = 100*75*9
= 67500
• Similarly sum for all pyramid levels
P3,P4,P5,P6,P7 = total 90360! anchors per
image
Shift anchors
Shift anchors according to input image from activation map
(26,15)
(-22,-11)
(22,11)
(-18,-7)
(0,0)
(4,4)
Shift anchor centered at (0,0) on P3 (stride 8)
Activation map by [ 4. 4. 4. 4.]
Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , ….
(4,4) (12,4)
8
Input Image
Anchors applied wrt to input image!
Cross Entropy loss
Examples that are easily classified (pt >
0.5) incur a loss with non-trivial magnitude
but summed over a large number of easy
examples, these small loss values can
overwhelm the rare class.
Balanced Entropy loss
Alpha=1 for foreground,1-alpha for background
• Alpha hyperparam
• While α balances the importance
of positive/negative examples, it
does not differentiate between
easy/hard examples!
Example
• The loss from easy
examples = 100000×0.1 =
10000
• The loss from hard
examples = 100×2.3 =
230
• 10000 / 230 = 43. It is
about 40× bigger loss
from easy examples.
Focal loss!
• Misclassified, pt is small, modulating factor is near 1, loss is
unaffected.
• As pt → 1, the factor goes to 0 and the loss for well-classified
examples is down-weighted..
• with γ = 2, example classified with pt = 0.9 would have 100×
lower loss compared to CE and with pt ≈ 0.968 it would have
1000× lower loss. This in turn increases the importance of
correcting misclassified examples!
• Every sample is weighted
according to its error!
• Modulating factor added
• Focusing parameter γ smoothly
adjusts the rate at which easy
examples are downweighted
Focal loss!
unlike FL, OHEM completely
discards easy examples
Focal loss!
Smooth L1 loss: Bounding boxes
Prediction pipeline
Predicts regression(deltas) to anchor boxes!
 Filter by 0.05 anchor score threshold
 Get 1000 boxes per level, merge all
 Apply NMS at 0.5
 300 final boxes! display to user 
Stats
Stats
MAP: mean average precision
Precision = TP/(TP+FP) = 2/3 = 0.67
Recall is the proportion of TP out of the ground truth labels = 2/5 = 0.4
MAP: Interpolation approach (old 2007)
MAP: Interpolation approach (old 2007)
MAP: AUC approach (new 2011)
Thank You

Object detection - RCNNs vs Retinanet

  • 1.
  • 2.
    Agenda  Selective search RCNN family (Two stage)  Retinanet (One stage)  Anchors  Losses  Stats  MAP
  • 3.
  • 4.
    Object Detection Goal: Problem: Wheredo we look in the image for the object? Kitte n
  • 5.
    Segmentation Idea: If wecorrectly segment the image before running object recognition, we can use our segmentations as candidate objects. Advantages: Can be efficient, makes no assumptions about object sizes or shapes.
  • 6.
    Selective search • Startby oversegmenting the input image “Efficient graph-based image segmentation” Felzenszwalband Huttenlocher, IJCV2004
  • 7.
  • 8.
  • 9.
    Similarity measures  Color:25 bin color histogram for each channel =75 (rgb)  Texture: HOG like gaussian derivatives of the image in 8 directions and for each channel. Construct a 10-bin histogram for each region = 240 dim vector.  Size: Size similarity encourages smaller regions to merge early. It ensures that region proposals at all scales are formed at all parts of the image.  Shape: Measures how well two regions (ri and rj) fit into each other. If ri fits into rj merge them to fill gaps
  • 10.
    Selective search 1. Mergetwo most similar regions basedonS. 2. Update similarities between the newregion and its neighbors. 3. Gobackto step 1.until the whole imageis asingle region.
  • 11.
    Selective search • Usehierarchical segmentation: start with small superpixels and merge based on diverse cues • Take bounding boxesof all generatedregions andtreat them aspossible object locations
  • 12.
  • 13.
    Stats • Recallis aproportionof objects thatare covered by some box with >0.5overlap Selecte d setting s
  • 14.
  • 15.
  • 16.
    R-CNN details • Cons •Training is slow (84h), takes a lot of disk space • 2000 CNN passes per image • Inference (detection) is slow (47s / image with VGG16) • The selective search algorithm is a fixed algorithm, no learning is happening!. This could lead to the generation of bad candidate region proposals.
  • 17.
    Fast R-CNN ConvNet Forward wholeimage through ConvNet “conv5” feature map of image “RoI Pooling” layer Linear + softmax FCs Fully-connected layers Softmax classifier Region proposals Linear Bounding-box regressors
  • 18.
    Fast R-CNN • Pros •Less compute overhead • 2.3 seconds per image inference time • Cons • Inference of 2.3 secs is still slow for real life! • The selective search algorithm is a fixed algorithm, no learning is happening!. This could lead to the generation of bad candidate region proposals.
  • 19.
    Fast R-CNN training ConvNet Linear+ softmax FCs Linear Log loss + smooth L1 loss Trainable Multi-task loss
  • 20.
  • 21.
  • 22.
    Region proposal network(RPN) • Slide a small window over the feature map • Predict object/no object • Regress bounding box coordinates • Box regression is with reference to anchors (3 scales x 3 aspect ratios)
  • 23.
    Loss i : indexof an anchor in a mini-batch pi: is the predicted probability of anchor i being an object p∗i is 1 if the anchor is positive, and is 0 if the anchor is negative. ti: 4 predicted bounding box coordinates t∗i: ground-truth box associated coordinates with a positive anchor Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss function (smooth L1) Classification+Regression
  • 24.
    Online hard examplemining • Class imbalance hurts training. • We are training the model to learn background space rather than detecting objects.  Sort anchors by their calculated loss, apply NMS  Pick the top ones such that ratio between the picked negatives and positives is at most 3:1. • Faster rcnn selects 256 anchors - 128 positive, 128 negative
  • 25.
  • 26.
    Faster R-CNN • Pros •0.2 seconds per image inference time superfast for real life • Uses RPN instead so better proposals as it can be trained
  • 27.
    Strided convolutions (refresher) Stride1 convolution with 3x3 Kernel Stride 2 convolution with 3x3 Kernel
  • 28.
    IOU: Intersection overunion (refresher)
  • 29.
    NMS: non maxsuppression (refresher) Initial predicted boxes Filtered (Suppressed boxes) by IOU
  • 30.
    Why one stagedetector trails accuracy? Two-stage: The proposal stage rapidly narrows down #candidate object locations to a small number (e.g., 1-2k), filtering out most background samples In the classification stage, fix foreground-to-background ratio to 1:3, or online hard example mining (OHEM). One-stage: Have to process a much larger set of candidate object locations regularly sampled across an image, which amounts to enumerating ~100k locations that densely cover spatial positions, scales, and aspect ratios.  Extreme foreground-background class imbalance encountered
  • 31.
    Activation maps How aboutpredicting from multiple maps? As image goes through deeper in the network, resolution decreases and semantic value increases
  • 32.
    Feature pyramid networks(FPN) • Improve predictive power of lower-level feature maps by adding contextual information from higher- level feature maps Top-Down+Lateral connections
  • 33.
  • 34.
  • 35.
    Anchors • Aspect ratios:0.5, 1, 2 • Scales: 1, 1.25, 1.58 • Strides: 8,16,32,64,128 • Sizes: 32, 64, 128, 256, 512 • Total (A): ratio*scales=3*3=9 anchors/pixel location • (K) object classes
  • 36.
    Anchors - Example •Anchor dims=(size*scale)/sqrt(ratio) • Eg for 32 anchor size: • [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34 • [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50 • [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70 For 800,600 input image: • P3 activation map shape: 100,75 • Stride: 8 • Total (A) = 9 anchors per pixel location • Total anchors at P3 level = 100*75*9 = 67500 • Similarly sum for all pyramid levels P3,P4,P5,P6,P7 = total 90360! anchors per image
  • 37.
    Shift anchors Shift anchorsaccording to input image from activation map (26,15) (-22,-11) (22,11) (-18,-7) (0,0) (4,4) Shift anchor centered at (0,0) on P3 (stride 8) Activation map by [ 4. 4. 4. 4.] Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , …. (4,4) (12,4) 8 Input Image Anchors applied wrt to input image!
  • 38.
    Cross Entropy loss Examplesthat are easily classified (pt > 0.5) incur a loss with non-trivial magnitude but summed over a large number of easy examples, these small loss values can overwhelm the rare class.
  • 39.
    Balanced Entropy loss Alpha=1for foreground,1-alpha for background • Alpha hyperparam • While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples!
  • 40.
    Example • The lossfrom easy examples = 100000×0.1 = 10000 • The loss from hard examples = 100×2.3 = 230 • 10000 / 230 = 43. It is about 40× bigger loss from easy examples.
  • 41.
    Focal loss! • Misclassified,pt is small, modulating factor is near 1, loss is unaffected. • As pt → 1, the factor goes to 0 and the loss for well-classified examples is down-weighted.. • with γ = 2, example classified with pt = 0.9 would have 100× lower loss compared to CE and with pt ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples! • Every sample is weighted according to its error! • Modulating factor added • Focusing parameter γ smoothly adjusts the rate at which easy examples are downweighted
  • 42.
    Focal loss! unlike FL,OHEM completely discards easy examples
  • 43.
  • 44.
    Smooth L1 loss:Bounding boxes
  • 45.
    Prediction pipeline Predicts regression(deltas)to anchor boxes!  Filter by 0.05 anchor score threshold  Get 1000 boxes per level, merge all  Apply NMS at 0.5  300 final boxes! display to user 
  • 46.
  • 47.
  • 48.
    MAP: mean averageprecision Precision = TP/(TP+FP) = 2/3 = 0.67 Recall is the proportion of TP out of the ground truth labels = 2/5 = 0.4
  • 49.
  • 50.
  • 51.
  • 52.

Editor's Notes

  • #23 At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal