Object detection - RCNNs vs Retinanet

Agenda
 Selective search
 RCNN family (Two stage)
 Retinanet (One stage)
 Anchors
 Losses
 Stats
 MAP

Classification vs Detection
Classification

Object Detection
Goal:
Problem: Where do we look in the image for the
object?
Kitte
n

Segmentation
Idea: If we correctly segment the image before running object
recognition, we can use our segmentations as candidate objects.
Advantages: Can be efficient, makes no assumptions about object
sizes or shapes.

Selective search
• Start by oversegmenting the input image
“Efficient graph-based image
segmentation” Felzenszwalband
Huttenlocher, IJCV2004

Similarity measures
 Color: 25 bin color histogram for each channel =75 (rgb)
 Texture: HOG like gaussian derivatives of the image in 8
directions and for each channel. Construct a 10-bin histogram
for each region = 240 dim vector.
 Size: Size similarity encourages smaller regions to merge
early. It ensures that region proposals at all scales are formed
at all parts of the image.
 Shape: Measures how well two regions (ri and rj) fit into each
other. If ri fits into rj merge them to fill gaps

Selective search
1. Merge two most similar regions basedonS.
2. Update similarities between the newregion and its
neighbors.
3. Gobackto step 1.until the
whole imageis
asingle region.

Selective search
• Use hierarchical segmentation: start with small superpixels and
merge based on diverse cues
• Take bounding boxesof all generatedregions andtreat them aspossible
object locations

Stats
• Recallis aproportion of objects thatare
covered by some box with >0.5overlap
Selecte
d
setting
s

R-CNN: Region proposals + CNN features

R-CNN details
• Cons
• Training is slow (84h), takes a lot of disk space
• 2000 CNN passes per image
• Inference (detection) is slow (47s / image with VGG16)
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.

Fast R-CNN
ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” layer
Linear +
softmax
FCs Fully-connected layers
Softmax classifier
Region
proposals
Linear Bounding-box regressors

Fast R-CNN
• Pros
• Less compute overhead
• 2.3 seconds per image inference time
• Cons
• Inference of 2.3 secs is still slow for real life!
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.

Fast R-CNN training
ConvNet
Linear +
softmax
FCs
Linear
Log loss + smooth L1 loss
Trainable
Multi-task loss

Region proposal network (RPN)
• Slide a small window over the feature map
• Predict object/no object
• Regress bounding box coordinates
• Box regression is with reference to anchors (3 scales x 3 aspect ratios)

Loss
i : index of an anchor in a mini-batch
pi: is the predicted probability of anchor i being an object
p∗i is 1 if the anchor is positive, and is 0 if the anchor is
negative.
ti: 4 predicted bounding box coordinates
t∗i: ground-truth box associated coordinates with a
positive anchor
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss
function (smooth L1)
Classification+Regression

Online hard example mining
• Class imbalance hurts training.
• We are training the model to learn background
space rather than detecting objects.
 Sort anchors by their calculated loss, apply NMS
 Pick the top ones such that ratio between the
picked negatives and positives is at most 3:1.
• Faster rcnn selects 256 anchors - 128 positive,
128 negative

Faster R-CNN
• Pros
• 0.2 seconds per image inference time superfast for real life
• Uses RPN instead so better proposals as it can be trained

Strided convolutions (refresher)
Stride 1 convolution with 3x3 Kernel Stride 2 convolution with 3x3 Kernel

IOU: Intersection over union (refresher)

NMS: non max suppression (refresher)
Initial predicted boxes Filtered (Suppressed boxes) by IOU

Why one stage detector trails accuracy?
Two-stage:
The proposal stage rapidly
narrows down #candidate object
locations to a small number (e.g.,
1-2k), filtering out most
background samples
In the classification stage, fix
foreground-to-background ratio to
1:3, or online hard example
mining (OHEM).
One-stage:
Have to process a much larger
set of candidate object locations
regularly sampled across an
image, which amounts to
enumerating ~100k locations that
densely cover spatial positions,
scales, and aspect ratios.
 Extreme foreground-background class imbalance encountered

Activation maps
How about predicting from multiple maps?
As image goes through deeper in the
network, resolution decreases and
semantic value increases

Feature pyramid networks (FPN)
• Improve predictive power of
lower-level feature maps by
adding contextual
information from higher-
level feature maps
Top-Down+Lateral connections

Retinanet
Backbone
Activation
maps at
different
pyramid
levels
Can be:
Densenet
VGG
MobileNet

Anchors
• Aspect ratios: 0.5, 1, 2
• Scales: 1, 1.25, 1.58
• Strides: 8,16,32,64,128
• Sizes: 32, 64, 128, 256, 512
• Total (A): ratio*scales=3*3=9 anchors/pixel location
• (K) object classes

Anchors - Example
• Anchor dims=(size*scale)/sqrt(ratio)
• Eg for 32 anchor size:
• [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34
• [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50
• [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70
For 800,600 input image:
• P3 activation map shape: 100,75
• Stride: 8
• Total (A) = 9 anchors per pixel location
• Total anchors at P3 level = 100*75*9
= 67500
• Similarly sum for all pyramid levels
P3,P4,P5,P6,P7 = total 90360! anchors per
image

Shift anchors
Shift anchors according to input image from activation map
(26,15)
(-22,-11)
(22,11)
(-18,-7)
(0,0)
(4,4)
Shift anchor centered at (0,0) on P3 (stride 8)
Activation map by [ 4. 4. 4. 4.]
Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , ….
(4,4) (12,4)
8
Input Image
Anchors applied wrt to input image!

Cross Entropy loss
Examples that are easily classified (pt >
0.5) incur a loss with non-trivial magnitude
but summed over a large number of easy
examples, these small loss values can
overwhelm the rare class.

Balanced Entropy loss
Alpha=1 for foreground,1-alpha for background
• Alpha hyperparam
• While α balances the importance
of positive/negative examples, it
does not differentiate between
easy/hard examples!

Example
• The loss from easy
examples = 100000×0.1 =
10000
• The loss from hard
examples = 100×2.3 =
230
• 10000 / 230 = 43. It is
about 40× bigger loss
from easy examples.

Focal loss!
• Misclassified, pt is small, modulating factor is near 1, loss is
unaffected.
• As pt → 1, the factor goes to 0 and the loss for well-classified
examples is down-weighted..
• with γ = 2, example classified with pt = 0.9 would have 100×
lower loss compared to CE and with pt ≈ 0.968 it would have
1000× lower loss. This in turn increases the importance of
correcting misclassified examples!
• Every sample is weighted
according to its error!
• Modulating factor added
• Focusing parameter γ smoothly
adjusts the rate at which easy
examples are downweighted

Focal loss!
unlike FL, OHEM completely
discards easy examples

Smooth L1 loss: Bounding boxes

Prediction pipeline
Predicts regression(deltas) to anchor boxes!
 Filter by 0.05 anchor score threshold
 Get 1000 boxes per level, merge all
 Apply NMS at 0.5
 300 final boxes! display to user 

MAP: mean average precision
Precision = TP/(TP+FP) = 2/3 = 0.67
Recall is the proportion of TP out of the ground truth labels = 2/5 = 0.4

MAP: Interpolation approach (old 2007)

Object detection - RCNNs vs Retinanet

In this document