150424 Scalable Object Detection using Deep Neural Networks

Perception and Intelligence Lab.
Scalable Object Detection
using Deep Neural Networks
Saturday, June 10, 2017
Presenter: Junho Cho
Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov
Google, Inc
CVPR 2014

+ DeepMultiBox
 Scalable object detection using DNN
+ Class-agnostic scalable object detection
 Only Bounding box. Not aware of what the object is in the box.
 Prediction a set of bounding boxes where potential objects are
 Localize then recognize
+ Boxes generated using single DNN
 Outputs
• fixed number of bounding boxes.
• A score for each box. Confidence of the box containing an object.
2
Introduction

+ Common paradigm
 Detection of particular class object.
 Operate on sub-image and apply detectors in exhaustive manner
• All locations and scales
 Was successful within discriminatively trained DPM (PAMI 2010)
 Too much computations
 Harder as # of classes ↑
• Train a separate detector per class
3
Previous work

+ Model
 Encode i-th object box and its confidence as node values of last net layer
+ Bounding box
 Upper-left and lower-right co-ordinates
 Vector: 𝒍𝒊 ∈ ℝ 𝟒
4 node values
 Normalized co-ordinates w.r.t. image dim.
 Linear transform of the last hidden layer
+ Confidence
 Confidence score for the box containing an object
Score: 𝒄𝒊∈ [𝟎, 𝟏] 1 node value
 Linear transform of the last hidden layer followed by a sigmoid
4
x1, y1
x2, y2
𝒍𝒊=[x1 y1 x2 y2]
Proposed approach

+Inference time
 𝐾 bounding boxes. 𝐾 = 100 𝑜𝑟 200
 Bounding box locations. 𝑙𝑖, 𝑖 ∈ 1, … 𝐾
 Confidences 𝑐𝑖, 𝑖 ∈ 1, … 𝐾
5
Proposed approach
K=100 500 nodes
K=200 1000 nodes
𝑥1
𝑦1
𝑥1
𝑦2
𝑐1
𝑙1
5 𝑛𝑜𝑑𝑒𝑠
…
𝑐𝑖
…
…
𝐾
𝑐 𝐾

+Train Objective
 Train DNN to predict 𝑙𝑖 and their 𝑐𝑖
• Such that highest scoring boxes match well with the
ground truth object boxes.
6
Proposed approach

+ A training image with 𝑀 ground truth(GT)s objects with
labeled by bounding boxes.
 Bounding boxes: 𝑔𝑗, 𝑗 ∈ {1, … , 𝑀}
 Practically, 𝐾 ≫ 𝑀
Optimize only best matches with ground truth.
7
Proposed approach

+ Formulation of assignment problem
+ 𝑥 is assignment from predicted bounding box to GT.
+ 𝑥𝑖𝑗 ∈ 0, 1 (𝑖 ∈ {1, … 𝐾}, 𝑗 ∈ {1, … , 𝑀})
+ 𝑥𝑖𝑗 = 1  the 𝑖-th prediction is assigned to 𝑗-th true obj.
+ Localization loss
8
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1

+ Optimize confidences of the boxes to the assignment 𝑥𝑖
+ Confidence loss
+ Term (a)
 For all predicted box 𝒊 is assigned to ground truth 𝒋.
 𝑥𝑖𝑗 = 1 and maximize 𝑐𝑖
+ Term (b)
 𝑗 𝑥𝑖𝑗 = 1  prediction 𝑖 has been matched to a ground truth.
• becomes zero
 𝑗 𝑥𝑖𝑗 = 0  prediction 𝑖 has not been matched to a ground truth.
• Minimize 𝑐𝑖
9
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1
(a) (b)

+ Final loss objective.
 Combination of localization loss and confidence loss
+ 𝛼: balance term.
 Used 0.3
+ Optimization.
 For each training example, solve an optimal assignment 𝑥∗
Proposed approach

+Bipartite matching
 Polynomial in complexity.
• Ex) Hungarian method, time complexity: 𝑂(𝑛3)
 Inexpensive matching
• Most case, # of ground truth ≤ a dozen
 Thus fast
11
Proposed approach
1
…
…
…
…
…
….
K
1
2
3
4
5
Prediction GT

+ For example… 5 of GT & K # of Prediction
12
Proposed approach
3
2
4
1
Actually K=100 or 200
More red boxes
Find best match GT to
Predction
1
4 3
25
6
5

+ Optimize network parameters
 Via Back Propagation(BP)
 First derivatives of BP algorithm on 𝑙 and 𝑐
 Update network parameters after eval gradient given 𝑥∗
 Train with Stochastic Gradient Descent
13
Proposed approach

+ Sufficient principle of training model
 but additional modification enable training more accurate and faster
+Modification
1. Cluster all training GT locations. All 𝑔𝑖 from train images
• Find 𝐾 such clusters/centroids (K-means)
– 𝐾 : # of predictions
• And use as priors for each of predicted locations.
• Encourage to learn a residual to a prior.
 Prediction learns from corresponding prior
• 1st prior to 𝑙1 node
• …
• 𝐾 𝑡ℎ
prior to 𝑙 𝐾 node
 𝑙𝑖 node predicts box close to 𝑖 𝑡ℎ prior.
14
Proposed approach
1
2
3
…
…
…
….
K
Prior
1
2
3
…
…
…
….
K
Prediction
Learn from

+ Modification
2. Use these priors in matching process instead.
• Find best match b/w the 𝑲 priors & GT
• Confidence loss and Localization loss b/w
GT & coordinates of prediction matched to priors
• Call it prior matching
– Hypothesis: Enforces diversification among predictions
• Without it, slow convergence speed, low quality of model
15
Proposed approach
1
…
3
…
…
…
….
K
1
2
3
4
5
Prior GT
Best
Match1
…
3
…
…
…
….
K
Prediction
Prediction corresponding to prior
Loss training
 Prediction guided by Prior

16
Proposed approach
+ Prediction corresponding to prior
 Learn to predict near prior
 Prediction guided by prior
1
6
6

First localize, (DeepMultiBox)
+ Predict bounding box locations and associated confidences.
+ Can use confidence score and Non-Maximum-Suppression (NMS)
 to obtain smaller # of high confidence boxes.
+ Boxes supposed to represent objects.
then recognize
+ Can use subsequent classifier for object detection.
+ Can use powerful classifier
 Because of small # of boxes
+ In the paper, used second DNN for classification
 AlexNet. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks.
17
Proposed approach

+Experiment details
 Parallel training
• Faster convergence
 Boxes pruned using NMS
• Jaccard (
𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝑈𝑛𝑖𝑜𝑛
) similarity threshold of 0.5
 Generate more images from the dataset
• 0-5%, 5-15%, 15-50%, 50-100% of images.
18
Experiment results

+ VOC 2007
 The Pascal Visual Object Classes Challenge
 20 object classes labelled on bounding box
+ Training on VOC 2012. 11000 images
 Trained K = 100 box localizer
 Trained on data set comprising of
• 10 million crops overlapping some obj
– At least 0.5 Jaccard overlap similarity
• 20 million negative crops
– At most 0.2 Jaccard sim with any obj boxes.
– Labeled with “background” class label
19
Experiment results - VOC 2007

+ Evaluation
 Maximum center square crop
• Resized to network input size 220x220 (AlexNet)
 Single pass, hundred candidate boxes
 Apply NMS, top 10 highest scoring detections
 Classified by 21-way classifier (20 classes + background class)
20
Max
Square
crop
InputSize
220x220

+ Discussion
 Analyze on localizer in isolation
 Additional scales
• 3x3 windows of size 60%
of image
 10 bounding boxes localizing
• Max center : 45.3%
• Max center + 1 scale: 48%
 Importance of looking image at several resolution
• Better with high resolution image crops.
 Better than other reported result
• 42% (What is an object? CVPR2010)
21

+ Discussion
 Post-classification
• mAP: 0.29
 Quite competitive
• As running time complexity very low
• Use top 10 boxes
22
DPM
DPM

23
Max-center crop
Full image used
But small object
Detectable
Such as
Boats, Sheep

+ ILSVRC 2012 Classification with Localization Challenge
 Localization model with more heuristic methods
• Inception architecture
24
Experiment results – ILSVRC 2012

+ ILSVRC 2012 Classification with Localization Challenge
 Localization model with more heuristic methods
• Inception architecture
 After classification
 Much less # of proposals.
25

+ MultiBox approach can use transfer learning
 To detect objects which never specifically trained on.
• similarities with objects that it has seen.
 Figure 5. trained on ImagetNet and test on VOC test set
• And vice versa
 Performed class-agnostic detection.
26

+ ImageNet-trained model capture more VOC windows
 Comapared to vice versa
 Hypothesize: Due to the ImageNet class set being more richer
than VOC class set.
27

+ three contributions
1. New definition of Object Detection
• A regression problem to the coordinates of several bounding boxes, as well
as a confidence score of how likely this box contains an object.
• Traditionally, score features within predefined boxes.
28
Contributions

2. Loss function which trains bounding box predictors
as part of network training
• Solve assignment problem by utilize learning abilities of DNN
• Back Propagation
29
Contributions

3. Train object box detector in class-agnostic manner
• Scalable way to detect large # of object classes.
• Post-classifying, achieve competitive detection results.
• Box predictor generalizes over unseen classes
– Flexible to be re-used to the other detection problems.
30
Contributions

+ Competitive method.
 Better detection performance but larger computations
 OverFeat
• Efficient sliding ConvNet at multiple locations and scales
• Predicting one bounding box per class
• 2 sec/image on GPU.
• 40x slower than GPU implementation of DeepMultiBox
• SCR, centered crop: closest method to DeepMultiBox
– Scores 40.0% while DeepMultiBox scores 40.94%
• DeepMultiBox extracts multiple regions of interest in one network evaluation.
Discussion and Conclusion

+ Competitive method.
 R-CNN using selective search
• Propose 2000 candidates locations per image
• Extract top layer features from ConvNet
• Use hard-negative trained SVM to classify the locations into VOC classes
• 200x more expensive

+ Current state (localization network and categorization network)
 5 – 10 network evaluations
• 1 network for localization and several more for classification
 Does not scale linearly with # of classes to be recognized.
 Which makes very competitive with DPM-like approaches.
+ Hope to build localization and recognition into a single network.
 Extract both locations and class label in a single feed-forward pass in
network.

Thank you

+ AlexNet (NIPS 2012)
Convolution – pooling – ReLU – Normalize
= 1 convolutional layer
 5 convolutional layer
 2 fully-connected hidden layer
35
Introduction

+ Evaluation
 Detection@5
• Produce one box per each of the 5 labels
– Positive when at least one box and associated label are correct
• Jaccard 0.5 overlap
• Table 2.
– # of windows chosen after NMS, ranking from confidence score
36

+ Compare with One-box-per-class
 re-implementation of the winning entry of ILSVRC-2012 “classification
with localization” challenge
• SuverVision. Hinton.
– Code not provided…
 DeepMultiBox is competitive with 5-10 windows
 Two Drawbacks:
1. Output scales linearly with the # of classes
2. Doesn’t generalize naturally to multiple instances of obj of the same type.
37

2. Doesn’t generalize naturally to multiple same type object.
+ Generalization to such scenario
+ necessary for actual image understanding.
+ DeepMultiBox : scalable way
+ At Fig 5., it generally capture more objects more
accurately than a single-box method.
38

+ Novel method for localizing object in an image.
+ Uses deep CNN as base feature extraction and learning model.
+ Formulates multi box localization cost
 Taking advantage of # of GT locations
 Learn to predict such locations in unseen images.

+ Results on challenging benchmarks. VOC 2007 & ILSVRC 2012
+ Work fine by predicting only very few locations.
 To be probed by a subsequent classifier
+ Scalable and generalize across two datasets.
 Being able to predict locations of interest, even not trained on such class.
+ Capture multiple instances of same class
 Important feature. Aims better image understanding.

+ Predicting more windows, able to capture more GT bounding boxes.
 But no comparable increase in mAP on VOC2007
 Hypothesize: classification model works better with hard-negative mining & learn
to better model with local features, the context and detector confidences jointly
take advantage of the proposed window
.

150424 Scalable Object Detection using Deep Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 150424 Scalable Object Detection using Deep Neural Networks

Similar to 150424 Scalable Object Detection using Deep Neural Networks (20)

Recently uploaded

Recently uploaded (20)

150424 Scalable Object Detection using Deep Neural Networks