9654467111 Call Girls In Munirka Hotel And Home Service
150424 Scalable Object Detection using Deep Neural Networks
1. Perception and Intelligence Lab.
Scalable Object Detection
using Deep Neural Networks
Saturday, June 10, 2017
Presenter: Junho Cho
Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov
Google, Inc
CVPR 2014
2. Perception and Intelligence Lab.
+ DeepMultiBox
Scalable object detection using DNN
+ Class-agnostic scalable object detection
Only Bounding box. Not aware of what the object is in the box.
Prediction a set of bounding boxes where potential objects are
Localize then recognize
+ Boxes generated using single DNN
Outputs
• fixed number of bounding boxes.
• A score for each box. Confidence of the box containing an object.
2
Introduction
3. Perception and Intelligence Lab.
+ Common paradigm
Detection of particular class object.
Operate on sub-image and apply detectors in exhaustive manner
• All locations and scales
Was successful within discriminatively trained DPM (PAMI 2010)
Too much computations
Harder as # of classes ↑
• Train a separate detector per class
3
Previous work
4. Perception and Intelligence Lab.
+ Model
Encode i-th object box and its confidence as node values of last net layer
+ Bounding box
Upper-left and lower-right co-ordinates
Vector: 𝒍𝒊 ∈ ℝ 𝟒
4 node values
Normalized co-ordinates w.r.t. image dim.
Linear transform of the last hidden layer
+ Confidence
Confidence score for the box containing an object
Score: 𝒄𝒊∈ [𝟎, 𝟏] 1 node value
Linear transform of the last hidden layer followed by a sigmoid
4
x1, y1
x2, y2
𝒍𝒊=[x1 y1 x2 y2]
Proposed approach
6. Perception and Intelligence Lab.
+Train Objective
Train DNN to predict 𝑙𝑖 and their 𝑐𝑖
• Such that highest scoring boxes match well with the
ground truth object boxes.
6
Proposed approach
7. Perception and Intelligence Lab.
+ A training image with 𝑀 ground truth(GT)s objects with
labeled by bounding boxes.
Bounding boxes: 𝑔𝑗, 𝑗 ∈ {1, … , 𝑀}
Practically, 𝐾 ≫ 𝑀
Optimize only best matches with ground truth.
7
Proposed approach
8. Perception and Intelligence Lab.
+ Formulation of assignment problem
+ 𝑥 is assignment from predicted bounding box to GT.
+ 𝑥𝑖𝑗 ∈ 0, 1 (𝑖 ∈ {1, … 𝐾}, 𝑗 ∈ {1, … , 𝑀})
+ 𝑥𝑖𝑗 = 1 the 𝑖-th prediction is assigned to 𝑗-th true obj.
+ Localization loss
8
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1
9. Perception and Intelligence Lab.
+ Optimize confidences of the boxes to the assignment 𝑥𝑖
+ Confidence loss
+ Term (a)
For all predicted box 𝒊 is assigned to ground truth 𝒋.
𝑥𝑖𝑗 = 1 and maximize 𝑐𝑖
+ Term (b)
𝑗 𝑥𝑖𝑗 = 1 prediction 𝑖 has been matched to a ground truth.
• becomes zero
𝑗 𝑥𝑖𝑗 = 0 prediction 𝑖 has not been matched to a ground truth.
• Minimize 𝑐𝑖
9
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1
(a) (b)
10. Perception and Intelligence Lab.
+ Final loss objective.
Combination of localization loss and confidence loss
+ 𝛼: balance term.
Used 0.3
+ Optimization.
For each training example, solve an optimal assignment 𝑥∗
Proposed approach
11. Perception and Intelligence Lab.
+Bipartite matching
Polynomial in complexity.
• Ex) Hungarian method, time complexity: 𝑂(𝑛3)
Inexpensive matching
• Most case, # of ground truth ≤ a dozen
Thus fast
11
Proposed approach
1
…
…
…
…
…
….
K
1
2
3
4
5
Prediction GT
12. Perception and Intelligence Lab.
+ For example… 5 of GT & K # of Prediction
12
Proposed approach
3
2
4
1
Actually K=100 or 200
More red boxes
Find best match GT to
Predction
1
4 3
25
6
5
13. Perception and Intelligence Lab.
+ Optimize network parameters
Via Back Propagation(BP)
First derivatives of BP algorithm on 𝑙 and 𝑐
Update network parameters after eval gradient given 𝑥∗
Train with Stochastic Gradient Descent
13
Proposed approach
14. Perception and Intelligence Lab.
+ Sufficient principle of training model
but additional modification enable training more accurate and faster
+Modification
1. Cluster all training GT locations. All 𝑔𝑖 from train images
• Find 𝐾 such clusters/centroids (K-means)
– 𝐾 : # of predictions
• And use as priors for each of predicted locations.
• Encourage to learn a residual to a prior.
Prediction learns from corresponding prior
• 1st prior to 𝑙1 node
• …
• 𝐾 𝑡ℎ
prior to 𝑙 𝐾 node
𝑙𝑖 node predicts box close to 𝑖 𝑡ℎ prior.
14
Proposed approach
1
2
3
…
…
…
….
K
Prior
1
2
3
…
…
…
….
K
Prediction
Learn from
15. Perception and Intelligence Lab.
+ Modification
2. Use these priors in matching process instead.
• Find best match b/w the 𝑲 priors & GT
• Confidence loss and Localization loss b/w
GT & coordinates of prediction matched to priors
• Call it prior matching
– Hypothesis: Enforces diversification among predictions
• Without it, slow convergence speed, low quality of model
15
Proposed approach
1
…
3
…
…
…
….
K
1
2
3
4
5
Prior GT
Best
Match1
…
3
…
…
…
….
K
Prediction
Prediction corresponding to prior
Loss training
Prediction guided by Prior
16. Perception and Intelligence Lab.
16
Proposed approach
+ Prediction corresponding to prior
Learn to predict near prior
Prediction guided by prior
1
6
6
17. Perception and Intelligence Lab.
First localize, (DeepMultiBox)
+ Predict bounding box locations and associated confidences.
+ Can use confidence score and Non-Maximum-Suppression (NMS)
to obtain smaller # of high confidence boxes.
+ Boxes supposed to represent objects.
then recognize
+ Can use subsequent classifier for object detection.
+ Can use powerful classifier
Because of small # of boxes
+ In the paper, used second DNN for classification
AlexNet. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks.
17
Proposed approach
18. Perception and Intelligence Lab.
+Experiment details
Parallel training
• Faster convergence
Boxes pruned using NMS
• Jaccard (
𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝑈𝑛𝑖𝑜𝑛
) similarity threshold of 0.5
Generate more images from the dataset
• 0-5%, 5-15%, 15-50%, 50-100% of images.
18
Experiment results
19. Perception and Intelligence Lab.
+ VOC 2007
The Pascal Visual Object Classes Challenge
20 object classes labelled on bounding box
+ Training on VOC 2012. 11000 images
Trained K = 100 box localizer
Trained on data set comprising of
• 10 million crops overlapping some obj
– At least 0.5 Jaccard overlap similarity
• 20 million negative crops
– At most 0.2 Jaccard sim with any obj boxes.
– Labeled with “background” class label
19
Experiment results - VOC 2007
20. Perception and Intelligence Lab.
+ Evaluation
Maximum center square crop
• Resized to network input size 220x220 (AlexNet)
Single pass, hundred candidate boxes
Apply NMS, top 10 highest scoring detections
Classified by 21-way classifier (20 classes + background class)
20
Experiment results - VOC 2007
Max
Square
crop
InputSize
220x220
21. Perception and Intelligence Lab.
+ Discussion
Analyze on localizer in isolation
Additional scales
• 3x3 windows of size 60%
of image
10 bounding boxes localizing
• Max center : 45.3%
• Max center + 1 scale: 48%
Importance of looking image at several resolution
• Better with high resolution image crops.
Better than other reported result
• 42% (What is an object? CVPR2010)
21
Experiment results - VOC 2007
22. Perception and Intelligence Lab.
+ Discussion
Post-classification
• mAP: 0.29
Quite competitive
• As running time complexity very low
• Use top 10 boxes
22
Experiment results - VOC 2007
DPM
DPM
23. Perception and Intelligence Lab.
23
Experiment results - VOC 2007
Max-center crop
Full image used
But small object
Detectable
Such as
Boats, Sheep
24. Perception and Intelligence Lab.
+ ILSVRC 2012 Classification with Localization Challenge
Localization model with more heuristic methods
• Inception architecture
24
Experiment results – ILSVRC 2012
25. Perception and Intelligence Lab.
+ ILSVRC 2012 Classification with Localization Challenge
Localization model with more heuristic methods
• Inception architecture
After classification
Much less # of proposals.
25
Experiment results – ILSVRC 2012
26. Perception and Intelligence Lab.
+ MultiBox approach can use transfer learning
To detect objects which never specifically trained on.
• similarities with objects that it has seen.
Figure 5. trained on ImagetNet and test on VOC test set
• And vice versa
Performed class-agnostic detection.
26
Experiment results – ILSVRC 2012
27. Perception and Intelligence Lab.
+ ImageNet-trained model capture more VOC windows
Comapared to vice versa
Hypothesize: Due to the ImageNet class set being more richer
than VOC class set.
27
Experiment results – ILSVRC 2012
28. Perception and Intelligence Lab.
+ three contributions
1. New definition of Object Detection
• A regression problem to the coordinates of several bounding boxes, as well
as a confidence score of how likely this box contains an object.
• Traditionally, score features within predefined boxes.
28
Contributions
29. Perception and Intelligence Lab.
+ three contributions
2. Loss function which trains bounding box predictors
as part of network training
• Solve assignment problem by utilize learning abilities of DNN
• Back Propagation
29
Contributions
30. Perception and Intelligence Lab.
+ three contributions
3. Train object box detector in class-agnostic manner
• Scalable way to detect large # of object classes.
• Post-classifying, achieve competitive detection results.
• Box predictor generalizes over unseen classes
– Flexible to be re-used to the other detection problems.
30
Contributions
31. Perception and Intelligence Lab.
+ Competitive method.
Better detection performance but larger computations
OverFeat
• Efficient sliding ConvNet at multiple locations and scales
• Predicting one bounding box per class
• 2 sec/image on GPU.
• 40x slower than GPU implementation of DeepMultiBox
• SCR, centered crop: closest method to DeepMultiBox
– Scores 40.0% while DeepMultiBox scores 40.94%
• DeepMultiBox extracts multiple regions of interest in one network evaluation.
Discussion and Conclusion
32. Perception and Intelligence Lab.
+ Competitive method.
R-CNN using selective search
• Propose 2000 candidates locations per image
• Extract top layer features from ConvNet
• Use hard-negative trained SVM to classify the locations into VOC classes
• 200x more expensive
Discussion and Conclusion
33. Perception and Intelligence Lab.
+ Current state (localization network and categorization network)
5 – 10 network evaluations
• 1 network for localization and several more for classification
Does not scale linearly with # of classes to be recognized.
Which makes very competitive with DPM-like approaches.
+ Hope to build localization and recognition into a single network.
Extract both locations and class label in a single feed-forward pass in
network.
Discussion and Conclusion
36. Perception and Intelligence Lab.
+ Evaluation
Detection@5
• Produce one box per each of the 5 labels
– Positive when at least one box and associated label are correct
• Jaccard 0.5 overlap
• Table 2.
– # of windows chosen after NMS, ranking from confidence score
36
Experiment results – ILSVRC 2012
37. Perception and Intelligence Lab.
+ Compare with One-box-per-class
re-implementation of the winning entry of ILSVRC-2012 “classification
with localization” challenge
• SuverVision. Hinton.
– Code not provided…
DeepMultiBox is competitive with 5-10 windows
Two Drawbacks:
1. Output scales linearly with the # of classes
2. Doesn’t generalize naturally to multiple instances of obj of the same type.
37
Experiment results – ILSVRC 2012
38. Perception and Intelligence Lab.
2. Doesn’t generalize naturally to multiple same type object.
+ Generalization to such scenario
+ necessary for actual image understanding.
+ DeepMultiBox : scalable way
+ At Fig 5., it generally capture more objects more
accurately than a single-box method.
38
Experiment results – ILSVRC 2012
39. Perception and Intelligence Lab.
+ Novel method for localizing object in an image.
+ Uses deep CNN as base feature extraction and learning model.
+ Formulates multi box localization cost
Taking advantage of # of GT locations
Learn to predict such locations in unseen images.
Discussion and Conclusion
40. Perception and Intelligence Lab.
+ Results on challenging benchmarks. VOC 2007 & ILSVRC 2012
+ Work fine by predicting only very few locations.
To be probed by a subsequent classifier
+ Scalable and generalize across two datasets.
Being able to predict locations of interest, even not trained on such class.
+ Capture multiple instances of same class
Important feature. Aims better image understanding.
Discussion and Conclusion
41. Perception and Intelligence Lab.
+ Predicting more windows, able to capture more GT bounding boxes.
But no comparable increase in mAP on VOC2007
Hypothesize: classification model works better with hard-negative mining & learn
to better model with local features, the context and detector confidences jointly
take advantage of the proposed window
.
Discussion and Conclusion