Perception and Intelligence Lab.
Scalable Object Detection
using Deep Neural Networks
Saturday, June 10, 2017
Presenter: Junho Cho
Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov
Google, Inc
CVPR 2014
Perception and Intelligence Lab.
+ DeepMultiBox
 Scalable object detection using DNN
+ Class-agnostic scalable object detection
 Only Bounding box. Not aware of what the object is in the box.
 Prediction a set of bounding boxes where potential objects are
 Localize then recognize
+ Boxes generated using single DNN
 Outputs
• fixed number of bounding boxes.
• A score for each box. Confidence of the box containing an object.
2
Introduction
Perception and Intelligence Lab.
+ Common paradigm
 Detection of particular class object.
 Operate on sub-image and apply detectors in exhaustive manner
• All locations and scales
 Was successful within discriminatively trained DPM (PAMI 2010)
 Too much computations
 Harder as # of classes ↑
• Train a separate detector per class
3
Previous work
Perception and Intelligence Lab.
+ Model
 Encode i-th object box and its confidence as node values of last net layer
+ Bounding box
 Upper-left and lower-right co-ordinates
 Vector: 𝒍𝒊 ∈ ℝ 𝟒
4 node values
 Normalized co-ordinates w.r.t. image dim.
 Linear transform of the last hidden layer
+ Confidence
 Confidence score for the box containing an object
Score: 𝒄𝒊∈ [𝟎, 𝟏] 1 node value
 Linear transform of the last hidden layer followed by a sigmoid
4
x1, y1
x2, y2
𝒍𝒊=[x1 y1 x2 y2]
Proposed approach
Perception and Intelligence Lab.
+Inference time
 𝐾 bounding boxes. 𝐾 = 100 𝑜𝑟 200
 Bounding box locations. 𝑙𝑖, 𝑖 ∈ 1, … 𝐾
 Confidences 𝑐𝑖, 𝑖 ∈ 1, … 𝐾
5
Proposed approach
K=100 500 nodes
K=200 1000 nodes
𝑥1
𝑦1
𝑥1
𝑦2
𝑐1
𝑙1
5 𝑛𝑜𝑑𝑒𝑠
…
𝑐𝑖
…
…
𝐾
𝑐 𝐾
Perception and Intelligence Lab.
+Train Objective
 Train DNN to predict 𝑙𝑖 and their 𝑐𝑖
• Such that highest scoring boxes match well with the
ground truth object boxes.
6
Proposed approach
Perception and Intelligence Lab.
+ A training image with 𝑀 ground truth(GT)s objects with
labeled by bounding boxes.
 Bounding boxes: 𝑔𝑗, 𝑗 ∈ {1, … , 𝑀}
 Practically, 𝐾 ≫ 𝑀
Optimize only best matches with ground truth.
7
Proposed approach
Perception and Intelligence Lab.
+ Formulation of assignment problem
+ 𝑥 is assignment from predicted bounding box to GT.
+ 𝑥𝑖𝑗 ∈ 0, 1 (𝑖 ∈ {1, … 𝐾}, 𝑗 ∈ {1, … , 𝑀})
+ 𝑥𝑖𝑗 = 1  the 𝑖-th prediction is assigned to 𝑗-th true obj.
+ Localization loss
8
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1
Perception and Intelligence Lab.
+ Optimize confidences of the boxes to the assignment 𝑥𝑖
+ Confidence loss
+ Term (a)
 For all predicted box 𝒊 is assigned to ground truth 𝒋.
 𝑥𝑖𝑗 = 1 and maximize 𝑐𝑖
+ Term (b)
 𝑗 𝑥𝑖𝑗 = 1  prediction 𝑖 has been matched to a ground truth.
• becomes zero
 𝑗 𝑥𝑖𝑗 = 0  prediction 𝑖 has not been matched to a ground truth.
• Minimize 𝑐𝑖
9
Proposed approach
1
…
i
…
K
1
..
j
…
M
Prediction GT
𝑥𝑖𝑗 = 1
(a) (b)
Perception and Intelligence Lab.
+ Final loss objective.
 Combination of localization loss and confidence loss
+ 𝛼: balance term.
 Used 0.3
+ Optimization.
 For each training example, solve an optimal assignment 𝑥∗
Proposed approach
Perception and Intelligence Lab.
+Bipartite matching
 Polynomial in complexity.
• Ex) Hungarian method, time complexity: 𝑂(𝑛3)
 Inexpensive matching
• Most case, # of ground truth ≤ a dozen
 Thus fast
11
Proposed approach
1
…
…
…
…
…
….
K
1
2
3
4
5
Prediction GT
Perception and Intelligence Lab.
+ For example… 5 of GT & K # of Prediction
12
Proposed approach
3
2
4
1
Actually K=100 or 200
More red boxes
Find best match GT to
Predction
1
4 3
25
6
5
Perception and Intelligence Lab.
+ Optimize network parameters
 Via Back Propagation(BP)
 First derivatives of BP algorithm on 𝑙 and 𝑐
 Update network parameters after eval gradient given 𝑥∗
 Train with Stochastic Gradient Descent
13
Proposed approach
Perception and Intelligence Lab.
+ Sufficient principle of training model
 but additional modification enable training more accurate and faster
+Modification
1. Cluster all training GT locations. All 𝑔𝑖 from train images
• Find 𝐾 such clusters/centroids (K-means)
– 𝐾 : # of predictions
• And use as priors for each of predicted locations.
• Encourage to learn a residual to a prior.
 Prediction learns from corresponding prior
• 1st prior to 𝑙1 node
• …
• 𝐾 𝑡ℎ
prior to 𝑙 𝐾 node
 𝑙𝑖 node predicts box close to 𝑖 𝑡ℎ prior.
14
Proposed approach
1
2
3
…
…
…
….
K
Prior
1
2
3
…
…
…
….
K
Prediction
Learn from
Perception and Intelligence Lab.
+ Modification
2. Use these priors in matching process instead.
• Find best match b/w the 𝑲 priors & GT
• Confidence loss and Localization loss b/w
GT & coordinates of prediction matched to priors
• Call it prior matching
– Hypothesis: Enforces diversification among predictions
• Without it, slow convergence speed, low quality of model
15
Proposed approach
1
…
3
…
…
…
….
K
1
2
3
4
5
Prior GT
Best
Match1
…
3
…
…
…
….
K
Prediction
Prediction corresponding to prior
Loss training
 Prediction guided by Prior
Perception and Intelligence Lab.
16
Proposed approach
+ Prediction corresponding to prior
 Learn to predict near prior
 Prediction guided by prior
1
6
6
Perception and Intelligence Lab.
First localize, (DeepMultiBox)
+ Predict bounding box locations and associated confidences.
+ Can use confidence score and Non-Maximum-Suppression (NMS)
 to obtain smaller # of high confidence boxes.
+ Boxes supposed to represent objects.
then recognize
+ Can use subsequent classifier for object detection.
+ Can use powerful classifier
 Because of small # of boxes
+ In the paper, used second DNN for classification
 AlexNet. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks.
17
Proposed approach
Perception and Intelligence Lab.
+Experiment details
 Parallel training
• Faster convergence
 Boxes pruned using NMS
• Jaccard (
𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛
𝑈𝑛𝑖𝑜𝑛
) similarity threshold of 0.5
 Generate more images from the dataset
• 0-5%, 5-15%, 15-50%, 50-100% of images.
18
Experiment results
Perception and Intelligence Lab.
+ VOC 2007
 The Pascal Visual Object Classes Challenge
 20 object classes labelled on bounding box
+ Training on VOC 2012. 11000 images
 Trained K = 100 box localizer
 Trained on data set comprising of
• 10 million crops overlapping some obj
– At least 0.5 Jaccard overlap similarity
• 20 million negative crops
– At most 0.2 Jaccard sim with any obj boxes.
– Labeled with “background” class label
19
Experiment results - VOC 2007
Perception and Intelligence Lab.
+ Evaluation
 Maximum center square crop
• Resized to network input size 220x220 (AlexNet)
 Single pass, hundred candidate boxes
 Apply NMS, top 10 highest scoring detections
 Classified by 21-way classifier (20 classes + background class)
20
Experiment results - VOC 2007
Max
Square
crop
InputSize
220x220
Perception and Intelligence Lab.
+ Discussion
 Analyze on localizer in isolation
 Additional scales
• 3x3 windows of size 60%
of image
 10 bounding boxes localizing
• Max center : 45.3%
• Max center + 1 scale: 48%
 Importance of looking image at several resolution
• Better with high resolution image crops.
 Better than other reported result
• 42% (What is an object? CVPR2010)
21
Experiment results - VOC 2007
Perception and Intelligence Lab.
+ Discussion
 Post-classification
• mAP: 0.29
 Quite competitive
• As running time complexity very low
• Use top 10 boxes
22
Experiment results - VOC 2007
DPM
DPM
Perception and Intelligence Lab.
23
Experiment results - VOC 2007
Max-center crop
Full image used
But small object
Detectable
Such as
Boats, Sheep
Perception and Intelligence Lab.
+ ILSVRC 2012 Classification with Localization Challenge
 Localization model with more heuristic methods
• Inception architecture
24
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ ILSVRC 2012 Classification with Localization Challenge
 Localization model with more heuristic methods
• Inception architecture
 After classification
 Much less # of proposals.
25
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ MultiBox approach can use transfer learning
 To detect objects which never specifically trained on.
• similarities with objects that it has seen.
 Figure 5. trained on ImagetNet and test on VOC test set
• And vice versa
 Performed class-agnostic detection.
26
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ ImageNet-trained model capture more VOC windows
 Comapared to vice versa
 Hypothesize: Due to the ImageNet class set being more richer
than VOC class set.
27
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ three contributions
1. New definition of Object Detection
• A regression problem to the coordinates of several bounding boxes, as well
as a confidence score of how likely this box contains an object.
• Traditionally, score features within predefined boxes.
28
Contributions
Perception and Intelligence Lab.
+ three contributions
2. Loss function which trains bounding box predictors
as part of network training
• Solve assignment problem by utilize learning abilities of DNN
• Back Propagation
29
Contributions
Perception and Intelligence Lab.
+ three contributions
3. Train object box detector in class-agnostic manner
• Scalable way to detect large # of object classes.
• Post-classifying, achieve competitive detection results.
• Box predictor generalizes over unseen classes
– Flexible to be re-used to the other detection problems.
30
Contributions
Perception and Intelligence Lab.
+ Competitive method.
 Better detection performance but larger computations
 OverFeat
• Efficient sliding ConvNet at multiple locations and scales
• Predicting one bounding box per class
• 2 sec/image on GPU.
• 40x slower than GPU implementation of DeepMultiBox
• SCR, centered crop: closest method to DeepMultiBox
– Scores 40.0% while DeepMultiBox scores 40.94%
• DeepMultiBox extracts multiple regions of interest in one network evaluation.
Discussion and Conclusion
Perception and Intelligence Lab.
+ Competitive method.
 R-CNN using selective search
• Propose 2000 candidates locations per image
• Extract top layer features from ConvNet
• Use hard-negative trained SVM to classify the locations into VOC classes
• 200x more expensive
Discussion and Conclusion
Perception and Intelligence Lab.
+ Current state (localization network and categorization network)
 5 – 10 network evaluations
• 1 network for localization and several more for classification
 Does not scale linearly with # of classes to be recognized.
 Which makes very competitive with DPM-like approaches.
+ Hope to build localization and recognition into a single network.
 Extract both locations and class label in a single feed-forward pass in
network.
Discussion and Conclusion
Perception and Intelligence Lab.
Thank you
Perception and Intelligence Lab.
+ AlexNet (NIPS 2012)
Convolution – pooling – ReLU – Normalize
= 1 convolutional layer
 5 convolutional layer
 2 fully-connected hidden layer
35
Introduction
Perception and Intelligence Lab.
+ Evaluation
 Detection@5
• Produce one box per each of the 5 labels
– Positive when at least one box and associated label are correct
• Jaccard 0.5 overlap
• Table 2.
– # of windows chosen after NMS, ranking from confidence score
36
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ Compare with One-box-per-class
 re-implementation of the winning entry of ILSVRC-2012 “classification
with localization” challenge
• SuverVision. Hinton.
– Code not provided…
 DeepMultiBox is competitive with 5-10 windows
 Two Drawbacks:
1. Output scales linearly with the # of classes
2. Doesn’t generalize naturally to multiple instances of obj of the same type.
37
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
2. Doesn’t generalize naturally to multiple same type object.
+ Generalization to such scenario
+ necessary for actual image understanding.
+ DeepMultiBox : scalable way
+ At Fig 5., it generally capture more objects more
accurately than a single-box method.
38
Experiment results – ILSVRC 2012
Perception and Intelligence Lab.
+ Novel method for localizing object in an image.
+ Uses deep CNN as base feature extraction and learning model.
+ Formulates multi box localization cost
 Taking advantage of # of GT locations
 Learn to predict such locations in unseen images.
Discussion and Conclusion
Perception and Intelligence Lab.
+ Results on challenging benchmarks. VOC 2007 & ILSVRC 2012
+ Work fine by predicting only very few locations.
 To be probed by a subsequent classifier
+ Scalable and generalize across two datasets.
 Being able to predict locations of interest, even not trained on such class.
+ Capture multiple instances of same class
 Important feature. Aims better image understanding.
Discussion and Conclusion
Perception and Intelligence Lab.
+ Predicting more windows, able to capture more GT bounding boxes.
 But no comparable increase in mAP on VOC2007
 Hypothesize: classification model works better with hard-negative mining & learn
to better model with local features, the context and detector confidences jointly
take advantage of the proposed window
.
Discussion and Conclusion

150424 Scalable Object Detection using Deep Neural Networks

  • 1.
    Perception and IntelligenceLab. Scalable Object Detection using Deep Neural Networks Saturday, June 10, 2017 Presenter: Junho Cho Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov Google, Inc CVPR 2014
  • 2.
    Perception and IntelligenceLab. + DeepMultiBox  Scalable object detection using DNN + Class-agnostic scalable object detection  Only Bounding box. Not aware of what the object is in the box.  Prediction a set of bounding boxes where potential objects are  Localize then recognize + Boxes generated using single DNN  Outputs • fixed number of bounding boxes. • A score for each box. Confidence of the box containing an object. 2 Introduction
  • 3.
    Perception and IntelligenceLab. + Common paradigm  Detection of particular class object.  Operate on sub-image and apply detectors in exhaustive manner • All locations and scales  Was successful within discriminatively trained DPM (PAMI 2010)  Too much computations  Harder as # of classes ↑ • Train a separate detector per class 3 Previous work
  • 4.
    Perception and IntelligenceLab. + Model  Encode i-th object box and its confidence as node values of last net layer + Bounding box  Upper-left and lower-right co-ordinates  Vector: 𝒍𝒊 ∈ ℝ 𝟒 4 node values  Normalized co-ordinates w.r.t. image dim.  Linear transform of the last hidden layer + Confidence  Confidence score for the box containing an object Score: 𝒄𝒊∈ [𝟎, 𝟏] 1 node value  Linear transform of the last hidden layer followed by a sigmoid 4 x1, y1 x2, y2 𝒍𝒊=[x1 y1 x2 y2] Proposed approach
  • 5.
    Perception and IntelligenceLab. +Inference time  𝐾 bounding boxes. 𝐾 = 100 𝑜𝑟 200  Bounding box locations. 𝑙𝑖, 𝑖 ∈ 1, … 𝐾  Confidences 𝑐𝑖, 𝑖 ∈ 1, … 𝐾 5 Proposed approach K=100 500 nodes K=200 1000 nodes 𝑥1 𝑦1 𝑥1 𝑦2 𝑐1 𝑙1 5 𝑛𝑜𝑑𝑒𝑠 … 𝑐𝑖 … … 𝐾 𝑐 𝐾
  • 6.
    Perception and IntelligenceLab. +Train Objective  Train DNN to predict 𝑙𝑖 and their 𝑐𝑖 • Such that highest scoring boxes match well with the ground truth object boxes. 6 Proposed approach
  • 7.
    Perception and IntelligenceLab. + A training image with 𝑀 ground truth(GT)s objects with labeled by bounding boxes.  Bounding boxes: 𝑔𝑗, 𝑗 ∈ {1, … , 𝑀}  Practically, 𝐾 ≫ 𝑀 Optimize only best matches with ground truth. 7 Proposed approach
  • 8.
    Perception and IntelligenceLab. + Formulation of assignment problem + 𝑥 is assignment from predicted bounding box to GT. + 𝑥𝑖𝑗 ∈ 0, 1 (𝑖 ∈ {1, … 𝐾}, 𝑗 ∈ {1, … , 𝑀}) + 𝑥𝑖𝑗 = 1  the 𝑖-th prediction is assigned to 𝑗-th true obj. + Localization loss 8 Proposed approach 1 … i … K 1 .. j … M Prediction GT 𝑥𝑖𝑗 = 1
  • 9.
    Perception and IntelligenceLab. + Optimize confidences of the boxes to the assignment 𝑥𝑖 + Confidence loss + Term (a)  For all predicted box 𝒊 is assigned to ground truth 𝒋.  𝑥𝑖𝑗 = 1 and maximize 𝑐𝑖 + Term (b)  𝑗 𝑥𝑖𝑗 = 1  prediction 𝑖 has been matched to a ground truth. • becomes zero  𝑗 𝑥𝑖𝑗 = 0  prediction 𝑖 has not been matched to a ground truth. • Minimize 𝑐𝑖 9 Proposed approach 1 … i … K 1 .. j … M Prediction GT 𝑥𝑖𝑗 = 1 (a) (b)
  • 10.
    Perception and IntelligenceLab. + Final loss objective.  Combination of localization loss and confidence loss + 𝛼: balance term.  Used 0.3 + Optimization.  For each training example, solve an optimal assignment 𝑥∗ Proposed approach
  • 11.
    Perception and IntelligenceLab. +Bipartite matching  Polynomial in complexity. • Ex) Hungarian method, time complexity: 𝑂(𝑛3)  Inexpensive matching • Most case, # of ground truth ≤ a dozen  Thus fast 11 Proposed approach 1 … … … … … …. K 1 2 3 4 5 Prediction GT
  • 12.
    Perception and IntelligenceLab. + For example… 5 of GT & K # of Prediction 12 Proposed approach 3 2 4 1 Actually K=100 or 200 More red boxes Find best match GT to Predction 1 4 3 25 6 5
  • 13.
    Perception and IntelligenceLab. + Optimize network parameters  Via Back Propagation(BP)  First derivatives of BP algorithm on 𝑙 and 𝑐  Update network parameters after eval gradient given 𝑥∗  Train with Stochastic Gradient Descent 13 Proposed approach
  • 14.
    Perception and IntelligenceLab. + Sufficient principle of training model  but additional modification enable training more accurate and faster +Modification 1. Cluster all training GT locations. All 𝑔𝑖 from train images • Find 𝐾 such clusters/centroids (K-means) – 𝐾 : # of predictions • And use as priors for each of predicted locations. • Encourage to learn a residual to a prior.  Prediction learns from corresponding prior • 1st prior to 𝑙1 node • … • 𝐾 𝑡ℎ prior to 𝑙 𝐾 node  𝑙𝑖 node predicts box close to 𝑖 𝑡ℎ prior. 14 Proposed approach 1 2 3 … … … …. K Prior 1 2 3 … … … …. K Prediction Learn from
  • 15.
    Perception and IntelligenceLab. + Modification 2. Use these priors in matching process instead. • Find best match b/w the 𝑲 priors & GT • Confidence loss and Localization loss b/w GT & coordinates of prediction matched to priors • Call it prior matching – Hypothesis: Enforces diversification among predictions • Without it, slow convergence speed, low quality of model 15 Proposed approach 1 … 3 … … … …. K 1 2 3 4 5 Prior GT Best Match1 … 3 … … … …. K Prediction Prediction corresponding to prior Loss training  Prediction guided by Prior
  • 16.
    Perception and IntelligenceLab. 16 Proposed approach + Prediction corresponding to prior  Learn to predict near prior  Prediction guided by prior 1 6 6
  • 17.
    Perception and IntelligenceLab. First localize, (DeepMultiBox) + Predict bounding box locations and associated confidences. + Can use confidence score and Non-Maximum-Suppression (NMS)  to obtain smaller # of high confidence boxes. + Boxes supposed to represent objects. then recognize + Can use subsequent classifier for object detection. + Can use powerful classifier  Because of small # of boxes + In the paper, used second DNN for classification  AlexNet. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. 17 Proposed approach
  • 18.
    Perception and IntelligenceLab. +Experiment details  Parallel training • Faster convergence  Boxes pruned using NMS • Jaccard ( 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑈𝑛𝑖𝑜𝑛 ) similarity threshold of 0.5  Generate more images from the dataset • 0-5%, 5-15%, 15-50%, 50-100% of images. 18 Experiment results
  • 19.
    Perception and IntelligenceLab. + VOC 2007  The Pascal Visual Object Classes Challenge  20 object classes labelled on bounding box + Training on VOC 2012. 11000 images  Trained K = 100 box localizer  Trained on data set comprising of • 10 million crops overlapping some obj – At least 0.5 Jaccard overlap similarity • 20 million negative crops – At most 0.2 Jaccard sim with any obj boxes. – Labeled with “background” class label 19 Experiment results - VOC 2007
  • 20.
    Perception and IntelligenceLab. + Evaluation  Maximum center square crop • Resized to network input size 220x220 (AlexNet)  Single pass, hundred candidate boxes  Apply NMS, top 10 highest scoring detections  Classified by 21-way classifier (20 classes + background class) 20 Experiment results - VOC 2007 Max Square crop InputSize 220x220
  • 21.
    Perception and IntelligenceLab. + Discussion  Analyze on localizer in isolation  Additional scales • 3x3 windows of size 60% of image  10 bounding boxes localizing • Max center : 45.3% • Max center + 1 scale: 48%  Importance of looking image at several resolution • Better with high resolution image crops.  Better than other reported result • 42% (What is an object? CVPR2010) 21 Experiment results - VOC 2007
  • 22.
    Perception and IntelligenceLab. + Discussion  Post-classification • mAP: 0.29  Quite competitive • As running time complexity very low • Use top 10 boxes 22 Experiment results - VOC 2007 DPM DPM
  • 23.
    Perception and IntelligenceLab. 23 Experiment results - VOC 2007 Max-center crop Full image used But small object Detectable Such as Boats, Sheep
  • 24.
    Perception and IntelligenceLab. + ILSVRC 2012 Classification with Localization Challenge  Localization model with more heuristic methods • Inception architecture 24 Experiment results – ILSVRC 2012
  • 25.
    Perception and IntelligenceLab. + ILSVRC 2012 Classification with Localization Challenge  Localization model with more heuristic methods • Inception architecture  After classification  Much less # of proposals. 25 Experiment results – ILSVRC 2012
  • 26.
    Perception and IntelligenceLab. + MultiBox approach can use transfer learning  To detect objects which never specifically trained on. • similarities with objects that it has seen.  Figure 5. trained on ImagetNet and test on VOC test set • And vice versa  Performed class-agnostic detection. 26 Experiment results – ILSVRC 2012
  • 27.
    Perception and IntelligenceLab. + ImageNet-trained model capture more VOC windows  Comapared to vice versa  Hypothesize: Due to the ImageNet class set being more richer than VOC class set. 27 Experiment results – ILSVRC 2012
  • 28.
    Perception and IntelligenceLab. + three contributions 1. New definition of Object Detection • A regression problem to the coordinates of several bounding boxes, as well as a confidence score of how likely this box contains an object. • Traditionally, score features within predefined boxes. 28 Contributions
  • 29.
    Perception and IntelligenceLab. + three contributions 2. Loss function which trains bounding box predictors as part of network training • Solve assignment problem by utilize learning abilities of DNN • Back Propagation 29 Contributions
  • 30.
    Perception and IntelligenceLab. + three contributions 3. Train object box detector in class-agnostic manner • Scalable way to detect large # of object classes. • Post-classifying, achieve competitive detection results. • Box predictor generalizes over unseen classes – Flexible to be re-used to the other detection problems. 30 Contributions
  • 31.
    Perception and IntelligenceLab. + Competitive method.  Better detection performance but larger computations  OverFeat • Efficient sliding ConvNet at multiple locations and scales • Predicting one bounding box per class • 2 sec/image on GPU. • 40x slower than GPU implementation of DeepMultiBox • SCR, centered crop: closest method to DeepMultiBox – Scores 40.0% while DeepMultiBox scores 40.94% • DeepMultiBox extracts multiple regions of interest in one network evaluation. Discussion and Conclusion
  • 32.
    Perception and IntelligenceLab. + Competitive method.  R-CNN using selective search • Propose 2000 candidates locations per image • Extract top layer features from ConvNet • Use hard-negative trained SVM to classify the locations into VOC classes • 200x more expensive Discussion and Conclusion
  • 33.
    Perception and IntelligenceLab. + Current state (localization network and categorization network)  5 – 10 network evaluations • 1 network for localization and several more for classification  Does not scale linearly with # of classes to be recognized.  Which makes very competitive with DPM-like approaches. + Hope to build localization and recognition into a single network.  Extract both locations and class label in a single feed-forward pass in network. Discussion and Conclusion
  • 34.
  • 35.
    Perception and IntelligenceLab. + AlexNet (NIPS 2012) Convolution – pooling – ReLU – Normalize = 1 convolutional layer  5 convolutional layer  2 fully-connected hidden layer 35 Introduction
  • 36.
    Perception and IntelligenceLab. + Evaluation  Detection@5 • Produce one box per each of the 5 labels – Positive when at least one box and associated label are correct • Jaccard 0.5 overlap • Table 2. – # of windows chosen after NMS, ranking from confidence score 36 Experiment results – ILSVRC 2012
  • 37.
    Perception and IntelligenceLab. + Compare with One-box-per-class  re-implementation of the winning entry of ILSVRC-2012 “classification with localization” challenge • SuverVision. Hinton. – Code not provided…  DeepMultiBox is competitive with 5-10 windows  Two Drawbacks: 1. Output scales linearly with the # of classes 2. Doesn’t generalize naturally to multiple instances of obj of the same type. 37 Experiment results – ILSVRC 2012
  • 38.
    Perception and IntelligenceLab. 2. Doesn’t generalize naturally to multiple same type object. + Generalization to such scenario + necessary for actual image understanding. + DeepMultiBox : scalable way + At Fig 5., it generally capture more objects more accurately than a single-box method. 38 Experiment results – ILSVRC 2012
  • 39.
    Perception and IntelligenceLab. + Novel method for localizing object in an image. + Uses deep CNN as base feature extraction and learning model. + Formulates multi box localization cost  Taking advantage of # of GT locations  Learn to predict such locations in unseen images. Discussion and Conclusion
  • 40.
    Perception and IntelligenceLab. + Results on challenging benchmarks. VOC 2007 & ILSVRC 2012 + Work fine by predicting only very few locations.  To be probed by a subsequent classifier + Scalable and generalize across two datasets.  Being able to predict locations of interest, even not trained on such class. + Capture multiple instances of same class  Important feature. Aims better image understanding. Discussion and Conclusion
  • 41.
    Perception and IntelligenceLab. + Predicting more windows, able to capture more GT bounding boxes.  But no comparable increase in mAP on VOC2007  Hypothesize: classification model works better with hard-negative mining & learn to better model with local features, the context and detector confidences jointly take advantage of the proposed window . Discussion and Conclusion