Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020

Object Detection
Computer Vision 2
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Spring 2020

Acknowledgements
2
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
[UPC TelecomBCN 2016] [UPC TelecomBCN 2017]

Acknowledgements
3
[UPC TelecomBCN 2018]
Míriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center
Andreu Girbau
andreu.girbau@upc.edu
PhD Candidate
AutomaticTV
[UPC TelecomBCN 2019]

Outline
4
1. Motivation
2. Datasets
3. Evaluation
4. Neural Architectures
a. Two-stage
b. Single-stage

Recap
Figure from Charles Ollion - Olivier Grisel

Object Detection
CAT, DOG, DUCK
The task of assigning a label and a
bounding box to all objects in the
image:
1. We don’t know number of objects
2. Object detection relies on object
proposal and object classiﬁcation
9

Object Detection as Classiﬁcation
Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
10

Cat ? NO
Dog ? NO
Duck? NO
11

Cat ? YES
Dog ? NO
Duck? NO
12

Cat ? NO
Dog ? NO
Duck? NO
13

Challenge:
Very large amount of possibilities:
● position
● scale
● aspect ratio
14
Question: Do you think it is feasible to evaluate all possibilities ?

Challenge:
Very large amount of possibilities:
● position
● scale
● aspect ratio
Solution: If your classiﬁer is fast enough, go for it
15

Object Detection with ConvNets?
Convnets are computationally demanding. We can’t test all positions & scales !
Solution: Look at a tiny subset of positions. Choose them wisely :)
16

Outline
17
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage

Classic Datasets
18
PASCAL
20 categories
6k training images
6k validation images
10k test images
ILSVRC
200 categories
456k training images
60k validation + test images
COCO
80 categories
60k val + test images

Open Images Dataset
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., ... & Ferrari, V. The open images dataset v4: Uniﬁed
image classiﬁcation, object detection, and visual relationship detection at scale. IJCV 2020. [dataset]

Open Images Dataset v6
PASCAL
20 categories
6k training images
6k validation images
10k test images
ILSVRC
200 categories
60k validation + test images
COCO
80 categories
60k val + test images

Images with a large number of diﬀerent classes annotated (11
on the left, 7 on the right).

Outline
25
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage
5. Software implementations

26
Evaluation metrics: Intersection over Union (IoU)
● aka Jaccard index
● Size of intersection divided by the size of
the union
● Evaluate localization
Figure: Pyimagesearch

27
Metric: Average Precision (AP) for Object Detection
Consider the case in which your object detection algorithm provides you:
● Coordinates for each bounding box.
● A conﬁdence for each bounding box
0.7
0.9
Predictions
0.5

28
Rank your predictions based on the conﬁdence score of your object detection
algorithm:
0.7
0.9
0.9
0.7
#1
#2
#3
Predictions
0.5
0.5

29
Set a criteria to identify whether your predictions are correct.
Typically, a minimum IoU with respect to the bounding boxes from the ground truth annotation.
○ For example, IoU > 0.5. This is referred as AP0.5
.
○ Other popular options: AP0.75
, or a range of IoU [0.5:0.95] in 0.05 steps
○ Each GT box can only be assigned to one predicted box.
0.7
0.9
0.9
0.7
#1
#2
#3
Ground truth True Positive (TP)
False Positive (FP)
0.5
0.5
Conﬁdencescore

30
Compute the point of the Precision-Recall curve by considering as decision thresholds (Thr) the
conﬁdence scores of the ranked detections.
Rank Correct ?
1 True
2 False
3 True
False Positive (FP) or
False Negative (FN)
0.7
0.9
0.5
Threshold Precision Recall
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4

31
In the object detection case, in which GT objects may never any predictions, we may consider that
trying to find the missing objects with an infinite amount of object proposals would drop precision
to 0.0, but would eventually find all objects, so recall would be 1.0
Table inspired by: Johnatan Hui, “mAP (mean Average Precision) for Object Detection” (Medium 2018)
False Positive (FP) or
False Negative (FN)
0.7
0.9
0.5
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)

32
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Precision
Recall
1.0
0.5
0.5 1.0

33
“The precision at each recall level r is interpolated by taking the maximum precision (...) for which the
corresponding recall exceeds r.” (from Pascal VOC) [ref]
[ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual
Object Classes (VOC) challenge." IJCV 2010.
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Precision
Recall
1.0
0.5
0.5 1.00

34
Actually, not all PR pairs need to be computed because AP for object detection only requires
the PR pairs related to True positives:
0.9 1/1 1/4
0.7 1/2 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Rank Correct ?
1 True
2 False
3 True
∞ True(s)
Precision
Recall
1.0
0.5
0.5 1.00

35
● The AP metric approximates the area of the PR curve.
● There are diﬀerent methods for this approximation that may cause
inconsistencies between implementations.
● Popular ones
○ (suggested) “the mean precision at a set of eleven equally spaced
recall levels [0, 0.1, ...1]”
○ “weighted mean of precisions achieved at each threshold, with the
increase in recall from the previous threshold used as the weight”
(scikit-learn).
[ref] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. "The Pascal Visual
Object Classes (VOC) challenge." IJCV 2010.

36
In our work, we adopt the approach from Pascal VOC:
● AP is “the mean precision at a set of eleven equally spaced recall levels [0, 0.1, ...1]”
0.9 1/1 1/4
0.5 2/3 2/4
0.0 ⋍ 0 1
Recall Precision
0.0 1.00
0.1 1.00
0.2 1.00
0.3 0.67
0.4 0.67
0.5 0.00
... 0.00
1.0 0.00
AP 0.39
Precision
Recall
1.0
0.5
0.5 1.00

37
What if your object detection algorithm does not provide any confidence score ?
#1
#2
#3
Predictions
Metric: Average Precision w/o conﬁdence scores
?

38
If your object detection algorithm does not provide any conﬁdence score:
● Generate N random ranks (eg. N=10) and average your metrics across these N runs.
● Average the obtained APs.
#1
#2
#3
#1
#2
#3
#1
#2
#3
AP1
AP2
APN
...
AP
Metric: Average Precision w/o conﬁdence scores

39
Evaluation metrics: mean Average Precision (mAP)
In the cases of multiple Q classes (eg. car, bike, person…), the mAP averages
across the AP(q) of each class:
● Further readings:
○ Tarang Sangh, “Measuring Object Detection models — mAP — What is Mean Average Precision?” (Medium
2018)

40
Evaluation metrics: Average Precision (AP)
You can obtain implementations for this Average Precision for Object Detection
from:
TensorFlow Microsoft CoCo dataset API

Outline
41
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage

Outline
42
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage

Object Detection
There are two main families:
● Two-Stage: Region proposal and then classification
● Single-Stage: A grid in the image where each cell is a
proposal

Region Proposals
● Find “blobby” image regions that are likely to contain objects
● “Class-agnostic” object detector
Slide Credit: CS231n 44

Region Proposals
45
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90

Region Proposals
46
Typical object detection/segmentation pipelines:
Object
proposal
Refinement
and
Classification
Dog
0.85
Cat
0.80
Dog
0.75
Cat
0.90
NMS: Non-Maximum Suppression

Region Proposals: from pixels
#SS Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. IJCV
2013
47

Region Proposals: from pixels
#MCG Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2016). Multiscale combinatorial grouping for
image segmentation and object proposal generation. TPAMI 2016
48

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014.
49
R-CNN

R-CNN
50
We expect: We get:
Non Maximum Suppression + score threshold

R-CNN + Non Maximum Suppression (NMS)
51
#DPM Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object detection with discriminatively
trained part-based models. TPAMI 2009.
Figure: Adrian Rosebrock

52
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and
semantic segmentation. CVPR 2014.
R-CNN

R-CNN: Problems
1. Slow at test-time: need to run full forward pass of
CNN for each region proposal
2. SVMs and regressors are post-hoc: CNN features not
updated in response to SVMs and regressors

Fast R-CNN:
Girshick Fast R-CNN. ICCV 2015c
Solution: Share computation of convolutional layers between region proposals for an image
R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal
54

Fast R-CNN
Solution: Train it all together end to end
R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training.
55Girshick Fast R-CNN. ICCV 2015
-Softmax over (K+1) classes and 4 box offsets
-Positive box are the ones with larger Intersection
Over Union with ground truth

Fast R-CNN: RoI-Pooling
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
(variable size)
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
(fixed size)
Fully-connected layers expect
low-res conv features:
C x h x w
Slide Credit: CS231n 56Girshick Fast R-CNN. ICCV 2015

RoI poolings allow 1) to propagate gradient only on interesting
regions, and 2) eﬃcient computing.
Input: convolutional map + N regions of interest
Output: tensor of N x 7 x 7 x depth features
Fast R-CNN: RoI-Pooling

Fast R-CNN
R-CNN Fast R-CNN
Training Time: 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
mAP (VOC 2007) 66.0 66.9
Using VGG-16 CNN on Pascal VOC 2007 dataset
Faster!
FASTER!
Better!

Fast R-CNN: Limitation
Slide Credit: CS231n
R-CNN Fast R-CNN
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image
with Selective Search
50 seconds 2 seconds
(Speedup) 1x 25x
Test-time speeds do not include region proposals
59

Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Fast R-CNN
60
Learn proposals end-to-end sharing parameters with the classification network
#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region
proposal networks. NIPS 2015.
Faster R-CNN

Faster R-CNN
Conv
layers
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
61
Learn proposals end-to-end sharing parameters with the classification network
This network is called Region Proposal Network (RPN), and the proposals are learnt!!
#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster r-cnn: Towards real-time object detection with region

Faster R-CNN replaces
selective search (SS) with the
(RPN), which is trained
jointly.
Faster R-CNN

Region Proposal Network (RPN)
Objectness scores
(object/no object)
Bounding Box Regression
In practice, k = 9 (3 diﬀerent scales and 3 aspect ratios)
63#Faster R-CNN Ren, S., He, K., Girshick, R., & Sun, J.. Faster R-CNN: Towards real-time object detection with region

Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
R-CNN Fast R-CNN Faster R-CNN
Test time per image
(with proposals)
50 seconds 2 seconds 0.2 seconds
(Speedup) 1x 25x 250x
mAP (VOC 2007) 66.0 66.9 66.9
Faster R-CNN

Mask R-CNN: Object Detection + Instance Segmentation
65
He et al. Mask R-CNN. ICCV 2017

Next lecture: Instance & Image Segmentation
66
Source: Detectron2
Carles
Ventura

Two-stage vs Single-stage methods
67
Computationally too intensive and too slow for real-time
applications
Faster R-CNN 7 FPS
resample pixels for each BBOX
resample features for each BBOX
high quality
classifier
Object proposals
generation
Image
pixels

Two-stage vs Single-stage methods
68
resample pixels for each BBOX
resample features for each BBOX
high quality
classifier
Object proposals
generation
Image
pixels
Instead of having two networks
Region Proposals Network + Classiﬁer Network
in one-stage architectures, bounding boxes and conﬁdences for multiple categories
are predicted directly with a single network

Outline
69
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage

One-stage methods
70
Problem:
Too many positions & scales to test
Previously… :

Overfeat
71#OverFeat Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. "Overfeat:
Integrated recognition, localization and detection using convolutional networks." ICLR 2014

One-stage methods
72
Problem:
Solution: If your classifier is fast enough, go for it
Previously… :

73
Problem:
Modern detectors parallelize feature extraction across all
locations.
Region classiﬁcation is not slow anymore!
Previously… :
One-stage methods

YOLO: You Only Look Once
74#YOLO Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Proposal-free object detection pipeline
S x S grid on input
For each cell of the S x S predict:
● B boxes and confidence scores C (5 x B values) + classes c

75Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
S x S grid on input
Bounding boxes + confidence
Class probability map
Final detections

76Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
S x S grid on input
Bounding boxes + confidence
Class probability map
Final detections
Final detections:
Cj * prob(c) > threshold

Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 77

78
Each cell predicts:
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 conﬁdence value
- Some number of class
probabilities
For Pascal VOC:
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs

SSD: Single Shot MultiBox Detector
Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016
79
Same idea as YOLO, + several predictors at different stages in the network to allow different receptive
fields.

YOLOv2
80Redmon & Farhadi. YOLO900: Better, Faster, Stronger. CVPR 2017

YOLOv3
81
YOLO v2
+ residual blocks
+ skip connections
+ upsampling
+ detection at
multiple scales

YOLOv4
82Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection”
arXiv 2020.

83
#YOLO Redmon et al. You Only Look Once: Uniﬁed, Real-Time Object Detection, CVPR 2016

Military Applications & Privacy Risks
84

RetinaNet
85
Matching proposal-based performance with a one-stage approach
Problem of one-stage detectors? They evaluate many candidate locations but only
a few have objects ---> IMBALANCE, making learning inefficient
Focal loss: Key idea is to lower loss weight for well classified samples, increase it
for difficult ones.
Lin et al. Focal Loss for Dense Object Detection. ICCV 2017

Neural Archictures for Object Detection
87
Two-stage methods
● R-CNN
● Fast R-CNN
● Faster R-CNN
● Mask R-CNN
Single-stage methods
● YOLO
● SSD
● RetinaNet

Software implementations
88
Most models are publicly available ready to be used oﬀ-the-shelf.
Model Framework
Faster R-CNN [torchvision] (< suggested)
[Detectron2] [Keras]
RetinaNet [Detectron2] (< suggested)
[Keras]
Benchmark [TensorFlow Object Detection API]
YOLOv3 [PyTorch]
SSD [PyTorch] [Tutorial on Keras]
Mask R-CNN [torchvision] (< suggested)
[PyTorch] [Keras & TF] [tutorial]

89
Wang, Xin, Thomas E. Huang, Trevor Darrell, Joseph E. Gonzalez, and Fisher Yu. "Frustratingly Simple
Few-Shot Object Detection." arXiv preprint arXiv:2003.06957 (2020). [code based on Detectron 2]
Probably, you will not be interested in the object classes deﬁned in Pascal/COCO. You can adapt
(ﬁne-tune) existing models to your own object classes.

Software implementations for Mobile
90
TensorFlow Lite: Object Detection
PyTorch Mobile (no speciﬁc solutions for object detection)

91
Jordi Torres, “TensorFlow or PyTorch? ” (2020) [in Catalan]

Outline
92
1. Motivation
2. Datasets
3. Evaluation
a. Two-stage
b. Single-stage

Next lab: ImageNet models
93
Dani
Fojo

Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020

Similar to Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcelona 2020