Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
1
Lecture 11:
Detection and Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
2
Administrative
Midterms being graded
Please don’t discuss midterms until next week - some students not yet taken
A2 being graded
Project milestones due Tuesday 5/16
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
HyperQuest
3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
4
HyperQuest
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
5
HyperQuest
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
6
HyperQuest
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
9
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
10
HyperQuest
Will post more details on Piazza this afternoon
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
11
Last Time: Recurrent Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
12
Last Time: Recurrent Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
13
Figure from Karpathy et a, “Deep Visual-Semantic Alignments for
Generating Image Descriptions”, CVPR 2015; figure copyright
IEEE, 2015.
Reproduced for educational purposes.
Last Time: Recurrent Networks
A cat sitting on a
suitcase on the floor
A cat is sitting on a tree
branch
Two people walking on
the beach with
surfboards
A tennis player in
action on the court
A woman is holding
a cat in her hand
A person holding a
computer mouse on a
desk
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
14
Last Time: Recurrent Networks
Vanilla RNN
Simple RNN
Elman RNN
Elman, “Finding Structure in Time”, Cognitive Science, 1990.
Hochreiter and Schmidhuber, “Long Short-Term Memory”, Neural computation, 1997
Long Short Term Memory
(LSTM)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
15
Today: Segmentation, Localization, Detection
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
16
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
So far: Image Classification
This image is CC0 public domain Vector:
4096
Fully-Connected:
4096 to 1000
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
17
Other Computer Vision Tasks
Classification
+ Localization
Semantic
Segmentation
Object
Detection
Instance
Segmentation
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
18
Semantic Segmentation
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
19
Semantic Segmentation
Cow
Grass
Sky
T
r
e
e
s
Label each pixel in the
image with a category
label
Don’t differentiate
instances, only care about
pixels
This image is CC0 public domain
Grass
Cat
Sky
T
r
e
e
s
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
20
Semantic Segmentation Idea: Sliding Window
Full image
Extract patch
Classify center
pixel with CNN
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
21
Semantic Segmentation Idea: Sliding Window
Full image
Extract patch
Classify center
pixel with CNN
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
22
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Convolutions:
D x H x W
Conv Conv Conv Conv
Scores:
C x H x W
argmax
Predictions:
H x W
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
23
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Convolutions:
D x H x W
Conv Conv Conv Conv
Scores:
C x H x W
argmax
Predictions:
H x W
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Problem: convolutions at
original image resolution will
be very expensive ...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
24
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Predictions:
H x W
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
High-res:
D1
x H/2 x W/2
High-res:
D1
x H/2 x W/2
Med-res:
D2
x H/4 x W/4
Med-res:
D2
x H/4 x W/4
Low-res:
D3
x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
25
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Predictions:
H x W
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
High-res:
D1
x H/2 x W/2
High-res:
D1
x H/2 x W/2
Med-res:
D2
x H/4 x W/4
Med-res:
D2
x H/4 x W/4
Low-res:
D3
x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:
Pooling, strided
convolution
Upsampling:
???
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
26
In-Network upsampling: “Unpooling”
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4
Nearest Neighbor
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 0 2 0
0 0 0 0
3 0 4 0
0 0 0 0
“Bed of Nails”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
27
In-Network upsampling: “Max Unpooling”
Input: 4 x 4
1 2 6 3
3 5 2 1
1 2 2 1
7 3 4 8
1 2
3 4
Input: 2 x 2 Output: 4 x 4
0 0 2 0
0 1 0 0
0 0 0 0
3 0 0 4
Max Unpooling
Use positions from
pooling layer
5 6
7 8
Max Pooling
Remember which element was max!
…
Rest of the network
Output: 2 x 2
Corresponding pairs of
downsampling and
upsampling layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
28
Learnable Upsampling: Transpose Convolution
Recall:Typical 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
29
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
30
Learnable Upsampling: Transpose Convolution
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
31
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
32
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
33
Learnable Upsampling: Transpose Convolution
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
Filter moves 2 pixels in
the input for every one
pixel in the output
Stride gives ratio between
movement in input and
output
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
34
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
35
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
36
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Sum where
output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in
the output for every one
pixel in the input
Stride gives ratio between
movement in output and
input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
37
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Sum where
output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in
the output for every one
pixel in the input
Stride gives ratio between
movement in output and
input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
38
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Sum where
output overlaps
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in
the output for every one
pixel in the input
Stride gives ratio between
movement in output and
input
Other names:
-Deconvolution (bad)
-Upconvolution
-Fractionally strided
convolution
-Backward strided
convolution
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
39
Transpose Convolution: 1D Example
a
b
x
y
z
ax
ay
az + bx
by
bz
Input Filter
Output
Output contains
copies of the filter
weighted by the
input, summing at
where at overlaps in
the output
Need to crop one
pixel from output to
make output exactly
2x input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
40
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication
Example: 1D conv, kernel
size=3, stride=1, padding=1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
41
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication
Example: 1D conv, kernel
size=3, stride=1, padding=1
Convolution transpose multiplies by the
transpose of the same matrix:
When stride=1, convolution transpose is
just a regular convolution (with different
padding rules)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
42
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication
Example: 1D conv, kernel
size=3, stride=2, padding=1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
43
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication
Example: 1D conv, kernel
size=3, stride=2, padding=1
Convolution transpose multiplies by the
transpose of the same matrix:
When stride>1, convolution transpose is
no longer a normal convolution!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
44
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Predictions:
H x W
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
High-res:
D1
x H/2 x W/2
High-res:
D1
x H/2 x W/2
Med-res:
D2
x H/4 x W/4
Med-res:
D2
x H/4 x W/4
Low-res:
D3
x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:
Pooling, strided
convolution
Upsampling:
Unpooling or strided
transpose convolution
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
45
Classification + Localization
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
46
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Classification + Localization
This image is CC0 public domain Vector:
4096
Fully
Connected:
4096 to 1000
Box
Coordinates
(x, y, w, h)
Fully
Connected:
4096 to 4
Treat localization as a
regression problem!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
47
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Classification + Localization
Vector:
4096
Fully
Connected:
4096 to 1000
Box
Coordinates
(x, y, w, h)
Fully
Connected:
4096 to 4
Softmax
Loss
L2 Loss
Correct label:
Cat
Correct box:
(x’, y’, w’, h’)
This image is CC0 public domain
Treat localization as a
regression problem!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
48
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Classification + Localization
Vector:
4096
Fully
Connected:
4096 to 1000
Box
Coordinates
(x, y, w, h)
Fully
Connected:
4096 to 4
Softmax
Loss
L2 Loss
Loss
Correct label:
Cat
Correct box:
(x’, y’, w’, h’)
+
This image is CC0 public domain
Treat localization as a
regression problem!
Multitask Loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
49
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Classification + Localization
Vector:
4096
Fully
Connected:
4096 to 1000
Box
Coordinates
(x, y, w, h)
Fully
Connected:
4096 to 4
Softmax
Loss
L2 Loss
Loss
Correct label:
Cat
Correct box:
(x’, y’, w’, h’)
+
This image is CC0 public domain
Often pretrained on ImageNet
(Transfer learning)
Treat localization as a
regression problem!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
50
Aside: Human Pose Estimation
This image is licensed under CC-BY 2.0.
Represent pose as a
set of 14 joint
positions:
Left / right foot
Left / right knee
Left / right hip
Left / right shoulder
Left / right elbow
Left / right hand
Neck
Head top
Johnson and Everingham, "Clustered Pose and Nonlinear Appearance Models
for Human Pose Estimation", BMVC 2010
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
51
Toshev and Szegedy, “DeepPose: Human Pose
Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation
Vector:
4096
Left foot: (x, y)
Right foot: (x, y)
…
Head top: (x, y)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
52
Toshev and Szegedy, “DeepPose: Human Pose
Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation
Vector:
4096
Left foot: (x, y)
Right foot: (x, y)
…
Head top: (x, y)
L2 loss
Correct left
foot: (x’, y’)
L2 loss
Correct head
top: (x’, y’)
L2 loss
Loss
+
...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
53
Object Detection
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
54
Object Detection: Impact of Deep Learning
Figure copyright Ross Girshick, 2015.
Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
55
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
56
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
4 numbers
16 numbers
Many
numbers!
Each image needs a
different number of outputs!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
57
Object Detection as Classification: Sliding Window
Dog? NO
Cat? NO
Background? YES
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
58
Object Detection as Classification: Sliding Window
Dog? YES
Cat? NO
Background? NO
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
59
Object Detection as Classification: Sliding Window
Dog? YES
Cat? NO
Background? NO
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
60
Object Detection as Classification: Sliding Window
Dog? NO
Cat? YES
Background? NO
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
61
Object Detection as Classification: Sliding Window
Dog? NO
Cat? YES
Background? NO
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Problem: Need to apply CNN to huge
number of locations and scales, very
computationally expensive!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
62
Region Proposals
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012
Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
63
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
64
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
65
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
66
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
67
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
68
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
69
R-CNN: Problems
• Ad hoc training objectives
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net [He et al. ECCV14]
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
70
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
71
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
72
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
73
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
74
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
75
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
76
Fast R-CNN
(Training)
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
77
Fast R-CNN
(Training)
Girshick, “Fast R-CNN”, ICCV 2015.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
78
Faster R-CNN: RoI Pooling
Hi-res input image:
3 x 640 x 480
with region
proposal
Hi-res conv features:
512 x 20 x 15;
Projected region
proposal is e.g.
512 x 18 x 8
(varies per proposal)
Fully-connected
layers
Divide projected
proposal into 7x7
grid, max-pool
within each cell
RoI conv features:
512 x 7 x 7
for region proposal
Fully-connected layers expect
low-res conv features:
512 x 7 x 7
CNN
Project proposal
onto features
Girshick, “Fast R-CNN”, ICCV 2015.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
79
R-CNN vs SPP vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
80
R-CNN vs SPP vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Problem:
Runtime dominated
by region proposals!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
81
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
proposals from features
Jointly train with 4 losses:
1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
82
Faster R-CNN:
Make CNN do proposals!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
83
Detection without Proposals: YOLO / SSD
Divide image into grid
7 x 7
Image a set of base boxes
centered at each grid cell
Here B = 3
Input image
3 x H x W
Within each grid cell:
- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
Output:
7 x 7 x (5 * B + C)
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
84
Detection without Proposals: YOLO / SSD
Divide image into grid
7 x 7
Image a set of base boxes
centered at each grid cell
Here B = 3
Input image
3 x H x W
Within each grid cell:
- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
Output:
7 x 7 x (5 * B + C)
Go from input image to tensor of scores with one big convolutional network!
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
85
Object Detection: Lots of variables ...
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Base Network
VGG16
ResNet-101
Inception V2
Inception V3
Inception
ResNet
MobileNet
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Object Detection
architecture
Faster R-CNN
R-FCN
SSD
Image Size
# Region Proposals
…
Takeaways
Faster R-CNN is
slower but more
accurate
SSD is much
faster but not as
accurate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
86
Aside: Object Detection + Captioning
= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
87
Aside: Object Detection + Captioning
= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
88
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
89
Instance Segmentation
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
90
Mask R-CNN
He et al, “Mask R-CNN”, arXiv 2017
RoI Align
Conv
Classification Scores: C
Box coordinates (per class): 4 * C
CNN Conv
Predict a mask for
each of C classes
C x 14 x 14
256 x 14 x 14 256 x 14 x 14
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
91
Mask R-CNN: Very Good Results!
He et al, “Mask R-CNN”, arXiv 2017
Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
92
Mask R-CNN
Also does pose
He et al, “Mask R-CNN”, arXiv 2017
RoI Align
Conv
Classification Scores: C
Box coordinates (per class): 4 * C
Joint coordinates
CNN Conv
Predict a mask for
each of C classes
C x 14 x 14
256 x 14 x 14 256 x 14 x 14
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
93
Mask R-CNN
Also does pose
He et al, “Mask R-CNN”, arXiv 2017
Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
94
Recap:
Classification
+ Localization
Semantic
Segmentation
Object
Detection
Instance
Segmentation
CAT
GRASS, CAT,
TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
Single Object Multiple Object
No objects, just pixels This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017
95
Next time:
Visualizing CNN features
DeepDream + Style Transfer

Object detection stanford

  • 1.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 1 Lecture 11: Detection and Segmentation
  • 2.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 2 Administrative Midterms being graded Please don’t discuss midterms until next week - some students not yet taken A2 being graded Project milestones due Tuesday 5/16
  • 3.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 HyperQuest 3
  • 4.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 4 HyperQuest
  • 5.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 5 HyperQuest
  • 6.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 6 HyperQuest
  • 7.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 7
  • 8.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 8
  • 9.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 9
  • 10.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 10 HyperQuest Will post more details on Piazza this afternoon
  • 11.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 11 Last Time: Recurrent Networks
  • 12.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 12 Last Time: Recurrent Networks
  • 13.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 13 Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes. Last Time: Recurrent Networks A cat sitting on a suitcase on the floor A cat is sitting on a tree branch Two people walking on the beach with surfboards A tennis player in action on the court A woman is holding a cat in her hand A person holding a computer mouse on a desk
  • 14.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 14 Last Time: Recurrent Networks Vanilla RNN Simple RNN Elman RNN Elman, “Finding Structure in Time”, Cognitive Science, 1990. Hochreiter and Schmidhuber, “Long Short-Term Memory”, Neural computation, 1997 Long Short Term Memory (LSTM)
  • 15.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 15 Today: Segmentation, Localization, Detection
  • 16.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 16 Class Scores Cat: 0.9 Dog: 0.05 Car: 0.01 ... So far: Image Classification This image is CC0 public domain Vector: 4096 Fully-Connected: 4096 to 1000
  • 17.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 17 Other Computer Vision Tasks Classification + Localization Semantic Segmentation Object Detection Instance Segmentation CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 18.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 18 Semantic Segmentation CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 19.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 19 Semantic Segmentation Cow Grass Sky T r e e s Label each pixel in the image with a category label Don’t differentiate instances, only care about pixels This image is CC0 public domain Grass Cat Sky T r e e s
  • 20.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 20 Semantic Segmentation Idea: Sliding Window Full image Extract patch Classify center pixel with CNN Cow Cow Grass Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013 Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
  • 21.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 21 Semantic Segmentation Idea: Sliding Window Full image Extract patch Classify center pixel with CNN Cow Cow Grass Problem: Very inefficient! Not reusing shared features between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013 Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
  • 22.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 22 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
  • 23.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 23 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once! Problem: convolutions at original image resolution will be very expensive ...
  • 24.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 24 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4 Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
  • 25.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 25 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4 Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015 Downsampling: Pooling, strided convolution Upsampling: ???
  • 26.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 26 In-Network upsampling: “Unpooling” 1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 1 2 2 1 1 2 2 3 3 4 4 3 3 4 4 Nearest Neighbor 1 2 3 4 Input: 2 x 2 Output: 4 x 4 1 0 2 0 0 0 0 0 3 0 4 0 0 0 0 0 “Bed of Nails”
  • 27.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 27 In-Network upsampling: “Max Unpooling” Input: 4 x 4 1 2 6 3 3 5 2 1 1 2 2 1 7 3 4 8 1 2 3 4 Input: 2 x 2 Output: 4 x 4 0 0 2 0 0 1 0 0 0 0 0 0 3 0 0 4 Max Unpooling Use positions from pooling layer 5 6 7 8 Max Pooling Remember which element was max! … Rest of the network Output: 2 x 2 Corresponding pairs of downsampling and upsampling layers
  • 28.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 28 Learnable Upsampling: Transpose Convolution Recall:Typical 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4
  • 29.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 29 Learnable Upsampling: Transpose Convolution Recall: Normal 3 x 3 convolution, stride 1 pad 1 Input: 4 x 4 Output: 4 x 4 Dot product between filter and input
  • 30.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 30 Learnable Upsampling: Transpose Convolution Input: 4 x 4 Output: 4 x 4 Dot product between filter and input Recall: Normal 3 x 3 convolution, stride 1 pad 1
  • 31.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 31 Input: 4 x 4 Output: 2 x 2 Learnable Upsampling: Transpose Convolution Recall: Normal 3 x 3 convolution, stride 2 pad 1
  • 32.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 32 Input: 4 x 4 Output: 2 x 2 Dot product between filter and input Learnable Upsampling: Transpose Convolution Recall: Normal 3 x 3 convolution, stride 2 pad 1
  • 33.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 33 Learnable Upsampling: Transpose Convolution Input: 4 x 4 Output: 2 x 2 Dot product between filter and input Filter moves 2 pixels in the input for every one pixel in the output Stride gives ratio between movement in input and output Recall: Normal 3 x 3 convolution, stride 2 pad 1
  • 34.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 34 Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1 Input: 2 x 2 Output: 4 x 4
  • 35.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 35 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1
  • 36.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 36 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where output overlaps Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1 Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input
  • 37.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 37 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where output overlaps Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1 Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input
  • 38.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 38 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where output overlaps Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1 Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input Other names: -Deconvolution (bad) -Upconvolution -Fractionally strided convolution -Backward strided convolution
  • 39.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 39 Transpose Convolution: 1D Example a b x y z ax ay az + bx by bz Input Filter Output Output contains copies of the filter weighted by the input, summing at where at overlaps in the output Need to crop one pixel from output to make output exactly 2x input
  • 40.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 40 Convolution as Matrix Multiplication (1D Example) We can express convolution in terms of a matrix multiplication Example: 1D conv, kernel size=3, stride=1, padding=1
  • 41.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 41 Convolution as Matrix Multiplication (1D Example) We can express convolution in terms of a matrix multiplication Example: 1D conv, kernel size=3, stride=1, padding=1 Convolution transpose multiplies by the transpose of the same matrix: When stride=1, convolution transpose is just a regular convolution (with different padding rules)
  • 42.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 42 Convolution as Matrix Multiplication (1D Example) We can express convolution in terms of a matrix multiplication Example: 1D conv, kernel size=3, stride=2, padding=1
  • 43.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 43 Convolution as Matrix Multiplication (1D Example) We can express convolution in terms of a matrix multiplication Example: 1D conv, kernel size=3, stride=2, padding=1 Convolution transpose multiplies by the transpose of the same matrix: When stride>1, convolution transpose is no longer a normal convolution!
  • 44.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 44 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Predictions: H x W Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High-res: D1 x H/2 x W/2 High-res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low-res: D3 x H/4 x W/4 Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015 Downsampling: Pooling, strided convolution Upsampling: Unpooling or strided transpose convolution
  • 45.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 45 Classification + Localization CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 46.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 46 Class Scores Cat: 0.9 Dog: 0.05 Car: 0.01 ... Classification + Localization This image is CC0 public domain Vector: 4096 Fully Connected: 4096 to 1000 Box Coordinates (x, y, w, h) Fully Connected: 4096 to 4 Treat localization as a regression problem!
  • 47.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 47 Class Scores Cat: 0.9 Dog: 0.05 Car: 0.01 ... Classification + Localization Vector: 4096 Fully Connected: 4096 to 1000 Box Coordinates (x, y, w, h) Fully Connected: 4096 to 4 Softmax Loss L2 Loss Correct label: Cat Correct box: (x’, y’, w’, h’) This image is CC0 public domain Treat localization as a regression problem!
  • 48.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 48 Class Scores Cat: 0.9 Dog: 0.05 Car: 0.01 ... Classification + Localization Vector: 4096 Fully Connected: 4096 to 1000 Box Coordinates (x, y, w, h) Fully Connected: 4096 to 4 Softmax Loss L2 Loss Loss Correct label: Cat Correct box: (x’, y’, w’, h’) + This image is CC0 public domain Treat localization as a regression problem! Multitask Loss
  • 49.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 49 Class Scores Cat: 0.9 Dog: 0.05 Car: 0.01 ... Classification + Localization Vector: 4096 Fully Connected: 4096 to 1000 Box Coordinates (x, y, w, h) Fully Connected: 4096 to 4 Softmax Loss L2 Loss Loss Correct label: Cat Correct box: (x’, y’, w’, h’) + This image is CC0 public domain Often pretrained on ImageNet (Transfer learning) Treat localization as a regression problem!
  • 50.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 50 Aside: Human Pose Estimation This image is licensed under CC-BY 2.0. Represent pose as a set of 14 joint positions: Left / right foot Left / right knee Left / right hip Left / right shoulder Left / right elbow Left / right hand Neck Head top Johnson and Everingham, "Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation", BMVC 2010
  • 51.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 51 Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014 Aside: Human Pose Estimation Vector: 4096 Left foot: (x, y) Right foot: (x, y) … Head top: (x, y)
  • 52.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 52 Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014 Aside: Human Pose Estimation Vector: 4096 Left foot: (x, y) Right foot: (x, y) … Head top: (x, y) L2 loss Correct left foot: (x’, y’) L2 loss Correct head top: (x’, y’) L2 loss Loss + ...
  • 53.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 53 Object Detection CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 54.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 54 Object Detection: Impact of Deep Learning Figure copyright Ross Girshick, 2015. Reproduced with permission.
  • 55.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 55 Object Detection as Regression? CAT: (x, y, w, h) DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h) DUCK: (x, y, w, h) DUCK: (x, y, w, h) ….
  • 56.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 56 Object Detection as Regression? CAT: (x, y, w, h) DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h) DUCK: (x, y, w, h) DUCK: (x, y, w, h) …. 4 numbers 16 numbers Many numbers! Each image needs a different number of outputs!
  • 57.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 57 Object Detection as Classification: Sliding Window Dog? NO Cat? NO Background? YES Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
  • 58.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 58 Object Detection as Classification: Sliding Window Dog? YES Cat? NO Background? NO Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
  • 59.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 59 Object Detection as Classification: Sliding Window Dog? YES Cat? NO Background? NO Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
  • 60.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 60 Object Detection as Classification: Sliding Window Dog? NO Cat? YES Background? NO Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
  • 61.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 61 Object Detection as Classification: Sliding Window Dog? NO Cat? YES Background? NO Apply a CNN to many different crops of the image, CNN classifies each crop as object or background Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive!
  • 62.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 62 Region Proposals ● Find “blobby” image regions that are likely to contain objects ● Relatively fast to run; e.g. Selective Search gives 1000 region proposals in a few seconds on CPU Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013 Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
  • 63.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 63 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 64.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 64 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 65.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 65 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 66.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 66 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 67.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 67 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 68.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 68 R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 69.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 69 R-CNN: Problems • Ad hoc training objectives • Fine-tune network with softmax classifier (log loss) • Train post-hoc linear SVMs (hinge loss) • Train post-hoc bounding-box regressions (least squares) • Training is slow (84h), takes a lot of disk space • Inference (detection) is slow • 47s / image with VGG16 [Simonyan & Zisserman. ICLR15] • Fixed by SPP-net [He et al. ECCV14] Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 70.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 70 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 71.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 71 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 72.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 72 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 73.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 73 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 74.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 74 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 75.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 75 Fast R-CNN Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 76.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 76 Fast R-CNN (Training) Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 77.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 77 Fast R-CNN (Training) Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
  • 78.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 78 Faster R-CNN: RoI Pooling Hi-res input image: 3 x 640 x 480 with region proposal Hi-res conv features: 512 x 20 x 15; Projected region proposal is e.g. 512 x 18 x 8 (varies per proposal) Fully-connected layers Divide projected proposal into 7x7 grid, max-pool within each cell RoI conv features: 512 x 7 x 7 for region proposal Fully-connected layers expect low-res conv features: 512 x 7 x 7 CNN Project proposal onto features Girshick, “Fast R-CNN”, ICCV 2015.
  • 79.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 79 R-CNN vs SPP vs Fast R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015
  • 80.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 80 R-CNN vs SPP vs Fast R-CNN Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick, “Fast R-CNN”, ICCV 2015 Problem: Runtime dominated by region proposals!
  • 81.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 81 Faster R-CNN: Make CNN do proposals! Insert Region Proposal Network (RPN) to predict proposals from features Jointly train with 4 losses: 1. RPN classify object / not object 2. RPN regress box coordinates 3. Final classification score (object classes) 4. Final box coordinates Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick; reproduced with permission
  • 82.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 82 Faster R-CNN: Make CNN do proposals!
  • 83.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 83 Detection without Proposals: YOLO / SSD Divide image into grid 7 x 7 Image a set of base boxes centered at each grid cell Here B = 3 Input image 3 x H x W Within each grid cell: - Regress from each of the B base boxes to a final box with 5 numbers: (dx, dy, dh, dw, confidence) - Predict scores for each of C classes (including background as a class) Output: 7 x 7 x (5 * B + C) Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
  • 84.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 84 Detection without Proposals: YOLO / SSD Divide image into grid 7 x 7 Image a set of base boxes centered at each grid cell Here B = 3 Input image 3 x H x W Within each grid cell: - Regress from each of the B base boxes to a final box with 5 numbers: (dx, dy, dh, dw, confidence) - Predict scores for each of C classes (including background as a class) Output: 7 x 7 x (5 * B + C) Go from input image to tensor of scores with one big convolutional network! Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
  • 85.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 85 Object Detection: Lots of variables ... Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017 Base Network VGG16 ResNet-101 Inception V2 Inception V3 Inception ResNet MobileNet R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016 Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015 Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016 Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016 MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017 Object Detection architecture Faster R-CNN R-FCN SSD Image Size # Region Proposals … Takeaways Faster R-CNN is slower but more accurate SSD is much faster but not as accurate
  • 86.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 86 Aside: Object Detection + Captioning = Dense Captioning Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016 Figure copyright IEEE, 2016. Reproduced for educational purposes.
  • 87.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 87 Aside: Object Detection + Captioning = Dense Captioning Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016 Figure copyright IEEE, 2016. Reproduced for educational purposes.
  • 88.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 88
  • 89.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 89 Instance Segmentation CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 90.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 90 Mask R-CNN He et al, “Mask R-CNN”, arXiv 2017 RoI Align Conv Classification Scores: C Box coordinates (per class): 4 * C CNN Conv Predict a mask for each of C classes C x 14 x 14 256 x 14 x 14 256 x 14 x 14
  • 91.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 91 Mask R-CNN: Very Good Results! He et al, “Mask R-CNN”, arXiv 2017 Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.
  • 92.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 92 Mask R-CNN Also does pose He et al, “Mask R-CNN”, arXiv 2017 RoI Align Conv Classification Scores: C Box coordinates (per class): 4 * C Joint coordinates CNN Conv Predict a mask for each of C classes C x 14 x 14 256 x 14 x 14 256 x 14 x 14
  • 93.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 93 Mask R-CNN Also does pose He et al, “Mask R-CNN”, arXiv 2017 Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.
  • 94.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 94 Recap: Classification + Localization Semantic Segmentation Object Detection Instance Segmentation CAT GRASS, CAT, TREE, SKY DOG, DOG, CAT DOG, DOG, CAT Single Object Multiple Object No objects, just pixels This image is CC0 public domain
  • 95.
    Fei-Fei Li &Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 95 Next time: Visualizing CNN features DeepDream + Style Transfer