Introduction to
Deep Learning
Presenter: Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)
Optimization methods
CNN basics
Semantic segmentation
Weakly supervised localization
Image detection
RNN
Visual QnA
Word2Vec
Image Captioning
Contents
What is deep learning?
3
“Deep learning is a branch of machine learning based on a set of
algorithms that attempt to model high-level abstractions in data by
using multiple processing layers, with complex structures or otherwise,
composed of multiple non-linear transformations.”
Wikipedia says:
Machine
Learning
High-level
abstraction Network
Is it brand new?
4
Neural Nets McCulloch & Pitt 1943
Perception Rosenblatt 1958
RNN Grossberg 1973
CNN Fukushima 1979
RBM Hinton 1999
DBN Hinton 2006
D-AE Vincent 2008
AlexNet Alex 2012
GoogLeNet Szegedy 2015
Deep architectures
5
Feed-Forward: multilayer neural nets, convolutional nets
Feed-Back: Stacked Sparse Coding, Deconvolutional Nets
Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders
Recurrent: Recurrent Nets, Long-Short Term Memory
CNN basics
CNN
7
CNNs are basically layers of convolutions followed by
subsampling and fully connected layers.
Intuitively speaking, convolutions and subsampling
layers works as feature extraction layers while a fully
connected layer classifies which category current input
belongs to using extracted features.
8
9
10
11
12
13
14
15
16
Optimization
methods
Gradient descent?
Gradient descent?
There are three variants of gradient descent
Differ in how much data we use to compute
gradient
We make a trade-off between the accuracy
and computing time
Batch gradient descent
In batch gradient decent, we use the entire
training dataset to compute the gradient.
Stochastic gradient descent
In stochastic gradient descent (SGD), the
gradient is computed from each training
sample, one by one.
Mini-batch gradient decent
In mini-batch gradient decent, we take the
best of both worlds.
Common mini-batch sizes range between 50
and 256 (but can vary).
Challenges
Choosing a proper learning rate is cumbersome.
 Learning rate schedule
Avoiding getting trapped in suboptimal local
minima
Momentum
Nesterov accelerated gradient
Adagrad
It adapts the learning rate to the parameters,
performing larger updates for infrequent and
smaller updates for frequent parameters.
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 −
𝜂
𝐺𝑡,𝑖𝑖 + 𝜖
𝑔𝑡,𝑖
Performing larger updates for infrequent and
smaller updates for frequent parameters.
Adadelta
Adadelta is an extension of Adagrad that seeks
to reduce its monotonically decreasing learning
rate.
It restricts the window of accumulated past
gradients to some fixed size 𝑤.
𝐸 𝑔2
𝑡 = 𝛾𝐸 𝑔2
𝑡−1 + 1 − 𝛾 𝑔𝑡
2
𝐸 ∆𝜃2
𝑡 = 𝛾𝐸 ∆𝜃2
𝑡−1 + 1 − 𝛾 ∆𝜃𝑡
2
𝜃𝑡+1 = 𝜃𝑡 −
𝐸 ∆𝜃2
𝑡 + 𝜖
𝐸 𝑔2
𝑡 + 𝜖
𝑔𝑡
No learning rate!
Exponential moving average
28
RMSprop
RMSprop is an unpublished, adaptive learning
rate method proposed by Geoff Hinton in his
lecture..
𝐸 𝑔2
𝑡 = 𝛾𝐸 𝑔2
𝑡−1 + 1 − 𝛾 𝑔𝑡
2
𝜃𝑡+1 = 𝜃𝑡 −
𝜂
𝐸 𝑔2
𝑡 + 𝜖
𝑔𝑡
Adam
Adaptive Moment Estimation (Adam) stores both
exponentially decaying average of past gradients
and and squared gradients.
𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡
𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡
2
𝜃𝑡+1 = 𝜃𝑡 −
𝜂
𝑣 𝑡 + 𝜖
1 − 𝛽2
𝑡
1 − 𝛽1
𝑡 𝑚 𝑡
Momentum
Running average of
gradient squares
Adam
Adaptive Moment Estimation (Adam) stores both
exponentially decaying average of past gradients
and and squared gradients.
𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡
𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡
2
𝜃𝑡+1 = 𝜃𝑡 −
𝜂
𝑣 𝑡 + 𝜖
1 − 𝛽2
𝑡
1 − 𝛽1
𝑡 𝑚 𝑡
Visualization
Semantic
segmentation
Semantic Segmentation?
lion
dog
giraffe
Image Classification
bicycle
person
ball
dog
Object Detection
person
person
person
person person
bicyclebicycle
Semantic Segmentation
Semantic segmentation
35
36
37
38
39
40
41
42
43
44
Results
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Results
73
Results
74
Weakly
supervised
localization
Weakly supervised localization
76
Weakly supervised localization
77
Weakly Supervised Object Localization
78
Usually supervised learning of localization is annotated with bounding box
What if localization is possible with image label without bounding box
annotations?
Today’s seminar: Learning Deep Features for Discriminative
Localization
1512.04150v1 Zhou et al. 2015 CVPR2016
Architecture
79
AlexNet+GAP+places205
Living room
11x11 Avg Pooling: Global Average Pooling (GAP)
11x11x512
512 205
227x227x3
Class activation map (CAM)
80
• Identify important image regions by projecting back
the weights of output layer to convolutional feature
maps.
• CAMs can be generated for each class in single image.
• Regions for each categories are different in given image.
• palace, dome, church …
Results
81
• CAM on top 5 predictions on an image
• CAM for one object class in images
GAP vs. GMP
82
• Oquab et al. CVPR2015
Is object localization for free? weakly-supervised learning with convolutional neural
networks.
• Use global max pooling(GMP)
• Intuitive difference between GMP and GAP?
• GAP loss encourages identification on the extent of an object.
• GMP loss encourages it to identify just one discriminative part.
• GAP, average of a map maximized by finding all discriminative
parts of object
• if activations is all low, output of particular map reduces.
• GMP, low scores for all image regions except the most
discriminative part
• do not impact the score when perform MAX
pooling
GAP & GMP
83
• GAP (upper) vs GMP (lower)
• GAP outperforms GMP
• GAP highlights more complete
object regions and less
background noise.
• Loss for average pooling
benefits when the network
identifies all discriminative
regions of an object
84
Concept localization
85
Concept localization in weakly
labeled images
• Positive set: short phrase in text caption
• Negative set: randomly selected images
• Model catch the concept, phrases are
much more abstract than object name.
Weakly supervised text detector
• Positive set: 350 Google StreeView
images that contain text.
• Negative set: outdoor scene images in
SUN dataset
• Text highlighted without bounding box
annotations.
Image detection
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
Results
102
SPPnet
104
105
106
107
108
109
110
111
112
Results
113
Results
114
Fast R-CNN
116
117
118
119
120
121
122
123
124
125
Faster R-CNN
127
128
129
130
131
132
133
134
135
136
137
138
Results
139
Results
140
R-CNN
141
Image Regions Resize Convolution
Features
Classify
SPP net
142
Image Convolution Features SPPRegions Classify
R-CNN vs. SPP net
143
R-CNN SPP net
Fast R-CNN
144
Image
Convolution Features
Regions
RoI Pooling
Layer
Class Label
Confidence
RoI Pooling
Layer
Class Label
Confidence
R-CNN vs. SPP net vs. Fast R-CNN
145
R-CNN SPP net
Fast R-CNN
Faster R-CNN
146
Image Fully Convolutional
Features
Bounding Box
Regression
BB Classification
FastR-CNN
R-CNN vs. SPP net vs. Fast R-CNN
147
R-CNN SPP net
Fast R-CNN Faster R-CNN
148
Results
149
150
151
152
RNN
Recurrent Neural Network
155
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Network
156
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM comes in!
157
Long Short Term Memory
This is just a standard RNN.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM comes in!
158
Long Short Term Memory
This is just a standard RNN.This is the LSTM!
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Overall Architecture
159
(Cell) state
Hidden State
Forget Gate
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Input Gate
Output Gate
Next (Cell) State
Next Hidden State
Input
Output
Output = Hidden state
The Core Idea
160
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Visual QnA
VQA: Dataset and Problem definition
162
VQA dataset - Example
Q: How many dogs are seen?
Q: What animal is this?
Q: What color is the car?
Q: What is the mustache made of?Q: Is this vegetarian pizza?
Solving VQA
163
Approach
[Malinowski et al., 2015] [Ren et al., 2015] [Andres et al., 2015]
[Ma et al., 2015] [Jiang et al., 2015]
Various methods have been proposed
DPPnet
164
Motivation
Common pipeline of using deep learning for vision
CNN trained on ImageNet
Switch the final layer and fine-tune for the New Task
In VQA, Task is determined by a question
Observation:
DPPnet
165
Main Idea
Switching parameters of a layer based on a question
Dynamic Parameter Layer
Question Parameter Prediction Network
DPPnet
166
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Parameter
M
N
Q
P
: Dimension of hidden state
fc-layer
N=Q×P R=Q×P×M Q=1000, P=1000, M=500
For example:
R=500,000,000
1.86GB for single layer
Number of parameters for
VGG19: 144,000,000
DPPnet
167
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Parameter
M
N
Q
P
: Dimension of hidden state
fc-layer
Solution:
R=Q×P×M R= N×M
N=Q×P N<Q×P
We can control N
DPPnet
168
Weight Sharing with Hashing Trick
Weights of Dynamic Parameter Layer are picked from Candidate weights by Hashing
Question Feature
Candidate Weights
fc-layer
0.11.2-0.70.3-0.2
0.1 0.1 -0.2 -0.7
1.2 -0.2 0.1 -0.7
-0.7 1.2 0.3 -0.2
0.3 0.3 0.1 1.2
DynamicParameterLayer
Hasing
[Chen et al., 2015]
DPPnet
169
Final Architecture
End-to-End Fine-tuning is possible (Fully-differentiable)
DPPnet
170
Qualitative Results
Q: What is the boy holding?
DPPnet: surfboard DPPnet: bat
DPPnet
171
Qualitative Results
Q: What animal is shown?
DPPnet: giraffe DPPnet: elephant
DPPnet
172
Qualitative Results
Q: How does the woman feel?
DPPnet: happy
Q: What type of hat is she wearing?
DPPnet: cowboy
DPPnet
173
Qualitative Results
Q: How many cranes are in the image?
DPPnet: 2 (3)
Q: How many people are on the bench?
DPPnet: 2 (1)
How to combine image and question?
174
How to combine image and question?
175
How to combine image and question?
176
How to combine image and question?
177
How to combine image and question?
178
How to combine image and question?
179
How to combine image and question?
180
How to combine image and question?
181
Multimodal Compact Bilinear Pooling
182
Multimodal Compact Bilinear Pooling
183
Multimodal Compact Bilinear Pooling
184
Multimodal Compact Bilinear Pooling
185
MCB without Attention
186
MCB with Attention
187
Results
188
Results
189
Results
190
Results
191
Results
192
Results
193
Word2Vec
Word2vec?
195
196
197
198
199
200
201
202
203
204
205
206
207
208
Image
Captioning
Image Captioning?
210
Overall Architecture
211
Language Model
212
Language Model
213
Language Model
214
Language Model
215
Language Model
216
Training phase
217
Training phase
218
Training phase
219
Training phase
220
Training phase
221
Training phase
222
Test phase
223
Test phase
224
Test phase
225
Test phase
226
Test phase
227
Test phase
228
Test phase
229
Test phase
230
Test phase
231
Results
232
Results
233
But not always..
234
235
Show, attend and tell
236
237
238
239
240
Results
241
Results
242
Results (mistakes)
243
Neural Art
Preliminaries
245
Understanding Deep Image
Representations by Inverting Them
CVPR2015
Texture Synthesis Using
Convolutional Neural Networks
NIPS2015
A Neural Algorithm of Artistic Style
246
A Neural Algorithm of Artistic Style
247
248
Texture Synthesis Using
Convolutional Neural Networks
-NIPS2015
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
Texture?
249
Visual texture synthesis
250
Which one do you think is real?
Right one is real.
Goal of texture synthesis is to produce (arbitrarily many)
new samples from an example texture.
Results of this work
251
Right ones are given sources!
How?
252
Texture Model
253
𝑋 𝑎
Input a
𝐹𝑎
1
𝐹𝑎
2
𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
Feature Correlations
254
𝑋 𝑎
Input a
𝐹𝑎
1
𝐹𝑎
2
𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
𝐺 𝑎
2
= 𝐹𝑎
2 𝑇
𝐹𝑎
2
(Gram matrix)
Feature Correlations
255
𝐺 𝑎
2
𝐹𝑎
2
𝐹𝑎
2
=
number of filters W*H
𝐹𝑎
2
𝐺 𝑎
2 = 𝐹𝑎
2 𝑇 𝐹𝑎
2
(Gram matrix)
number of filters
Texture Generation
256
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
Texture Generation
257
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
Element-wise squared loss
Total layer-wise loss function
Results
258
Results
259
260
Understanding Deep Image
Representations by Inverting Them
-CVPR2015
Aravindh Mahendran, Andrea Vedaldi (VGGgroup)
Reconstruction from feature map
261
Reconstruction from feature map
262
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
Let’s make this features similar!
By changing the input image!
Receptive Field
263
264
A Neural Algorithm of Artistic Style
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
How?
265
Style Image
Content Image
Mixed ImageNeural Art
How?
266
Style Image
Content Image
Mixed ImageNeural Art
Texture Synthesis Using
Convolutional Neural Networks
Understanding Deep Image
Representations by Inverting Them
How?
267
Gram matrix
Neural Art
268
𝑝: original photo, 𝑎: original artwork
𝑥: image to be generated
Content Style
Total loss = content loss + style loss
Results
269
Results
270
271

Deep Learning in Computer Vision