Gradient
Based Learning Applied to Document Recognition
Y
. LeCun , L. Bottou , Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11
):2278 ----‐2324 , November 1998
1. Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-Jung Ni
2. • History of Representative CNN models
• Three key ideas for CNN
• Local Receptive Fields
• Shared Weights
• Sub-sampling
• Model Architecture
• Implementation
• Keras
02
Outline
Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing
Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing
7. • Architecture of LeNet-5
• Two sets of convolutional and average pooling layers
• Followed by a flattening convolutional layer
• Then two fully-connected layers and finally a softmax classifier
07
Model Architecture
8. • Similar to the idea of activation function
• All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture – Squashing Function
𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 , 𝑤ℎ𝑒𝑟𝑒 ቊ
𝐴 = 1.17519
𝑆 = 2
3
with this choice of params,
the equalities 𝑓 1 = 1 and 𝑓 −1 = −1 satisfied.
𝑓 𝑎𝑓′ 𝑎
𝑓′′ 𝑎
Some details
- Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
- The absolute value of the 2nd derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.
15. 15
Model Architecture – Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) 𝑦 𝑖 is
computed as follow:
𝑦 𝑖
=
𝑗
𝑥𝑗 − 𝑤𝑖𝑗
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where 𝑦 𝐷 𝑝 is the output of 𝐷 𝑝
-th RBF units,
that is, the one that corresponds to the right
class of input pattern 𝑧 𝑝
.
22. 18
Appendix 1. Common to zero pad the border
Example. WLOG
- input 7x7
- 3x3 filter, applied with stride 1
- pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
• F = 3 => zero pad with 1
• F = 5 => zero pad with 2
𝑊 𝑙
= 𝑖𝑛𝑡
𝑊 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
23. 19
Appendix 2. Sub-Sampling v.s. Pooling
• Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
• Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.5 1.75
1.5 1.75
Avg-pooling
1.5 1.75
1.5 1.75
Sub-sampling
𝑤 + 𝑏
, where w and b ∈ 𝑅