LeNet-5

Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-Jung Ni

• History of Representative CNN models
• Three key ideas for CNN
• Local Receptive Fields
• Shared Weights
• Sub-sampling
• Model Architecture
• Implementation
• Keras
02
Outline
Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing
Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing

1980s
CNN be
proposed
1998 LeNet
2012
AlexNet
2015
VGGNet
2015
GoogleNet
2016
ResNet
2017
DenseNet
03
History of Representative CNN models
The first time use
𝐛𝐚𝐜𝐤 − 𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧 to
update model params.
The first time use
𝐆𝐏𝐔 to
accelerate computations

• Why local connectivity? (what)
• Spatial correlation is local
• Reduce # of parameters
04
Three key ideas : Local Receptive Fields (1/3)
Example. WLOG
- 1000x1000 image
- 3x3 filter (kernel)
106
+ 1 params.
/ hidden unit
32
+ 1 params.
/ hidden unit

• Why weight sharing? (where)
• Statistics is at different locations
• Reduce # of parameters
05
Three key ideas : Shared Weights (2/3)
Example. WLOG
- # input units (neurons) = 7
- # hidden unit = 3
3 ∗ 3 + 3 = 12 params. 3 ∗ 1 + 3 = 6 params.

• Why Sub-sampling? (Size)
• Sub-sampling the pixel will not change the object
• Reduce memory consumption
06
Three key ideas : Sub-sampling (3/3)
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
1.5 1.75
1.5 1.75
Max-pooling Avg-pooling

• Architecture of LeNet-5
• Two sets of convolutional and average pooling layers
• Followed by a flattening convolutional layer
• Then two fully-connected layers and finally a softmax classifier
07
Model Architecture

• Similar to the idea of activation function
• All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture – Squashing Function
𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 , 𝑤ℎ𝑒𝑟𝑒 ቊ
𝐴 = 1.17519
𝑆 = 2
3
with this choice of params,
the equalities 𝑓 1 = 1 and 𝑓 −1 = −1 satisfied.
𝑓 𝑎𝑓′ 𝑎
𝑓′′ 𝑎
Some details
- Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
- The absolute value of the 2nd derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.

09
Model Architecture – 1st layer (1/7)
• Trainable params
= (weight * input map channel + bias) * output map channel
= (5*5*1 + 1) * 6 = 156
• Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5 *1 + 1) * 6 * (28*28) = 122,304
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
Convolution layer 1 (C1)
with 6 feature maps or filters
having size 5×5, a stride of one,
and ‘same’ padding!

10
Model Architecture – 2nd layer (2/7)
= (weight + bias) * output map channel
= (1 + 1) * 6 = 12
• Connections
= (kernel size + bias) * output map channel * output map size
= (2*2 + 1) * 6 * (14*14) = 5,880
Subsampling layer 2 (S2)
with a filter size 2×2, a
stride of two, and ‘valid’
padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

11
Model Architecture – 3rd layer (3/7)
= ∑group [ (weight * input map channel + bias) * output map channel ]
= (5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1 = 456 + 606 + 303 +151 = 1,516
• Connections
= [(5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1] * (10*10) = 151,600
with 16 feature maps having
size 5×5 and a stride of one,
and ‘valid’ padding!
Based on the consideration of computation costs,
• First 6 feature maps are connected to 3 contiguous input maps
• Second 6 feature maps are connected to 4 contiguous input maps
• Next 3 feature maps are connected to 4 discontinuous input maps
• Last 1 feature map are connected to all 6 input maps
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

12
Model Architecture – 4th layer (4/7)
= (1 + 1) * 16 = 32
• Connections
= (kernel size + bias) * output map channel * output map size
= (2*2 + 1) * 16 * (5*5) = 2,000
Subsampling layer 4 (S4)
with a filter size 2×2, a
stride of two, and ‘valid’
padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

13
= (weight * input map channel + bias) * output map channel
= (5*5*16 + 1) * 120 = 48,120
• Connections
= (5*5*16 + 1) * 120 * (1*1) = 48,120
with 120 feature maps or
filters having size 5×5, a stride
of one, and ‘valid’ padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

14
= (120 + 1) * 84 = 10,164
• Connections
= (120 + 1) * 84 = 10,164
Fully-connected layer (F6)
with 84 neuron units!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

15
Model Architecture – Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) 𝑦 𝑖 is
computed as follow:
𝑦 𝑖
= ෍
𝑗
𝑥𝑗 − 𝑤𝑖𝑗
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where 𝑦 𝐷 𝑝 is the output of 𝐷 𝑝
-th RBF units,
that is, the one that corresponds to the right
class of input pattern 𝑧 𝑝
.

Model Architecture (LeNet-5)
Notation W, H F S P
Layer
Feature Map
# channel
Feature Map
Size
Filter Size
(Kernel Size)
Stride Padding
Activation
function
Input Image 1 32x32 - - - -
1 Convilution 6 28x28 5x5 1 0 tanh
2 Avg-Pooling 6 14x14 2x2 2 0 tanh
4 Avg-Pooling 16 5x5 2x2 2 0 tanh
6 FC - 84 - - - tanh
Output FC - 10 - - - RBF
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
16

Implementation – Download Data Set & Normalize
16

Implementation – Define LeNet-5 Model
16

Implementation – Define LeNet-5 Model & Evaluate
16

Implementation – Visualize the Training Process
16

18
Appendix 1. Common to zero pad the border
Example. WLOG
- input 7x7
- 3x3 filter, applied with stride 1
- pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
• F = 3 => zero pad with 1
• F = 5 => zero pad with 2
𝑊 𝑙
= 𝑖𝑛𝑡
𝑊 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1

19
Appendix 2. Sub-Sampling v.s. Pooling
• Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
• Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.5 1.75
1.5 1.75
Avg-pooling
1.5 1.75
1.5 1.75
Sub-sampling
𝑤 + 𝑏
, where w and b ∈ 𝑅

19
Appendix 3. Radial Basis Function (RBF) units
𝑥84
𝑥1
𝑥2
𝑥83
𝑥𝑗
𝑦10
𝑦1
𝑦2
𝑦9
𝑦𝑖
⋮
⋮
⋮
⋮
𝑤1,1
𝑤10,84
𝒀10×1 = 𝑾10×84
𝑇
𝑿84×1
𝑵𝒐𝒕𝒆.
1) 𝑥𝑗 ∈ ℝ 𝑖𝑠 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝐹𝐶6 𝑙𝑎𝑦𝑒𝑟 𝑤𝑖𝑡ℎ 𝑠𝑞𝑢𝑎𝑠ℎ𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 ,
∀ 𝑗 = 1, … , 84
2) 𝑇ℎ𝑒 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 ൛𝑤𝑖𝑗 |𝑖 = 1, … , 10 ; 𝑗 =

LeNet-5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LeNet-5

Similar to LeNet-5 (20)

Recently uploaded

Recently uploaded (20)

LeNet-5