Case Study of CNN
from LeNet to ResNet
NamHyuk Ahn @ Ajou Univ.
2016. 03. 09
Convolutional Neural Network
Convolution Layer
- Convolution (3-dim dot product) image and filter
- Stack filter in one layer (See blue and green output,
called channel)
Convolution Layer
- Local Connectivity
• Instead connect all pixels to neurons, connect
only local region of input (called receptive field)
• It can reduce many parameter
- Parameter sharing
• To reduce parameter, each channel have same
filter. (# of filter == # of channel)
Convolution Layer
- Example) 1st conv layer in AlexNet
• Input: [224, 224], filter: [11x11x3], 96, output: [55, 55]
- Each filter extract different features (i.e. horizontal
edge, vertical edge…)
Pooling Layer
- Downsample image to reduce parameter
- Usually use max pooling (take maximum value in
region)
ReLU, FC Layer
- ReLU
• Sort of activation function (e.g. sigmoid, tanh…)
- Fully-connected Layer
• Same as normal neural network
Convolutional Neural Network
Training CNN
1. Calculate loss function with foward-prop
2. Optimize parameter w.r.t loss function with back-
prop
• Use gradient descent method (SGD)
• Gradient of weight can calculate with chain rule of partial derivate
ILSVRC trend
AlexNet (2012)
(ILSVRC 2012 winner)
AlexNet
- ReLU
- Data augmentation
- Dropout
- Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)
AlexNet
- Other methods (but will not mention today)
• SGD + momentum (+ mini-batch)
• Multiple GPU
• Weight Decay
• Local Response Normalization
Problems of sigmoid
- Gradient vanishing
• when gradient pass sigmoid, it can vanish
because local gradient of sigmoid can be almost
zero.
- Output is not zero-centered
• cause bad performance
ReLU
- Converge of SGD is faster than sigmoid-like
- Computationally cheap
Data augmentation
- Randomly crop [256, 256] images to [224, 224]
- At test time, crop 5 images and average to predict
Dropout
- Similar to bagging (approximation of bagging)
- Act like regularizer (reduce overfit)
- Instead of using all neurons, “dropout” some neurons
randomly (usually 0.5 probability)
Dropout
• At test time, not “dropout” neurons, but use
weighted neurons (usually 0.5)
• Weight is expected value of each neurons
Architecture
- conv - pool - … - fc - softmax (similar to LeNet)
- Use large size filter (i.e. 11x11)
Architecture
- Weights must be initalized randomly
• If not, all gradients of neurons will be same
• Usually, use gaussian distribution, std = 0.01
- Use mini-batch SGD and momentum SGD to
update weight
VGGNet (2014)
(ILSVRC 2014 2nd)
VGGNet
- Use small size kernel (always 3x3)
• Can use multiple non-linearlity (e.g. ReLU)
• Less weights to train
- Hard data augmentation (more than AlexNet)
- Ensemble 7 model (ILSVRC submission 7.3%)
Architecture
- Most memory needs in early layers, most parameters
increase in fc layers.
GoogLeNet - Inception v1 (2014)
(ILSVRC 2014 winner)
GoogLeNet
Inception module
- Use 1x1, 3x3 and 5x5 conv
simultaneously to capture
variety of structure
- Capture dense structure to
1x1, more spread out structure
to 3x3, 5x5
- Computational expensive
• Use 1x1 conv layer to
reduce dimension (explain
details in later in ResNet)
Auxiliary Classifiers
- Deep network raises concern about effectiveness
of graident in backprop
- Loss of auxiliary is added to total loss (weighted by
0.3), remove at test time
Average Pooling
- Proposed in Network in Network (also used in
GoogLeNet)
- Problems of fc layer
• Needs lots of parameter, easy to overfit
- Replace fc to average pooling
Average Pooling
- Make channel as same as # of class in last conv
- Calc average on each channel, and pass to softmax
- Reduce overfit
MSRA ResNet (2015)
(ILSVRC 2015 winner)
before ResNet..
- Have to know about
• PReLU
• Xavier Initalization
• Batch Normalization
PReLU
- Adaptive version of ReLU
- Train slope of function when x < 0
- Slightly more parameter (# of layer x # of channel)
Xavier Initalization
- If init with gaussian distribution, output of neurons
will be nearly zeros when network is deeep
- If increase std (1.0), output will saturate to -1 or 1
- Xavier init decide initial value by number of input
neurons
- Looks fine, but this init method assume linear
activation so can’t use in ReLU-like network
output is saturated
output is vanished
Xavier Initalization / 2
Xavier Initalization
Xavier Initalization / 2
Batch Normalization
- Make output to be gaussian distribution, but
normalization cost a lot
• Calc mean, variance in each dimension (assume each dims are
uncorrelated)
• Calc mean, variance in mini-batch (not entire set)
- Normalize constrain non-linearlity and constrain
network by assume each dims are uncorrelated
• Linear transform output (factors are parameter)
Batch Normalization
- When test, calc mean, variance using entire set (use
moving average)
- BN act like regularizer (don’t need Dropout)
ResNet
ResNet
Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
Deep Residual Learning
- Element-wise addition with F(x) and shortcut
connection, and pass through ReLU non-linearlity
- Dim of x, F(x) are unequal (changing of channel),
linear project x to match dim (done by 1x1 conv)
- Similar to LSTM
Deeper Bottleneck
- To reduce training time, modify as bottleneck design
(just for economical reason)
• (3x3x3)x64x64 + (3x3x3)x64x64=221184 (left)
• (1x1x3)x256x64 + (3x3x3)x64x64 + (1x1x3)x64x256=208896 (right)
• More width(channel) in right, but similar parameter
• Similar method also used in GoogLeNet
ResNet
- Data augmentation as AlexNet does
- Batch Normalization (no dropout)
- Xavier / 2 initalization
- Average pooling
- Structure follows VGGNet style
Conclusion
Top-5Error
0%
4%
8%
12%
16%
AlexN
et
(2012)
VG
G
N
et
(2014)
Inception-V1
(2014)
H
um
an
PR
eLU
-net
(2015)
BN
-Inception
(2015)
R
esN
et-152
(2015)
Inception-R
esN
et
(2016)
3.1%
3.57%
4.82%4.94%5.1%
6.66%
7.32%
15.31%
Conclusion
- Dropout, BN
- ReLU-like activation (e.g. PReLU, ELU..)
- Xavier initalization
- Average pooling
- Use pre-trained model :)
Reference
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. 2012.
- Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
- Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
- He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." Proceedings of the IEEE International Conference on Computer Vision. 2015.
- He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385
(2015).
- Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, Inception-ResNet and the
Impact of Residual Connections on Learning." arXiv preprint arXiv:1602.07261 (2016).
- Gu, Jiuxiang, et al. "Recent Advances in Convolutional Neural Networks." arXiv preprint arXiv:
1512.07108 (2015). (good for tutorial)
- Also Thanks to CS231n, I used some figures in CS231n lecture slides. 

see http://cs231n.stanford.edu/index.html

Case Study of Convolutional Neural Network

  • 1.
    Case Study ofCNN from LeNet to ResNet NamHyuk Ahn @ Ajou Univ. 2016. 03. 09
  • 2.
  • 5.
    Convolution Layer - Convolution(3-dim dot product) image and filter - Stack filter in one layer (See blue and green output, called channel)
  • 6.
    Convolution Layer - LocalConnectivity • Instead connect all pixels to neurons, connect only local region of input (called receptive field) • It can reduce many parameter - Parameter sharing • To reduce parameter, each channel have same filter. (# of filter == # of channel)
  • 7.
    Convolution Layer - Example)1st conv layer in AlexNet • Input: [224, 224], filter: [11x11x3], 96, output: [55, 55] - Each filter extract different features (i.e. horizontal edge, vertical edge…)
  • 8.
    Pooling Layer - Downsampleimage to reduce parameter - Usually use max pooling (take maximum value in region)
  • 9.
    ReLU, FC Layer -ReLU • Sort of activation function (e.g. sigmoid, tanh…) - Fully-connected Layer • Same as normal neural network
  • 10.
  • 11.
    Training CNN 1. Calculateloss function with foward-prop 2. Optimize parameter w.r.t loss function with back- prop • Use gradient descent method (SGD) • Gradient of weight can calculate with chain rule of partial derivate
  • 15.
  • 17.
  • 18.
    AlexNet - ReLU - Dataaugmentation - Dropout - Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)
  • 19.
    AlexNet - Other methods(but will not mention today) • SGD + momentum (+ mini-batch) • Multiple GPU • Weight Decay • Local Response Normalization
  • 20.
    Problems of sigmoid -Gradient vanishing • when gradient pass sigmoid, it can vanish because local gradient of sigmoid can be almost zero. - Output is not zero-centered • cause bad performance
  • 21.
    ReLU - Converge ofSGD is faster than sigmoid-like - Computationally cheap
  • 22.
    Data augmentation - Randomlycrop [256, 256] images to [224, 224] - At test time, crop 5 images and average to predict
  • 23.
    Dropout - Similar tobagging (approximation of bagging) - Act like regularizer (reduce overfit) - Instead of using all neurons, “dropout” some neurons randomly (usually 0.5 probability)
  • 24.
    Dropout • At testtime, not “dropout” neurons, but use weighted neurons (usually 0.5) • Weight is expected value of each neurons
  • 25.
    Architecture - conv -pool - … - fc - softmax (similar to LeNet) - Use large size filter (i.e. 11x11)
  • 26.
    Architecture - Weights mustbe initalized randomly • If not, all gradients of neurons will be same • Usually, use gaussian distribution, std = 0.01 - Use mini-batch SGD and momentum SGD to update weight
  • 27.
  • 28.
    VGGNet - Use smallsize kernel (always 3x3) • Can use multiple non-linearlity (e.g. ReLU) • Less weights to train - Hard data augmentation (more than AlexNet) - Ensemble 7 model (ILSVRC submission 7.3%)
  • 29.
    Architecture - Most memoryneeds in early layers, most parameters increase in fc layers.
  • 30.
    GoogLeNet - Inceptionv1 (2014) (ILSVRC 2014 winner)
  • 31.
  • 32.
    Inception module - Use1x1, 3x3 and 5x5 conv simultaneously to capture variety of structure - Capture dense structure to 1x1, more spread out structure to 3x3, 5x5 - Computational expensive • Use 1x1 conv layer to reduce dimension (explain details in later in ResNet)
  • 33.
    Auxiliary Classifiers - Deepnetwork raises concern about effectiveness of graident in backprop - Loss of auxiliary is added to total loss (weighted by 0.3), remove at test time
  • 34.
    Average Pooling - Proposedin Network in Network (also used in GoogLeNet) - Problems of fc layer • Needs lots of parameter, easy to overfit - Replace fc to average pooling
  • 35.
    Average Pooling - Makechannel as same as # of class in last conv - Calc average on each channel, and pass to softmax - Reduce overfit
  • 36.
  • 37.
    before ResNet.. - Haveto know about • PReLU • Xavier Initalization • Batch Normalization
  • 38.
    PReLU - Adaptive versionof ReLU - Train slope of function when x < 0 - Slightly more parameter (# of layer x # of channel)
  • 39.
    Xavier Initalization - Ifinit with gaussian distribution, output of neurons will be nearly zeros when network is deeep - If increase std (1.0), output will saturate to -1 or 1 - Xavier init decide initial value by number of input neurons - Looks fine, but this init method assume linear activation so can’t use in ReLU-like network
  • 40.
  • 41.
    Xavier Initalization /2 Xavier Initalization Xavier Initalization / 2
  • 42.
    Batch Normalization - Makeoutput to be gaussian distribution, but normalization cost a lot • Calc mean, variance in each dimension (assume each dims are uncorrelated) • Calc mean, variance in mini-batch (not entire set) - Normalize constrain non-linearlity and constrain network by assume each dims are uncorrelated • Linear transform output (factors are parameter)
  • 43.
    Batch Normalization - Whentest, calc mean, variance using entire set (use moving average) - BN act like regularizer (don’t need Dropout)
  • 44.
  • 45.
  • 46.
    Problem of degradation -More depth, more accurate but deep network can vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer) - More deeper, degradation problem occur • Not only overfit, but also increase training error
  • 47.
    Deep Residual Learning -Element-wise addition with F(x) and shortcut connection, and pass through ReLU non-linearlity - Dim of x, F(x) are unequal (changing of channel), linear project x to match dim (done by 1x1 conv) - Similar to LSTM
  • 48.
    Deeper Bottleneck - Toreduce training time, modify as bottleneck design (just for economical reason) • (3x3x3)x64x64 + (3x3x3)x64x64=221184 (left) • (1x1x3)x256x64 + (3x3x3)x64x64 + (1x1x3)x64x256=208896 (right) • More width(channel) in right, but similar parameter • Similar method also used in GoogLeNet
  • 49.
    ResNet - Data augmentationas AlexNet does - Batch Normalization (no dropout) - Xavier / 2 initalization - Average pooling - Structure follows VGGNet style
  • 50.
  • 51.
  • 52.
    Conclusion - Dropout, BN -ReLU-like activation (e.g. PReLU, ELU..) - Xavier initalization - Average pooling - Use pre-trained model :)
  • 53.
    Reference - Krizhevsky, Alex,Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. - Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). - Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013). - He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE International Conference on Computer Vision. 2015. - He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). - Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning." arXiv preprint arXiv:1602.07261 (2016). - Gu, Jiuxiang, et al. "Recent Advances in Convolutional Neural Networks." arXiv preprint arXiv: 1512.07108 (2015). (good for tutorial) - Also Thanks to CS231n, I used some figures in CS231n lecture slides. 
 see http://cs231n.stanford.edu/index.html