Case Study of Convolutional Neural Network

Case Study of CNN
from LeNet to ResNet
NamHyuk Ahn @ Ajou Univ.
2016. 03. 09

Convolution Layer
- Convolution (3-dim dot product) image and ﬁlter
- Stack ﬁlter in one layer (See blue and green output,
called channel)

Convolution Layer
- Local Connectivity
• Instead connect all pixels to neurons, connect
only local region of input (called receptive field)
• It can reduce many parameter
- Parameter sharing
• To reduce parameter, each channel have same
filter. (# of filter == # of channel)

Convolution Layer
- Example) 1st conv layer in AlexNet
• Input: [224, 224], ﬁlter: [11x11x3], 96, output: [55, 55]
- Each ﬁlter extract different features (i.e. horizontal
edge, vertical edge…)

Pooling Layer
- Downsample image to reduce parameter
- Usually use max pooling (take maximum value in
region)

ReLU, FC Layer
- ReLU
• Sort of activation function (e.g. sigmoid, tanh…)
- Fully-connected Layer
• Same as normal neural network

Training CNN
1. Calculate loss function with foward-prop
2. Optimize parameter w.r.t loss function with back-
prop
• Use gradient descent method (SGD)
• Gradient of weight can calculate with chain rule of partial derivate

AlexNet (2012)
(ILSVRC 2012 winner)

AlexNet
- ReLU
- Data augmentation
- Dropout
- Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)

AlexNet
- Other methods (but will not mention today)
• SGD + momentum (+ mini-batch)
• Multiple GPU
• Weight Decay
• Local Response Normalization

Problems of sigmoid
- Gradient vanishing
• when gradient pass sigmoid, it can vanish
because local gradient of sigmoid can be almost
zero.
- Output is not zero-centered
• cause bad performance

ReLU
- Converge of SGD is faster than sigmoid-like
- Computationally cheap

Data augmentation
- Randomly crop [256, 256] images to [224, 224]
- At test time, crop 5 images and average to predict

Dropout
- Similar to bagging (approximation of bagging)
- Act like regularizer (reduce overﬁt)
- Instead of using all neurons, “dropout” some neurons
randomly (usually 0.5 probability)

Dropout
• At test time, not “dropout” neurons, but use
weighted neurons (usually 0.5)
• Weight is expected value of each neurons

Architecture
- conv - pool - … - fc - softmax (similar to LeNet)
- Use large size ﬁlter (i.e. 11x11)

Architecture
- Weights must be initalized randomly
• If not, all gradients of neurons will be same
• Usually, use gaussian distribution, std = 0.01
- Use mini-batch SGD and momentum SGD to
update weight

VGGNet (2014)
(ILSVRC 2014 2nd)

VGGNet
- Use small size kernel (always 3x3)
• Can use multiple non-linearlity (e.g. ReLU)
• Less weights to train
- Hard data augmentation (more than AlexNet)
- Ensemble 7 model (ILSVRC submission 7.3%)

Architecture
- Most memory needs in early layers, most parameters
increase in fc layers.

GoogLeNet - Inception v1 (2014)

Inception module
- Use 1x1, 3x3 and 5x5 conv
simultaneously to capture
variety of structure
- Capture dense structure to
1x1, more spread out structure
to 3x3, 5x5
- Computational expensive
• Use 1x1 conv layer to
reduce dimension (explain
details in later in ResNet)

Auxiliary Classiﬁers
- Deep network raises concern about effectiveness
of graident in backprop
- Loss of auxiliary is added to total loss (weighted by
0.3), remove at test time

Average Pooling
- Proposed in Network in Network (also used in
GoogLeNet)
- Problems of fc layer
• Needs lots of parameter, easy to overﬁt
- Replace fc to average pooling

Average Pooling
- Make channel as same as # of class in last conv
- Calc average on each channel, and pass to softmax
- Reduce overﬁt

MSRA ResNet (2015)

before ResNet..
- Have to know about
• PReLU
• Xavier Initalization
• Batch Normalization

PReLU
- Adaptive version of ReLU
- Train slope of function when x < 0
- Slightly more parameter (# of layer x # of channel)

Xavier Initalization
- If init with gaussian distribution, output of neurons
will be nearly zeros when network is deeep
- If increase std (1.0), output will saturate to -1 or 1
- Xavier init decide initial value by number of input
neurons
- Looks ﬁne, but this init method assume linear
activation so can’t use in ReLU-like network

output is saturated
output is vanished

Xavier Initalization / 2
Xavier Initalization
Xavier Initalization / 2

Batch Normalization
- Make output to be gaussian distribution, but
normalization cost a lot
• Calc mean, variance in each dimension (assume each dims are
uncorrelated)
• Calc mean, variance in mini-batch (not entire set)
- Normalize constrain non-linearlity and constrain
network by assume each dims are uncorrelated
• Linear transform output (factors are parameter)

Batch Normalization
- When test, calc mean, variance using entire set (use
moving average)
- BN act like regularizer (don’t need Dropout)

Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overﬁt, but also increase training error

Deep Residual Learning
- Element-wise addition with F(x) and shortcut
connection, and pass through ReLU non-linearlity
- Dim of x, F(x) are unequal (changing of channel),
linear project x to match dim (done by 1x1 conv)
- Similar to LSTM

Deeper Bottleneck
- To reduce training time, modify as bottleneck design
(just for economical reason)
• (3x3x3)x64x64 + (3x3x3)x64x64=221184 (left)
• (1x1x3)x256x64 + (3x3x3)x64x64 + (1x1x3)x64x256=208896 (right)
• More width(channel) in right, but similar parameter
• Similar method also used in GoogLeNet

ResNet
- Data augmentation as AlexNet does
- Batch Normalization (no dropout)
- Xavier / 2 initalization
- Average pooling
- Structure follows VGGNet style

Top-5Error
0%
4%
8%
12%
16%
AlexN
et
(2012)
VG
G
N
et
(2014)
Inception-V1
(2014)
H
um
an
PR
eLU
-net
(2015)
BN
-Inception
(2015)
R
esN
et-152
(2015)
Inception-R
esN
et
(2016)
3.1%
3.57%
4.82%4.94%5.1%
6.66%
7.32%
15.31%

Conclusion
- Dropout, BN
- ReLU-like activation (e.g. PReLU, ELU..)
- Xavier initalization
- Average pooling
- Use pre-trained model :)

Reference
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. 2012.
- Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
- Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
- He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." Proceedings of the IEEE International Conference on Computer Vision. 2015.
- He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385
(2015).
- Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, Inception-ResNet and the
Impact of Residual Connections on Learning." arXiv preprint arXiv:1602.07261 (2016).
- Gu, Jiuxiang, et al. "Recent Advances in Convolutional Neural Networks." arXiv preprint arXiv:
1512.07108 (2015). (good for tutorial)
- Also Thanks to CS231n, I used some figures in CS231n lecture slides.  
see http://cs231n.stanford.edu/index.html

Case Study of Convolutional Neural Network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Case Study of Convolutional Neural Network

Similar to Case Study of Convolutional Neural Network (20)

More from NamHyuk Ahn

More from NamHyuk Ahn (7)

Recently uploaded

Recently uploaded (20)

Case Study of Convolutional Neural Network