Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Case Study of CNN
from LeNet to ResNet
NamHyuk Ahn @ Ajou Univ.
2016. 03. 09
Convolutional Neural Network
Convolution Layer
- Convolution (3-dim dot product) image and filter
- Stack filter in one layer (See blue and green output,...
Convolution Layer
- Local Connectivity
• Instead connect all pixels to neurons, connect
only local region of input (called...
Convolution Layer
- Example) 1st conv layer in AlexNet
• Input: [224, 224], filter: [11x11x3], 96, output: [55, 55]
- Each ...
Pooling Layer
- Downsample image to reduce parameter
- Usually use max pooling (take maximum value in
region)
ReLU, FC Layer
- ReLU
• Sort of activation function (e.g. sigmoid, tanh…)
- Fully-connected Layer
• Same as normal neural ...
Convolutional Neural Network
Training CNN
1. Calculate loss function with foward-prop
2. Optimize parameter w.r.t loss function with back-
prop
• Use g...
ILSVRC trend
AlexNet (2012)
(ILSVRC 2012 winner)
AlexNet
- ReLU
- Data augmentation
- Dropout
- Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)
AlexNet
- Other methods (but will not mention today)
• SGD + momentum (+ mini-batch)
• Multiple GPU
• Weight Decay
• Local...
Problems of sigmoid
- Gradient vanishing
• when gradient pass sigmoid, it can vanish
because local gradient of sigmoid can...
ReLU
- Converge of SGD is faster than sigmoid-like
- Computationally cheap
Data augmentation
- Randomly crop [256, 256] images to [224, 224]
- At test time, crop 5 images and average to predict
Dropout
- Similar to bagging (approximation of bagging)
- Act like regularizer (reduce overfit)
- Instead of using all neur...
Dropout
• At test time, not “dropout” neurons, but use
weighted neurons (usually 0.5)
• Weight is expected value of each n...
Architecture
- conv - pool - … - fc - softmax (similar to LeNet)
- Use large size filter (i.e. 11x11)
Architecture
- Weights must be initalized randomly
• If not, all gradients of neurons will be same
• Usually, use gaussian...
VGGNet (2014)
(ILSVRC 2014 2nd)
VGGNet
- Use small size kernel (always 3x3)
• Can use multiple non-linearlity (e.g. ReLU)
• Less weights to train
- Hard d...
Architecture
- Most memory needs in early layers, most parameters
increase in fc layers.
GoogLeNet - Inception v1 (2014)
(ILSVRC 2014 winner)
GoogLeNet
Inception module
- Use 1x1, 3x3 and 5x5 conv
simultaneously to capture
variety of structure
- Capture dense structure to
1...
Auxiliary Classifiers
- Deep network raises concern about effectiveness
of graident in backprop
- Loss of auxiliary is adde...
Average Pooling
- Proposed in Network in Network (also used in
GoogLeNet)
- Problems of fc layer
• Needs lots of parameter...
Average Pooling
- Make channel as same as # of class in last conv
- Calc average on each channel, and pass to softmax
- Re...
MSRA ResNet (2015)
(ILSVRC 2015 winner)
before ResNet..
- Have to know about
• PReLU
• Xavier Initalization
• Batch Normalization
PReLU
- Adaptive version of ReLU
- Train slope of function when x < 0
- Slightly more parameter (# of layer x # of channel)
Xavier Initalization
- If init with gaussian distribution, output of neurons
will be nearly zeros when network is deeep
- ...
output is saturated
output is vanished
Xavier Initalization / 2
Xavier Initalization
Xavier Initalization / 2
Batch Normalization
- Make output to be gaussian distribution, but
normalization cost a lot
• Calc mean, variance in each ...
Batch Normalization
- When test, calc mean, variance using entire set (use
moving average)
- BN act like regularizer (don’...
ResNet
ResNet
Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout...
Deep Residual Learning
- Element-wise addition with F(x) and shortcut
connection, and pass through ReLU non-linearlity
- D...
Deeper Bottleneck
- To reduce training time, modify as bottleneck design
(just for economical reason)
• (3x3x3)x64x64 + (3...
ResNet
- Data augmentation as AlexNet does
- Batch Normalization (no dropout)
- Xavier / 2 initalization
- Average pooling...
Conclusion
Top-5Error
0%
4%
8%
12%
16%
AlexN
et
(2012)
VG
G
N
et
(2014)
Inception-V1
(2014)
H
um
an
PR
eLU
-net
(2015)
BN
-Inception
...
Conclusion
- Dropout, BN
- ReLU-like activation (e.g. PReLU, ELU..)
- Xavier initalization
- Average pooling
- Use pre-tra...
Reference
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neur...
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
Upcoming SlideShare
Loading in …5
×

Case Study of Convolutional Neural Network

4,763 views

Published on

Presentation file in my lab seminar.

Published in: Engineering
  • GoogleLeNet에 FC가 없다는 걸 처음 알았네요. 그렇다면 마지막 layer의 각 filter가 정말 의미적으로 개별 클래스에 해당하는 것이어야 하고. 자료 감사합니다.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Case Study of Convolutional Neural Network

  1. 1. Case Study of CNN from LeNet to ResNet NamHyuk Ahn @ Ajou Univ. 2016. 03. 09
  2. 2. Convolutional Neural Network
  3. 3. Convolution Layer - Convolution (3-dim dot product) image and filter - Stack filter in one layer (See blue and green output, called channel)
  4. 4. Convolution Layer - Local Connectivity • Instead connect all pixels to neurons, connect only local region of input (called receptive field) • It can reduce many parameter - Parameter sharing • To reduce parameter, each channel have same filter. (# of filter == # of channel)
  5. 5. Convolution Layer - Example) 1st conv layer in AlexNet • Input: [224, 224], filter: [11x11x3], 96, output: [55, 55] - Each filter extract different features (i.e. horizontal edge, vertical edge…)
  6. 6. Pooling Layer - Downsample image to reduce parameter - Usually use max pooling (take maximum value in region)
  7. 7. ReLU, FC Layer - ReLU • Sort of activation function (e.g. sigmoid, tanh…) - Fully-connected Layer • Same as normal neural network
  8. 8. Convolutional Neural Network
  9. 9. Training CNN 1. Calculate loss function with foward-prop 2. Optimize parameter w.r.t loss function with back- prop • Use gradient descent method (SGD) • Gradient of weight can calculate with chain rule of partial derivate
  10. 10. ILSVRC trend
  11. 11. AlexNet (2012) (ILSVRC 2012 winner)
  12. 12. AlexNet - ReLU - Data augmentation - Dropout - Ensemble CNN (1-CNN 18.2%, 7-CNN 15.4%)
  13. 13. AlexNet - Other methods (but will not mention today) • SGD + momentum (+ mini-batch) • Multiple GPU • Weight Decay • Local Response Normalization
  14. 14. Problems of sigmoid - Gradient vanishing • when gradient pass sigmoid, it can vanish because local gradient of sigmoid can be almost zero. - Output is not zero-centered • cause bad performance
  15. 15. ReLU - Converge of SGD is faster than sigmoid-like - Computationally cheap
  16. 16. Data augmentation - Randomly crop [256, 256] images to [224, 224] - At test time, crop 5 images and average to predict
  17. 17. Dropout - Similar to bagging (approximation of bagging) - Act like regularizer (reduce overfit) - Instead of using all neurons, “dropout” some neurons randomly (usually 0.5 probability)
  18. 18. Dropout • At test time, not “dropout” neurons, but use weighted neurons (usually 0.5) • Weight is expected value of each neurons
  19. 19. Architecture - conv - pool - … - fc - softmax (similar to LeNet) - Use large size filter (i.e. 11x11)
  20. 20. Architecture - Weights must be initalized randomly • If not, all gradients of neurons will be same • Usually, use gaussian distribution, std = 0.01 - Use mini-batch SGD and momentum SGD to update weight
  21. 21. VGGNet (2014) (ILSVRC 2014 2nd)
  22. 22. VGGNet - Use small size kernel (always 3x3) • Can use multiple non-linearlity (e.g. ReLU) • Less weights to train - Hard data augmentation (more than AlexNet) - Ensemble 7 model (ILSVRC submission 7.3%)
  23. 23. Architecture - Most memory needs in early layers, most parameters increase in fc layers.
  24. 24. GoogLeNet - Inception v1 (2014) (ILSVRC 2014 winner)
  25. 25. GoogLeNet
  26. 26. Inception module - Use 1x1, 3x3 and 5x5 conv simultaneously to capture variety of structure - Capture dense structure to 1x1, more spread out structure to 3x3, 5x5 - Computational expensive • Use 1x1 conv layer to reduce dimension (explain details in later in ResNet)
  27. 27. Auxiliary Classifiers - Deep network raises concern about effectiveness of graident in backprop - Loss of auxiliary is added to total loss (weighted by 0.3), remove at test time
  28. 28. Average Pooling - Proposed in Network in Network (also used in GoogLeNet) - Problems of fc layer • Needs lots of parameter, easy to overfit - Replace fc to average pooling
  29. 29. Average Pooling - Make channel as same as # of class in last conv - Calc average on each channel, and pass to softmax - Reduce overfit
  30. 30. MSRA ResNet (2015) (ILSVRC 2015 winner)
  31. 31. before ResNet.. - Have to know about • PReLU • Xavier Initalization • Batch Normalization
  32. 32. PReLU - Adaptive version of ReLU - Train slope of function when x < 0 - Slightly more parameter (# of layer x # of channel)
  33. 33. Xavier Initalization - If init with gaussian distribution, output of neurons will be nearly zeros when network is deeep - If increase std (1.0), output will saturate to -1 or 1 - Xavier init decide initial value by number of input neurons - Looks fine, but this init method assume linear activation so can’t use in ReLU-like network
  34. 34. output is saturated output is vanished
  35. 35. Xavier Initalization / 2 Xavier Initalization Xavier Initalization / 2
  36. 36. Batch Normalization - Make output to be gaussian distribution, but normalization cost a lot • Calc mean, variance in each dimension (assume each dims are uncorrelated) • Calc mean, variance in mini-batch (not entire set) - Normalize constrain non-linearlity and constrain network by assume each dims are uncorrelated • Linear transform output (factors are parameter)
  37. 37. Batch Normalization - When test, calc mean, variance using entire set (use moving average) - BN act like regularizer (don’t need Dropout)
  38. 38. ResNet
  39. 39. ResNet
  40. 40. Problem of degradation - More depth, more accurate but deep network can vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer) - More deeper, degradation problem occur • Not only overfit, but also increase training error
  41. 41. Deep Residual Learning - Element-wise addition with F(x) and shortcut connection, and pass through ReLU non-linearlity - Dim of x, F(x) are unequal (changing of channel), linear project x to match dim (done by 1x1 conv) - Similar to LSTM
  42. 42. Deeper Bottleneck - To reduce training time, modify as bottleneck design (just for economical reason) • (3x3x3)x64x64 + (3x3x3)x64x64=221184 (left) • (1x1x3)x256x64 + (3x3x3)x64x64 + (1x1x3)x64x256=208896 (right) • More width(channel) in right, but similar parameter • Similar method also used in GoogLeNet
  43. 43. ResNet - Data augmentation as AlexNet does - Batch Normalization (no dropout) - Xavier / 2 initalization - Average pooling - Structure follows VGGNet style
  44. 44. Conclusion
  45. 45. Top-5Error 0% 4% 8% 12% 16% AlexN et (2012) VG G N et (2014) Inception-V1 (2014) H um an PR eLU -net (2015) BN -Inception (2015) R esN et-152 (2015) Inception-R esN et (2016) 3.1% 3.57% 4.82%4.94%5.1% 6.66% 7.32% 15.31%
  46. 46. Conclusion - Dropout, BN - ReLU-like activation (e.g. PReLU, ELU..) - Xavier initalization - Average pooling - Use pre-trained model :)
  47. 47. Reference - Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. - Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). - Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013). - He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE International Conference on Computer Vision. 2015. - He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). - Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning." arXiv preprint arXiv:1602.07261 (2016). - Gu, Jiuxiang, et al. "Recent Advances in Convolutional Neural Networks." arXiv preprint arXiv: 1512.07108 (2015). (good for tutorial) - Also Thanks to CS231n, I used some figures in CS231n lecture slides. 
 see http://cs231n.stanford.edu/index.html

×