Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Inception & Xception
PR12와 함께 이해하는
Jaejun Yoo
Ph.D. Candidate @KAIST
PR12
10th Sep, 2017
(GoogLeNet)
Today’s contents
GoogLeNet : Inception models
• Going Deeper with Convolution
• Rethinking the Inception Architecture for ...
Today’s contents
Today’s contents
Xception model
• Xception: Deep Learning with Depthwise Separable Convolutions
Motivation
Q) What is the best way to improve the performance
of deep neural network?
A) …
Motivation
Q) What is the best way to improve the performance
of deep neural network?
A) Bigger size!
(Increase the depth ...
Problem
1. Bigger model typically means a larger number of parameters
→ overfitting
2. Increased use of computational reso...
Problem
1. Bigger model typically means a larger number of parameters
→ overfitting
2. Increased use of computational reso...
Sparsely Connected Architecture
1. Mimicking biological system
2. Theoretical underpinnings
→ Arora et al.
Provable bounds...
Why Arora et al. is important
To provably solve optimization problems for general neural networks with two or more layers,...
Sparsely Connected Architecture
1. Mimicking biological system
2. Theoretical underpinnings
→ Arora et al.
Provable bounds...
Layer-wise learning
Correlation statistics
Videos:
• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-repr...
Layer-wise learning
Correlation statistics
Videos:
• http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-repr...
Problem
1. Sparse matrix computation is very inefficient
→ dense matrix calculation is extremely efficient
2. Even ConvNet...
Problem
1. Sparse matrix computation is very inefficient
→ dense matrix calculation is extremely efficient
2. Even ConvNet...
Inception architecture
Main idea
How to find out an optimal local sparse structure in a
convolutional network and how can ...
In images, correlations tend to be local
Cover very local clusters by 1x1 convolutions
1x1
number of
filters
Less spread out correlations
1x1
number of
filters
Cover more spread out clusters by 3x3 convolutions
1x1
3x3
number of
filters
Cover more spread out clusters by 5x5 convolutions
1x1
number of
filters
3x3
Cover more spread out clusters by 5x5 convolutions
1x1
number of
filters
3x3
5x5
A heterogeneous set of convolutions
1x1
number of
filters
3x3
5x5
Schematic view (naive version)
1x1
number of
filters
3x3
5x5
1x1 convolutio
ns
3x3 convolutio
ns
5x5 convolutio
ns
Filter ...
1x1 convoluti
ons
3x3 convoluti
ons
5x5 convoluti
ons
Filter concate
nation
Previous layer
Naive idea
3x3 max pooli
ng
1x1 convoluti
ons
3x3 convoluti
ons
5x5 convoluti
ons
Filter concate
nation
Previous layer
Naive idea (does not work!)
3x3...
1x1 convoluti
ons
3x3 convoluti
ons
5x5 convoluti
ons
Filter concate
nation
Previous layer
Inception module
3x3 max pooli
...
Network in Network (NiN)
MLPConvConv
MLPConv = 𝟏 × 𝟏 ConvConv = GLM
GLM is replaced with a ”micro network” structure
𝟏 × 𝟏 Conv
1. Increase the representational power of neural network
→ In MLPConv sense
2. Dimension reduction
→ usually th...
𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv
5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times)
5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약
2.78 배 비용이...
𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv
https://norman3.github.io/papers/docs/google_inception.html
Xception
Observation 1
• Inception module try to explicitly factoring two tasks done by a
single convolution kernel: mappi...
Xception
Observation
• Inception module try to explicitly factoring two tasks done by a
single convolution kernel: mapping...
Xception
Xception
Xception
Xception
Commonly called “separable convolution” in deep learning
frameworks such as TF and Keras; a spatial convolution
p...
Xception
1. The order of the operations
2. The presence or absence of a non-linearity after the first
operaton
Xception vs...
Xception
Regular convolution Inception Depthwise separable convolution
Inception modules lie in between!
Observation 2
Xce...
Xception
Regular convolution Inception Depthwise separable convolution
Inception modules lie in between!
Observation 2
Xce...
Xception
ImageNet JFT
Reference
GoogLeNet : Inception models
Blog
• https://norman3.github.io/papers/docs/google_inception.html (Kor)
• https://...
Reducing the grid size efficiently
https://norman3.github.io/papers/docs/google_inception.html
•Pooling을 먼저하면?
•이 때는 Representational Bottleneck 이 발생한다.
•Pooling으로 인한 정보 손실로 생각하면 될 듯 하다.
•예제로 드는 연산은 (d,d,k)(d,d,k) 를 (...
Upcoming SlideShare
Loading in …5
×

[PR12] Inception and Xception - Jaejun Yoo

1,578 views

Published on

Introduction to Inception and Xception
video: https://youtu.be/V0dLhyg5_Dw
Papers:
Going Deeper with Convolutions
Rethinking the Inception Architecture for Computer Vision
Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
Xception: Deep Learning with Depthwise Separable Convolutions

Published in: Science
  • Be the first to comment

[PR12] Inception and Xception - Jaejun Yoo

  1. 1. Inception & Xception PR12와 함께 이해하는 Jaejun Yoo Ph.D. Candidate @KAIST PR12 10th Sep, 2017 (GoogLeNet)
  2. 2. Today’s contents GoogLeNet : Inception models • Going Deeper with Convolution • Rethinking the Inception Architecture for Computer Vision • Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
  3. 3. Today’s contents
  4. 4. Today’s contents Xception model • Xception: Deep Learning with Depthwise Separable Convolutions
  5. 5. Motivation Q) What is the best way to improve the performance of deep neural network? A) …
  6. 6. Motivation Q) What is the best way to improve the performance of deep neural network? A) Bigger size! (Increase the depth and width of the model) Yea DO IT WE ARE GOOGLE
  7. 7. Problem 1. Bigger model typically means a larger number of parameters → overfitting 2. Increased use of computational resources → e.g. quadratic increase of computation 𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪 𝟐 computations
  8. 8. Problem 1. Bigger model typically means a larger number of parameters → overfitting 2. Increased use of computational resources → e.g. quadratic increase of computation 𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪 𝟐 computations Solution: Sparsely connected architecture
  9. 9. Sparsely Connected Architecture 1. Mimicking biological system 2. Theoretical underpinnings → Arora et al. Provable bounds for learning some deep representations ICML 2014 “Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”
  10. 10. Why Arora et al. is important To provably solve optimization problems for general neural networks with two or more layers, the algorithms that would be necessary hit some of the biggest open problems in computer science. So, we don't think there's much hope for machine learning researchers to try to find algorithms that are provably optimal for deep networks. This is because the problem is NP-hard, meaning that provably solving it in polynomial time would also solve thousands of open problems that have been open for decades. Indeed, in 1988 J. Stephen Judd shows the following problem to be NP-hard: https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning Given a general neural network and a set of training examples, does there exist a set of edge weights for the network so that the network produces the correct output for all the training examples? Judd also shows that the problem remains NP-hard even if it only requires a network to produce the correct output for just two-thirds of the training examples, which implies that even approximately training a neural network is intrinsically difficult in the worst case. In 1993, Blum and Rivest make the news worse: even a simple network with just two layers and three nodes is NP-hard to train!
  11. 11. Sparsely Connected Architecture 1. Mimicking biological system 2. Theoretical underpinnings → Arora et al. Provable bounds for learning some deep representations ICML 2014 “Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?” YES!
  12. 12. Layer-wise learning Correlation statistics Videos: • http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/ • https://www.youtube.com/watch?v=c43pqQE176g “If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”
  13. 13. Layer-wise learning Correlation statistics Videos: • http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/ • https://www.youtube.com/watch?v=c43pqQE176g “If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.” Hebbian principle!
  14. 14. Problem 1. Sparse matrix computation is very inefficient → dense matrix calculation is extremely efficient 2. Even ConvNet changed back from sparse connection to full connection for better optimize parallel computing
  15. 15. Problem 1. Sparse matrix computation is very inefficient → dense matrix calculation is extremely efficient 2. Even ConvNet changed back from sparse connection to full connection for better optimize parallel computing Is there any intermediate step?
  16. 16. Inception architecture Main idea How to find out an optimal local sparse structure in a convolutional network and how can it be approximated and covered by readily available dense components? : All we need is to find the optimal local construction and to repeat it spatially : Arora et al.: layer by layer construction with correlation statistics analysis
  17. 17. In images, correlations tend to be local
  18. 18. Cover very local clusters by 1x1 convolutions 1x1 number of filters
  19. 19. Less spread out correlations 1x1 number of filters
  20. 20. Cover more spread out clusters by 3x3 convolutions 1x1 3x3 number of filters
  21. 21. Cover more spread out clusters by 5x5 convolutions 1x1 number of filters 3x3
  22. 22. Cover more spread out clusters by 5x5 convolutions 1x1 number of filters 3x3 5x5
  23. 23. A heterogeneous set of convolutions 1x1 number of filters 3x3 5x5
  24. 24. Schematic view (naive version) 1x1 number of filters 3x3 5x5 1x1 convolutio ns 3x3 convolutio ns 5x5 convolutio ns Filter concaten ation Previous layer
  25. 25. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Naive idea 3x3 max pooli ng
  26. 26. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Naive idea (does not work!) 3x3 max pooli ng
  27. 27. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Inception module 3x3 max pooli ng 1x1 convoluti ons 1x1 convoluti ons 1x1 convoluti ons
  28. 28. Network in Network (NiN) MLPConvConv MLPConv = 𝟏 × 𝟏 ConvConv = GLM GLM is replaced with a ”micro network” structure
  29. 29. 𝟏 × 𝟏 Conv 1. Increase the representational power of neural network → In MLPConv sense 2. Dimension reduction → usually the filters are highly correlated
  30. 30. 𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv 5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times) 5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약 2.78 배 비용이 더 들어간다. 만약 크기가 같은 2개의 layer 를 하나의 5x5 로 변환하는 것과 3x3 짜리 2개로 변환하는 것 사이의 비용을 계산해 보자. 5x5xN : (3x3xN) + (3x3xN) = 25 : 9+9 = 25 : 18 (약 28% 의 reduction 효과) https://norman3.github.io/papers/docs/google_inception.html
  31. 31. 𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv https://norman3.github.io/papers/docs/google_inception.html
  32. 32. Xception Observation 1 • Inception module try to explicitly factoring two tasks done by a single convolution kernel: mapping cross-channel correlation and spatial correlation Inception hypothesis • By inception module, These two correlations are sufficiently decoupled.
  33. 33. Xception Observation • Inception module try to explicitly factoring two tasks done by a single convolution kernel: mapping cross-channel correlation and spatial correlation Inception hypothesis • By inception module, These two correlations are sufficiently decoupled. Would it be reasonable to make a much stronger hypothesis than the Inception hypothesis?
  34. 34. Xception
  35. 35. Xception
  36. 36. Xception
  37. 37. Xception Commonly called “separable convolution” in deep learning frameworks such as TF and Keras; a spatial convolution performed independently over each channel of an input, followed by a pointwise convolution, i.e. 1 × 1 𝑐𝑜𝑛𝑣 Depthwise separable convolution
  38. 38. Xception 1. The order of the operations 2. The presence or absence of a non-linearity after the first operaton Xception vs. depthwise separable convolution Regular convolution Inception Depthwise separable convolution
  39. 39. Xception Regular convolution Inception Depthwise separable convolution Inception modules lie in between! Observation 2 Xception Hypothesis : Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
  40. 40. Xception Regular convolution Inception Depthwise separable convolution Inception modules lie in between! Observation 2 Xception Hypothesis : Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
  41. 41. Xception ImageNet JFT
  42. 42. Reference GoogLeNet : Inception models Blog • https://norman3.github.io/papers/docs/google_inception.html (Kor) • https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/ (Eng) Paper • Going Deeper with Convolution • Rethinking the Inception Architecture for Computer Vision • Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning Xception models Paper • Xception: Deep Learning with Depthwise Separable Convolutions
  43. 43. Reducing the grid size efficiently https://norman3.github.io/papers/docs/google_inception.html
  44. 44. •Pooling을 먼저하면? •이 때는 Representational Bottleneck 이 발생한다. •Pooling으로 인한 정보 손실로 생각하면 될 듯 하다. •예제로 드는 연산은 (d,d,k)(d,d,k) 를 (d/2,d/2,2k)(d/2,d/2,2k) 로 변환하는 Conv 로 확인. (따라서 여기서는 d=35d=35, k=320k=320) •어쨌거나 이렇게 하면 실제 연산 수는, •pooling + stride.1 conv with 2k filter => 2(d/2)2k22(d/2)2k2 연산 수 •strid.1 conv with 2k fileter + pooling => 2d2k22d2k2 연산 수 •즉, 왼쪽은 연산량이 좀 더 작지만 Representational Bottleneck 이 발생. •오른쪽은 정보 손실이 더 적지만 연산량이 2배. https://norman3.github.io/papers/docs/google_inception.html Reducing the grid size efficiently

×