Successfully reported this slideshow.
Upcoming SlideShare
×

# [PR12] Inception and Xception - Jaejun Yoo

2,135 views

Published on

Introduction to Inception and Xception
video: https://youtu.be/V0dLhyg5_Dw
Papers:
Going Deeper with Convolutions
Rethinking the Inception Architecture for Computer Vision
Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
Xception: Deep Learning with Depthwise Separable Convolutions

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### [PR12] Inception and Xception - Jaejun Yoo

1. 1. Inception & Xception PR12와 함께 이해하는 Jaejun Yoo Ph.D. Candidate @KAIST PR12 10th Sep, 2017 (GoogLeNet)
2. 2. Today’s contents GoogLeNet : Inception models • Going Deeper with Convolution • Rethinking the Inception Architecture for Computer Vision • Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning
3. 3. Today’s contents
4. 4. Today’s contents Xception model • Xception: Deep Learning with Depthwise Separable Convolutions
5. 5. Motivation Q) What is the best way to improve the performance of deep neural network? A) …
6. 6. Motivation Q) What is the best way to improve the performance of deep neural network? A) Bigger size! (Increase the depth and width of the model) Yea DO IT WE ARE GOOGLE
7. 7. Problem 1. Bigger model typically means a larger number of parameters → overfitting 2. Increased use of computational resources → e.g. quadratic increase of computation 𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪 𝟐 computations
8. 8. Problem 1. Bigger model typically means a larger number of parameters → overfitting 2. Increased use of computational resources → e.g. quadratic increase of computation 𝟑 × 𝟑 × 𝑪 → 𝟑 × 𝟑 × 𝑪: 𝑪 𝟐 computations Solution: Sparsely connected architecture
9. 9. Sparsely Connected Architecture 1. Mimicking biological system 2. Theoretical underpinnings → Arora et al. Provable bounds for learning some deep representations ICML 2014 “Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”
10. 10. Why Arora et al. is important To provably solve optimization problems for general neural networks with two or more layers, the algorithms that would be necessary hit some of the biggest open problems in computer science. So, we don't think there's much hope for machine learning researchers to try to find algorithms that are provably optimal for deep networks. This is because the problem is NP-hard, meaning that provably solving it in polynomial time would also solve thousands of open problems that have been open for decades. Indeed, in 1988 J. Stephen Judd shows the following problem to be NP-hard: https://www.oreilly.com/ideas/the-hard-thing-about-deep-learning Given a general neural network and a set of training examples, does there exist a set of edge weights for the network so that the network produces the correct output for all the training examples? Judd also shows that the problem remains NP-hard even if it only requires a network to produce the correct output for just two-thirds of the training examples, which implies that even approximately training a neural network is intrinsically difficult in the worst case. In 1993, Blum and Rivest make the news worse: even a simple network with just two layers and three nodes is NP-hard to train!
11. 11. Sparsely Connected Architecture 1. Mimicking biological system 2. Theoretical underpinnings → Arora et al. Provable bounds for learning some deep representations ICML 2014 “Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?” YES!
12. 12. Layer-wise learning Correlation statistics Videos: • http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/ • https://www.youtube.com/watch?v=c43pqQE176g “If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.”
13. 13. Layer-wise learning Correlation statistics Videos: • http://techtalks.tv/talks/provable-bounds-for-learning-some-deep-representations/61118/ • https://www.youtube.com/watch?v=c43pqQE176g “If the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.” Hebbian principle!
14. 14. Problem 1. Sparse matrix computation is very inefficient → dense matrix calculation is extremely efficient 2. Even ConvNet changed back from sparse connection to full connection for better optimize parallel computing
15. 15. Problem 1. Sparse matrix computation is very inefficient → dense matrix calculation is extremely efficient 2. Even ConvNet changed back from sparse connection to full connection for better optimize parallel computing Is there any intermediate step?
16. 16. Inception architecture Main idea How to find out an optimal local sparse structure in a convolutional network and how can it be approximated and covered by readily available dense components? : All we need is to find the optimal local construction and to repeat it spatially : Arora et al.: layer by layer construction with correlation statistics analysis
17. 17. In images, correlations tend to be local
18. 18. Cover very local clusters by 1x1 convolutions 1x1 number of filters
19. 19. Less spread out correlations 1x1 number of filters
20. 20. Cover more spread out clusters by 3x3 convolutions 1x1 3x3 number of filters
21. 21. Cover more spread out clusters by 5x5 convolutions 1x1 number of filters 3x3
22. 22. Cover more spread out clusters by 5x5 convolutions 1x1 number of filters 3x3 5x5
23. 23. A heterogeneous set of convolutions 1x1 number of filters 3x3 5x5
24. 24. Schematic view (naive version) 1x1 number of filters 3x3 5x5 1x1 convolutio ns 3x3 convolutio ns 5x5 convolutio ns Filter concaten ation Previous layer
25. 25. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Naive idea 3x3 max pooli ng
26. 26. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Naive idea (does not work!) 3x3 max pooli ng
27. 27. 1x1 convoluti ons 3x3 convoluti ons 5x5 convoluti ons Filter concate nation Previous layer Inception module 3x3 max pooli ng 1x1 convoluti ons 1x1 convoluti ons 1x1 convoluti ons
28. 28. Network in Network (NiN) MLPConvConv MLPConv = 𝟏 × 𝟏 ConvConv = GLM GLM is replaced with a ”micro network” structure
29. 29. 𝟏 × 𝟏 Conv 1. Increase the representational power of neural network → In MLPConv sense 2. Dimension reduction → usually the filters are highly correlated
30. 30. 𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv 5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times) 5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약 2.78 배 비용이 더 들어간다. 만약 크기가 같은 2개의 layer 를 하나의 5x5 로 변환하는 것과 3x3 짜리 2개로 변환하는 것 사이의 비용을 계산해 보자. 5x5xN : (3x3xN) + (3x3xN) = 25 : 9+9 = 25 : 18 (약 28% 의 reduction 효과) https://norman3.github.io/papers/docs/google_inception.html
31. 31. 𝟓 × 𝟓 Conv → 𝟑 × 𝟑 Conv + 𝟑 × 𝟑 Conv https://norman3.github.io/papers/docs/google_inception.html
32. 32. Xception Observation 1 • Inception module try to explicitly factoring two tasks done by a single convolution kernel: mapping cross-channel correlation and spatial correlation Inception hypothesis • By inception module, These two correlations are sufficiently decoupled.
33. 33. Xception Observation • Inception module try to explicitly factoring two tasks done by a single convolution kernel: mapping cross-channel correlation and spatial correlation Inception hypothesis • By inception module, These two correlations are sufficiently decoupled. Would it be reasonable to make a much stronger hypothesis than the Inception hypothesis?
34. 34. Xception
35. 35. Xception
36. 36. Xception
37. 37. Xception Commonly called “separable convolution” in deep learning frameworks such as TF and Keras; a spatial convolution performed independently over each channel of an input, followed by a pointwise convolution, i.e. 1 × 1 𝑐𝑜𝑛𝑣 Depthwise separable convolution
38. 38. Xception 1. The order of the operations 2. The presence or absence of a non-linearity after the first operaton Xception vs. depthwise separable convolution Regular convolution Inception Depthwise separable convolution
39. 39. Xception Regular convolution Inception Depthwise separable convolution Inception modules lie in between! Observation 2 Xception Hypothesis : Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
40. 40. Xception Regular convolution Inception Depthwise separable convolution Inception modules lie in between! Observation 2 Xception Hypothesis : Make the mapping that entirely decouples the cross-channels correlations and spatial correlations
41. 41. Xception ImageNet JFT
42. 42. Reference GoogLeNet : Inception models Blog • https://norman3.github.io/papers/docs/google_inception.html (Kor) • https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/ (Eng) Paper • Going Deeper with Convolution • Rethinking the Inception Architecture for Computer Vision • Inception-v4, Inception-RestNet and the Impact of Residual Connections on Learning Xception models Paper • Xception: Deep Learning with Depthwise Separable Convolutions
43. 43. Reducing the grid size efficiently https://norman3.github.io/papers/docs/google_inception.html
44. 44. •Pooling을 먼저하면? •이 때는 Representational Bottleneck 이 발생한다. •Pooling으로 인한 정보 손실로 생각하면 될 듯 하다. •예제로 드는 연산은 (d,d,k)(d,d,k) 를 (d/2,d/2,2k)(d/2,d/2,2k) 로 변환하는 Conv 로 확인. (따라서 여기서는 d=35d=35, k=320k=320) •어쨌거나 이렇게 하면 실제 연산 수는, •pooling + stride.1 conv with 2k filter => 2(d/2)2k22(d/2)2k2 연산 수 •strid.1 conv with 2k fileter + pooling => 2d2k22d2k2 연산 수 •즉, 왼쪽은 연산량이 좀 더 작지만 Representational Bottleneck 이 발생. •오른쪽은 정보 손실이 더 적지만 연산량이 2배. https://norman3.github.io/papers/docs/google_inception.html Reducing the grid size efficiently