Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 698705 views
- AI and Machine Learning Demystified... by Carol Smith 3496959 views
- 10 facts about jobs in the future by Pew Research Cent... 602051 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 983136 views
- Harry Surden - Artificial Intellige... by Harry Surden 560948 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1147829 views

871 views

Published on

video: https://youtu.be/UxJNG7ENRNg

paper: https://arxiv.org/pdf/1611.03530.pdf

Published in:
Technology

No Downloads

Total views

871

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

65

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Understanding Deep Learning Requires Rethinking Generalization PR12와 함께 이해하는 Jaejun Yoo Ph.D. Candidate @KAIST PR12 20th Jan, 2018
- 2. Today’s contents Understanding deep learning requires rethinking generalization by C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals Nov. 2016: https://arxiv.org/abs/1611.03530 ICLR 2017 Best Paper ??? @#*($DFJLDK …paper awards
- 3. Questions Why large neural networks generalize well in practice? Can the traditional theories on generalization actually explain the results we are seeing these days? What is it then that distinguishes neural networks that generalize well from those that don’t?
- 4. Questions Why large neural networks generalize well in practice? Can the traditional theories on generalization actually explain the results we are seeing these days? What is it then that distinguishes neural networks that generalize well from those that don’t?
- 5. Questions Why large neural networks generalize well in practice? Can the traditional theories on generalization actually explain the results we are seeing these days? What is it then that distinguishes neural networks that generalize well from those that don’t?
- 6. Questions Why large neural networks generalize well in practice? Can the traditional theories on generalization actually explain the results we are seeing these days? What is it then that distinguishes neural networks that generalize well from those that don’t? “Deep neural networks TOO easily fit random labels.”
- 7. Conventional wisdom Small generalization error due to • Model family • Various regularization techniques “Generalization error” 전략: 최대한 적은 parameter를 갖으면서 training error가 최소인 model을 찾자
- 8. Conventional wisdom ??? Small generalization error due to • Model family • Various regularization techniques “Generalization error”
- 9. Effective capacity of neural networks Parameter Count Num Training Samples MLP 1 x 512 AlexNet Inception Wide Resnet MLP 1 x 512 p/n = 24 Inception p/n = 33 Test error AlexNet p/n = 28 Wide ResNet p/n = 179
- 10. Effective capacity of neural networks Parameter Count Num Training Samples MLP 1 x 512 AlexNet Inception Wide Resnet MLP 1 x 512 p/n = 24 Inception p/n = 33 Test error AlexNet p/n = 28 Wide ResNet p/n = 179 If counting the number of parameter is not a useful way to measure the model complexity, then how can we measure the effective capacity of the model?
- 11. Randomization test Fitting random labels and pixels
- 12. Randomization test Naïve intuition learning is impossible, e.g., training not converging or slowing down substantially. Fitting random labels and pixels
- 13. Randomization test Fitting random labels and pixels
- 14. Implications Rademacher complexity and VC-dimension Uniform stability “How sensitive the algorithm is to the replacement of a single example”
- 15. Implications Rademacher complexity and VC-dimension Uniform stability “How sensitive the algorithm is to the replacement of a single example”
- 16. Implications Rademacher complexity and VC-dimension Uniform stability “How sensitive the algorithm is to the replacement of a single example” Solely a property of the algorithm (nothing to do with the data)
- 17. Implications • The effective capacity of neural networks is sufficient for memorizing the entire data set. • Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels • Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged Summary
- 18. The role of regularization Regularization 같을 걸 끼얹나? What about regularization techniques? 당신의 모델이 자꾸 overfitting 될 때 이렇게 하면 살 수 있다. 위기탈출 넘버원
- 19. The role of regularization What about regularization techniques?
- 20. The role of regularization What about regularization techniques? 없어도 꽤 잘 되는데?
- 21. The role of regularization What about regularization techniques? 있어도 잘만 (overfitting)되는데?
- 22. The role of regularization What about regularization techniques? Regularization certainly helps generalization but it is NOT the fundamental reason for generalization. 있어도 잘만 (overfitting)되는데?
- 23. Finite sample expressivity Much effort has gone into studying the expressivity of NNs However, almost all of these results are at the “population level” Showing what functions of the entire domain can and cannot be represented by certain classes of NNs with the same number of parameters What is more relevant in practice is the expressive power of NNs on a finite sample size of n
- 24. Finite sample expressivity The expressive power of NNs on a finite sample size of n? “As soon as the number of parameters p of a networks is greater than n, even simple two-layer neural networks can represent any function of the input sample.” "(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는 d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."
- 25. Finite sample expressivity Proof) • are invertible if and only if the diagonal elements are nonzero • have their eigenvalues taken directly from the diagonal elements Lower triangular matrix… ∎ ∃𝐀𝐀−𝟏𝟏 ∵ 𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝐀𝐀 = 𝒏𝒏
- 26. Finite sample expressivity Proof) "(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는 d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."
- 27. Finite sample expressivity Proof) "(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는 d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."
- 28. Finite sample expressivity Proof) "(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는 d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다." 𝒚𝒚 = 𝐀𝐀𝒘𝒘, ∃𝐀𝐀−𝟏𝟏 ∵ 𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝐀𝐀 = 𝒏𝒏 𝒃𝒃𝒃𝒃 𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳 𝟏𝟏
- 29. Implicit regularization Neural net이 잘 되는 이유는 잘 모르겠지만 linear model 로부터 어떤 insight를 얻을 수는 없을까? An appeal to linear models
- 30. Implicit regularization An appeal to linear models But do all global minima generalize equally well? How do we determine one to the other? “Curvature”
- 31. Implicit regularization An appeal to linear models But do all global minima generalize equally well? How do we determine one to the other? “Curvature” In linear case, curvature of all optimal solutions is the same ∵ Hessian of the loss function is not a function of the choice of 𝑤𝑤
- 32. Implicit regularization Algorithm 자체가 어떤 constraint Here, implicit regularization = SGD algorithm SGD는 solution이 Minimum Norm Solution으로 수렴 If curvature doesn’t distinguish global minima, what does?
- 33. Algorithm 자체가 어떤 constraint Here, implicit regularization = SGD algorithm SGD는 solution이 Minimum Norm Solution으로 수렴 Implicit regularization If curvature doesn’t distinguish global minima, what does? 𝑤𝑤𝑚𝑚𝑚𝑚 = 𝐗𝐗 𝐓𝐓 𝐗𝐗𝐗𝐗 𝐓𝐓 −𝟏𝟏 𝑦𝑦 𝐗𝐗𝐗𝐗 𝐓𝐓 ∈ ℝ𝒏𝒏×𝒏𝒏 일종의 kernel, gram matrix K
- 34. Implicit regularization Quite surprisingly… 이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다!
- 35. Implicit regularization Quite surprisingly… 이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다! RBF kernel 𝐗𝐗𝐗𝐗 𝐓𝐓 𝜶𝜶 = 𝒚𝒚
- 36. Implicit regularization Quite surprisingly… 이렇게 단순한 방식으로 구한 solution이 err가 매우 낮다!
- 37. • Simple experimental framework for understanding the effective capacity of deep learning models • Successful DeepNets are able to overfit the training set • Other formal measure of complexity for the models/ algorithms/data distributions are needed to precisely explain the over-parameterized regime Conclusion
- 38. We believe that … “understanding neural networks requires rethinking generalization.” Conclusion
- 39. References • https://arxiv.org/pdf/1611.03530.pdf (paper) • https://openreview.net/forum?id=Sy8gdB9xx (open review comments) • https://github.com/pluskid/fitting-random-labels (code) • http://pluskid.org/slides/ICLR2017-Poster.pdf (poster) • https://www.youtube.com/watch?v=kCj51pTQPKI (presentation, YouTube) • https://www.slideshare.net/JungHoonSeo2/understanding-deep-learning-requires- rethinking-generalization-2017-12 (slideshare: Kor) • https://danieltakeshi.github.io/2017/05/19/understanding-deep-learning-requires- rethinking-generalization-my-thoughts-and-notes (blog)
- 40. Things to discuss about… • The effective capacity of neural networks is sufficient for memorizing the entire data set. • Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels • Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged Summary 정말???
- 41. 기존의 일반적인 supervised learning setting: Training과 test의 domain이 같다고 가정. Statistical Learning Theory : SVM, …
- 42. 전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y)
- 43. 전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y) 비디오 게임 고객평가 (X)
- 44. 전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y) 비디오 게임 고객평가 (X) NN으로 표현되는 H 함수 공간으로부터….
- 45. 전자기기 고객평가 (X) / 긍정 혹은 부정 라벨 (Y) 비디오 게임 고객평가 (X) Classifier h를 학습하는데, target의 label을 모르지만 source(X,Y)와 target(X) 두 도메인 모두에서 잘 label 을 찾는 h를 찾고 싶다. NN으로 표현되는 H 함수 공간으로부터….
- 46. 기존 전략: 최대한 적은 parameter를 갖으면서 training error가 최소인 model을 찾자
- 47. 이제는 training domain (source)과 testing domain (target)이 서로 다르다 기존의 전략 외에 다른 전략이 추가로 필요하다.
- 48. A Computable Adaptation Bound Divergence estimation complexity Dependent on number of unlabeled samples
- 49. The optimal joint hypothesis is the hypothesis with minimal combined error is that error

No public clipboards found for this slide

Be the first to comment