Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 954015 views
- AI and Machine Learning Demystified... by Carol Smith 3586685 views
- 10 facts about jobs in the future by Pew Research Cent... 644165 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 1043616 views
- Harry Surden - Artificial Intellige... by Harry Surden 605626 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1188276 views

1,709 views

Published on

In Collaboration with Michael Mahoney, UC Berkeley

National Energy Research Scientific Computing Center

Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models. We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization, whereas large, modern deep networks display a new kind of heavy tailed self-regularization. We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training. Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs. We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.

Published in:
Technology

No Downloads

Total views

1,709

On SlideShare

0

From Embeds

0

Number of Embeds

1,166

Shares

0

Downloads

22

Comments

0

Likes

3

No embeds

No notes for slide

- 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
- 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
- 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); ﬁrst $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
- 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
- 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
- 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
- 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
- 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
- 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
- 10. c|c (TM) Motivations: what is Regularization ? (TM) 10 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
- 11. c|c (TM) (TM) 11 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overﬁt on randomly labeled data Regularization can not prevent this
- 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
- 13. c|c (TM) Motivations: how we study Regularization (TM) 13 turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
- 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points ﬂoor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information ﬂow what happens to the layer weight matrices WL ?
- 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
- 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
- 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
- 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
- 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well deﬁned edges (depends on Q, aspect ratio)
- 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom ﬂuctuations very crisp edges Q
- 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
- 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
- 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
- 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
- 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
- 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signiﬁes over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
- 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
- 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
- 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
- 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
- 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
- 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
- 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
- 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
- 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
- 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
- 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
- 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Q > 1 Q = 1 > 0 = 0 plus Tracy-Widom ﬂuctuations very crisp edges Heavy-tailed ? Bulk-decay ? Bulk+Spikes Large, well trained models approach heavy tailed self-regularization
- 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 226 Q~1.3 Bulk-decay ? Heavy-tailed ? best MP ﬁts bulk not captured difﬁcult to apply MP
- 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Applying RMT: Heavy Tails ~ Q=1 Large, well trained models approach heavy tailed self-regularization DenseNet201 typical layer w/ Q=1.92 Heavy-tailed, but seemingly within MP eigenvalue bounds MP ﬁt is terrible near eigenvalue minimum = 0 variance 1.83 is quite large Resembles Q =1 ﬁt like a soft rank collapse best MP ﬁt (Q ﬁxed)
- 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Large, well trained models approach heavy tailed self-regularization Q = 1 = 0 Long tail looks takes the form of very large variance standard MP theory assumes ﬁnite variance
- 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization best MP ﬁt (Q ﬁxed) Heavy-tailed, but not clean power law no hard rank collapse but close max ~ 30 InceptionV3 Layer 302 Q~2.048
- 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Applying RMT: Should we ﬂoat Q ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 302 Q~2.048 best MP ﬁt (Q =1) Heavy-tailed, Q=1 does not ﬁt max ~ 30 (not shown) ~1.3 very large variances do not capture bulk
- 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Summary self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
- 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
- 46. c|c (TM) Energy Funnels: Minimizing Frustration (TM) 46 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
- 47. c|c (TM) the Spin Glass of Minimal Frustration (TM) 47 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, ﬂipped low lying Energy state in Spin Glass ~ spikes in RMT
- 48. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 48 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?
- 49. (TM) c|c (TM) c | c charles@calculationconsulting.com

No public clipboards found for this slide

Be the first to comment