Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Deep Learning Works: Self Regularization in Deep Neural Networks

1,255 views

Published on

Talk given on June 8, 2018 at UC Berkeley / NERSC

In Collaboration with Michael Mahoney, UC Berkeley
National Energy Research Scientific Computing Center

Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models. We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization, whereas large, modern deep networks display a new kind of heavy tailed self-regularization. We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training. Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs. We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.

Published in: Technology
  • Be the first to comment

Why Deep Learning Works: Self Regularization in Deep Neural Networks

  1. 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  2. 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  3. 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  4. 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  5. 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
  6. 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
  7. 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  8. 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  9. 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  10. 10. c|c (TM) Motivations: what is Regularization ? (TM) 10 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  11. 11. c|c (TM) (TM) 11 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  12. 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  13. 13. c|c (TM) Motivations: how we study Regularization (TM) 13 turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
  14. 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow what happens to the layer weight matrices WL ?
  15. 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
  16. 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
  17. 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
  18. 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
  19. 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  20. 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  21. 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
  22. 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  23. 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  24. 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  25. 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  26. 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  27. 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
  28. 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  29. 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
  30. 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  31. 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
  32. 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
  33. 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
  34. 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
  35. 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
  36. 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  37. 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  38. 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Q > 1 Q = 1 > 0 = 0 plus Tracy-Widom fluctuations very crisp edges Heavy-tailed ? Bulk-decay ? Bulk+Spikes Large, well trained models approach heavy tailed self-regularization
  39. 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 226 Q~1.3 Bulk-decay ? Heavy-tailed ? best MP fits bulk not captured difficult to apply MP
  40. 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Applying RMT: Heavy Tails ~ Q=1 Large, well trained models approach heavy tailed self-regularization DenseNet201 typical layer w/ Q=1.92 Heavy-tailed, but seemingly within MP eigenvalue bounds MP fit is terrible near eigenvalue minimum = 0 variance 1.83 is quite large Resembles Q =1 fit like a soft rank collapse best MP fit (Q fixed)
  41. 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Large, well trained models approach heavy tailed self-regularization Q = 1 = 0 Long tail looks takes the form of very large variance standard MP theory assumes finite variance
  42. 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization best MP fit (Q fixed) Heavy-tailed, but not clean power law no hard rank collapse but close max ~ 30 InceptionV3 Layer 302 Q~2.048
  43. 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Applying RMT: Should we float Q ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 302 Q~2.048 best MP fit (Q =1) Heavy-tailed, Q=1 does not fit max ~ 30 (not shown) ~1.3 very large variances do not capture bulk
  44. 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Summary
 self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
  45. 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  46. 46. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 46 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  47. 47. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 47 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  48. 48. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 48 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?
  49. 49. (TM) c|c (TM) c | c charles@calculationconsulting.com

×