Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

351 views

Published on

Talk given on Dec 13, 2018 at ICSI, UC Berkeley
http://www.icsi.berkeley.edu/icsi/events/2018/12/regularization-neural-networks

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. Moreover, we can use these heavy tailed results to form a VC-like average case complexity metric that resembles the product norm used in analyzing toy NNs, and we can use this to predict the test accuracy of pretrained DNNs without peeking at the test data.

Published in: Technology
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Slide 45 has a mistake..should read Frobenius Norm Squared ~ Max Eigenvalue ... see the plot
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • From research to production, we bring AI to life. http://calculationconsulting.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Please subscribe to our youtube channel: https://www.youtube.com/c/calculationconsulting
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

  1. 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  2. 2. calculation|consulting UC Berkeley / ICSI 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  3. 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  4. 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  5. 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
  6. 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? weakly supervised deep learning ? …
  7. 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  8. 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  9. 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  10. 10. c|c (TM) (TM) 10 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  11. 11. c|c (TM) Motivations: what is Regularization ? (TM) 11 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  12. 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  13. 13. c|c (TM) Motivations: how we study Regularization (TM) 13 we can characterize the the learning process by studying W the Energy Landscape is determined by the layer weights W L L and the eigenvalues of X= (1/N) W WL L T
  14. 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Random Matrix Theory: Universality Classes for DNN Weight Matrices
  15. 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  16. 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  17. 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into W Empirical Spectral Density (ESD: eigenvalues of X) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] N,M = W.shape() … X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) plt.hist(X, bin=100, density=True)
  18. 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs
  19. 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works A Heavy tailed random matrices has a heavy tailed empirical spectral density Given where Form where (but really ) The the ESD has the form Random Matrix Theory: Heavy Tailed where is linear in
  20. 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed RMT says if W is heavy tailed, the ESD will also have heavy tails If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)
  21. 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
  22. 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments on pre-trained DNNs: Traditional and Heavy Tailed Implicit Self-Regularization
  23. 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  24. 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works LeNet 5 resembles the MP Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  25. 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  26. 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  27. 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  28. 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  29. 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  30. 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works AlexNet, 
 VGG, ResNet, Inception, DenseNet, … Heavy Tailed RMT: Scale Free ESD All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free
  31. 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Universality of Power Law Exponents: I0,000 Weight Matrices
  32. 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality
  33. 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  34. 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Rank Collapse: ImageNet and AllenNLP The pretrained ImageNet and AllenNLP show (almost) no rank collapse
  35. 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Power Law Universality: BERT The pretrained BERT model is not optimal, displays rank collapse
  36. 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Predicting Test / Generalization Accuracies: Universality and Complexity Metrics
  37. 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Universality: Capacity metrics Universality suggests the power law exponent would make a good, Universal. DNN capacity metric Imagine a weighted average where the weights b ~ |W| scale of the matrix An Unsupervised, VC-like data dependent complexity metric for predicting trends in average case generalization accuracy in DNNs
  38. 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works DNN Capacity metrics: Product norms The product norm is a data-dependent,VC-like capacity metric for DNNs
  39. 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Predicting test accuracies: Product norms We can predict trends in the test accuracy without peeking at the test data !
  40. 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Universality: Capacity metrics We can do even better using the weighted average alpha But first, we need a Relation between Frobenius norm to the Power Law And to solve…what are the weights b ?
  41. 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations form Correlation matrix, select Normalization create a random Heavy Tailed (Pareto) matrix Frobenius Norm-Power Law relation depends on the Normalization
  42. 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations Frobenius Norm-Power Law relation depends on the Normalization compute ‘eigenvalues’ of , fit to power Law examine the norm-powerlaw relation:
  43. 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations import numpy as np import powerlaw … N,M = … mu = … W = np.random.pareto(a=mu,size=(N,M)) X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) alpha = Powerlaw.fit(evals).alpha logNorm = np.log10(np.linalg.norm(W) logMaxEig = np.log10(np.max(evals)) ratio = 2*logNorm/logMaxEig
  44. 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Scale-free Normalization the Frobenius Norm is dominated by a single scale-free eigenvalue Heavy Tailed matrices: norm-powerlaw relations Relation is weakly scale dependent
  45. 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works large Pareto matrices have a simple, limiting norm-power-law relation Heavy Tailed matrices: norm-power law relations Standard Normalization Relation is only linear for very Heavy Tailed matrices
  46. 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works Finite size Pareto matrices have a universal, linear norm-power-law relation Heavy Tailed matrices: norm-power law relations Relation is nearly linear for very Heavy and Fat Tailed finite size matrices
  47. 47. c|c (TM) (TM) 47 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Here we treat both Linear layers and Conv2D feature maps
  48. 48. c|c (TM) (TM) 48 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Associate the log product norm with the weighed alpha metric
  49. 49. c|c (TM) (TM) 49 calculation | consulting why deep learning works Predicting test accuracies: VGG Series The weighted alpha capacity metric predicts trends in test accuracy
  50. 50. c|c (TM) (TM) 50 calculation | consulting why deep learning works Predicting test accuracies: ResNet Series
  51. 51. c|c (TM) (TM) 51 calculation | consulting why deep learning works Open source tool: weightwatcher pip install weightwatcher … import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() All results can be reproduced using the python weightwatcher tool python tool to analyze Fat Tails in Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher
  52. 52. c|c (TM) (TM) 52 calculation | consulting why deep learning works Implicit Self-Regularization: 5+1 Phases of Training
  53. 53. c|c (TM) (TM) 53 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  54. 54. c|c (TM) (TM) 54 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  55. 55. c|c (TM) (TM) 55 calculation | consulting why deep learning works Self-Regularization: Batch size experiments Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC We can induce heavy tails in small models by decreasing the batch size Mini-AlexNet: retrained to exploit Generalization Gap
  56. 56. c|c (TM) (TM) 56 calculation | consulting why deep learning works Large batch sizes => decrease generalization accuracy Self-Regularization: Batch size experiments
  57. 57. c|c (TM) (TM) 57 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Random-Like Bleeding-outRandom-Like Decreasing the batch size induces strong correlations in W
  58. 58. c|c (TM) (TM) 58 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Gaussian random matrix Bulk+ Spikes Heavy Tailed Large batch sizes Small batch sizes
  59. 59. c|c (TM) (TM) 59 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  60. 60. c|c (TM) (TM) 60 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  61. 61. c|c (TM) (TM) 61 calculation | consulting why deep learning works Summary
 •reviewed Random Matrix Theory (RMT) •small NNs ~ Tinkhonov regularization •modern DNNs are Heavy-Tailed •Universality of Power Law exponent alpha •can predict generalization accuracies using new capacity metric, weighted average alpha •can induce heavy tails by exploiting the Generalization Gap phenomena, decreasing batch size
  62. 62. c|c (TM) (TM) 62 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow Is this right ? Based on a Gaussian Spin Glass model
  63. 63. c|c (TM) (TM) 63 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can we characterize the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  64. 64. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 64 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  65. 65. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 65 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  66. 66. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 66 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?
  67. 67. (TM) c|c (TM) c | c charles@calculationconsulting.com

×