Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calcul...
calculation|consulting
UC Berkeley / ICSI 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
c...
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Ch...
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC ...
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
calculation | consulting why deep learning works
NNs as spi...
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
calculation | consulting why deep learning works
Theoretica...
c|c
(TM)
Set up: the Energy Landscape
(TM)
7
calculation | consulting why deep learning works
minimize Loss: but how avoid...
c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
calculation | consulting why deep learning works
highly non-convex ?...
c|c
(TM)
Problem: Local Minima ?
(TM)
9
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
soluti...
c|c
(TM)
(TM)
10
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generali...
c|c
(TM)
Motivations: what is Regularization ?
(TM)
11
calculation | consulting why deep learning works
every adjustable k...
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what i...
c|c
(TM)
Motivations: how we study Regularization
(TM)
13
we can characterize the the learning process by studying W
the E...
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Random Matrix Theory:
Universality Classes for
DNN Weigh...
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a de...
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fl...
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into W
Empirical ...
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN train...
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
A Heavy tailed random matrices has a heavy tailed empiri...
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
RMT says if W is heav...
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wign...
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Experiments on pre-trained DNNs:
Traditional and Heavy T...
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (19...
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Co...
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
...
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk ...
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft r...
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbati...
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like...
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
AlexNet, 

VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tai...
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Universality of Power Law Exponents:
I0,000 Weight Matri...
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models dis...
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models dis...
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained Imag...
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Power Law Universality: BERT
The pretrained BERT model i...
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Predicting Test / Generalization Accuracies:
Universalit...
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the...
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is ...
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
We can predict...
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Universality: Capacity metrics
We can do even better usi...
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
form Corr...
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
Frobenius...
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
import nu...
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Scale-free Normalization
the Frobenius Norm is dominated...
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
large Pareto matrices have a simple, limiting norm-power...
c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Finite size Pareto matrices have a universal, linear nor...
c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat...
c|c
(TM)
(TM)
48
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Associate the...
c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Predicting test accuracies: VGG Series
The weighted alph...
c|c
(TM)
(TM)
50
calculation | consulting why deep learning works
Predicting test accuracies: ResNet Series
c|c
(TM)
(TM)
51
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatche...
c|c
(TM)
(TM)
52
calculation | consulting why deep learning works
Implicit Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
53
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
54
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
55
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
Conv2D MaxPo...
c|c
(TM)
(TM)
56
calculation | consulting why deep learning works
Large batch sizes => decrease generalization accuracy
Se...
c|c
(TM)
(TM)
57
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Random-Like Bleedi...
c|c
(TM)
(TM)
58
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the bat...
c|c
(TM)
(TM)
59
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the bat...
c|c
(TM)
(TM)
60
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the bat...
c|c
(TM)
(TM)
61
calculation | consulting why deep learning works
Summary

•reviewed Random Matrix Theory (RMT)
•small NNs...
c|c
(TM)
(TM)
62
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1...
c|c
(TM)
(TM)
63
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can we character...
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
64
calculation | consulting why deep learning works
http://www.natur...
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
65
calculation | consulting why deep learning works
Conjectured 2015...
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
66
calculation | consulting why deep learning works
Compare to LeCun’s...
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com
Upcoming SlideShare
Loading in …5
×

of

YouTube videos are no longer supported on SlideShare

View original on YouTube

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 2 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 3 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 4 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 5 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 6 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 7 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 8 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 9 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 10 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 11 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 12 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 13 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 14 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 15 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 16 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 17 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 18 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 19 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 20 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 21 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 22 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 23 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 24 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 25 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 26 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 27 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 28 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 29 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 30 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 31 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 32 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 33 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 34 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 35 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 36 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 37 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 38 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 39 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 40 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 41 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 42 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 43 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 44 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 45 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 46 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 47 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 48 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 49 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 50 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 51 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 52 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 53 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 54 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 55 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 56 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 57 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 58 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 59 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 60 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 61 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 62 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 63 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 64 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 65 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 66 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 67 Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley Slide 68
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

Download to read offline

Talk given on Dec 13, 2018 at ICSI, UC Berkeley
http://www.icsi.berkeley.edu/icsi/events/2018/12/regularization-neural-networks

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. Moreover, we can use these heavy tailed results to form a VC-like average case complexity metric that resembles the product norm used in analyzing toy NNs, and we can use this to predict the test accuracy of pretrained DNNs without peeking at the test data.

Related Books

Free with a 30 day trial from Scribd

See all

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

  1. 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  2. 2. calculation|consulting UC Berkeley / ICSI 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  3. 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  4. 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  5. 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
  6. 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? weakly supervised deep learning ? …
  7. 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  8. 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  9. 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  10. 10. c|c (TM) (TM) 10 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  11. 11. c|c (TM) Motivations: what is Regularization ? (TM) 11 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  12. 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  13. 13. c|c (TM) Motivations: how we study Regularization (TM) 13 we can characterize the the learning process by studying W the Energy Landscape is determined by the layer weights W L L and the eigenvalues of X= (1/N) W WL L T
  14. 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Random Matrix Theory: Universality Classes for DNN Weight Matrices
  15. 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  16. 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  17. 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into W Empirical Spectral Density (ESD: eigenvalues of X) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] N,M = W.shape() … X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) plt.hist(X, bin=100, density=True)
  18. 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs
  19. 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works A Heavy tailed random matrices has a heavy tailed empirical spectral density Given where Form where (but really ) The the ESD has the form Random Matrix Theory: Heavy Tailed where is linear in
  20. 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed RMT says if W is heavy tailed, the ESD will also have heavy tails If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)
  21. 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
  22. 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments on pre-trained DNNs: Traditional and Heavy Tailed Implicit Self-Regularization
  23. 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  24. 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works LeNet 5 resembles the MP Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  25. 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  26. 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  27. 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  28. 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  29. 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  30. 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works AlexNet, 
 VGG, ResNet, Inception, DenseNet, … Heavy Tailed RMT: Scale Free ESD All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free
  31. 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Universality of Power Law Exponents: I0,000 Weight Matrices
  32. 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality
  33. 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  34. 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Rank Collapse: ImageNet and AllenNLP The pretrained ImageNet and AllenNLP show (almost) no rank collapse
  35. 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Power Law Universality: BERT The pretrained BERT model is not optimal, displays rank collapse
  36. 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Predicting Test / Generalization Accuracies: Universality and Complexity Metrics
  37. 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Universality: Capacity metrics Universality suggests the power law exponent would make a good, Universal. DNN capacity metric Imagine a weighted average where the weights b ~ |W| scale of the matrix An Unsupervised, VC-like data dependent complexity metric for predicting trends in average case generalization accuracy in DNNs
  38. 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works DNN Capacity metrics: Product norms The product norm is a data-dependent,VC-like capacity metric for DNNs
  39. 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Predicting test accuracies: Product norms We can predict trends in the test accuracy without peeking at the test data !
  40. 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Universality: Capacity metrics We can do even better using the weighted average alpha But first, we need a Relation between Frobenius norm to the Power Law And to solve…what are the weights b ?
  41. 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations form Correlation matrix, select Normalization create a random Heavy Tailed (Pareto) matrix Frobenius Norm-Power Law relation depends on the Normalization
  42. 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations Frobenius Norm-Power Law relation depends on the Normalization compute ‘eigenvalues’ of , fit to power Law examine the norm-powerlaw relation:
  43. 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations import numpy as np import powerlaw … N,M = … mu = … W = np.random.pareto(a=mu,size=(N,M)) X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) alpha = Powerlaw.fit(evals).alpha logNorm = np.log10(np.linalg.norm(W) logMaxEig = np.log10(np.max(evals)) ratio = 2*logNorm/logMaxEig
  44. 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Scale-free Normalization the Frobenius Norm is dominated by a single scale-free eigenvalue Heavy Tailed matrices: norm-powerlaw relations Relation is weakly scale dependent
  45. 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works large Pareto matrices have a simple, limiting norm-power-law relation Heavy Tailed matrices: norm-power law relations Standard Normalization Relation is only linear for very Heavy Tailed matrices
  46. 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works Finite size Pareto matrices have a universal, linear norm-power-law relation Heavy Tailed matrices: norm-power law relations Relation is nearly linear for very Heavy and Fat Tailed finite size matrices
  47. 47. c|c (TM) (TM) 47 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Here we treat both Linear layers and Conv2D feature maps
  48. 48. c|c (TM) (TM) 48 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Associate the log product norm with the weighed alpha metric
  49. 49. c|c (TM) (TM) 49 calculation | consulting why deep learning works Predicting test accuracies: VGG Series The weighted alpha capacity metric predicts trends in test accuracy
  50. 50. c|c (TM) (TM) 50 calculation | consulting why deep learning works Predicting test accuracies: ResNet Series
  51. 51. c|c (TM) (TM) 51 calculation | consulting why deep learning works Open source tool: weightwatcher pip install weightwatcher … import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() All results can be reproduced using the python weightwatcher tool python tool to analyze Fat Tails in Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher
  52. 52. c|c (TM) (TM) 52 calculation | consulting why deep learning works Implicit Self-Regularization: 5+1 Phases of Training
  53. 53. c|c (TM) (TM) 53 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  54. 54. c|c (TM) (TM) 54 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  55. 55. c|c (TM) (TM) 55 calculation | consulting why deep learning works Self-Regularization: Batch size experiments Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC We can induce heavy tails in small models by decreasing the batch size Mini-AlexNet: retrained to exploit Generalization Gap
  56. 56. c|c (TM) (TM) 56 calculation | consulting why deep learning works Large batch sizes => decrease generalization accuracy Self-Regularization: Batch size experiments
  57. 57. c|c (TM) (TM) 57 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Random-Like Bleeding-outRandom-Like Decreasing the batch size induces strong correlations in W
  58. 58. c|c (TM) (TM) 58 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Gaussian random matrix Bulk+ Spikes Heavy Tailed Large batch sizes Small batch sizes
  59. 59. c|c (TM) (TM) 59 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  60. 60. c|c (TM) (TM) 60 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  61. 61. c|c (TM) (TM) 61 calculation | consulting why deep learning works Summary
 •reviewed Random Matrix Theory (RMT) •small NNs ~ Tinkhonov regularization •modern DNNs are Heavy-Tailed •Universality of Power Law exponent alpha •can predict generalization accuracies using new capacity metric, weighted average alpha •can induce heavy tails by exploiting the Generalization Gap phenomena, decreasing batch size
  62. 62. c|c (TM) (TM) 62 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow Is this right ? Based on a Gaussian Spin Glass model
  63. 63. c|c (TM) (TM) 63 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can we characterize the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  64. 64. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 64 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  65. 65. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 65 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  66. 66. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 66 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?
  67. 67. (TM) c|c (TM) c | c charles@calculationconsulting.com
  • DmitryAfanasiev1

    May. 13, 2021

Talk given on Dec 13, 2018 at ICSI, UC Berkeley http://www.icsi.berkeley.edu/icsi/events/2018/12/regularization-neural-networks Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. Moreover, we can use these heavy tailed results to form a VC-like average case complexity metric that resembles the product norm used in analyzing toy NNs, and we can use this to predict the test accuracy of pretrained DNNs without peeking at the test data.

Views

Total views

678

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

26

Shares

0

Comments

0

Likes

1

×