ENS Macrh 2022.pdf

calculation | consulting
the WeightWatcher project - or
Why Deep Learning Works
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
the WeightWatcher project - or
Why Deep Learning Works
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: Barclays, BlackRock
Fortune 500: Roche, France Telecom,Walmart
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Griffin Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
(TM)
3

c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?

c|c
(TM)
(TM)
5
Open source tool: weightwatcher

c|c
(TM)
(TM)
6
Understanding deep learning requires rethinking generalization
Motivations: WeightWatcher Theory
The weightwatcher theory is a Semi-Empirical theory based on: 
the Statistical Mechanics of Generalization,
Random MatrixTheory, and
the theory of Strongly Correlated Systems

c|c
(TM)
Research: Implicit Self-Regularization in Deep Learning
(TM)
7
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
• Workshop: Statistical Mechanics Methods for Discovering Knowledge from
Production-Scale Neural Networks
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
• More in press today:
• Some unpublished, experimental results also discussed
(JMLR 2021)
(ICML 2019, SDM 2020)
(KDD 2020)
(Nature Communications 2021)
Selected publications
[Contest post-mortem] [Training transformers]

c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
8
The tail of the ESD contains the information

c|c
(TM)
(TM)
9
Well trained layers are heavy-tailed and well shaped
GPT-2 Fits a Power Law
(or Truncated Power Law)
alpha in [2, 6]
watcher.analyze(plot=True)
Good quality of fit (D is small)

c|c
(TM)
(TM)
10
Better trained layers are more heavy-tailed and better shaped
GPT GPT-2

c|c
(TM)
(TM)
11
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear

c|c
(TM)
(TM)
12
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
13
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see

c|c
(TM)
(TM)
14
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
15
AlexNet,  
VGG11,VGG13, …
ResNet, …
Inception,
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR

c|c
(TM)
(TM)
16
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained
alpha for every layer 
smaller alpha is better
large alpha are bad fits

c|c
(TM)
(TM)
17
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4

c|c
(TM)
(TM)
18
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes

c|c
(TM)
(TM)
19
HT-SR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.

c|c
(TM)
(TM)
20
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.

c|c
(TM)
WeightWatcher: predict trends in generalization
(TM)
21
Predict test accuracies across variations in hyper-parameters
The average Power Law exponent alpha
predicts generalization—at fixed depth
 
Smaller average-alpha is better
Better models are easier to treat
Charles H. Martin, Michael W. Mahoney;
[Contest post-mortem paper]

c|c
(TM)
WeightWatcher: Shape vs Scale metrics
(TM)
22
Purely norm-based (scale) metrics (from SLT) can be correlated with depth
but anti-correlated with hyper-parameter changes

c|c
(TM)
WeightWatcher: treat architecture changes
(TM)
23
Predict test accuracies across variations in hyper-parameters and depth
The alpha-hat metric combines
shape and scale metrics
and corrects
for different depths (grey line)
can be derived from theory…

c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
24
alpha-hat works for 100s of different CV and NLP models
We do not have access to
The training or test data
But we can still predict
trends in the generalization

c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
25
ResNet, DenseNet, etc.

c|c
(TM)
(TM)
26
Predicting test accuracies: 100 pretrained models
The heavy tailed (shape) metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)

c|c
(TM)
(TM)
27
Correlation Flow: CV Models
We can study correlation flow looking at vs. depth
VGG ResNet DenseNet

c|c
(TM)
WeightWatcher: interpreting shapes
(TM)
28
very high accuracy requires advanced methods
hard
easy
X
:)

c|c
(TM)
WeightWatcher: more Power Law shape metrics
(TM)
29
watcher.analyze(…, fit=‘TPL’)
Truncated Power Law fits
fit=‘E_TPL’)
weightwatcher provides several shape (and scale) metrics
plus several more unpublished experimental options

c|c
(TM)
WeightWatcher: E_TPL shape metric
(TM)
30
the E_TPL (and rand_distance)
shape metrics track the
learning curve epoch-by-epoch
Training MT transformers
from scratch to SOTA
Extended Truncated Power Law
highly accurate results leverage the advanced shape metrics
Here, (Lambda) is the shape metric
[Training transformers paper]

c|c
(TM)
WeightWatcher: why Power Law fits ?
(TM)
31
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL fits
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)

c|c
(TM)
WeightWatcher: why Power Law fits ?
(TM)
32
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity

c|c
(TM)
WeightWatcher: open-source, open-science
(TM)
33
We are looking for early adopters and collaborators
github.com/CalculatedContent/WeightWatcher
We have a Slack channel to support the tool
Please file issues
Ping me to join

c|c
(TM)
(TM)
34
Statistical Mechanics
derivation of the alpha-hat metric

c|c
(TM)
(TM)
35
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
MultiLayer Feed Forward Network Perceptron

c|c
(TM)
(TM)
36
Gaussian data
Average over data

c|c
(TM)
(TM)
37

c|c
(TM)
(TM)
38

c|c
(TM)
(TM)
39
Average version space volume over Gaussian data and uniform random Teachers
Final expression has 2 parts, parameterized by the error ( ), size of data set (p)

c|c
(TM)
(TM)
40
Average over random teachers
introduces overlap R
Key idea in matrix generalization

c|c
(TM)
(TM)
41
New Set Up: Matrix-generalized Student-Teacher
“ .. Matrix Generalization of S-T …” Martin, Milletari, & Mahoney (in preparation)
real DNN matrices:
NxM
Strongly correlated
Heavy-Tailed
correlation matrices
Solve for total integrated version space

c|c
(TM)
(TM)
42
Gibbs Learning / Canonical Ensemble
Consider T-S Mean Squared Error (MSE)

c|c
(TM)
(TM)
43
Integrate the canonical measure over Gaussian data
Matrix-generalized Student-Teacher overlap

c|c
(TM)
(TM)
44
Integrate the version space volume over the Students J
Expand delta function
Again, break into 2 parts

c|c
(TM)
(TM)
45
New approach: HCIZ Matrix Integrals
Fix the Teacher: average over Student Correlation Matrices
Wick
rotation
Represent as an HCIZ integral
Note:
which resemble the Teacher

c|c
(TM)
(TM)
46
New approach: SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical fit
Eigenvalues of Teacher
empirical fit to:
“Asymptotics of HCZI integrals …” Tanaka (2008)
WeightWatcher
PowerLaw metric

c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
47
Smaller alpha corresponds to more convex energy landscapes
Transformers (alpha ~ 3-4 or more)
alpha 2-3 (or less)
Rational Decisions, Random Matrices and Spin Glasses" (1998)
by Galluccio, Bouchaud, and Potters:

c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
48
When the layer alpha < 2, we think this means the layer is overfit
We suspect that the early layers
of some Convolutional Nets
may be slightly overtrained
Some alpha < 2
This is predicted from our HTSR theory

(TM)
c|c
(TM)
c | c

c|c
(TM)
(TM)
50

c|c
(TM)
(TM)
51
New interpretation: HCIZ Matrix Integrals
Generating functional
R-Transform (inverse Green’s function, via Contour Integral)
in terms of Teacher's eigenvalues , and Student’s cumulants

c|c
(TM)
(TM)
52
Results: Gaussian Random Weight Matrices
“Random Matrix Theory (book)” Bouchaud and Potters (2020)
Recover the Frobenius Norm (squared) as the metric

c|c
(TM)
(TM)
53
Results: (very) Heavy Tailed Weight Matrices
“Heavy-tailed random matrices” Burda and Jukiewicz (2009)
Recover a Shatten Norm, in terms of the Heavy Tailed exponent

c|c
(TM)
(TM)
54
Application to: Heavy Tailed Weight Matrices
Some reasonable approximations give the weighted alpha metric
Q.E.D.

ENS Macrh 2022.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ENS Macrh 2022.pdf

Similar to ENS Macrh 2022.pdf (20)

More from Charles Martin

More from Charles Martin (7)

Recently uploaded

Recently uploaded (20)

ENS Macrh 2022.pdf