SlideShare a Scribd company logo
calculation | consulting
the WeightWatcher project - or
Why Deep Learning Works
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
the WeightWatcher project - or
Why Deep Learning Works
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: Barclays, BlackRock
Fortune 500: Roche, France Telecom,Walmart
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Griffin Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
(TM)
5
calculation | consulting why deep learning works
Open source tool: weightwatcher
c|c
(TM)
(TM)
6
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Motivations: WeightWatcher Theory
The weightwatcher theory is a Semi-Empirical theory based on:

the Statistical Mechanics of Generalization,
Random MatrixTheory, and
the theory of Strongly Correlated Systems
c|c
(TM)
Research: Implicit Self-Regularization in Deep Learning
(TM)
7
calculation | consulting why deep learning works
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
• Workshop: Statistical Mechanics Methods for Discovering Knowledge from
Production-Scale Neural Networks
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
• More in press today:
• Some unpublished, experimental results also discussed
(JMLR 2021)
(ICML 2019, SDM 2020)
(KDD 2020)
(Nature Communications 2021)
Selected publications
[Contest post-mortem] [Training transformers]
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
8
calculation | consulting why deep learning works
The tail of the ESD contains the information
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
9
calculation | consulting why deep learning works
Well trained layers are heavy-tailed and well shaped
GPT-2 Fits a Power Law
(or Truncated Power Law)
alpha in [2, 6]
watcher.analyze(plot=True)
Good quality of fit (D is small)
c|c
(TM)
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
(TM)
10
calculation | consulting why deep learning works
Better trained layers are more heavy-tailed and better shaped
GPT GPT-2
c|c
(TM)
(TM)
11
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
AlexNet, 

VGG11,VGG13, …
ResNet, …
Inception,
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Heavy Tailed Metrics: GPT vs GPT2
The original GPT is poorly trained on purpose; GPT2 is well trained
alpha for every layer

smaller alpha is better
large alpha are bad fits
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
HT-SR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
WeightWatcher: predict trends in generalization
(TM)
21
calculation | consulting why deep learning works
Predict test accuracies across variations in hyper-parameters
The average Power Law exponent alpha
predicts generalization—at fixed depth


Smaller average-alpha is better
Better models are easier to treat
Charles H. Martin, Michael W. Mahoney;
[Contest post-mortem paper]
c|c
(TM)
WeightWatcher: Shape vs Scale metrics
(TM)
22
calculation | consulting why deep learning works
Purely norm-based (scale) metrics (from SLT) can be correlated with depth
but anti-correlated with hyper-parameter changes
c|c
(TM)
WeightWatcher: treat architecture changes
(TM)
23
calculation | consulting why deep learning works
Predict test accuracies across variations in hyper-parameters and depth
The alpha-hat metric combines
shape and scale metrics
and corrects
for different depths (grey line)
can be derived from theory…
c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
24
calculation | consulting why deep learning works
alpha-hat works for 100s of different CV and NLP models
(Nature Communications 2021)
We do not have access to
The training or test data
But we can still predict
trends in the generalization
c|c
(TM)
WeightWatcher: predict test accuracies
(TM)
25
calculation | consulting why deep learning works
ResNet, DenseNet, etc.
(Nature Communications 2021)
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Predicting test accuracies: 100 pretrained models
The heavy tailed (shape) metrics perform best
https://github.com/osmr/imgclsmob
From an open source sandbox of
nearly 500 pretrained CV models
(picked >=5 models / regression)
(Nature Communications 2021)
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Correlation Flow: CV Models
We can study correlation flow looking at vs. depth
VGG ResNet DenseNet
(Nature Communications 2021)
c|c
(TM)
WeightWatcher: interpreting shapes
(TM)
28
calculation | consulting why deep learning works
very high accuracy requires advanced methods
hard
easy
X
:)
c|c
(TM)
WeightWatcher: more Power Law shape metrics
(TM)
29
calculation | consulting why deep learning works
watcher.analyze(…, fit=‘TPL’)
Truncated Power Law fits
fit=‘E_TPL’)
weightwatcher provides several shape (and scale) metrics
plus several more unpublished experimental options
c|c
(TM)
WeightWatcher: E_TPL shape metric
(TM)
30
calculation | consulting why deep learning works
the E_TPL (and rand_distance)
shape metrics track the
learning curve epoch-by-epoch
Training MT transformers
from scratch to SOTA
Extended Truncated Power Law
highly accurate results leverage the advanced shape metrics
Here, (Lambda) is the shape metric
[Training transformers paper]
c|c
(TM)
WeightWatcher: why Power Law fits ?
(TM)
31
calculation | consulting why deep learning works
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL fits
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
c|c
(TM)
WeightWatcher: why Power Law fits ?
(TM)
32
calculation | consulting why deep learning works
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity
c|c
(TM)
WeightWatcher: open-source, open-science
(TM)
33
calculation | consulting why deep learning works
We are looking for early adopters and collaborators
github.com/CalculatedContent/WeightWatcher
We have a Slack channel to support the tool
Please file issues
Ping me to join
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Statistical Mechanics
derivation of the alpha-hat metric
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning Engle &Van den Broeck (2001)
MultiLayer Feed Forward Network Perceptron
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Gaussian data
Average over data
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Average version space volume over Gaussian data and uniform random Teachers
Final expression has 2 parts, parameterized by the error ( ), size of data set (p)
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Classic Set Up: Student-Teacher model
Average over random teachers
introduces overlap R
Key idea in matrix generalization
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
“ .. Matrix Generalization of S-T …” Martin, Milletari, & Mahoney (in preparation)
real DNN matrices:
NxM
Strongly correlated
Heavy-Tailed
correlation matrices
Solve for total integrated version space
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
Gibbs Learning / Canonical Ensemble
Consider T-S Mean Squared Error (MSE)
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
Integrate the canonical measure over Gaussian data
Matrix-generalized Student-Teacher overlap
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
New Set Up: Matrix-generalized Student-Teacher
Integrate the version space volume over the Students J
Expand delta function
Again, break into 2 parts
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
New approach: HCIZ Matrix Integrals
Fix the Teacher: average over Student Correlation Matrices
Wick
rotation
Represent as an HCIZ integral
Note:
which resemble the Teacher
c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
New approach: SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical fit
Eigenvalues of Teacher
empirical fit to:
“Asymptotics of HCZI integrals …” Tanaka (2008)
WeightWatcher
PowerLaw metric
c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
47
calculation | consulting why deep learning works
Smaller alpha corresponds to more convex energy landscapes
Transformers (alpha ~ 3-4 or more)
alpha 2-3 (or less)
Rational Decisions, Random Matrices and Spin Glasses" (1998)
by Galluccio, Bouchaud, and Potters:
c|c
(TM)
WeightWatcher: global and local convexity metrics
(TM)
48
calculation | consulting why deep learning works
When the layer alpha < 2, we think this means the layer is overfit
We suspect that the early layers
of some Convolutional Nets
may be slightly overtrained
Some alpha < 2
This is predicted from our HTSR theory
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com
c|c
(TM)
(TM)
50
calculation | consulting why deep learning works
c|c
(TM)
(TM)
51
calculation | consulting why deep learning works
New interpretation: HCIZ Matrix Integrals
Generating functional
R-Transform (inverse Green’s function, via Contour Integral)
in terms of Teacher's eigenvalues , and Student’s cumulants
c|c
(TM)
(TM)
52
calculation | consulting why deep learning works
Results: Gaussian Random Weight Matrices
“Random Matrix Theory (book)” Bouchaud and Potters (2020)
Recover the Frobenius Norm (squared) as the metric
c|c
(TM)
(TM)
53
calculation | consulting why deep learning works
Results: (very) Heavy Tailed Weight Matrices
“Heavy-tailed random matrices” Burda and Jukiewicz (2009)
Recover a Shatten Norm, in terms of the Heavy Tailed exponent
c|c
(TM)
(TM)
54
calculation | consulting why deep learning works
Application to: Heavy Tailed Weight Matrices
Some reasonable approximations give the weighted alpha metric
Q.E.D.
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

More Related Content

What's hot

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Charles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
Charles Martin
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
Charles Martin
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
Charles Martin
 
Integrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP formatIntegrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP format
IEA-ETSAP
 
Parallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU PlatformsParallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU Platforms
Ganesan Narayanasamy
 
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
sipij
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
RyuichiKanoh
 
Xgboost
XgboostXgboost
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
Dongmin Choi
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
Yuko Kuroki (黒木祐子)
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
LEE HOSEONG
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
Dongmin Choi
 
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Mokhtar SELLAMI
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Dongmin Choi
 
Unsupervised learning networks
Unsupervised learning networksUnsupervised learning networks
Unsupervised learning networks
Dr. C.V. Suresh Babu
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Pooyan Jamshidi
 
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Voica Gavrilut
 

What's hot (20)

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
Integrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP formatIntegrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP format
 
Parallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU PlatformsParallel Biological Sequence Comparison in GPU Platforms
Parallel Biological Sequence Comparison in GPU Platforms
 
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
Xgboost
XgboostXgboost
Xgboost
 
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
Cari 2020: A minimalistic model of spatial structuration of humid savanna veg...
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
 
Unsupervised learning networks
Unsupervised learning networksUnsupervised learning networks
Unsupervised learning networks
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
 

Similar to ENS Macrh 2022.pdf

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
Charles Martin
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
Charles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
Charles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
Mokhtar SELLAMI
 
Flavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approachFlavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approach
Alexander Rakhlin
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
Charles Martin
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
ijdkp
 
Lecture6 xing
Lecture6 xingLecture6 xing
Lecture6 xing
Tianlu Wang
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
민재 정
 
Accelerated life testing
Accelerated life testingAccelerated life testing
Accelerated life testing
Steven Li
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
Polytechnique Montréal
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
Anubhav Jain
 
CoopLoc Technical Presentation
CoopLoc Technical PresentationCoopLoc Technical Presentation
CoopLoc Technical Presentation
Nikos Fasarakis-Hilliard
 
2021 itu challenge_reinforcement_learning
2021 itu challenge_reinforcement_learning2021 itu challenge_reinforcement_learning
2021 itu challenge_reinforcement_learning
LASSEMedia
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
Guide
GuideGuide
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 
EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D. EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D.
Statistics and Bioinformatics (EiB-UB)
 

Similar to ENS Macrh 2022.pdf (20)

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Flavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approachFlavours of Physics Challenge: Transfer Learning approach
Flavours of Physics Challenge: Transfer Learning approach
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
 
Lecture6 xing
Lecture6 xingLecture6 xing
Lecture6 xing
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Accelerated life testing
Accelerated life testingAccelerated life testing
Accelerated life testing
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
CoopLoc Technical Presentation
CoopLoc Technical PresentationCoopLoc Technical Presentation
CoopLoc Technical Presentation
 
2021 itu challenge_reinforcement_learning
2021 itu challenge_reinforcement_learning2021 itu challenge_reinforcement_learning
2021 itu challenge_reinforcement_learning
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D. EiB Seminar from Esteban Vegas, Ph.D.
EiB Seminar from Esteban Vegas, Ph.D.
 

More from Charles Martin

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
Charles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
Charles Martin
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
Charles Martin
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
Charles Martin
 

More from Charles Martin (7)

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

ENS Macrh 2022.pdf

  • 1. calculation | consulting the WeightWatcher project - or Why Deep Learning Works (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting the WeightWatcher project - or Why Deep Learning Works (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: Barclays, BlackRock Fortune 500: Roche, France Telecom,Walmart BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Griffin Advisors Alt. Energy: Anthropocene Institute (Page Family) www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) (TM) 5 calculation | consulting why deep learning works Open source tool: weightwatcher
  • 6. c|c (TM) (TM) 6 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Motivations: WeightWatcher Theory The weightwatcher theory is a Semi-Empirical theory based on:
 the Statistical Mechanics of Generalization, Random MatrixTheory, and the theory of Strongly Correlated Systems
  • 7. c|c (TM) Research: Implicit Self-Regularization in Deep Learning (TM) 7 calculation | consulting why deep learning works • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning. • Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks • Workshop: Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks • Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data • More in press today: • Some unpublished, experimental results also discussed (JMLR 2021) (ICML 2019, SDM 2020) (KDD 2020) (Nature Communications 2021) Selected publications [Contest post-mortem] [Training transformers]
  • 8. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 8 calculation | consulting why deep learning works The tail of the ESD contains the information
  • 9. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 9 calculation | consulting why deep learning works Well trained layers are heavy-tailed and well shaped GPT-2 Fits a Power Law (or Truncated Power Law) alpha in [2, 6] watcher.analyze(plot=True) Good quality of fit (D is small)
  • 10. c|c (TM) WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices (TM) 10 calculation | consulting why deep learning works Better trained layers are more heavy-tailed and better shaped GPT GPT-2
  • 11. c|c (TM) (TM) 11 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) Conv2D MaxPool Conv2D MaxPool FC FC
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works AlexNet, 
 VGG11,VGG13, … ResNet, … Inception, DenseNet, BERT, RoBERT, … GPT, GPT2, … … Heavy-Tailed: Self-Regularization All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free HTSR
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Heavy Tailed Metrics: GPT vs GPT2 The original GPT is poorly trained on purpose; GPT2 is well trained alpha for every layer
 smaller alpha is better large alpha are bad fits
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works HT-SR Theory: 5+1 Phases of Training Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 21. c|c (TM) WeightWatcher: predict trends in generalization (TM) 21 calculation | consulting why deep learning works Predict test accuracies across variations in hyper-parameters The average Power Law exponent alpha predicts generalization—at fixed depth 
 Smaller average-alpha is better Better models are easier to treat Charles H. Martin, Michael W. Mahoney; [Contest post-mortem paper]
  • 22. c|c (TM) WeightWatcher: Shape vs Scale metrics (TM) 22 calculation | consulting why deep learning works Purely norm-based (scale) metrics (from SLT) can be correlated with depth but anti-correlated with hyper-parameter changes
  • 23. c|c (TM) WeightWatcher: treat architecture changes (TM) 23 calculation | consulting why deep learning works Predict test accuracies across variations in hyper-parameters and depth The alpha-hat metric combines shape and scale metrics and corrects for different depths (grey line) can be derived from theory…
  • 24. c|c (TM) WeightWatcher: predict test accuracies (TM) 24 calculation | consulting why deep learning works alpha-hat works for 100s of different CV and NLP models (Nature Communications 2021) We do not have access to The training or test data But we can still predict trends in the generalization
  • 25. c|c (TM) WeightWatcher: predict test accuracies (TM) 25 calculation | consulting why deep learning works ResNet, DenseNet, etc. (Nature Communications 2021)
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Predicting test accuracies: 100 pretrained models The heavy tailed (shape) metrics perform best https://github.com/osmr/imgclsmob From an open source sandbox of nearly 500 pretrained CV models (picked >=5 models / regression) (Nature Communications 2021)
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Correlation Flow: CV Models We can study correlation flow looking at vs. depth VGG ResNet DenseNet (Nature Communications 2021)
  • 28. c|c (TM) WeightWatcher: interpreting shapes (TM) 28 calculation | consulting why deep learning works very high accuracy requires advanced methods hard easy X :)
  • 29. c|c (TM) WeightWatcher: more Power Law shape metrics (TM) 29 calculation | consulting why deep learning works watcher.analyze(…, fit=‘TPL’) Truncated Power Law fits fit=‘E_TPL’) weightwatcher provides several shape (and scale) metrics plus several more unpublished experimental options
  • 30. c|c (TM) WeightWatcher: E_TPL shape metric (TM) 30 calculation | consulting why deep learning works the E_TPL (and rand_distance) shape metrics track the learning curve epoch-by-epoch Training MT transformers from scratch to SOTA Extended Truncated Power Law highly accurate results leverage the advanced shape metrics Here, (Lambda) is the shape metric [Training transformers paper]
  • 31. c|c (TM) WeightWatcher: why Power Law fits ? (TM) 31 calculation | consulting why deep learning works Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fits from experimental neuroscience plus totally new shape metrics we have invented (and published)
  • 32. c|c (TM) WeightWatcher: why Power Law fits ? (TM) 32 calculation | consulting why deep learning works Spiking (i.e real) neurons exhibit (truncated) power law behavior The Critical Brain Hypothesis Evidence of Self-Organized Criticality (SOC) Per Bak (How Nature Works) As neural systems become more complex they exhibit power law behavior and then truncated power law behavior We see exactly this behavior in DNNs and it is predictive of learning capacity
  • 33. c|c (TM) WeightWatcher: open-source, open-science (TM) 33 calculation | consulting why deep learning works We are looking for early adopters and collaborators github.com/CalculatedContent/WeightWatcher We have a Slack channel to support the tool Please file issues Ping me to join
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Statistical Mechanics derivation of the alpha-hat metric
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Statistical Mechanics of Learning Engle &Van den Broeck (2001) MultiLayer Feed Forward Network Perceptron
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Gaussian data Average over data
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Average version space volume over Gaussian data and uniform random Teachers Final expression has 2 parts, parameterized by the error ( ), size of data set (p)
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Classic Set Up: Student-Teacher model Average over random teachers introduces overlap R Key idea in matrix generalization
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher “ .. Matrix Generalization of S-T …” Martin, Milletari, & Mahoney (in preparation) real DNN matrices: NxM Strongly correlated Heavy-Tailed correlation matrices Solve for total integrated version space
  • 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher Gibbs Learning / Canonical Ensemble Consider T-S Mean Squared Error (MSE)
  • 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher Integrate the canonical measure over Gaussian data Matrix-generalized Student-Teacher overlap
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works New Set Up: Matrix-generalized Student-Teacher Integrate the version space volume over the Students J Expand delta function Again, break into 2 parts
  • 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works New approach: HCIZ Matrix Integrals Fix the Teacher: average over Student Correlation Matrices Wick rotation Represent as an HCIZ integral Note: which resemble the Teacher
  • 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works New approach: SemiEmpirical Theory “Generalized Norm” simple, functional form can infer from empirical fit Eigenvalues of Teacher empirical fit to: “Asymptotics of HCZI integrals …” Tanaka (2008) WeightWatcher PowerLaw metric
  • 47. c|c (TM) WeightWatcher: global and local convexity metrics (TM) 47 calculation | consulting why deep learning works Smaller alpha corresponds to more convex energy landscapes Transformers (alpha ~ 3-4 or more) alpha 2-3 (or less) Rational Decisions, Random Matrices and Spin Glasses" (1998) by Galluccio, Bouchaud, and Potters:
  • 48. c|c (TM) WeightWatcher: global and local convexity metrics (TM) 48 calculation | consulting why deep learning works When the layer alpha < 2, we think this means the layer is overfit We suspect that the early layers of some Convolutional Nets may be slightly overtrained Some alpha < 2 This is predicted from our HTSR theory
  • 50. c|c (TM) (TM) 50 calculation | consulting why deep learning works
  • 51. c|c (TM) (TM) 51 calculation | consulting why deep learning works New interpretation: HCIZ Matrix Integrals Generating functional R-Transform (inverse Green’s function, via Contour Integral) in terms of Teacher's eigenvalues , and Student’s cumulants
  • 52. c|c (TM) (TM) 52 calculation | consulting why deep learning works Results: Gaussian Random Weight Matrices “Random Matrix Theory (book)” Bouchaud and Potters (2020) Recover the Frobenius Norm (squared) as the metric
  • 53. c|c (TM) (TM) 53 calculation | consulting why deep learning works Results: (very) Heavy Tailed Weight Matrices “Heavy-tailed random matrices” Burda and Jukiewicz (2009) Recover a Shatten Norm, in terms of the Heavy Tailed exponent
  • 54. c|c (TM) (TM) 54 calculation | consulting why deep learning works Application to: Heavy Tailed Weight Matrices Some reasonable approximations give the weighted alpha metric Q.E.D.