SlideShare a Scribd company logo
Heavy-Tailed Self-Regularization
in Deep Neural Networks
NeurIPS 2023 Workshop on Heavy Tails
charles@calculationconsulting.com
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Research: Implicit Self-Regularization in Deep Learning
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
(JMLR 2021)
(Nature Communications 2021)
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.
• SETOL: SemiEmpirical Theory of Learning
(Invited submission, Philosophical Magazine)
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Problem: How do NN layers converge ?
Under
fi
t
Over
fi
t
Random
Ideal Fit
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Accelerate Training: Adjust Layer Learning Rates
Temperature Balancing (NeurIPS 2023)
c|c
(TM)
calculation | consulting heavy-tails in DNNs
weightwatcher layer quality metric
Llama
Comparing LLMs: Detect ‘Bad’ Layers
c|c
(TM)
calculation | consulting heavy-tails in DNNs
TruthfulQA
metric
LLM Hallucinations: Gauge Truthfulness of Base Models
c|c
(TM)
calculation | consulting heavy-tails in DNNs
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
The tail of the ESD contains the information
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
Well trained layers are heavy-tailed and well shaped
GPT-2 Fits a Power Law
(or Truncated Power Law)
alpha in [2, 6]
watcher.analyze(plot=True)
Good quality of
fi
t (D is small)
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher: analyzes the ESD
(eigenvalues) of the layer weight matrices
Better trained layers are more heavy-tailed and better shaped
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom
fl
uctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
c|c
(TM)
calculation | consulting heavy-tails in DNNs
AlexNet,
VGG11,VGG13, …
ResNet, …
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
Flan-T5
Bloom
Llama
Falcon
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR
c|c
(TM)
calculation | consulting heavy-tails in DNNs
HTSR Theory: Heavy Tailed Self Regularization
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
c|c
(TM)
calculation | consulting heavy-tails in DNNs
HTSR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Statistical Mechanics
Why this works…
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning from the 1990s
MultiLayer Feed Forward Network Perceptron
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Classic Set Up: Student-Teacher model
Average overlap over random students J
Generalization Error ~ ST Overlap
Exhibits phase behavior when over
fi
t
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Key Results: Complex Phase Behavior
over
fi
t models behave like glassy / meta-stable spin glasses
(2017)
c|c
(TM)
calculation | consulting heavy-tails in DNNs
New Set Up: Matrix-generalized Student-Teacher
real DNN matrices:
NxM
Strongly correlated
Heavy-Tailed
correlation matrices
Solve for Free Energy associated with the generalization error
This Free Energy is the weightwatcher layer quality metric
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher
Quality metric
Layer Quality Metrics : SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical
fi
t
Eigenvalues of Teacher
empirical
fi
t: R-transform:
“Asymptotics of HCZI integrals …” Tanaka (2008)
free cumulants
c|c
(TM)
calculation | consulting heavy-tails in DNNs
New Principle of Learning:
Volume Preserving Transformation
As the correlations concentrate into the tail
The change of measure must satisfy
a volume preserving transformation
Which sets a condition on the eigenvalues in the tail
c|c
(TM)
calculation | consulting heavy-tails in DNNs
New Principle of Ideal Learning:
Volume Preserving Transformation
As alpha -> 2, the |det X|=1 condition holds!
c|c
(TM)
calculation | consulting heavy-tails in DNNs
New Principle of Ideal Learning:
Volume Preserving Transformation
As alpha -> 2, the |det X|=1 condition holds!
c|c
(TM)
calculation | consulting heavy-tails in DNNs
New Principle of Ideal Learning:
As alpha -> 2, SVD generalizing components concentrate in the tail
MLP3 on MNIST, varying LR
Truncated SVD on FC1
c|c
(TM)
calculation | consulting heavy-tails in DNNs
As alpha < 2, layers appear to be over
fi
t
Layer Quality Metrics : Detecting Over
fi
tting
MLP3 on MNIST, varying LR
c|c
(TM)
calculation | consulting heavy-tails in DNNs
As alpha < 2, layers appear to be over
fi
t
Layer Quality Metrics : Detecting Over
fi
tting
MLP3 on MNIST: Reduced capacity
Train FC1, freeze FC2, FC3
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Heavy Tailed Correlations: in Neuroscience
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher: why Power Law
fi
ts ?
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity
c|c
(TM)
calculation | consulting heavy-tails in DNNs
WeightWatcher: open-source, open-science
We are looking for early adopters and collaborators
https://weightwatcher.ai 115K+ downloads; 1000 stars
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

More Related Content

What's hot

Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
ML Basics
ML BasicsML Basics
ML Basics
SrujanaMerugu1
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
James McMurray
 
Tabnet presentation
Tabnet presentationTabnet presentation
Tabnet presentation
Sebastien Fischman
 
Geospatial Data Analysis and Visualization in Python
Geospatial Data Analysis and Visualization in PythonGeospatial Data Analysis and Visualization in Python
Geospatial Data Analysis and Visualization in Python
Halfdan Rump
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
Wataru Hirota
 
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled AttentionDeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
taeseon ryu
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorch
Hyungjoo Cho
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
Md Tajul Islam
 
Towards Human-Centered Machine Learning
Towards Human-Centered Machine LearningTowards Human-Centered Machine Learning
Towards Human-Centered Machine Learning
Sri Ambati
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic Models
Hyeongmin Lee
 
Types of machine learning
Types of machine learningTypes of machine learning
Types of machine learning
HimaniAloona
 
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Sri Ambati
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
End to-end semi-supervised object detection with soft teacher ver.1.0
End to-end semi-supervised object detection with soft teacher ver.1.0End to-end semi-supervised object detection with soft teacher ver.1.0
End to-end semi-supervised object detection with soft teacher ver.1.0
taeseon ryu
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
Jeong-Yoon Lee
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
Ning Jiang
 

What's hot (20)

Autoencoders
AutoencodersAutoencoders
Autoencoders
 
ML Basics
ML BasicsML Basics
ML Basics
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
 
Tabnet presentation
Tabnet presentationTabnet presentation
Tabnet presentation
 
Geospatial Data Analysis and Visualization in Python
Geospatial Data Analysis and Visualization in PythonGeospatial Data Analysis and Visualization in Python
Geospatial Data Analysis and Visualization in Python
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled AttentionDeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorch
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
 
Towards Human-Centered Machine Learning
Towards Human-Centered Machine LearningTowards Human-Centered Machine Learning
Towards Human-Centered Machine Learning
 
PR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic ModelsPR-409: Denoising Diffusion Probabilistic Models
PR-409: Denoising Diffusion Probabilistic Models
 
Types of machine learning
Types of machine learningTypes of machine learning
Types of machine learning
 
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
End to-end semi-supervised object detection with soft teacher ver.1.0
End to-end semi-supervised object detection with soft teacher ver.1.0End to-end semi-supervised object detection with soft teacher ver.1.0
End to-end semi-supervised object detection with soft teacher ver.1.0
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 

Similar to Heavy Tails Workshop NeurIPS2023.pdf

This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
Charles Martin
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
Charles Martin
 
Role of Tensors in Machine Learning
Role of Tensors in Machine LearningRole of Tensors in Machine Learning
Role of Tensors in Machine Learning
Anima Anandkumar
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
Charles Martin
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
Brian Kim
 
Recent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detectionRecent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detection
ActiveEon
 
Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning Works
Charles Martin
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
Charles Martin
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
zukun
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
Steve Nouri
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Charles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
Zihui Li
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
butest
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
Zahra Amini
 
IGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.pptIGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.ppt
grssieee
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
sohaib_alam
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
Mathias Niepert
 
Performance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal ApproximationPerformance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal Approximation
iosrjce
 

Similar to Heavy Tails Workshop NeurIPS2023.pdf (20)

This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
Role of Tensors in Machine Learning
Role of Tensors in Machine LearningRole of Tensors in Machine Learning
Role of Tensors in Machine Learning
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
 
Recent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detectionRecent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detection
 
Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning Works
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
IGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.pptIGARSS2011-I-Ling.ppt
IGARSS2011-I-Ling.ppt
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
Performance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal ApproximationPerformance of Matching Algorithmsfor Signal Approximation
Performance of Matching Algorithmsfor Signal Approximation
 

More from Charles Martin

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
Charles Martin
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
Charles Martin
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
Charles Martin
 
Search relevance
Search relevanceSearch relevance
Search relevance
Charles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
Charles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
Charles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
Charles Martin
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Charles Martin
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
Charles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
Charles Martin
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
Charles Martin
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
Charles Martin
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
Charles Martin
 

More from Charles Martin (17)

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
Search relevance
Search relevanceSearch relevance
Search relevance
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Recently uploaded

Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
Karen593256
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 

Recently uploaded (20)

Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 

Heavy Tails Workshop NeurIPS2023.pdf

  • 1. Heavy-Tailed Self-Regularization in Deep Neural Networks NeurIPS 2023 Workshop on Heavy Tails charles@calculationconsulting.com
  • 2. c|c (TM) calculation | consulting heavy-tails in DNNs Research: Implicit Self-Regularization in Deep Learning • Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data (JMLR 2021) (Nature Communications 2021) • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning. • SETOL: SemiEmpirical Theory of Learning (Invited submission, Philosophical Magazine)
  • 3. c|c (TM) calculation | consulting heavy-tails in DNNs Problem: How do NN layers converge ? Under fi t Over fi t Random Ideal Fit
  • 4. c|c (TM) calculation | consulting heavy-tails in DNNs Accelerate Training: Adjust Layer Learning Rates Temperature Balancing (NeurIPS 2023)
  • 5. c|c (TM) calculation | consulting heavy-tails in DNNs weightwatcher layer quality metric Llama Comparing LLMs: Detect ‘Bad’ Layers
  • 6. c|c (TM) calculation | consulting heavy-tails in DNNs TruthfulQA metric LLM Hallucinations: Gauge Truthfulness of Base Models
  • 7. c|c (TM) calculation | consulting heavy-tails in DNNs
  • 8. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices The tail of the ESD contains the information
  • 9. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices Well trained layers are heavy-tailed and well shaped GPT-2 Fits a Power Law (or Truncated Power Law) alpha in [2, 6] watcher.analyze(plot=True) Good quality of fi t (D is small)
  • 10. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher: analyzes the ESD (eigenvalues) of the layer weight matrices Better trained layers are more heavy-tailed and better shaped
  • 11. c|c (TM) calculation | consulting heavy-tails in DNNs Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fl uctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear
  • 12. c|c (TM) calculation | consulting heavy-tails in DNNs Random Matrix Theory: Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see
  • 13. c|c (TM) calculation | consulting heavy-tails in DNNs AlexNet, VGG11,VGG13, … ResNet, … DenseNet, BERT, RoBERT, … GPT, GPT2, … Flan-T5 Bloom Llama Falcon … Heavy-Tailed: Self-Regularization All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free HTSR
  • 14. c|c (TM) calculation | consulting heavy-tails in DNNs HTSR Theory: Heavy Tailed Self Regularization DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes
  • 15. c|c (TM) calculation | consulting heavy-tails in DNNs HTSR Theory: 5+1 Phases of Training Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 16. c|c (TM) calculation | consulting heavy-tails in DNNs Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
  • 17. c|c (TM) calculation | consulting heavy-tails in DNNs Statistical Mechanics Why this works…
  • 18. c|c (TM) calculation | consulting heavy-tails in DNNs Classic Set Up: Student-Teacher model Statistical Mechanics of Learning from the 1990s MultiLayer Feed Forward Network Perceptron
  • 19. c|c (TM) calculation | consulting heavy-tails in DNNs Classic Set Up: Student-Teacher model Average overlap over random students J Generalization Error ~ ST Overlap Exhibits phase behavior when over fi t
  • 20. c|c (TM) calculation | consulting heavy-tails in DNNs Key Results: Complex Phase Behavior over fi t models behave like glassy / meta-stable spin glasses (2017)
  • 21. c|c (TM) calculation | consulting heavy-tails in DNNs New Set Up: Matrix-generalized Student-Teacher real DNN matrices: NxM Strongly correlated Heavy-Tailed correlation matrices Solve for Free Energy associated with the generalization error This Free Energy is the weightwatcher layer quality metric
  • 22. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher Quality metric Layer Quality Metrics : SemiEmpirical Theory “Generalized Norm” simple, functional form can infer from empirical fi t Eigenvalues of Teacher empirical fi t: R-transform: “Asymptotics of HCZI integrals …” Tanaka (2008) free cumulants
  • 23. c|c (TM) calculation | consulting heavy-tails in DNNs New Principle of Learning: Volume Preserving Transformation As the correlations concentrate into the tail The change of measure must satisfy a volume preserving transformation Which sets a condition on the eigenvalues in the tail
  • 24. c|c (TM) calculation | consulting heavy-tails in DNNs New Principle of Ideal Learning: Volume Preserving Transformation As alpha -> 2, the |det X|=1 condition holds!
  • 25. c|c (TM) calculation | consulting heavy-tails in DNNs New Principle of Ideal Learning: Volume Preserving Transformation As alpha -> 2, the |det X|=1 condition holds!
  • 26. c|c (TM) calculation | consulting heavy-tails in DNNs New Principle of Ideal Learning: As alpha -> 2, SVD generalizing components concentrate in the tail MLP3 on MNIST, varying LR Truncated SVD on FC1
  • 27. c|c (TM) calculation | consulting heavy-tails in DNNs As alpha < 2, layers appear to be over fi t Layer Quality Metrics : Detecting Over fi tting MLP3 on MNIST, varying LR
  • 28. c|c (TM) calculation | consulting heavy-tails in DNNs As alpha < 2, layers appear to be over fi t Layer Quality Metrics : Detecting Over fi tting MLP3 on MNIST: Reduced capacity Train FC1, freeze FC2, FC3
  • 29. c|c (TM) calculation | consulting heavy-tails in DNNs Heavy Tailed Correlations: in Neuroscience Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fi ts from experimental neuroscience plus totally new shape metrics we have invented (and published) Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fi ts from experimental neuroscience plus totally new shape metrics we have invented (and published)
  • 30. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher: why Power Law fi ts ? Spiking (i.e real) neurons exhibit (truncated) power law behavior The Critical Brain Hypothesis Evidence of Self-Organized Criticality (SOC) Per Bak (How Nature Works) As neural systems become more complex they exhibit power law behavior and then truncated power law behavior We see exactly this behavior in DNNs and it is predictive of learning capacity
  • 31. c|c (TM) calculation | consulting heavy-tails in DNNs WeightWatcher: open-source, open-science We are looking for early adopters and collaborators https://weightwatcher.ai 115K+ downloads; 1000 stars