WeightWatcher
Data Free Diagnostics for Deep Learning
TII Jan 2025
charles@calculatedcontent.com
c|c
(TM)
calculation | consulting weightwatcher.ai
Who am I?
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 20 years experience in Applied Machine Learning (ML) and AI
Developed ML algos for Demand Media; the
fi
rst $1B IPO since Google
Lean Start Ups: Aardvark (acquired by Google), …
Internet Giants: eHow, eBay, GoDaddy,WalmartLabs …
Wall Street: Barclays, BlackRock, GLG, …
Fortune 500: Big Pharma, Telecom, …
https://weightwatcher.ai
https://calculatedcontent.com
charles@calculatedcontent.com
c|c
(TM)
calculation | consulting weightwatcher.ai
WeightWatcher: open-source, open-science
https://weightwatcher.ai 115K+ downloads; 1000 stars
c|c
(TM)
calculation | consulting weightwatcher.ai
weightwatcher (HTSR) layer quality metric:
c|c
(TM)
calculation | consulting weightwatcher.ai
Research: Implicit Self-Regularization in Deep Learning
• Predicting trends in the quality of state-of-the-art neural networks without
access to training or testing data
(JMLR 2021)
(Nature Communications 2021)
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning.
• SETOL: SemiEmpirical Theory of Learning
(Invited submission, Philosophical Magazine)
c|c
(TM)
calculation | consulting weightwatcher.ai
Problem: How do NN layers converge ?
Under
fi
t
Over
fi
t
Random
Ideal Fit
c|c
(TM)
calculation | consulting weightwatcher.ai
6.0
1.5
8.0
2.0
Solution: weightwatcher (HTSR) layer quality metric
c|c
(TM)
calculation | consulting weightwatcher.ai
HTSR layer quality metric
Llama
Falcon
LLMs: ‘under & over
fi
t’ Layers
c|c
(TM)
calculation | consulting weightwatcher.ai
WeightWatcher: analyzes the (tail of the) ESD
(eigenvalues) of the layer weight matrices
2.0
4.0
c|c
(TM)
calculation | consulting weightwatcher.ai
Theory-based: Layer Quality metrics
(2025)
HTSR SETOL
(2021)
c|c
(TM)
Theoretical analysis: LLMs have under
fi
t layers
calculation | consulting weightwatcher.ai
> 6
Llama Falcon vs Falcon2
c|c
(TM)
Theoretical predictions:
fi
ne-tuned updates
calculation | consulting weightwatcher.ai
Falcon 3 Instruct
No under
fi
t layers
> 6
Few over
fi
t layers
< 2
c|c
(TM)
Theoretical analysis:
fi
ne-tuned updates follow
predictions of the theory
calculation | consulting weightwatcher.ai
Llama-3.1 Mistral Qwen2.5
c|c
(TM)
calculation | consulting weightwatcher.ai
Theoretical prediction: better base model
(smaller alphas) are easier to
fi
ne-tune
Llama-3.1 70B base vs 70B Instruct
c|c
(TM)
calculation | consulting weightwatcher.ai
Over
fi
tting is good : Segment Anything Models
c|c
(TM)
calculation | consulting weightwatcher.ai
Over
fi
tting is bad : Llama Guard models
c|c
(TM)
calculation | consulting weightwatcher.ai
TruthfulQA
metric
LLM Hallucinations: Gauge Truthfulness of Base Models
c|c
(TM)
Applications of HTSR: Student Projects
calculation | consulting weightwatcher.ai
low-memory
fi
ne-tuning
optimize layer
learning rates
compres LLMs
c|c
(TM)
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom
fl
uctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
calculation | consulting weightwatcher.ai
c|c
(TM)
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
calculation | consulting weightwatcher.ai
c|c
(TM)
AlexNet,
VGG11,VGG13, …
ResNet, …
DenseNet,
BERT, RoBERT, …
GPT, GPT2, …
Flan-T5
Bloom
Llama
Falcon
…
Heavy-Tailed: Self-Regularization
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
HTSR
calculation | consulting weightwatcher.ai
c|c
(TM)
Heavy-Tails: in Latent Semantic Analysis
HeavyTails do not solely come from of SGD training
LSA on 20newsgroups; great PL
fi
t; alpha = 2.2
calculation | consulting weightwatcher.ai
c|c
(TM)
HTSR Theory: Heavy Tailed Self Regularization
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
calculation | consulting weightwatcher.ai
c|c
(TM)
HTSR Theory: 5+1 Phases of Training
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
calculation | consulting weightwatcher.ai
c|c
(TM)
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021.
calculation | consulting weightwatcher.ai
c|c
(TM)
Heavy Tailed RMT: Universality Classes
calculation | consulting weightwatcher.ai
c|c
(TM)
Statistical Mechanics
Why this works…
calculation | consulting weightwatcher.ai
c|c
(TM)
calculation | consulting heavy-tails in DNNs
Classic Set Up: Student-Teacher model
Statistical Mechanics of Learning from the 1990s
MultiLayer Feed Forward Network Perceptron
calculation | consulting weightwatcher.ai
c|c
(TM)
Set Up: Student-Teacher model

Quality (generalization accuracy)
Perceptron
(vector model)
NN Layer
(matrix model)
calculation | consulting weightwatcher.ai
c|c
(TM)
Set Up: Matrix-generalized Student-Teacher
Heavy-Tailed correlation matrix
Student-Teacher matrix overlap
Quality (generalization accuracy)
calculation | consulting weightwatcher.ai
(2 forms)
c|c
(TM)
Set Up: Matrix-generalized Student-Teacher
Free Energy Quality (squared) / Generating Function
Quality (generalization accuracy)
calculation | consulting weightwatcher.ai
HCIZ Integral
c|c
(TM)
HCIZ Integrals : Evaluation “Tricks”
Free Energy Quality (squared) / Generating Function
X
Need to Change Measure
calculation | consulting weightwatcher.ai
c|c
(TM)
calculation | consulting heavy-tails in DNNs
HCIZ Integrals : Evaluation “Tricks”
Free Energy Quality (squared) / Generating Function
X
Need Determinant = 1
calculation | consulting weightwatcher.ai
c|c
(TM)
HCIZ Integrals : Renormalization Group Theory
Let H be the Hamiltonian for the Quality
Volume Preserving or Scale-Invariant Transformation
calculation | consulting weightwatcher.ai
c|c
(TM)
calculation | consulting heavy-tails in DNNs
HCIZ Integrals : Effective Correlation Space
calculation | consulting weightwatcher.ai
c|c
(TM)
HCIZ Integrals : Effective Space Free Energy
Volume Preserving or Scale-Invariant Transformation
calculation | consulting weightwatcher.ai
c|c
(TM)
Layer Quality Metrics : SemiEmpirical Theory
“Generalized Norm”
simple, functional form
can infer from empirical
fi
t
Eigenvalues of Teacher
empirical
fi
t: R-transform:
“Asymptotics of HCZI integrals …” Tanaka (2008)
free cumulants*
WeightWatcher
Layer Quality metric ~
calculation | consulting weightwatcher.ai
c|c
(TM)
R-Transform : Heavy Tails
Alpha smaller =>
Heavy tail = >
Larger higher-order cumulants =>
Better generalization…
(down to alpha=2)
calculation | consulting weightwatcher.ai
c|c
(TM)
R-Transform : Branch Cut at the ECS
start of ECS
Inverse-Wishart model
calculation | consulting weightwatcher.ai
c|c
(TM)
New Principle of Learning:
Volume Preserving Transformation
As the correlations concentrate into the tail
The change of measure must satisfy
a volume preserving transformation
Which sets a condition on the eigenvalues in the tail
calculation | consulting weightwatcher.ai
c|c
(TM)
New Principle of Ideal Learning:
Volume Preserving Transformation
As alpha -> 2, the |det X|=1 condition holds!
calculation | consulting weightwatcher.ai
c|c
(TM)
New Principle of Ideal Learning:
Volume Preserving Transformation
As alpha -> 2, the |det X|=1 condition holds!
calculation | consulting weightwatcher.ai
c|c
(TM)
New Principle of Ideal Learning:
Wilson Exact Renormalization Group
calculation | consulting weightwatcher.ai
c|c
(TM)
Evidence of Ideal Learning:
As alpha -> 2, SVD generalizing components concentrate in the tail
MLP3 on
MNIST,
varying LR
Effective Correlation Space (ECS)
calculation | consulting weightwatcher.ai
c|c
(TM)
As alpha < 2, layers appear to be over
fi
t
Layer Quality Metrics : Detecting Over
fi
tting
calculation | consulting weightwatcher.ai
c|c
(TM)
A result of spuriously large learning rates
Layer Quality Metrics : Correlation Traps
calculation | consulting weightwatcher.ai
c|c
(TM)
As alpha < 2, layer FC1 appears to be over
fi
t
Why ?: IZ Free Energy becomes non-extensive
Layer Quality Metrics : Detecting Over
fi
tting
MLP3 on MNIST: Reduced capacity
Train FC1, freeze FC2, FC3
calculation | consulting weightwatcher.ai
c|c
(TM)
Heavy Tailed Correlations: in Neuroscience
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
Spiking (i.e real) neurons exhibit power law behavior
weightwatcher supports several PL
fi
ts
from experimental neuroscience
plus totally new shape metrics
we have invented (and published)
calculation | consulting weightwatcher.ai
c|c
(TM)
WeightWatcher: why Power Law
fi
ts ?
Spiking (i.e real) neurons exhibit (truncated) power law behavior
The Critical Brain Hypothesis
Evidence of Self-Organized Criticality (SOC)
Per Bak (How Nature Works)
As neural systems become more complex
they exhibit power law behavior
and then truncated power law behavior
We see exactly this behavior in DNNs
and it is predictive of learning capacity
calculation | consulting weightwatcher.ai
c|c
(TM)
WeightWatcher: open-source, open-science
We are looking for early adopters and collaborators
https://weightwatcher.ai 115K+ downloads; 1000 stars
calculation | consulting weightwatcher.ai
(TM)
c|c
(TM)
c | c
charles@calculatedcontent.com

WeightWatcher: Data Free Diagnostics for Deep Learning

  • 1.
    WeightWatcher Data Free Diagnosticsfor Deep Learning TII Jan 2025 charles@calculatedcontent.com
  • 2.
    c|c (TM) calculation | consultingweightwatcher.ai Who am I? Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 20 years experience in Applied Machine Learning (ML) and AI Developed ML algos for Demand Media; the fi rst $1B IPO since Google Lean Start Ups: Aardvark (acquired by Google), … Internet Giants: eHow, eBay, GoDaddy,WalmartLabs … Wall Street: Barclays, BlackRock, GLG, … Fortune 500: Big Pharma, Telecom, … https://weightwatcher.ai https://calculatedcontent.com charles@calculatedcontent.com
  • 3.
    c|c (TM) calculation | consultingweightwatcher.ai WeightWatcher: open-source, open-science https://weightwatcher.ai 115K+ downloads; 1000 stars
  • 4.
    c|c (TM) calculation | consultingweightwatcher.ai weightwatcher (HTSR) layer quality metric:
  • 5.
    c|c (TM) calculation | consultingweightwatcher.ai Research: Implicit Self-Regularization in Deep Learning • Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data (JMLR 2021) (Nature Communications 2021) • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning. • SETOL: SemiEmpirical Theory of Learning (Invited submission, Philosophical Magazine)
  • 6.
    c|c (TM) calculation | consultingweightwatcher.ai Problem: How do NN layers converge ? Under fi t Over fi t Random Ideal Fit
  • 7.
    c|c (TM) calculation | consultingweightwatcher.ai 6.0 1.5 8.0 2.0 Solution: weightwatcher (HTSR) layer quality metric
  • 8.
    c|c (TM) calculation | consultingweightwatcher.ai HTSR layer quality metric Llama Falcon LLMs: ‘under & over fi t’ Layers
  • 9.
    c|c (TM) calculation | consultingweightwatcher.ai WeightWatcher: analyzes the (tail of the) ESD (eigenvalues) of the layer weight matrices 2.0 4.0
  • 10.
    c|c (TM) calculation | consultingweightwatcher.ai Theory-based: Layer Quality metrics (2025) HTSR SETOL (2021)
  • 11.
    c|c (TM) Theoretical analysis: LLMshave under fi t layers calculation | consulting weightwatcher.ai > 6 Llama Falcon vs Falcon2
  • 12.
    c|c (TM) Theoretical predictions: fi ne-tuned updates calculation| consulting weightwatcher.ai Falcon 3 Instruct No under fi t layers > 6 Few over fi t layers < 2
  • 13.
    c|c (TM) Theoretical analysis: fi ne-tuned updatesfollow predictions of the theory calculation | consulting weightwatcher.ai Llama-3.1 Mistral Qwen2.5
  • 14.
    c|c (TM) calculation | consultingweightwatcher.ai Theoretical prediction: better base model (smaller alphas) are easier to fi ne-tune Llama-3.1 70B base vs 70B Instruct
  • 15.
    c|c (TM) calculation | consultingweightwatcher.ai Over fi tting is good : Segment Anything Models
  • 16.
    c|c (TM) calculation | consultingweightwatcher.ai Over fi tting is bad : Llama Guard models
  • 17.
    c|c (TM) calculation | consultingweightwatcher.ai TruthfulQA metric LLM Hallucinations: Gauge Truthfulness of Base Models
  • 18.
    c|c (TM) Applications of HTSR:Student Projects calculation | consulting weightwatcher.ai low-memory fi ne-tuning optimize layer learning rates compres LLMs
  • 19.
    c|c (TM) Random Matrix Theory:Marcenko Pastur plus Tracy-Widom fl uctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear calculation | consulting weightwatcher.ai
  • 20.
    c|c (TM) Random Matrix Theory:Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see calculation | consulting weightwatcher.ai
  • 21.
    c|c (TM) AlexNet, VGG11,VGG13, … ResNet, … DenseNet, BERT,RoBERT, … GPT, GPT2, … Flan-T5 Bloom Llama Falcon … Heavy-Tailed: Self-Regularization All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free HTSR calculation | consulting weightwatcher.ai
  • 22.
    c|c (TM) Heavy-Tails: in LatentSemantic Analysis HeavyTails do not solely come from of SGD training LSA on 20newsgroups; great PL fi t; alpha = 2.2 calculation | consulting weightwatcher.ai
  • 23.
    c|c (TM) HTSR Theory: HeavyTailed Self Regularization DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes calculation | consulting weightwatcher.ai
  • 24.
    c|c (TM) HTSR Theory: 5+1Phases of Training Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021. calculation | consulting weightwatcher.ai
  • 25.
    c|c (TM) Heavy Tailed RMT:Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT Charles H. Martin, Michael W. Mahoney; JMLR 22(165):1−73, 2021. calculation | consulting weightwatcher.ai
  • 26.
    c|c (TM) Heavy Tailed RMT:Universality Classes calculation | consulting weightwatcher.ai
  • 27.
    c|c (TM) Statistical Mechanics Why thisworks… calculation | consulting weightwatcher.ai
  • 28.
    c|c (TM) calculation | consultingheavy-tails in DNNs Classic Set Up: Student-Teacher model Statistical Mechanics of Learning from the 1990s MultiLayer Feed Forward Network Perceptron calculation | consulting weightwatcher.ai
  • 29.
    c|c (TM) Set Up: Student-Teachermodel Quality (generalization accuracy) Perceptron (vector model) NN Layer (matrix model) calculation | consulting weightwatcher.ai
  • 30.
    c|c (TM) Set Up: Matrix-generalizedStudent-Teacher Heavy-Tailed correlation matrix Student-Teacher matrix overlap Quality (generalization accuracy) calculation | consulting weightwatcher.ai (2 forms)
  • 31.
    c|c (TM) Set Up: Matrix-generalizedStudent-Teacher Free Energy Quality (squared) / Generating Function Quality (generalization accuracy) calculation | consulting weightwatcher.ai HCIZ Integral
  • 32.
    c|c (TM) HCIZ Integrals :Evaluation “Tricks” Free Energy Quality (squared) / Generating Function X Need to Change Measure calculation | consulting weightwatcher.ai
  • 33.
    c|c (TM) calculation | consultingheavy-tails in DNNs HCIZ Integrals : Evaluation “Tricks” Free Energy Quality (squared) / Generating Function X Need Determinant = 1 calculation | consulting weightwatcher.ai
  • 34.
    c|c (TM) HCIZ Integrals :Renormalization Group Theory Let H be the Hamiltonian for the Quality Volume Preserving or Scale-Invariant Transformation calculation | consulting weightwatcher.ai
  • 35.
    c|c (TM) calculation | consultingheavy-tails in DNNs HCIZ Integrals : Effective Correlation Space calculation | consulting weightwatcher.ai
  • 36.
    c|c (TM) HCIZ Integrals :Effective Space Free Energy Volume Preserving or Scale-Invariant Transformation calculation | consulting weightwatcher.ai
  • 37.
    c|c (TM) Layer Quality Metrics: SemiEmpirical Theory “Generalized Norm” simple, functional form can infer from empirical fi t Eigenvalues of Teacher empirical fi t: R-transform: “Asymptotics of HCZI integrals …” Tanaka (2008) free cumulants* WeightWatcher Layer Quality metric ~ calculation | consulting weightwatcher.ai
  • 38.
    c|c (TM) R-Transform : HeavyTails Alpha smaller => Heavy tail = > Larger higher-order cumulants => Better generalization… (down to alpha=2) calculation | consulting weightwatcher.ai
  • 39.
    c|c (TM) R-Transform : BranchCut at the ECS start of ECS Inverse-Wishart model calculation | consulting weightwatcher.ai
  • 40.
    c|c (TM) New Principle ofLearning: Volume Preserving Transformation As the correlations concentrate into the tail The change of measure must satisfy a volume preserving transformation Which sets a condition on the eigenvalues in the tail calculation | consulting weightwatcher.ai
  • 41.
    c|c (TM) New Principle ofIdeal Learning: Volume Preserving Transformation As alpha -> 2, the |det X|=1 condition holds! calculation | consulting weightwatcher.ai
  • 42.
    c|c (TM) New Principle ofIdeal Learning: Volume Preserving Transformation As alpha -> 2, the |det X|=1 condition holds! calculation | consulting weightwatcher.ai
  • 43.
    c|c (TM) New Principle ofIdeal Learning: Wilson Exact Renormalization Group calculation | consulting weightwatcher.ai
  • 44.
    c|c (TM) Evidence of IdealLearning: As alpha -> 2, SVD generalizing components concentrate in the tail MLP3 on MNIST, varying LR Effective Correlation Space (ECS) calculation | consulting weightwatcher.ai
  • 45.
    c|c (TM) As alpha <2, layers appear to be over fi t Layer Quality Metrics : Detecting Over fi tting calculation | consulting weightwatcher.ai
  • 46.
    c|c (TM) A result ofspuriously large learning rates Layer Quality Metrics : Correlation Traps calculation | consulting weightwatcher.ai
  • 47.
    c|c (TM) As alpha <2, layer FC1 appears to be over fi t Why ?: IZ Free Energy becomes non-extensive Layer Quality Metrics : Detecting Over fi tting MLP3 on MNIST: Reduced capacity Train FC1, freeze FC2, FC3 calculation | consulting weightwatcher.ai
  • 48.
    c|c (TM) Heavy Tailed Correlations:in Neuroscience Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fi ts from experimental neuroscience plus totally new shape metrics we have invented (and published) Spiking (i.e real) neurons exhibit power law behavior weightwatcher supports several PL fi ts from experimental neuroscience plus totally new shape metrics we have invented (and published) calculation | consulting weightwatcher.ai
  • 49.
    c|c (TM) WeightWatcher: why PowerLaw fi ts ? Spiking (i.e real) neurons exhibit (truncated) power law behavior The Critical Brain Hypothesis Evidence of Self-Organized Criticality (SOC) Per Bak (How Nature Works) As neural systems become more complex they exhibit power law behavior and then truncated power law behavior We see exactly this behavior in DNNs and it is predictive of learning capacity calculation | consulting weightwatcher.ai
  • 50.
    c|c (TM) WeightWatcher: open-source, open-science Weare looking for early adopters and collaborators https://weightwatcher.ai 115K+ downloads; 1000 stars calculation | consulting weightwatcher.ai
  • 51.