SlideShare a Scribd company logo
1 of 45
Download to read offline
calculation | consulting
why deep learning works:
perspectives from theoretical chemistry
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
MMDS 2016
why deep learning works:
perspectives from theoretical chemistry
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 10 years experience in applied Machine Learning
Developed ML algos for Demand Media; the first $1B IPO since Google
Tech: Aardvark (now Google), eHow, GoDaddy, …
Wall Street: BlackRock
Fortune 500: Big Pharma, Telecom, eBay
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
Data Scientists are Different
c|c
(TM)
theoretical physics
machine learning specialist
(TM)
4
experimental physics
data scientist
engineer
software, browser tech, dev ops, …
not all techies are the same
calculation | consulting why deep learning works
c|c
(TM)
Problem: How can SGD possibly work?
Aren’t Neural Nets non-Convex ?!
(TM)
5
calculation | consulting why deep learning works
can Spin Glass models suggest why ?
what other models are out there ?
expected observed ?
c|c
(TM)
(TM)
6
calculation | consulting why deep learning works
Outline

Random Energy Model (REM)
Temperature, regularization and the glass transition
extending REM: Spin Glass of Minimal Frustration
protein folding analogy: Funneled Energy Landscapes
example: Dark Knowledge
Recent work: Spin Glass models for Deep Nets
c|c
(TM)
(TM)
7
calculation | consulting why deep learning works
Warning

condensed matter theory is about qualitative analogies
we may seek a toy model
a mean field theory
a phenomenological description
c|c
(TM)
What problem is Deep Learning solving ?
(TM)
8
calculation | consulting why deep learning works
minimize cross-entropy
https://www.ics.uci.edu/~pjsados.pdf
c|c
(TM)
Problem: What is a good theoretical
model for deep networks ?
(TM)
9
calculation | consulting why deep learning works
p-spin spherical glass
LeCun … 2015
L Hamiltonian (Energy function)
X Gaussian random variables
w real valued (spins) , spherical constraint
H >= 3 (p)
can be solved analytically, simulated easily
c|c
(TM)
What is a spin glass ?
(TM)
10
calculation | consulting why deep learning works
Frustration: constraints that can not be satisfied
J = X = weights
S = w = spins
Energetically: all spins should be paired
c|c
(TM)
why p-spin spherical glass ?
(TM)
11
calculation | consulting why deep learning works
crudely: deep networks (effectively) have no local minima !
local minima
k=1 critical points
floor / ground state
k = 2 critical points
k = 3 critical points
the critical points are ordered
saddle points
c|c
(TM)
why p-spin spherical glass ?
(TM)
12
calculation | consulting why deep learning works
crudely: deep networks (effectively) have no local minima !
http://cims.nyu.edu/~achoroma/NonFlash/Papers/PAPER_AMMGY.pdf
ap
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
any local minima will do; the ground state is a state of overtraining
good generalization
overtraining
Early Stopping: to avoid the ground state ?
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
it’s easy to find the ground state; it’s hard to generalize ?
Early Stopping: to avoid the ground state ?
c|c
(TM)
Current Interpretation

(TM)
15
calculation | consulting why deep learning works
•finding the ground state is easy (sic); generalizing is hard
•finding the ground state is irrelevant: any local minima will do
•the ground state is a state over training
c|c
(TM)
recent p-spin spherical glass results
(TM)
16
calculation | consulting why deep learning works
actually: recent results (2013) on the behavior
(distribution of critical points, concentration of the means)
of an isotropic random function on a high dimensional manifold
require: the variables actually concentrate on their means
the weights are drawn from isotropic random function
related to: old results TAP solutions (1977)
# critical points ~ TAP complexity
avoid local minima? : increase Temperature
harder problem: low Temp behavior of spin glass
c|c
(TM)
What problem is Deep Learning solving ?
(TM)
17
calculation | consulting why deep learning works
minimize cross-entropy of output layer
entropic effects : not just min energy
more like min free energy (divergence)
Statistical Physics and InformationTheory: Neri Merhav
i.e. variational auto encoders
c|c
(TM)
What problem is Deep Learning solving ?
(TM)
18
calculation | consulting why deep learning works
Restricted Boltzmann Machine
can define free energy directly
A Practical Guide toTraining Restricted Boltzmann Machines, Hinton
c|c
(TM)
What problem is Deep Learning solving ?
(TM)
19
calculation | consulting why deep learning works
Restricted Boltzmann Machine
trade off between energy and entropy
min free energy directly
A Practical Guide toTraining Restricted Boltzmann Machines, Hinton
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf
infinite limit of p-spin spherical glass
A related approach: Random Energy Model (REM)
c|c
(TM)
Random Energy Model (REM)
(TM)
21
calculation | consulting why deep learning works
ground state is governed by ExtremeValue Statistics
http://guava.physics.uiuc.edu/~nigel/courses/563/essays2000/pogorelov.pdf
http://scitation.aip.org/content/aip/journal/jcp/111/14/10.1063/1.479951
old result from protein folding theory
c|c
(TM)
REM: What is Temperature ?
(TM)
22
calculation | consulting why deep learning works
We can use statistical mechanics to analyze known algorithms
I don’t mean in the traditional sense of algorithmic analysis
take Ej as the objective = loss function + regularizer
study Z: form a mean field theory;
take limits N -> inf, T -> 0
c|c
(TM)
REM: What is Temperature ?
(TM)
23
calculation | consulting why deep learning works
let E(T) by the effective energy
E(T) = E/T ~ sum of weights*activations
as T -> 0, E(T) effective energies diverge; weights explode
Temperature is a proxy for weight constraints
T sets the Energy Scale
c|c
(TM)
Temperature: as Weight Constraints
(TM)
24
calculation | consulting why deep learning works
•traditional weight regularization
•max norm constraints (i.e. w/dropout)
•batch norm regularization (2015)
we avoid situations when the weights explode
in deep networks, we temper the weights
and the distribution of the activations (i.e local entropy)
c|c
(TM)
REM: a toy model for real Glasses

(TM)
25
calculation | consulting why deep learning works
but it is believed that entropy collapse ‘drives’ the glass transition
the glass transition is not well understood
c|c
(TM)
what is a real (structural) Glass ?

(TM)
26
calculation | consulting why deep learning works
Sand + Fire = Glass
c|c
(TM)
what is a real (structural) Glass ?

(TM)
27
calculation | consulting why deep learning works
all liquids can be made into glasses
if we cool then fast enough
the glass transition is not a normal phase transition
not the melting point
arrangement of atoms is amorphous; not completely random
different cooling rates produce different glassy states
universal phenomena; not universal physics
molecular details affect the thermodynamics
c|c
(TM)
REM: the Glass Transition

(TM)
28
calculation | consulting why deep learning works
Entropy collapses when T <~ Tc
Phase Diagram: entropy density
energy density
free energy density
https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf
c|c
(TM)
REM: Dynamics on the Energy Landscape

(TM)
29
calculation | consulting why deep learning works
let us assume some states trap the solver for some time;
of course, there is a great effort to design solvers that can avoid traps
c|c
(TM)
Energy Landscapes: and Protein Folding 

(TM)
30
calculation | consulting why deep learning works
let us assume some states trap the solver in state E(j) for a short time
and the transitions E(j) -> E(j-1) are governed by finite, reversible transitions
(i.e. SGD oscillates back and forth for a while)
classic result(s): for T near the glass Temp (Tc)
the traversal times are slower than exponential !
in a physical system, like a protein or polymer,
it would take longer than the known lifetime of the universe
to find the ground (folded) state
c|c
(TM)
Protein Folding: the Levinthal Paradox 

(TM)
31
calculation | consulting why deep learning works
folding could take longer than the known lifetime of the universe ?
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
http://arxiv.org/pdf/cond-mat/9904060v2.pdf
Old analogy between Protein folding and Hopfield Associative Memories
Natural pattern recognition could
• use a mechanism with a glass Temp (Tc) that is as low as possible
• avoid the glass transition entirely, via energetics
Nature (i.e. folding) can not operate this way !
Protein Folding: around the Levinthal Paradox 

c|c
(TM)
Spin Glasses: Minimizing Frustration

(TM)
33
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
c|c
(TM)
Spin Glasses: Minimizing Frustration

(TM)
34
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
c|c
(TM)
Spin Glasses: vs Disordered FerroMagnets

(TM)
35
calculation | consulting why deep learning works
http://arxiv.org/pdf/cond-mat/9904060v2.pdf
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
36
calculation | consulting why deep learning works
REM + strongly correlated ground state = no glass transition
https://arxiv.org/pdf/1312.7283.pdf
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
37
calculation | consulting why deep learning works
Training a model induces an energy gap, with few local minima
http://arxiv.org/pdf/1312.0867v1.pdf
c|c
(TM)
Energy Funnels: Entropy vs Energy 

(TM)
38
calculation | consulting why deep learning works
there is a tradeoff between Energy and Entropy minimization
c|c
(TM)
Energy Landscape Theory of Protein Folding
(TM)
39
calculation | consulting why deep learning works
there is a tradeoff between Energy and Entropy minimization
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Avoids the glass transition by having more favorable energetics
Levinthal paradox
glassy surface
vanishing gradients
Energy Landscape Theory of Protein Folding
funneled landscape
rugged convexity
energy / entropy tradeoff
c|c
(TM)
RBMs: Entropy Energy Tradeoff
(TM)
41
calculation | consulting why deep learning works
RBM on MNIST Aligned for comparison: entropy - 200
Entropy drops off much faster than total Free Energy
c|c
(TM)
Dark Knowledge: an Energy Funnel ?

(TM)
42
calculation | consulting why deep learning works
784 -> 800 -> 800 -> 10MLP on MNIST
Distilled
10,000 test cases, 10 classes
99 errors
same entropy (capacity); better loss function
fit to ensemble soft-max probabilities
146 errors
784 -> 800 -> 800 -> 10
c|c
(TM)
Adversarial Deep Nets: an Energy Funnel ?

(TM)
43
calculation | consulting why deep learning works
Discriminator learns a complex loss function
Generator: fake data
Discriminator: fake vs real ?
http://soumith.ch/eyescream/
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Summary

Random Energy Model (REM): simpler theoretical model
Glass Transition: temperature ~ weight constraints
extending REM: Spin Glass of Minimal Frustration
possible examples: Dark Knowledge
Funneled Energy Landscapes
Adversarial Deep Nets
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

More Related Content

What's hot

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Charles Martin
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021Charles Martin
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAPJakub Bartczuk
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Mokhtar SELLAMI
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmMostafa G. M. Mostafa
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big DataAlexander Jung
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnnBrian Kim
 
Different techniques for speech recognition
Different  techniques for speech recognitionDifferent  techniques for speech recognition
Different techniques for speech recognitionyashi saxena
 

What's hot (20)

Search relevance
Search relevanceSearch relevance
Search relevance
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Cmb part3
Cmb part3Cmb part3
Cmb part3
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
PMF BPMF and BPTF
PMF BPMF and BPTFPMF BPMF and BPTF
PMF BPMF and BPTF
 
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian Processes
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Combinatorial Optimization
Combinatorial OptimizationCombinatorial Optimization
Combinatorial Optimization
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
 
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
 
Different techniques for speech recognition
Different  techniques for speech recognitionDifferent  techniques for speech recognition
Different techniques for speech recognition
 

Similar to CC mmds talk 2106

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
Semet Gecco06
Semet Gecco06Semet Gecco06
Semet Gecco06ysemet
 
Quantum Business in Japanese Market
Quantum Business in Japanese MarketQuantum Business in Japanese Market
Quantum Business in Japanese MarketYuichiro MInato
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouthQiqi Wang
 
Four Hats of Math: CFD
Four Hats of Math: CFDFour Hats of Math: CFD
Four Hats of Math: CFDTomasz Bednarz
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSCdongwook159
 
mws_gen_aae_spe_pptintroduction.ppt
mws_gen_aae_spe_pptintroduction.pptmws_gen_aae_spe_pptintroduction.ppt
mws_gen_aae_spe_pptintroduction.pptSudeepThapaliya1
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesArithmer Inc.
 
Dynamic mechanical analysis(DMA)
Dynamic mechanical analysis(DMA)Dynamic mechanical analysis(DMA)
Dynamic mechanical analysis(DMA)Manar Alfhad
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5dm_work
 
EOS5 Demo
EOS5 DemoEOS5 Demo
EOS5 Demodm_work
 

Similar to CC mmds talk 2106 (20)

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
Semet Gecco06
Semet Gecco06Semet Gecco06
Semet Gecco06
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Mit2 72s09 lec02 (1)
Mit2 72s09 lec02 (1)Mit2 72s09 lec02 (1)
Mit2 72s09 lec02 (1)
 
Discussion of PMCMC
Discussion of PMCMCDiscussion of PMCMC
Discussion of PMCMC
 
Quantum Business in Japanese Market
Quantum Business in Japanese MarketQuantum Business in Japanese Market
Quantum Business in Japanese Market
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
2014.10.dartmouth
2014.10.dartmouth2014.10.dartmouth
2014.10.dartmouth
 
Four Hats of Math: CFD
Four Hats of Math: CFDFour Hats of Math: CFD
Four Hats of Math: CFD
 
Mathematics Colloquium, UCSC
Mathematics Colloquium, UCSCMathematics Colloquium, UCSC
Mathematics Colloquium, UCSC
 
Fine Grained Complexity
Fine Grained ComplexityFine Grained Complexity
Fine Grained Complexity
 
mws_gen_aae_spe_pptintroduction.ppt
mws_gen_aae_spe_pptintroduction.pptmws_gen_aae_spe_pptintroduction.ppt
mws_gen_aae_spe_pptintroduction.ppt
 
Ch05 ppts callister7e
Ch05 ppts callister7eCh05 ppts callister7e
Ch05 ppts callister7e
 
Design and analysis of slabs
Design and analysis of slabsDesign and analysis of slabs
Design and analysis of slabs
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
 
Dynamic mechanical analysis(DMA)
Dynamic mechanical analysis(DMA)Dynamic mechanical analysis(DMA)
Dynamic mechanical analysis(DMA)
 
End of Sprint 5
End of Sprint 5End of Sprint 5
End of Sprint 5
 
EOS5 Demo
EOS5 DemoEOS5 Demo
EOS5 Demo
 
Esa act mtimpe_talk
Esa act mtimpe_talkEsa act mtimpe_talk
Esa act mtimpe_talk
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 

More from Charles Martin

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpCharles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

More from Charles Martin (10)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Recently uploaded

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Recently uploaded (20)

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

CC mmds talk 2106

  • 1. calculation | consulting why deep learning works: perspectives from theoretical chemistry (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting MMDS 2016 why deep learning works: perspectives from theoretical chemistry (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 10 years experience in applied Machine Learning Developed ML algos for Demand Media; the first $1B IPO since Google Tech: Aardvark (now Google), eHow, GoDaddy, … Wall Street: BlackRock Fortune 500: Big Pharma, Telecom, eBay www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. Data Scientists are Different c|c (TM) theoretical physics machine learning specialist (TM) 4 experimental physics data scientist engineer software, browser tech, dev ops, … not all techies are the same calculation | consulting why deep learning works
  • 5. c|c (TM) Problem: How can SGD possibly work? Aren’t Neural Nets non-Convex ?! (TM) 5 calculation | consulting why deep learning works can Spin Glass models suggest why ? what other models are out there ? expected observed ?
  • 6. c|c (TM) (TM) 6 calculation | consulting why deep learning works Outline
 Random Energy Model (REM) Temperature, regularization and the glass transition extending REM: Spin Glass of Minimal Frustration protein folding analogy: Funneled Energy Landscapes example: Dark Knowledge Recent work: Spin Glass models for Deep Nets
  • 7. c|c (TM) (TM) 7 calculation | consulting why deep learning works Warning
 condensed matter theory is about qualitative analogies we may seek a toy model a mean field theory a phenomenological description
  • 8. c|c (TM) What problem is Deep Learning solving ? (TM) 8 calculation | consulting why deep learning works minimize cross-entropy https://www.ics.uci.edu/~pjsados.pdf
  • 9. c|c (TM) Problem: What is a good theoretical model for deep networks ? (TM) 9 calculation | consulting why deep learning works p-spin spherical glass LeCun … 2015 L Hamiltonian (Energy function) X Gaussian random variables w real valued (spins) , spherical constraint H >= 3 (p) can be solved analytically, simulated easily
  • 10. c|c (TM) What is a spin glass ? (TM) 10 calculation | consulting why deep learning works Frustration: constraints that can not be satisfied J = X = weights S = w = spins Energetically: all spins should be paired
  • 11. c|c (TM) why p-spin spherical glass ? (TM) 11 calculation | consulting why deep learning works crudely: deep networks (effectively) have no local minima ! local minima k=1 critical points floor / ground state k = 2 critical points k = 3 critical points the critical points are ordered saddle points
  • 12. c|c (TM) why p-spin spherical glass ? (TM) 12 calculation | consulting why deep learning works crudely: deep networks (effectively) have no local minima ! http://cims.nyu.edu/~achoroma/NonFlash/Papers/PAPER_AMMGY.pdf ap
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works any local minima will do; the ground state is a state of overtraining good generalization overtraining Early Stopping: to avoid the ground state ?
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works it’s easy to find the ground state; it’s hard to generalize ? Early Stopping: to avoid the ground state ?
  • 15. c|c (TM) Current Interpretation
 (TM) 15 calculation | consulting why deep learning works •finding the ground state is easy (sic); generalizing is hard •finding the ground state is irrelevant: any local minima will do •the ground state is a state over training
  • 16. c|c (TM) recent p-spin spherical glass results (TM) 16 calculation | consulting why deep learning works actually: recent results (2013) on the behavior (distribution of critical points, concentration of the means) of an isotropic random function on a high dimensional manifold require: the variables actually concentrate on their means the weights are drawn from isotropic random function related to: old results TAP solutions (1977) # critical points ~ TAP complexity avoid local minima? : increase Temperature harder problem: low Temp behavior of spin glass
  • 17. c|c (TM) What problem is Deep Learning solving ? (TM) 17 calculation | consulting why deep learning works minimize cross-entropy of output layer entropic effects : not just min energy more like min free energy (divergence) Statistical Physics and InformationTheory: Neri Merhav i.e. variational auto encoders
  • 18. c|c (TM) What problem is Deep Learning solving ? (TM) 18 calculation | consulting why deep learning works Restricted Boltzmann Machine can define free energy directly A Practical Guide toTraining Restricted Boltzmann Machines, Hinton
  • 19. c|c (TM) What problem is Deep Learning solving ? (TM) 19 calculation | consulting why deep learning works Restricted Boltzmann Machine trade off between energy and entropy min free energy directly A Practical Guide toTraining Restricted Boltzmann Machines, Hinton
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf infinite limit of p-spin spherical glass A related approach: Random Energy Model (REM)
  • 21. c|c (TM) Random Energy Model (REM) (TM) 21 calculation | consulting why deep learning works ground state is governed by ExtremeValue Statistics http://guava.physics.uiuc.edu/~nigel/courses/563/essays2000/pogorelov.pdf http://scitation.aip.org/content/aip/journal/jcp/111/14/10.1063/1.479951 old result from protein folding theory
  • 22. c|c (TM) REM: What is Temperature ? (TM) 22 calculation | consulting why deep learning works We can use statistical mechanics to analyze known algorithms I don’t mean in the traditional sense of algorithmic analysis take Ej as the objective = loss function + regularizer study Z: form a mean field theory; take limits N -> inf, T -> 0
  • 23. c|c (TM) REM: What is Temperature ? (TM) 23 calculation | consulting why deep learning works let E(T) by the effective energy E(T) = E/T ~ sum of weights*activations as T -> 0, E(T) effective energies diverge; weights explode Temperature is a proxy for weight constraints T sets the Energy Scale
  • 24. c|c (TM) Temperature: as Weight Constraints (TM) 24 calculation | consulting why deep learning works •traditional weight regularization •max norm constraints (i.e. w/dropout) •batch norm regularization (2015) we avoid situations when the weights explode in deep networks, we temper the weights and the distribution of the activations (i.e local entropy)
  • 25. c|c (TM) REM: a toy model for real Glasses
 (TM) 25 calculation | consulting why deep learning works but it is believed that entropy collapse ‘drives’ the glass transition the glass transition is not well understood
  • 26. c|c (TM) what is a real (structural) Glass ?
 (TM) 26 calculation | consulting why deep learning works Sand + Fire = Glass
  • 27. c|c (TM) what is a real (structural) Glass ?
 (TM) 27 calculation | consulting why deep learning works all liquids can be made into glasses if we cool then fast enough the glass transition is not a normal phase transition not the melting point arrangement of atoms is amorphous; not completely random different cooling rates produce different glassy states universal phenomena; not universal physics molecular details affect the thermodynamics
  • 28. c|c (TM) REM: the Glass Transition
 (TM) 28 calculation | consulting why deep learning works Entropy collapses when T <~ Tc Phase Diagram: entropy density energy density free energy density https://web.stanford.edu/~montanar/RESEARCH/BOOK/partB.pdf
  • 29. c|c (TM) REM: Dynamics on the Energy Landscape
 (TM) 29 calculation | consulting why deep learning works let us assume some states trap the solver for some time; of course, there is a great effort to design solvers that can avoid traps
  • 30. c|c (TM) Energy Landscapes: and Protein Folding 
 (TM) 30 calculation | consulting why deep learning works let us assume some states trap the solver in state E(j) for a short time and the transitions E(j) -> E(j-1) are governed by finite, reversible transitions (i.e. SGD oscillates back and forth for a while) classic result(s): for T near the glass Temp (Tc) the traversal times are slower than exponential ! in a physical system, like a protein or polymer, it would take longer than the known lifetime of the universe to find the ground (folded) state
  • 31. c|c (TM) Protein Folding: the Levinthal Paradox 
 (TM) 31 calculation | consulting why deep learning works folding could take longer than the known lifetime of the universe ?
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works http://arxiv.org/pdf/cond-mat/9904060v2.pdf Old analogy between Protein folding and Hopfield Associative Memories Natural pattern recognition could • use a mechanism with a glass Temp (Tc) that is as low as possible • avoid the glass transition entirely, via energetics Nature (i.e. folding) can not operate this way ! Protein Folding: around the Levinthal Paradox 

  • 33. c|c (TM) Spin Glasses: Minimizing Frustration
 (TM) 33 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
  • 34. c|c (TM) Spin Glasses: Minimizing Frustration
 (TM) 34 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
  • 35. c|c (TM) Spin Glasses: vs Disordered FerroMagnets
 (TM) 35 calculation | consulting why deep learning works http://arxiv.org/pdf/cond-mat/9904060v2.pdf
  • 36. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 36 calculation | consulting why deep learning works REM + strongly correlated ground state = no glass transition https://arxiv.org/pdf/1312.7283.pdf
  • 37. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 37 calculation | consulting why deep learning works Training a model induces an energy gap, with few local minima http://arxiv.org/pdf/1312.0867v1.pdf
  • 38. c|c (TM) Energy Funnels: Entropy vs Energy 
 (TM) 38 calculation | consulting why deep learning works there is a tradeoff between Energy and Entropy minimization
  • 39. c|c (TM) Energy Landscape Theory of Protein Folding (TM) 39 calculation | consulting why deep learning works there is a tradeoff between Energy and Entropy minimization
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Avoids the glass transition by having more favorable energetics Levinthal paradox glassy surface vanishing gradients Energy Landscape Theory of Protein Folding funneled landscape rugged convexity energy / entropy tradeoff
  • 41. c|c (TM) RBMs: Entropy Energy Tradeoff (TM) 41 calculation | consulting why deep learning works RBM on MNIST Aligned for comparison: entropy - 200 Entropy drops off much faster than total Free Energy
  • 42. c|c (TM) Dark Knowledge: an Energy Funnel ?
 (TM) 42 calculation | consulting why deep learning works 784 -> 800 -> 800 -> 10MLP on MNIST Distilled 10,000 test cases, 10 classes 99 errors same entropy (capacity); better loss function fit to ensemble soft-max probabilities 146 errors 784 -> 800 -> 800 -> 10
  • 43. c|c (TM) Adversarial Deep Nets: an Energy Funnel ?
 (TM) 43 calculation | consulting why deep learning works Discriminator learns a complex loss function Generator: fake data Discriminator: fake vs real ? http://soumith.ch/eyescream/
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Summary
 Random Energy Model (REM): simpler theoretical model Glass Transition: temperature ~ weight constraints extending REM: Spin Glass of Minimal Frustration possible examples: Dark Knowledge Funneled Energy Landscapes Adversarial Deep Nets