SlideShare a Scribd company logo
1 of 21
Download to read offline
Theoretical Deep LearningTheoretical Deep Learning
Xiaohu Zhu
Cofounder & Chief Scientist
Why?Why?
Reason 1Reason 1
To understand things better and
deeper
Reason 2Reason 2
Devise more efficient algorithms
Reason 3Reason 3
To connect with other solid
theories and methods
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
The killer application of DLThe killer application of DL
Composite functionsComposite functions
# of parameters grow exponentially with the dimension of
the equations
# of units grows linearly with the dimension of functions
worse performance for deep learning for non-composite
functions
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
Optimization 1Optimization 1
Linear equations: # of unknowns > # of equations ⇒ more
than one solution
Neural net for ImageNet: # of parameters(~millions) ≫ # of
samples(~60,000) Overparameterization
Bézout's Theorem: # of solutions > # of atoms in the
universe ⇒ degenerate: each solution corresponds to a
infinite solution set
Optimization 2Optimization 2
Overparameterization: neural nets have infinite number of
global optimum solution, which form a plato valley in the
loss space.
SGD could stay in the degenerating valley with high
probability
Good news: easy to optimize, global optimum exist, many,
easy to find by opt algorithms
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
Generalization 1Generalization 1
Overparameterization: good for optimization, bad for
generalization
Deep learning: tasks reasonably mix well with loss functions
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error 
Differential equation dynamic system: near global minimum,
deep nn works like a linear network
Generalization 2Generalization 2
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error 
Cross Entropy ∈ Exponential loss
asymmetricity ?⇒ Special property
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
WHAT'sWHAT's
More?More?
Plato optimumPlato optimum
=> better=> better
generalization?generalization?
Overfitting?Overfitting?
Look out!Look out!
Do we needDo we need
Prior?Prior?
Whether BrainWhether Brain
research isresearch is
useful for DL?useful for DL?
ReferencesReferences
1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49.
2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv
preprint arXiv:1705.03071.
3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519.
4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833.
5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD.
arXiv preprint arXiv:1801.02254.
6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non-
overfitting puzzle. arXiv preprint arXiv:1801.00173.
7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties
of sgd. Center for Brains, Minds and Machines (CBMM).
8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933.
9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.
ThanksThanks

More Related Content

What's hot

Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
STEGANOGRAPHY PRESENTATION SLIDES
STEGANOGRAPHY PRESENTATION SLIDESSTEGANOGRAPHY PRESENTATION SLIDES
STEGANOGRAPHY PRESENTATION SLIDES
Lovely Mnadal
 
Automatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural NetworksAutomatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural Networks
Jinho Choi
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
SOFIA - Cross Domain Interoperability Case Study. NOKIA
SOFIA - Cross Domain Interoperability Case Study. NOKIASOFIA - Cross Domain Interoperability Case Study. NOKIA
SOFIA - Cross Domain Interoperability Case Study. NOKIA
Sofia Eu
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
suthi
 

What's hot (20)

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Privacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymityPrivacy Protectin Models and Defamation caused by k-anonymity
Privacy Protectin Models and Defamation caused by k-anonymity
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Steganography
SteganographySteganography
Steganography
 
Deep learning in Crypto Currency Trading
Deep learning in Crypto Currency TradingDeep learning in Crypto Currency Trading
Deep learning in Crypto Currency Trading
 
STEGANOGRAPHY PRESENTATION SLIDES
STEGANOGRAPHY PRESENTATION SLIDESSTEGANOGRAPHY PRESENTATION SLIDES
STEGANOGRAPHY PRESENTATION SLIDES
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Steganography
SteganographySteganography
Steganography
 
DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...
DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...
DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...
 
Automatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural NetworksAutomatic Personality Prediction with Attention-based Neural Networks
Automatic Personality Prediction with Attention-based Neural Networks
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
SOFIA - Cross Domain Interoperability Case Study. NOKIA
SOFIA - Cross Domain Interoperability Case Study. NOKIASOFIA - Cross Domain Interoperability Case Study. NOKIA
SOFIA - Cross Domain Interoperability Case Study. NOKIA
 
DWT based approach for steganography using biometrics
DWT based approach for steganography using biometricsDWT based approach for steganography using biometrics
DWT based approach for steganography using biometrics
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)
 
Rsa cryptography &steganography
Rsa cryptography &steganographyRsa cryptography &steganography
Rsa cryptography &steganography
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
 

Similar to Theoretical Deep Learning

Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
BigDataCloud
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
YanhuaSi
 

Similar to Theoretical Deep Learning (20)

[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Resnet.pdf
Resnet.pdfResnet.pdf
Resnet.pdf
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text Classification
 
W4301117121
W4301117121W4301117121
W4301117121
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
 

More from Xiaohu ZHU (9)

A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its Application
 
CBIR in the Era of Deep Learning
CBIR in the Era of Deep LearningCBIR in the Era of Deep Learning
CBIR in the Era of Deep Learning
 
Deep cv 101
Deep cv 101Deep cv 101
Deep cv 101
 
苏宁图像智能分析实践
苏宁图像智能分析实践苏宁图像智能分析实践
苏宁图像智能分析实践
 
Deep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionDeep Reinforcement Learning An Introduction
Deep Reinforcement Learning An Introduction
 
Hangzhou Deep Learning Meetup-Deep Reinforcement Learning
Hangzhou Deep Learning Meetup-Deep Reinforcement LearningHangzhou Deep Learning Meetup-Deep Reinforcement Learning
Hangzhou Deep Learning Meetup-Deep Reinforcement Learning
 
神经网络与深度学习
神经网络与深度学习神经网络与深度学习
神经网络与深度学习
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
Shanghai Deep Learning Meetup #1
Shanghai Deep Learning Meetup #1Shanghai Deep Learning Meetup #1
Shanghai Deep Learning Meetup #1
 

Recently uploaded

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 

Recently uploaded (20)

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 

Theoretical Deep Learning

  • 1. Theoretical Deep LearningTheoretical Deep Learning Xiaohu Zhu Cofounder & Chief Scientist
  • 3. Reason 1Reason 1 To understand things better and deeper
  • 4. Reason 2Reason 2 Devise more efficient algorithms
  • 5. Reason 3Reason 3 To connect with other solid theories and methods
  • 6. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 7. RepresentationRepresentation The killer application of DLThe killer application of DL
  • 8. Composite functionsComposite functions # of parameters grow exponentially with the dimension of the equations # of units grows linearly with the dimension of functions worse performance for deep learning for non-composite functions
  • 9. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 10. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 11. Optimization 1Optimization 1 Linear equations: # of unknowns > # of equations ⇒ more than one solution Neural net for ImageNet: # of parameters(~millions) ≫ # of samples(~60,000) Overparameterization Bézout's Theorem: # of solutions > # of atoms in the universe ⇒ degenerate: each solution corresponds to a infinite solution set
  • 12. Optimization 2Optimization 2 Overparameterization: neural nets have infinite number of global optimum solution, which form a plato valley in the loss space. SGD could stay in the degenerating valley with high probability Good news: easy to optimize, global optimum exist, many, easy to find by opt algorithms
  • 13. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 14. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 15. Generalization 1Generalization 1 Overparameterization: good for optimization, bad for generalization Deep learning: tasks reasonably mix well with loss functions Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏ overfits classification error  Differential equation dynamic system: near global minimum, deep nn works like a linear network
  • 16. Generalization 2Generalization 2 Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏ overfits classification error  Cross Entropy ∈ Exponential loss asymmetricity ?⇒ Special property
  • 17. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 18. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data? WHAT'sWHAT's More?More?
  • 19. Plato optimumPlato optimum => better=> better generalization?generalization? Overfitting?Overfitting? Look out!Look out! Do we needDo we need Prior?Prior? Whether BrainWhether Brain research isresearch is useful for DL?useful for DL?
  • 20. ReferencesReferences 1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49. 2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. 3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519. 4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833. 5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD. arXiv preprint arXiv:1801.02254. 6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non- overfitting puzzle. arXiv preprint arXiv:1801.00173. 7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties of sgd. Center for Brains, Minds and Machines (CBMM). 8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933. 9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.