SlideShare a Scribd company logo
1 of 37
Download to read offline
Deep Learning and Optimization Methods
Stefan Kühn
Join me on XING
Data Science Meetup Hamburg - July 27th, 2017
Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26
Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26
Deep Learning
Neural Networks - Universal Approximation Theorem
1-hidden-layer feed-forward neural net with finite number of parameters can
approximate any continuous function on compact subsets of Rn
Questions:
Why do we need deep learning at all?
theoretic result
approximation by piecewise constant functions (not what you might
want for classification/regression)
Why are deep nets harder to train than shallow nets?
More parameters to be learned by training?
More hyperparameters to be set before training?
Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —
Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26
Example: RNNs
Recurrent Neural Nets
Extremely powerful for modeling sequential data, e.g. time series but
extremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:
Qualitatively: Flexible and rich model class
Practically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:
Qualitatively: Learning long-term dependencies
Practically: Gradient-based methods struggle when separation between
input and target output is large
Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26
Example: RNNs
Recurrent Neural Nets
Highly volatile relationship between parameters and hidden states
Indicators
Vanishing/exploding gradients
Internal Covariate Shift
Remedies
ReLU
’Careful’ initialization
Small stepsizes
(Recurrent) Batch Normalization
Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26
Example: RNNs
Recurrent Neural Nets and LSTM
Schmidhuber/Hochreiter proposed change of RNN architecture by adding
Long Short-Term Memory Units
Vanishing/exploding gradients?
fixed linear dynamics, no longer problematic
Any questions open?
Gradient-based trainings works better with LSTMs
LSTMs can compensate one deficiency of Gradient-based learning but
is this the only one?
Most problems are related to specific numerical issues.
Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26
Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisions
large amounts of training data. [Bouttou, Bousquet]
Underlying Idea
Approximate optimization algorithms might be sufficient for learning
purposes. [Bouttou, Bousquet]
Implications:
Small-scale: Trade-off between approximation error and estimation
error
Large-scale: Computational complexity dominates
Long story short:
The best optimization methods might not be the best learning
methods!
Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26
Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks
Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26
Gradient Descent
Minimize a given function f :
min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Update in step k
xk+1 = xk − α f (xk)
Properties:
always a descent direction, no test needed
locally optimal, globally convergent
works with inexact line search, e.g. Armijo’ s rule
Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26
Stochastic Gradient Descent
Setting
f (x) :=
i
fi (x),
f (x) :=
i
fi (x), i = 1, . . . , m number of training examples
Choose i and update in step k
xk+1 = xk − α fi (xk)
Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26
Shortcomings of Gradient Descent
local: only local information used
especially: no curvature information used
greedy: prefers high curvature directions
scale invariant: no
James Martens, Deep learning via Hessian-free optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26
Momentum
Update in step k
zk+1 = βzk + f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:
condition number κ of improves by square root
stepsizes can be twice as long
order of convergence
√
κ − 1
√
κ + 1
instead of
κ − 1
κ + 1
can diverge, if β is not properly chosen/adapted
Gabriel Goh, Why momentum really works
Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26
Momentum
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26
Adam
Properties:
combines several clever tricks (from Momentum, RMSprop, AdaGrad)
has some similarities to Trust Region methods
empirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26
SGD, Momentum and more
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26
L-BFGS and Nonlinear CG
Observations so far:
The better the method, the more parameters to tune.
All better methods try to incorporate curvature information.
Why not doing so directly?
L-BFGS
Quasi-Newton method, builds an approximation of the (inverse) Hessian
and scales gradient accordingly.
Nonlinear CG
Informally speaking, Nonlinear CG tries to solve a quadratic approximation
of the function.
No surprise: They also work with minibatches.
Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26
Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning
Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26
Truncated Newton: Hessian-Free Optimization
Main ideas:
Approximate not Hessian H, but matrix-vector product Hd.
Use finite differences instead of exact Hessian.
Use damping.
Use Linear CG method for solving quadratic approximation.
Use clever mini-batch stragegy for large data-sets.
Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26
Empirical test on pathological problems
Main results:
The addition problem is known to be effectively impossible for
gradient descent, HF did it.
Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)
Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26
Summary
In the long run, the biggest bottleneck will be the sequential parts of an
algorithm. That’s why the number of iterations needs to be small. SGD and
its successors tend to have much more iterations, and they cannot benefit
as much from higher parallelism (GPUs).
But whatever you do/prefer/choose:
At least use successors of SGD: Momentum, Adam etc.
Look for generic approaches instead of more and more specialized and
manually finetuned solutions.
Key aspects:
Initialization
Adaptive choice of stepsizes/momentum/. . .
Scaling of the gradient
Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26
Resources
Overview of Gradient Descent methods
Why momentum really works
Adam - A Method for Stochastic Optimization
Andrew Ng et al. about L-BFGS and CG outperforming SGD
Lecture Slides Neural Networks for Machine Learning - Hinton et al.
On the importance of initialization and momentum in deep learning
Data-Science-Blog: Summary article in preparation (Stefan Kühn)
The Neural Network Zoo
Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods
Deep Learning and Optimization Methods

More Related Content

What's hot

Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Chung-Il Kim
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
 
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...Aalto University
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Aalto University
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
 
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...Aalto University
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design PatternsAshwin Shiv
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningRakshith Sathish
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)James McMurray
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuningArsalan Qadri
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 

What's hot (20)

Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian Processes
 
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...
Model-Based User Interface Optimization: Part IV: ADVANCED TOPICS - At SICSA ...
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
Model-Based User Interface Optimization: Part III: SOLVING REAL PROBLEMS - At...
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
 
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
12_applications.pdf
12_applications.pdf12_applications.pdf
12_applications.pdf
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
 
Stochastic gradient descent and its tuning
Stochastic gradient descent and its tuningStochastic gradient descent and its tuning
Stochastic gradient descent and its tuning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 

Similar to Deep Learning and Optimization Methods

Gradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricGradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricNAVER Engineering
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceYoonho Lee
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and KerasJie He
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
Wasserstein GAN an Introduction
Wasserstein GAN an IntroductionWasserstein GAN an Introduction
Wasserstein GAN an IntroductionMartin Heusel
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...Edge AI and Vision Alliance
 
November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 butest
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Deep learning: challenges and applications
Deep learning: challenges and  applicationsDeep learning: challenges and  applications
Deep learning: challenges and applicationsAboul Ella Hassanien
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesGiuseppe (Pino) Di Fabbrizio
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...ijcseit
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...TELKOMNIKA JOURNAL
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 

Similar to Deep Learning and Optimization Methods (20)

Gradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricGradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metric
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Wasserstein GAN an Introduction
Wasserstein GAN an IntroductionWasserstein GAN an Introduction
Wasserstein GAN an Introduction
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
 
Me2011 Granularity presentation by Henderson-Sellers
Me2011 Granularity presentation by Henderson-SellersMe2011 Granularity presentation by Henderson-Sellers
Me2011 Granularity presentation by Henderson-Sellers
 
November, 2006 CCKM'06 1
November, 2006 CCKM'06 1 November, 2006 CCKM'06 1
November, 2006 CCKM'06 1
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Deep learning: challenges and applications
Deep learning: challenges and  applicationsDeep learning: challenges and  applications
Deep learning: challenges and applications
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectives
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
 
The Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - PhdassistanceThe Advancement and Challenges in Computational Physics - Phdassistance
The Advancement and Challenges in Computational Physics - Phdassistance
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 

More from Stefan Kühn

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfStefan Kühn
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfStefan Kühn
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsStefan Kühn
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeStefan Kühn
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with RStefan Kühn
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsStefan Kühn
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data VisualizationStefan Kühn
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsStefan Kühn
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataStefan Kühn
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeStefan Kühn
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Stefan Kühn
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015Stefan Kühn
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015Stefan Kühn
 

More from Stefan Kühn (14)

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdf
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and Applications
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational Change
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with R
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and Applications
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data Visualization
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional Data
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015
 

Recently uploaded

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Deep Learning and Optimization Methods

  • 1. Deep Learning and Optimization Methods Stefan Kühn Join me on XING Data Science Meetup Hamburg - July 27th, 2017 Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26
  • 2. Contents 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26
  • 3.
  • 4. Deep Learning Neural Networks - Universal Approximation Theorem 1-hidden-layer feed-forward neural net with finite number of parameters can approximate any continuous function on compact subsets of Rn Questions: Why do we need deep learning at all? theoretic result approximation by piecewise constant functions (not what you might want for classification/regression) Why are deep nets harder to train than shallow nets? More parameters to be learned by training? More hyperparameters to be set before training? Numerical issues? disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more — Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26
  • 5. Example: RNNs Recurrent Neural Nets Extremely powerful for modeling sequential data, e.g. time series but extremely hard to train (somewhat less hard for LSTMs/GRUs) Main Advantages: Qualitatively: Flexible and rich model class Practically: Gradients easily computed by Backpropagation (BPTT) Main Problems: Qualitatively: Learning long-term dependencies Practically: Gradient-based methods struggle when separation between input and target output is large Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26
  • 6. Example: RNNs Recurrent Neural Nets Highly volatile relationship between parameters and hidden states Indicators Vanishing/exploding gradients Internal Covariate Shift Remedies ReLU ’Careful’ initialization Small stepsizes (Recurrent) Batch Normalization Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26
  • 7. Example: RNNs Recurrent Neural Nets and LSTM Schmidhuber/Hochreiter proposed change of RNN architecture by adding Long Short-Term Memory Units Vanishing/exploding gradients? fixed linear dynamics, no longer problematic Any questions open? Gradient-based trainings works better with LSTMs LSTMs can compensate one deficiency of Gradient-based learning but is this the only one? Most problems are related to specific numerical issues. Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26
  • 8. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26
  • 9. Trade-offs between Optimization and Learning Computational complexity becomes the limiting factor when one envisions large amounts of training data. [Bouttou, Bousquet] Underlying Idea Approximate optimization algorithms might be sufficient for learning purposes. [Bouttou, Bousquet] Implications: Small-scale: Trade-off between approximation error and estimation error Large-scale: Computational complexity dominates Long story short: The best optimization methods might not be the best learning methods! Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26
  • 10. Empirical results Empirical evidence for SGD being a better learner than optimizer. RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26
  • 11. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26
  • 12. Gradient Descent Minimize a given function f : min f (x), x ∈ Rn Direction of Steepest Descent, the negative gradient: d = − f (x) Update in step k xk+1 = xk − α f (xk) Properties: always a descent direction, no test needed locally optimal, globally convergent works with inexact line search, e.g. Armijo’ s rule Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26
  • 13. Stochastic Gradient Descent Setting f (x) := i fi (x), f (x) := i fi (x), i = 1, . . . , m number of training examples Choose i and update in step k xk+1 = xk − α fi (xk) Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26
  • 14. Shortcomings of Gradient Descent local: only local information used especially: no curvature information used greedy: prefers high curvature directions scale invariant: no James Martens, Deep learning via Hessian-free optimization Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26
  • 15. Momentum Update in step k zk+1 = βzk + f (xk) xk+1 = xk − αzk+1 Properties for a quadratic convex objective: condition number κ of improves by square root stepsizes can be twice as long order of convergence √ κ − 1 √ κ + 1 instead of κ − 1 κ + 1 can diverge, if β is not properly chosen/adapted Gabriel Goh, Why momentum really works Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26
  • 16. Momentum D E M O Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26
  • 17. Adam Properties: combines several clever tricks (from Momentum, RMSprop, AdaGrad) has some similarities to Trust Region methods empirically proven - best in class (personal opinion) Kingma, Ba Adam: A method for stochastic optimization Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26
  • 18. SGD, Momentum and more D E M O Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26
  • 19. L-BFGS and Nonlinear CG Observations so far: The better the method, the more parameters to tune. All better methods try to incorporate curvature information. Why not doing so directly? L-BFGS Quasi-Newton method, builds an approximation of the (inverse) Hessian and scales gradient accordingly. Nonlinear CG Informally speaking, Nonlinear CG tries to solve a quadratic approximation of the function. No surprise: They also work with minibatches. Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26
  • 20. Empirical results Empirical evidence for better optimizers being better learners. MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26
  • 21. Truncated Newton: Hessian-Free Optimization Main ideas: Approximate not Hessian H, but matrix-vector product Hd. Use finite differences instead of exact Hessian. Use damping. Use Linear CG method for solving quadratic approximation. Use clever mini-batch stragegy for large data-sets. Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26
  • 22. Empirical test on pathological problems Main results: The addition problem is known to be effectively impossible for gradient descent, HF did it. Basic RNN cells are used, no specialized architectures (LSTMs etc.). (Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997) Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26
  • 23. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26
  • 24. Summary In the long run, the biggest bottleneck will be the sequential parts of an algorithm. That’s why the number of iterations needs to be small. SGD and its successors tend to have much more iterations, and they cannot benefit as much from higher parallelism (GPUs). But whatever you do/prefer/choose: At least use successors of SGD: Momentum, Adam etc. Look for generic approaches instead of more and more specialized and manually finetuned solutions. Key aspects: Initialization Adaptive choice of stepsizes/momentum/. . . Scaling of the gradient Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26
  • 25. Resources Overview of Gradient Descent methods Why momentum really works Adam - A Method for Stochastic Optimization Andrew Ng et al. about L-BFGS and CG outperforming SGD Lecture Slides Neural Networks for Machine Learning - Hinton et al. On the importance of initialization and momentum in deep learning Data-Science-Blog: Summary article in preparation (Stefan Kühn) The Neural Network Zoo Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26