SlideShare a Scribd company logo
1 of 35
Download to read offline
The Machinery behind Deep Learning
Stefan Kühn
Join me on XING
Minds Mastering Machines - Cologne - April 26th, 2018
Stefan Kühn (XING) Deep Optimization 26.04.2018 1 / 35
Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 2 / 35
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 3 / 35
Deep Learning
Neural Networks - Universal Approximation Theorem
1-hidden-layer feed-forward neural net with finite number of parameters can
approximate any continuous function on compact subsets of Rn
Questions:
Why do we need deep learning at all?
theoretic result, requires wide nets
approximation by piecewise constant functions (not what you might
want for classification/regression)
deep nets can replicate capacity of wide shallow nets with performance
and stability improvements
Why are deep nets harder to train than shallow nets?
More parameters to be learned by training?
More hyperparameters to be set before training?
Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —
Stefan Kühn (XING) Deep Optimization 26.04.2018 4 / 35
Example: RNNs
Recurrent Neural Nets
Extremely powerful for modeling sequential data, e.g. time series but
extremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:
Qualitatively: Flexible and rich model class
Practically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:
Qualitatively: Learning long-term dependencies
Practically: Gradient-based methods struggle when separation between
input and target output is large
Stefan Kühn (XING) Deep Optimization 26.04.2018 5 / 35
Example: RNNs
Recurrent Neural Nets
Highly volatile relationship between parameters and hidden states
Indicators
Vanishing/exploding gradients
Internal Covariate Shift
Remedies
ReLU
’Careful’ initialization
Small stepsizes
(Recurrent) Batch Normalization
Stefan Kühn (XING) Deep Optimization 26.04.2018 6 / 35
Example: RNNs
Recurrent Neural Nets and LSTM
Schmidhuber/Hochreiter proposed change of RNN architecture by adding
Long Short-Term Memory Units
Vanishing/exploding gradients?
fixed linear dynamics, no longer problematic
Any questions open?
Gradient-based trainings works better with LSTMs
LSTMs can compensate one deficiency of Gradient-based learning but
is this the only one?
Most problems are related to specific numerical issues.
Stefan Kühn (XING) Deep Optimization 26.04.2018 7 / 35
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 8 / 35
Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Vidal et al, Mathematics of Deep Learning
Stefan Kühn (XING) Deep Optimization 26.04.2018 9 / 35
Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Local Optimum: Minimum in local neighborhood (global minimum
might not even exist)
Global Optimum: Point with lowest function value (if existing)
Critical points: Candidates for local/global optima, or saddle points
Iterative Minimization: Step-by-step approach to find minima
Descent direction: Direction in which the function value decreases at
least for a small steps
Gradient: For differentiable functions the negative gradient always is a
descent direction and it vanishes at critical points
Stefan Kühn (XING) Deep Optimization 26.04.2018 10 / 35
Optimality and Deep Neural Nets
Some surprisingly strong theoretical results for this nonlinear+nonconvex
optimization problem - and practical evidence as well!
Saddle points: In high-dimensional convex problem critical points are
saddle points -> could not be observed for Deep Nets
Local and global optimal: Deep Nets seem to have the property that
local optima are located near the global optimum
Optimal representation: Deep Nets can represent data optimally under
certain conditions (minimal sufficient statistic)
Information Theory: Deep Nets and entropy are becoming best friend,
strong relations to optimal control theory (Optimization in infinite
dimensions)
Global optimality for positively homogeneous networks:
self-explanatory
Stefan Kühn (XING) Deep Optimization 26.04.2018 11 / 35
Notions of Error
Decomposition of the Error
Even the best possible prediction - the optimal prediction via the so-called
Bayes predictor - comes with an error.
Error Components
Bayes Error: Theoretically optimal error
Approximation Error: Error introduced by the model class
Estimation Error: Error introduced by parameter estimation / model
training / optimization method
Stefan Kühn (XING) Deep Optimization 26.04.2018 12 / 35
Notions of Error
Example
Bayes Error: Even the optimal predictor for house prices using only zip
codes makes an error -> Property of the data / features
Approximation Error: Linear Models cannot resolve non-linear
relationships between the features irrespective of the training method
(but possibly could with different features, like e.g. polynomial
regression)
Estimation Error: Did we select the right model from the model class
based on the available data? -> depends on model class, data and
training / optimization method
But what about the Generalization Error?
Stefan Kühn (XING) Deep Optimization 26.04.2018 13 / 35
Notions of Learning
Learning
A core objective of a learner is to generalize from its experience.
But why do we use Mathematical Optimization for Learning?
What would be an alternative? Biology?
Stefan Kühn (XING) Deep Optimization 26.04.2018 14 / 35
Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisions
large amounts of training data. [Bouttou, Bousquet]
Underlying Idea
Approximate optimization algorithms might be sufficient for learning
purposes. [Bouttou, Bousquet]
Implications:
Small-scale: Trade-off between approximation error and estimation
error
Large-scale: Computational complexity dominates
Long story short:
The best optimization methods might not be the best learning
methods!
Stefan Kühn (XING) Deep Optimization 26.04.2018 15 / 35
Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks
Stefan Kühn (XING) Deep Optimization 26.04.2018 16 / 35
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 17 / 35
Advanced Concepts in Mathematical Optimization
Stepsize rules: Dynamically adjust step lengths to speed up
convergence
Preconditioning: Helps with ill-conditioned problems -> pathological
curvature
Damping: A strategy for making ill-posed problem regular -> helps
making local methods (Newton) work globally
Trust region: Determine step length - or radius of trust - first and then
look for good/best descent directions
Relaxation: Relax constraints for better tractability
Combine simple and complex methods: Levenberg-Marquardt
algorithm, combines Gradient Descent and Newton’s method (ensures
global convergence plus fast local convergence)
Stefan Kühn (XING) Deep Optimization 26.04.2018 18 / 35
Gradient Descent
Minimize a given objective function f :
min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Update in step k
xk+1 = xk − α f (xk)
Properties:
always a descent direction, no test needed
locally optimal, globally convergent
works with inexact line search, e.g. Armijo’s rule
Stefan Kühn (XING) Deep Optimization 26.04.2018 19 / 35
Stochastic Gradient Descent
Setting
x model parameters
f (x) :=
i
fi (x), loss function is sum of individual losses
f (x) :=
i
fi (x), i = 1, . . . , m number of training examples
Choose i and update in step k
xk+1 = xk − α fi (xk)
Stefan Kühn (XING) Deep Optimization 26.04.2018 20 / 35
Shortcomings of Gradient Descent
local: only local information used
especially: no curvature information used
greedy: prefers high curvature directions
scale invariant: no
James Martens, Deep learning via Hessian-free optimization
Stefan Kühn (XING) Deep Optimization 26.04.2018 21 / 35
Momentum
Update in step k
zk+1 = βzk + f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:
condition number κ of improves by square root
stepsizes can be twice as long
order of convergence
√
κ − 1
√
κ + 1
instead of
κ − 1
κ + 1
can diverge, if β is not properly chosen/adapted
Gabriel Goh, Why momentum really works
Stefan Kühn (XING) Deep Optimization 26.04.2018 22 / 35
Momentum
D E M O
https://distill.pub/2017/momentum/
Stefan Kühn (XING) Deep Optimization 26.04.2018 23 / 35
Adam
Properties:
combines several clever tricks (from Momentum, RMSprop, AdaGrad)
has some similarities to Trust Region methods
empirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization
Stefan Kühn (XING) Deep Optimization 26.04.2018 24 / 35
SGD, Momentum and more
D E M O
Visualization of algorithms - by Sebastian Ruder
Stefan Kühn (XING) Deep Optimization 26.04.2018 25 / 35
Beyond Adam
Adam has problems (and it’s not Eve)
Parameters are coupled
Some results indicate that Adam has not the best generalization
properties
It’s a heuristic -> convergence guarantuee?
And Adam has friends!
New variants that decouple parameters
Combine Adam - better at early training stages - and SGD - better
generalization properties
This also helps with convergence!
Wilson et al The Marginal Value of Adaptive Gradient Methods in Machine Learning
Keskar, Socher Improving Generalization Performance by Switching from Adam to SGD
Stefan Kühn (XING) Deep Optimization 26.04.2018 26 / 35
Higher-Order Methods
Second-Order Methods
Require existence of Hessian and use this for scaling gradients accordingly,
very successful but computationally expensive
Classical Newton Method: fast local convergence, no global
convergence
Relaxed Newton Methods: help with global convergence
Damped Newton Methods: help with global convergence
Modified Newton Methods: help with computational complexity
Quasi-Newton Methods: help with computational complexity
Nonlinear Conjugate Gradient Methods: iteratively build approximation
to Hessian
But there is a lot more to explore, e.g. the basin-hopping algorithm - a strategy for finding global optima - or
derivative-free methods like Nelder-Mead (downhill simplex), Particle Swarm Optimization (PSO and its variants)
Stefan Kühn (XING) Deep Optimization 26.04.2018 27 / 35
L-BFGS and Nonlinear CG
Observations so far:
The better the method, the more parameters to tune.
All better methods try to incorporate curvature information.
Why not doing so directly?
L-BFGS
Quasi-Newton method, builds an approximation of the (inverse) Hessian
and scales gradient accordingly.
Nonlinear CG
Informally speaking, Nonlinear CG tries to solve a quadratic approximation
of the function.
No surprise: They also work with minibatches.
Stefan Kühn (XING) Deep Optimization 26.04.2018 28 / 35
Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning
Stefan Kühn (XING) Deep Optimization 26.04.2018 29 / 35
Truncated Newton: Hessian-Free Optimization
Main ideas:
Approximate not Hessian H, but matrix-vector product Hd.
Use finite differences instead of exact Hessian.
Use damping.
Use Linear CG method for solving quadratic approximation.
Use clever mini-batch stragegy for large data-sets.
Stefan Kühn (XING) Deep Optimization 26.04.2018 30 / 35
Empirical test on pathological problems
Main results:
The addition problem is known to be effectively impossible for
gradient descent, HF did it.
Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)
Stefan Kühn (XING) Deep Optimization 26.04.2018 31 / 35
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 32 / 35
Summary
In the long run, the biggest bottleneck will be the sequential parts of an
algorithm. That’s why the number of iterations needs to be small. SGD and
its successors tend to have much more iterations, and they cannot benefit
as much from higher parallelism (GPUs).
But whatever you do/prefer/choose:
At least try out successors of SGD: Momentum, Adam etc.
Look for generic approaches instead of more and more specialized and
manually finetuned solutions.
Key aspects:
Initialization
Adaptive choice of stepsizes/momentum/. . .
Scaling of the gradient
Stefan Kühn (XING) Deep Optimization 26.04.2018 33 / 35
Resources
Overview of Gradient Descent methods
Why momentum really works
Adam - A Method for Stochastic Optimization
Mathematics of Deep Learning
The Marginal Value of Adaptive Gradient Methods in Machine
Learning
Andrew Ng et al. about L-BFGS and CG outperforming SGD
Lecture Slides Neural Networks for Machine Learning - Hinton et al.
On the importance of initialization and momentum in deep learning
Data-Science-Blog: Summary article in preparation (Stefan Kühn)
The Neural Network Zoo
Stefan Kühn (XING) Deep Optimization 26.04.2018 34 / 35
Thank you!
Stefan Kühn (XING) Deep Optimization 26.04.2018 35 / 35

More Related Content

What's hot

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)James McMurray
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Facial keypoint recognition
Facial keypoint recognitionFacial keypoint recognition
Facial keypoint recognitionAkrita Agarwal
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Ds03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhaniDs03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhanijyoti_lakhani
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmDataMites
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Shai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingShai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingwolf
 
Problem Formulation in Artificial Inteligence Projects
Problem Formulation in Artificial Inteligence ProjectsProblem Formulation in Artificial Inteligence Projects
Problem Formulation in Artificial Inteligence ProjectsDr. C.V. Suresh Babu
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...Joonhyung Lee
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 

What's hot (20)

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Noura2
Noura2Noura2
Noura2
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)The Gaussian Process Latent Variable Model (GPLVM)
The Gaussian Process Latent Variable Model (GPLVM)
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Facial keypoint recognition
Facial keypoint recognitionFacial keypoint recognition
Facial keypoint recognition
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
K means report
K means reportK means report
K means report
 
Ds03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhaniDs03 algorithms jyoti lakhani
Ds03 algorithms jyoti lakhani
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Ironwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PMIronwood4_Tuesday_Medasani_1PM
Ironwood4_Tuesday_Medasani_1PM
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Shai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingShai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble tracking
 
Problem Formulation in Artificial Inteligence Projects
Problem Formulation in Artificial Inteligence ProjectsProblem Formulation in Artificial Inteligence Projects
Problem Formulation in Artificial Inteligence Projects
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 

Similar to The Machinery behind Deep Learning

Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Wuhyun Rico Shin
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Chung-Il Kim
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesGiuseppe (Pino) Di Fabbrizio
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxdongchangim30
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Gradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricGradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricNAVER Engineering
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceYoonho Lee
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyCharles Martin
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Amr Kamel Deklel
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Parcel Lot Division with cGAN
Parcel Lot Division with cGANParcel Lot Division with cGAN
Parcel Lot Division with cGANMatthew To
 
Master Defense Slides (translated)
Master Defense Slides (translated)Master Defense Slides (translated)
Master Defense Slides (translated)Francis Piéraut
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
 

Similar to The Machinery behind Deep Learning (20)

Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]Wasserstein 1031 thesis [Chung il kim]
Wasserstein 1031 thesis [Chung il kim]
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectives
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Gradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metricGradient-based Meta-learning with learned layerwise subspace and metric
Gradient-based Meta-learning with learned layerwise subspace and metric
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Parcel Lot Division with cGAN
Parcel Lot Division with cGANParcel Lot Division with cGAN
Parcel Lot Division with cGAN
 
Master Defense Slides (translated)
Master Defense Slides (translated)Master Defense Slides (translated)
Master Defense Slides (translated)
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
 

More from Stefan Kühn

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfStefan Kühn
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfStefan Kühn
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsStefan Kühn
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeStefan Kühn
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with RStefan Kühn
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsStefan Kühn
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data VisualizationStefan Kühn
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsStefan Kühn
 
Learning To Rank data2day 2017
Learning To Rank data2day 2017Learning To Rank data2day 2017
Learning To Rank data2day 2017Stefan Kühn
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataStefan Kühn
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeStefan Kühn
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Stefan Kühn
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015Stefan Kühn
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015Stefan Kühn
 

More from Stefan Kühn (15)

data2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdfdata2day2023_SKuehn_DataPlatformFallacy.pdf
data2day2023_SKuehn_DataPlatformFallacy.pdf
 
data2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdfdata2day2022_SKuehn_DataValueChain.pdf
data2day2022_SKuehn_DataValueChain.pdf
 
Talk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and ApplicationsTalk at MCubed London about Manifold Learning and Applications
Talk at MCubed London about Manifold Learning and Applications
 
Data Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational ChangeData Science - Cargo Cult - Organizational Change
Data Science - Cargo Cult - Organizational Change
 
Interactive Dashboards with R
Interactive Dashboards with RInteractive Dashboards with R
Interactive Dashboards with R
 
Talk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and ApplicationsTalk at PyData Berlin about Manifold Learning and Applications
Talk at PyData Berlin about Manifold Learning and Applications
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
 
Manifold Learning and Data Visualization
Manifold Learning and Data VisualizationManifold Learning and Data Visualization
Manifold Learning and Data Visualization
 
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing SolutionsBecoming Data-driven - Machine Learning @ XING Marketing Solutions
Becoming Data-driven - Machine Learning @ XING Marketing Solutions
 
Learning To Rank data2day 2017
Learning To Rank data2day 2017Learning To Rank data2day 2017
Learning To Rank data2day 2017
 
Visualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional DataVisualizing and Communicating High-dimensional Data
Visualizing and Communicating High-dimensional Data
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
Data Visualization at codetalks 2016
Data Visualization at codetalks 2016Data Visualization at codetalks 2016
Data Visualization at codetalks 2016
 
SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015SKuehn_MachineLearningAndOptimization_2015
SKuehn_MachineLearningAndOptimization_2015
 
SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015SKuehn_Talk_FootballAnalytics_data2day2015
SKuehn_Talk_FootballAnalytics_data2day2015
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

The Machinery behind Deep Learning

  • 1. The Machinery behind Deep Learning Stefan Kühn Join me on XING Minds Mastering Machines - Cologne - April 26th, 2018 Stefan Kühn (XING) Deep Optimization 26.04.2018 1 / 35
  • 2. Contents 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 26.04.2018 2 / 35
  • 3. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 26.04.2018 3 / 35
  • 4. Deep Learning Neural Networks - Universal Approximation Theorem 1-hidden-layer feed-forward neural net with finite number of parameters can approximate any continuous function on compact subsets of Rn Questions: Why do we need deep learning at all? theoretic result, requires wide nets approximation by piecewise constant functions (not what you might want for classification/regression) deep nets can replicate capacity of wide shallow nets with performance and stability improvements Why are deep nets harder to train than shallow nets? More parameters to be learned by training? More hyperparameters to be set before training? Numerical issues? disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more — Stefan Kühn (XING) Deep Optimization 26.04.2018 4 / 35
  • 5. Example: RNNs Recurrent Neural Nets Extremely powerful for modeling sequential data, e.g. time series but extremely hard to train (somewhat less hard for LSTMs/GRUs) Main Advantages: Qualitatively: Flexible and rich model class Practically: Gradients easily computed by Backpropagation (BPTT) Main Problems: Qualitatively: Learning long-term dependencies Practically: Gradient-based methods struggle when separation between input and target output is large Stefan Kühn (XING) Deep Optimization 26.04.2018 5 / 35
  • 6. Example: RNNs Recurrent Neural Nets Highly volatile relationship between parameters and hidden states Indicators Vanishing/exploding gradients Internal Covariate Shift Remedies ReLU ’Careful’ initialization Small stepsizes (Recurrent) Batch Normalization Stefan Kühn (XING) Deep Optimization 26.04.2018 6 / 35
  • 7. Example: RNNs Recurrent Neural Nets and LSTM Schmidhuber/Hochreiter proposed change of RNN architecture by adding Long Short-Term Memory Units Vanishing/exploding gradients? fixed linear dynamics, no longer problematic Any questions open? Gradient-based trainings works better with LSTMs LSTMs can compensate one deficiency of Gradient-based learning but is this the only one? Most problems are related to specific numerical issues. Stefan Kühn (XING) Deep Optimization 26.04.2018 7 / 35
  • 8. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 26.04.2018 8 / 35
  • 9. Notions of Optimality Mathematical Optimization Minimize a given loss function by a certain optimization method or strategy until convergence. Vidal et al, Mathematics of Deep Learning Stefan Kühn (XING) Deep Optimization 26.04.2018 9 / 35
  • 10. Notions of Optimality Mathematical Optimization Minimize a given loss function by a certain optimization method or strategy until convergence. Local Optimum: Minimum in local neighborhood (global minimum might not even exist) Global Optimum: Point with lowest function value (if existing) Critical points: Candidates for local/global optima, or saddle points Iterative Minimization: Step-by-step approach to find minima Descent direction: Direction in which the function value decreases at least for a small steps Gradient: For differentiable functions the negative gradient always is a descent direction and it vanishes at critical points Stefan Kühn (XING) Deep Optimization 26.04.2018 10 / 35
  • 11. Optimality and Deep Neural Nets Some surprisingly strong theoretical results for this nonlinear+nonconvex optimization problem - and practical evidence as well! Saddle points: In high-dimensional convex problem critical points are saddle points -> could not be observed for Deep Nets Local and global optimal: Deep Nets seem to have the property that local optima are located near the global optimum Optimal representation: Deep Nets can represent data optimally under certain conditions (minimal sufficient statistic) Information Theory: Deep Nets and entropy are becoming best friend, strong relations to optimal control theory (Optimization in infinite dimensions) Global optimality for positively homogeneous networks: self-explanatory Stefan Kühn (XING) Deep Optimization 26.04.2018 11 / 35
  • 12. Notions of Error Decomposition of the Error Even the best possible prediction - the optimal prediction via the so-called Bayes predictor - comes with an error. Error Components Bayes Error: Theoretically optimal error Approximation Error: Error introduced by the model class Estimation Error: Error introduced by parameter estimation / model training / optimization method Stefan Kühn (XING) Deep Optimization 26.04.2018 12 / 35
  • 13. Notions of Error Example Bayes Error: Even the optimal predictor for house prices using only zip codes makes an error -> Property of the data / features Approximation Error: Linear Models cannot resolve non-linear relationships between the features irrespective of the training method (but possibly could with different features, like e.g. polynomial regression) Estimation Error: Did we select the right model from the model class based on the available data? -> depends on model class, data and training / optimization method But what about the Generalization Error? Stefan Kühn (XING) Deep Optimization 26.04.2018 13 / 35
  • 14. Notions of Learning Learning A core objective of a learner is to generalize from its experience. But why do we use Mathematical Optimization for Learning? What would be an alternative? Biology? Stefan Kühn (XING) Deep Optimization 26.04.2018 14 / 35
  • 15. Trade-offs between Optimization and Learning Computational complexity becomes the limiting factor when one envisions large amounts of training data. [Bouttou, Bousquet] Underlying Idea Approximate optimization algorithms might be sufficient for learning purposes. [Bouttou, Bousquet] Implications: Small-scale: Trade-off between approximation error and estimation error Large-scale: Computational complexity dominates Long story short: The best optimization methods might not be the best learning methods! Stefan Kühn (XING) Deep Optimization 26.04.2018 15 / 35
  • 16. Empirical results Empirical evidence for SGD being a better learner than optimizer. RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks Stefan Kühn (XING) Deep Optimization 26.04.2018 16 / 35
  • 17. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 26.04.2018 17 / 35
  • 18. Advanced Concepts in Mathematical Optimization Stepsize rules: Dynamically adjust step lengths to speed up convergence Preconditioning: Helps with ill-conditioned problems -> pathological curvature Damping: A strategy for making ill-posed problem regular -> helps making local methods (Newton) work globally Trust region: Determine step length - or radius of trust - first and then look for good/best descent directions Relaxation: Relax constraints for better tractability Combine simple and complex methods: Levenberg-Marquardt algorithm, combines Gradient Descent and Newton’s method (ensures global convergence plus fast local convergence) Stefan Kühn (XING) Deep Optimization 26.04.2018 18 / 35
  • 19. Gradient Descent Minimize a given objective function f : min f (x), x ∈ Rn Direction of Steepest Descent, the negative gradient: d = − f (x) Update in step k xk+1 = xk − α f (xk) Properties: always a descent direction, no test needed locally optimal, globally convergent works with inexact line search, e.g. Armijo’s rule Stefan Kühn (XING) Deep Optimization 26.04.2018 19 / 35
  • 20. Stochastic Gradient Descent Setting x model parameters f (x) := i fi (x), loss function is sum of individual losses f (x) := i fi (x), i = 1, . . . , m number of training examples Choose i and update in step k xk+1 = xk − α fi (xk) Stefan Kühn (XING) Deep Optimization 26.04.2018 20 / 35
  • 21. Shortcomings of Gradient Descent local: only local information used especially: no curvature information used greedy: prefers high curvature directions scale invariant: no James Martens, Deep learning via Hessian-free optimization Stefan Kühn (XING) Deep Optimization 26.04.2018 21 / 35
  • 22. Momentum Update in step k zk+1 = βzk + f (xk) xk+1 = xk − αzk+1 Properties for a quadratic convex objective: condition number κ of improves by square root stepsizes can be twice as long order of convergence √ κ − 1 √ κ + 1 instead of κ − 1 κ + 1 can diverge, if β is not properly chosen/adapted Gabriel Goh, Why momentum really works Stefan Kühn (XING) Deep Optimization 26.04.2018 22 / 35
  • 23. Momentum D E M O https://distill.pub/2017/momentum/ Stefan Kühn (XING) Deep Optimization 26.04.2018 23 / 35
  • 24. Adam Properties: combines several clever tricks (from Momentum, RMSprop, AdaGrad) has some similarities to Trust Region methods empirically proven - best in class (personal opinion) Kingma, Ba Adam: A method for stochastic optimization Stefan Kühn (XING) Deep Optimization 26.04.2018 24 / 35
  • 25. SGD, Momentum and more D E M O Visualization of algorithms - by Sebastian Ruder Stefan Kühn (XING) Deep Optimization 26.04.2018 25 / 35
  • 26. Beyond Adam Adam has problems (and it’s not Eve) Parameters are coupled Some results indicate that Adam has not the best generalization properties It’s a heuristic -> convergence guarantuee? And Adam has friends! New variants that decouple parameters Combine Adam - better at early training stages - and SGD - better generalization properties This also helps with convergence! Wilson et al The Marginal Value of Adaptive Gradient Methods in Machine Learning Keskar, Socher Improving Generalization Performance by Switching from Adam to SGD Stefan Kühn (XING) Deep Optimization 26.04.2018 26 / 35
  • 27. Higher-Order Methods Second-Order Methods Require existence of Hessian and use this for scaling gradients accordingly, very successful but computationally expensive Classical Newton Method: fast local convergence, no global convergence Relaxed Newton Methods: help with global convergence Damped Newton Methods: help with global convergence Modified Newton Methods: help with computational complexity Quasi-Newton Methods: help with computational complexity Nonlinear Conjugate Gradient Methods: iteratively build approximation to Hessian But there is a lot more to explore, e.g. the basin-hopping algorithm - a strategy for finding global optima - or derivative-free methods like Nelder-Mead (downhill simplex), Particle Swarm Optimization (PSO and its variants) Stefan Kühn (XING) Deep Optimization 26.04.2018 27 / 35
  • 28. L-BFGS and Nonlinear CG Observations so far: The better the method, the more parameters to tune. All better methods try to incorporate curvature information. Why not doing so directly? L-BFGS Quasi-Newton method, builds an approximation of the (inverse) Hessian and scales gradient accordingly. Nonlinear CG Informally speaking, Nonlinear CG tries to solve a quadratic approximation of the function. No surprise: They also work with minibatches. Stefan Kühn (XING) Deep Optimization 26.04.2018 28 / 35
  • 29. Empirical results Empirical evidence for better optimizers being better learners. MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning Stefan Kühn (XING) Deep Optimization 26.04.2018 29 / 35
  • 30. Truncated Newton: Hessian-Free Optimization Main ideas: Approximate not Hessian H, but matrix-vector product Hd. Use finite differences instead of exact Hessian. Use damping. Use Linear CG method for solving quadratic approximation. Use clever mini-batch stragegy for large data-sets. Stefan Kühn (XING) Deep Optimization 26.04.2018 30 / 35
  • 31. Empirical test on pathological problems Main results: The addition problem is known to be effectively impossible for gradient descent, HF did it. Basic RNN cells are used, no specialized architectures (LSTMs etc.). (Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997) Stefan Kühn (XING) Deep Optimization 26.04.2018 31 / 35
  • 32. 1 Training Deep Networks 2 Training and Learning 3 The Toolbox of Optimization Methods 4 Takeaways Stefan Kühn (XING) Deep Optimization 26.04.2018 32 / 35
  • 33. Summary In the long run, the biggest bottleneck will be the sequential parts of an algorithm. That’s why the number of iterations needs to be small. SGD and its successors tend to have much more iterations, and they cannot benefit as much from higher parallelism (GPUs). But whatever you do/prefer/choose: At least try out successors of SGD: Momentum, Adam etc. Look for generic approaches instead of more and more specialized and manually finetuned solutions. Key aspects: Initialization Adaptive choice of stepsizes/momentum/. . . Scaling of the gradient Stefan Kühn (XING) Deep Optimization 26.04.2018 33 / 35
  • 34. Resources Overview of Gradient Descent methods Why momentum really works Adam - A Method for Stochastic Optimization Mathematics of Deep Learning The Marginal Value of Adaptive Gradient Methods in Machine Learning Andrew Ng et al. about L-BFGS and CG outperforming SGD Lecture Slides Neural Networks for Machine Learning - Hinton et al. On the importance of initialization and momentum in deep learning Data-Science-Blog: Summary article in preparation (Stefan Kühn) The Neural Network Zoo Stefan Kühn (XING) Deep Optimization 26.04.2018 34 / 35
  • 35. Thank you! Stefan Kühn (XING) Deep Optimization 26.04.2018 35 / 35