Talk at the Data Science Meetup Hamburg about Deep Learning, the most important Optimization methods in this field and the relationship between training and learning
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
Deep Learning and Optimization Methods
1. Deep Learning and Optimization Methods
Stefan Kühn
Join me on XING
Data Science Meetup Hamburg - July 27th, 2017
Stefan Kühn (XING) Deep Optimization 27.07.2017 1 / 26
2. Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 2 / 26
3.
4. Deep Learning
Neural Networks - Universal Approximation Theorem
1-hidden-layer feed-forward neural net with finite number of parameters can
approximate any continuous function on compact subsets of Rn
Questions:
Why do we need deep learning at all?
theoretic result
approximation by piecewise constant functions (not what you might
want for classification/regression)
Why are deep nets harder to train than shallow nets?
More parameters to be learned by training?
More hyperparameters to be set before training?
Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —
Stefan Kühn (XING) Deep Optimization 27.07.2017 4 / 26
5. Example: RNNs
Recurrent Neural Nets
Extremely powerful for modeling sequential data, e.g. time series but
extremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:
Qualitatively: Flexible and rich model class
Practically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:
Qualitatively: Learning long-term dependencies
Practically: Gradient-based methods struggle when separation between
input and target output is large
Stefan Kühn (XING) Deep Optimization 27.07.2017 5 / 26
6. Example: RNNs
Recurrent Neural Nets
Highly volatile relationship between parameters and hidden states
Indicators
Vanishing/exploding gradients
Internal Covariate Shift
Remedies
ReLU
’Careful’ initialization
Small stepsizes
(Recurrent) Batch Normalization
Stefan Kühn (XING) Deep Optimization 27.07.2017 6 / 26
7. Example: RNNs
Recurrent Neural Nets and LSTM
Schmidhuber/Hochreiter proposed change of RNN architecture by adding
Long Short-Term Memory Units
Vanishing/exploding gradients?
fixed linear dynamics, no longer problematic
Any questions open?
Gradient-based trainings works better with LSTMs
LSTMs can compensate one deficiency of Gradient-based learning but
is this the only one?
Most problems are related to specific numerical issues.
Stefan Kühn (XING) Deep Optimization 27.07.2017 7 / 26
8. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 8 / 26
9. Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisions
large amounts of training data. [Bouttou, Bousquet]
Underlying Idea
Approximate optimization algorithms might be sufficient for learning
purposes. [Bouttou, Bousquet]
Implications:
Small-scale: Trade-off between approximation error and estimation
error
Large-scale: Computational complexity dominates
Long story short:
The best optimization methods might not be the best learning
methods!
Stefan Kühn (XING) Deep Optimization 27.07.2017 9 / 26
10. Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks
Stefan Kühn (XING) Deep Optimization 27.07.2017 10 / 26
11. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 11 / 26
12. Gradient Descent
Minimize a given function f :
min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Update in step k
xk+1 = xk − α f (xk)
Properties:
always a descent direction, no test needed
locally optimal, globally convergent
works with inexact line search, e.g. Armijo’ s rule
Stefan Kühn (XING) Deep Optimization 27.07.2017 12 / 26
13. Stochastic Gradient Descent
Setting
f (x) :=
i
fi (x),
f (x) :=
i
fi (x), i = 1, . . . , m number of training examples
Choose i and update in step k
xk+1 = xk − α fi (xk)
Stefan Kühn (XING) Deep Optimization 27.07.2017 13 / 26
14. Shortcomings of Gradient Descent
local: only local information used
especially: no curvature information used
greedy: prefers high curvature directions
scale invariant: no
James Martens, Deep learning via Hessian-free optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 14 / 26
15. Momentum
Update in step k
zk+1 = βzk + f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:
condition number κ of improves by square root
stepsizes can be twice as long
order of convergence
√
κ − 1
√
κ + 1
instead of
κ − 1
κ + 1
can diverge, if β is not properly chosen/adapted
Gabriel Goh, Why momentum really works
Stefan Kühn (XING) Deep Optimization 27.07.2017 15 / 26
16. Momentum
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 16 / 26
17. Adam
Properties:
combines several clever tricks (from Momentum, RMSprop, AdaGrad)
has some similarities to Trust Region methods
empirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization
Stefan Kühn (XING) Deep Optimization 27.07.2017 17 / 26
18. SGD, Momentum and more
D E M O
Stefan Kühn (XING) Deep Optimization 27.07.2017 18 / 26
19. L-BFGS and Nonlinear CG
Observations so far:
The better the method, the more parameters to tune.
All better methods try to incorporate curvature information.
Why not doing so directly?
L-BFGS
Quasi-Newton method, builds an approximation of the (inverse) Hessian
and scales gradient accordingly.
Nonlinear CG
Informally speaking, Nonlinear CG tries to solve a quadratic approximation
of the function.
No surprise: They also work with minibatches.
Stefan Kühn (XING) Deep Optimization 27.07.2017 19 / 26
20. Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning
Stefan Kühn (XING) Deep Optimization 27.07.2017 20 / 26
21. Truncated Newton: Hessian-Free Optimization
Main ideas:
Approximate not Hessian H, but matrix-vector product Hd.
Use finite differences instead of exact Hessian.
Use damping.
Use Linear CG method for solving quadratic approximation.
Use clever mini-batch stragegy for large data-sets.
Stefan Kühn (XING) Deep Optimization 27.07.2017 21 / 26
22. Empirical test on pathological problems
Main results:
The addition problem is known to be effectively impossible for
gradient descent, HF did it.
Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)
Stefan Kühn (XING) Deep Optimization 27.07.2017 22 / 26
23. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 27.07.2017 23 / 26
24. Summary
In the long run, the biggest bottleneck will be the sequential parts of an
algorithm. That’s why the number of iterations needs to be small. SGD and
its successors tend to have much more iterations, and they cannot benefit
as much from higher parallelism (GPUs).
But whatever you do/prefer/choose:
At least use successors of SGD: Momentum, Adam etc.
Look for generic approaches instead of more and more specialized and
manually finetuned solutions.
Key aspects:
Initialization
Adaptive choice of stepsizes/momentum/. . .
Scaling of the gradient
Stefan Kühn (XING) Deep Optimization 27.07.2017 24 / 26
25. Resources
Overview of Gradient Descent methods
Why momentum really works
Adam - A Method for Stochastic Optimization
Andrew Ng et al. about L-BFGS and CG outperforming SGD
Lecture Slides Neural Networks for Machine Learning - Hinton et al.
On the importance of initialization and momentum in deep learning
Data-Science-Blog: Summary article in preparation (Stefan Kühn)
The Neural Network Zoo
Stefan Kühn (XING) Deep Optimization 27.07.2017 25 / 26