SlideShare a Scribd company logo
1 of 39
Download to read offline
Numerical Computation
for Deep Learning
Lecture slides for Chapter 4 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last modified 2017-10-14
Thanks to Justin Gilmer and Jacob
Buckman for helpful discussions
(Goodfellow 2017)
Numerical concerns for implementations
of deep learning algorithms
• Algorithms are often specified in terms of real numbers; real
numbers cannot be implemented in a finite computer
• Does the algorithm still work when implemented with a finite
number of bits?
• Do small changes in the input to a function cause large changes to
an output?
• Rounding errors, noise, measurement errors can cause large
changes
• Iterative search for best input is difficult
(Goodfellow 2017)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow
(Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
Gradient Descent
CHAPTER 4. NUMERICAL COMPUTATION
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Global minimum at x = 0.
Since f0(x) = 0, gradient
descent halts here.
For x < 0, we have f0(x) < 0,
so we can decrease f by
moving rightward.
For x > 0, we have f0(x) > 0,
so we can decrease f by
moving leftward.
f(x) = 1
2 x2
f0
(x) = x
Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a
function can be used to follow the function downhill to a minimum.Figure 4.1
(Goodfellow 2017)
Approximate Optimization
PTER 4. NUMERICAL COMPUTATION
x
f(x)
Ideally, we would like
to arrive at the global
minimum, but this
might not be possible.
This local minimum
performs nearly as well as
the global one,
so it is an acceptable
halting point.
This local minimum performs
poorly and should be avoided.
re 4.3: Optimization algorithms may fail to find a global minimum when the
ple local minima or plateaus present. In the context of deep learning, we gen
t such solutions even though they are not truly minimal, so long as they corre
Figure 4.3
(Goodfellow 2017)
We usually don’t even reach a
local minimum
APTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
50 0 50 100 150 200 250
Training time (epochs)
2
0
2
4
6
8
10
12
14
16
Gradientnorm
0 50 100 150 200 250
Training time (epochs)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classificationerrorrate
ure 8.1: Gradient descent often does not arrive at a critical point of any kind. In th
mple, the gradient norm increases throughout training of a convolutional network us
(Goodfellow 2017)
Deep learning optimization
way of life
• Pure math way of life:
• Find literally the smallest value of f(x)
• Or maybe: find some critical point of f(x) where
the value is locally smallest
• Deep learning way of life:
• Decrease the value of f(x) a lot
(Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
Critical Points
TER 4. NUMERICAL COMPUTATION
Minimum Maximum Saddle point
4.2: Examples of each of the three types of critical points in 1-D. A critical
with zero slope. Such a point can either be a local minimum, which is low
Figure 4.2
(Goodfellow 2017)
Saddle Points
x1
−15
0
15
x1
−15
0
15
f(x1,x1)
−500
0
500
4.5: A saddle point containing both positive and negative curvature. The function
example is f(x) = x2
1 x2
2. Along the axis corresponding to x1, the function
upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue
the axis corresponding to x2, the function curves downward. This direction is an
Figure 4.5
(Gradient descent escapes,
see Appendix C of “Qualitatively
Characterizing Neural Network
Optimization Problems”)
Saddle points attract
Newton’s method
(Goodfellow 2017)
Curvature
APTER 4. NUMERICAL COMPUTATION
x
f(x)
Negative curvature
x
f(x)
No curvature
x
f(x)
Positive curvature
ure 4.4: The second derivative determines the curvature of a function. Here w
adratic functions with various curvature. The dashed line indicates the value of t
Figure 4.4
(Goodfellow 2017)
Directional Second Derivatives
Fortunately, in this book, we usually need to decompose only a specific class
−3 −2 −1 0 1 2 3
x0
−3
−2
−1
0
1
2
3x1
v(1)
v(2)
Before multiplication
−3 −2 −1 0 1 2 3
x0
0
−3
−2
−1
0
1
2
3
x0
1
v(1)
¸1 v(1)
v(2)
¸2 v(2)
After multiplication
Effect of eigenvectors and eigenvalues
Figure 2.3: Effect of eigenvectors and eigenvalues. An example of the effect of eigenvect
Directional Curvature
(Goodfellow 2017)
Predicting optimal step size
using Taylor series
on f(x) around the current point x(0):
(x) ⇡ f(x(0)
) + (x x(0)
)>
g +
1
2
(x x(0)
)>
H(x x(0)
), (4.8)
he gradient and H is the Hessian at x(0). If we use a learning rate
e new point x will be given by x(0) ✏g. Substituting this into our
on, we obtain
f(x(0)
✏g) ⇡ f(x(0)
) ✏g>
g +
1
2
✏2
g>
Hg. (4.9)
hree terms here: the original value of the function, the expected
t due to the slope of the function, and the correction we must apply
or the curvature of the function. When this last term is too large, the
cent step can actually move uphill. When g>Hg is zero or negative,
eries approximation predicts that increasing ✏ forever will decrease f
ractice, the Taylor series is unlikely to remain accurate for large ✏, so
ort to more heuristic choices of ✏ in this case. When g>Hg is positive,
he optimal step size that decreases the Taylor series approximation of
the most yields
e: the original value of the function, the expected
pe of the function, and the correction we must apply
e of the function. When this last term is too large, the
ctually move uphill. When g>Hg is zero or negative,
ation predicts that increasing ✏ forever will decrease f
ylor series is unlikely to remain accurate for large ✏, so
ristic choices of ✏ in this case. When g>Hg is positive,
size that decreases the Taylor series approximation of
s
✏⇤
=
g>g
g>Hg
. (4.10)
aligns with the eigenvector of H corresponding to the
then this optimal step size is given by 1
max
. To the
e minimize can be approximated well by a quadratic
the Hessian thus determine the scale of the learning
Big gradients speed you up
Big eigenvalues slow you
down if you align with
their eigenvectors
(Goodfellow 2017)
Condition Number
computation because rounding errors in the inputs
e output.
= A 1x. When A 2 Rn⇥n has an eigenvalue
umber is
max
i,j
i
j
. (4.2)
de of the largest and smallest eigenvalue. When
rsion is particularly sensitive to error in the input.
sic property of the matrix itself, not the result
x inversion. Poorly conditioned matrices amplify
tiply by the true matrix inverse. In practice, the
r by numerical errors in the inversion process itself.
When the condition number is large,
sometimes you hit large eigenvalues and
sometimes you hit small ones.
The large ones force you to keep the learning
rate small, and miss out on moving fast in the
small eigenvalue directions.
(Goodfellow 2017)
Gradient Descent and Poor
Conditioning
30 20 10 0 10 20
x1
30
20
10
0
10
20x2
igure 4.6: Gradient descent fails to exploit the curvature information containe
essian matrix. Here we use gradient descent to minimize a quadratic function f(x
Figure 4.6
(Goodfellow 2017)
Neural net visualization
nference paper at ICLR 2015
(From “Qualitatively
Characterizing Neural
Network Optimization
Problems”)
At end of learning:
- gradient is still large
- curvature is huge
(Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
KKT Multipliers
. NUMERICAL COMPUTATION
erties guarantee that no infeasible point can be optimal, and that the
ithin the feasible points is unchanged.
orm constrained maximization, we can construct the generalized La-
ction of f(x), which leads to this optimization problem:
min
x
max max
↵,↵ 0
f(x) +
X
i
ig(i)
(x) +
X
j
↵jh(j)
(x). (4.19)
so convert this to a problem with maximization in the outer loop:
max
x
min min
↵,↵ 0
f(x) +
X
i
ig(i)
(x)
X
j
↵jh(j)
(x). (4.20)
the term for the equality constraints does not matter; we may define it
on or subtraction as we wish, because the optimization is free to choose
r each i.
In this book, mostly used for
theory
(e.g.: show Gaussian is highest
entropy distribution)
In practice, we usually
just project back to the
constraint region after each
step
(Goodfellow 2017)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow
(Goodfellow 2017)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”
• Loss goes down, accuracy gets within a few
percentage points of state-of-the-art
• No “bugs” per se
• Often deep learning algorithms “explode” (NaNs, large
values)
• Culprit is often loss of numerical precision
(Goodfellow 2017)
Rounding and truncation
errors
• In a digital computer, we use float32 or similar
schemes to represent real numbers
• A real number x is rounded to x + delta for some
small delta
• Overflow: large x replaced by inf
• Underflow: small x replaced by 0
(Goodfellow 2017)
Example
• Adding a very small number to a larger one may
have no effect. This can cause large changes
downstream:
>>> a = np.array([0., 1e-8]).astype('float32')
>>> a.argmax()
1
>>> (a + 1).argmax()
0
(Goodfellow 2017)
Secondary effects
• Suppose we have code that computes x-y
• Suppose x overflows to inf
• Suppose y overflows to inf
• Then x - y = inf - inf = NaN
(Goodfellow 2017)
exp
• exp(x) overflows for large x
• Doesn’t need to be very large
• float32: 89 overflows
• Never use large x
• exp(x) underflows for very negative x
• Possibly not a problem
• Possibly catastrophic if exp(x) is a denominator, an argument to a
logarithm, etc.
(Goodfellow 2017)
Subtraction
• Suppose x and y have similar magnitude
• Suppose x is always greater than y
• In a computer, x - y may be negative due to
rounding error
• Example: variance
ntity of the distribution is clear from the context, we may simply
e of the random variable that the expectation is over, as in Ex[f(x)].
which random variable the expectation is over, we may omit the
rely, as in E[f(x)]. By default, we can assume that E[·] averages over
all the random variables inside the brackets. Likewise, when there is
we may omit the square brackets.
ons are linear, for example,
Ex[↵f(x) + g(x)] = ↵Ex[f(x)] + Ex[g(x)], (3.11)
are not dependent on x.
nce gives a measure of how much the values of a function of a random
y as we sample different values of x from its probability distribution:
Var(f(x)) = E
h
(f(x) E[f(x)])2
i
. (3.12)
ance is low, the values of f(x) cluster near their expected value. The
the variance is known as the standard deviation.
= E
⇥
f(x)2
⇤
E [f(x)]
2
Safe
Dangerous
(Goodfellow 2017)
log and sqrt
• log(0) = - inf
• log(<negative>) is imaginary, usually nan in software
• sqrt(0) is 0, but its derivative has a divide by zero
• Definitely avoid underflow or round-to-negative in the
argument!
• Common case: standard_dev = sqrt(variance)
(Goodfellow 2017)
log exp
• log exp(x) is a common pattern
• Should be simplified to x
• Avoids:
• Overflow in exp
• Underflow in exp causing -inf in log
(Goodfellow 2017)
Which is the better hack?
• normalized_x = x / st_dev
• eps = 1e-7
• Should we use
• st_dev = sqrt(eps + variance)
• st_dev = eps + sqrt(variance) ?
• What if variance is implemented safely and will never
round to negative?
(Goodfellow 2017)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))
• Failure modes:
• If any entry is very large, exp overflows
• If all entries are very negative, all exps
underflow… and then log is -inf
(Goodfellow 2017)
Stable version
mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))
Built in version:
tf.reduce_logsumexp
(Goodfellow 2017)
Why does the logsumexp trick
work?
• Algebraically equivalent to the original version:
m + log
X
i
exp(ai m)
= m + log
X
i
exp(ai)
exp(m)
= m + log
1
exp(m)
X
i
exp(ai)
= m log exp(m) + log
X
i
exp(ai)
(Goodfellow 2017)
Why does the logsumexp trick
work?
• No overflow:
• Entries of safe_array are at most 0
• Some of the exp terms underflow, but not all
• At least one entry of safe_array is 0
• The sum of exp terms is at least 1
• The sum is now safe to pass to the log
(Goodfellow 2017)
Softmax
• Softmax: use your library’s built-in softmax function
• If you build your own, use:
• Similar to logsumexp
safe_logits = logits - tf.reduce_max(logits)
softmax = tf.nn.softmax(safe_logits)
(Goodfellow 2017)
Sigmoid
• Use your library’s built-in sigmoid function
• If you build your own:
• Recall that sigmoid is just softmax with one of
the logits hard-coded to 0
(Goodfellow 2017)
Cross-entropy
• Cross-entropy loss for softmax (and sigmoid) has both
softmax and logsumexp in it
• Compute it using the logits not the probabilities
• The probabilities lose gradient due to rounding error where
the softmax saturates
• Use tf.nn.softmax_cross_entropy_with_logits or similar
• If you roll your own, use the stabilization tricks for softmax
and logsumexp
(Goodfellow 2017)
Bug hunting strategies
• If you increase your learning rate and the loss gets
stuck, you are probably rounding your gradient to
zero somewhere: maybe computing cross-entropy
using probabilities instead of logits
• For correctly implemented loss, too high of
learning rate should usually cause explosion
(Goodfellow 2017)
Bug hunting strategies
• If you see explosion (NaNs, very large values) immediately
suspect:
• log
• exp
• sqrt
• division
• Always suspect the code that changed most recently
(Goodfellow 2017)
Questions

More Related Content

What's hot

Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsSangwoo Mo
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributionsWooSung Choi
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theoryjfrchicanog
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering SimplyEmad Nabil
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Fixed point iteration
Fixed point iterationFixed point iteration
Fixed point iterationIsaac Yowetu
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningChun-Ming Chang
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANShyam Krishna Khadka
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksBennoG1
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksValentin De Bortoli
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 

What's hot (20)

Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering Simply
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Fixed point iteration
Fixed point iterationFixed point iteration
Fixed point iteration
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGAN
 
Decision tree
Decision treeDecision tree
Decision tree
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial Networks
 
17_monte_carlo.pdf
17_monte_carlo.pdf17_monte_carlo.pdf
17_monte_carlo.pdf
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural NetworksQuantitative Propagation of Chaos for SGD in Wide Neural Networks
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 

Similar to 04 numerical

Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
Fuzzy Logic Seminar with Implementation
Fuzzy Logic Seminar with ImplementationFuzzy Logic Seminar with Implementation
Fuzzy Logic Seminar with ImplementationBhaumik Parmar
 

Similar to 04 numerical (20)

06 mlp
06 mlp06 mlp
06 mlp
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
12_applications.pdf
12_applications.pdf12_applications.pdf
12_applications.pdf
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
DNN_M3_Optimization.pdf
DNN_M3_Optimization.pdfDNN_M3_Optimization.pdf
DNN_M3_Optimization.pdf
 
Regression
RegressionRegression
Regression
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Fuzzy Logic Seminar with Implementation
Fuzzy Logic Seminar with ImplementationFuzzy Logic Seminar with Implementation
Fuzzy Logic Seminar with Implementation
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
 

More from Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

More from Ronald Teo (15)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
Mdp
MdpMdp
Mdp
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

04 numerical

  • 1. Numerical Computation for Deep Learning Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful discussions
  • 2. (Goodfellow 2017) Numerical concerns for implementations of deep learning algorithms • Algorithms are often specified in terms of real numbers; real numbers cannot be implemented in a finite computer • Does the algorithm still work when implemented with a finite number of bits? • Do small changes in the input to a function cause large changes to an output? • Rounding errors, noise, measurement errors can cause large changes • Iterative search for best input is difficult
  • 3. (Goodfellow 2017) Roadmap • Iterative Optimization • Rounding error, underflow, overflow
  • 4. (Goodfellow 2017) Iterative Optimization • Gradient descent • Curvature • Constrained optimization
  • 5. (Goodfellow 2017) Gradient Descent CHAPTER 4. NUMERICAL COMPUTATION 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Global minimum at x = 0. Since f0(x) = 0, gradient descent halts here. For x < 0, we have f0(x) < 0, so we can decrease f by moving rightward. For x > 0, we have f0(x) > 0, so we can decrease f by moving leftward. f(x) = 1 2 x2 f0 (x) = x Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a function can be used to follow the function downhill to a minimum.Figure 4.1
  • 6. (Goodfellow 2017) Approximate Optimization PTER 4. NUMERICAL COMPUTATION x f(x) Ideally, we would like to arrive at the global minimum, but this might not be possible. This local minimum performs nearly as well as the global one, so it is an acceptable halting point. This local minimum performs poorly and should be avoided. re 4.3: Optimization algorithms may fail to find a global minimum when the ple local minima or plateaus present. In the context of deep learning, we gen t such solutions even though they are not truly minimal, so long as they corre Figure 4.3
  • 7. (Goodfellow 2017) We usually don’t even reach a local minimum APTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS 50 0 50 100 150 200 250 Training time (epochs) 2 0 2 4 6 8 10 12 14 16 Gradientnorm 0 50 100 150 200 250 Training time (epochs) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Classificationerrorrate ure 8.1: Gradient descent often does not arrive at a critical point of any kind. In th mple, the gradient norm increases throughout training of a convolutional network us
  • 8. (Goodfellow 2017) Deep learning optimization way of life • Pure math way of life: • Find literally the smallest value of f(x) • Or maybe: find some critical point of f(x) where the value is locally smallest • Deep learning way of life: • Decrease the value of f(x) a lot
  • 9. (Goodfellow 2017) Iterative Optimization • Gradient descent • Curvature • Constrained optimization
  • 10. (Goodfellow 2017) Critical Points TER 4. NUMERICAL COMPUTATION Minimum Maximum Saddle point 4.2: Examples of each of the three types of critical points in 1-D. A critical with zero slope. Such a point can either be a local minimum, which is low Figure 4.2
  • 11. (Goodfellow 2017) Saddle Points x1 −15 0 15 x1 −15 0 15 f(x1,x1) −500 0 500 4.5: A saddle point containing both positive and negative curvature. The function example is f(x) = x2 1 x2 2. Along the axis corresponding to x1, the function upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue the axis corresponding to x2, the function curves downward. This direction is an Figure 4.5 (Gradient descent escapes, see Appendix C of “Qualitatively Characterizing Neural Network Optimization Problems”) Saddle points attract Newton’s method
  • 12. (Goodfellow 2017) Curvature APTER 4. NUMERICAL COMPUTATION x f(x) Negative curvature x f(x) No curvature x f(x) Positive curvature ure 4.4: The second derivative determines the curvature of a function. Here w adratic functions with various curvature. The dashed line indicates the value of t Figure 4.4
  • 13. (Goodfellow 2017) Directional Second Derivatives Fortunately, in this book, we usually need to decompose only a specific class −3 −2 −1 0 1 2 3 x0 −3 −2 −1 0 1 2 3x1 v(1) v(2) Before multiplication −3 −2 −1 0 1 2 3 x0 0 −3 −2 −1 0 1 2 3 x0 1 v(1) ¸1 v(1) v(2) ¸2 v(2) After multiplication Effect of eigenvectors and eigenvalues Figure 2.3: Effect of eigenvectors and eigenvalues. An example of the effect of eigenvect Directional Curvature
  • 14. (Goodfellow 2017) Predicting optimal step size using Taylor series on f(x) around the current point x(0): (x) ⇡ f(x(0) ) + (x x(0) )> g + 1 2 (x x(0) )> H(x x(0) ), (4.8) he gradient and H is the Hessian at x(0). If we use a learning rate e new point x will be given by x(0) ✏g. Substituting this into our on, we obtain f(x(0) ✏g) ⇡ f(x(0) ) ✏g> g + 1 2 ✏2 g> Hg. (4.9) hree terms here: the original value of the function, the expected t due to the slope of the function, and the correction we must apply or the curvature of the function. When this last term is too large, the cent step can actually move uphill. When g>Hg is zero or negative, eries approximation predicts that increasing ✏ forever will decrease f ractice, the Taylor series is unlikely to remain accurate for large ✏, so ort to more heuristic choices of ✏ in this case. When g>Hg is positive, he optimal step size that decreases the Taylor series approximation of the most yields e: the original value of the function, the expected pe of the function, and the correction we must apply e of the function. When this last term is too large, the ctually move uphill. When g>Hg is zero or negative, ation predicts that increasing ✏ forever will decrease f ylor series is unlikely to remain accurate for large ✏, so ristic choices of ✏ in this case. When g>Hg is positive, size that decreases the Taylor series approximation of s ✏⇤ = g>g g>Hg . (4.10) aligns with the eigenvector of H corresponding to the then this optimal step size is given by 1 max . To the e minimize can be approximated well by a quadratic the Hessian thus determine the scale of the learning Big gradients speed you up Big eigenvalues slow you down if you align with their eigenvectors
  • 15. (Goodfellow 2017) Condition Number computation because rounding errors in the inputs e output. = A 1x. When A 2 Rn⇥n has an eigenvalue umber is max i,j i j . (4.2) de of the largest and smallest eigenvalue. When rsion is particularly sensitive to error in the input. sic property of the matrix itself, not the result x inversion. Poorly conditioned matrices amplify tiply by the true matrix inverse. In practice, the r by numerical errors in the inversion process itself. When the condition number is large, sometimes you hit large eigenvalues and sometimes you hit small ones. The large ones force you to keep the learning rate small, and miss out on moving fast in the small eigenvalue directions.
  • 16. (Goodfellow 2017) Gradient Descent and Poor Conditioning 30 20 10 0 10 20 x1 30 20 10 0 10 20x2 igure 4.6: Gradient descent fails to exploit the curvature information containe essian matrix. Here we use gradient descent to minimize a quadratic function f(x Figure 4.6
  • 17. (Goodfellow 2017) Neural net visualization nference paper at ICLR 2015 (From “Qualitatively Characterizing Neural Network Optimization Problems”) At end of learning: - gradient is still large - curvature is huge
  • 18. (Goodfellow 2017) Iterative Optimization • Gradient descent • Curvature • Constrained optimization
  • 19. (Goodfellow 2017) KKT Multipliers . NUMERICAL COMPUTATION erties guarantee that no infeasible point can be optimal, and that the ithin the feasible points is unchanged. orm constrained maximization, we can construct the generalized La- ction of f(x), which leads to this optimization problem: min x max max ↵,↵ 0 f(x) + X i ig(i) (x) + X j ↵jh(j) (x). (4.19) so convert this to a problem with maximization in the outer loop: max x min min ↵,↵ 0 f(x) + X i ig(i) (x) X j ↵jh(j) (x). (4.20) the term for the equality constraints does not matter; we may define it on or subtraction as we wish, because the optimization is free to choose r each i. In this book, mostly used for theory (e.g.: show Gaussian is highest entropy distribution) In practice, we usually just project back to the constraint region after each step
  • 20. (Goodfellow 2017) Roadmap • Iterative Optimization • Rounding error, underflow, overflow
  • 21. (Goodfellow 2017) Numerical Precision: A deep learning super skill • Often deep learning algorithms “sort of work” • Loss goes down, accuracy gets within a few percentage points of state-of-the-art • No “bugs” per se • Often deep learning algorithms “explode” (NaNs, large values) • Culprit is often loss of numerical precision
  • 22. (Goodfellow 2017) Rounding and truncation errors • In a digital computer, we use float32 or similar schemes to represent real numbers • A real number x is rounded to x + delta for some small delta • Overflow: large x replaced by inf • Underflow: small x replaced by 0
  • 23. (Goodfellow 2017) Example • Adding a very small number to a larger one may have no effect. This can cause large changes downstream: >>> a = np.array([0., 1e-8]).astype('float32') >>> a.argmax() 1 >>> (a + 1).argmax() 0
  • 24. (Goodfellow 2017) Secondary effects • Suppose we have code that computes x-y • Suppose x overflows to inf • Suppose y overflows to inf • Then x - y = inf - inf = NaN
  • 25. (Goodfellow 2017) exp • exp(x) overflows for large x • Doesn’t need to be very large • float32: 89 overflows • Never use large x • exp(x) underflows for very negative x • Possibly not a problem • Possibly catastrophic if exp(x) is a denominator, an argument to a logarithm, etc.
  • 26. (Goodfellow 2017) Subtraction • Suppose x and y have similar magnitude • Suppose x is always greater than y • In a computer, x - y may be negative due to rounding error • Example: variance ntity of the distribution is clear from the context, we may simply e of the random variable that the expectation is over, as in Ex[f(x)]. which random variable the expectation is over, we may omit the rely, as in E[f(x)]. By default, we can assume that E[·] averages over all the random variables inside the brackets. Likewise, when there is we may omit the square brackets. ons are linear, for example, Ex[↵f(x) + g(x)] = ↵Ex[f(x)] + Ex[g(x)], (3.11) are not dependent on x. nce gives a measure of how much the values of a function of a random y as we sample different values of x from its probability distribution: Var(f(x)) = E h (f(x) E[f(x)])2 i . (3.12) ance is low, the values of f(x) cluster near their expected value. The the variance is known as the standard deviation. = E ⇥ f(x)2 ⇤ E [f(x)] 2 Safe Dangerous
  • 27. (Goodfellow 2017) log and sqrt • log(0) = - inf • log(<negative>) is imaginary, usually nan in software • sqrt(0) is 0, but its derivative has a divide by zero • Definitely avoid underflow or round-to-negative in the argument! • Common case: standard_dev = sqrt(variance)
  • 28. (Goodfellow 2017) log exp • log exp(x) is a common pattern • Should be simplified to x • Avoids: • Overflow in exp • Underflow in exp causing -inf in log
  • 29. (Goodfellow 2017) Which is the better hack? • normalized_x = x / st_dev • eps = 1e-7 • Should we use • st_dev = sqrt(eps + variance) • st_dev = eps + sqrt(variance) ? • What if variance is implemented safely and will never round to negative?
  • 30. (Goodfellow 2017) log(sum(exp)) • Naive implementation: tf.log(tf.reduce_sum(tf.exp(array)) • Failure modes: • If any entry is very large, exp overflows • If all entries are very negative, all exps underflow… and then log is -inf
  • 31. (Goodfellow 2017) Stable version mx = tf.reduce_max(array) safe_array = array - mx log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array)) Built in version: tf.reduce_logsumexp
  • 32. (Goodfellow 2017) Why does the logsumexp trick work? • Algebraically equivalent to the original version: m + log X i exp(ai m) = m + log X i exp(ai) exp(m) = m + log 1 exp(m) X i exp(ai) = m log exp(m) + log X i exp(ai)
  • 33. (Goodfellow 2017) Why does the logsumexp trick work? • No overflow: • Entries of safe_array are at most 0 • Some of the exp terms underflow, but not all • At least one entry of safe_array is 0 • The sum of exp terms is at least 1 • The sum is now safe to pass to the log
  • 34. (Goodfellow 2017) Softmax • Softmax: use your library’s built-in softmax function • If you build your own, use: • Similar to logsumexp safe_logits = logits - tf.reduce_max(logits) softmax = tf.nn.softmax(safe_logits)
  • 35. (Goodfellow 2017) Sigmoid • Use your library’s built-in sigmoid function • If you build your own: • Recall that sigmoid is just softmax with one of the logits hard-coded to 0
  • 36. (Goodfellow 2017) Cross-entropy • Cross-entropy loss for softmax (and sigmoid) has both softmax and logsumexp in it • Compute it using the logits not the probabilities • The probabilities lose gradient due to rounding error where the softmax saturates • Use tf.nn.softmax_cross_entropy_with_logits or similar • If you roll your own, use the stabilization tricks for softmax and logsumexp
  • 37. (Goodfellow 2017) Bug hunting strategies • If you increase your learning rate and the loss gets stuck, you are probably rounding your gradient to zero somewhere: maybe computing cross-entropy using probabilities instead of logits • For correctly implemented loss, too high of learning rate should usually cause explosion
  • 38. (Goodfellow 2017) Bug hunting strategies • If you see explosion (NaNs, very large values) immediately suspect: • log • exp • sqrt • division • Always suspect the code that changed most recently