SlideShare a Scribd company logo
1 of 59
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Deep Learning
CS60010
Abir Das
Computer Science and Engineering Department
Indian Institute of Technology Kharagpur
http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Agenda
• Understand basics of Matrix/Vector Calculus and Optimization concepts
to be used in the course.
• Understand different types or errors in learning
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Resources
• ”Deep Learning”, I. Goodfellow, Y. Bengio, A. Courville. (Chapter 8)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 3
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Optimization and Deep Learning- Connections
• Deep learning (machine learning, in general) involves optimization in many contexts
• The goal is to find parameters 𝜽 of a neural network that significantly reduce a cost
function or objective function 𝐽 𝜽 .
• Gradient based optimization is the most popular way for training Deep Neural Networks.
• There are other ways too, e.g., evolutionary or derivative free optimization, but they
come with issues particularly crucial for neural network training.
• Its easy to spend a semester on optimization. Thus, these few lectures will only be a
scratch on a very small part of the surface.
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Optimization and Deep Learning- Differences
• In learning we care about some performance measure 𝑃 (e.g., image classification
accuracy, language translation accuracy etc.) on test set, but we minimize a different cost
function 𝐽(𝜽) on training set, with the hope that doing so will improve 𝑃
• This is in contrast to pure optimization where minimizing 𝐽(𝜽) is a goal in itself
• Lets see what types of errors creep in as a result
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
• Training data is 𝒙(𝑖), 𝑦(𝑖): 𝑖 = 1, … , 𝑁 coming from probability distribution 𝒟(𝒙, 𝑦)
• Let the neural network learns the output functions as 𝑔𝒟
(𝒙) and the loss is denoted as 𝑙(𝑔𝒟
𝒙 , 𝑦)
• What we want to minimize is the expected risk
E 𝑔𝒟 = 𝑙 𝑔𝒟 𝒙 , 𝑦 𝑑𝒟 𝒙, 𝑦 = 𝔼 𝑙 𝑔𝒟 𝒙 , 𝑦
• If we could minimize this risk we would have got the true optimal function
𝑓 = arg min
𝑔𝒟
E 𝑔𝒟
• But we don’t know the actual distribution that generates the data. So, what we actually minimize is the empirical risk
En 𝑔𝒟
=
1
𝑁
𝑖=1
𝑁
𝑙 𝑔𝒟
𝒙𝒊 , 𝑦𝑖 = 𝔼𝑛 𝑙 𝑔𝒟
𝒙 , 𝑦
• We choose a family ℋof candidate prediction functions and find the function that minimizes the empirical risk
𝑔n
𝒟
= arg min
𝑔𝒟∈ℋ
En 𝑔𝒟
• Since 𝑓 may not be found in the family ℋ, we also define
𝑔ℋ∗
𝒟
= arg min
𝑔𝒟∈ℋ
E 𝑔𝒟
Expected and Empirical Risk
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Minimizes Staying within
𝑓 expected risk No constraints on function family • True data distribution known
• Family of functions exhaustive
𝑔ℋ∗
𝒟 expected risk family of functions ℋ • True data distribution known
• Family of functions not exhaustive
𝑔n
𝒟 empirical risk family of functions ℋ • True data distribution not known
• Family of functions not exhaustive
Sources of Error
• True data generation procedure not known
• Family of functions [or hypothesis] to try is not exhaustive
• (And not to forget) we rely on a surrogate loss in place of the true classification error rate
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
A Simple Learning Problem
2 Data Points. 2 hypothesis sets:
x
y
x
y
Slide courtesy: Malik Magdon-Ismail
ℋ0: ℎ 𝑥 = 𝑏
ℋ1: ℎ(𝑥) = 𝑎𝑥 + 𝑏
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Let’s Repeat the Experiment Many Times
Slide courtesy: Malik Magdon-Ismail
x
y
x
y
• For each data set 𝒟 you get a different 𝑔𝑛
𝒟
• For a fixed 𝒙, 𝑔𝑛
𝒟(𝒙) is random value, depending on 𝒟
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
What’s Happening on Average
Slide courtesy: Malik Magdon-Ismail
x
y
ḡ(x)
sin(x)
x
y
ḡ(x)
sin(x)
We can define
𝑔𝑛
𝒟(𝒙) Random value, depending on 𝒟
Your average prediction on 𝒙
var 𝒙 = 𝔼𝒟 𝑔𝑛
𝒟 𝒙 − 𝑔 𝒙
2
= 𝔼𝒟 𝑔𝑛
𝒟 𝒙 2 − 𝑔 𝒙 2
How variable is your prediction
𝑔 𝒙 = 𝔼𝒟 𝑔𝑛
𝒟(𝒙) =
1
𝐾
𝑔𝑛
𝒟1
𝒙 +𝑔𝑛
𝒟2
𝒙 + ⋯ + 𝑔𝑛
𝒟𝐾
(𝒙)
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
𝐸𝑜𝑢𝑡 on Test Point 𝒙 for Data 𝒟
Slide motivation: Malik Magdon-Ismail
Squared error, a random value depending on 𝒟
𝐸𝑜𝑢𝑡 𝒙 = 𝔼𝒟 𝐸𝑜𝑢𝑡
𝒟
𝒙 = 𝔼𝒟 𝑔𝑛
𝒟 𝒙 − 𝑓 𝒙
2
Expected value of the above random variable
x
f (x)
Eo
D
ut(x)
x
f (x)
Eo
D
ut(x)
𝐸𝑜𝑢𝑡
𝒟
𝒙 = 𝑔𝑛
𝒟 𝒙 − 𝑓 𝒙
2
𝑔𝑛
𝒟
𝒙
𝑔𝑛
𝒟
𝒙
07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
The Bias-Variance Decomposition
𝐸𝑜𝑢𝑡 𝒙 = 𝔼𝒟 𝑔𝑛
𝒟 𝒙 − 𝑓 𝒙
2
= 𝔼𝒟 𝑓 𝒙 2
− 2𝑓 𝒙 𝑔𝑛
𝒟
𝒙 + 𝑔𝑛
𝒟
𝒙 2
= 𝑓 𝒙 2
− 2𝑓 𝒙 𝔼𝒟 𝑔𝑛
𝒟
𝒙 + 𝔼𝒟 𝑔𝑛
𝒟
𝒙 2
= 𝑓 𝒙 2
− 2𝑓 𝒙 𝑔 𝒙 + 𝔼𝒟 𝑔𝑛
𝒟
𝒙 2
= 𝑓 𝒙 2 − 𝑔 𝒙 2 + 𝑔 𝒙 2 − 2𝑓 𝒙 𝑔 𝒙 + 𝔼𝒟 𝑔𝑛
𝒟 𝒙 2
= 𝑓 𝒙 − 𝑔 𝒙
2
+ 𝔼𝒟 𝑔𝑛
𝒟 𝒙 2 − 𝑔 𝒙 2
Slide motivation: Malik Magdon-Ismail
Bias Variance
𝐸𝑜𝑢𝑡 𝒙 = Bias Variance
+
07, 12, 13 Jan, 2022 12
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Bias-Variance to Overfitting-Underfitting
• Suppose the true underlying function is 𝑓(𝒙).
• But the observations (data points) are noisy i.e., 𝑓 𝒙 + 𝜀
𝑓(𝒙)
07, 12, 13 Jan, 2022 13
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
The Extreme Cases of Bias and Variance - Under-fitting
Slide motivation: John A. Bullinaria
A good way to understand the concepts of bias and variance is by
considering the two extreme cases of what a neural network might learn.
Suppose the neural network is lazy and just produces the same
constant output whatever training data we give it, i.e. 𝑔𝑛
𝒟
(𝒙)=c, then
𝐸𝑜𝑢𝑡 𝒙 = 𝑓 𝒙 − 𝑔 𝒙
2
+ 𝔼𝒟 𝑔𝑛
𝒟
𝒙 2
− 𝑔 𝒙 2
= 𝑓 𝒙 − 𝑐 2
+ 𝔼𝒟 𝒄2
− 𝒄2
= 𝑓 𝒙 − 𝑐 2
+ 0
In this case the variance term will be zero, but the bias will be large,
because the network has made no attempt to fit the data. We say
we have extreme under-fitting
07, 12, 13 Jan, 2022 14
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
The Extreme Cases of Bias and Variance - Over-fitting
Slide motivation: John A. Bullinaria
On the other hand, suppose the neural network is very hard working and
makes sure that it exactly fits every data point, i.e. 𝑔𝑛
𝒟 𝒙 = 𝑓 𝒙 + 𝜀,
then, 𝑔 𝒙 = 𝔼𝓓 𝑔𝑛
𝒟
𝒙 = 𝔼𝓓 𝑓 𝒙 + 𝜀 = 𝔼𝓓 𝑓 𝒙 + 0 = 𝑓 𝒙
In this case the bias term will be zero, but the variance is the
square of the noise on the data, which could be substantial. In this
case we say we have extreme over-fitting.
𝐸𝑜𝑢𝑡 𝒙 = 𝑓 𝒙 − 𝑔 𝒙
2
+ 𝔼𝒟 𝑔𝑛
𝒟 𝒙 2 − 𝑔 𝒙 2
= 𝑓 𝒙 − 𝑓 𝒙
2
+ 𝔼𝒟 𝑔𝑛
𝒟 𝒙 − 𝑔(𝒙)
2
= 0 + 𝔼𝒟 𝑓 𝒙 + 𝜀 − 𝑓 𝒙
2
= 0 + 𝔼𝒟 𝜺2
07, 12, 13 Jan, 2022 15
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Bias-Variance Trade-off
Number ofData Points, N
Error
Test Error
Training Error
Overfitting Underfitting
07, 12, 13 Jan, 2022 16
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Vector/Matrix Calculus
• If 𝑓 𝒙 ∈ ℝ and 𝒙 ∈ ℝ𝑛, then 𝛻𝒙𝑓 ≜
𝜕𝑓
𝜕𝑥1
𝜕𝑓
𝜕𝑥2
⋯
𝜕𝑓
𝜕𝑥𝑛
𝑇
is called the
gradient of 𝑓(𝒙)
• If 𝑓 𝒙 = 𝑓1(𝑥) 𝑓2(𝑥) ⋯ 𝑓𝑚(𝑥) 𝑇 ∈ ℝ𝑚and 𝒙 ∈ ℝ𝑛, then
• 𝛻𝒙𝑓 =
𝜕𝑓1
𝜕𝑥1
𝜕𝑓1
𝜕𝑥2
⋯
𝜕𝑓1
𝜕𝑥𝑛
𝜕𝑓2
𝜕𝑥1
𝜕𝑓2
𝜕𝑥2
⋯
𝜕𝑓2
𝜕𝑥𝑛
. . . .
𝜕𝑓𝑚
𝜕𝑥1
𝜕𝑓𝑚
𝜕𝑥2
⋯
𝜕𝑓𝑚
𝜕𝑥𝑛 𝑚×𝑛
• is called “Jacobian matrix” of 𝑓 𝒙 w.r.t. 𝒙
07, 12, 13 Jan, 2022 17
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Vector/Matrix Calculus
• If 𝑓 𝒙 ∈ ℝ and 𝒙 ∈ ℝ𝑛, then
• 𝛻𝒙
2𝑓 =
𝜕2𝑓
𝜕𝑥1
2
𝜕2𝑓
𝜕𝑥1𝑥2
⋯
𝜕2𝑓
𝜕𝑥1𝑥𝑛
𝜕2𝑓
𝜕𝑥1𝑥2
𝜕2𝑓
𝜕𝑥2
2 ⋯
𝜕2𝑓
𝜕𝑥2𝑥𝑛
. . . .
𝜕2𝑓
𝜕𝑥1𝑥𝑛
𝜕2𝑓
𝜕𝑥2𝑥𝑛
⋯
𝜕2𝑓
𝜕𝑥𝑛
2
• is called the “Hessian matrix” of 𝑓 𝒙 w.r.t. 𝒙
07, 12, 13 Jan, 2022 18
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Vector/Matrix Calculus
• Some standard results:
•
𝜕
𝜕𝒙
𝒃𝑇
𝒙 =
𝜕
𝜕𝒙
𝒙𝑇
𝒃 = 𝒃
•
𝜕
𝜕𝒙
𝑨𝒙 = 𝑨
•
𝜕
𝜕𝒙
𝒙𝑇𝑨𝒙 = 𝑨 + 𝑨𝑻 𝒙; if 𝑨 = 𝑨𝑇,
𝜕
𝜕𝒙
1
2
𝒙𝑇𝑨𝒙 = 𝑨𝒙
• Product rule: 𝒖 ∈ ℝ𝑚, 𝒗 ∈ ℝ𝑚, 𝒙 ∈ ℝ𝑛
• 𝜕𝒖𝑻𝒗
𝜕𝒙
𝑻
1×𝑛
= 𝒖𝑇
1×𝑚
𝜕𝒗
𝜕𝒙 𝑚×𝑛
+ 𝒗𝑇
1×𝑚
𝜕𝒖
𝜕𝒙 𝑚×𝑛
07, 12, 13 Jan, 2022 19
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Vector/Matrix Calculus
• Some standard results:
• If 𝐹 𝑓 𝒙 ∈ ℝ, 𝑓 𝒙 ∈ ℝ, 𝒙 ∈ ℝ𝑛, then
• 𝜕𝐹
𝜕𝒙 𝑛×1
=
𝜕𝑓
𝜕𝒙 𝑛×1
𝜕𝐹
𝜕𝑓 1×1
• If 𝐹 𝒇 𝒙 ∈ ℝ, 𝒇 𝒙 ∈ ℝ𝑚, 𝒙 ∈ ℝ𝑛, then
• 𝜕𝐹
𝜕𝒙 𝑛×1
=
𝜕𝒇
𝜕𝒙 𝑛×𝑚
𝑇
𝜕𝐹
𝜕𝒇 𝑚×1
07, 12, 13 Jan, 2022 20
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Vector/Matrix Calculus
• Derivatives of norms:
• For two norm of a vector
𝜕
𝜕𝒙
𝒙 2
2
=
𝜕
𝜕𝒙
𝒙𝑇𝒙 = 2𝒙
• For Frobenius norm of a matrix
𝜕
𝜕𝑿
𝑿 𝐹
2
=
𝜕
𝜕𝑿
𝑡𝑟(𝑿𝑿𝑇) = 2𝑿
07, 12, 13 Jan, 2022 21
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Optimization Problem
• Problem Statement:
min
𝑠.𝑡. 𝒙∈𝜒
𝑓(𝒙)
• Problem statement of convex optimization:
min
𝑠.𝑡. 𝒙∈𝜒
𝑓(𝒙)
• with 𝑓: ℝ𝑛
→ ℝ is a convex function and
• 𝜒 is a convex set
07, 12, 13 Jan, 2022 22
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
• Convex Set: A set 𝒞 ⊆ ℝ𝑛 is a convex set if for all 𝑥, 𝑦 ∈ 𝒞, the line
segment connecting 𝑥 and 𝑦 is in 𝒞 , i.e.,
∀𝒙, 𝒚 ∈ 𝒞, 𝜃 ∈ 0,1 , 𝜃𝒙 + 1 − 𝜃 𝒚 ∈ 𝒞
Convex Sets and Functions
𝒙
𝒚
𝜃𝒙 + 1 − 𝜃 𝒚, 𝜃 ∈ [0,1]
𝒞
𝜃𝒙 + 1 − 𝜃 𝑦
= 𝒚 + 𝜃(𝒙 − 𝒚)
07, 12, 13 Jan, 2022 23
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
• Convex Function: A function 𝑓: ℝ𝑛 → ℝ is a convex function if its
domain 𝑑𝑜𝑚(𝑓) is a convex set and for all 𝑥, 𝑦 ∈ 𝑑𝑜𝑚(𝑓), and 𝜃 ∈ 0,1 ,
we have
𝑓 𝜃𝒙 + 1 − 𝜃 𝒚 ≤ 𝜃𝑓(𝒙) + 1 − 𝜃 𝑓(𝒚)
Convex Sets and Functions
𝜃𝒙 + 1 − 𝜃 𝒚 ≤ 𝜃𝒙 + 1 − 𝜃 𝒚
= 𝜃 𝒙 + 1 − 𝜃 𝒚
All norms are convex functions
07, 12, 13 Jan, 2022 24
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Alternative Definition of Convexity for differentiable functions
Theorem: (first order characterization) Let 𝑓: ℝ𝑛
→ ℝ be a differentiable
function whose dom(f) is convex. Then, f is convex iff
𝑓 𝒚 ≥ 𝑓 𝒙 + 𝛻𝑓 𝒙 , 𝒚 − 𝒙 ∀𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 …(1)
07, 12, 13 Jan, 2022 25
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Proof (part 1): 𝐸𝑞𝑛
1 ⇒ 𝐶𝑜𝑛𝑣𝑒𝑥
07, 12, 13 Jan, 2022
Given 𝐸𝑞𝑛 1 : 𝑓 𝒚 ≥ 𝑓 𝒙 + 𝛻𝑓 𝒙 , 𝒚 − 𝒙
Let us consider 𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 and 𝜃 ∈ 0,1
Let 𝒛 = 1 − 𝜃 𝒙 + 𝜃𝒚
Now, By 𝑒𝑞𝑛 1 ,
𝑓 𝒙 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 𝒙 − 𝒛 … (2), taking 𝒙, 𝐳
𝑓 𝒚 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 𝒚 − 𝒛 … (3) , taking 𝒚, 𝐳
CS60010 / Deep Learning | Optimization (c) Abir Das 26
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
07, 12, 13 Jan, 2022
Multiplying 𝐸𝑞𝑛 2 with 1 − 𝜃 , we get,
(1 − 𝜃)𝑓 𝒙 ≥ (1 − 𝜃)𝑓 𝒛 + (1 − 𝜃) 𝛻𝑓 𝒛 , 𝒙 − 𝒛 …(4)
and 𝐸𝑞𝑛 3 with 𝜃
𝜃𝑓 𝒚 ≥ 𝜃𝑓 𝒛 + 𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛 … (5)
Now combining 𝐸𝑞𝑛s 4 𝑎𝑛𝑑 (5), we have,
1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 1 − 𝜃 𝑓 𝒛 + 𝜃𝑓 𝒛 + 1 − 𝜃 𝛻𝑓 𝒛 , 𝒙 − 𝒛
+𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛
CS60010 / Deep Learning | Optimization (c) Abir Das 27
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
⇒ 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 𝑓 𝒛 − 𝜃𝑓 𝒛 + 𝜃𝑓 𝒛 + 𝛻𝑓 𝒛 , 1 − 𝜃 𝒙 − 𝒛
+ 𝛻𝑓 𝒛 , 𝜃(𝒚 − 𝒛)
07, 12, 13 Jan, 2022
1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 1 − 𝜃 𝑓 𝒛 + 𝜃𝑓 𝒛 + 1 − 𝜃 𝛻𝑓 𝒛 , 𝒙 − 𝒛
+𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛
⇒ 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 1 − 𝜃 𝒙 − 𝒛 + 𝜃(𝒚 − 𝒛)
=0 (Why ?)
1 − 𝜃 𝒙 − 𝒛 + 𝜃 𝑦 − 𝑧 = 1 − 𝜃 𝒙 − 1 − 𝜃 𝒛 + 𝜃𝒚 − 𝜃𝒛
= 1 − 𝜃 𝒙 − 𝒛 + 𝜃𝒛 + 𝜃𝒚 − 𝜃𝒛
= 1 − 𝜃 𝒙 + 𝜃𝒚 − 𝒛 = 𝒛 − 𝒛 = 𝟎
= 𝑓 𝒛 = 𝑓( 1 − 𝜃 𝒙 + 𝜃𝒚)
CS60010 / Deep Learning | Optimization (c) Abir Das 28
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Proof (part 2): 𝐶𝑜𝑛𝑣𝑒𝑥 ⇒ 𝐸𝑞𝑛
1
07, 12, 13 Jan, 2022
Suppose 𝑓is convex, i.e., ∀𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 and ∀𝜃 ∈ 0,1 ,
𝑓 1 − 𝜃 𝒙 + 𝜃𝒚 ≤ 1 − 𝜃 𝑓(𝒙) + 𝜃𝑓(𝒚)
Equivalently,
𝑓 𝒙 + 𝜃(𝒚 − 𝒙) ≤ 𝑓(𝒙) + 𝜃(𝑓 𝒚 − 𝑓 𝒙 )
⇒ 𝑓 𝒚 − 𝑓 𝒙 ≥
𝑓 𝒙 + 𝜃 𝒚 − 𝒙 − 𝑓(𝒙)
𝜃
CS60010 / Deep Learning | Optimization (c) Abir Das 29
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
07, 12, 13 Jan, 2022
By taking the limit as 𝜃 → 0 on both sides and using the definition
of derivative, we obtain,
𝑓 𝒚 − 𝑓 𝒙 ≥ lim
𝜃→0
𝑓 𝒙 + 𝜃 𝒚 − 𝒙 − 𝑓(𝒙)
𝜃
= 𝛻𝑓 𝒙 , 𝒚 − 𝒙
(How?)
𝑔 = 𝑙𝑖𝑚
𝜃→0
𝑓 𝑥 + 𝜃 𝑦 − 𝑥 − 𝑓(𝑥)
𝜃
𝑔 = 𝑙𝑖𝑚
𝜃→0
𝑓 𝑥 + 𝜃 𝑦 − 𝑥 − 𝑓 𝑥
𝜃(𝑦 − 𝑥)
(𝑦 − 𝑥)
Let, 𝜃 𝑦 − 𝑥 = ℎ, As 𝜃 → 0, ℎ → 0
[Multiplying numerator and denominator
by (𝑦 − 𝑥)]
∴ 𝑔 = 𝑙𝑖𝑚
ℎ→0
𝑓 𝑥 + ℎ − 𝑓 𝑥
ℎ
(𝑦 − 𝑥) = 𝑓′(𝑥)(𝑦 − 𝑥)
[Let us see in case of 1-D 𝑥, 𝑦.]
CS60010 / Deep Learning | Optimization (c) Abir Das 30
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent
min
𝒘,𝒃
loss function
min
𝜃
𝐽(𝜃)
More formally,
For scalar 𝜃, the condition is 𝐽′
𝜃 =
𝜕𝐽
𝜕𝜃
= 0
For higher dimensional 𝜽, the condition boils down to
𝐽′
𝜽 = 𝛻𝜽𝐽 = 𝟎
⇒
𝜕𝐽
𝜕𝜃1
,
𝜕𝐽
𝜕𝜃2
, … ,
𝜕𝐽
𝜕𝜃𝑁
𝑇
= 𝟎
⇒
𝜕𝐽
𝜕𝜃1
𝜕𝐽
𝜕𝜃2
⋮
𝜕𝐽
𝜕𝜃𝑁
=
0
0
⋮
0
𝐽(𝜃)
𝐽′ 𝜃 = 0 𝜃
07, 12, 13 Jan, 2022 31
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
One Way to Find Minima – Gradient Descent
This is helpful but not always useful.
For example 𝐽 𝜃 = log 𝑖=1
𝑚
𝑒(𝒂𝒊
𝑇
𝜽+𝑏𝑖)
is a convex function with
clear minima, but finding analytical solution is not easy.
𝛻𝜽𝐽 = 𝟎
⇒
1
𝑖=1
𝑚
𝑒(𝒂𝒊
𝑇𝜽+𝑏𝑖)
𝑖=1
𝑚
𝑒 𝒂𝒊
𝑇
𝜽+𝑏𝑖 𝒂𝒊 = 0
⇒
𝑖=1
𝑚
𝑒 𝒂𝒊
𝑇
𝜽+𝑏𝑖 𝒂𝒊 = 0
So, a numerical iterative solution is sought for.
07, 12, 13 Jan, 2022 32
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
One Way to Find Minima – Gradient Descent
Start with an initial guess 𝜽0
Repeatedly update 𝜽 by taking a small step: 𝜽𝑘
= 𝜽𝑘−1
+ 𝜂Δ𝜽 … (1)
so that 𝐽(𝜽) gets smaller with each update i.e., 𝐽 𝜽𝑘 ≤ 𝐽(𝜽𝑘−1) … (2)
(1) implies, 𝐽 𝜽𝑘
= 𝐽 𝜽𝑘−1
+ 𝜂Δ𝜽
= 𝐽 𝜽𝑘−1
+ 𝜂(Δ𝜽)𝑇
𝐽′
𝜽𝑘−1
+ ℎ. 𝑜. 𝑡. [Using Taylor series expansion]
≈ 𝐽 𝜽𝑘−1
+ 𝜂(Δ𝜽)𝑇
𝐽′
𝜽𝑘−1
[Neglecting ℎ. 𝑜. 𝑡.] … (3)
Combining (2) and (3),
𝜂(Δ𝜽)𝑇
𝐽′
𝜽𝑘−1
≤ 0 𝑖. 𝑒. , (Δ𝜽)𝑇
𝐽′
𝜽𝑘−1
≤ 0 [as 𝜂 is positive]
So, for 𝜽 to minimize 𝐽 𝜽 , 𝑖. 𝑒. , to satisfy (2) we have to choose some Δ𝜽 that is negative when a dot product
of it is done with the gradient 𝐽′
𝜽𝑘−1
.
Then why not choose, Δ𝜽 = −𝐽′
𝜽𝑘−1
!!
Then,(Δ𝜽)𝑇𝐽′ 𝜽𝑘−1 = −||𝐽′ 𝜽𝑘−1 ||2 surely is a negative quantity and satisfies the condition.
07, 12, 13 Jan, 2022 33
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent
Start with an initial guess 𝜽0
Repeatedly update 𝜽 by taking a small step: 𝜽𝑘
= 𝜽𝑘−1
− 𝜂𝐽′
𝜽𝑘−1
until convergence (𝐽′
𝜽𝑘−1
is very small)
07, 12, 13 Jan, 2022 34
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent
Start with an initial guess 𝜽0
Repeatedly update 𝜽 by taking a small step: 𝜽𝑘
= 𝜽𝑘−1
− 𝜂𝐽′
𝜽𝑘−1
until convergence (𝐽′
𝜽𝑘−1
is very small)
07, 12, 13 Jan, 2022 35
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent
Start with an initial guess 𝜽0
Repeatedly update 𝜽 by taking a small step: 𝜽𝑘
= 𝜽𝑘−1
− 𝜂𝐽′
𝜽𝑘−1
until convergence (𝐽′
𝜽𝑘−1
is very small)
07, 12, 13 Jan, 2022 36
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent
Start with an initial guess 𝜽0
Repeatedly update 𝜽 by taking a small step: 𝜽𝑘
= 𝜽𝑘−1
− 𝜂𝐽′
𝜽𝑘−1
until convergence (𝐽′
𝜽𝑘−1
is very small)
Remember, in Neural Networks,
the loss is computed by averaging
the losses for all examples
07, 12, 13 Jan, 2022 37
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
But Imagine This
Adapted from Olivier Bousquet’s invited talk at NeurIPS 2018 after winning Test of Time Award for NIPS 2007 paper:
“The Trade-Offs of Large Scale Learning” by Leon Bottou and Olivier Bousquet. Link of the talk.
07, 12, 13 Jan, 2022 38
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Batch, Stochastic and Minibatch
• Optimization algorithms that use the entire training set to compute the gradient are called batch or deterministic
gradient methods. Ones that use a single training example for that task are called stochastic or online gradient
methods
• Most of the algorithms we use for deep learning fall somewhere in between!
• These are called minibatch or minibatch stochastic methods
07, 12, 13 Jan, 2022 39
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Batch, Stochastic and Mini-batch Stochastic Gradient Descent
Images courtesy: Shubhendu Trivedi et. Al, Goodfellow et. al..
Mini-batch
07, 12, 13 Jan, 2022 40
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Batch and Stochastic Gradient Descent
Images courtesy: Shubhendu Trivedi et. al.
07, 12, 13 Jan, 2022 41
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Images courtesy: Karpathy et. al.
07, 12, 13 Jan, 2022 42
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Images courtesy: Karpathy et. al.
07, 12, 13 Jan, 2022 43
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
07, 12, 13 Jan, 2022 44
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Gradient Descent: only one force acting at any point.
07, 12, 13 Jan, 2022 45
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Momentum: Two forces acting at any point.
i) Momentum built up due to gradients pushing the particle at that point
ii) Gradient computed at that point
i)
ii)
Momentum
at the point
Negative of
gradient at the
point. (Would
have reached here
w/o momentum)
Actual step
07, 12, 13 Jan, 2022 46
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Images courtesy: Goodfellow et. al.
Stochastic Gradient Descent with Momentum
Momentum
at the point
Negative of
gradient at the
point. (Would
have reached here
w/o momentum)
Actual step
07, 12, 13 Jan, 2022 47
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Images courtesy: Goodfellow et. al.
Nesterov Momentum
• The difference between Nesterov momentum and standard momentum is where the gradient is
evaluated.
• With Nesterov momentum the gradient is evaluated after the current velocity is applied.
• In practice, Nesterov momentum speeds up the convergence only for well behaved loss functions (convex
with consistent curvature)
07, 12, 13 Jan, 2022 48
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Distill Pub Article on Why Momentum Really Works
07, 12, 13 Jan, 2022 49
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Adaptive Learning Rate Methods
• Till now we assign the same learning rate in all directions.
• Would it not be a good idea if we can move slowly in a steeper direction whereas move fast in a shallower
direction?
Images courtesy: Karpathy et. al.
07, 12, 13 Jan, 2022 50
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
AdaGrad
• Downscale learning rate by square-root of sum of squares of all the historical gradient values
• Parameters that have large partial derivative of the loss – learning rates for them are rapidly declined
Images courtesy: Shubhendu Trivedi et. al.
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
AdaGrad
• 𝒈(1) =
𝑔 1 ,1
𝑔 1 ,2
⋮
𝑔 1 ,𝑑
, 𝒈(2) =
𝑔 2 ,1
𝑔 2 ,2
⋮
𝑔 2 ,𝑑
, …, 𝒈(𝑡) =
𝑔 𝑡 ,1
𝑔 𝑡 ,2
⋮
𝑔 𝑡 ,𝑑
• 𝒓(1) =
𝑔 1 ,1
2
𝑔 1 ,2
2
⋮
𝑔 1 ,𝑑
2
, 𝒓(2) =
𝑔 1 ,1
2
+ 𝑔 2 ,1
2
𝑔 1 ,2
2
+ 𝑔 2 ,2
2
⋮
𝑔 1 ,𝑑
2
+ 𝑔 2 ,𝑑
2
, …., 𝒓(𝑡) =
𝑔 1 ,1
2
+ 𝑔 2 ,1
2
+ ⋯ + 𝑔 𝑡 ,1
2
𝑔 1 ,2
2
+ 𝑔 2 ,2
2
+ ⋯ + 𝑔 𝑡 ,2
2
⋮
𝑔 1 ,𝑑
2
+ 𝑔 2 ,𝑑
2
+ ⋯ + 𝑔 𝑡 ,𝑑
2
• ∆𝜽(𝑡) = −
𝜖.𝑔 𝑡 ,1
𝛿+ 𝑔 1 ,1
2
+𝑔 2 ,1
2
+⋯+𝑔 𝑡 ,1
2
𝜖.𝑔 𝑡 ,2
𝛿+ 𝑔 1 ,2
2
+𝑔 2 ,2
2
+⋯+𝑔 𝑡 ,2
2
⋮
𝜖.𝑔 𝑡 ,𝑑
𝛿+ 𝑔 1 ,𝑑
2 +𝑔 2 ,𝑑
2 +⋯+𝑔 𝑡 ,𝑑
2
07, 12, 13 Jan, 2022 52
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
RMSProp
• One problem of AdaGrad is that the ‘r’ vector continues to build up and grow its value.
• This shrinks the learning rate too aggressively.
• RMSProp strikes a balance by exponentially decaying contributions from past gradients.
Images courtesy: Shubhendu Trivedi et. al.
07, 12, 13 Jan, 2022 53
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
RMSProp
• 𝒈(1), 𝒈(2), …, 𝒈(𝑡) 𝒓(𝑡) = 𝜌𝒓(𝑡−1) + (1 − 𝜌)𝒈(𝑡)⨀𝒈(𝑡)
• 𝒓(0) = 0
• 𝒓(1) = 𝜌𝒓(0) + 1 − 𝜌 𝒈 1 ⨀𝒈 1 = 1 − 𝜌 𝒈 1 ⨀𝒈 1
• 𝒓(2) = 𝜌𝒓(1) + 1 − 𝜌 𝒈 2 ⨀𝒈 2 = 𝜌 1 − 𝜌 𝒈 1 ⨀𝒈 1 + 1 − 𝜌 𝒈 2 ⨀𝒈 2
• 𝒓(3) = 𝜌𝒓(2) + 1 − 𝜌 𝒈 3 ⨀𝒈 3 = 𝜌2
1 − 𝜌 𝒈 1 ⨀𝒈 1 + 𝜌 1 − 𝜌 𝒈 2 ⨀𝒈 2 + 1 − 𝜌 𝒈 3 ⨀𝒈 3
• RMSProp uses an exponentially decaying average to discard history from the extreme past so that the
accumulation of gradients do not stall the learning.
• AdaDelta is another variant where instead of exponentially decaying average, a moving window average
of the past gradients is taken
• Nesterov acceleration can also be applied to both these variants by computing the gradients at a ‘look
ahead’ position (i.e., at a place where the momentum would have taken the parameters).
07, 12, 13 Jan, 2022 54
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
ADAM (Adaptive Moments)
• Variant of the combination of RMSProp and Momentum.
• Incorporates first order moment of the gradient which can be thought of as equivalent to taking
advantage of the momentum strategy. Here momentum is also added with exponential averaging.
• It also incorporates the second order term which can be thought of as the RMSProp like exponential
averaging of the past gradients.
• Both first and second moments are corrected for bias to account for their initialization to zero.
07, 12, 13 Jan, 2022 55
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
ADAM (Adaptive Moments)
• Biased first order moment 𝒔(𝑡) = 𝜌1𝒔(𝑡−1) + (1 − 𝜌1)𝒈𝑡
• Biased second order moment 𝒓(𝑡) = 𝜌2𝒓(𝑡−1) + (1 − 𝜌2)𝒈𝑡⨀𝒈𝑡
• Bias corrected first order moment 𝒔(𝑡) =
𝒔(𝑡)
1−𝜌1
• Bias corrected second order moment 𝒓(𝑡) =
𝒓(𝑡)
1−𝜌2
• Weight update ∆𝜽(𝑡) = −𝜖
𝒔(𝑡)
𝒓(𝑡)+𝛿
(operations applied elementwise)
07, 12, 13 Jan, 2022 56
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
ADAM (Adaptive Moments)
Images courtesy: Shubhendu Trivedi et. al.
07, 12, 13 Jan, 2022 57
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Visualization
animation courtesy: Alec Radford
Find more animations at
https://tinyurl.com/y6tkf4f8
07, 12, 13 Jan, 2022 58
CS60010 / Deep Learning | Optimization (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpur
cse.iitkgp.ac.in
Thank You!!

More Related Content

Similar to 03_Optimization (1).pptx

Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalEdwin Efraín Jiménez Lepe
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RIRJET Journal
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Chris Rackauckas
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding TheoryCapgemini
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Hyun Wong Choi
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
What if computers invigilate examinations - Cypher 2018
What if computers invigilate examinations - Cypher 2018What if computers invigilate examinations - Cypher 2018
What if computers invigilate examinations - Cypher 2018Gourab Nath
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyiaemedu
 
A novel approach for satellite imagery storage by classifying the non duplica...
A novel approach for satellite imagery storage by classifying the non duplica...A novel approach for satellite imagery storage by classifying the non duplica...
A novel approach for satellite imagery storage by classifying the non duplica...IAEME Publication
 
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...AIST
 

Similar to 03_Optimization (1).pptx (20)

Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
 
07_BayesianClassifier.pptx
07_BayesianClassifier.pptx07_BayesianClassifier.pptx
07_BayesianClassifier.pptx
 
Imitation Learning
Imitation LearningImitation Learning
Imitation Learning
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with R
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Data Science for Number and Coding Theory
Data Science for Number and Coding TheoryData Science for Number and Coding Theory
Data Science for Number and Coding Theory
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
Features.pdf
Features.pdfFeatures.pdf
Features.pdf
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
What if computers invigilate examinations - Cypher 2018
What if computers invigilate examinations - Cypher 2018What if computers invigilate examinations - Cypher 2018
What if computers invigilate examinations - Cypher 2018
 
Ml intro
Ml introMl intro
Ml intro
 
Ml intro
Ml introMl intro
Ml intro
 
lecture7.pptx
lecture7.pptxlecture7.pptx
lecture7.pptx
 
A novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classifyA novel approach for satellite imagery storage by classify
A novel approach for satellite imagery storage by classify
 
A novel approach for satellite imagery storage by classifying the non duplica...
A novel approach for satellite imagery storage by classifying the non duplica...A novel approach for satellite imagery storage by classifying the non duplica...
A novel approach for satellite imagery storage by classifying the non duplica...
 
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 

Recently uploaded (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 

03_Optimization (1).pptx

  • 1. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Deep Learning CS60010 Abir Das Computer Science and Engineering Department Indian Institute of Technology Kharagpur http://cse.iitkgp.ac.in/~adas/
  • 2. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Agenda • Understand basics of Matrix/Vector Calculus and Optimization concepts to be used in the course. • Understand different types or errors in learning 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 2
  • 3. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Resources • ”Deep Learning”, I. Goodfellow, Y. Bengio, A. Courville. (Chapter 8) 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 3
  • 4. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Optimization and Deep Learning- Connections • Deep learning (machine learning, in general) involves optimization in many contexts • The goal is to find parameters 𝜽 of a neural network that significantly reduce a cost function or objective function 𝐽 𝜽 . • Gradient based optimization is the most popular way for training Deep Neural Networks. • There are other ways too, e.g., evolutionary or derivative free optimization, but they come with issues particularly crucial for neural network training. • Its easy to spend a semester on optimization. Thus, these few lectures will only be a scratch on a very small part of the surface. 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 4
  • 5. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Optimization and Deep Learning- Differences • In learning we care about some performance measure 𝑃 (e.g., image classification accuracy, language translation accuracy etc.) on test set, but we minimize a different cost function 𝐽(𝜽) on training set, with the hope that doing so will improve 𝑃 • This is in contrast to pure optimization where minimizing 𝐽(𝜽) is a goal in itself • Lets see what types of errors creep in as a result 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 5
  • 6. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in • Training data is 𝒙(𝑖), 𝑦(𝑖): 𝑖 = 1, … , 𝑁 coming from probability distribution 𝒟(𝒙, 𝑦) • Let the neural network learns the output functions as 𝑔𝒟 (𝒙) and the loss is denoted as 𝑙(𝑔𝒟 𝒙 , 𝑦) • What we want to minimize is the expected risk E 𝑔𝒟 = 𝑙 𝑔𝒟 𝒙 , 𝑦 𝑑𝒟 𝒙, 𝑦 = 𝔼 𝑙 𝑔𝒟 𝒙 , 𝑦 • If we could minimize this risk we would have got the true optimal function 𝑓 = arg min 𝑔𝒟 E 𝑔𝒟 • But we don’t know the actual distribution that generates the data. So, what we actually minimize is the empirical risk En 𝑔𝒟 = 1 𝑁 𝑖=1 𝑁 𝑙 𝑔𝒟 𝒙𝒊 , 𝑦𝑖 = 𝔼𝑛 𝑙 𝑔𝒟 𝒙 , 𝑦 • We choose a family ℋof candidate prediction functions and find the function that minimizes the empirical risk 𝑔n 𝒟 = arg min 𝑔𝒟∈ℋ En 𝑔𝒟 • Since 𝑓 may not be found in the family ℋ, we also define 𝑔ℋ∗ 𝒟 = arg min 𝑔𝒟∈ℋ E 𝑔𝒟 Expected and Empirical Risk 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 6
  • 7. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Minimizes Staying within 𝑓 expected risk No constraints on function family • True data distribution known • Family of functions exhaustive 𝑔ℋ∗ 𝒟 expected risk family of functions ℋ • True data distribution known • Family of functions not exhaustive 𝑔n 𝒟 empirical risk family of functions ℋ • True data distribution not known • Family of functions not exhaustive Sources of Error • True data generation procedure not known • Family of functions [or hypothesis] to try is not exhaustive • (And not to forget) we rely on a surrogate loss in place of the true classification error rate 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 7
  • 8. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in A Simple Learning Problem 2 Data Points. 2 hypothesis sets: x y x y Slide courtesy: Malik Magdon-Ismail ℋ0: ℎ 𝑥 = 𝑏 ℋ1: ℎ(𝑥) = 𝑎𝑥 + 𝑏 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 8
  • 9. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Let’s Repeat the Experiment Many Times Slide courtesy: Malik Magdon-Ismail x y x y • For each data set 𝒟 you get a different 𝑔𝑛 𝒟 • For a fixed 𝒙, 𝑔𝑛 𝒟(𝒙) is random value, depending on 𝒟 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 9
  • 10. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in What’s Happening on Average Slide courtesy: Malik Magdon-Ismail x y ḡ(x) sin(x) x y ḡ(x) sin(x) We can define 𝑔𝑛 𝒟(𝒙) Random value, depending on 𝒟 Your average prediction on 𝒙 var 𝒙 = 𝔼𝒟 𝑔𝑛 𝒟 𝒙 − 𝑔 𝒙 2 = 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 − 𝑔 𝒙 2 How variable is your prediction 𝑔 𝒙 = 𝔼𝒟 𝑔𝑛 𝒟(𝒙) = 1 𝐾 𝑔𝑛 𝒟1 𝒙 +𝑔𝑛 𝒟2 𝒙 + ⋯ + 𝑔𝑛 𝒟𝐾 (𝒙) 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 10
  • 11. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in 𝐸𝑜𝑢𝑡 on Test Point 𝒙 for Data 𝒟 Slide motivation: Malik Magdon-Ismail Squared error, a random value depending on 𝒟 𝐸𝑜𝑢𝑡 𝒙 = 𝔼𝒟 𝐸𝑜𝑢𝑡 𝒟 𝒙 = 𝔼𝒟 𝑔𝑛 𝒟 𝒙 − 𝑓 𝒙 2 Expected value of the above random variable x f (x) Eo D ut(x) x f (x) Eo D ut(x) 𝐸𝑜𝑢𝑡 𝒟 𝒙 = 𝑔𝑛 𝒟 𝒙 − 𝑓 𝒙 2 𝑔𝑛 𝒟 𝒙 𝑔𝑛 𝒟 𝒙 07, 12, 13 Jan, 2022 CS60010 / Deep Learning | Optimization (c) Abir Das 11
  • 12. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in The Bias-Variance Decomposition 𝐸𝑜𝑢𝑡 𝒙 = 𝔼𝒟 𝑔𝑛 𝒟 𝒙 − 𝑓 𝒙 2 = 𝔼𝒟 𝑓 𝒙 2 − 2𝑓 𝒙 𝑔𝑛 𝒟 𝒙 + 𝑔𝑛 𝒟 𝒙 2 = 𝑓 𝒙 2 − 2𝑓 𝒙 𝔼𝒟 𝑔𝑛 𝒟 𝒙 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 = 𝑓 𝒙 2 − 2𝑓 𝒙 𝑔 𝒙 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 = 𝑓 𝒙 2 − 𝑔 𝒙 2 + 𝑔 𝒙 2 − 2𝑓 𝒙 𝑔 𝒙 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 = 𝑓 𝒙 − 𝑔 𝒙 2 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 − 𝑔 𝒙 2 Slide motivation: Malik Magdon-Ismail Bias Variance 𝐸𝑜𝑢𝑡 𝒙 = Bias Variance + 07, 12, 13 Jan, 2022 12 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 13. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Bias-Variance to Overfitting-Underfitting • Suppose the true underlying function is 𝑓(𝒙). • But the observations (data points) are noisy i.e., 𝑓 𝒙 + 𝜀 𝑓(𝒙) 07, 12, 13 Jan, 2022 13 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 14. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in The Extreme Cases of Bias and Variance - Under-fitting Slide motivation: John A. Bullinaria A good way to understand the concepts of bias and variance is by considering the two extreme cases of what a neural network might learn. Suppose the neural network is lazy and just produces the same constant output whatever training data we give it, i.e. 𝑔𝑛 𝒟 (𝒙)=c, then 𝐸𝑜𝑢𝑡 𝒙 = 𝑓 𝒙 − 𝑔 𝒙 2 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 − 𝑔 𝒙 2 = 𝑓 𝒙 − 𝑐 2 + 𝔼𝒟 𝒄2 − 𝒄2 = 𝑓 𝒙 − 𝑐 2 + 0 In this case the variance term will be zero, but the bias will be large, because the network has made no attempt to fit the data. We say we have extreme under-fitting 07, 12, 13 Jan, 2022 14 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 15. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in The Extreme Cases of Bias and Variance - Over-fitting Slide motivation: John A. Bullinaria On the other hand, suppose the neural network is very hard working and makes sure that it exactly fits every data point, i.e. 𝑔𝑛 𝒟 𝒙 = 𝑓 𝒙 + 𝜀, then, 𝑔 𝒙 = 𝔼𝓓 𝑔𝑛 𝒟 𝒙 = 𝔼𝓓 𝑓 𝒙 + 𝜀 = 𝔼𝓓 𝑓 𝒙 + 0 = 𝑓 𝒙 In this case the bias term will be zero, but the variance is the square of the noise on the data, which could be substantial. In this case we say we have extreme over-fitting. 𝐸𝑜𝑢𝑡 𝒙 = 𝑓 𝒙 − 𝑔 𝒙 2 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 2 − 𝑔 𝒙 2 = 𝑓 𝒙 − 𝑓 𝒙 2 + 𝔼𝒟 𝑔𝑛 𝒟 𝒙 − 𝑔(𝒙) 2 = 0 + 𝔼𝒟 𝑓 𝒙 + 𝜀 − 𝑓 𝒙 2 = 0 + 𝔼𝒟 𝜺2 07, 12, 13 Jan, 2022 15 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 16. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Bias-Variance Trade-off Number ofData Points, N Error Test Error Training Error Overfitting Underfitting 07, 12, 13 Jan, 2022 16 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 17. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Vector/Matrix Calculus • If 𝑓 𝒙 ∈ ℝ and 𝒙 ∈ ℝ𝑛, then 𝛻𝒙𝑓 ≜ 𝜕𝑓 𝜕𝑥1 𝜕𝑓 𝜕𝑥2 ⋯ 𝜕𝑓 𝜕𝑥𝑛 𝑇 is called the gradient of 𝑓(𝒙) • If 𝑓 𝒙 = 𝑓1(𝑥) 𝑓2(𝑥) ⋯ 𝑓𝑚(𝑥) 𝑇 ∈ ℝ𝑚and 𝒙 ∈ ℝ𝑛, then • 𝛻𝒙𝑓 = 𝜕𝑓1 𝜕𝑥1 𝜕𝑓1 𝜕𝑥2 ⋯ 𝜕𝑓1 𝜕𝑥𝑛 𝜕𝑓2 𝜕𝑥1 𝜕𝑓2 𝜕𝑥2 ⋯ 𝜕𝑓2 𝜕𝑥𝑛 . . . . 𝜕𝑓𝑚 𝜕𝑥1 𝜕𝑓𝑚 𝜕𝑥2 ⋯ 𝜕𝑓𝑚 𝜕𝑥𝑛 𝑚×𝑛 • is called “Jacobian matrix” of 𝑓 𝒙 w.r.t. 𝒙 07, 12, 13 Jan, 2022 17 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 18. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Vector/Matrix Calculus • If 𝑓 𝒙 ∈ ℝ and 𝒙 ∈ ℝ𝑛, then • 𝛻𝒙 2𝑓 = 𝜕2𝑓 𝜕𝑥1 2 𝜕2𝑓 𝜕𝑥1𝑥2 ⋯ 𝜕2𝑓 𝜕𝑥1𝑥𝑛 𝜕2𝑓 𝜕𝑥1𝑥2 𝜕2𝑓 𝜕𝑥2 2 ⋯ 𝜕2𝑓 𝜕𝑥2𝑥𝑛 . . . . 𝜕2𝑓 𝜕𝑥1𝑥𝑛 𝜕2𝑓 𝜕𝑥2𝑥𝑛 ⋯ 𝜕2𝑓 𝜕𝑥𝑛 2 • is called the “Hessian matrix” of 𝑓 𝒙 w.r.t. 𝒙 07, 12, 13 Jan, 2022 18 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 19. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Vector/Matrix Calculus • Some standard results: • 𝜕 𝜕𝒙 𝒃𝑇 𝒙 = 𝜕 𝜕𝒙 𝒙𝑇 𝒃 = 𝒃 • 𝜕 𝜕𝒙 𝑨𝒙 = 𝑨 • 𝜕 𝜕𝒙 𝒙𝑇𝑨𝒙 = 𝑨 + 𝑨𝑻 𝒙; if 𝑨 = 𝑨𝑇, 𝜕 𝜕𝒙 1 2 𝒙𝑇𝑨𝒙 = 𝑨𝒙 • Product rule: 𝒖 ∈ ℝ𝑚, 𝒗 ∈ ℝ𝑚, 𝒙 ∈ ℝ𝑛 • 𝜕𝒖𝑻𝒗 𝜕𝒙 𝑻 1×𝑛 = 𝒖𝑇 1×𝑚 𝜕𝒗 𝜕𝒙 𝑚×𝑛 + 𝒗𝑇 1×𝑚 𝜕𝒖 𝜕𝒙 𝑚×𝑛 07, 12, 13 Jan, 2022 19 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 20. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Vector/Matrix Calculus • Some standard results: • If 𝐹 𝑓 𝒙 ∈ ℝ, 𝑓 𝒙 ∈ ℝ, 𝒙 ∈ ℝ𝑛, then • 𝜕𝐹 𝜕𝒙 𝑛×1 = 𝜕𝑓 𝜕𝒙 𝑛×1 𝜕𝐹 𝜕𝑓 1×1 • If 𝐹 𝒇 𝒙 ∈ ℝ, 𝒇 𝒙 ∈ ℝ𝑚, 𝒙 ∈ ℝ𝑛, then • 𝜕𝐹 𝜕𝒙 𝑛×1 = 𝜕𝒇 𝜕𝒙 𝑛×𝑚 𝑇 𝜕𝐹 𝜕𝒇 𝑚×1 07, 12, 13 Jan, 2022 20 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 21. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Vector/Matrix Calculus • Derivatives of norms: • For two norm of a vector 𝜕 𝜕𝒙 𝒙 2 2 = 𝜕 𝜕𝒙 𝒙𝑇𝒙 = 2𝒙 • For Frobenius norm of a matrix 𝜕 𝜕𝑿 𝑿 𝐹 2 = 𝜕 𝜕𝑿 𝑡𝑟(𝑿𝑿𝑇) = 2𝑿 07, 12, 13 Jan, 2022 21 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 22. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Optimization Problem • Problem Statement: min 𝑠.𝑡. 𝒙∈𝜒 𝑓(𝒙) • Problem statement of convex optimization: min 𝑠.𝑡. 𝒙∈𝜒 𝑓(𝒙) • with 𝑓: ℝ𝑛 → ℝ is a convex function and • 𝜒 is a convex set 07, 12, 13 Jan, 2022 22 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 23. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in • Convex Set: A set 𝒞 ⊆ ℝ𝑛 is a convex set if for all 𝑥, 𝑦 ∈ 𝒞, the line segment connecting 𝑥 and 𝑦 is in 𝒞 , i.e., ∀𝒙, 𝒚 ∈ 𝒞, 𝜃 ∈ 0,1 , 𝜃𝒙 + 1 − 𝜃 𝒚 ∈ 𝒞 Convex Sets and Functions 𝒙 𝒚 𝜃𝒙 + 1 − 𝜃 𝒚, 𝜃 ∈ [0,1] 𝒞 𝜃𝒙 + 1 − 𝜃 𝑦 = 𝒚 + 𝜃(𝒙 − 𝒚) 07, 12, 13 Jan, 2022 23 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 24. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in • Convex Function: A function 𝑓: ℝ𝑛 → ℝ is a convex function if its domain 𝑑𝑜𝑚(𝑓) is a convex set and for all 𝑥, 𝑦 ∈ 𝑑𝑜𝑚(𝑓), and 𝜃 ∈ 0,1 , we have 𝑓 𝜃𝒙 + 1 − 𝜃 𝒚 ≤ 𝜃𝑓(𝒙) + 1 − 𝜃 𝑓(𝒚) Convex Sets and Functions 𝜃𝒙 + 1 − 𝜃 𝒚 ≤ 𝜃𝒙 + 1 − 𝜃 𝒚 = 𝜃 𝒙 + 1 − 𝜃 𝒚 All norms are convex functions 07, 12, 13 Jan, 2022 24 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 25. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Alternative Definition of Convexity for differentiable functions Theorem: (first order characterization) Let 𝑓: ℝ𝑛 → ℝ be a differentiable function whose dom(f) is convex. Then, f is convex iff 𝑓 𝒚 ≥ 𝑓 𝒙 + 𝛻𝑓 𝒙 , 𝒚 − 𝒙 ∀𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 …(1) 07, 12, 13 Jan, 2022 25 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 26. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Proof (part 1): 𝐸𝑞𝑛 1 ⇒ 𝐶𝑜𝑛𝑣𝑒𝑥 07, 12, 13 Jan, 2022 Given 𝐸𝑞𝑛 1 : 𝑓 𝒚 ≥ 𝑓 𝒙 + 𝛻𝑓 𝒙 , 𝒚 − 𝒙 Let us consider 𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 and 𝜃 ∈ 0,1 Let 𝒛 = 1 − 𝜃 𝒙 + 𝜃𝒚 Now, By 𝑒𝑞𝑛 1 , 𝑓 𝒙 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 𝒙 − 𝒛 … (2), taking 𝒙, 𝐳 𝑓 𝒚 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 𝒚 − 𝒛 … (3) , taking 𝒚, 𝐳 CS60010 / Deep Learning | Optimization (c) Abir Das 26
  • 27. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in 07, 12, 13 Jan, 2022 Multiplying 𝐸𝑞𝑛 2 with 1 − 𝜃 , we get, (1 − 𝜃)𝑓 𝒙 ≥ (1 − 𝜃)𝑓 𝒛 + (1 − 𝜃) 𝛻𝑓 𝒛 , 𝒙 − 𝒛 …(4) and 𝐸𝑞𝑛 3 with 𝜃 𝜃𝑓 𝒚 ≥ 𝜃𝑓 𝒛 + 𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛 … (5) Now combining 𝐸𝑞𝑛s 4 𝑎𝑛𝑑 (5), we have, 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 1 − 𝜃 𝑓 𝒛 + 𝜃𝑓 𝒛 + 1 − 𝜃 𝛻𝑓 𝒛 , 𝒙 − 𝒛 +𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛 CS60010 / Deep Learning | Optimization (c) Abir Das 27
  • 28. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in ⇒ 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 𝑓 𝒛 − 𝜃𝑓 𝒛 + 𝜃𝑓 𝒛 + 𝛻𝑓 𝒛 , 1 − 𝜃 𝒙 − 𝒛 + 𝛻𝑓 𝒛 , 𝜃(𝒚 − 𝒛) 07, 12, 13 Jan, 2022 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 1 − 𝜃 𝑓 𝒛 + 𝜃𝑓 𝒛 + 1 − 𝜃 𝛻𝑓 𝒛 , 𝒙 − 𝒛 +𝜃 𝛻𝑓 𝒛 , 𝒚 − 𝒛 ⇒ 1 − 𝜃 𝑓 𝒙 + 𝜃𝑓 𝒚 ≥ 𝑓 𝒛 + 𝛻𝑓 𝒛 , 1 − 𝜃 𝒙 − 𝒛 + 𝜃(𝒚 − 𝒛) =0 (Why ?) 1 − 𝜃 𝒙 − 𝒛 + 𝜃 𝑦 − 𝑧 = 1 − 𝜃 𝒙 − 1 − 𝜃 𝒛 + 𝜃𝒚 − 𝜃𝒛 = 1 − 𝜃 𝒙 − 𝒛 + 𝜃𝒛 + 𝜃𝒚 − 𝜃𝒛 = 1 − 𝜃 𝒙 + 𝜃𝒚 − 𝒛 = 𝒛 − 𝒛 = 𝟎 = 𝑓 𝒛 = 𝑓( 1 − 𝜃 𝒙 + 𝜃𝒚) CS60010 / Deep Learning | Optimization (c) Abir Das 28
  • 29. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Proof (part 2): 𝐶𝑜𝑛𝑣𝑒𝑥 ⇒ 𝐸𝑞𝑛 1 07, 12, 13 Jan, 2022 Suppose 𝑓is convex, i.e., ∀𝒙, 𝒚 ∈ 𝑑𝑜𝑚 𝑓 and ∀𝜃 ∈ 0,1 , 𝑓 1 − 𝜃 𝒙 + 𝜃𝒚 ≤ 1 − 𝜃 𝑓(𝒙) + 𝜃𝑓(𝒚) Equivalently, 𝑓 𝒙 + 𝜃(𝒚 − 𝒙) ≤ 𝑓(𝒙) + 𝜃(𝑓 𝒚 − 𝑓 𝒙 ) ⇒ 𝑓 𝒚 − 𝑓 𝒙 ≥ 𝑓 𝒙 + 𝜃 𝒚 − 𝒙 − 𝑓(𝒙) 𝜃 CS60010 / Deep Learning | Optimization (c) Abir Das 29
  • 30. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in 07, 12, 13 Jan, 2022 By taking the limit as 𝜃 → 0 on both sides and using the definition of derivative, we obtain, 𝑓 𝒚 − 𝑓 𝒙 ≥ lim 𝜃→0 𝑓 𝒙 + 𝜃 𝒚 − 𝒙 − 𝑓(𝒙) 𝜃 = 𝛻𝑓 𝒙 , 𝒚 − 𝒙 (How?) 𝑔 = 𝑙𝑖𝑚 𝜃→0 𝑓 𝑥 + 𝜃 𝑦 − 𝑥 − 𝑓(𝑥) 𝜃 𝑔 = 𝑙𝑖𝑚 𝜃→0 𝑓 𝑥 + 𝜃 𝑦 − 𝑥 − 𝑓 𝑥 𝜃(𝑦 − 𝑥) (𝑦 − 𝑥) Let, 𝜃 𝑦 − 𝑥 = ℎ, As 𝜃 → 0, ℎ → 0 [Multiplying numerator and denominator by (𝑦 − 𝑥)] ∴ 𝑔 = 𝑙𝑖𝑚 ℎ→0 𝑓 𝑥 + ℎ − 𝑓 𝑥 ℎ (𝑦 − 𝑥) = 𝑓′(𝑥)(𝑦 − 𝑥) [Let us see in case of 1-D 𝑥, 𝑦.] CS60010 / Deep Learning | Optimization (c) Abir Das 30
  • 31. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent min 𝒘,𝒃 loss function min 𝜃 𝐽(𝜃) More formally, For scalar 𝜃, the condition is 𝐽′ 𝜃 = 𝜕𝐽 𝜕𝜃 = 0 For higher dimensional 𝜽, the condition boils down to 𝐽′ 𝜽 = 𝛻𝜽𝐽 = 𝟎 ⇒ 𝜕𝐽 𝜕𝜃1 , 𝜕𝐽 𝜕𝜃2 , … , 𝜕𝐽 𝜕𝜃𝑁 𝑇 = 𝟎 ⇒ 𝜕𝐽 𝜕𝜃1 𝜕𝐽 𝜕𝜃2 ⋮ 𝜕𝐽 𝜕𝜃𝑁 = 0 0 ⋮ 0 𝐽(𝜃) 𝐽′ 𝜃 = 0 𝜃 07, 12, 13 Jan, 2022 31 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 32. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in One Way to Find Minima – Gradient Descent This is helpful but not always useful. For example 𝐽 𝜃 = log 𝑖=1 𝑚 𝑒(𝒂𝒊 𝑇 𝜽+𝑏𝑖) is a convex function with clear minima, but finding analytical solution is not easy. 𝛻𝜽𝐽 = 𝟎 ⇒ 1 𝑖=1 𝑚 𝑒(𝒂𝒊 𝑇𝜽+𝑏𝑖) 𝑖=1 𝑚 𝑒 𝒂𝒊 𝑇 𝜽+𝑏𝑖 𝒂𝒊 = 0 ⇒ 𝑖=1 𝑚 𝑒 𝒂𝒊 𝑇 𝜽+𝑏𝑖 𝒂𝒊 = 0 So, a numerical iterative solution is sought for. 07, 12, 13 Jan, 2022 32 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 33. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in One Way to Find Minima – Gradient Descent Start with an initial guess 𝜽0 Repeatedly update 𝜽 by taking a small step: 𝜽𝑘 = 𝜽𝑘−1 + 𝜂Δ𝜽 … (1) so that 𝐽(𝜽) gets smaller with each update i.e., 𝐽 𝜽𝑘 ≤ 𝐽(𝜽𝑘−1) … (2) (1) implies, 𝐽 𝜽𝑘 = 𝐽 𝜽𝑘−1 + 𝜂Δ𝜽 = 𝐽 𝜽𝑘−1 + 𝜂(Δ𝜽)𝑇 𝐽′ 𝜽𝑘−1 + ℎ. 𝑜. 𝑡. [Using Taylor series expansion] ≈ 𝐽 𝜽𝑘−1 + 𝜂(Δ𝜽)𝑇 𝐽′ 𝜽𝑘−1 [Neglecting ℎ. 𝑜. 𝑡.] … (3) Combining (2) and (3), 𝜂(Δ𝜽)𝑇 𝐽′ 𝜽𝑘−1 ≤ 0 𝑖. 𝑒. , (Δ𝜽)𝑇 𝐽′ 𝜽𝑘−1 ≤ 0 [as 𝜂 is positive] So, for 𝜽 to minimize 𝐽 𝜽 , 𝑖. 𝑒. , to satisfy (2) we have to choose some Δ𝜽 that is negative when a dot product of it is done with the gradient 𝐽′ 𝜽𝑘−1 . Then why not choose, Δ𝜽 = −𝐽′ 𝜽𝑘−1 !! Then,(Δ𝜽)𝑇𝐽′ 𝜽𝑘−1 = −||𝐽′ 𝜽𝑘−1 ||2 surely is a negative quantity and satisfies the condition. 07, 12, 13 Jan, 2022 33 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 34. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent Start with an initial guess 𝜽0 Repeatedly update 𝜽 by taking a small step: 𝜽𝑘 = 𝜽𝑘−1 − 𝜂𝐽′ 𝜽𝑘−1 until convergence (𝐽′ 𝜽𝑘−1 is very small) 07, 12, 13 Jan, 2022 34 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 35. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent Start with an initial guess 𝜽0 Repeatedly update 𝜽 by taking a small step: 𝜽𝑘 = 𝜽𝑘−1 − 𝜂𝐽′ 𝜽𝑘−1 until convergence (𝐽′ 𝜽𝑘−1 is very small) 07, 12, 13 Jan, 2022 35 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 36. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent Start with an initial guess 𝜽0 Repeatedly update 𝜽 by taking a small step: 𝜽𝑘 = 𝜽𝑘−1 − 𝜂𝐽′ 𝜽𝑘−1 until convergence (𝐽′ 𝜽𝑘−1 is very small) 07, 12, 13 Jan, 2022 36 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 37. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent Start with an initial guess 𝜽0 Repeatedly update 𝜽 by taking a small step: 𝜽𝑘 = 𝜽𝑘−1 − 𝜂𝐽′ 𝜽𝑘−1 until convergence (𝐽′ 𝜽𝑘−1 is very small) Remember, in Neural Networks, the loss is computed by averaging the losses for all examples 07, 12, 13 Jan, 2022 37 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 38. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in But Imagine This Adapted from Olivier Bousquet’s invited talk at NeurIPS 2018 after winning Test of Time Award for NIPS 2007 paper: “The Trade-Offs of Large Scale Learning” by Leon Bottou and Olivier Bousquet. Link of the talk. 07, 12, 13 Jan, 2022 38 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 39. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Batch, Stochastic and Minibatch • Optimization algorithms that use the entire training set to compute the gradient are called batch or deterministic gradient methods. Ones that use a single training example for that task are called stochastic or online gradient methods • Most of the algorithms we use for deep learning fall somewhere in between! • These are called minibatch or minibatch stochastic methods 07, 12, 13 Jan, 2022 39 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 40. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Batch, Stochastic and Mini-batch Stochastic Gradient Descent Images courtesy: Shubhendu Trivedi et. Al, Goodfellow et. al.. Mini-batch 07, 12, 13 Jan, 2022 40 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 41. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Batch and Stochastic Gradient Descent Images courtesy: Shubhendu Trivedi et. al. 07, 12, 13 Jan, 2022 41 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 42. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Images courtesy: Karpathy et. al. 07, 12, 13 Jan, 2022 42 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 43. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Images courtesy: Karpathy et. al. 07, 12, 13 Jan, 2022 43 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 44. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in 07, 12, 13 Jan, 2022 44 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 45. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Gradient Descent: only one force acting at any point. 07, 12, 13 Jan, 2022 45 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 46. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Momentum: Two forces acting at any point. i) Momentum built up due to gradients pushing the particle at that point ii) Gradient computed at that point i) ii) Momentum at the point Negative of gradient at the point. (Would have reached here w/o momentum) Actual step 07, 12, 13 Jan, 2022 46 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 47. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Images courtesy: Goodfellow et. al. Stochastic Gradient Descent with Momentum Momentum at the point Negative of gradient at the point. (Would have reached here w/o momentum) Actual step 07, 12, 13 Jan, 2022 47 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 48. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Images courtesy: Goodfellow et. al. Nesterov Momentum • The difference between Nesterov momentum and standard momentum is where the gradient is evaluated. • With Nesterov momentum the gradient is evaluated after the current velocity is applied. • In practice, Nesterov momentum speeds up the convergence only for well behaved loss functions (convex with consistent curvature) 07, 12, 13 Jan, 2022 48 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 49. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Distill Pub Article on Why Momentum Really Works 07, 12, 13 Jan, 2022 49 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 50. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Adaptive Learning Rate Methods • Till now we assign the same learning rate in all directions. • Would it not be a good idea if we can move slowly in a steeper direction whereas move fast in a shallower direction? Images courtesy: Karpathy et. al. 07, 12, 13 Jan, 2022 50 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 51. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in AdaGrad • Downscale learning rate by square-root of sum of squares of all the historical gradient values • Parameters that have large partial derivative of the loss – learning rates for them are rapidly declined Images courtesy: Shubhendu Trivedi et. al.
  • 52. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in AdaGrad • 𝒈(1) = 𝑔 1 ,1 𝑔 1 ,2 ⋮ 𝑔 1 ,𝑑 , 𝒈(2) = 𝑔 2 ,1 𝑔 2 ,2 ⋮ 𝑔 2 ,𝑑 , …, 𝒈(𝑡) = 𝑔 𝑡 ,1 𝑔 𝑡 ,2 ⋮ 𝑔 𝑡 ,𝑑 • 𝒓(1) = 𝑔 1 ,1 2 𝑔 1 ,2 2 ⋮ 𝑔 1 ,𝑑 2 , 𝒓(2) = 𝑔 1 ,1 2 + 𝑔 2 ,1 2 𝑔 1 ,2 2 + 𝑔 2 ,2 2 ⋮ 𝑔 1 ,𝑑 2 + 𝑔 2 ,𝑑 2 , …., 𝒓(𝑡) = 𝑔 1 ,1 2 + 𝑔 2 ,1 2 + ⋯ + 𝑔 𝑡 ,1 2 𝑔 1 ,2 2 + 𝑔 2 ,2 2 + ⋯ + 𝑔 𝑡 ,2 2 ⋮ 𝑔 1 ,𝑑 2 + 𝑔 2 ,𝑑 2 + ⋯ + 𝑔 𝑡 ,𝑑 2 • ∆𝜽(𝑡) = − 𝜖.𝑔 𝑡 ,1 𝛿+ 𝑔 1 ,1 2 +𝑔 2 ,1 2 +⋯+𝑔 𝑡 ,1 2 𝜖.𝑔 𝑡 ,2 𝛿+ 𝑔 1 ,2 2 +𝑔 2 ,2 2 +⋯+𝑔 𝑡 ,2 2 ⋮ 𝜖.𝑔 𝑡 ,𝑑 𝛿+ 𝑔 1 ,𝑑 2 +𝑔 2 ,𝑑 2 +⋯+𝑔 𝑡 ,𝑑 2 07, 12, 13 Jan, 2022 52 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 53. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in RMSProp • One problem of AdaGrad is that the ‘r’ vector continues to build up and grow its value. • This shrinks the learning rate too aggressively. • RMSProp strikes a balance by exponentially decaying contributions from past gradients. Images courtesy: Shubhendu Trivedi et. al. 07, 12, 13 Jan, 2022 53 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 54. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in RMSProp • 𝒈(1), 𝒈(2), …, 𝒈(𝑡) 𝒓(𝑡) = 𝜌𝒓(𝑡−1) + (1 − 𝜌)𝒈(𝑡)⨀𝒈(𝑡) • 𝒓(0) = 0 • 𝒓(1) = 𝜌𝒓(0) + 1 − 𝜌 𝒈 1 ⨀𝒈 1 = 1 − 𝜌 𝒈 1 ⨀𝒈 1 • 𝒓(2) = 𝜌𝒓(1) + 1 − 𝜌 𝒈 2 ⨀𝒈 2 = 𝜌 1 − 𝜌 𝒈 1 ⨀𝒈 1 + 1 − 𝜌 𝒈 2 ⨀𝒈 2 • 𝒓(3) = 𝜌𝒓(2) + 1 − 𝜌 𝒈 3 ⨀𝒈 3 = 𝜌2 1 − 𝜌 𝒈 1 ⨀𝒈 1 + 𝜌 1 − 𝜌 𝒈 2 ⨀𝒈 2 + 1 − 𝜌 𝒈 3 ⨀𝒈 3 • RMSProp uses an exponentially decaying average to discard history from the extreme past so that the accumulation of gradients do not stall the learning. • AdaDelta is another variant where instead of exponentially decaying average, a moving window average of the past gradients is taken • Nesterov acceleration can also be applied to both these variants by computing the gradients at a ‘look ahead’ position (i.e., at a place where the momentum would have taken the parameters). 07, 12, 13 Jan, 2022 54 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 55. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in ADAM (Adaptive Moments) • Variant of the combination of RMSProp and Momentum. • Incorporates first order moment of the gradient which can be thought of as equivalent to taking advantage of the momentum strategy. Here momentum is also added with exponential averaging. • It also incorporates the second order term which can be thought of as the RMSProp like exponential averaging of the past gradients. • Both first and second moments are corrected for bias to account for their initialization to zero. 07, 12, 13 Jan, 2022 55 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 56. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in ADAM (Adaptive Moments) • Biased first order moment 𝒔(𝑡) = 𝜌1𝒔(𝑡−1) + (1 − 𝜌1)𝒈𝑡 • Biased second order moment 𝒓(𝑡) = 𝜌2𝒓(𝑡−1) + (1 − 𝜌2)𝒈𝑡⨀𝒈𝑡 • Bias corrected first order moment 𝒔(𝑡) = 𝒔(𝑡) 1−𝜌1 • Bias corrected second order moment 𝒓(𝑡) = 𝒓(𝑡) 1−𝜌2 • Weight update ∆𝜽(𝑡) = −𝜖 𝒔(𝑡) 𝒓(𝑡)+𝛿 (operations applied elementwise) 07, 12, 13 Jan, 2022 56 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 57. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in ADAM (Adaptive Moments) Images courtesy: Shubhendu Trivedi et. al. 07, 12, 13 Jan, 2022 57 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 58. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Visualization animation courtesy: Alec Radford Find more animations at https://tinyurl.com/y6tkf4f8 07, 12, 13 Jan, 2022 58 CS60010 / Deep Learning | Optimization (c) Abir Das
  • 59. Computer Science and Engineering| Indian Institute of Technology Kharagpur cse.iitkgp.ac.in Thank You!!