SlideShare a Scribd company logo
1 of 37
Gradient Descent
Nicholas Ruozzi
University of Texas at Dallas
Gradient Descent
• Method to find local optima of differentiable a function 𝑓
• Intuition: gradient tells us direction of greatest increase,
negative gradient gives us direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small
enough step in the direction of the negative gradient, the
function will decrease in value
• How small is small enough?
2
Gradient Descent
Gradient Descent Algorithm:
• Pick an initial point 𝑥0
• Iterate until convergence
𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡)
where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate)
3
Gradient Descent
Gradient Descent Algorithm:
• Pick an initial point 𝑥0
• Iterate until convergence
𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡)
where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate)
4
When do we stop?
Gradient Descent
Gradient Descent Algorithm:
• Pick an initial point 𝑥0
• Iterate until convergence
𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡)
where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate)
5
Possible Stopping Criteria: iterate until
∇𝑓(𝑥𝑡) ≤ 𝜖 for some 𝜖 > 0
How small should 𝜖 be?
Gradient Descent
6
𝑓 𝑥 = 𝑥2
𝑥(0)
= −4
Step size: .8
Gradient Descent
7
𝑓 𝑥 = 𝑥2
𝑥(1) = −4 − .8 ⋅ 2 ⋅ (−4)
𝑥(0)
= −4
Step size: .8
Gradient Descent
8
𝑓 𝑥 = 𝑥2
𝑥(1) = 2.4
𝑥(0)
= −4
Step size: .8
Gradient Descent
9
𝑓 𝑥 = 𝑥2
𝑥(1) = 0.4
𝑥(2)
= 2.4 − .8 ⋅ 2 ⋅ 2.4
𝑥(1) = 2.4
𝑥(0)
= −4
Step size: .8
Gradient Descent
10
𝑓 𝑥 = 𝑥2
𝑥(2)
= −1.44
𝑥(1) = 2.4
𝑥(0)
= −4
Step size: .8
Gradient Descent
11
𝑓 𝑥 = 𝑥2
𝑥(2)
= −1.44
𝑥(1) = 2.4
𝑥(0)
= −4
𝑥(5) = 0.31104
𝑥(4)
= −0.5184
𝑥(3)
= .864
𝑥(30)
= −8.84296𝑒 − 07
Step size: .8
Gradient Descent
12
Step size: .9
Gradient Descent
13
Step size: .2
Gradient Descent
14
Step size matters!
Gradient Descent
15
Step size matters!
Line Search
• Instead of picking a fixed step size that may or may not actually
result in a decrease in the function value, we can consider
minimizing the function along the direction specified by the
gradient to guarantee that the next iteration decreases the
function value
• In other words choose, 𝑥𝑡+1 ∈ arg min
𝛾≥ 0
𝑓(𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 )
• This is called exact line search
• This optimization problem can be expensive to solve exactly 
• However, if 𝑓 is convex, this is a univariate convex
optimization problem
16
Backtracking Line Search
• Instead of exact line search, could simply use a strategy that
finds some step size that decreases the function value (one must
exist)
• Backtracking line search: start with a large step size, 𝛾, and keep
shrinking it until 𝑓 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 < 𝑓(𝑥𝑡)
• This always guarantees a decrease, but it may not decrease as
much as exact line search
• Still, this is typically much faster in practice as it only requires
a few function evaluations
17
Backtracking Line Search
• To implement backtracking line search, choose two parameters
𝛼 ∈ 0, . 5 , 𝛽 ∈ (0,1)
• Set 𝛾 = 1
• While 𝑓 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 > 𝑓 𝑥𝑡 − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥𝑡
2
• 𝛾 = 𝛽𝛾
• Set 𝑥𝑡+1 = 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡
18
Iterations continue until
a step size is found that
decreases the function
“enough”
Backtracking Line Search
19
𝛼 = .2, 𝛽 = .99
Backtracking Line Search
20
𝛼 = .1, 𝛽 = .3
Gradient Descent: Convex Functions
• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the
result is a global minimizer
• Not all convex functions are differentiable, can we still apply
gradient descent?
21
Gradients of Convex Functions
• For a differentiable convex function 𝑔(𝑥) its gradients yield
linear underestimators
𝑥
𝑔(𝑥)
22
Gradients of Convex Functions
• For a differentiable convex function 𝑔(𝑥) its gradients yield
linear underestimators
𝑥
𝑔(𝑥)
23
Gradients of Convex Functions
• For a differentiable convex function 𝑔(𝑥) its gradients yield
linear underestimators: zero gradient corresponds to a global
optimum
𝑥
𝑔(𝑥)
24
Subgradients
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0
is given
by any line, 𝑙, such that 𝑙 𝑥0
= 𝑔(𝑥0
) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all
𝑥, i.e., it is a linear underestimator
𝑥
𝑔(𝑥)
𝑥0
25
Subgradients
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0
is given
by any line, 𝑙, such that 𝑙 𝑥0
= 𝑔(𝑥0
) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all
𝑥, i.e., it is a linear underestimator
𝑥
𝑔(𝑥)
𝑥0
26
Subgradients
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0
is given
by any line, 𝑙, such that 𝑙 𝑥0
= 𝑔(𝑥0
) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all
𝑥, i.e., it is a linear underestimator
𝑥
𝑔(𝑥)
𝑥0
27
Subgradients
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0
is given
by any line, 𝑙, such that 𝑙 𝑥0
= 𝑔(𝑥0
) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all
𝑥, i.e., it is a linear underestimator
𝑥
𝑔(𝑥)
𝑥0
If 0 is a subgradient
at 𝑥0, then 𝑥0 is a
global minimum
28
Subgradients
• If a convex function is differentiable at a point 𝑥, then it has a
unique subgradient at the point 𝑥 given by the gradient
• If a convex function is not differentiable at a point 𝑥, it can have
many subgradients
• E.g., the set of subgradients of the convex function |𝑥| at the
point 𝑥 = 0 is given by the set of slopes [−1,1]
• The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e.,
𝑔, ℎ subgradients, then .5𝑔 + .5ℎ is also a subgradient
• Subgradients only guaranteed to exist for convex functions
29
Subgradient Example
• Subgradient of 𝑔 𝑥 = max(𝑓1 𝑥 , 𝑓2 𝑥 ) for 𝑓1, 𝑓2 convex
functions?
30
Subgradient Example
• Subgradient of 𝑔 𝑥 = max(𝑓1 𝑥 , 𝑓2 𝑥 ) for 𝑓1, 𝑓2 convex
functions?
• If 𝑓1 𝑥 > 𝑓2(𝑥), ∇𝑓1(𝑥)
• If 𝑓2 𝑥 > 𝑓1(𝑥), ∇𝑓2(𝑥)
• If 𝑓1 𝑥 = 𝑓2 𝑥 , ∇𝑓1(𝑥) and ∇𝑓2(𝑥) are both subgradients
(and so are all convex combinations of these)
31
Subgradient Descent
Subgradient Descent Algorithm:
• Pick an initial point 𝑥0
• Iterate until convergence
𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝑠𝑓(𝑥𝑡)
where 𝛾𝑡 is the 𝑡𝑡ℎ step size and 𝑠𝑓(𝑥𝑡) is a subgradient of 𝑓 at
𝑥𝑡
32
Subgradient Descent
Subgradient Descent Algorithm:
• Pick an initial point 𝑥0
• Iterate until convergence
𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝑠𝑓(𝑥𝑡)
where 𝛾𝑡 is the 𝑡𝑡ℎ step size and 𝑠𝑓(𝑥𝑡) is a subgradient of 𝑓 at
𝑥𝑡
33
Can you use line search here?
Subgradient Descent
34
Step Size: .9
Diminishing Step Size Rules
• A fixed step size may not result in convergence for non-
differentiable functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of
iterations increase but not too quickly that the algorithm fails
to make progress
• Common diminishing step size rules:
• 𝛾𝑡 =
𝑎
𝑏+𝑡
for some 𝑎 > 0, 𝑏 ≥ 0
• 𝛾𝑡 =
𝑎
𝑡
for some 𝑎 > 0
35
Subgradient Descent
36
Diminishing Step Size
Theoretical Guarantees
• The hard work in convex optimization is to identify conditions
that guarantee quick convergence to within a small error of the
optimum
• Let 𝑓𝑏𝑒𝑠𝑡
(𝑡)
= min
𝑡′∈{0,…,𝑡}
𝑓(𝑥𝑡′)
• For a fixed step size, 𝛾, we are guaranteed that
lim
𝑡→∞
𝑓𝑏𝑒𝑠𝑡
(𝑡)
− inf
𝑥
𝑓 𝑥 ≤ 𝜖(𝛾)
where 𝜖(𝛾) is some positive constant that depends on 𝛾
• If 𝑓 is differentiable, then we have 𝜖 𝛾 = 0 whenever 𝛾 is
small enough
37

More Related Content

Similar to Lecture_3_Gradient_Descent.pptx

Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...rofiho9697
 
01 Functions and their Graphs.pptx
01 Functions and their Graphs.pptx01 Functions and their Graphs.pptx
01 Functions and their Graphs.pptxEljon02
 
AP Advantage: AP Calculus
AP Advantage: AP CalculusAP Advantage: AP Calculus
AP Advantage: AP CalculusShashank Patil
 
Week6n7 Applications of Derivative.pptx
Week6n7 Applications of  Derivative.pptxWeek6n7 Applications of  Derivative.pptx
Week6n7 Applications of Derivative.pptxkashiijaam008
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
what_are_Derivative.pdf
what_are_Derivative.pdfwhat_are_Derivative.pdf
what_are_Derivative.pdfPatrickNokrek
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-opticsA machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-opticsJCMwave
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to applicationGAYO3
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics JCMwave
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2khairulhuda242
 
Chapter 1 - What is a Function.pdf
Chapter 1 - What is a Function.pdfChapter 1 - What is a Function.pdf
Chapter 1 - What is a Function.pdfManarKareem1
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2weekcsl9496
 
4. Integral Calculus for gcse and other exams.pptx
4. Integral Calculus for gcse and other exams.pptx4. Integral Calculus for gcse and other exams.pptx
4. Integral Calculus for gcse and other exams.pptxHappy Ladher
 

Similar to Lecture_3_Gradient_Descent.pptx (20)

Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
 
01 Functions and their Graphs.pptx
01 Functions and their Graphs.pptx01 Functions and their Graphs.pptx
01 Functions and their Graphs.pptx
 
AP Advantage: AP Calculus
AP Advantage: AP CalculusAP Advantage: AP Calculus
AP Advantage: AP Calculus
 
Week6n7 Applications of Derivative.pptx
Week6n7 Applications of  Derivative.pptxWeek6n7 Applications of  Derivative.pptx
Week6n7 Applications of Derivative.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
5163147.ppt
5163147.ppt5163147.ppt
5163147.ppt
 
what_are_Derivative.pdf
what_are_Derivative.pdfwhat_are_Derivative.pdf
what_are_Derivative.pdf
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-opticsA machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to application
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
Chapter 1 - What is a Function.pdf
Chapter 1 - What is a Function.pdfChapter 1 - What is a Function.pdf
Chapter 1 - What is a Function.pdf
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2week
 
CI_L01_Optimization.pdf
CI_L01_Optimization.pdfCI_L01_Optimization.pdf
CI_L01_Optimization.pdf
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
 
Steepest descent method
Steepest descent methodSteepest descent method
Steepest descent method
 
4. Integral Calculus for gcse and other exams.pptx
4. Integral Calculus for gcse and other exams.pptx4. Integral Calculus for gcse and other exams.pptx
4. Integral Calculus for gcse and other exams.pptx
 

More from gnans Kgnanshek

More from gnans Kgnanshek (20)

MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptxMICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
MICROCONTROLLER EMBEDDED SYSTEM IOT 8051 TIMER.pptx
 
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptxEDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
EDC SLIDES PRESENTATION PRINCIPAL ON WEDNESDAY.pptx
 
types of research.pptx
types of research.pptxtypes of research.pptx
types of research.pptx
 
CS8601 4 MC NOTES.pdf
CS8601 4 MC NOTES.pdfCS8601 4 MC NOTES.pdf
CS8601 4 MC NOTES.pdf
 
CS8601 3 MC NOTES.pdf
CS8601 3 MC NOTES.pdfCS8601 3 MC NOTES.pdf
CS8601 3 MC NOTES.pdf
 
CS8601 2 MC NOTES.pdf
CS8601 2 MC NOTES.pdfCS8601 2 MC NOTES.pdf
CS8601 2 MC NOTES.pdf
 
CS8601 1 MC NOTES.pdf
CS8601 1 MC NOTES.pdfCS8601 1 MC NOTES.pdf
CS8601 1 MC NOTES.pdf
 
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
 
19_Learning.ppt
19_Learning.ppt19_Learning.ppt
19_Learning.ppt
 
Batch_Normalization.pptx
Batch_Normalization.pptxBatch_Normalization.pptx
Batch_Normalization.pptx
 
33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf33.-Multi-Layer-Perceptron.pdf
33.-Multi-Layer-Perceptron.pdf
 
11_NeuralNets.pdf
11_NeuralNets.pdf11_NeuralNets.pdf
11_NeuralNets.pdf
 
NN-Ch7.PDF
NN-Ch7.PDFNN-Ch7.PDF
NN-Ch7.PDF
 
NN-Ch6.PDF
NN-Ch6.PDFNN-Ch6.PDF
NN-Ch6.PDF
 
NN-Ch5.PDF
NN-Ch5.PDFNN-Ch5.PDF
NN-Ch5.PDF
 
NN-Ch3.PDF
NN-Ch3.PDFNN-Ch3.PDF
NN-Ch3.PDF
 
NN-Ch2.PDF
NN-Ch2.PDFNN-Ch2.PDF
NN-Ch2.PDF
 
unit-1 MANAGEMENT AND ORGANIZATIONS.pptx
unit-1 MANAGEMENT AND ORGANIZATIONS.pptxunit-1 MANAGEMENT AND ORGANIZATIONS.pptx
unit-1 MANAGEMENT AND ORGANIZATIONS.pptx
 
POM all 5 Units.pptx
POM all 5 Units.pptxPOM all 5 Units.pptx
POM all 5 Units.pptx
 
3 AGRI WORK final review slides.pptx
3 AGRI WORK final review slides.pptx3 AGRI WORK final review slides.pptx
3 AGRI WORK final review slides.pptx
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Lecture_3_Gradient_Descent.pptx

  • 2. Gradient Descent • Method to find local optima of differentiable a function 𝑓 • Intuition: gradient tells us direction of greatest increase, negative gradient gives us direction of greatest decrease • Take steps in directions that reduce the function value • Definition of derivative guarantees that if we take a small enough step in the direction of the negative gradient, the function will decrease in value • How small is small enough? 2
  • 3. Gradient Descent Gradient Descent Algorithm: • Pick an initial point 𝑥0 • Iterate until convergence 𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡) where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate) 3
  • 4. Gradient Descent Gradient Descent Algorithm: • Pick an initial point 𝑥0 • Iterate until convergence 𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡) where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate) 4 When do we stop?
  • 5. Gradient Descent Gradient Descent Algorithm: • Pick an initial point 𝑥0 • Iterate until convergence 𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝛻𝑓(𝑥𝑡) where 𝛾𝑡 is the 𝑡𝑡ℎ step size (sometimes called learning rate) 5 Possible Stopping Criteria: iterate until ∇𝑓(𝑥𝑡) ≤ 𝜖 for some 𝜖 > 0 How small should 𝜖 be?
  • 6. Gradient Descent 6 𝑓 𝑥 = 𝑥2 𝑥(0) = −4 Step size: .8
  • 7. Gradient Descent 7 𝑓 𝑥 = 𝑥2 𝑥(1) = −4 − .8 ⋅ 2 ⋅ (−4) 𝑥(0) = −4 Step size: .8
  • 8. Gradient Descent 8 𝑓 𝑥 = 𝑥2 𝑥(1) = 2.4 𝑥(0) = −4 Step size: .8
  • 9. Gradient Descent 9 𝑓 𝑥 = 𝑥2 𝑥(1) = 0.4 𝑥(2) = 2.4 − .8 ⋅ 2 ⋅ 2.4 𝑥(1) = 2.4 𝑥(0) = −4 Step size: .8
  • 10. Gradient Descent 10 𝑓 𝑥 = 𝑥2 𝑥(2) = −1.44 𝑥(1) = 2.4 𝑥(0) = −4 Step size: .8
  • 11. Gradient Descent 11 𝑓 𝑥 = 𝑥2 𝑥(2) = −1.44 𝑥(1) = 2.4 𝑥(0) = −4 𝑥(5) = 0.31104 𝑥(4) = −0.5184 𝑥(3) = .864 𝑥(30) = −8.84296𝑒 − 07 Step size: .8
  • 16. Line Search • Instead of picking a fixed step size that may or may not actually result in a decrease in the function value, we can consider minimizing the function along the direction specified by the gradient to guarantee that the next iteration decreases the function value • In other words choose, 𝑥𝑡+1 ∈ arg min 𝛾≥ 0 𝑓(𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 ) • This is called exact line search • This optimization problem can be expensive to solve exactly  • However, if 𝑓 is convex, this is a univariate convex optimization problem 16
  • 17. Backtracking Line Search • Instead of exact line search, could simply use a strategy that finds some step size that decreases the function value (one must exist) • Backtracking line search: start with a large step size, 𝛾, and keep shrinking it until 𝑓 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 < 𝑓(𝑥𝑡) • This always guarantees a decrease, but it may not decrease as much as exact line search • Still, this is typically much faster in practice as it only requires a few function evaluations 17
  • 18. Backtracking Line Search • To implement backtracking line search, choose two parameters 𝛼 ∈ 0, . 5 , 𝛽 ∈ (0,1) • Set 𝛾 = 1 • While 𝑓 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 > 𝑓 𝑥𝑡 − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥𝑡 2 • 𝛾 = 𝛽𝛾 • Set 𝑥𝑡+1 = 𝑥𝑡 − 𝛾∇𝑓 𝑥𝑡 18 Iterations continue until a step size is found that decreases the function “enough”
  • 21. Gradient Descent: Convex Functions • For convex functions, local optima are always global optima (this follows from the definition of convexity) • If gradient descent converges to a critical point, then the result is a global minimizer • Not all convex functions are differentiable, can we still apply gradient descent? 21
  • 22. Gradients of Convex Functions • For a differentiable convex function 𝑔(𝑥) its gradients yield linear underestimators 𝑥 𝑔(𝑥) 22
  • 23. Gradients of Convex Functions • For a differentiable convex function 𝑔(𝑥) its gradients yield linear underestimators 𝑥 𝑔(𝑥) 23
  • 24. Gradients of Convex Functions • For a differentiable convex function 𝑔(𝑥) its gradients yield linear underestimators: zero gradient corresponds to a global optimum 𝑥 𝑔(𝑥) 24
  • 25. Subgradients • For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0 is given by any line, 𝑙, such that 𝑙 𝑥0 = 𝑔(𝑥0 ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it is a linear underestimator 𝑥 𝑔(𝑥) 𝑥0 25
  • 26. Subgradients • For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0 is given by any line, 𝑙, such that 𝑙 𝑥0 = 𝑔(𝑥0 ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it is a linear underestimator 𝑥 𝑔(𝑥) 𝑥0 26
  • 27. Subgradients • For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0 is given by any line, 𝑙, such that 𝑙 𝑥0 = 𝑔(𝑥0 ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it is a linear underestimator 𝑥 𝑔(𝑥) 𝑥0 27
  • 28. Subgradients • For a convex function 𝑔(𝑥), a subgradient at a point 𝑥0 is given by any line, 𝑙, such that 𝑙 𝑥0 = 𝑔(𝑥0 ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it is a linear underestimator 𝑥 𝑔(𝑥) 𝑥0 If 0 is a subgradient at 𝑥0, then 𝑥0 is a global minimum 28
  • 29. Subgradients • If a convex function is differentiable at a point 𝑥, then it has a unique subgradient at the point 𝑥 given by the gradient • If a convex function is not differentiable at a point 𝑥, it can have many subgradients • E.g., the set of subgradients of the convex function |𝑥| at the point 𝑥 = 0 is given by the set of slopes [−1,1] • The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e., 𝑔, ℎ subgradients, then .5𝑔 + .5ℎ is also a subgradient • Subgradients only guaranteed to exist for convex functions 29
  • 30. Subgradient Example • Subgradient of 𝑔 𝑥 = max(𝑓1 𝑥 , 𝑓2 𝑥 ) for 𝑓1, 𝑓2 convex functions? 30
  • 31. Subgradient Example • Subgradient of 𝑔 𝑥 = max(𝑓1 𝑥 , 𝑓2 𝑥 ) for 𝑓1, 𝑓2 convex functions? • If 𝑓1 𝑥 > 𝑓2(𝑥), ∇𝑓1(𝑥) • If 𝑓2 𝑥 > 𝑓1(𝑥), ∇𝑓2(𝑥) • If 𝑓1 𝑥 = 𝑓2 𝑥 , ∇𝑓1(𝑥) and ∇𝑓2(𝑥) are both subgradients (and so are all convex combinations of these) 31
  • 32. Subgradient Descent Subgradient Descent Algorithm: • Pick an initial point 𝑥0 • Iterate until convergence 𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝑠𝑓(𝑥𝑡) where 𝛾𝑡 is the 𝑡𝑡ℎ step size and 𝑠𝑓(𝑥𝑡) is a subgradient of 𝑓 at 𝑥𝑡 32
  • 33. Subgradient Descent Subgradient Descent Algorithm: • Pick an initial point 𝑥0 • Iterate until convergence 𝑥𝑡+1 = 𝑥𝑡 − 𝛾𝑡𝑠𝑓(𝑥𝑡) where 𝛾𝑡 is the 𝑡𝑡ℎ step size and 𝑠𝑓(𝑥𝑡) is a subgradient of 𝑓 at 𝑥𝑡 33 Can you use line search here?
  • 35. Diminishing Step Size Rules • A fixed step size may not result in convergence for non- differentiable functions • Instead, can use a diminishing step size: • Required property: step size must decrease as number of iterations increase but not too quickly that the algorithm fails to make progress • Common diminishing step size rules: • 𝛾𝑡 = 𝑎 𝑏+𝑡 for some 𝑎 > 0, 𝑏 ≥ 0 • 𝛾𝑡 = 𝑎 𝑡 for some 𝑎 > 0 35
  • 37. Theoretical Guarantees • The hard work in convex optimization is to identify conditions that guarantee quick convergence to within a small error of the optimum • Let 𝑓𝑏𝑒𝑠𝑡 (𝑡) = min 𝑡′∈{0,…,𝑡} 𝑓(𝑥𝑡′) • For a fixed step size, 𝛾, we are guaranteed that lim 𝑡→∞ 𝑓𝑏𝑒𝑠𝑡 (𝑡) − inf 𝑥 𝑓 𝑥 ≤ 𝜖(𝛾) where 𝜖(𝛾) is some positive constant that depends on 𝛾 • If 𝑓 is differentiable, then we have 𝜖 𝛾 = 0 whenever 𝛾 is small enough 37