Descent methods
for unconstrained optimization
Gilles Gasso
INSA Rouen
Laboratory LITIS
October 13, 2019
Gilles Gasso Descent methods 1/ 32
Plan
1 Formulation
2 Optimality conditions
3 Notions gradient
Existence of a solution
4 Descent algorithms
Main methodsof descent
Researchof the step
Summary
5 Illustration of descent methods
Gilles Gasso Descent methods 2/ 32
Formulation
Unconstrained optimization
Elements of the problem
θ ∈Rd : vector of unknown real parameters
J : Rd → R : the function to be minimized.
Assumption: J is differentiable all over its domain
domJ = θ ∈Rd |J(θ) < ∞
}
Problem formulation
(P) min J(θ)
θ∈Rd
Gilles Gasso Descent methods 3/ 32
Formulation
Unconstrained optimization
Examples
1
J(θ) = θT Pθ + qT θ + r
2
with P a positive definite matrix
−1
0
1
2
−2 −2
0
4
2
4.5
5
5.5
6

1

2
J()
0
2
4
6
0
2
4
−2
6
2
1
J
θ
(θ) = cos(θ −θ)+sin(θ +θ )+ 1
0
1 2 1 2
4
−1
Gilles Gasso Descent methods 4/ 32
Optimality conditions
Different solutions
Global solution
θ∗ is said to be the global minimum solution of the problem if
J(θ∗) ≤ J(θ), ∀
θ ∈domJ
Local solution
θ
ˆis a local minimum solution of problem (P) if it holds
J(θ̂) ≤ J(θ), ∀
θ ∈domJ such that ǁθ̂ −θǁ ≤ ϵ, ϵ > 0
Illustration
J(θ) = cos(θ1 2
—θ ) + sin 1 2
(θ + θ ) + θ1
4
0 1 2 3 4 5
0
1
2
3
4
5
*

Minimum global
Minimumlocal
−1
−0.5
0
0.5
1
1.5
2
2.5
Gilles Gasso Descent methods 5/ 32
Optimality conditions
Optimality conditions
How do we assess a solution to the problem?
Gilles Gasso Descent methods 6/ 32
Notions gradient
Differentiability
What for?
Differentiability allows to checkif aθ is solution
Allows to get a local linear approximation of J
Notion of gradient
Let J : Rd → R beadifferentiable function allover its domain. The
gradient of J at θ ∈domJ is the d-dimensional vector denoted as ∇J(θ)
and defined as
∇J(θ) = ∂J(θ)
∂θ1
··· ∂J(θ)
∂θd
T
∂θk
∂J(θ)
is the partial derivative of J w.r.t. parameter θk.
Gilles Gasso Descent methods 7/ 32
Notions gradient
Example
4 4
1 2
J(θ) = θ + θ −4θ θ
1 2
Gradient: ∇J(θ) =
3
1
4θ −4θ2
−4θ1 + 4θ3
2
Evaluation at θ =
1
−1
: ∇J(θ) =
8
−8
Exercise
1 2
Find the gradient function of J(θ) = cos(θ −θ ) + sin 1 2
(θ + θ ) + θ1
4
Gilles Gasso Descent methods 8/ 32
Notions gradient
Gradient calculation
t→ 0
Gateaux directional derivative
Let the function J : Rd → R and let h ∈Rd beavector. The Gateaux
directional derivative of J at θ ∈Rd along the direction h is the limit (if it
exists)
D(h, θ) = lim
J(θ + th) −J(θ)
t
the gradient
Let J : Rd → R adifferentiable function at θ ∈Rd . The gradient of J at θ is the
unique vector ∇J(θ) ∈ Rd such that the directional derivative is the linear form:
D(h, θ) = lim
t→0
J(θ + th) −J(θ)
t
= ∇J(θ)T h
Gilles Gasso Descent methods 9/ 32
Notions gradient
Examples
Computing the directional derivative in practice
Let be the function φ(t) = J(θ + th) with t ∈R, h ∈Rd , θ ∈ Rd
φ′(0), derivative of φ(t) at t = 0 defines the directional derivative
′
Proof φ (0) = limt→0 = limt→0
φ(t)− φ(0) J(θ+th)− J(θ)
t t
Exercises:from the directional derivative D(h, θ) to the gradient
Compute the gradient of the following functions
Linear function: J(θ) = cT θ with c, θ ∈Rd
Quadratic function: J(θ) = θT
Aθ with A ∈ Rd×d
Gilles Gasso Descent methods 10 / 32
Notions gradient Existence of a solution
Existence of a solution
Theorem
AssumeJ : Rd → R is aproper,continuous and coercicea function. There
the optimization problem minθ J(θ) admits a global solution.
a
the function J is coercive if J(θ) → ∞ for any θ such that ǁθǁ → ∞
Proof.
J being coercice, we have J(θ) → ∞ for ǁθǁ → ∞. Let θ0 ∈ Rd . There exists ν > 0 such that
ǁθǁ > ν ⇒ J(θ) > J(θ0). This means that there exists ν > 0 such that
J(θ) ≤ J(θ0) ⇒ ǁθǁ ≤ ν. To find a solution, il suffices to restrain the minimization to
d
{θ∈R | ǁ θǁ ≤ ν }
d
, ,
min J(θ). As the set θ ∈ R |ǁθǁ ≤ ν is compact, the problem admits a
global solution using Weierstrass theorem.
Interpretation
the theorem states that if J is bounded from below and the bound is
attained for a finite θ, the minimization problem admits a solution.
Gilles Gasso Descent methods 11 / 32
Notions gradient Existence of a solution
First order necessary condition
Theorem [First order condition]
Let J : Rd → R be a differential function on its domain. A vector θ0 is a
(local or global) solution of the problem (P), if it necessarily satisfies the
condition ∇J(θ0) = 0.
Vocabulary
Any vector θ0 that verifies ∇J(θ0) = 0 is called a stationary point or critical
point
∇J(θ) ∈Rd is the gradient vector of J at θ.
The gradient is the unique vector such that the directional derivative can be
written as:
lim
t→0
J(θ + th) −J(θ)
t
T h
= ∇J(θ) , h ∈Rd, t ∈R
Gilles Gasso Descent methods 12 / 32
Notions gradient Existence of a solution
Example of a first order optimality condition
4 4
1 2
J(θ) = θ + θ −4 θ θ
1 2
Gradient ∇J(θ) =
3
1
4θ − 4θ2
−4θ1 + 4θ 3
2
Stationary points that verify ∇J(θ) = 0.
0
(1) (2)
Three solutions θ = , θ =
0 1
1
et
θ(3)
=
−1
−1 −2 −1 0 1 2
−2
−1
0
1
2
Remarks
θ(2) and θ(3) are local minimal but not θ(1)
every stationary point can be deemed a local extremum
We need another optimality condition
How to ensure that a stationary point is a minimum?
Gilles Gasso Descent methods 13 / 32
Notions gradient Existence of a solution
Hessian matrix
Twice differential function
J : Rd → R is saidto beatwice differentiable function on its domain
domJ if, at everypoint θ ∈,there existsanunique symmetrical matrix
H(θ) ∈Rd×d called Hessian matrix such that
J(θ + h) = J(θ) + ∇J(θ)T + hTH(θ)h + ǁhǁ2ε(h).
ε(h) is a continuous function at 0 with limh→0 ε(h) = 0
H(θ) is the second derivative matrix
H(θ) =
∂2J ∂2J
∂θ1∂θ1 ∂θ1∂θ2
∂2J
∂θ1∂θd
. . .
.
∂2J
···
. ··· .
∂2J ∂2J
···
∂θd ∂θ1 ∂θd ∂θ2 ∂θd ∂θd
H(θ) = ∇θT(∇θJ(θ)) is the Jacobian of the gradient function
Gilles Gasso Descent methods 14 / 32
Notions gradient Existence of a solution
Examples
Example 1
Objective function
4 4
1 2
J(θ) = θ + θ −4θ θ
1 2
Gradient
∇J(θ) =
3
1
4θ −4θ2
3
−4θ1 + 4θ2
Hessian matrix
H(θ) =
12θ2
1
−4
2
−4 12θ2
Exemple 2
Quadratic objective function
J(θ) = 1θT Pθ + qT θ + r
2
Directional derivative
D(h, θ) = limt→ 0
J(θ+th)−J(θ)
t
D(h, θ) = (Pθ + q)T h
Gradient ∇J(θ) = Pθ + q
Hessian matrix H(θ) = P
Gilles Gasso Descent methods 15 / 32
Notions gradient Existence of a solution
Second order optimality condition
Theorem [Second order optimality condition]
Let J : Rd → R beatwice differentiable function on its domain. If θ0 is a
minimum of J, then ∇J(θ0) = 0 and H(θ0) is a positive definite matrix.
Remarks
H is positive definite if and only if all its eigenvalues are positive
H is negative definite if and only if all its eigenvalues are negative
For θ ∈R, this condition means that the gradient of J at the minimum is
null, J′(θ) = 0 and its second derivative is positive i.e. J′′(θ) > 0
If at a stationary point θ0 H(θ0)) is negative definite, θ0 is a local
maximum of J
Gilles Gasso Descent methods 16 / 32
Notions gradient Existence of a solution
Illustration of the second order optimality condition
4 4
1 2
J(θ) = θ + θ −4 θ θ
1 2
Gradient : ∇J(θ) =
3
1
4θ −4θ2
−4θ1 + 4θ3
2
Stationary points : θ(1)
=
0
0
, θ(2)
= 1
1
and
θ(3)
=
−1
−1
Hessian matrix H(θ) =
12θ2
1
−4
2
−4 12θ2
−2 −1 0 1 2
−2
−1
0
1
2
θ(1)
θ(2) θ(3)
0 −4
−4 12
12 −4
−4 12
12 −4
−4 12
Hessian
Eigenvalues 4,−4 8,16 8,16
Type of solution Saddle point Minimum Minimum
Gilles Gasso Descent methods 17 / 32
Notions gradient Existence of a solution
Necessary and sufficient optimality condition
Theorem [2nd order sufficient condition ]
Assumethe hessian matrix H(θ0) of J(θ) at θ0 existsand is positive
definite. Assumealsothe gradient ∇J(θ0) = 0. Then θ0 is a(local or
global) minimum of problem (P).
Theorem [Sufficient and necessary optimality condition]
Let J be a convex function. Every local solution θ
ˆis a global solution θ∗.
Recall
A function J : Rd → R is convex if it verifies
J(αθ + (1−α)z) ≤ αJ(θ) + (1 −α)J(z), ∀
θ, z ∈domJ, 0 ≤ α ≤ 1
Gilles Gasso Descent methods 18 / 32
Notions gradient Existence of a solution
How to find the solution(s)?
We have seen how to assessa solution to the problem
A question to be addressed is: how to compute a solution?
Gilles Gasso Descent methods 19 / 32
Descent algorithms
Principle of descent algorithms
Direction of descent
Let the function J : Rd → R. The vector h ∈Rd is called adirection of
descent in θ if there exists α > 0 such that J(θ + αh) < J(θ)
Principle of descent methods
Start from an initial point θ0
Design a sequence of points { θk} with θk+1 = θk + αk hk
Ensure that the sequence {θk } converges to a stationary point θ
ˆ
hk: direction of descent
αk is the step size
Gilles Gasso Descent methods 20 / 32
Descent algorithms
General approach
General algorithm
1: Let k = 0, initialize θk
2: repeat
3: Find a descent direction hk ∈Rd
4: Line search: find a step size αk > 0 in the direction hk such that
J(θk + αk hk) decreases "enough"
5: Update: θk+1 ← θk + αk hk and k ← k + 1
6: until ǁ∇J(θk)ǁ < ϵ
The methods of descent differ by the choice of:
h: gradient algorithm, Newton, Quasi-Newton algorithm
α: Line search, backtracking
Gilles Gasso Descent methods 21 / 32
Descent algorithms Main methods of descent
Gradient Algorithm
Theorem [descent direction and opposite direction of gradient]
Let J(θ) beadifferential function. The direction h = −∇J(θ) ∈Rd is a
descent direction.
Proof.
J being differentiable, for any t > 0 we have
J(θ + th) = J(θ) + t∇J(θ)Th + tǁhǁϵ(th). Setting h = −∇J(θ), we get
J(θ + th) −J(θ) = −tǁ∇J(θ)ǁ2 + tǁhǁϵ(th). For t small enough ϵ(th) → 0 and
so J(θ + th) −J(θ) = −tǁ∇J(θ)ǁ2 < 0. It is then a descent direction.
Characteristics of the gradient algorithm
Choice of the descent direction at θk: hk = −∇J(θk)
Complexity of the update: θk+1 ← θk −αk ∇J(θk) costs O(d)
Gilles Gasso Descent methods 22 / 32
Descent algorithms Main methods of descent
Newton algorithm
2nd order approximation of the twice diffetentiable J at θk
k k
Th
1
2
k
J(θ + h) ≈ J(θ ) + ∇J(θ ) + hTH(θ )h
with H(θk ) the positive definite Hessian matrix
The direction hk which minimizes this approximation is obtained by
∇J(θ + hk) = 0 ⇒ hk = −H(θk)− 1∇J(θk)
Features
Choice of the descent direction at θk : hk = −H(θk)−1∇J(θk )
Complexity of the update: θk+1 ← θk −αk H(θk )−1∇(θk ) costs
O(d3) flops
H(θk ) is not always guaranteed to be positive definite matrix. Hence
we cannot always ensure that hk is a direction of descent
Gilles Gasso Descent methods 23 / 32
Descent algorithms Main methods of descent
Illustration of gradient and Newton methods
Local approximation of the two methods
in 1D
−4 −2 2 4
−40
−20
0
20
40
60
80
100
Tangente en
k
0

J(
)

k
J
Approxim. de J Meth. Gradient
Approxim. de J Meth. Newton
Directions of descent in 2D
0
−2 −1 0 1 2
−2
−1
0
1
2
−1
h = − H  J
h = −  J
Gilles Gasso Descent methods 24 / 32
Descent algorithms Main methods of descent
Quasi-Newton method
Main features
Choiceof the descentdirection at θk : hk = −B(θk )−1∇J(θk )
B(θk ) is an positive definite approximation of the Hessian matrix
Complexity of the update: most of the times O(d2)
Gilles Gasso Descent methods 25 / 32
Descent algorithms Research of the step
Line search
α>0
Assumethe direction of descenthk at θk is fixed. We aim to find the step
sizeαk > 0 in the direction hk suchthat the function J(θk + αk hk )
decreases enough (compared to J(θk ))
Several options
Fixed step size: use a fixed value α > 0 at each iteration k
θk+1 ← θk + αhk
Optimal step size αk
∗
θk+1 ← θk + αk
∗ hk with αk
∗ = argminJ(θk + αhk )
Variable step size: the choice αk is adapted to the current iteration
θk+1 ← θk + αkhk
Gilles Gasso Descent methods 26 / 32
Descent algorithms Research of the step
Line search
Pros and cons
Fixed step size strategy: often not very effective
Optimal step size: can be costly in calculation time
Variable step: most commonly used approach
The step is often imprecise
A trade-off between computation cost and decrease of J
Gilles Gasso Descent methods 27 / 32
Descent algorithms Research of the step
Variable step sizeh
Armijo’s rule
Determine the step size αk in order to have a sufficient decrease of J i.e.
J(θk + αk h) ≤ J(θk ) + c αk ∇J(θk )T hk
Usually c is chosen in the range 10− , 10−
5 1
Having hk the direction of descent, we have ∇J(θk )Thk < 0, which ensures
the decrease of J
Backtracking
1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ
2: repeat
3: α ← ρα
4: until J(θk + αh) > J(θk ) + c α∇J(θk )T hk
Interpretation: as long as J does not decrease, we
decrease the value of the step size
Choice of the initial step
Newton method:
ᾱ = 1
Gradient method:
ᾱ = 2J(θk )− J(θk− 1)
∇J(θk )T hk
Gilles Gasso Descent methods 28 / 32
Descent algorithms Summary
Summary of descent methods
General algorithm
1: Initialize θk
2: repeat
3: Find direction of descent hk ∈ Rd
4: Line search: find the step αk > 0
5: Update: θk+1 ← θk + αk hk
6: until convergence
Method Direction of descent h Complexity Convergence
Gradient −∇J(θ) O(d) linear
Quasi-Newton −B(θ)−1∇J(θ) O(d2) superlineare
Newton −H(θ)−1∇J(θ) O(d3) quadratic
Step size computation: backtracking (common) or optimal step size
Complexity of each method: depends on the complexity of calculating h, the
search for α, and the number of iterations performed till convergence
Gilles Gasso Descent methods 29 / 32
Illustration of descent methods
Gradient method
J along the iterations Evolution of the iterates
0 2 10 12
−2
0
2
4
6
8
10
4 6 8
Itérations k
(k)
J(
)

0
−1.5 −1 −0.5 0 0.5 1 1.5
−1
−1.5
−0.5
0
0.5
1
1.5
Gilles Gasso Descent methods 30 / 32
Illustration of descent methods
Newton method
J along the iterations Evolution of the iterates
1 2 6 7
−2
0
2
4
6
8
10
3 4 5
Itérations k
(k)
J(
)
H to guarantee the positive
definite property of Hessian

0
−1.5 −1 −0.5 0 0.5 1 1.5
At each iteration we considered
the matrix H(θ) + λ I instead of −1.5
−1
−0.5
0
0.5
1
1.5
Gilles Gasso Descent methods 31 / 32
Illustration of descent methods
Conclusion
Unconstrained optimization of smooth objective function
Characterization of the solution(s) requires checking the optimality
conditions
Computation of a solution using descent methods
Gradient descent method
Newton method
Not covered in this lecture:
Convergence analysis of the studied algorithms
Non-smooth optimization
Gradient-free optimzation
Gilles Gasso Descent methods 32 / 32

UnconstrOptim_Eng.pptx

  • 1.
    Descent methods for unconstrainedoptimization Gilles Gasso INSA Rouen Laboratory LITIS October 13, 2019 Gilles Gasso Descent methods 1/ 32
  • 2.
    Plan 1 Formulation 2 Optimalityconditions 3 Notions gradient Existence of a solution 4 Descent algorithms Main methodsof descent Researchof the step Summary 5 Illustration of descent methods Gilles Gasso Descent methods 2/ 32
  • 3.
    Formulation Unconstrained optimization Elements ofthe problem θ ∈Rd : vector of unknown real parameters J : Rd → R : the function to be minimized. Assumption: J is differentiable all over its domain domJ = θ ∈Rd |J(θ) < ∞ } Problem formulation (P) min J(θ) θ∈Rd Gilles Gasso Descent methods 3/ 32
  • 4.
    Formulation Unconstrained optimization Examples 1 J(θ) =θT Pθ + qT θ + r 2 with P a positive definite matrix −1 0 1 2 −2 −2 0 4 2 4.5 5 5.5 6  1  2 J() 0 2 4 6 0 2 4 −2 6 2 1 J θ (θ) = cos(θ −θ)+sin(θ +θ )+ 1 0 1 2 1 2 4 −1 Gilles Gasso Descent methods 4/ 32
  • 5.
    Optimality conditions Different solutions Globalsolution θ∗ is said to be the global minimum solution of the problem if J(θ∗) ≤ J(θ), ∀ θ ∈domJ Local solution θ ˆis a local minimum solution of problem (P) if it holds J(θ̂) ≤ J(θ), ∀ θ ∈domJ such that ǁθ̂ −θǁ ≤ ϵ, ϵ > 0 Illustration J(θ) = cos(θ1 2 —θ ) + sin 1 2 (θ + θ ) + θ1 4 0 1 2 3 4 5 0 1 2 3 4 5 *  Minimum global Minimumlocal −1 −0.5 0 0.5 1 1.5 2 2.5 Gilles Gasso Descent methods 5/ 32
  • 6.
    Optimality conditions Optimality conditions Howdo we assess a solution to the problem? Gilles Gasso Descent methods 6/ 32
  • 7.
    Notions gradient Differentiability What for? Differentiabilityallows to checkif aθ is solution Allows to get a local linear approximation of J Notion of gradient Let J : Rd → R beadifferentiable function allover its domain. The gradient of J at θ ∈domJ is the d-dimensional vector denoted as ∇J(θ) and defined as ∇J(θ) = ∂J(θ) ∂θ1 ··· ∂J(θ) ∂θd T ∂θk ∂J(θ) is the partial derivative of J w.r.t. parameter θk. Gilles Gasso Descent methods 7/ 32
  • 8.
    Notions gradient Example 4 4 12 J(θ) = θ + θ −4θ θ 1 2 Gradient: ∇J(θ) = 3 1 4θ −4θ2 −4θ1 + 4θ3 2 Evaluation at θ = 1 −1 : ∇J(θ) = 8 −8 Exercise 1 2 Find the gradient function of J(θ) = cos(θ −θ ) + sin 1 2 (θ + θ ) + θ1 4 Gilles Gasso Descent methods 8/ 32
  • 9.
    Notions gradient Gradient calculation t→0 Gateaux directional derivative Let the function J : Rd → R and let h ∈Rd beavector. The Gateaux directional derivative of J at θ ∈Rd along the direction h is the limit (if it exists) D(h, θ) = lim J(θ + th) −J(θ) t the gradient Let J : Rd → R adifferentiable function at θ ∈Rd . The gradient of J at θ is the unique vector ∇J(θ) ∈ Rd such that the directional derivative is the linear form: D(h, θ) = lim t→0 J(θ + th) −J(θ) t = ∇J(θ)T h Gilles Gasso Descent methods 9/ 32
  • 10.
    Notions gradient Examples Computing thedirectional derivative in practice Let be the function φ(t) = J(θ + th) with t ∈R, h ∈Rd , θ ∈ Rd φ′(0), derivative of φ(t) at t = 0 defines the directional derivative ′ Proof φ (0) = limt→0 = limt→0 φ(t)− φ(0) J(θ+th)− J(θ) t t Exercises:from the directional derivative D(h, θ) to the gradient Compute the gradient of the following functions Linear function: J(θ) = cT θ with c, θ ∈Rd Quadratic function: J(θ) = θT Aθ with A ∈ Rd×d Gilles Gasso Descent methods 10 / 32
  • 11.
    Notions gradient Existenceof a solution Existence of a solution Theorem AssumeJ : Rd → R is aproper,continuous and coercicea function. There the optimization problem minθ J(θ) admits a global solution. a the function J is coercive if J(θ) → ∞ for any θ such that ǁθǁ → ∞ Proof. J being coercice, we have J(θ) → ∞ for ǁθǁ → ∞. Let θ0 ∈ Rd . There exists ν > 0 such that ǁθǁ > ν ⇒ J(θ) > J(θ0). This means that there exists ν > 0 such that J(θ) ≤ J(θ0) ⇒ ǁθǁ ≤ ν. To find a solution, il suffices to restrain the minimization to d {θ∈R | ǁ θǁ ≤ ν } d , , min J(θ). As the set θ ∈ R |ǁθǁ ≤ ν is compact, the problem admits a global solution using Weierstrass theorem. Interpretation the theorem states that if J is bounded from below and the bound is attained for a finite θ, the minimization problem admits a solution. Gilles Gasso Descent methods 11 / 32
  • 12.
    Notions gradient Existenceof a solution First order necessary condition Theorem [First order condition] Let J : Rd → R be a differential function on its domain. A vector θ0 is a (local or global) solution of the problem (P), if it necessarily satisfies the condition ∇J(θ0) = 0. Vocabulary Any vector θ0 that verifies ∇J(θ0) = 0 is called a stationary point or critical point ∇J(θ) ∈Rd is the gradient vector of J at θ. The gradient is the unique vector such that the directional derivative can be written as: lim t→0 J(θ + th) −J(θ) t T h = ∇J(θ) , h ∈Rd, t ∈R Gilles Gasso Descent methods 12 / 32
  • 13.
    Notions gradient Existenceof a solution Example of a first order optimality condition 4 4 1 2 J(θ) = θ + θ −4 θ θ 1 2 Gradient ∇J(θ) = 3 1 4θ − 4θ2 −4θ1 + 4θ 3 2 Stationary points that verify ∇J(θ) = 0. 0 (1) (2) Three solutions θ = , θ = 0 1 1 et θ(3) = −1 −1 −2 −1 0 1 2 −2 −1 0 1 2 Remarks θ(2) and θ(3) are local minimal but not θ(1) every stationary point can be deemed a local extremum We need another optimality condition How to ensure that a stationary point is a minimum? Gilles Gasso Descent methods 13 / 32
  • 14.
    Notions gradient Existenceof a solution Hessian matrix Twice differential function J : Rd → R is saidto beatwice differentiable function on its domain domJ if, at everypoint θ ∈,there existsanunique symmetrical matrix H(θ) ∈Rd×d called Hessian matrix such that J(θ + h) = J(θ) + ∇J(θ)T + hTH(θ)h + ǁhǁ2ε(h). ε(h) is a continuous function at 0 with limh→0 ε(h) = 0 H(θ) is the second derivative matrix H(θ) = ∂2J ∂2J ∂θ1∂θ1 ∂θ1∂θ2 ∂2J ∂θ1∂θd . . . . ∂2J ··· . ··· . ∂2J ∂2J ··· ∂θd ∂θ1 ∂θd ∂θ2 ∂θd ∂θd H(θ) = ∇θT(∇θJ(θ)) is the Jacobian of the gradient function Gilles Gasso Descent methods 14 / 32
  • 15.
    Notions gradient Existenceof a solution Examples Example 1 Objective function 4 4 1 2 J(θ) = θ + θ −4θ θ 1 2 Gradient ∇J(θ) = 3 1 4θ −4θ2 3 −4θ1 + 4θ2 Hessian matrix H(θ) = 12θ2 1 −4 2 −4 12θ2 Exemple 2 Quadratic objective function J(θ) = 1θT Pθ + qT θ + r 2 Directional derivative D(h, θ) = limt→ 0 J(θ+th)−J(θ) t D(h, θ) = (Pθ + q)T h Gradient ∇J(θ) = Pθ + q Hessian matrix H(θ) = P Gilles Gasso Descent methods 15 / 32
  • 16.
    Notions gradient Existenceof a solution Second order optimality condition Theorem [Second order optimality condition] Let J : Rd → R beatwice differentiable function on its domain. If θ0 is a minimum of J, then ∇J(θ0) = 0 and H(θ0) is a positive definite matrix. Remarks H is positive definite if and only if all its eigenvalues are positive H is negative definite if and only if all its eigenvalues are negative For θ ∈R, this condition means that the gradient of J at the minimum is null, J′(θ) = 0 and its second derivative is positive i.e. J′′(θ) > 0 If at a stationary point θ0 H(θ0)) is negative definite, θ0 is a local maximum of J Gilles Gasso Descent methods 16 / 32
  • 17.
    Notions gradient Existenceof a solution Illustration of the second order optimality condition 4 4 1 2 J(θ) = θ + θ −4 θ θ 1 2 Gradient : ∇J(θ) = 3 1 4θ −4θ2 −4θ1 + 4θ3 2 Stationary points : θ(1) = 0 0 , θ(2) = 1 1 and θ(3) = −1 −1 Hessian matrix H(θ) = 12θ2 1 −4 2 −4 12θ2 −2 −1 0 1 2 −2 −1 0 1 2 θ(1) θ(2) θ(3) 0 −4 −4 12 12 −4 −4 12 12 −4 −4 12 Hessian Eigenvalues 4,−4 8,16 8,16 Type of solution Saddle point Minimum Minimum Gilles Gasso Descent methods 17 / 32
  • 18.
    Notions gradient Existenceof a solution Necessary and sufficient optimality condition Theorem [2nd order sufficient condition ] Assumethe hessian matrix H(θ0) of J(θ) at θ0 existsand is positive definite. Assumealsothe gradient ∇J(θ0) = 0. Then θ0 is a(local or global) minimum of problem (P). Theorem [Sufficient and necessary optimality condition] Let J be a convex function. Every local solution θ ˆis a global solution θ∗. Recall A function J : Rd → R is convex if it verifies J(αθ + (1−α)z) ≤ αJ(θ) + (1 −α)J(z), ∀ θ, z ∈domJ, 0 ≤ α ≤ 1 Gilles Gasso Descent methods 18 / 32
  • 19.
    Notions gradient Existenceof a solution How to find the solution(s)? We have seen how to assessa solution to the problem A question to be addressed is: how to compute a solution? Gilles Gasso Descent methods 19 / 32
  • 20.
    Descent algorithms Principle ofdescent algorithms Direction of descent Let the function J : Rd → R. The vector h ∈Rd is called adirection of descent in θ if there exists α > 0 such that J(θ + αh) < J(θ) Principle of descent methods Start from an initial point θ0 Design a sequence of points { θk} with θk+1 = θk + αk hk Ensure that the sequence {θk } converges to a stationary point θ ˆ hk: direction of descent αk is the step size Gilles Gasso Descent methods 20 / 32
  • 21.
    Descent algorithms General approach Generalalgorithm 1: Let k = 0, initialize θk 2: repeat 3: Find a descent direction hk ∈Rd 4: Line search: find a step size αk > 0 in the direction hk such that J(θk + αk hk) decreases "enough" 5: Update: θk+1 ← θk + αk hk and k ← k + 1 6: until ǁ∇J(θk)ǁ < ϵ The methods of descent differ by the choice of: h: gradient algorithm, Newton, Quasi-Newton algorithm α: Line search, backtracking Gilles Gasso Descent methods 21 / 32
  • 22.
    Descent algorithms Mainmethods of descent Gradient Algorithm Theorem [descent direction and opposite direction of gradient] Let J(θ) beadifferential function. The direction h = −∇J(θ) ∈Rd is a descent direction. Proof. J being differentiable, for any t > 0 we have J(θ + th) = J(θ) + t∇J(θ)Th + tǁhǁϵ(th). Setting h = −∇J(θ), we get J(θ + th) −J(θ) = −tǁ∇J(θ)ǁ2 + tǁhǁϵ(th). For t small enough ϵ(th) → 0 and so J(θ + th) −J(θ) = −tǁ∇J(θ)ǁ2 < 0. It is then a descent direction. Characteristics of the gradient algorithm Choice of the descent direction at θk: hk = −∇J(θk) Complexity of the update: θk+1 ← θk −αk ∇J(θk) costs O(d) Gilles Gasso Descent methods 22 / 32
  • 23.
    Descent algorithms Mainmethods of descent Newton algorithm 2nd order approximation of the twice diffetentiable J at θk k k Th 1 2 k J(θ + h) ≈ J(θ ) + ∇J(θ ) + hTH(θ )h with H(θk ) the positive definite Hessian matrix The direction hk which minimizes this approximation is obtained by ∇J(θ + hk) = 0 ⇒ hk = −H(θk)− 1∇J(θk) Features Choice of the descent direction at θk : hk = −H(θk)−1∇J(θk ) Complexity of the update: θk+1 ← θk −αk H(θk )−1∇(θk ) costs O(d3) flops H(θk ) is not always guaranteed to be positive definite matrix. Hence we cannot always ensure that hk is a direction of descent Gilles Gasso Descent methods 23 / 32
  • 24.
    Descent algorithms Mainmethods of descent Illustration of gradient and Newton methods Local approximation of the two methods in 1D −4 −2 2 4 −40 −20 0 20 40 60 80 100 Tangente en k 0  J( )  k J Approxim. de J Meth. Gradient Approxim. de J Meth. Newton Directions of descent in 2D 0 −2 −1 0 1 2 −2 −1 0 1 2 −1 h = − H  J h = −  J Gilles Gasso Descent methods 24 / 32
  • 25.
    Descent algorithms Mainmethods of descent Quasi-Newton method Main features Choiceof the descentdirection at θk : hk = −B(θk )−1∇J(θk ) B(θk ) is an positive definite approximation of the Hessian matrix Complexity of the update: most of the times O(d2) Gilles Gasso Descent methods 25 / 32
  • 26.
    Descent algorithms Researchof the step Line search α>0 Assumethe direction of descenthk at θk is fixed. We aim to find the step sizeαk > 0 in the direction hk suchthat the function J(θk + αk hk ) decreases enough (compared to J(θk )) Several options Fixed step size: use a fixed value α > 0 at each iteration k θk+1 ← θk + αhk Optimal step size αk ∗ θk+1 ← θk + αk ∗ hk with αk ∗ = argminJ(θk + αhk ) Variable step size: the choice αk is adapted to the current iteration θk+1 ← θk + αkhk Gilles Gasso Descent methods 26 / 32
  • 27.
    Descent algorithms Researchof the step Line search Pros and cons Fixed step size strategy: often not very effective Optimal step size: can be costly in calculation time Variable step: most commonly used approach The step is often imprecise A trade-off between computation cost and decrease of J Gilles Gasso Descent methods 27 / 32
  • 28.
    Descent algorithms Researchof the step Variable step sizeh Armijo’s rule Determine the step size αk in order to have a sufficient decrease of J i.e. J(θk + αk h) ≤ J(θk ) + c αk ∇J(θk )T hk Usually c is chosen in the range 10− , 10− 5 1 Having hk the direction of descent, we have ∇J(θk )Thk < 0, which ensures the decrease of J Backtracking 1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ 2: repeat 3: α ← ρα 4: until J(θk + αh) > J(θk ) + c α∇J(θk )T hk Interpretation: as long as J does not decrease, we decrease the value of the step size Choice of the initial step Newton method: ᾱ = 1 Gradient method: ᾱ = 2J(θk )− J(θk− 1) ∇J(θk )T hk Gilles Gasso Descent methods 28 / 32
  • 29.
    Descent algorithms Summary Summaryof descent methods General algorithm 1: Initialize θk 2: repeat 3: Find direction of descent hk ∈ Rd 4: Line search: find the step αk > 0 5: Update: θk+1 ← θk + αk hk 6: until convergence Method Direction of descent h Complexity Convergence Gradient −∇J(θ) O(d) linear Quasi-Newton −B(θ)−1∇J(θ) O(d2) superlineare Newton −H(θ)−1∇J(θ) O(d3) quadratic Step size computation: backtracking (common) or optimal step size Complexity of each method: depends on the complexity of calculating h, the search for α, and the number of iterations performed till convergence Gilles Gasso Descent methods 29 / 32
  • 30.
    Illustration of descentmethods Gradient method J along the iterations Evolution of the iterates 0 2 10 12 −2 0 2 4 6 8 10 4 6 8 Itérations k (k) J( )  0 −1.5 −1 −0.5 0 0.5 1 1.5 −1 −1.5 −0.5 0 0.5 1 1.5 Gilles Gasso Descent methods 30 / 32
  • 31.
    Illustration of descentmethods Newton method J along the iterations Evolution of the iterates 1 2 6 7 −2 0 2 4 6 8 10 3 4 5 Itérations k (k) J( ) H to guarantee the positive definite property of Hessian  0 −1.5 −1 −0.5 0 0.5 1 1.5 At each iteration we considered the matrix H(θ) + λ I instead of −1.5 −1 −0.5 0 0.5 1 1.5 Gilles Gasso Descent methods 31 / 32
  • 32.
    Illustration of descentmethods Conclusion Unconstrained optimization of smooth objective function Characterization of the solution(s) requires checking the optimality conditions Computation of a solution using descent methods Gradient descent method Newton method Not covered in this lecture: Convergence analysis of the studied algorithms Non-smooth optimization Gradient-free optimzation Gilles Gasso Descent methods 32 / 32