CI_L01_Optimization.pdf

Classical Methods of
Optimization
DPTO DE INGENIERÍA DE SISTEMAS Y
AUTOMÁTICA
UC3M

Unconstrained continuous optimization:
• Convexity
• Iterative optimization algorithms
• Gradient descent
• Newton’s method
• Gauss-Newton method
New topics:
• Axial iteration
• Levenberg-Marquardt algorithm
• Application
Lecture outline

Introduction: Problem specification
Suppose we have a cost function (or objective function)
Our aim is find the value of the parameters that minimize this function
subject to the following constraints:
• equality
• inequality
We will start by focussing on unconstrained problems
f(x) : IRn → IR
x
x∗ = arg min
x
f(x)
ci(x) = 0, i = 1, . . . , me
ci(x) ≥ 0, i = me + 1, . . . , m

Unconstrained optimization
• down-hill search (gradient descent) algorithms can find local minima
• which of the minima is found depends on the starting point
• such minima often occur in real applications
min
x
f(x)
f(x)
x
local
minimum
global
minimum
function of one
variable

Class of functions
convex Not convex
• Convexity provides a test for a single extremum
• A non-negative sum of convex functions is convex

Class of functions continued
single extremum – convex single extremum – non-convex
multiple extrema – non-convex noisy
Not convex
horrible

Optimization algorithm – key ideas
• Find δx such that f(x + δx) < f(x)
• This leads to an iterative update xn+1 = xn + δx
• Reduce the problem to a series of 1D line searches δx = αp
-5 0 5 10 15
-5
0
5
10
15

Choosing the direction 1: axial iteration
Alternate minimization over x and y
-5 0 5 10 15
-5
0
5
10
15

Choosing the direction 2: steepest descent
Move in the direction of the gradient ∇f(xn)
-5 0 5 10 15
-5
0
5
10
15

-5 0 5 10 15
-5
0
5
10
15
• The gradient is everywhere perpendicular to the contour lines.
• After each line minimization the new gradient is always orthogonal
to the previous step direction (true of any line minimization.)
• Consequently, the iterates tend to zig-zag down the valley in a very
ineﬃcient manner
Steepest descent

A harder case: Rosenbrock’s function
f(x, y) = 100(y − x2)2 + (1 − x)2
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Rosenbrock function
Minimum is at [1, 1]

-0.95 -0.9 -0.85 -0.8 -0.75
0.65
0.7
0.75
0.8
0.85
Steepest Descent
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Steepest Descent
Steepest descent on Rosenbrock function
• The zig-zag behaviour is clear in the zoomed view (100 iterations)
• The algorithm crawls down the valley

Conjugate Gradients – sketch only
The method of conjugate gradients chooses successive descent direc-
tions pn such that it is guaranteed to reach the minimum in a ﬁnite
number of steps.
• Each pn is chosen to be conjugate to all previous search directions
with respect to the Hessian H :
p>
n H pj = 0, 0 =< j < n
• The resulting search directions are mutually linearly independent.
• Remarkably, pn can be chosen using only knowledge of pn−1, ∇f(xn−1)
and ∇f(xn) (see Numerical Recipes)
pn = ∇fn +
⎛
⎝
∇f>
n ∇fn
∇f>
n−1∇fn−1
⎞
⎠ pn−1

Choosing the direction 3: conjugate gradients
Again, uses first derivatives only, but avoids “undoing” previous
work
• An N-dimensional quadratic form can be minimized in at most N
conjugate descent steps.
• 3 diﬀerent starting points.
• Minimum is reached in exactly 2 steps.

Choosing the direction 4: Newton’s method
Start from Taylor expansion in 2D
A function may be approximated locally by its Taylor series expansion
about a point x0
f(x + δx) ≈ f(x) +
Ã
∂f
∂x
,
∂f
∂y
! Ã
δx
δy
!
+
1
2
(δx, δy)
⎡
⎢
⎣
∂2f
∂x2
∂2f
∂x∂y
∂2f
∂x∂y
∂2f
∂y2
⎤
⎥
⎦
Ã
δx
δy
!
The expansion to second order is a quadratic function
f(x + δx) = a + g>δx +
1
2
δx>H δx
Now minimize this expansion over δx:
min
δx
f(x + δx) = a + g>δx +
1
2
δx>H δx

-5 0 5 10 15
-5
0
5
10
15
min
δx
f(x + δx) = a + g>
δx +
1
2
δx>
H δx
For a minimum we require that ∇f(x + δx) = 0, and so
∇f(x + δx) = g + Hδx = 0
with solution δx = −H−1g (Matlab δx = −Hg).
This gives the iterative update
xn+1 = xn − H−1
n gn

• If f(x) is quadratic, then the solution is found in one step.
• The method has quadratic convergence (as in the 1D case).
• The solution δx = −H−1
n gn is guaranteed to be a downhill direction
provided that H is positive deﬁnite
• Rather than jump straight to the predicted solution at xn − H−1
n gn,
it is better to perform a line search
xn+1 = xn − αnH−1
n gn
• If H = I then this reduces to steepest descent.
xn+1 = xn − H−1
n gn

Newton’s method - example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Newton method with line search
gradient < 1e-3 after 15 iterations
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
ellipses show successive
quadratic approximations
• The algorithm converges in only 15 iterations – far superior to steepest
descent
• However, the method requires computing the Hessian matrix at each
iteration – this is not always feasible

Performance issues for optimization algorithms
1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence

Special structure for cost function - non-linear least squares
• It is very common in applications for a cost function f(x) to be the
sum of a large number of squared residuals
f(x) =
M
X
i=1
ri(x)2
• If each residual ri(x) depends non-linearly on the parameters x then
the minimization of f(x) is a non-linear least squares problem.
• We also assume that the residuals ri are: (i) small at the optimum,
and (ii) zero-mean.

f(x) =
M
X
i=1
r2
i
Gradient
∇f(x) = 2
M
X
i
ri(x)∇ri(x)
Hessian
H = ∇∇>f(x) = 2
M
X
i
∇
³
ri(x)∇>ri(x)
´
= 2
M
X
i
∇ri(x)∇>ri(x) + ri(x)∇∇>ri(x)
which is approximated as
HGN = 2
M
X
i
∇ri(x)∇>ri(x)
This is the Gauss-Newton approximation
Non-linear least squares
∇ri
∇ri
∇>ri

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Gauss-Newton method with line search
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
• minimization with the Gauss-Newton approximation with line search
takes only 14 iterations
xn+1 = xn − αnH−1
n gn with Hn (x) = HGN (xn)

Comparison
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Newton Gauss-Newton
• requires computing Hessian
• exact solution if quadratic
• approximates Hessian by
product of gradient of residuals
• requires only derivatives

Summary of minimizations methods
Update xn+1 = xn + δx
1. Newton.
H δx = −g
2. Gauss-Newton.
HGN δx = −g
3. Gradient descent.
λδx = −g

Levenberg-Marquardt algorithm
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
• Away from the minimum, in regions of negative curvature, the
Gauss-Newton approximation is not very good.
• In such regions, a simple steepest-descent step is probably the best
plan.
• The Levenberg-Marquardt method is a mechanism for varying be-
tween steepest-descent and Gauss-Newton steps depending on how
good the HGN approximation is locally.
gradient
descent
Newton

• The method uses the modiﬁed Hessian
H (x, λ) = HGN + λI
• When λ is small, H approximates the Gauss-Newton Hessian.
• When λ is large, H is close to the identity, causing steepest-descent
steps to be taken.

LM Algorithm
H (x, λ) = HGN(x) + λI
1. Set λ = 0.001 (say)
2. Solve δx = −H(x, λ)−1 g
3. If f(xn + δx) > f(xn), increase λ (×10 say) and go to 2.
4. Otherwise, decrease λ (×0.1 say), let xn+1 = xn + δx, and go to 2.
Note : This algorithm does not require explicit line searches.

Example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Levenberg-Marquardt method
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
• Minimization using Levenberg-Marquardt (no line search) takes 31
iterations.
Matlab: lsqnonlin

Comparison
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Gauss-Newton
• more iterations than Gauss-Newton,
but
• no line search required,
• and more frequently converges
Levenberg-Marquardt

CI_L01_Optimization.pdf

Recommended

Recommended

More Related Content

Similar to CI_L01_Optimization.pdf

Similar to CI_L01_Optimization.pdf (20)

More from SantiagoGarridoBulln

More from SantiagoGarridoBulln (14)

Recently uploaded

Recently uploaded (20)

CI_L01_Optimization.pdf