3. Introduction: Problem specification
Suppose we have a cost function (or objective function)
Our aim is find the value of the parameters that minimize this function
subject to the following constraints:
• equality
• inequality
We will start by focussing on unconstrained problems
f(x) : IRn → IR
x
x∗ = arg min
x
f(x)
ci(x) = 0, i = 1, . . . , me
ci(x) ≥ 0, i = me + 1, . . . , m
4. Unconstrained optimization
• down-hill search (gradient descent) algorithms can find local minima
• which of the minima is found depends on the starting point
• such minima often occur in real applications
min
x
f(x)
f(x)
x
local
minimum
global
minimum
function of one
variable
6. Class of functions
convex Not convex
• Convexity provides a test for a single extremum
• A non-negative sum of convex functions is convex
7. Class of functions continued
single extremum – convex single extremum – non-convex
multiple extrema – non-convex noisy
Not convex
horrible
8. Optimization algorithm – key ideas
• Find δx such that f(x + δx) < f(x)
• This leads to an iterative update xn+1 = xn + δx
• Reduce the problem to a series of 1D line searches δx = αp
-5 0 5 10 15
-5
0
5
10
15
9. Choosing the direction 1: axial iteration
Alternate minimization over x and y
-5 0 5 10 15
-5
0
5
10
15
10. Choosing the direction 2: steepest descent
Move in the direction of the gradient ∇f(xn)
-5 0 5 10 15
-5
0
5
10
15
11. -5 0 5 10 15
-5
0
5
10
15
• The gradient is everywhere perpendicular to the contour lines.
• After each line minimization the new gradient is always orthogonal
to the previous step direction (true of any line minimization.)
• Consequently, the iterates tend to zig-zag down the valley in a very
inefficient manner
Steepest descent
12. A harder case: Rosenbrock’s function
f(x, y) = 100(y − x2)2 + (1 − x)2
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Rosenbrock function
Minimum is at [1, 1]
13. -0.95 -0.9 -0.85 -0.8 -0.75
0.65
0.7
0.75
0.8
0.85
Steepest Descent
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Steepest Descent
Steepest descent on Rosenbrock function
• The zig-zag behaviour is clear in the zoomed view (100 iterations)
• The algorithm crawls down the valley
14. Conjugate Gradients – sketch only
The method of conjugate gradients chooses successive descent direc-
tions pn such that it is guaranteed to reach the minimum in a finite
number of steps.
• Each pn is chosen to be conjugate to all previous search directions
with respect to the Hessian H :
p>
n H pj = 0, 0 =< j < n
• The resulting search directions are mutually linearly independent.
• Remarkably, pn can be chosen using only knowledge of pn−1, ∇f(xn−1)
and ∇f(xn) (see Numerical Recipes)
pn = ∇fn +
⎛
⎝
∇f>
n ∇fn
∇f>
n−1∇fn−1
⎞
⎠ pn−1
15. Choosing the direction 3: conjugate gradients
Again, uses first derivatives only, but avoids “undoing” previous
work
• An N-dimensional quadratic form can be minimized in at most N
conjugate descent steps.
• 3 different starting points.
• Minimum is reached in exactly 2 steps.
16. Choosing the direction 4: Newton’s method
Start from Taylor expansion in 2D
A function may be approximated locally by its Taylor series expansion
about a point x0
f(x + δx) ≈ f(x) +
Ã
∂f
∂x
,
∂f
∂y
! Ã
δx
δy
!
+
1
2
(δx, δy)
⎡
⎢
⎣
∂2f
∂x2
∂2f
∂x∂y
∂2f
∂x∂y
∂2f
∂y2
⎤
⎥
⎦
Ã
δx
δy
!
The expansion to second order is a quadratic function
f(x + δx) = a + g>δx +
1
2
δx>H δx
Now minimize this expansion over δx:
min
δx
f(x + δx) = a + g>δx +
1
2
δx>H δx
17. -5 0 5 10 15
-5
0
5
10
15
min
δx
f(x + δx) = a + g>
δx +
1
2
δx>
H δx
For a minimum we require that ∇f(x + δx) = 0, and so
∇f(x + δx) = g + Hδx = 0
with solution δx = −H−1g (Matlab δx = −Hg).
This gives the iterative update
xn+1 = xn − H−1
n gn
18. • If f(x) is quadratic, then the solution is found in one step.
• The method has quadratic convergence (as in the 1D case).
• The solution δx = −H−1
n gn is guaranteed to be a downhill direction
provided that H is positive definite
• Rather than jump straight to the predicted solution at xn − H−1
n gn,
it is better to perform a line search
xn+1 = xn − αnH−1
n gn
• If H = I then this reduces to steepest descent.
xn+1 = xn − H−1
n gn
19. Newton’s method - example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Newton method with line search
gradient < 1e-3 after 15 iterations
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Newton method with line search
gradient < 1e-3 after 15 iterations
ellipses show successive
quadratic approximations
• The algorithm converges in only 15 iterations – far superior to steepest
descent
• However, the method requires computing the Hessian matrix at each
iteration – this is not always feasible
20. Performance issues for optimization algorithms
1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence
21. Special structure for cost function - non-linear least squares
• It is very common in applications for a cost function f(x) to be the
sum of a large number of squared residuals
f(x) =
M
X
i=1
ri(x)2
• If each residual ri(x) depends non-linearly on the parameters x then
the minimization of f(x) is a non-linear least squares problem.
• We also assume that the residuals ri are: (i) small at the optimum,
and (ii) zero-mean.
22. f(x) =
M
X
i=1
r2
i
Gradient
∇f(x) = 2
M
X
i
ri(x)∇ri(x)
Hessian
H = ∇∇>f(x) = 2
M
X
i
∇
³
ri(x)∇>ri(x)
´
= 2
M
X
i
∇ri(x)∇>ri(x) + ri(x)∇∇>ri(x)
which is approximated as
HGN = 2
M
X
i
∇ri(x)∇>ri(x)
This is the Gauss-Newton approximation
Non-linear least squares
∇ri
∇ri
∇>ri
23. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations
• minimization with the Gauss-Newton approximation with line search
takes only 14 iterations
xn+1 = xn − αnH−1
n gn with Hn (x) = HGN (xn)
24. Comparison
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations
-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Newton method with line search
gradient < 1e-3 after 15 iterations
Newton Gauss-Newton
• requires computing Hessian
• exact solution if quadratic
• approximates Hessian by
product of gradient of residuals
• requires only derivatives
26. Levenberg-Marquardt algorithm
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
• Away from the minimum, in regions of negative curvature, the
Gauss-Newton approximation is not very good.
• In such regions, a simple steepest-descent step is probably the best
plan.
• The Levenberg-Marquardt method is a mechanism for varying be-
tween steepest-descent and Gauss-Newton steps depending on how
good the HGN approximation is locally.
gradient
descent
Newton
27. • The method uses the modified Hessian
H (x, λ) = HGN + λI
• When λ is small, H approximates the Gauss-Newton Hessian.
• When λ is large, H is close to the identity, causing steepest-descent
steps to be taken.
28. LM Algorithm
H (x, λ) = HGN(x) + λI
1. Set λ = 0.001 (say)
2. Solve δx = −H(x, λ)−1 g
3. If f(xn + δx) > f(xn), increase λ (×10 say) and go to 2.
4. Otherwise, decrease λ (×0.1 say), let xn+1 = xn + δx, and go to 2.
Note : This algorithm does not require explicit line searches.