2. IngredientsIngredients
• Objective functionObjective function
• VariablesVariables
• ConstraintsConstraints
Find values of the variablesFind values of the variables
that minimize or maximize the objective functionthat minimize or maximize the objective function
while satisfying the constraintswhile satisfying the constraints
3. Different Kinds of OptimizationDifferent Kinds of Optimization
Figure from: Optimization Technology CenterFigure from: Optimization Technology Center
http://www-fp.mcs.anl.gov/otc/Guide/OptWeb/http://www-fp.mcs.anl.gov/otc/Guide/OptWeb/
4. Different Optimization TechniquesDifferent Optimization Techniques
• Algorithms have very different flavorAlgorithms have very different flavor
depending on specific problemdepending on specific problem
– Closed form vs. numerical vs. discreteClosed form vs. numerical vs. discrete
– Local vs. global minimaLocal vs. global minima
– Running times ranging from O(1) to NP-hardRunning times ranging from O(1) to NP-hard
• Today:Today:
– Focus on continuous numerical methodsFocus on continuous numerical methods
5. Optimization in 1-DOptimization in 1-D
• Look for analogies to bracketing in root-Look for analogies to bracketing in root-
findingfinding
• What does it mean toWhat does it mean to bracke tbracke t a minimum?a minimum?
((xxle ftle ft,, ff((xxle ftle ft))))
((xxrig htrig ht,, ff((xxrig htrig ht))))
((xxm idm id ,, ff((xxm idm id ))))
xxle ftle ft << xxm idm id << xxrig htrig ht
ff((xxm idm id ) <) < ff((xxle ftle ft))
ff((xxm idm id ) <) < ff((xxrig htrig ht))
xxle ftle ft << xxm idm id << xxrig htrig ht
ff((xxm idm id ) <) < ff((xxle ftle ft))
ff((xxm idm id ) <) < ff((xxrig htrig ht))
6. Optimization in 1-DOptimization in 1-D
• Once we have these properties, there isOnce we have these properties, there is atat
leastleast oneone locallocal minimum betweenminimum between xxleftleft andand xxrightright
• Establishing bracket initially:Establishing bracket initially:
– GivenGiven xxinitialinitial,, incre m e ntincre m e nt
– EvaluateEvaluate ff((xxinitialinitial),), ff((xxinitialinitial+ incre m e nt+ incre m e nt))
– If decreasing, step until find an increaseIf decreasing, step until find an increase
– Else, step in opposite direction until find an increaseElse, step in opposite direction until find an increase
– Grow increment at each stepGrow increment at each step
• For maximization: substitute –For maximization: substitute –ff forfor ff
7. Optimization in 1-DOptimization in 1-D
• Strategy: evaluate function at someStrategy: evaluate function at some xxnewnew
((xxle ftle ft,, ff((xxle ftle ft))))
((xxrig htrig ht,, ff((xxrig htrig ht))))
((xxm idm id ,, ff((xxm idm id ))))
((xxne wne w ,, ff((xxne wne w ))))
8. Optimization in 1-DOptimization in 1-D
• Strategy: evaluate function at someStrategy: evaluate function at some xxnewnew
– Here, new “bracket” points areHere, new “bracket” points are xxnewnew ,, xxmidmid ,, xxrightright
((xxle ftle ft,, ff((xxle ftle ft))))
((xxrig htrig ht,, ff((xxrig htrig ht))))
((xxm idm id ,, ff((xxm idm id ))))
((xxne wne w ,, ff((xxne wne w ))))
9. Optimization in 1-DOptimization in 1-D
• Strategy: evaluate function at someStrategy: evaluate function at some xxnewnew
– Here, new “bracket” points areHere, new “bracket” points are xxleftleft,, xxnewnew ,, xxmidmid
((xxle ftle ft,, ff((xxle ftle ft))))
((xxrig htrig ht,, ff((xxrig htrig ht))))
((xxm idm id ,, ff((xxm idm id ))))
((xxne wne w ,, ff((xxne wne w ))))
10. Optimization in 1-DOptimization in 1-D
• Unlike with root-finding, can’t alwaysUnlike with root-finding, can’t always
guarantee that interval will be reduced by aguarantee that interval will be reduced by a
factor of 2factor of 2
• Let’s find the optimal place forLet’s find the optimal place for xx midmid , relative to, relative to
left and right, that will guarantee same factorleft and right, that will guarantee same factor
of reduction regardless of outcomeof reduction regardless of outcome
11. Optimization in 1-DOptimization in 1-D
ifif ff((xxnewnew) <) < ff((xxmidmid ))
new interval =new interval = αα
elseelse
new interval = 1–new interval = 1–αα22
αα
αα22
12. Golden Section SearchGolden Section Search
• To assure same interval, wantTo assure same interval, want αα = 1–= 1–αα22
• So,So,
• This is the “golden ratio” = 0.618…This is the “golden ratio” = 0.618…
• So, interval decreases by 30% per iterationSo, interval decreases by 30% per iteration
– Line ar co nve rg e nceLine ar co nve rg e nce
ϕα =
−
=
2
15
13. Error ToleranceError Tolerance
• Around minimum, derivative = 0, soAround minimum, derivative = 0, so
• Rule of thumb: pointless to ask for moreRule of thumb: pointless to ask for more
accuracy than sqrt(accuracy than sqrt(εε ))
– Can use double precision if you want a single-Can use double precision if you want a single-
precision result (and/or have single-precision data)precision result (and/or have single-precision data)
ε
ε
~
machine)()()(
...)()()(
2
2
1
2
2
1
x
xxfxfxxf
xxfxfxxf
∆⇒
=∆′′=−∆+
+∆′′+=∆+
14. Faster 1-D OptimizationFaster 1-D Optimization
• Trade off super-linear convergence forTrade off super-linear convergence for
worse robustnessworse robustness
– Combine with Golden Section search for safetyCombine with Golden Section search for safety
• Usual bag of tricks:Usual bag of tricks:
– Fit parabola through 3 points, find minimumFit parabola through 3 points, find minimum
– Compute derivatives as well as positions, fit cubicCompute derivatives as well as positions, fit cubic
– UseUse se co ndse co nd derivatives: Newtonderivatives: Newton
19. Newton’s MethodNewton’s Method
• At each step:At each step:
• Requires 1Requires 1stst
and 2and 2ndnd
derivativesderivatives
• Quadratic convergenceQuadratic convergence
)(
)(
1
k
k
kk
xf
xf
xx
′′
′
−=+
20. Multi-Dimensional OptimizationMulti-Dimensional Optimization
• Important in many areasImportant in many areas
– Fitting a model to measured dataFitting a model to measured data
– Finding best design in some parameter spaceFinding best design in some parameter space
• Hard in generalHard in general
– Weird shapes: multiple extrema, saddles,Weird shapes: multiple extrema, saddles,
curved or elongated valleys, etc.curved or elongated valleys, etc.
– Can’t bracketCan’t bracket
• In general, easier than rootfindingIn general, easier than rootfinding
– Can always walk “downhill”Can always walk “downhill”
21. Newton’s Method inNewton’s Method in
Multiple DimensionsMultiple Dimensions
• Replace 1Replace 1stst
derivative with gradient,derivative with gradient,
22ndnd
derivative with Hessianderivative with Hessian
=
=∇
∂
∂
∂∂
∂
∂∂
∂
∂
∂
∂
∂
∂
∂
2
22
2
2
2
),(
y
f
yx
f
yx
f
x
f
y
f
x
f
H
f
yxf
22. Newton’s Method inNewton’s Method in
Multiple DimensionsMultiple Dimensions
• Replace 1Replace 1stst
derivative with gradient,derivative with gradient,
22ndnd
derivative with Hessianderivative with Hessian
• So,So,
• Tends to be extremely fragile unless functionTends to be extremely fragile unless function
very smooth and starting close to minimumvery smooth and starting close to minimum
)()(1
1 kkkk xfxHxx
∇−= −
+
23. Important classification of methodsImportant classification of methods
• UseUse function + gradient + Hessianfunction + gradient + Hessian (Newton)(Newton)
• UseUse function + gradientfunction + gradient (most descent methods)(most descent methods)
• UseUse function values onlyfunction values only (Nelder-Mead, called(Nelder-Mead, called
also “simplex”, or “amoeba” method)also “simplex”, or “amoeba” method)
24. Steepest Descent MethodsSteepest Descent Methods
• What if you can’t / don’t want toWhat if you can’t / don’t want to
use 2use 2ndnd
derivative?derivative?
• ““Quasi-Newton” methods estimate HessianQuasi-Newton” methods estimate Hessian
• Alternative: walk along (negative of)Alternative: walk along (negative of)
gradient…gradient…
– PerformPerform 1-D minimization1-D minimization along line passingalong line passing
through current point in the direction of thethrough current point in the direction of the
gradientgradient
– Once done, re-compute gradient, iterateOnce done, re-compute gradient, iterate
27. Conjugate Gradient MethodsConjugate Gradient Methods
• Idea: avoid “undoing”Idea: avoid “undoing”
minimization that’sminimization that’s
already been donealready been done
• Walk along directionWalk along direction
• Polak and RibierePolak and Ribiere
formula:formula:
kkkk dgd β+−= ++ 11
kk
kk
k
gg
gggk
T
1
T
)(1
−
=
++
β
28. Conjugate Gradient MethodsConjugate Gradient Methods
• Conjugate gradient implicitly obtainsConjugate gradient implicitly obtains
information about Hessianinformation about Hessian
• For quadratic function inFor quadratic function in nn dimensions, getsdimensions, gets
e xacte xact solution insolution in nn steps (ignoring roundoffsteps (ignoring roundoff
error)error)
• Works well in practice…Works well in practice…
29. Value-Only Methods in Multi-DimensionsValue-Only Methods in Multi-Dimensions
• If can’t evaluate gradients, life is hardIf can’t evaluate gradients, life is hard
• Can use approximate (numerically evaluated)Can use approximate (numerically evaluated)
gradients:gradients:
≈
=∇ −⋅+
−⋅+
−⋅+
∂
∂
∂
∂
∂
∂
δ
δ
δ
δ
δ
δ
)()(
)()(
)()(
3
2
1
3
2
1
)( xfexf
xfexf
xfexf
e
f
e
f
e
f
xf
30. Generic Optimization StrategiesGeneric Optimization Strategies
• Uniform sampling:Uniform sampling:
– Cost rises exponentially with # of dimensionsCost rises exponentially with # of dimensions
• Simulated annealing:Simulated annealing:
– Search in random directionsSearch in random directions
– Start with large steps, gradually decreaseStart with large steps, gradually decrease
– ““Annealing schedule” – how fast to cool?Annealing schedule” – how fast to cool?
31. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• Keep track ofKeep track of nn+1 points in+1 points in nn dimensionsdimensions
– Vertices of aVertices of a sim ple xsim ple x (triangle in 2D(triangle in 2D
tetrahedron in 3D, etc.)tetrahedron in 3D, etc.)
• At each iteration: simplex can move,At each iteration: simplex can move,
expand, or contractexpand, or contract
– Sometimes known asSometimes known as am o e ba m e tho dam o e ba m e tho d ::
simplex “oozes” along the functionsimplex “oozes” along the function
32. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• Basic operation:Basic operation: reflectionreflection
worst pointworst point
(highest function value)(highest function value)
location probed bylocation probed by
reflectionreflection stepstep
33. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• If reflection resulted in best (lowest) value soIf reflection resulted in best (lowest) value so
far,far,
try antry an expansionexpansion
• Else, if reflection helped at all, keep itElse, if reflection helped at all, keep it
location probed bylocation probed by
expansionexpansion stepstep
34. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• If reflection didn’t help (reflected point stillIf reflection didn’t help (reflected point still
worst) try aworst) try a contractioncontraction
location probed bylocation probed by
contrationcontration stepstep
35. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• If all else failsIf all else fails shrinkshrink the simplex aroundthe simplex around
thethe be stbe st pointpoint
36. Downhill Simplex Method (Nelder-Mead)Downhill Simplex Method (Nelder-Mead)
• Method fairly efficient at each iterationMethod fairly efficient at each iteration
(typically 1-2 function evaluations)(typically 1-2 function evaluations)
• Can takeCan take lo tslo ts of iterationsof iterations
• Somewhat flakey – sometimes needsSomewhat flakey – sometimes needs re startre start
after simplex collapses on itself, etc.after simplex collapses on itself, etc.
• Benefits: simple to implement, doesn’t needBenefits: simple to implement, doesn’t need
derivative, doesn’t care about functionderivative, doesn’t care about function
smoothness, etc.smoothness, etc.
37. Rosenbrock’s FunctionRosenbrock’s Function
• Designed specifically for testingDesigned specifically for testing
optimization techniquesoptimization techniques
• Curved, narrow valleyCurved, narrow valley
222
)1()(100),( xxyyxf −+−=
38. Constrained OptimizationConstrained Optimization
• Equality constraints: optimizeEquality constraints: optimize f(x)f(x)
subject tosubject to gg ii(x)(x)=0=0
• Method of Lagrange multipliers: convert to aMethod of Lagrange multipliers: convert to a
higher-dimensional problemhigher-dimensional problem
• Minimize w.r.t.Minimize w.r.t.∑+ )()( xgxf iiλ );( 11 knxx λλ
39. Constrained OptimizationConstrained Optimization
• Inequality constraints are harder…Inequality constraints are harder…
• If objective function and constraints all linear,If objective function and constraints all linear,
this is “linear programming”this is “linear programming”
• Observation: minimum must lie at corner ofObservation: minimum must lie at corner of
region formed by constraintsregion formed by constraints
• Simplex method: move from vertex to vertex,Simplex method: move from vertex to vertex,
minimizing objective functionminimizing objective function
40. Constrained OptimizationConstrained Optimization
• General “nonlinear programming” hardGeneral “nonlinear programming” hard
• Algorithms for special cases (e.g. quadratic)Algorithms for special cases (e.g. quadratic)
41. Global OptimizationGlobal Optimization
• In general, can’t guarantee that you’ve foundIn general, can’t guarantee that you’ve found
global (rather than local) minimumglobal (rather than local) minimum
• Some heuristics:Some heuristics:
– Multi-start: try local optimization fromMulti-start: try local optimization from
several starting positionsseveral starting positions
– Very slow simulated annealingVery slow simulated annealing
– Use analytical methods (or graphing) to determineUse analytical methods (or graphing) to determine
behavior, guide methods to correct neighborhoodsbehavior, guide methods to correct neighborhoods