Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Privacy vs. capitalism - MIT March ... by agropper 65 views
- Agreement Document - Milestone know... by Arijit Mondal 228 views
- Pear blogslides by John Theurer Canc... 340 views
- Ship passport kordelia_fischer-borc... by IFsbh 375 views
- Halloweeen aramayo solano arias cas... by silgareca 325 views
- Are forex trading robots reviews ho... by martinfronek 362 views

545 views

Published on

Derivative Free Optimization, presented in Liège, 2011

No Downloads

Total views

545

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

12

Comments

0

Likes

1

No embeds

No notes for slide

- 1. DERIVATIVE-FREE OPTIMIZATION http://www.lri.fr/~teytaud/dfo.pdf (or Quentins web page ?)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège using also Slides from A. Auger
- 2. The next slide is the most important of all.Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 3. In case of trouble, Interrupt me.Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 4. In case of trouble, Interrupt me. Further discussion needed: - R82A, Montefiore institute - olivier.teytaud@inria.fr - or after the lessons (the 25 th , not the 18th)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 5. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 6. Derivative-free optimization of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 7. Derivative-free optimization of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 8. Derivative-free optimization of f No gradient ! Only depends on the xs and f(x)sOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 9. Derivative-free optimization of f Why derivative free optimization ?
- 10. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower
- 11. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative
- 12. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative Its simpler (by far) ==> less bugs
- 13. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative Its simpler (by far) Its more robust (to noise, to strange functions...)
- 14. Derivative-free optimization of f Optimization algorithms ==> Newton optimization ? Why derivative free ==> Quasi-Newton (BFGS) Ok, its slower But sometimes you have no derivative ==> Gradient descent Its simpler (by far) ==> ...robust (to noise, to strange functions...) Its more
- 15. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Ok, its slower Derivative-free optimization But sometimes you have no derivative (dont need gradients) Its simpler (by far) Its more robust (to noise, to strange functions...)
- 16. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Derivative-free optimization Ok, its slower But sometimes you have no derivative Comparison-based optimization (coming soon), Its simpler (by far)comparisons, just needing Its more robust (to noise, to strange functions...) incuding evolutionary algorithms
- 17. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 18. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 19. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains ==> for theoretical analysis
- 20. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
- 21. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constantPreliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
- 22. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constant Translation of thePreliminaries:Gaussian Sze of the- Gaussian distribution Gaussian- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
- 23. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution
- 24. Preliminaries: Isotropic case:- Gaussian distribution- Multivariate Gaussian distribution||2 /22)==> general case: density = K exp( - || x - ==> level sets are rotationally invariant- Non-isotropic Gaussian distribution==> completely defined by and - Markov chains (do you understand why K is fixed by ?)==> “isotropic” Gaussian
- 25. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution
- 26. Step-size different on each axis K exp( - p(x) ) with- p(x) a quadratic form (--> + infinity)- K a normalization constant
- 27. Notions that we will see:- Evolutionary algorithm- Cross-over- Truncation selection / roulette wheel- Linear / log-linear convergence- Estimation of Distribution Algorithm- EMNA- Self-adaptation- (1+1)-ES with 1/5th rule- Voronoi representation- Non-isotropy
- 28. Comparison-based optimization Observation: we want robustness w.r.t that: is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 29
- 29. Comparison-based optimization yi=f(xi) is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 30
- 30. Population-based comparison-based algorithms ? X(1)=( x(1,1),x(1,2),...,x(1,) ) = Opt() X(2)=( x(2,1),x(2,2),...,x(2,) ) = Opt(x(1), signs of diff) … … ... x(n)=( x(n,1),x(n,2),...,x(n,) ) = Opt(x(n-1), signs of diff) ==> lets write it for =2.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 31
- 31. Population-based comparison-based algorithms ? x(1)=(x(1,1),x(1,2)) = Opt() x(2)=(x(2,1),x(2,2)) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n)=(x(n,1),x(n,2)) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2)) with y(i,j) = f ( x(i,j) )Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 32
- 32. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population x(1) = Opt() x(2) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 33
- 33. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), sign(y(1,1)-y(1,2)), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),sign(y(n-1,1)-y(n-1,2) ,I(n-1))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 34
- 34. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), (1), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),(n-1) ,I(n-1))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 35
- 35. Comparison-based optimization ==> Same behavior on many functions is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 36
- 36. Comparison-based optimization ==> Same behavior on many functions is comparison-based if Quasi-Newton methods very poor on this.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 37
- 37. Why comparison-based algorithms ? ==> more robust ==> this can be mathematically formalized: comparison-based opt. are slow ( d log ||xn-x*||/n ~ constant) but robust (optimal for some worst case analysis)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 38. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 39. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) o f an egy ema Strat c sch utionBasi Evol
- 40. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) o f an egy Compute their fitness values ema Strat c sch utionBasi Evol
- 41. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) o f an egy Compute their fitness values ema Strat c sch ution Select the bestBasi Evol
- 42. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) o f an egy Compute their fitness values ema Strat c sch ution Select the bestBasi Evol Let x = average of these best
- 43. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) o f an egy Compute their fitness values ema Strat c sch ution Select the bestBasi Evol Let x = average of these best
- 44. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) llel para Compute their fitness values Multi-cores, sly Clusters, Grids... ou Select the bestObvi Let x = average of these best
- 45. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) llel para Compute their fitness values sly ple. ou Select the best ly simObvi Real Let x = average of these best
- 46. Parameters: Generate points around x x, ( x + N where N is a standard Gaussian) llel para Not a negligible advantage. Compute their fitness values When I accessed, for the 1st time, sly to a crucial industrial ple. code of an important ou Select the best company, I believed ly simObvi that it would be Real clean and bug free. Let x = average of these best (I was young)
- 47. Parameters: Generate 1 point x around x x, ( x + N where N is a standard Gaussian) Compute its fitness value ) - ES le Keep the best (x or x). ru (1 +1 1/5 th x=best(x,x) The with =2 if x best =0.84 otherwise
- 48. This is x...
- 49. I generate =6 points
- 50. I select the =3 best points
- 51. x=average of these =3 best points
- 52. Ok.Choosing an initial x is as in any algorithm.But how do I choose sigma ?
- 53. Ok.Choosing x is as in any algorithm.But how do I choose sigma ?Sometimes by human guess.But for large number of iterations,there is better.
- 54. log || xn – x* || ~ - C n
- 55. Usually termed “linear convergence”, ==> but its in log-scale. log || xn – x* || ~ - C n
- 56. Examples of evolutionary algorithmsAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 60
- 57. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 61
- 58. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 62
- 59. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 63
- 60. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 64
- 61. EMNA is usually non-isotropicAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 65
- 62. EMNA is usually non-isotropicAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 66
- 63. Self-adaptation (works in many frameworks)Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 67
- 64. Self-adaptation (works in many frameworks) Can be used for non-isotropic multivariate Gaussian distributions.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 68
- 65. Lets generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) - we generate individuals - we evaluate these individuals - we regenerate a search point and a step-sizeMaybe we could keep more than one search point ?
- 66. Lets generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) points ==> mu search - we generate individuals - we evaluate thesegenerated individuals ==> lambda individuals - we regenerate a search point and a step-sizeMaybe we could keep more than one search point ?
- 67. Parameters: Generate points x1,...,x around x1,...,x e.g. each x randomly generated from two points llel para Compute their fitness values sly ple. ou Select the best ly simObvi Real Dont average...
- 68. Generate points around x1,...,xe.g. each x randomly generated from two points
- 69. Generate points around x1,...,xe.g. each x randomly generated This is a from two points cross-over
- 70. Generate points around x1,...,x e.g. each x randomly generated This is a from two points cross-overExample of procedure for generating a point: - Randomly draw k parents x1,...,xk (truncation selection: randomly in selected individuals) - For generating the ith coordinate of new individual z: u=random(1,k) z(i) = x(u)i
- 71. Lets summarize:We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA).==> Some EA (older than the EDA acronym) are EDAs.
- 72. Lets summarize:We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points EDA EA==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA).==> Some EA (older than the EDA acronym) are EDAs.
- 73. Gives a lot freedom: - choose your representation and operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose (carefully) e.g. min(dimension, /4)
- 74. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose (carefully) e.g. min(dimension, /4)Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
- 75. Voronoi representation: - a family of points
- 76. Voronoi representation: - a family of points
- 77. Voronoi representation: - a family of points - their labels
- 78. Voronoi representation: - a family of points - their labels==> cross-over makes sense==> you can optimize a shape
- 79. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really usefulMutations: each label is changed with proba 1/nCross-over: each point/label is randomly drawn from one of the two parents
- 80. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really usefulMutations: each label is changed with proba 1/nCross-over: randomly pick one split in the representation: - left part from parent 1 - right part from parent 2 ==> related to biology
- 81. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose (carefully) e.g. min(dimension, /4)Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
- 82. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. AnalysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 83. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* ||Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 84. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || -------------------------- - E log || x(n-1) – f* ||Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 85. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We dont know f*. We want to maximize: How can we optimize this ? - E log || x(n) - f* || We will observe -------------------------- the acceptance rate, - E log || x(n-1) – f* || and we will deduce if Olivier TeytaudInria Tao, en visite dans la belle ville de Liège is too large or too small..
- 86. - E log || x(n) - f* || ON THE NORM FUNCTION -------------------------- - E log || x(n-1) – f* ||Rejected Acceptedmutations mutationsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 87. - E log || x(n) - f* || For each step-size, -------------------------- evaluate this “expected progress rate” - E log || x(n-1) – f* || and evaluate “P(acceptance)”Rejected Acceptedmutations mutationsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 88. Progress rate Rejected mutations Acceptance rate
- 89. Progress rate We want to be here! Rejected We observe mutations (approximately) this variable Acceptance rate
- 90. Progress rate Rejected mutations Big Acceptance rate step-size
- 91. Progress rate Rejected mutations Small step-size Acceptance rate
- 92. Progress rate RejectedSmall acceptance rate mutations ==> decrease sigma Acceptance rate
- 93. Progress rate Rejected Big acceptance rate mutations ==> increase sigma Acceptance rate
- 94. th1/5 rule Based on maths showing that good step-size <==> success rate < 1/5Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 98
- 95. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 96. III. From math. programming ==>pattern search method Comparison with ES: - code more complicated - same rate - deterministic - less robustOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 97. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 98. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate) - NEWUOA (using value functions and not only comparisons)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 99. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 100. IV. Using machine learning What if computing f takes days ?==> parallelism==> and “learn” an approximation of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 101. IV. Using machine learningStatistical tools: f (x) = approximation ( x, x1,f(x1), x2,f(x2), … , xn,f(xn)) y(n+1) = f (x(n+1) ) e.g. f = quadratic function closest to f on the x(i)s.
- 102. IV. Using machine learning ==> keyword “surrogate models” ==> use f instead of f ==> periodically, re-use the real f
- 103. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
- 104. Derivative free optimization is fun.==> nice maths==> nice applications + easily parallel algorithms==> can handle really complicated domains (mixed continuous / integer, optimization on sets of programs)Yet,often suboptimal on highly structured problems (when BFGS is easy to use, thanks to fast gradients)
- 105. Keywords, readings==> cross-entropy (so close to evolution strategies)==> genetic programming (evolutionary algorithms for automatically building programs)==> H.-G. Beyers book on ES = good starting point==> many resources on the web==> keep in mind that representation / operators are often the key==> we only considered isotropic algorithms; sometimes not a good idea at all

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment