Derivative Free Optimization

545 views

Published on

Derivative Free Optimization, presented in Liège, 2011

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
545
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Derivative Free Optimization

  1. 1. DERIVATIVE-FREE OPTIMIZATION http://www.lri.fr/~teytaud/dfo.pdf (or Quentins web page ?)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège using also Slides from A. Auger
  2. 2. The next slide is the most important of all.Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  3. 3. In case of trouble, Interrupt me.Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  4. 4. In case of trouble, Interrupt me. Further discussion needed: - R82A, Montefiore institute - olivier.teytaud@inria.fr - or after the lessons (the 25 th , not the 18th)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  5. 5. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  6. 6. Derivative-free optimization of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  7. 7. Derivative-free optimization of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  8. 8. Derivative-free optimization of f No gradient ! Only depends on the xs and f(x)sOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  9. 9. Derivative-free optimization of f Why derivative free optimization ?
  10. 10. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower
  11. 11. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative
  12. 12. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative Its simpler (by far) ==> less bugs
  13. 13. Derivative-free optimization of f Why derivative free optimization ? Ok, its slower But sometimes you have no derivative Its simpler (by far) Its more robust (to noise, to strange functions...)
  14. 14. Derivative-free optimization of f Optimization algorithms ==> Newton optimization ? Why derivative free ==> Quasi-Newton (BFGS) Ok, its slower But sometimes you have no derivative ==> Gradient descent Its simpler (by far) ==> ...robust (to noise, to strange functions...) Its more
  15. 15. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Ok, its slower Derivative-free optimization But sometimes you have no derivative (dont need gradients) Its simpler (by far) Its more robust (to noise, to strange functions...)
  16. 16. Derivative-free optimization of f Optimization algorithms Why derivative free optimization ? Derivative-free optimization Ok, its slower But sometimes you have no derivative Comparison-based optimization (coming soon), Its simpler (by far)comparisons, just needing Its more robust (to noise, to strange functions...) incuding evolutionary algorithms
  17. 17. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  18. 18. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  19. 19. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains ==> for theoretical analysis
  20. 20. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
  21. 21. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constantPreliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
  22. 22. K exp( - p(x) ) with - p(x) a degree 2 polynomial (neg. dom coef) - K a normalization constant Translation of thePreliminaries:Gaussian Sze of the- Gaussian distribution Gaussian- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution- Markov chains
  23. 23. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution
  24. 24. Preliminaries: Isotropic case:- Gaussian distribution- Multivariate Gaussian distribution||2 /22)==> general case: density = K exp( - || x - ==> level sets are rotationally invariant- Non-isotropic Gaussian distribution==> completely defined by  and - Markov chains (do you understand why K is fixed by ?)==> “isotropic” Gaussian
  25. 25. Preliminaries:- Gaussian distribution- Multivariate Gaussian distribution- Non-isotropic Gaussian distribution
  26. 26. Step-size different on each axis K exp( - p(x) ) with- p(x) a quadratic form (--> + infinity)- K a normalization constant
  27. 27. Notions that we will see:- Evolutionary algorithm- Cross-over- Truncation selection / roulette wheel- Linear / log-linear convergence- Estimation of Distribution Algorithm- EMNA- Self-adaptation- (1+1)-ES with 1/5th rule- Voronoi representation- Non-isotropy
  28. 28. Comparison-based optimization Observation: we want robustness w.r.t that: is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 29
  29. 29. Comparison-based optimization yi=f(xi) is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 30
  30. 30. Population-based comparison-based algorithms ? X(1)=( x(1,1),x(1,2),...,x(1,) ) = Opt() X(2)=( x(2,1),x(2,2),...,x(2,) ) = Opt(x(1), signs of diff) … … ... x(n)=( x(n,1),x(n,2),...,x(n,) ) = Opt(x(n-1), signs of diff) ==> lets write it for =2.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 31
  31. 31. Population-based comparison-based algorithms ? x(1)=(x(1,1),x(1,2)) = Opt() x(2)=(x(2,1),x(2,2)) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n)=(x(n,1),x(n,2)) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2)) with y(i,j) = f ( x(i,j) )Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 32
  32. 32. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population x(1) = Opt() x(2) = Opt(x(1), sign(y(1,1)-y(1,2)) ) … … ... x(n) = Opt(x(n-1), sign(y(n-1,1)-y(n-1,2))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 33
  33. 33. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), sign(y(1,1)-y(1,2)), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),sign(y(n-1,1)-y(n-1,2) ,I(n-1))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 34
  34. 34. Population-based comparison-based algorithms ? Abstract notations: x(i) is a population, I(i) is an internal state of the algorithm. x(1),I(1) = Opt() x(2),I(2) = Opt(x(1), (1), I(1) ) … … ... x(n),I(n) = Opt(x(n-1),(n-1) ,I(n-1))Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 35
  35. 35. Comparison-based optimization ==> Same behavior on many functions is comparison-based ifAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 36
  36. 36. Comparison-based optimization ==> Same behavior on many functions is comparison-based if Quasi-Newton methods very poor on this.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 37
  37. 37. Why comparison-based algorithms ? ==> more robust ==> this can be mathematically formalized: comparison-based opt. are slow ( d log ||xn-x*||/n ~ constant) but robust (optimal for some worst case analysis)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  38. 38. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. analysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  39. 39. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy ema Strat c sch utionBasi Evol
  40. 40. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch utionBasi Evol
  41. 41. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  bestBasi Evol
  42. 42. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  bestBasi Evol Let x = average of these  best
  43. 43. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) o f an egy Compute their  fitness values ema Strat c sch ution Select the  bestBasi Evol Let x = average of these  best
  44. 44. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Compute their  fitness values Multi-cores, sly  Clusters, Grids... ou Select the  bestObvi Let x = average of these  best
  45. 45. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Compute their  fitness values sly  ple. ou Select the  best ly simObvi Real Let x = average of these  best
  46. 46. Parameters: Generate  points around x x, ( x +  N where N is a standard Gaussian) llel para Not a negligible advantage. Compute their  fitness values When I accessed, for the 1st time, sly  to a crucial industrial ple. code of an important ou Select the  best company, I believed ly simObvi that it would be Real clean and bug free. Let x = average of these  best (I was young)
  47. 47. Parameters: Generate 1 point x around x x, ( x +  N where N is a standard Gaussian) Compute its fitness value ) - ES le Keep the best (x or x). ru (1 +1 1/5 th x=best(x,x) The with =2 if x best =0.84 otherwise
  48. 48. This is x...
  49. 49. I generate =6 points
  50. 50. I select the =3 best points
  51. 51. x=average of these =3 best points
  52. 52. Ok.Choosing an initial x is as in any algorithm.But how do I choose sigma ?
  53. 53. Ok.Choosing x is as in any algorithm.But how do I choose sigma ?Sometimes by human guess.But for large number of iterations,there is better.
  54. 54. log || xn – x* || ~ - C n
  55. 55. Usually termed “linear convergence”, ==> but its in log-scale. log || xn – x* || ~ - C n
  56. 56. Examples of evolutionary algorithmsAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 60
  57. 57. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 61
  58. 58. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 62
  59. 59. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 63
  60. 60. Estimation of Multivariate Normal AlgorithmAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 64
  61. 61. EMNA is usually non-isotropicAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 65
  62. 62. EMNA is usually non-isotropicAuger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 66
  63. 63. Self-adaptation (works in many frameworks)Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 67
  64. 64. Self-adaptation (works in many frameworks) Can be used for non-isotropic multivariate Gaussian distributions.Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 68
  65. 65. Lets generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) - we generate individuals - we evaluate these individuals - we regenerate a search point and a step-sizeMaybe we could keep more than one search point ?
  66. 66. Lets generalize. We have seen algorithms which work as follows: - we keep one search point in memory (and one step-size) points ==> mu search - we generate individuals - we evaluate thesegenerated individuals ==> lambda individuals - we regenerate a search point and a step-sizeMaybe we could keep more than one search point ?
  67. 67. Parameters: Generate  points x1,...,x around x1,...,x e.g. each x randomly generated from two points llel para Compute their  fitness values sly  ple. ou Select the  best ly simObvi Real Dont average...
  68. 68. Generate  points around x1,...,xe.g. each x randomly generated from two points
  69. 69. Generate  points around x1,...,xe.g. each x randomly generated This is a from two points cross-over
  70. 70. Generate  points around x1,...,x e.g. each x randomly generated This is a from two points cross-overExample of procedure for generating a point: - Randomly draw k parents x1,...,xk (truncation selection: randomly in selected individuals) - For generating the ith coordinate of new individual z: u=random(1,k) z(i) = x(u)i
  71. 71. Lets summarize:We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA).==> Some EA (older than the EDA acronym) are EDAs.
  72. 72. Lets summarize:We have seen a general scheme for optimization: - generate a population (e.g. from some distribution, or from a set of search points) - select the best = new search points EDA EA==> Small difference between an Evolutionary Algorithm (EA) and an Estimation of Distribution Algorithm (EDA).==> Some EA (older than the EDA acronym) are EDAs.
  73. 73. Gives a lot freedom: - choose your representation and operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4)
  74. 74. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4)Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
  75. 75. Voronoi representation: - a family of points
  76. 76. Voronoi representation: - a family of points
  77. 77. Voronoi representation: - a family of points - their labels
  78. 78. Voronoi representation: - a family of points - their labels==> cross-over makes sense==> you can optimize a shape
  79. 79. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really usefulMutations: each label is changed with proba 1/nCross-over: each point/label is randomly drawn from one of the two parents
  80. 80. Voronoi representation: - a family of points - their labels ==> cross-over makes sense ==> you can optimize a shape ==> not that mathematical; but really usefulMutations: each label is changed with proba 1/nCross-over: randomly pick one split in the representation: - left part from parent 1 - right part from parent 2 ==> related to biology
  81. 81. Gives a lot freedom: - choose your operators (depending on the problem) - if you have a step-size, choose adaptation rule - choose your population-size (depending on your computer/grid ) - choose  (carefully) e.g. min(dimension,  /4)Can handle strange things: - optimize a physical structure ? - structure represented as a Voronoi - cross-over makes sense, benefits from local structure - not so many algorithms can work on that
  82. 82. II. Evolutionary algorithms a. Fundamental elements b. Algorithms c. Math. AnalysisOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  83. 83. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* ||Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  84. 84. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We want to maximize: - E log || x(n) - f* || -------------------------- - E log || x(n-1) – f* ||Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  85. 85. Consider the (1+1)-ES. x(n) = x(n-1) or x(n-1) + (n-1)N We dont know f*. We want to maximize: How can we optimize this ? - E log || x(n) - f* || We will observe -------------------------- the acceptance rate, - E log || x(n-1) – f* || and we will deduce if Olivier TeytaudInria Tao, en visite dans la belle ville de Liège is too large or too small..
  86. 86. - E log || x(n) - f* || ON THE NORM FUNCTION -------------------------- - E log || x(n-1) – f* ||Rejected Acceptedmutations mutationsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  87. 87. - E log || x(n) - f* || For each step-size, -------------------------- evaluate this “expected progress rate” - E log || x(n-1) – f* || and evaluate “P(acceptance)”Rejected Acceptedmutations mutationsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  88. 88. Progress rate Rejected mutations Acceptance rate
  89. 89. Progress rate We want to be here! Rejected We observe mutations (approximately) this variable Acceptance rate
  90. 90. Progress rate Rejected mutations Big Acceptance rate step-size
  91. 91. Progress rate Rejected mutations Small step-size Acceptance rate
  92. 92. Progress rate RejectedSmall acceptance rate mutations ==> decrease sigma Acceptance rate
  93. 93. Progress rate Rejected Big acceptance rate mutations ==> increase sigma Acceptance rate
  94. 94. th1/5 rule Based on maths showing that good step-size <==> success rate < 1/5Auger, Fournier, Hansen, Rolet, Teytaud, Teytaud parallel evolution 98
  95. 95. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  96. 96. III. From math. programming ==>pattern search method Comparison with ES: - code more complicated - same rate - deterministic - less robustOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  97. 97. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  98. 98. III. From math. programming Also: - Nelder-Mead algorithm (similar to pattern search, better constant in the rate) - NEWUOA (using value functions and not only comparisons)Olivier TeytaudInria Tao, en visite dans la belle ville de Liège
  99. 99. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  100. 100. IV. Using machine learning What if computing f takes days ?==> parallelism==> and “learn” an approximation of fOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  101. 101. IV. Using machine learningStatistical tools: f (x) = approximation ( x, x1,f(x1), x2,f(x2), … , xn,f(xn)) y(n+1) = f (x(n+1) ) e.g. f = quadratic function closest to f on the x(i)s.
  102. 102. IV. Using machine learning ==> keyword “surrogate models” ==> use f instead of f ==> periodically, re-use the real f
  103. 103. I. Optimization and DFO II. Evolutionary algorithms III. From math. programming IV. Using machine learning V. ConclusionsOlivier TeytaudInria Tao, en visite dans la belle ville de Liège
  104. 104. Derivative free optimization is fun.==> nice maths==> nice applications + easily parallel algorithms==> can handle really complicated domains (mixed continuous / integer, optimization on sets of programs)Yet,often suboptimal on highly structured problems (when BFGS is easy to use, thanks to fast gradients)
  105. 105. Keywords, readings==> cross-entropy (so close to evolution strategies)==> genetic programming (evolutionary algorithms for automatically building programs)==> H.-G. Beyers book on ES = good starting point==> many resources on the web==> keep in mind that representation / operators are often the key==> we only considered isotropic algorithms; sometimes not a good idea at all

×