0
Simulations for combiningheuristics and consistentalgorithmsApplications to Minesweeper,the game of Goand Power GridsO. Bu...
In France   (RTE+Luc Lasne)
Beautiful spatially                distributed problem●   Short term (~10s): dispatching    ●   Real-time control, humans ...
Goals / tools●   Optimizing investments       –    For the next 50 years       –    In Europe / North Africa●   Taking int...
Main ideas●   We like simulation-based optimizers    ●   For analyzing simulations    ●   For bilevel optimization (“anyti...
Main ideas●   We like simulation-based optimizers    ●   For analyzing simulations    ●   For bilevel optimization (“anyti...
A great challenge: MineSweeper.- looks easy- in fact, not easy:    many myopic (one-   step-ahead)   approaches.- partiall...
1. Rules of MineSweeper    2. State of the art  3. The CSP approach  4. The UCT approach5. The best of both worlds
RULES    At the beginning,      all  locations     are  Covered(unkwown).
I playhere!
Good news! No mine in     theneighborhood! I can “click”    all the neighbours.
I have 3  uncovered  neighbors, and I have 3 mines in theneighborhood ==> 3 flags!
I know  its a mine,so I put a flag!
No info !
I play here and I lose...
The mostsuccessfulgame ever!Who in thisroom never  played   Mine-Sweeper ?
1. Rules of MineSweeper   2. State of the art  3. The CSP approach  4. The UCT approach5. The best of both worlds
Do you think its  easy ? (10 mines)MineSweeperis not simple.
What isthe optimal  move ?
What is                                        the optimal                                          move ? Remark: the que...
What is                                     the optimal                                       move ?             This one ...
Moredifficult! Whichmove isoptimal ?Here, theclassicalapproach  (CSP)is wrong.
Probability   of a mine ?- Top:- Middle:- Bottom:
Probability   of a mine ?- Top: 33%- Middle:- Bottom:
Probability   of a mine ?- Top: 33%- Middle: 33%- Bottom:
Probability   of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%
Probability    of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%==> so all moves    equivalent ?
Probability    of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%==> so all moves    equivalent ?==> NOOOOO!!!
Probability    of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%Top or bottom:  66% of win!Middle: 33%!
The myopic(one-step ahead) approach plays   randomly. The middle is a   bad move! Even with same  proba of mine, some move...
State of the art:- solved in 4x4- NP-complete- Constraint Satisfaction Problem approach:    = Find the location which is l...
1. Rules of MineSweeper       2. State of the art     3. The CSP approach(and other old known methods)     4. The UCT appr...
- Exact MDP: very expensive. 4x4 solved.- Single Point Strategy (SPS): simple local solving- CSP (constraint satisf. probl...
CSP as modified by Legendre et al, 2012:   - (unknown) state:         x(i) = 1 if there is a mine at location i   - each v...
CSP- is very fast- but its not optimal- because ofHere CSP plays randomly!Also for the initial move: dont play randomly th...
1. Rules of MineSweeper    2. State of the art  3. The CSP approach 4. The UCT approach5. The best of both worlds
The MCTS approach●   Random simulations                             (Bruegman, 93)●   A tree of possible futures, increasi...
Why not UCT ?- looks like a stupid idea at first view- can not compete with CSP in terms of speed- But at least UCT is  co...
UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
UCT
UCT
UCT
UCT
UCT      Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...            SCORE =                5/7             + k.sqrt( log(10)/7 )
Exploitation ...            SCORE =                5/7             + k.sqrt( log(10)/7 )
Exploitation ...            SCORE =                5/7             + k.sqrt( log(10)/7 )
... or exploration ?              SCORE =                  0/2               + k.sqrt( log(10)/2 )
UCT in one slideGreat progress in the game of Go and in various other games
UCT in one slide            C SP by     se the al 2012We u re et      dLegen expansion   for      ulation                 ...
Applying UCT here ?•   Might look like ``too much•   But in many cases CSP is suboptimal•   We have seen an example of sub...
An example showing that the initialmove matters (UCT finds it, not CSP)..                              3x3, 7 mines:      ...
Second such example:       15 mines on 5x5 board with                GnoMine rule      (i.e. initial move is a 0, i.e. no ...
1. Rules of MineSweeper    2. State of the art  3. The CSP approach  4. The UCT approach5. The best of both worlds
Summary    I have two approaches:●   CSP:       ●   Fast       ●   Suboptimal (myopic, only 1-step ahead)●   UCT:       ● ...
The best of both worlds ?●   CSP:       ●   Fast       ●   Suboptimal (myopic, only 1-step ahead)●   UCT:       ●   needs ...
What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate po...
What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate po...
What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate po...
What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate po...
What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate po...
We published a version of UCT       for MineSweeper in which this wasWhat do I need for implementing UCT ?                ...
Rejection algorithm:      1- randomly draw the minesWhat do I need for implementing UCT ?Given 2- if and an action, return...
It is mathematically ok, but it is too slow.Then,need for used a UCT ? CSP implementation.What do I            we implemen...
EXPERIMENTAL RESULTS                            Huge                         computation10 000 UCT-simulations      time  ...
CONCLUSIONS: a       methodology for sequential           decision making- When you have a myopic solver  (i.e. which negl...
Main ideas●   We like simulation-based optimizers    ●   For analyzing simulations    ●   For bilevel optimization (“anyti...
What is Direct Policy Search ?●   I have a parametric policy decision = p(w,s)        –   Inputs: parameter w, state s    ...
Example of policy    p(w,state) = decision     such that●   Alpha = W0+W1tanh(W2.state)+W3●   Week-Ahead-Reward(decision) ...
One or two rivers (7 stocks)
Network of rivers (7 rivers)
Investment problem
Summary    Two simulation-based tools for Sequential Decision    Making:●   UCT = a nice tool for short term combinatorial...
Thanks for yourattention!    9 Mines.  What is theoptimal move ?
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Upcoming SlideShare
Loading in...5
×

Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

530

Published on

Using and combining
- UCT
- DPS
for sequential decision making.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
530
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Simulation-based optimization: Upper Confidence Tree and Direct Policy Search"

  1. 1. Simulations for combiningheuristics and consistentalgorithmsApplications to Minesweeper,the game of Goand Power GridsO. Buffet, A. Couëtoux, H. Doghmen,W. Lin, O. Teytaud,& many others
  2. 2. In France (RTE+Luc Lasne)
  3. 3. Beautiful spatially distributed problem● Short term (~10s): dispatching ● Real-time control, humans in the loop● Days, weeks: combinatorial optimization● Years: hydroelectric stocks ● Stochasticity (provides price of water for week- ahead)● 50 years: investments ● Expensive optimization of strategies (parallel) ● Uncertainties: Multiobjective (many!) ? Worst- case ?
  4. 4. Goals / tools● Optimizing investments – For the next 50 years – In Europe / North Africa● Taking into account power plants / networks● Multi-objective (scenarios), visualization with:● Collaboration with a company (data, models)● 3 ph.D. students full time● dedicated machine 500/1000 cores
  5. 5. Main ideas● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement)● Required knowledge is a simulator: ● No dependency on additional knowledge ● Simplified model (linearized...) not necessary● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy)
  6. 6. Main ideas● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement)● All we want as required knowledge is a simulator: ● No dependency on additional knowledge ● No simplified model● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy Tools:● Upper Confidence Tree = Adaptive Simulator, good for combinatorial aspects● Direct Policy Search = Adaptive Simulator, good for long term effects
  7. 7. A great challenge: MineSweeper.- looks easy- in fact, not easy: many myopic (one- step-ahead) approaches.- partially observable
  8. 8. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach5. The best of both worlds
  9. 9. RULES At the beginning, all locations are Covered(unkwown).
  10. 10. I playhere!
  11. 11. Good news! No mine in theneighborhood! I can “click” all the neighbours.
  12. 12. I have 3 uncovered neighbors, and I have 3 mines in theneighborhood ==> 3 flags!
  13. 13. I know its a mine,so I put a flag!
  14. 14. No info !
  15. 15. I play here and I lose...
  16. 16. The mostsuccessfulgame ever!Who in thisroom never played Mine-Sweeper ?
  17. 17. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach5. The best of both worlds
  18. 18. Do you think its easy ? (10 mines)MineSweeperis not simple.
  19. 19. What isthe optimal move ?
  20. 20. What is the optimal move ? Remark: the question makes sense, without Knowing the history.You dont need the history for playing optimaly. ==> (this fact is mathematically non trivial!)
  21. 21. What is the optimal move ? This one is easy.Both remaining locations win with proba 50%.
  22. 22. Moredifficult! Whichmove isoptimal ?Here, theclassicalapproach (CSP)is wrong.
  23. 23. Probability of a mine ?- Top:- Middle:- Bottom:
  24. 24. Probability of a mine ?- Top: 33%- Middle:- Bottom:
  25. 25. Probability of a mine ?- Top: 33%- Middle: 33%- Bottom:
  26. 26. Probability of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%
  27. 27. Probability of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%==> so all moves equivalent ?
  28. 28. Probability of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%==> so all moves equivalent ?==> NOOOOO!!!
  29. 29. Probability of a mine ?- Top: 33%- Middle: 33%- Bottom: 33%Top or bottom: 66% of win!Middle: 33%!
  30. 30. The myopic(one-step ahead) approach plays randomly. The middle is a bad move! Even with same proba of mine, some moves arebetter than others!
  31. 31. State of the art:- solved in 4x4- NP-complete- Constraint Satisfaction Problem approach: = Find the location which is less likely to be a mine, play there. ==> 80% success “beginner” (9x9, 10 mines) ==> 45% success “intermediate” (16x16, 40 mines) ==> 34% success “expert” (30x40, 99 mines)
  32. 32. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach(and other old known methods) 4. The UCT approach 5. The best of both worlds
  33. 33. - Exact MDP: very expensive. 4x4 solved.- Single Point Strategy (SPS): simple local solving- CSP (constraint satisf. problem): the main approach. - (unknown) state: x(i) = 1 if there is a mine at location i - each visible location is a constraint: If location 15 is labelled 4, then the constraint is x(04)+x(05)+x(06) +x(14)+ x(16) +x(24)+x(25)+x(26) = 4. - find all solutions x1, x2, x3,...,xN - P(mine in j) = (sumi Xij ) / N <== this is math. proved! - play j such that P(mine in j) minimal - if several such j, randomly break ties. MDP= Markov Decision Process CSP = Constraint Satisfaction Problem
  34. 34. CSP as modified by Legendre et al, 2012: - (unknown) state: x(i) = 1 if there is a mine at location i - each visible location is a constraint: If location 15 is 4, then the constraint is x(04)+x(05)+x(06) +x(14)+ x(16) +x(24)+x(25)+x(26) = 4. - find all solutions x1, x2, x3,...,xN - P(mine in j) = (sumi Xij ) / N <== this is math. proved! - play j such that P(mine in j) minimal - if several such j, choose one “closest to the frontier” (proposed by Legendre et al) - if several such j, randomly break ties.
  35. 35. CSP- is very fast- but its not optimal- because ofHere CSP plays randomly!Also for the initial move: dont play randomly the first move! (sometimes opening book)
  36. 36. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach5. The best of both worlds
  37. 37. The MCTS approach● Random simulations (Bruegman, 93)● A tree of possible futures, increasing along simulations (several simultaneous papers, 2006)
  38. 38. Why not UCT ?- looks like a stupid idea at first view- can not compete with CSP in terms of speed- But at least UCT is consistent: if given sufficient time, it will play optimally.- Tested in Couetoux and Teytaud, 2011
  39. 39. UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
  40. 40. UCT
  41. 41. UCT
  42. 42. UCT
  43. 43. UCT
  44. 44. UCT Kocsis & Szepesvari (06)
  45. 45. Exploitation ...
  46. 46. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  47. 47. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  48. 48. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  49. 49. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  50. 50. UCT in one slideGreat progress in the game of Go and in various other games
  51. 51. UCT in one slide C SP by se the al 2012We u re et dLegen expansion for ulation . a nd sim
  52. 52. Applying UCT here ?• Might look like ``too much• But in many cases CSP is suboptimal• We have seen an example of suboptimal move by CSP a few slides ago• Lets see two additional examples
  53. 53. An example showing that the initialmove matters (UCT finds it, not CSP).. 3x3, 7 mines: the optimal move is anything but the center. Optimal winning rate: 25%. Optimal winning rate if random uniform initial move: 17/72. (yes we get 1/72 improvement!)
  54. 54. Second such example: 15 mines on 5x5 board with GnoMine rule (i.e. initial move is a 0, i.e. no mine in the neighborhood) Optimal success rate = 100%!!!!!Play the center, and you win (well, you have to work...) The myopic CSP approach does not find it.
  55. 55. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach5. The best of both worlds
  56. 56. Summary I have two approaches:● CSP: ● Fast ● Suboptimal (myopic, only 1-step ahead)● UCT: ● needs a generative model (probability of next states, given my action), ● Asymptotically optimal
  57. 57. The best of both worlds ?● CSP: ● Fast ● Suboptimal (myopic, only 1-step ahead)● UCT: ● needs a generative model by CSP, ● Asymptotically optimal
  58. 58. What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, whatare the possible next states ?
  59. 59. What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  60. 60. What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  61. 61. What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  62. 62. What do I need for implementing UCT ?A complete generative model.Given a state and an action,I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  63. 63. We published a version of UCT for MineSweeper in which this wasWhat do I need for implementing UCT ? implemented usingA complete generative model.Given a state and an action, the rejection method only.I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  64. 64. Rejection algorithm: 1- randomly draw the minesWhat do I need for implementing UCT ?Given 2- if and an action, return the new observation a state its ok,A complete generative model. 3- otherwise, go back to 1.I must be able to simulate possible transitions.State S, Action a:(S,a) ==> SExample: given the state below, and the action “top left”, what are the possible nextstates ?
  65. 65. It is mathematically ok, but it is too slow.Then,need for used a UCT ? CSP implementation.What do I we implementing weakA complete generative model.Given a state and an action, Still too slow.Now a reasonably fast implementation, withI must be able to simulate possible transitions.State S, Action a:(S,a) ==> S Legendre et al heuristic.Example: given the state below, and the action “top left”, what are the possible nextstates ?
  66. 66. EXPERIMENTAL RESULTS Huge computation10 000 UCT-simulations time per move Our results (total = a few days)
  67. 67. CONCLUSIONS: a methodology for sequential decision making- When you have a myopic solver (i.e. which neglects long term effects, as too often in industry!) ==> improve it with heuristics (as Legendre et al) ==> combine with UCT (as we did) ==> significant improvements- We have similar experiments on industrial testbeds
  68. 68. Main ideas● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement)● All we want as required knowledge is a simulator: ● No dependency on additional knowledge ● No simplified model● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy Tools:● Upper Confidence Tree = Adaptive Simulator, good for combinatorial aspects● Direct Policy Search = Adaptive Simulator, good for long term effects
  69. 69. What is Direct Policy Search ?● I have a parametric policy decision = p(w,s) – Inputs: parameter w, state s – Output: decision p(w,s) – E.g. ● p(w,s) = w.s (scalar product) ● p(w,s) = W0 + W1 x tanh(W2 x s+W3) (neural network)● I have a simulator cost = simulate(p, transition): – Inputs = policy p , transition function – Output = cost (possibly noisy) – Principle: – state = initial state – While (state not final) of ● decision=p(state) p art . state=transition(state,decision) ig ork A b ur w ●● Direct Policy Search( transition , policy p(.,.) ): o – w = argmin simulate( p(w,.) ) // with your favorite optimization algorithm – return p(w,.)
  70. 70. Example of policy p(w,state) = decision such that● Alpha = W0+W1tanh(W2.state)+W3● Week-Ahead-Reward(decision) + Alpha.Stock(decision) is maximum ==> if linear transition, compliant with huge action dimension ==> non-linearities handled by the neural net
  71. 71. One or two rivers (7 stocks)
  72. 72. Network of rivers (7 rivers)
  73. 73. Investment problem
  74. 74. Summary Two simulation-based tools for Sequential Decision Making:● UCT = a nice tool for short term combinatorial effect● DPS = a stable tool for long term effects Both:● Are anytime● Provide simulation results● Can take into account non-linear effects● High-dimension of the state space
  75. 75. Thanks for yourattention! 9 Mines. What is theoptimal move ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×