Tutorialmcts

458 views

Published on

A simple tutorial on Monte-Carlo Tree Search

Contains a description of dynamic programming and alpha-beta search, then MCTS. Special cases for simultaneous actions are discussed.

I should add comments so that it can be used without preliminary knowledge of MCTS, if there is at least one request for doing so I'll do it.


@article{gelly:hal-00695370,
hal_id = {hal-00695370},
url = {http://hal.inria.fr/hal-00695370},
title = {{The Grand Challenge of Computer Go: Monte Carlo Tree Search and Extensions}},
author = {Gelly, Sylvain and Kocsis, Levente and Schoenauer, Marc and Sebag, Mich{\`e}le and Silver, David and Szepesvari, Csaba and Teytaud, Olivier},
abstract = {{The ancient oriental game of Go has long been considered a grand challenge for artificial intelligence. For decades, com- puter Go has defied the classical methods in game tree search that worked so successfully for chess and checkers. How- ever, recent play in computer Go has been transformed by a new paradigm for tree search based on Monte-Carlo meth- ods. Programs based on Monte-Carlo tree search now play at human-master levels and are beginning to challenge top professional players. In this paper we describe the leading algorithms for Monte-Carlo tree search and explain how they have advanced the state of the art in computer Go.}},
language = {Anglais},
affiliation = {TAO - INRIA Saclay - Ile de France , Laboratoire de Recherche en Informatique - LRI , LPDS , Microsoft Research - Inria Joint Centre - MSR - INRIA , University of Alberta, Canada , Department of Computing Science},
publisher = {ACM},
pages = {106-113},
journal = {Communication of the ACM},
volume = {55},
number = {3 },
audience = {internationale },
year = {2012},
pdf = {http://hal.inria.fr/hal-00695370/PDF/CACM-MCTS.pdf},
}

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
458
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tutorialmcts

  1. 1. Bandit-based Monte-Carlo planning: the gameof Go and beyondDesigning intelligentagents withMonte-Carlo Tree Search. Olivier.Teytaud@inria.fr + F. Teytaud + H. Doghmen + othersTAO, Inria-Saclay IDF, Cnrs 8623,Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork ofExcellence. Keywords: UCB, EXP3, MCTS, UCT.ParisApril 2011. Games, games with hidden information, games with simultaneous actions.
  2. 2. Key point PLEASE INTERRUPT ME ! HAVE QUESTIONS ! LETS HAVE A FRIENDLY SESSION ! ASK QUESTIONS NOW AND LATER BY MAIL! olivier.teytaud@inria.fr
  3. 3. OutlineIntroduction: games / control / planning.Standard approachesBandit-based Monte-Carlo Planning andUCT.Application to thegame of Go and(far) beyond
  4. 4. A game is a directed graphGames with simultaneous actions Paris 1st of February 4
  5. 5. A game is a directed graph with actions 1 2 3Games with simultaneous actions Paris 1st of February 5
  6. 6. A game is a directed graph with actions and players 1 WhiteBlack 2 3 White 12 43 White Black Black Black BlackGames with simultaneous actions Paris 1st of February 6
  7. 7. A game is a directed graph with actions and players and observationsBob Bear Bee Bee 1 WhiteBlack 2 3 White 12 43 White Black Black Black Black Games with simultaneous actions Paris 1st of February 7
  8. 8. A game is a directed graph with actions and players and observations and rewardsBob Bear Bee Bee 1 WhiteBlack 2 3 +1 0 White 12 43 Rewards White Black on leafs Black only! Black Black Games with simultaneous actions Paris 1st of February 8
  9. 9. A game is a directed graph +actions +players +observations +rewards +loopsBob Bear Bee Bee 1 WhiteBlack 2 3 +1 0 White 12 43 White Black Black Black Black Games with simultaneous actions Paris 1st of February 9
  10. 10. More than games in thisformalismA main application: the management of many energy stocks in front of randomnessAt each time step we see random outcomesWe have to make decisions (switching on or off)We have losses (ANR / NSC project)
  11. 11. Opening a reservoir produces energy (and water goes to another reservoir) Classical Thermal plants Reservoir 2 Reservoir 1 Reservoir 5 Electricity demand Reservoir 4 Reservoir 3 Nuclear plantsLost water
  12. 12. OutlineIntroduction:games / control / planning.Standard approachesBandit-based Monte-Carlo Planning andUCT.Application to thegame of Go and(far) beyond
  13. 13. What are the approaches ?Dynamic programming (Massé – Bellman 50s)(still the main approach in industry) (minimax / alpha-beta in games)Reinforcement learning (some promising results,less used in industry)Some tree exploration tools (less usual instochastic or continuous cases)Bandit-Based Monte-Carlo planningScripts + tuning
  14. 14. What are the approaches ?Dynamic programming (Massé – Bellman 50s)(still the main approach in industry) Where we are: Done: Presentation of the problem. Now: We briefly present dynamic programming Thereafter: We present MCTS / UCT.
  15. 15. Dynamic programmingV(x) = expectation of future loss if optimalstrategy after state x. (well defined)u(x) such that the expectation ofV(f(x,u(x),A)) is minimalComputation by dynamic programmingWe compute V for all the final states X.
  16. 16. Dynamic programmingV(x) = expectation of C(xH) if optimalstrategy. (well defined)u(x) such that the expectation ofV(f(x,u(x),A)) is minimalComputation by dynamic programmingWe compute V for all the final states X.We compute V for all the “-1” states X.
  17. 17. Dynamic programming (DP)V(x) = expectation of C(xH) if optimalstrategy. (well defined)u(x) such that the expectation ofV(f(x,u(x),A)) is minimalComputation by dynamic programmingWe compute V for all the final states X.We compute V for all the “-1” states X. ... ... ...
  18. 18. Dynamic programming: picture
  19. 19. Alpha-beta = DP + pruning
  20. 20. Alpha-beta = DP + pruning “Nevertheless, I believe that a world-champion- level Go machine can be built within 10 years, based upon the same method of intensive analysis--brute force, basically--that Deep Blue employed for chess.” Hsu, IEEE Spectrum, 2007. (==> I dont think so.)
  21. 21. Extensions of DPApproximate dynamic programming (e.g.for continuous domains)Reinforcement learningCase where f(...) or A is black-boxHuge state spaces ==> but lack of stabilityDirect Policy Search, Fitted-Q-Iteration...==> there is room for improvements
  22. 22. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS;2006)ExtensionsWeaknessGames as benchmarks ?
  23. 23. Monte-Carlo Tree SearchMonte-Carlo Tree Search (MCTS) appearedin games.R. Coulom. Efficient Selectivity and Backup Operators inMonte-Carlo Tree Search. In Proceedings of the 5thInternational Conference on Computers and Games, Turin, Italy,2006.Its most well-known variant is termed UpperConfidence Tree (UCT).
  24. 24. UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
  25. 25. UCT
  26. 26. UCT
  27. 27. UCT
  28. 28. UCT
  29. 29. UCT Kocsis & Szepesvari (06)
  30. 30. Exploitation ...
  31. 31. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  32. 32. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  33. 33. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  34. 34. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  35. 35. Parallelizing MCTSOn a parallel machine with shared memory: just many simulations in parallel, the same memory for all.On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  36. 36. Parallelizing MCTSOn a parallel machine with shared memory: just many simulations in parallel, the same memory for all.On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  37. 37. Parallelizing MCTSOn a parallel machine with shared memory: just many simulations in parallel, the same memory for all.On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  38. 38. Parallelizing MCTSOn a parallel machine with shared memory: just many simulations in parallel, the same memory for all.On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  39. 39. Parallelizing MCTSOn a parallel machine with shared memory: just many simulations in parallel, the same memory for all.On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  40. 40. Good news: it worksSo misleading numbers...
  41. 41. Much better than voting schemesBut little difference with T. Cazenave (depth 0).
  42. 42. Every month, someone tells us: Try with a bigger machine ! And win against top pros !
  43. 43. Being faster is not the solution
  44. 44. The same in Havannah (F. Teytaud)
  45. 45. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)ExtensionsWeaknessGames as benchmarks ?
  46. 46. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  47. 47. Why UCT is suboptimal for games ? There are better formula than mean + sqrt(log(...) / …) (=UCT) MCTS, under mild conditions on games (including deterministic two-player zero- sum games), can be consistent (→ best move); frugal (if there is a good move, it does not visit infinitely often all the tree). (==> not true for UCT)
  48. 48. Why UCT is suboptimal for games ? There are better formula than mean + sqrt(log(...) / …) (=UCT) MCTS, under mild There is better for deterministic conditions on games win/draw/loss games: (including deterministic two-player zero- (sumRewards+K)/ sum games), can be (nbTrials+2K) consistent (→ best move); frugal (if there is a good move, it does not visit infinitely often all the tree). (==> not true for UCT)
  49. 49. Go: from 29 to 6 stonesFormula forsimulation nbWins + 1 argmax --------------- nbLosses + 2 Berthier, Doghmen, T., LION 2010 ==> consistency ==> frugality
  50. 50. It depends on the game and on the tuning...
  51. 51. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  52. 52. Infinite action spaces:progressive wideningUCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(i)/ number of trials with decision u )  ==> argmax only on the i first arms ( [ 0.25 0.5 ] ) (Coulom, Chaslot et al, Wang et al)
  53. 53. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  54. 54. Extensions``Standard UCT: score(situation,move) = compromise (in [0,1+] ) between a) empirical quality P ( win | nextMove(situation) = move ) estimated in simulations b) exploration termRemark: No offline learning
  55. 55. Extension: offline learning (introducing imitation learning) c) offline value (Bouzy et al) = empirical estimate P ( played | pattern )Pattern = ball of locations, each location either: - this is black stone - this is white stone - this is empty - this is not black stone - this is not white stone - this is not empty - this is borderSupport = frequency of “the center of this pattern is played”Confidence = conditional frequency of playBias = confidence of pattern with max support
  56. 56. Extension: offline learning (introducing imitation learning) score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value (Bouzy et al, Coulom) = empirical estimate P ( played | pattern ) for patterns with big support ==> estimated on databaseAt first, (c) is the most important; later, (a) dominates.
  57. 57. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  58. 58. Extensions``Standard UCT: score(situation,move) = compromise between a) empirical quality b) exploration termRemark: No learning from one situation to another
  59. 59. Extension: transient valuesscore(situation,move) = compromise between a) empirical quality P ( win | nextMove(situation) = move ) estimated in simulations b) exploration term c) offline value d) ``transient value: (Gelly et al, 07) P (win | move ∈ laterMoves(situations) )==> brings information from node N to ancestor node M==> does not bring information from node N to descendants or cousins (many people have tried...) Brügman, Gelly et al
  60. 60. Transient values = RAVE = very good in many gamesIt works also in Havannah.
  61. 61. It works also in NoGo NoGo = rules of Go except that capturing ==> loosing
  62. 62. Counter-example to RAVE, B2By M. Müller B2 makes sense only if it is played Immediately (otherwise A5 kills).
  63. 63. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  64. 64. Extensions``Standard UCT: score(situation,move) = compromise between a) empirical quality b) exploration termRemarks: No expert rules
  65. 65. Extension: expert rules score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value d) transient value e) expert rules ==> empirically derived linear combinationMost important terms,(e)+(c) first,then (d) becomes stronger,finally (a) only
  66. 66. Extension: expert rules in the Monte-Carlo partDecisive moves: play immediate wins.Anti-decisive moves: dont play moves with immediate winning reply.Teytaud&Teytaud, CIG2010: can be fast in connection games. E.g. Havannah:
  67. 67. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2010: win against a pro (4p) 19x19, H6 Zen2010: win against a pro (5p) 19x19, H6 Zen2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  68. 68. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2010: win against a pro (5p) 19x19, H6 Zen Wins with H6 / H7 are lucky (rare)2007: win against a pro (5p) 9x9 (blitz) MoGo wins2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  69. 69. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2010: win against a pro (5p) 19x19, H6 Zen2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW Win with disadvantageous==> still 6 stones at least! side.
  70. 70. 13x13 Go: new results!9x9 Go: computers are at the human best level. - Fuego won against a top level human as white - mogoTW did it both as black and white and regularly wins some games against the top players. - mogoTW won ¾ yesterday in blind go (blind go = go in 9x9 according to the pros)19x19 Go: the best humans still (almost always) win easily with7 handicap stones.In WCCI 2010, experiments in 13x13 Go: - MoGo won 2/2 against 6D players with handicap 2 - MfoG won 1/2 against 6D players with handicap 2 - Fuego won 0/2 against 6D players with handicap 2 And yesterday MoGoTW won one game with handicap 2.5!
  71. 71. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  72. 72. BanditsWe have seen UCB: choose action with maximal scoreQ(action,state) = empirical_reward(action,state) +sqrt( log(nbSims(state)) / nbSims(action,state) )EXP3 is another bandit: For adversarial cases Based on a stochastic formula
  73. 73. EXP3 in one slideGrigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009
  74. 74. MCTS for simultaneous actions Player 1 plays Player 2 plays Both players play... Player 1 plays Player 2 plays
  75. 75. MCTS for simultaneous actions Player 1 plays Flory, Teytaud, = maxUCB node Evostar 2011 Player 2 plays =minUCB node Both players play =EXP3 node Player 1 plays... Player 2 plays =maxUCB node =minUCB node
  76. 76. MCTS for hidden informationPlayer 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  77. 77. MCTS for hidden informationPlayer 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node “Observation set” = set of sequences of observations Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  78. 78. MCTS for hidden informationPlayer 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node “Observation set” = set of Here, possible sequences sequences of of observations observations are partitioned in 3 Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  79. 79. MCTS for hidden informationPlayer 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Thanks Martin(incrementally + application to phantom-tic-tac-toe: see D. Auger 2011) Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  80. 80. MCTS for hidden informationPlayer 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Use EXP3:(incrementally + application to phantom-tic-tac-toe: see D. Auger 2010) in consistent even adversarial setting. Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  81. 81. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  82. 82. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  83. 83. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  84. 84. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  85. 85. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  86. 86. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  87. 87. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  88. 88. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  89. 89. MCTS with hidden informationWhile (there is time for thinking){ s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  90. 90. MCTS with hidden information:incremental versionWhile (there is time for thinking){ Possibly refine s=initial state the family while (s not terminal) { of bandits. os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation}
  91. 91. Lets have fun with a niceapplication(
  92. 92. Lets have fun with Urban Rivals (4 cards) Each player has - four cards (each one can be used once) - 12 pilz (each one can be used once) - 12 life points Each card has: - one attack level - one damage - special effects (forget it for the moment) Four turns: - P1 attacks P2 - P2 attacks P1 - P1 attacks P2 - P2 attacks P1 Games with simultaneous actions Paris 1st of February 92
  93. 93. Lets have fun with Urban RivalsFirst, attacker plays:- chooses a card- chooses ( PRIVATELY ) a number of pilz Attack level = attack(card) x (1+nb of pilz)Then, defender plays: - chooses a card - chooses a number of pilz Defense level = attack(card) x (1+nb of pilz)Result: If attack > defense Defender looses Power(attackers card) Else Attacker looses Power(defenders card) Games with simultaneous actions Paris 1st of February 93
  94. 94. Lets have fun with Urban Rivals==> The MCTS-based AI is now at the best human level.Experimental (only) remarks on EXP3:- discard strategies with small number of sims = better approx of the Nash- also an improvement by taking into account the other bandit- not yet compared to INF- virtual simulations (inspired by Kummer)Games with simultaneous actions Paris 1st of February 94
  95. 95. Lets have fun with a niceapplication) We are now at the best human level in Urban Rivals.
  96. 96. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)ExtensionsWeaknessGames as benchmarks ?
  97. 97. Game of Go (9x9 here)
  98. 98. Game of Go
  99. 99. Game of Go
  100. 100. Game of Go
  101. 101. Game of Go
  102. 102. Game of Go
  103. 103. Game of Go
  104. 104. Game of Go: counting territories(white has 7.5 “bonus” as black starts)
  105. 105. Game of Go: the rules Black plays at the blue circle: the white group dies (it is removed)Its impossible to kill white (two “eyes”). “Ko” rules: we dont come back to the same situation. (without ko: “PSPACE hard” with ko: “EXPTIME-complete”) At the end, we count territories ==> black starts, so +7.5 for white.
  106. 106. Easy for computers ... becausehuman knowledge easy to encode. 106
  107. 107. Difficult for computers (pointed out to me by J.M. Jolion in 1998) 107
  108. 108. Key point in Go: there are human-easysituations which are computer-hard. Well see much easier situations poorly understood. (komi 7.5) 108
  109. 109. Difficult for computers (win forblack, playing A) Well see much easier situations poorly understood. (komi 7.5) But lets see an easier case. 109
  110. 110. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  111. 111. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  112. 112. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  113. 113. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  114. 114. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  115. 115. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  116. 116. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  117. 117. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  118. 118. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  119. 119. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  120. 120. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  121. 121. It does not work. Why ? 50% of estimated win probability!In the first node: The first simulations give ~ 50% The next simulations go to 100% or 0% (depending on the chosen move) But, then, we switch to another node                                                (~ 8! x 8! such nodes)
  122. 122. And the humans ? 50% of estimated win probability!In the first node: The first simulations give ~ 50% The next simulations go to 100% or 0% (depending on the chosen move) But, then, we DONT switch to another node  
  123. 123. SemeaisShouldwhiteplay inthesemeai(G1)or capture(J15) ? 123
  124. 124. SemeaisShould blackplay thesemeai ? 124
  125. 125. SemeaisShould blackplay thesemeai ? 125
  126. 126. SemeaisShould blackplay thesemeai ?Useless! 126
  127. 127. OutlineDiscrete time control: various approachesMonte-Carlo Tree Search (UCT, MCTS; 2006)ExtensionsWeaknessGames as benchmarks ?
  128. 128. Difficult board games: Havannah Very difficult for computers. Very simple to implement.
  129. 129. What else ? First Person Shooting(UCT for partially observable MDP)Frédéric Lemoine MIG 11/07/2008 134
  130. 130. What else ? Games with  simultaneous actions or hidden information Flory, Teytaud, Evostar 2011Games with hidden informationGames with simultaneous actions. UrbanRivals = internet card game; 11 millions of registered users. Game with hidden information. Frédéric Lemoine MIG 11/07/2008 135
  131. 131. What else ? Real Time Strategy Game (multiple actors, partially obs.) Frédéric Lemoine MIG 11/07/2008 136
  132. 132. What else ? Sports (continuous control) Frédéric Lemoine MIG 11/07/2008 137
  133. 133. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. (holds true for Go)Frédéric Lemoine MIG 11/07/2008 138
  134. 134. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. VSFrédéric Lemoine MIG 11/07/2008 139
  135. 135. What else ? Collaborative sportsFrédéric Lemoine MIG 11/07/2008 140
  136. 136. When is MCTS relevant ? Robust in front of:High dimension;Non-convexity of Bellman values;Complex modelsDelayed reward
  137. 137. When is MCTS relevant ? Robust in front of:High dimension;Non-convexity of Bellman values;Complex modelsDelayed rewardMore difficult forHigh values of H;Highly unobservable cases (Monte-Carlo, but notMonte-Carlo Tree Search, see Cazenave et al.)Lack of reasonable baseline for the MC
  138. 138. When is MCTS relevant ? Robust in front of:High dimension;Non-convexity of Bellman values;Complex modelsDelayed reward Go: H ~300 Dimension = 361More difficult for Fully observableHigh values of H; Fully delayed rewardHighly unobservables casesLack of reasonnable baseline for the MC
  139. 139. When is MCTS relevant ?How to apply it:Implement the transition (a function action x state → state )Design a Monte-Carlo part (a random simulation) (a heuristic in one-player games; difficult if two opponents) ==> at this point you can simulate...Implement UCT (just a bias in the simulator – no real optimizer)
  140. 140. When is MCTS relevant ?How to apply it:Implement the transition (a function action x state → state )Design a Monte-Carlo part (a random simulation) (a heuristic in one-player games; difficult if two opponents) ==> at this point you can simulate...Implement UCT (just a bias in the simulator – no real optimizer)PossiblyRAVE values (Gelly et al)Parallelize multicores + MPI (Cazenave et al, Gelly et al)Decisive moves + anti-decisive moves (Teytaud et al)Patterns (Bouzy et al)
  141. 141. Advantages of MCTS:easy + visibleMany indicators (not only expectation;simulation-based; visible; easy to check)Algorithm indeed simpler (unless in-depthoptimization as for Go competition...) than DPAnytime (you stop when you want)
  142. 142. Advantages of MCTS:generalNo convexity assumptionArbitrarily complex modelI can add an opponentHigh dimension ok
  143. 143. Drawback of MCTSRecent methodImpact of H not clearly known (?)No free lunch: a model of the transition /uncertainties is required (but, an advantage:no constraint) (but: Fonteneau et al, Model-free MC)
  144. 144. ConclusionEssentially asymptotically proved onlyEmpirically good forThe game of GoSome other (difficult) gamesNon-linear expensive optimizationActive learningTested industrially (Spiral library – architecture-specific )There are understood (but not solved) weaknesses Next challenge:Solve these weaknesses (introducing learning ? Refut. Tables ? Drake etal)More industrial applicationsPartially observable cases (Cazenave et al, Rolet et al, Auger)H large, truncating (Lorentz)Scalability (Doghmen et al)

×