A simple tutorial on Monte-Carlo Tree Search
Contains a description of dynamic programming and alpha-beta search, then MCTS. Special cases for simultaneous actions are discussed.
I should add comments so that it can be used without preliminary knowledge of MCTS, if there is at least one request for doing so I'll do it.
@article{gelly:hal-00695370,
hal_id = {hal-00695370},
url = {http://hal.inria.fr/hal-00695370},
title = {{The Grand Challenge of Computer Go: Monte Carlo Tree Search and Extensions}},
author = {Gelly, Sylvain and Kocsis, Levente and Schoenauer, Marc and Sebag, Mich{\`e}le and Silver, David and Szepesvari, Csaba and Teytaud, Olivier},
abstract = {{The ancient oriental game of Go has long been considered a grand challenge for artificial intelligence. For decades, com- puter Go has defied the classical methods in game tree search that worked so successfully for chess and checkers. How- ever, recent play in computer Go has been transformed by a new paradigm for tree search based on Monte-Carlo meth- ods. Programs based on Monte-Carlo tree search now play at human-master levels and are beginning to challenge top professional players. In this paper we describe the leading algorithms for Monte-Carlo tree search and explain how they have advanced the state of the art in computer Go.}},
language = {Anglais},
affiliation = {TAO - INRIA Saclay - Ile de France , Laboratoire de Recherche en Informatique - LRI , LPDS , Microsoft Research - Inria Joint Centre - MSR - INRIA , University of Alberta, Canada , Department of Computing Science},
publisher = {ACM},
pages = {106-113},
journal = {Communication of the ACM},
volume = {55},
number = {3 },
audience = {internationale },
year = {2012},
pdf = {http://hal.inria.fr/hal-00695370/PDF/CACM-MCTS.pdf},
}
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Tutorialmcts
1. Bandit-based Monte-Carlo planning: the game
of Go and beyond
Designing intelligent
agents with
Monte-Carlo Tree Search.
Olivier.Teytaud@inria.fr + F. Teytaud + H. Doghmen + others
TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of
Excellence.
Keywords: UCB, EXP3, MCTS, UCT.
Paris
April 2011.
Games, games with hidden information,
games with simultaneous actions.
2. Key point
PLEASE INTERRUPT ME !
HAVE QUESTIONS !
LET'S HAVE A FRIENDLY SESSION !
ASK QUESTIONS NOW AND LATER BY MAIL!
olivier.teytaud@inria.fr
3. Outline
Introduction:
games / control / planning.
Standard approaches
Bandit-based Monte-Carlo Planning and
UCT.
Application to the
game of Go and
(far) beyond
7. A game is a directed graph with actions
and players and observations
Bob
Bear Bee
Bee 1 White
Black
2
3
White 12
43
White Black
Black
Black
Black
Games with simultaneous actions Paris 1st of February 7
8. A game is a directed graph with actions
and players and observations and rewards
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12
43 Rewards
White Black on leafs
Black only!
Black
Black
Games with simultaneous actions Paris 1st of February 8
9. A game is a directed graph +actions
+players +observations +rewards +loops
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12
43
White Black
Black
Black
Black
Games with simultaneous actions Paris 1st of February 9
10. More than games in this
formalism
A main application: the management of
many energy stocks in front of randomness
At each time step we see random outcomes
We have to make decisions (switching on or off)
We have losses
(ANR / NSC project)
11. Opening a reservoir produces energy
(and water goes to another reservoir)
Classical
Thermal plants
Reservoir 2
Reservoir 1
Reservoir 5 Electricity
demand
Reservoir 4 Reservoir 3
Nuclear
plants
Lost water
12. Outline
Introduction:
games / control / planning.
Standard approaches
Bandit-based Monte-Carlo Planning and
UCT.
Application to the
game of Go and
(far) beyond
13. What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
(minimax / alpha-beta in games)
Reinforcement learning (some promising results,
less used in industry)
Some tree exploration tools (less usual in
stochastic or continuous cases)
Bandit-Based Monte-Carlo planning
Scripts + tuning
14. What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
Where we are:
Done: Presentation of the problem.
Now: We briefly present dynamic
programming
Thereafter: We present MCTS / UCT.
15. Dynamic programming
V(x) = expectation of future loss if optimal
strategy after state x. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the final states X.
16. Dynamic programming
V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the final states X.
We compute V for all the “-1” states X.
17. Dynamic programming (DP)
V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the final states X.
We compute V for all the “-1” states X.
... ... ...
20. Alpha-beta = DP + pruning
“Nevertheless, I believe that a world-champion-
level
Go machine can be built within 10 years,
based upon the same method of intensive
analysis--brute force, basically--that
Deep Blue employed for chess.”
Hsu, IEEE Spectrum, 2007.
(==> I don't think so.)
21. Extensions of DP
Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
==> but lack of stability
Direct Policy Search,
Fitted-Q-Iteration...
==> there is room for improvements
22. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS;
2006)
Extensions
Weakness
Games as benchmarks ?
23. Monte-Carlo Tree Search
Monte-Carlo Tree Search (MCTS) appeared
in games.
R. Coulom. Efficient Selectivity and Backup Operators in
Monte-Carlo Tree Search. In Proceedings of the 5th
International Conference on Computers and Games, Turin, Italy,
2006.
Its most well-known variant is termed Upper
Confidence Tree (UCT).
35. Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.
On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:
Select nodes with at least 5% of total sims (depth at most
3)
Average all statistics on these nodes
==> comp cost = log(nb comp nodes)
36. Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.
On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:
Select nodes with at least 5% of total sims (depth at most
3)
Average all statistics on these nodes
==> comp cost = log(nb comp nodes)
37. Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.
On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:
Select nodes with at least 5% of total sims (depth at most
3)
Average all statistics on these nodes
==> comp cost = log(nb comp nodes)
38. Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.
On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:
Select nodes with at least 5% of total sims (depth at most
3)
Average all statistics on these nodes
==> comp cost = log(nb comp nodes)
39. Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.
On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:
Select nodes with at least 5% of total sims (depth at most
3)
Average all statistics on these nodes
==> comp cost = log(nb comp nodes)
45. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
46. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
47. Why UCT is suboptimal for games ?
There are better formula than
mean + sqrt(log(...) / …) (=UCT)
MCTS, under mild conditions on games
(including deterministic two-player zero-
sum games), can be
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).
(==> not true for UCT)
48. Why UCT is suboptimal for games ?
There are better formula than
mean + sqrt(log(...) / …) (=UCT)
MCTS, under mild There is better for deterministic
conditions on games
win/draw/loss games:
(including deterministic two-player zero-
(sumRewards+K)/
sum games), can be
(nbTrials+2K)
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).
(==> not true for UCT)
49. Go: from 29 to 6 stones
Formula for
simulation
nbWins + 1
argmax ---------------
nbLosses + 2 Berthier, Doghmen, T.,
LION 2010
==> consistency
==> frugality
51. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
52. Infinite action spaces:
progressive widening
UCB1: Choose u maximizing the compromise:
Empirical average for decision u
+ √( log(i)/ number of trials with decision u )
==> argmax only on the i first arms
( [ 0.25 0.5 ] )
(Coulom, Chaslot et al, Wang et al)
53. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
54. Extensions
``Standard'' UCT:
score(situation,move) = compromise (in [0,1+] )
between
a) empirical quality
P ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
Remark: No offline learning
55. Extension: offline learning
(introducing imitation learning)
c) offline value (Bouzy et al) =
empirical estimate P ( played | pattern )
Pattern = ball of locations, each location either:
- this is black stone
- this is white stone
- this is empty
- this is not black stone
- this is not white stone
- this is not empty
- this is border
Support = frequency of “the center of this pattern is played”
Confidence = conditional frequency of play
Bias = confidence of pattern with max support
56. Extension: offline learning
(introducing imitation learning)
score(situation,move) = compromise between
a) empirical quality
b) exploration term
c) offline value (Bouzy et al, Coulom) =
empirical estimate P ( played | pattern )
for patterns with big support
==> estimated on database
At first, (c) is the most important; later, (a) dominates.
57. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
58. Extensions
``Standard'' UCT:
score(situation,move) = compromise between
a) empirical quality
b) exploration term
Remark: No learning from one situation to another
59. Extension: transient values
score(situation,move) = compromise between
a) empirical quality
P' ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
c) offline value
d) ``transient'' value: (Gelly et al, 07)
P' (win | move ∈ laterMoves(situations) )
==> brings information from node N to ancestor node M
==> does not bring information from node N to
descendants or cousins (many people have tried...)
Brügman, Gelly et al
60. Transient values = RAVE = very good
in many games
It works also in Havannah.
61. It works also in NoGo
NoGo = rules of Go
except that
capturing
==> loosing
62. Counter-example to RAVE, B2
By M. Müller
B2 makes sense
only if it is played
Immediately
(otherwise A5 kills).
63. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
64. Extensions
``Standard'' UCT:
score(situation,move) = compromise between
a) empirical quality
b) exploration term
Remarks: No expert rules
65. Extension: expert rules
score(situation,move) = compromise between
a) empirical quality
b) exploration term
c) offline value
d) transient value
e) expert rules
==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only
66. Extension: expert rules
in the Monte-Carlo part
Decisive moves: play immediate wins.
Anti-decisive moves: don't play moves with immediate
winning reply.
Teytaud&Teytaud, CIG2010:
can be fast in connection games. E.g. Havannah:
67. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2010: win against a pro (4p) 19x19, H6 Zen
2010: win against a pro (5p) 19x19, H6 Zen
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
68. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2010: win against a pro (5p) 19x19, H6 Zen
Wins with H6 / H7
are lucky (rare)
2007: win against a pro (5p) 9x9 (blitz) MoGo
wins
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
69. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2010: win against a pro (5p) 19x19, H6 Zen
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
Win with
disadvantageous
==> still 6 stones at least! side.
70. 13x13 Go: new results!
9x9 Go: computers are at the human best level.
- Fuego won against a top level human as white
- mogoTW did it both as black and white and regularly wins
some games against the top players.
- mogoTW won ¾ yesterday in blind go (blind go = go in 9x9
according to the pros)
19x19 Go: the best humans still (almost always) win easily with
7 handicap stones.
In WCCI 2010, experiments in 13x13 Go:
- MoGo won 2/2 against 6D players with handicap 2
- MfoG won 1/2 against 6D players with handicap 2
- Fuego won 0/2 against 6D players with handicap 2
And yesterday MoGoTW won one game with handicap 2.5!
71. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information
72. Bandits
We have seen UCB:
choose action with maximal score
Q(action,state) =
empirical_reward(action,state)
+sqrt(
log(nbSims(state)) / nbSims(action,state) )
EXP3 is another bandit:
For adversarial cases
Based on a stochastic formula
73. EXP3 in one slide
Grigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009
74. MCTS for simultaneous actions
Player 1 plays
Player 2 plays Both players
play
... Player 1 plays
Player 2 plays
75. MCTS for simultaneous actions
Player 1 plays Flory, Teytaud,
= maxUCB node Evostar 2011
Player 2 plays
=minUCB node Both players play
=EXP3 node
Player 1 plays
... Player 2 plays =maxUCB node
=minUCB node
76. MCTS for hidden information
Player 1
Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node
Observation set 2
Player 2
Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node
77. MCTS for hidden information
Player 1
Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node “Observation set”
= set of
sequences
of observations
Observation set 2
Player 2
Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node
78. MCTS for hidden information
Player 1
Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node “Observation set”
= set of
Here, possible sequences
sequences of of observations
observations are
partitioned in 3 Observation set 2
Player 2
Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node
79. MCTS for hidden information
Player 1
Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node Thanks Martin
(incrementally + application to phantom-tic-tac-toe: see D. Auger 2011)
Observation set 2
Player 2
Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node
80. MCTS for hidden information
Player 1
Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node
Use EXP3:
(incrementally + application to phantom-tic-tac-toe: see D. Auger 2010) in
consistent even
adversarial setting.
Observation set 2
Player 2
Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node
81. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
82. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
83. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
84. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
85. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
86. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
87. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
88. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
89. MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
90. MCTS with hidden information:
incremental version
While (there is time for thinking)
{
Possibly refine
s=initial state
the family
while (s not terminal)
{
of bandits.
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation
}
92. Let's have fun with Urban Rivals (4 cards)
Each player has
- four cards (each one can be used once)
- 12 pilz (each one can be used once)
- 12 life points
Each card has:
- one attack level
- one damage
- special effects (forget it for the moment)
Four turns:
- P1 attacks P2
- P2 attacks P1
- P1 attacks P2
- P2 attacks P1
Games with simultaneous actions Paris 1st of February 92
93. Let's have fun with Urban Rivals
First, attacker plays:
- chooses a card
- chooses ( PRIVATELY ) a number of pilz
Attack level = attack(card) x (1+nb of pilz)
Then, defender plays:
- chooses a card
- chooses a number of pilz
Defense level = attack(card) x (1+nb of pilz)
Result:
If attack > defense
Defender looses Power(attacker's card)
Else
Attacker looses Power(defender's card)
Games with simultaneous actions Paris 1st of February 93
94. Let's have fun with Urban Rivals
==> The MCTS-based AI is now at the best human level.
Experimental (only) remarks on EXP3:
- discard strategies with small number of sims = better approx
of the Nash
- also an improvement by taking into
account the other bandit
- not yet compared to INF
- virtual simulations (inspired by Kummer)
Games with simultaneous actions Paris 1st of February 94
95. Let's have fun with a nice
application
) We are now at the
best human
level in Urban Rivals.
96. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
104. Game of Go: counting territories
(white has 7.5 “bonus” as black starts)
105. Game of Go: the rules
Black plays at the blue circle: the
white group dies (it is removed)
It's impossible to kill white (two “eyes”).
“Ko” rules: we don't come back to the same situation.
(without ko: “PSPACE hard”
with ko: “EXPTIME-complete”)
At the end, we count territories
==> black starts, so +7.5 for white.
108. Key point in Go: there are human-easy
situations which are computer-hard.
We'll see much
easier
situations poorly
understood.
(komi 7.5)
108
109. Difficult for computers (win for
black, playing A)
We'll see much
easier
situations poorly
understood.
(komi 7.5)
But let's see an
easier case.
109
110. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
111. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
112. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
113. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
114. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
115. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
116. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
117. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
118. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
119. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
120. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
121. It does not work. Why ?
50% of estimated
win probability!
In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we switch to another node
(~ 8! x 8! such nodes)
122. And the humans ?
50% of estimated
win probability!
In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we DON'T switch to another node
135. What else ? Games with
simultaneous actions or hidden information
Flory, Teytaud,
Evostar 2011
Games with hidden information
Games with simultaneous actions.
UrbanRivals = internet card game;
11 millions of registered users.
Game with hidden information.
Frédéric Lemoine MIG 11/07/2008 135
138. “Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.
(holds true for Go)
Frédéric Lemoine MIG 11/07/2008 138
139. “Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.
VS
Frédéric Lemoine MIG 11/07/2008 139
141. When is MCTS relevant ?
Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
142. When is MCTS relevant ?
Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
More difficult for
High values of H;
Highly unobservable cases (Monte-Carlo, but not
Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC
143. When is MCTS relevant ?
Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward Go:
H ~300
Dimension = 361
More difficult for Fully observable
High values of H; Fully delayed reward
Highly unobservables cases
Lack of reasonnable baseline for the MC
144. When is MCTS relevant ?
How to apply it:
Implement the transition
(a function action x state → state )
Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games;
difficult if two opponents)
==> at this point you can simulate...
Implement UCT (just a bias in the simulator – no real optimizer)
145. When is MCTS relevant ?
How to apply it:
Implement the transition
(a function action x state → state )
Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games;
difficult if two opponents)
==> at this point you can simulate...
Implement UCT (just a bias in the simulator – no real optimizer)
Possibly
RAVE values (Gelly et al)
Parallelize multicores + MPI (Cazenave et al, Gelly et al)
Decisive moves + anti-decisive moves (Teytaud et al)
Patterns (Bouzy et al)
146. Advantages of MCTS:
easy + visible
Many indicators (not only expectation;
simulation-based; visible; easy to check)
Algorithm indeed simpler (unless in-depth
optimization as for Go competition...) than DP
Anytime (you stop when you want)
148. Drawback of MCTS
Recent method
Impact of H not clearly known (?)
No free lunch: a model of the transition /
uncertainties is required (but, an advantage:
no constraint)
(but: Fonteneau et al, Model-free MC)
149. Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other (difficult) games
Non-linear expensive optimization
Active learning
Tested industrially (Spiral library – architecture-specific )
There are understood (but not solved) weaknesses
Next challenge:
Solve these weaknesses (introducing learning ? Refut. Tables ? Drake et
al)
More industrial applications
Partially observable cases (Cazenave et al, Rolet et al, Auger)
H large, truncating (Lorentz)
Scalability (Doghmen et al)