Board Games in Academia 2010
@article{gelly:hal-00695370,
hal_id = {hal-00695370},
url = {http://hal.inria.fr/hal-00695370},
title = {{The Grand Challenge of Computer Go: Monte Carlo Tree Search and Extensions}},
author = {Gelly, Sylvain and Kocsis, Levente and Schoenauer, Marc and Sebag, Mich{\`e}le and Silver, David and Szepesvari, Csaba and Teytaud, Olivier},
abstract = {{The ancient oriental game of Go has long been considered a grand challenge for artificial intelligence. For decades, com- puter Go has defied the classical methods in game tree search that worked so successfully for chess and checkers. How- ever, recent play in computer Go has been transformed by a new paradigm for tree search based on Monte-Carlo meth- ods. Programs based on Monte-Carlo tree search now play at human-master levels and are beginning to challenge top professional players. In this paper we describe the leading algorithms for Monte-Carlo tree search and explain how they have advanced the state of the art in computer Go.}},
language = {Anglais},
affiliation = {TAO - INRIA Saclay - Ile de France , Laboratoire de Recherche en Informatique - LRI , LPDS , Microsoft Research - Inria Joint Centre - MSR - INRIA , University of Alberta, Canada , Department of Computing Science},
publisher = {ACM},
pages = {106-113},
journal = {Communication of the ACM},
volume = {55},
number = {3 },
audience = {internationale },
year = {2012},
pdf = {http://hal.inria.fr/hal-00695370/PDF/CACM-MCTS.pdf},
}
Axa Assurance Maroc - Insurer Innovation Award 2024
Artificial intelligence and the game of Go
1. Bandit-based Monte-Carlo planning: the game
of Go and beyond
The game of Go: recent
progress for an old game
Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ.
Paris-Sud, LRI, CMAP, Univ. Amsterdam, Taiwan universities (including NUTN)
TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal Network of Excellence.
Paris,
April 2010.
2. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
3. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
4. History: I'm not an expert
Origins: I don't know. Too many dates in
the literature. Someone knows ?
8th century: the game of Go in Japan ?
9th century: symmetric game ?
16th century: first schools ?
Recently:
huge progress thanks to cultural differences in
teaching ?
becomes known in Europe (cf interest for Asian
cultures and Ikaru No Go)
5. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
6. Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
7. Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
8. Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
9. Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
10. Game of Go: the rules
Black plays at the blue circle: the
white group dies (it is removed)
It's impossible to kill white (two “eyes”).
“Ko” rules: we don't come back to the same situation.
At the end, we count territories
==> black starts, so +7.5 for white.
18. Game of Go: counting territories
(white has 7.5 “bonus” as black starts)
19. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
20. Introduction to games
Partially or fully observable
Randomized or not
Iterated or not
1,2,3,... players
Decentralized or not
Continuous or not
Infinite time or not
21. Complexity measures
(not always well defined)
State-space complexity
Combinatorial
Game-tree size complexity
Decision complexity measures
Game-tree complexity
Computational complexity Computational
complexity
Perfect-play complexity measures
State of the art level
22. Complexity measures
(not always well defined)
State-space complexity = number
of possible states
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
23. Complexity measures
(not always well defined)
State-space complexity
Game-tree size = number of leafs
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
24. Complexity measures
(not always well defined)
State-space complexity
Game-tree size
Decision complexity = min # of
leafs of tree showing perfect play
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
25. Complexity measures
(not always well defined)
State-space complexity
Game-tree size
Decision complexity
Game-tree complexity = # of leafs
for perfect play with constant depth
Computational complexity
Perfect-play complexity
State of the art level
26. Complexity measures
(not always well defined)
State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity (=
complexity classes, later)
Perfect-play complexity
State of the art level
27. Computational complexity:
Main reasons for this measure ?
Good feeling of understanding
(disagree if you want :-) )
Explicit families of problems
(extracted by reduction)
Fun
Connections
with classical complexity measures
Much better for looking clever
(when you speak about NP-complete
problems you look clever)
29. Computational complexity
Given a class X, a problem q can be
in X
or harder than pbs in X (X-hard)
or both (X-complete)
or neither
NP
NP -difficile
NP -complete
30. Computational complexity
For evaluating the complexity of your
problem:
1. Generalize your game to any size
(non trivial for chess)
2. Consider the problem:
- here is a board
- is the situation a win in perfect play ?
NP
NP
NP -complete -difficile
31. Computational complexity
==> cast into a decision problem (binary question)
==> can be used for choosing optimal move
(but not necessary)
==> trivial games can be EXPTIME-hard
==> no clear correlation with the fact that a game is difficult
for a computer (when compared to humans)
NP
NP
NP -complete -difficile
33. PSPACE vs EXPTIME
==> many important games are either PSPACE or EXPTIME
Theorem: If playing = filling a location
for eternity, then it is PSPACE.
(not necessarily PSPACE-complete!)
Proof: Depth-first search.
Applis: Hex, Havannah, Tic-Tac-Toe,
Ponnuki-Go...
35. NP / PSPACE / EXPTIME in Go
Tsumegos with no ko, forced moves only for
W, 2 moves for B, polynomial length: NP-
complete
Ponnuki-Go : PSPACE
Go without ko: PSPACE-hard
Go with ko + japanese rules:
EXPTIME-complete
Go with ko + superko: unknown
Some phantom-rengo undecidable ?
If Go with ko > Go without ko, then
PSPACE EXPTIME
36. Complexity measures
(not always well defined)
State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity (complexity
of perfect algorithm)
State of the art level
37. Complexity measures
(not always well defined)
State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
38. State of the art level
Very weak solving
Means that we know who should win
Typically proved by strategy-stealing
E.g.: hex (first player wins), hex + swap
(second player wins)
Weak solving
Strong solving
Best results so far
39. State of the art level
Very weak solving
Weak solving
Perfect play reached with reasonnable computation
time
Biggest success: draughts (tenths of years of
computation on tenths of machines)
Strong solving
Best results so far
40. State of the art level
Very weak solving
Weak solving
Strong solving
Perfect play from any situation in
reasonable time (variants of Tic-Tac-Toe)
Best results so far
41. State of the art level
Very weak solving
Weak solving
Strong solving
Best results so far
Shi-Fu-Mi: humans loose
English draughts: humans + machines reach perfect
play
Chess: nobody can compete with machines
Ponnuki-Go: some variants solved
9x9 Go: MoGoTW won with the disadvantageous side
with a top player
42. Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!
43. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
44. Monte-Carlo Tree Search
Monte-Carlo Tree Search (MCTS) appeared
in games.
Its most well-known variant is termed Upper
Confidence Tree (UCT).
I here present UCT.
Bandits;
Monte-Carlo approach for tree-search;
UCT.
45. A ``bandit'' problem
p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )
46. A ``bandit'' problem: the target
p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )
Regret: Rn=n max{pi} - ∑ rj (j<n)
How to minimize the regret (worst case on p) ?
47. Bandits – a classical solution
Regret: Rn=n max{pi} - ∑ rj (j<i)
UCB1: Choose u maximizing the compromise:
Empirical average for decision u
+ √( log(i)/ number of trials with decision u )
==> optimal regret O(log(n))
(Lai et al; Auer et al)
48. Infinite bandit: progressive
widening
UCB1: Choose u maximizing the compromise:
Empirical average for decision u
+ √( log(i)/ number of trials with decision u )
==> argmax only on the i first arms
( [ 0.25 0.5 ] )
(Coulom, Chaslot et al, Wang et al)
49. Bandits: much more
What is a bandit:
- a criterion (here a bandit)
defines the problem
- usually a score (typically
exploration+exploitation)
defines a criterion
==> an optimal score for a criterion is not optimal
for another ==> a wide literature
50. Bandits and trees
- we have seen the
definition of discrete
time control problems;
- we have seen what are
bandits
- we now introduce trees and UCT
62. Go: from 29 to 6 stones
Formula for
simulation
Asymptotically optimal move.
But all the tree is visited infinitely often!
What is used in implementations which work ?
63. Go: from 29 to 6 stones
Formula for
simulation
64. Go: from 29 to 6 stones
Formula for
simulation
Not consistent! Sometimes:
- Good move might have 0/1
- Bad move 1/(N-1) after N simulations
==> we only simulate bad move!
65. Go: from 29 to 6 stones
Formula for
simulation
Other (better) estimates,
but still inconsistent
66. Go: from 29 to 6 stones
Formula for
simulation
nbWins + 1
argmax ---------------
nbLosses + 2
==> consistency
==> frugality
67. Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
70. Why UCT is suboptimal for games ?
(boring version)
71. Why UCT is suboptimal for games ?
(clear version)
Monte-Carlo Tree Search, under mild
conditions on games (including
deterministic two-player zero-sum
games), can be
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).
72. Why UCT is suboptimal for games ?
(clear version)
Frugal algorithms: folklore results (many
people implement “frugal” MCTS).
However, these algorithms are (usually)
not consistent.
What is new is
sufficient
conditions for
consistency + frugal.
73. Extensions
``Standard'' UCT:
score(situation,move) = compromise (in [0,1+] )
between
a) empirical quality
P ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
(UCT is not fundamental in Go)
Remarks:
1) No offline learning
2) No learning from one situation to another
3) No expert rules
74. Extension 1: offline learning
score(situation,move) = compromise between
a) empirical quality
b) exploration term
c) offline value (Chaslot et al, Coulom) =
empirical estimate P ( played | pattern )
for patterns with big support
==> estimated on database
At first, (c) is the most important; later, (a) dominates.
75. Extensions
``Standard'' UCT:
score(situation,move) = compromise between
a) empirical quality
b) exploration term
Remarks:
1) No offline learning
2) No learning from one situation to another
3) No expert rules
76. Extension 2: transient values
score(situation,move) = compromise between
a) empirical quality
P' ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
c) offline value
d) ``transient'' value: (Gelly et al, 07)
P' (win | same player plays “move” later)
==> brings information from node N to ancestor node M
==> does not bring information from node N to
descendants or cousins (many people have tried)
77. Extensions
``Standard'' UCT:
score(situation,move) = compromise between
a) empirical quality
b) exploration term
Remarks:
1) No offline learning
2) No learning from one situation to another
3) No expert rules
78. Extension 3: expert rules
score(situation,move) = compromise between
a) empirical quality
b) exploration term
c) offline value
d) transient value
e) expert rules
==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only
79. Go: from 29 to 6 stones
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2008: win against a pro (4p) 19x19, H7 CrazyStone
2009: win against a pro (9p) 19x19, H7 MoGo
2009: win against a pro (1p) 19x19, H6 MoGo
==> still 6 stones at least!
81. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
82. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
83. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
84. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
85. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
86. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
87. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
88. Semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
89. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
90. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
91. A trivial semeai
Plenty of equivalent
situations!
They are randomly
sampled, with
no generalization.
50% of estimated
win probability!
92. It does not work. Why ?
50% of estimated
win probability!
In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we switch to another node
(~ 8! x 8! such nodes)
93. And the humans ?
50% of estimated
win probability!
In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we DON'T switch to another node
98. Outline: the game of Go
The history
The rules
Variants of go and complexity
Computers playing Go
Go and power plants
99. What is high-dimensional
discrete time control ?
There are time steps: 0, 1, 2, ..., H.
There are states and transitions:
xi+1 = f( xi, di)
di is the decision at time step i:
di=u(xi)
There is a cost:
C = C(xH)
==> We look for u(.) such that C is as
small as possible.
100. High-dimensional discrete time
control
xi+1 = f( xi, di)
di=u(xi)
C = C(xH)
==> We look for u(.) such that C is as
small as possible.
101. Discrete time + high dimension
+ uncertainty
xi+1 = f( xi, di, Ai) di=u(xi)
C = C(xH)
Ai might be:
- a Markov model
- an opponent: Ai maximizes inf C
==> we look for u(.) such that C is as small
as possible (e.g. on average).
102. Summary
High dimensional discrete time control is an
important problem
Many problems have no satisfactory
solution.
A new approach: Bandit-Based Monte-Carlo
Planning
103. High-dimensional discrete time
control
A main application: the management of
many energy stocks in front of randomness
At each time step we see random outcomes
We have to take decisions
After H time steps, we observe a cost
104. What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
Reinforcement learning (some promising results,
less used in industry)
Some tree exploration tools (less usual in
stochastic or continuous cases)
Bandit-Based Monte-Carlo planning (MCTS/UCT)
105. What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
Where we are:
Done: Presentation of the problem.
Now: We briefly present dynamic
programming
Thereafter: We present MCTS / UCT.
106. Dynamic programming
V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the X with horizon H.
107. Dynamic programming
V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the X with horizon H.
We compute V for all the X with horizon H-1.
108. Dynamic programming
V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming
We compute V for all the X with horizon H.
We compute V for all the X with horizon H-1.
... ... ...
109. Extensions
Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
==> but lack of stability
...
==> there is room for improvements
110. Conclusion : games = great for
artificial intelligence
Very difficult
for computers.
114. “Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.
(holds true for Go)
Frédéric Lemoine MIG 11/07/2008 114
115. “Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.
VS
Frédéric Lemoine MIG 11/07/2008 115
117. Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other games
Non-linear expensive optimization
Active learning
Not (yet) tested industrially
Understood weaknesses: plenty of very similar nodes!
Next challenge:
Solve these weaknesses
Industrial applications
Partially observable cases : cf Cazenave, Rolet
118. Biblio
Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
UCT: Kocsis, Szepesvari, Coquelin, Munos...
MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
Pérez, Rimmel, Wang...
Tree + DP for industrial applicationl: Péret, Garcia...
Bandits with infinitely many arms:
Audibert, Coulom, Munos, Wang...
Applications far from Go: Rolet,
Teytaud (F), Rimmel, De Mesmay
...
Links with “macro-actions” ?
Parallelization, mixing with offline
learning, bias...
119. Paul Veyssière Vincent Berthier
Contributors
Amine Bourki Hassen Doghmen
Matthieu Coulm Univ. Taiwan
Univ. Paris
Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
UCT: Kocsis, Szepesvari, Coquelin, Munos...
MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
Pérez, Rimmel, Wang...
Tree + DP for industrial applicationl: Péret, Garcia...
Bandits with infinitely many arms:
Audibert, Coulom, Munos, Wang...
Applications far from Go: Rolet,
Teytaud (F), Rimmel, De Mesmay
...
Links with “macro-actions” ?
Parallelization, mixing with offline
learning, bias...