Artificial intelligence and the game of Go

Bandit-based Monte-Carlo planning: the game
of Go and beyond

The game of Go: recent
progress for an old game

Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ.
Paris-Sud, LRI, CMAP, Univ. Amsterdam, Taiwan universities (including NUTN)

TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal Network of Excellence.

Paris,
April 2010.

Outline: the game of Go

The history

The rules

Variants of go and complexity

Computers playing Go

Go and power plants

History: I'm not an expert

Origins: I don't know. Too many dates in
the literature. Someone knows ?
8th century: the game of Go in Japan ?
9th century: symmetric game ?
16th century: first schools ?
Recently:
huge progress thanks to cultural differences in
teaching ?
becomes known in Europe (cf interest for Asian
cultures and Ikaru No Go)

Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)

Game of Go: the rules
Black plays at the blue circle: the
white group dies (it is removed)

It's impossible to kill white (two “eyes”).

“Ko” rules: we don't come back to the same situation.

At the end, we count territories
==> black starts, so +7.5 for white.

Game of Go: counting territories
(white has 7.5 “bonus” as black starts)

Introduction to games

Partially or fully observable
Randomized or not
Iterated or not
1,2,3,... players
Decentralized or not
Continuous or not
Infinite time or not

Complexity measures
(not always well defined)

State-space complexity
Combinatorial
Game-tree size complexity
Decision complexity measures
Game-tree complexity
Computational complexity Computational
complexity
Perfect-play complexity measures
State of the art level

Complexity measures

State-space complexity = number
of possible states
Game-tree size
Decision complexity
Computational complexity
Perfect-play complexity

Complexity measures

Game-tree size = number of leafs
Decision complexity

Complexity measures

Game-tree size
Decision complexity = min # of
leafs of tree showing perfect play

Complexity measures

Game-tree size
Decision complexity
Game-tree complexity = # of leafs
for perfect play with constant depth

Complexity measures

Game-tree size
Decision complexity
Computational complexity (=
complexity classes, later)

Computational complexity:
Main reasons for this measure ?

Good feeling of understanding
(disagree if you want :-) )
Explicit families of problems
(extracted by reduction)
Fun

Connections
with classical complexity measures
Much better for looking clever
(when you speak about NP-complete
problems you look clever)


Known:

Conjectured: strict inclusions everywhere.

Higher classes include undecidable cases.


Given a class X, a problem q can be
in X
or harder than pbs in X (X-hard)
or both (X-complete)
or neither
NP
NP -difficile
NP -complete

For evaluating the complexity of your
problem:

1. Generalize your game to any size
(non trivial for chess)
2. Consider the problem:
- here is a board
- is the situation a win in perfect play ?

NP
NP
NP -complete -difficile


==> cast into a decision problem (binary question)

==> can be used for choosing optimal move
(but not necessary)

==> trivial games can be EXPTIME-hard

==> no clear correlation with the fact that a game is difficult
for a computer (when compared to humans)

NP
NP
NP -complete -difficile

Games

Introduction

Complexity measures


Zoology

PSPACE vs EXPTIME

==> many important games are either PSPACE or EXPTIME

Theorem: If playing = filling a location
for eternity, then it is PSPACE.
(not necessarily PSPACE-complete!)

Proof: Depth-first search.
Applis: Hex, Havannah, Tic-Tac-Toe,
Ponnuki-Go...

NP / PSPACE / EXPTIME in Go
Tsumegos with no ko, forced moves only for
W, 2 moves for B, polynomial length: NP-
complete
Ponnuki-Go : PSPACE
Go without ko: PSPACE-hard
Go with ko + japanese rules:
EXPTIME-complete
Go with ko + superko: unknown
Some phantom-rengo undecidable ?

If Go with ko > Go without ko, then
PSPACE EXPTIME

Complexity measures

Game-tree size
Decision complexity
Perfect-play complexity (complexity
of perfect algorithm)

Complexity measures

Game-tree size
Decision complexity


Very weak solving
Means that we know who should win
Typically proved by strategy-stealing
E.g.: hex (first player wins), hex + swap
(second player wins)
Weak solving
Strong solving
Best results so far


Very weak solving
Weak solving
Perfect play reached with reasonnable computation
time
Biggest success: draughts (tenths of years of
computation on tenths of machines)
Strong solving
Best results so far


Very weak solving
Weak solving
Strong solving
Perfect play from any situation in
reasonable time (variants of Tic-Tac-Toe)
Best results so far

Very weak solving
Weak solving
Strong solving

Best results so far
Shi-Fu-Mi: humans loose
English draughts: humans + machines reach perfect
play
Chess: nobody can compete with machines
Ponnuki-Go: some variants solved
9x9 Go: MoGoTW won with the disadvantageous side
with a top player

Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone

2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW

==> still 6 stones at least!

Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) appeared
in games.
Its most well-known variant is termed Upper
Confidence Tree (UCT).
I here present UCT.
Bandits;
Monte-Carlo approach for tree-search;
UCT.

A ``bandit'' problem

p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )

A ``bandit'' problem: the target

p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )

Regret: Rn=n max{pi} - ∑ rj (j<n)
How to minimize the regret (worst case on p) ?

Bandits – a classical solution

Regret: Rn=n max{pi} - ∑ rj (j<i)

UCB1: Choose u maximizing the compromise:

Empirical average for decision u
+ √( log(i)/ number of trials with decision u )

==> optimal regret O(log(n))
(Lai et al; Auer et al)

Infinite bandit: progressive
widening
UCB1: Choose u maximizing the compromise:

Empirical average for decision u
+ √( log(i)/ number of trials with decision u )


==> argmax only on the i first arms
( [ 0.25 0.5 ] )

(Coulom, Chaslot et al, Wang et al)

Bandits: much more

What is a bandit:
- a criterion (here a bandit)
defines the problem
- usually a score (typically
exploration+exploitation)
defines a criterion
==> an optimal score for a criterion is not optimal
for another ==> a wide literature

Bandits and trees
- we have seen the
definition of discrete
time control problems;

- we have seen what are
bandits

- we now introduce trees and UCT

UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

Formula for
simulation
Asymptotically optimal move.

But all the tree is visited infinitely often!

What is used in implementations which work ?

Formula for
simulation

Formula for
simulation

Not consistent! Sometimes:
- Good move might have 0/1
- Bad move 1/(N-1) after N simulations
==> we only simulate bad move!

Formula for
simulation

Other (better) estimates,
but still inconsistent

Formula for
simulation
nbWins + 1
argmax ---------------
nbLosses + 2

==> consistency
==> frugality

Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?

Monte-Carlo Tree Search (beyond UCT)

Why UCT is suboptimal for games ?
(boring version)

(clear version)

Monte-Carlo Tree Search, under mild
conditions on games (including
deterministic two-player zero-sum
games), can be
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).

(clear version)

Frugal algorithms: folklore results (many
people implement “frugal” MCTS).
However, these algorithms are (usually)
not consistent.
What is new is
sufficient
conditions for
consistency + frugal.

Extensions
``Standard'' UCT:
score(situation,move) = compromise (in [0,1+] )
between
a) empirical quality
P ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
(UCT is not fundamental in Go)

Remarks:
1) No offline learning
2) No learning from one situation to another
3) No expert rules

Extension 1: offline learning

score(situation,move) = compromise between
b) exploration term
c) offline value (Chaslot et al, Coulom) =
empirical estimate P ( played | pattern )
for patterns with big support
==> estimated on database

At first, (c) is the most important; later, (a) dominates.

Extensions
``Standard'' UCT:
b) exploration term

Remarks:
1) No offline learning
2) No learning from one situation to another
3) No expert rules

Extension 2: transient values

P' ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
c) offline value
d) ``transient'' value: (Gelly et al, 07)
P' (win | same player plays “move” later)
==> brings information from node N to ancestor node M
==> does not bring information from node N to
descendants or cousins (many people have tried)

Extension 3: expert rules
b) exploration term
c) offline value
d) transient value
e) expert rules

==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only

2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW


==> still 6 stones at least!

Why computers are weak against
humans (thanks M. Jolion)

A trivial semeai

Plenty of equivalent
situations!

They are randomly
sampled, with
no generalization.

50% of estimated
win probability!

Semeai

Plenty of equivalent
situations!

They are randomly
sampled, with
no generalization.

50% of estimated
win probability!

It does not work. Why ?

50% of estimated
win probability!

In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we switch to another node
(~ 8! x 8! such nodes)

And the humans ?

50% of estimated
win probability!

In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we DON'T switch to another node

Semeais

Should
white
play in
the
semeai
(G1)
or capture
(J15) ?
94

Semeais

Should black
play the
semeai ?

95

Semeais

Should black
play the
semeai ?

96

Semeais

Should black
play the
semeai ?

Useless!

97

What is high-dimensional
discrete time control ?
There are time steps: 0, 1, 2, ..., H.
There are states and transitions:
xi+1 = f( xi, di)
di is the decision at time step i:
di=u(xi)
There is a cost:
C = C(xH)
==> We look for u(.) such that C is as
small as possible.

High-dimensional discrete time
control

xi+1 = f( xi, di)
di=u(xi)
C = C(xH)

==> We look for u(.) such that C is as
small as possible.

Discrete time + high dimension
+ uncertainty
xi+1 = f( xi, di, Ai) di=u(xi)
C = C(xH)
Ai might be:
- a Markov model
- an opponent: Ai maximizes inf C

==> we look for u(.) such that C is as small
as possible (e.g. on average).

Summary

High dimensional discrete time control is an
important problem

Many problems have no satisfactory
solution.

A new approach: Bandit-Based Monte-Carlo
Planning

High-dimensional discrete time
control
A main application: the management of
many energy stocks in front of randomness
At each time step we see random outcomes
We have to take decisions
After H time steps, we observe a cost

What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)

Reinforcement learning (some promising results,
less used in industry)

Some tree exploration tools (less usual in
stochastic or continuous cases)

Bandit-Based Monte-Carlo planning (MCTS/UCT)

What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)

Where we are:
Done: Presentation of the problem.
Now: We briefly present dynamic
programming
Thereafter: We present MCTS / UCT.

Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming

We compute V for all the X with horizon H.

Dynamic programming


We compute V for all the X with horizon H-1.

Dynamic programming


We compute V for all the X with horizon H-1.
... ... ...

Extensions

Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
==> but lack of stability
...
==> there is room for improvements

Conclusion : games = great for
artificial intelligence

Very difficult
for computers.

What else ? First Person Shooting
(UCT for partially observable MDP)

Frédéric Lemoine MIG 11/07/2008 111

What else ? Real Time Strategy Game
(multiple actors, partially obs.)


What else ? Sports (continuous control)


“Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.

(holds true for Go)


“Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.

VS


What else ? Collaborative sports


Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other games
Non-linear expensive optimization
Active learning
Not (yet) tested industrially
Understood weaknesses: plenty of very similar nodes!

Next challenge:
Solve these weaknesses
Industrial applications
Partially observable cases : cf Cazenave, Rolet

Biblio
Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
UCT: Kocsis, Szepesvari, Coquelin, Munos...
MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
Pérez, Rimmel, Wang...
Tree + DP for industrial applicationl: Péret, Garcia...
Bandits with infinitely many arms:
Audibert, Coulom, Munos, Wang...
Applications far from Go: Rolet,
Teytaud (F), Rimmel, De Mesmay
...
Links with “macro-actions” ?
Parallelization, mixing with offline
learning, bias...

Paul Veyssière Vincent Berthier

Contributors
Amine Bourki Hassen Doghmen
Matthieu Coulm Univ. Taiwan
Univ. Paris

Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
UCT: Kocsis, Szepesvari, Coquelin, Munos...
MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
Pérez, Rimmel, Wang...
Tree + DP for industrial applicationl: Péret, Garcia...
Bandits with infinitely many arms:
Audibert, Coulom, Munos, Wang...
Applications far from Go: Rolet,
Teytaud (F), Rimmel, De Mesmay
...
Links with “macro-actions” ?
Parallelization, mixing with offline
learning, bias...

Artificial intelligence and the game of Go

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Recently uploaded

Recently uploaded (20)

Artificial intelligence and the game of Go