Tutorialmcts

Bandit-based Monte-Carlo planning: the game
of Go and beyond

Designing intelligent
agents with
Monte-Carlo Tree Search.
Olivier.Teytaud@inria.fr + F. Teytaud + H. Doghmen + others
TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of
Excellence.

Keywords: UCB, EXP3, MCTS, UCT.
Paris
April 2011.
Games, games with hidden information,
games with simultaneous actions.

Key point

PLEASE INTERRUPT ME !
HAVE QUESTIONS !
LET'S HAVE A FRIENDLY SESSION !
ASK QUESTIONS NOW AND LATER BY MAIL!

olivier.teytaud@inria.fr

Outline

Introduction:
games / control / planning.

Standard approaches

Bandit-based Monte-Carlo Planning and
UCT.

Application to the
game of Go and
(far) beyond

A game is a directed graph

Games with simultaneous actions Paris 1st of February 4

A game is a directed graph with actions

1

2
3


A game is a directed graph with actions and players

1 White
Black
2
3

White 12

43
White Black
Black

Black
Black

and players and observations
Bob
Bear Bee
Bee 1 White
Black
2
3

White 12

43
White Black
Black

Black
Black

and players and observations and rewards
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12

43 Rewards
White Black on leafs
Black only!
Black
Black

A game is a directed graph +actions
+players +observations +rewards +loops
Bob
Bear Bee
Bee 1 White
Black
2
3
+1
0
White 12

43
White Black
Black

Black
Black

More than games in this
formalism
A main application: the management of
many energy stocks in front of randomness
At each time step we see random outcomes
We have to make decisions (switching on or off)
We have losses
(ANR / NSC project)

Opening a reservoir produces energy
(and water goes to another reservoir)

Classical
Thermal plants
Reservoir 2
Reservoir 1

Reservoir 5 Electricity
demand
Reservoir 4 Reservoir 3

Nuclear
plants
Lost water

Outline
Introduction:
games / control / planning.

Standard approaches

Bandit-based Monte-Carlo Planning and
UCT.

Application to the
game of Go and
(far) beyond

What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
(minimax / alpha-beta in games)
Reinforcement learning (some promising results,
less used in industry)

Some tree exploration tools (less usual in
stochastic or continuous cases)

Bandit-Based Monte-Carlo planning

Scripts + tuning

What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)

Where we are:
Done: Presentation of the problem.
Now: We briefly present dynamic
programming
Thereafter: We present MCTS / UCT.

Dynamic programming

V(x) = expectation of future loss if optimal
strategy after state x. (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming

We compute V for all the final states X.

Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy. (well defined)

We compute V for all the “-1” states X.

Dynamic programming (DP)

V(x) = expectation of C(xH) if optimal
strategy. (well defined)

We compute V for all the “-1” states X.
... ... ...

Alpha-beta = DP + pruning

“Nevertheless, I believe that a world-champion-
level
Go machine can be built within 10 years,
based upon the same method of intensive
analysis--brute force, basically--that
Deep Blue employed for chess.”
Hsu, IEEE Spectrum, 2007.

(==> I don't think so.)

Extensions of DP

Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
==> but lack of stability
Direct Policy Search,
Fitted-Q-Iteration...
==> there is room for improvements

Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS;
2006)
Extensions
Weakness
Games as benchmarks ?

Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) appeared
in games.
R. Coulom. Efficient Selectivity and Backup Operators in
Monte-Carlo Tree Search. In Proceedings of the 5th
International Conference on Computers and Games, Turin, Italy,
2006.

Its most well-known variant is termed Upper
Confidence Tree (UCT).

UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

Parallelizing MCTS
On a parallel machine with shared memory: just many
simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
per comp. node, and 3 times per second:

Select nodes with at least 5% of total sims (depth at most
3)

Average all statistics on these nodes

==> comp cost = log(nb comp nodes)

Good news: it works
So misleading numbers...

Much better than voting schemes

But little difference with T. Cazenave
(depth 0).

Every month, someone tells us:

Try with a bigger
machine !
And win against
top pros !

Being faster is not the solution

The same in Havannah
(F. Teytaud)

Outline

Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?

Outline
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
More than UCT in MCTS
Infinite action spaces
Offline learning
Online learning
Expert knowledge
Hidden information

Why UCT is suboptimal for games ?

There are better formula than
mean + sqrt(log(...) / …) (=UCT)

MCTS, under mild conditions on games
(including deterministic two-player zero-
sum games), can be
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).
(==> not true for UCT)

Why UCT is suboptimal for games ?

There are better formula than
mean + sqrt(log(...) / …) (=UCT)

MCTS, under mild There is better for deterministic
conditions on games
win/draw/loss games:
(including deterministic two-player zero-
(sumRewards+K)/
sum games), can be
(nbTrials+2K)
consistent (→ best move);
frugal (if there is a good move, it does not
visit infinitely often all the tree).
(==> not true for UCT)

Go: from 29 to 6 stones
Formula for
simulation
nbWins + 1
argmax ---------------
nbLosses + 2 Berthier, Doghmen, T.,
LION 2010
==> consistency
==> frugality

It depends on the game and on the
tuning...

Infinite action spaces:
progressive widening
UCB1: Choose u maximizing the compromise:

Empirical average for decision u
+ √( log(i)/ number of trials with decision u )


==> argmax only on the i first arms
( [ 0.25 0.5 ] )

(Coulom, Chaslot et al, Wang et al)

Extensions
``Standard'' UCT:
score(situation,move) = compromise (in [0,1+] )
between
a) empirical quality
P ( win | nextMove(situation) = move )
estimated in simulations

b) exploration term

Remark: No offline learning

Extension: offline learning
(introducing imitation learning)
c) offline value (Bouzy et al) =
empirical estimate P ( played | pattern )

Pattern = ball of locations, each location either:
- this is black stone
- this is white stone
- this is empty
- this is not black stone
- this is not white stone
- this is not empty
- this is border
Support = frequency of “the center of this pattern is played”
Confidence = conditional frequency of play
Bias = confidence of pattern with max support

Extension: offline learning
(introducing imitation learning)

score(situation,move) = compromise between
b) exploration term
c) offline value (Bouzy et al, Coulom) =
empirical estimate P ( played | pattern )
for patterns with big support
==> estimated on database

At first, (c) is the most important; later, (a) dominates.

Extensions
``Standard'' UCT:
b) exploration term

Remark: No learning from one situation to another

Extension: transient values

P' ( win | nextMove(situation) = move )
estimated in simulations
b) exploration term
c) offline value
d) ``transient'' value: (Gelly et al, 07)
P' (win | move ∈ laterMoves(situations) )

==> brings information from node N to ancestor node M
==> does not bring information from node N to
descendants or cousins (many people have tried...)
Brügman, Gelly et al

Transient values = RAVE = very good
in many games

It works also in Havannah.

It works also in NoGo
NoGo = rules of Go
except that

capturing

==> loosing

Counter-example to RAVE, B2
By M. Müller

B2 makes sense
only if it is played
Immediately
(otherwise A5 kills).

Extensions
``Standard'' UCT:
b) exploration term

Remarks: No expert rules

Extension: expert rules
b) exploration term
c) offline value
d) transient value
e) expert rules

==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only

Extension: expert rules
in the Monte-Carlo part
Decisive moves: play immediate wins.
Anti-decisive moves: don't play moves with immediate
winning reply.

Teytaud&Teytaud, CIG2010:
can be fast in connection games. E.g. Havannah:

1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9 MoGo
2008: win against a pro (4p) 19x19, H8 CrazyStone
2010: win against a pro (4p) 19x19, H6 Zen
2007: win against a pro (5p) 9x9 (blitz) MoGo
2008: win against a pro (5p) 9x9 white MoGo
2009: win against a pro (5p) 9x9 black MoGo
2009: win against a pro (9p) 9x9 white Fuego
2009: win against a pro (9p) 9x9 black MoGoTW
==> still 6 stones at least!

Wins with H6 / H7
are lucky (rare)
wins

==> still 6 stones at least!


Win with
disadvantageous
==> still 6 stones at least! side.

13x13 Go: new results!
9x9 Go: computers are at the human best level.
- Fuego won against a top level human as white
- mogoTW did it both as black and white and regularly wins
some games against the top players.
- mogoTW won ¾ yesterday in blind go (blind go = go in 9x9
according to the pros)

19x19 Go: the best humans still (almost always) win easily with
7 handicap stones.

In WCCI 2010, experiments in 13x13 Go:
- MoGo won 2/2 against 6D players with handicap 2
- MfoG won 1/2 against 6D players with handicap 2
- Fuego won 0/2 against 6D players with handicap 2
And yesterday MoGoTW won one game with handicap 2.5!

Bandits

We have seen UCB:
choose action with maximal score
Q(action,state) =
empirical_reward(action,state)
+sqrt(
log(nbSims(state)) / nbSims(action,state) )
EXP3 is another bandit:
For adversarial cases
Based on a stochastic formula

EXP3 in one slide

Grigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009

MCTS for simultaneous actions

Player 1 plays

Player 2 plays Both players
play

... Player 1 plays
Player 2 plays

MCTS for simultaneous actions

Player 1 plays Flory, Teytaud,
= maxUCB node Evostar 2011

Player 2 plays
=minUCB node Both players play
=EXP3 node

Player 1 plays
... Player 2 plays =maxUCB node
=minUCB node

MCTS for hidden information
Player 1

Observation set 1 Observation set 2
EXP3 node EXP3 node
Observation set 3
EXP3 node

Observation set 2
Player 2

Observation set 1 EXP3 node
EXP3 node
Observation set 3
EXP3 node

Player 1

EXP3 node EXP3 node
Observation set 3
EXP3 node “Observation set”
= set of
sequences
of observations

Observation set 2
Player 2

EXP3 node
Observation set 3
EXP3 node

Player 1

EXP3 node EXP3 node
Observation set 3
EXP3 node “Observation set”
= set of
Here, possible sequences
sequences of of observations
observations are
partitioned in 3 Observation set 2
Player 2

EXP3 node
Observation set 3
EXP3 node

Player 1

EXP3 node EXP3 node
Observation set 3
EXP3 node Thanks Martin

(incrementally + application to phantom-tic-tac-toe: see D. Auger 2011)

Observation set 2
Player 2

EXP3 node
Observation set 3
EXP3 node

Player 1

EXP3 node EXP3 node
Observation set 3
EXP3 node
Use EXP3:
(incrementally + application to phantom-tic-tac-toe: see D. Auger 2010) in
consistent even
adversarial setting.

Observation set 2
Player 2

EXP3 node
Observation set 3
EXP3 node

MCTS with hidden information
While (there is time for thinking)
{
s=initial state
while (s not terminal)
{
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation

}

MCTS with hidden information:
incremental version
While (there is time for thinking)
{
Possibly refine
s=initial state
the family
while (s not terminal)
{
of bandits.
os1=observationSet1=(); os2=()
b1=bandit1(os1); b2=bandit2(os2)
d1=b1.makeDecision;d2=b2.makeDecision
(s,o1,o2)=transition(s,d1,d2)
os1=os1.o1, os2=os2.o2
}
send reward to all bandits in the simulation

}

Let's have fun with a nice
application

(

Let's have fun with Urban Rivals (4 cards)
Each player has
- four cards (each one can be used once)
- 12 pilz (each one can be used once)
- 12 life points

Each card has:
- one attack level
- one damage
- special effects (forget it for the moment)

Four turns:
- P1 attacks P2
- P2 attacks P1
- P1 attacks P2
- P2 attacks P1


Let's have fun with Urban Rivals
First, attacker plays:
- chooses a card
- chooses ( PRIVATELY ) a number of pilz
Attack level = attack(card) x (1+nb of pilz)

Then, defender plays:
- chooses a card
- chooses a number of pilz
Defense level = attack(card) x (1+nb of pilz)

Result:
If attack > defense
Defender looses Power(attacker's card)
Else
Attacker looses Power(defender's card)


Let's have fun with Urban Rivals
==> The MCTS-based AI is now at the best human level.

Experimental (only) remarks on EXP3:

- discard strategies with small number of sims = better approx
of the Nash

- also an improvement by taking into
account the other bandit

- not yet compared to INF

- virtual simulations (inspired by Kummer)

Let's have fun with a nice
application

) We are now at the
best human
level in Urban Rivals.

Game of Go: counting territories
(white has 7.5 “bonus” as black starts)

Game of Go: the rules
Black plays at the blue circle: the
white group dies (it is removed)

It's impossible to kill white (two “eyes”).

“Ko” rules: we don't come back to the same situation.

(without ko: “PSPACE hard”
with ko: “EXPTIME-complete”)

At the end, we count territories
==> black starts, so +7.5 for white.

Easy for computers ... because
human knowledge easy to encode.

106

Difficult for computers
(pointed out to me by J.M. Jolion in 1998)

107

Key point in Go: there are human-easy
situations which are computer-hard.

We'll see much
easier
situations poorly
understood.

(komi 7.5)

108

Difficult for computers (win for
black, playing A)
We'll see much
easier
situations poorly
understood.

(komi 7.5)

But let's see an
easier case.
109

A trivial semeai

Plenty of equivalent
situations!

They are randomly
sampled, with
no generalization.

50% of estimated
win probability!

Semeai

Plenty of equivalent
situations!

They are randomly
sampled, with
no generalization.

50% of estimated
win probability!

It does not work. Why ?

50% of estimated
win probability!

In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we switch to another node
(~ 8! x 8! such nodes)

And the humans ?

50% of estimated
win probability!

In the first node:
The first simulations give ~ 50%
The next simulations go to 100% or 0% (depending
on the chosen move)
But, then, we DON'T switch to another node

Semeais

Should
white
play in
the
semeai
(G1)
or capture
(J15) ?
123

Semeais

Should black
play the
semeai ?

124

Semeais

Should black
play the
semeai ?

125

Semeais

Should black
play the
semeai ?

Useless!

126

Difficult board games: Havannah

Very difficult
for computers.
Very simple
to implement.

What else ? First Person Shooting
(UCT for partially observable MDP)

Frédéric Lemoine MIG 11/07/2008 134

What else ? Games with
simultaneous actions or hidden information

Flory, Teytaud,
Evostar 2011
Games with hidden information

Games with simultaneous actions.

UrbanRivals = internet card game;
11 millions of registered users.
Game with hidden information.


What else ? Real Time Strategy Game
(multiple actors, partially obs.)


What else ? Sports (continuous control)


“Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.

(holds true for Go)


“Real” games
Assumption: if a computer understands and guesses spins, then
this robot will be efficient for something else than just games.

VS


What else ? Collaborative sports


When is MCTS relevant ?

Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward


Robust in front of:
High dimension;
Complex models
Delayed reward

More difficult for
High values of H;
Highly unobservable cases (Monte-Carlo, but not
Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC


Robust in front of:
High dimension;
Complex models
Delayed reward Go:
H ~300
Dimension = 361
More difficult for Fully observable
High values of H; Fully delayed reward
Highly unobservables cases
Lack of reasonnable baseline for the MC


How to apply it:
Implement the transition
(a function action x state → state )

Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games;
difficult if two opponents)

==> at this point you can simulate...

Implement UCT (just a bias in the simulator – no real optimizer)

How to apply it:
Implement the transition
(a function action x state → state )

Design a Monte-Carlo part (a random simulation)
(a heuristic in one-player games;
difficult if two opponents)

==> at this point you can simulate...

Implement UCT (just a bias in the simulator – no real optimizer)

Possibly
RAVE values (Gelly et al)
Parallelize multicores + MPI (Cazenave et al, Gelly et al)
Decisive moves + anti-decisive moves (Teytaud et al)
Patterns (Bouzy et al)

Advantages of MCTS:
easy + visible

Many indicators (not only expectation;
simulation-based; visible; easy to check)

Algorithm indeed simpler (unless in-depth
optimization as for Go competition...) than DP

Anytime (you stop when you want)

Advantages of MCTS:
general

No convexity assumption

Arbitrarily complex model

I can add an opponent

High dimension ok

Drawback of MCTS

Recent method

Impact of H not clearly known (?)

No free lunch: a model of the transition /
uncertainties is required (but, an advantage:
no constraint)
(but: Fonteneau et al, Model-free MC)

Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other (difficult) games
Non-linear expensive optimization
Active learning
Tested industrially (Spiral library – architecture-specific )
There are understood (but not solved) weaknesses

Next challenge:
Solve these weaknesses (introducing learning ? Refut. Tables ? Drake et
al)
More industrial applications
Partially observable cases (Cazenave et al, Rolet et al, Auger)
H large, truncating (Lorentz)
Scalability (Doghmen et al)

Tutorialmcts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Recently uploaded

Recently uploaded (20)

Tutorialmcts