SlideShare a Scribd company logo
1 of 149
Bandit-based Monte-Carlo planning: the game
of Go and beyond



Designing intelligent
agents with
Monte-Carlo Tree Search.
 Olivier.Teytaud@inria.fr + F. Teytaud + H. Doghmen + others
TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of
Excellence.

                      Keywords: UCB, EXP3, MCTS, UCT.
Paris
April 2011.
                      Games, games with hidden information,
                        games with simultaneous actions.
Key point




                 PLEASE INTERRUPT ME !
                      HAVE QUESTIONS !
        LET'S HAVE A FRIENDLY SESSION !
 ASK QUESTIONS NOW AND LATER BY MAIL!

                  olivier.teytaud@inria.fr
Outline

Introduction:
   games / control / planning.

Standard approaches

Bandit-based Monte-Carlo Planning and
UCT.

Application to the
game of Go and
(far) beyond
A game is a directed graph




Games with simultaneous actions   Paris 1st of February   4
A game is a directed graph with actions


                      1


                      2
           3




Games with simultaneous actions   Paris 1st of February   5
A game is a directed graph with actions and players

                      1                         White
Black
                      2
           3

                                  White         12

                                             43
                    White                                     Black
                                                   Black

                                   Black
                   Black
Games with simultaneous actions       Paris 1st of February           6
A game is a directed graph with actions
                 and players and observations
Bob
                                           Bear                Bee
          Bee          1                          White
Black
                       2
            3

                                   White          12

                                              43
                     White                                           Black
                                                    Black

                                    Black
                    Black
 Games with simultaneous actions       Paris 1st of February                 7
A game is a directed graph with actions
         and players and observations and rewards
Bob
                                           Bear                Bee
          Bee          1                          White
Black
                       2
            3
                                                                             +1
                                                                      0
                                   White          12

                                              43                             Rewards
                     White                                           Black   on leafs
                                                    Black                     only!
                                    Black
                    Black
 Games with simultaneous actions       Paris 1st of February                        8
A game is a directed graph +actions
           +players +observations +rewards +loops
Bob
                                           Bear                Bee
          Bee          1                          White
Black
                       2
            3
                                                                             +1
                                                                      0
                                   White          12

                                              43
                     White                                           Black
                                                    Black

                                    Black
                    Black
 Games with simultaneous actions       Paris 1st of February                      9
More than games in this
formalism
A main application: the management of
 many energy stocks in front of randomness
At each time step we see random outcomes
We have to make decisions (switching on or off)
We have losses
                             (ANR / NSC project)
Opening a reservoir produces energy
  (and water goes to another reservoir)

                                                             Classical
                                                           Thermal plants
                                             Reservoir 2
              Reservoir 1



                            Reservoir 5                     Electricity
                                                             demand
             Reservoir 4                  Reservoir 3




                                                               Nuclear
                                                                plants
Lost water
Outline
Introduction:
games / control / planning.


Standard approaches


Bandit-based Monte-Carlo Planning and
UCT.


Application to the
game of Go and
(far) beyond
What are the approaches ?
Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)
     (minimax / alpha-beta in games)
Reinforcement learning (some promising results,
less used in industry)


Some tree exploration tools (less usual in
stochastic or continuous cases)


Bandit-Based Monte-Carlo planning


Scripts + tuning
What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)


  Where we are:
  Done: Presentation of the problem.
  Now: We briefly present dynamic
   programming
  Thereafter: We present MCTS / UCT.
Dynamic programming

V(x) = expectation of future loss if optimal
strategy after state x.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the final states X.
Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the final states X.
We compute V for all the “-1” states X.
Dynamic programming (DP)

V(x) = expectation of C(xH) if optimal
strategy.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the final states X.
We compute V for all the “-1” states X.
            ...        ...          ...
Dynamic programming: picture
Alpha-beta = DP + pruning
Alpha-beta = DP + pruning



    “Nevertheless, I believe that a world-champion-
                          level
      Go machine can be built within 10 years,
      based upon the same method of intensive
        analysis--brute force, basically--that
           Deep Blue employed for chess.”
                          Hsu, IEEE Spectrum, 2007.

                            (==> I don't think so.)
Extensions of DP

Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
   ==> but lack of stability
Direct Policy Search,
   Fitted-Q-Iteration...
==> there is room for improvements
Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS;
2006)
Extensions
Weakness
Games as benchmarks ?
Monte-Carlo Tree Search



Monte-Carlo Tree Search (MCTS) appeared
in games.
R. Coulom. Efficient Selectivity and Backup Operators in
Monte-Carlo Tree Search. In Proceedings of the 5th
International Conference on Computers and Games, Turin, Italy,
2006.



Its most well-known variant is termed Upper
Confidence Tree (UCT).
UCT (Upper Confidence Trees)




Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT
UCT
UCT
UCT
UCT
      Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
... or exploration ?
              SCORE =
                 0/2
               + k.sqrt( log(10)/2 )
Parallelizing MCTS
On a parallel machine with shared memory: just many
   simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
   per comp. node, and 3 times per second:

  Select nodes with at least 5% of total sims (depth at most
      3)

  Average all statistics on these nodes

  ==> comp cost = log(nb comp nodes)
Parallelizing MCTS
On a parallel machine with shared memory: just many
   simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
   per comp. node, and 3 times per second:

  Select nodes with at least 5% of total sims (depth at most
      3)

  Average all statistics on these nodes

  ==> comp cost = log(nb comp nodes)
Parallelizing MCTS
On a parallel machine with shared memory: just many
   simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
   per comp. node, and 3 times per second:

  Select nodes with at least 5% of total sims (depth at most
      3)

  Average all statistics on these nodes

  ==> comp cost = log(nb comp nodes)
Parallelizing MCTS
On a parallel machine with shared memory: just many
   simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
   per comp. node, and 3 times per second:

  Select nodes with at least 5% of total sims (depth at most
      3)

  Average all statistics on these nodes

  ==> comp cost = log(nb comp nodes)
Parallelizing MCTS
On a parallel machine with shared memory: just many
   simulations in parallel, the same memory for all.

On a parallel machine with no shared memory: one MCTS
   per comp. node, and 3 times per second:

  Select nodes with at least 5% of total sims (depth at most
      3)

  Average all statistics on these nodes

  ==> comp cost = log(nb comp nodes)
Good news: it works
So misleading numbers...
Much better than voting schemes




But little difference with T. Cazenave
  (depth 0).
Every month, someone tells us:


                Try with a bigger
                   machine !
                And win against
                    top pros !
Being faster is not the solution
The same in Havannah
      (F. Teytaud)
Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Why UCT is suboptimal for games ?

  There are better formula than
      mean + sqrt(log(...) / …)    (=UCT)


  MCTS, under mild conditions on games
   (including deterministic two-player zero-
   sum games), can be
 consistent (→ best move);
 frugal (if there is a good move, it does not
 visit infinitely often all the tree).
       (==> not true for UCT)
Why UCT is suboptimal for games ?

  There are better formula than
       mean + sqrt(log(...) / …)        (=UCT)


  MCTS, under mild There is better for deterministic
                   conditions on games
                   win/draw/loss games:
   (including deterministic two-player zero-
                     (sumRewards+K)/
   sum games), can be
                   (nbTrials+2K)
 consistent (→ best move);
 frugal (if there is a good move, it does not
 visit infinitely often all the tree).
        (==> not true for UCT)
Go: from 29 to 6 stones
Formula for
simulation
                        nbWins + 1
                argmax ---------------
                        nbLosses + 2     Berthier, Doghmen, T.,
                                              LION 2010
                   ==> consistency
                   ==> frugality
It depends on the game and on the
             tuning...
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Infinite action spaces:
progressive widening
UCB1: Choose u maximizing the compromise:


  Empirical average for decision u
  + √( log(i)/ number of trials with decision u )

                                 
  ==> argmax only on the     i       first arms
            (   [ 0.25 0.5 ] )


                  (Coulom, Chaslot et al, Wang et al)
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Extensions
``Standard'' UCT:
   score(situation,move) = compromise (in [0,1+] )
           between
      a) empirical quality
           P ( win | nextMove(situation) = move )
            estimated in simulations

      b) exploration term


Remark: No offline learning
Extension: offline learning
      (introducing imitation learning)
      c) offline value (Bouzy et al) =
            empirical estimate P ( played | pattern )

Pattern = ball of locations, each location either:
              - this is black stone
              - this is white stone
              - this is empty
              - this is not black stone
              - this is not white stone
              - this is not empty
              - this is border
Support = frequency of “the center of this pattern is played”
Confidence = conditional frequency of play
Bias = confidence of pattern with max support
Extension: offline learning
      (introducing imitation learning)

  score(situation,move) = compromise between
     a) empirical quality
     b) exploration term
      c) offline value (Bouzy et al, Coulom) =
            empirical estimate P ( played | pattern )
            for patterns with big support
            ==> estimated on database

At first, (c) is the most important; later, (a) dominates.
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Extensions
``Standard'' UCT:
   score(situation,move) = compromise between
      a) empirical quality
      b) exploration term

Remark: No learning from one situation to another
Extension: transient values

score(situation,move) = compromise between
   a) empirical quality
           P' ( win | nextMove(situation) = move )
          estimated in simulations
   b) exploration term
    c) offline value
    d) ``transient'' value:            (Gelly et al, 07)
           P' (win | move ∈ laterMoves(situations) )

==> brings information from node N to ancestor node M
==> does not bring information from node N to
     descendants or cousins (many people have tried...)
                                  Brügman, Gelly et al
Transient values = RAVE = very good
           in many games

It works also in Havannah.
It works also in NoGo
          NoGo = rules of Go
            except that

          capturing

           ==> loosing
Counter-example to RAVE, B2
By M. Müller

                    B2 makes sense
                    only if it is played
                    Immediately
                    (otherwise A5 kills).
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Extensions
``Standard'' UCT:
   score(situation,move) = compromise between
      a) empirical quality
      b) exploration term

Remarks: No expert rules
Extension: expert rules
  score(situation,move) = compromise between
     a) empirical quality
     b) exploration term
      c) offline value
      d) transient value
     e) expert rules

   ==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only
Extension: expert rules
       in the Monte-Carlo part
Decisive moves: play immediate wins.
Anti-decisive moves: don't play moves with immediate
                      winning reply.

Teytaud&Teytaud, CIG2010:
  can be fast in connection games. E.g. Havannah:
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9        MoGo
2008: win against a pro (4p) 19x19, H8    CrazyStone
2008: win against a pro (4p) 19x19, H7    CrazyStone
2009: win against a pro (9p) 19x19, H7        MoGo
2009: win against a pro (1p) 19x19, H6        MoGo
2010: win against a pro (4p) 19x19, H6          Zen
2010: win against a pro (5p) 19x19, H6          Zen
2007: win against a pro (5p) 9x9 (blitz)     MoGo
2008: win against a pro (5p) 9x9 white       MoGo
2009: win against a pro (5p) 9x9 black       MoGo
2009: win against a pro (9p) 9x9 white       Fuego
2009: win against a pro (9p) 9x9 black       MoGoTW
==> still 6 stones at least!
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9        MoGo
2008: win against a pro (4p) 19x19, H8    CrazyStone
2008: win against a pro (4p) 19x19, H7    CrazyStone
2009: win against a pro (9p) 19x19, H7        MoGo
2009: win against a pro (1p) 19x19, H6        MoGo
2010: win against a pro (5p) 19x19, H6          Zen
                                           Wins with H6 / H7
                                            are lucky (rare)
2007: win against a pro (5p) 9x9 (blitz)           MoGo
                                                  wins
2008: win against a pro (5p) 9x9 white            MoGo
2009: win against a pro (5p) 9x9 black            MoGo
2009: win against a pro (9p) 9x9 white            Fuego
2009: win against a pro (9p) 9x9 black            MoGoTW

==> still 6 stones at least!
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9        MoGo
2008: win against a pro (4p) 19x19, H8    CrazyStone
2008: win against a pro (4p) 19x19, H7    CrazyStone
2009: win against a pro (9p) 19x19, H7        MoGo
2009: win against a pro (1p) 19x19, H6        MoGo
2010: win against a pro (5p) 19x19, H6          Zen

2007: win against a pro (5p) 9x9 (blitz)       MoGo
2008: win against a pro (5p) 9x9 white         MoGo
2009: win against a pro (5p) 9x9 black         MoGo
2009: win against a pro (9p) 9x9 white         Fuego
2009: win against a pro (9p) 9x9 black         MoGoTW
                                               Win with
                                           disadvantageous
==> still 6 stones at least!                     side.
13x13 Go: new results!
9x9 Go: computers are at the human best level.
   - Fuego won against a top level human as white
   - mogoTW did it both as black and white and regularly wins
      some games against the top players.
  - mogoTW won ¾ yesterday in blind go (blind go = go in 9x9
           according to the pros)

19x19 Go: the best humans still (almost always) win easily with
7 handicap stones.

In WCCI 2010, experiments in 13x13 Go:
   - MoGo won 2/2 against 6D players with handicap 2
   - MfoG won 1/2 against 6D players with handicap 2
   - Fuego won 0/2 against 6D players with handicap 2
 And yesterday MoGoTW won one game with handicap 2.5!
Outline
Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
  More than UCT in MCTS
  Infinite action spaces
  Offline learning
  Online learning
  Expert knowledge
  Hidden information
Bandits

We have seen UCB:
  choose action with maximal score
Q(action,state) =
        empirical_reward(action,state)
    +sqrt(
  log(nbSims(state)) / nbSims(action,state)   )
EXP3 is another bandit:
  For adversarial cases
  Based on a stochastic formula
EXP3 in one slide




Grigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009
MCTS for simultaneous actions


                  Player 1 plays




      Player 2 plays               Both players
                                      play




...                                           Player 1 plays
                       Player 2 plays
MCTS for simultaneous actions


                   Player 1 plays                        Flory, Teytaud,
                  = maxUCB node                           Evostar 2011




       Player 2 plays
      =minUCB node               Both players play
                                   =EXP3 node




                                               Player 1 plays
...                      Player 2 plays       =maxUCB node
                        =minUCB node
MCTS for hidden information
Player 1

              Observation set 1        Observation set 2
                EXP3 node                EXP3 node
                              Observation set 3
                                EXP3 node




                                        Observation set 2
   Player 2




                 Observation set 1        EXP3 node
                   EXP3 node
                                 Observation set 3
                                   EXP3 node
MCTS for hidden information
Player 1

              Observation set 1        Observation set 2
                EXP3 node                EXP3 node
                              Observation set 3
                                EXP3 node             “Observation set”
                                                           = set of
                                                          sequences
                                                       of observations

                                        Observation set 2
   Player 2




                 Observation set 1        EXP3 node
                   EXP3 node
                                 Observation set 3
                                   EXP3 node
MCTS for hidden information
Player 1

              Observation set 1        Observation set 2
                EXP3 node                EXP3 node
                              Observation set 3
                                EXP3 node             “Observation set”
                                                           = set of
                Here, possible                            sequences
                sequences of                           of observations
               observations are
               partitioned in 3        Observation set 2
   Player 2




                 Observation set 1        EXP3 node
                    EXP3 node
                                 Observation set 3
                                   EXP3 node
MCTS for hidden information
Player 1

              Observation set 1        Observation set 2
                EXP3 node                EXP3 node
                              Observation set 3
                                EXP3 node                    Thanks Martin


(incrementally + application to phantom-tic-tac-toe: see D. Auger 2011)


                                         Observation set 2
   Player 2




                 Observation set 1         EXP3 node
                   EXP3 node
                                 Observation set 3
                                   EXP3 node
MCTS for hidden information
Player 1

              Observation set 1         Observation set 2
                EXP3 node                 EXP3 node
                              Observation set 3
                                EXP3 node
                                                            Use EXP3:
(incrementally + application to phantom-tic-tac-toe: see D. Auger 2010) in
                                                         consistent even
                                                         adversarial setting.

                                         Observation set 2
   Player 2




                 Observation set 1         EXP3 node
                   EXP3 node
                                 Observation set 3
                                   EXP3 node
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information
While (there is time for thinking)
{
    s=initial state
    while (s not terminal)
    {
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
MCTS with hidden information:
incremental version
While (there is time for thinking)
{
                                                   Possibly refine
    s=initial state
                                                     the family
    while (s not terminal)
    {
                                                     of bandits.
        os1=observationSet1=(); os2=()
        b1=bandit1(os1); b2=bandit2(os2)
        d1=b1.makeDecision;d2=b2.makeDecision
        (s,o1,o2)=transition(s,d1,d2)
        os1=os1.o1, os2=os2.o2
    }
    send reward to all bandits in the simulation

}
Let's have fun with a nice
application




(
Let's have fun with Urban Rivals (4 cards)
 Each player has
  - four cards (each one can be used once)
  - 12 pilz (each one can be used once)
  - 12 life points

 Each card has:
  - one attack level
  - one damage
  - special effects (forget it for the moment)

 Four turns:
  - P1 attacks P2
  - P2 attacks P1
  - P1 attacks P2
  - P2 attacks P1

 Games with simultaneous actions   Paris 1st of February   92
Let's have fun with Urban Rivals
First, attacker plays:
- chooses a card
- chooses ( PRIVATELY ) a number of pilz
 Attack level = attack(card) x (1+nb of pilz)

Then, defender plays:
 - chooses a card
 - chooses a number of pilz
 Defense level = attack(card) x (1+nb of pilz)

Result:
 If attack > defense
     Defender looses Power(attacker's card)
  Else
     Attacker looses Power(defender's card)

 Games with simultaneous actions   Paris 1st of February   93
Let's have fun with Urban Rivals
==> The MCTS-based AI is now at the best human level.

Experimental (only) remarks on EXP3:

- discard strategies with small number of sims = better approx
   of the Nash

- also an improvement by taking into
  account the other bandit

- not yet compared to INF

- virtual simulations (inspired by Kummer)
Games with simultaneous actions   Paris 1st of February   94
Let's have fun with a nice
application




)              We are now at the
                   best human
             level in Urban Rivals.
Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
Game of Go (9x9 here)
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go: counting territories
(white has 7.5 “bonus” as black starts)
Game of Go: the rules
        Black plays at the blue circle: the
        white group dies (it is removed)


It's impossible to kill white (two “eyes”).




      “Ko” rules: we don't come back to the same situation.

                           (without ko: “PSPACE hard”
                           with ko: “EXPTIME-complete”)


  At the end, we count territories
  ==> black starts, so +7.5 for white.
Easy for computers ... because
human knowledge easy to encode.




               106
Difficult for computers
         (pointed out to me by J.M. Jolion in 1998)




                   107
Key point in Go: there are human-easy
situations which are computer-hard.




                           We'll see much
                           easier
                           situations poorly
                           understood.



                               (komi 7.5)

                 108
Difficult for computers (win for
black, playing A)
                         We'll see much
                         easier
                         situations poorly
                         understood.



                             (komi 7.5)

                         But let's see an
                         easier case.
               109
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
It does not work. Why ?

                                             50% of estimated
                                               win probability!


In the first node:
 The first simulations give ~ 50%
 The next simulations go to 100% or 0% (depending 
on the chosen move)
 But, then, we switch to another node 
                                               (~ 8! x 8! such nodes)
And the humans ?

                                 50% of estimated
                                   win probability!


In the first node:
 The first simulations give ~ 50%
 The next simulations go to 100% or 0% (depending 
on the chosen move)
 But, then, we DON'T switch to another node 
 
Semeais

Should
white
play in
the
semeai
(G1)
or capture
(J15) ?
             123
Semeais

Should black
play the
semeai ?




               124
Semeais

Should black
play the
semeai ?




               125
Semeais

Should black
play the
semeai ?



Useless!



               126
Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
Difficult board games: Havannah

                      Very difficult
                     for computers.
                       Very simple
                     to implement.
What else ? First Person Shooting
(UCT for partially observable MDP)




Frédéric Lemoine   MIG 11/07/2008    134
What else ? Games with 
        simultaneous actions or hidden information

                                                 Flory, Teytaud,
                                                  Evostar 2011
Games with hidden information

Games with simultaneous actions.



 UrbanRivals = internet card game;
  11 millions of registered users.
  Game with hidden information.



 Frédéric Lemoine               MIG 11/07/2008              135
What else ? Real Time Strategy Game
   (multiple actors, partially obs.)




 Frédéric Lemoine   MIG 11/07/2008   136
What else ? Sports (continuous control)




 Frédéric Lemoine   MIG 11/07/2008   137
“Real” games
     Assumption: if a computer understands and guesses spins, then
     this robot will be efficient for something else than just games.

     (holds true for Go)




Frédéric Lemoine                 MIG 11/07/2008                         138
“Real” games
     Assumption: if a computer understands and guesses spins, then
     this robot will be efficient for something else than just games.




                                       VS




Frédéric Lemoine                 MIG 11/07/2008                         139
What else ? Collaborative sports




Frédéric Lemoine   MIG 11/07/2008        140
When is MCTS relevant ?

 Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward
When is MCTS relevant ?

 Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward

More difficult for
High values of H;
Highly unobservable cases (Monte-Carlo, but not
Monte-Carlo Tree Search, see Cazenave et al.)
Lack of reasonable baseline for the MC
When is MCTS relevant ?

 Robust in front of:
High dimension;
Non-convexity of Bellman values;
Complex models
Delayed reward                     Go:
                                    H ~300
                                    Dimension = 361
More difficult for                  Fully observable
High values of H;                   Fully delayed reward
Highly unobservables cases
Lack of reasonnable baseline for the MC
When is MCTS relevant ?

How to apply it:
Implement the transition
                (a function action x state → state )

Design a Monte-Carlo part (a random simulation)
                     (a heuristic in one-player games;
                              difficult if two opponents)

        ==> at this point you can simulate...

Implement UCT (just a bias in the simulator – no real optimizer)
When is MCTS relevant ?
How to apply it:
Implement the transition
                 (a function action x state → state )

Design a Monte-Carlo part (a random simulation)
                      (a heuristic in one-player games;
                    difficult if two opponents)

         ==> at this point you can simulate...

Implement UCT (just a bias in the simulator – no real optimizer)

Possibly
RAVE values (Gelly et al)
Parallelize multicores + MPI (Cazenave et al, Gelly et al)
Decisive moves + anti-decisive moves (Teytaud et al)
Patterns (Bouzy et al)
Advantages of MCTS:
easy + visible



Many indicators (not only expectation;
simulation-based; visible; easy to check)


Algorithm indeed simpler (unless in-depth
optimization as for Go competition...) than DP


Anytime (you stop when you want)
Advantages of MCTS:
general

No convexity assumption


Arbitrarily complex model


I can add an opponent


High dimension ok
Drawback of MCTS



Recent method


Impact of H not clearly known (?)


No free lunch: a model of the transition /
uncertainties is required (but, an advantage:
no constraint)
    (but: Fonteneau et al, Model-free MC)
Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other (difficult) games
Non-linear expensive optimization
Active learning
Tested industrially (Spiral library – architecture-specific )
There are understood (but not solved) weaknesses

 Next challenge:
Solve these weaknesses (introducing learning ? Refut. Tables ? Drake et
al)
More industrial applications
Partially observable cases (Cazenave et al, Rolet et al, Auger)
H large, truncating (Lorentz)
Scalability (Doghmen et al)

More Related Content

Viewers also liked

Tools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsTools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsOlivier Teytaud
 
Provocative statements around energy
Provocative statements around energyProvocative statements around energy
Provocative statements around energyOlivier Teytaud
 
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...Olivier Teytaud
 
The game of Go and energy; two nice computational intelligence problems (with...
The game of Go and energy; two nice computational intelligence problems (with...The game of Go and energy; two nice computational intelligence problems (with...
The game of Go and energy; two nice computational intelligence problems (with...Olivier Teytaud
 
Noisy Optimization combining Bandits and Evolutionary Algorithms
Noisy Optimization combining Bandits and Evolutionary AlgorithmsNoisy Optimization combining Bandits and Evolutionary Algorithms
Noisy Optimization combining Bandits and Evolutionary AlgorithmsOlivier Teytaud
 
Choosing between several options in uncertain environments
Choosing between several options in uncertain environmentsChoosing between several options in uncertain environments
Choosing between several options in uncertain environmentsOlivier Teytaud
 
Energy Management (production side)
Energy Management (production side)Energy Management (production side)
Energy Management (production side)Olivier Teytaud
 
Games with partial information
Games with partial informationGames with partial information
Games with partial informationOlivier Teytaud
 
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Olivier Teytaud
 
Artificial intelligence and blind Go
Artificial intelligence and blind GoArtificial intelligence and blind Go
Artificial intelligence and blind GoOlivier Teytaud
 
Optimization of power systems - old and new tools
Optimization of power systems - old and new toolsOptimization of power systems - old and new tools
Optimization of power systems - old and new toolsOlivier Teytaud
 
Artificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismArtificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismOlivier Teytaud
 

Viewers also liked (17)

Tools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsTools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power Systems
 
Grenoble
GrenobleGrenoble
Grenoble
 
Provocative statements around energy
Provocative statements around energyProvocative statements around energy
Provocative statements around energy
 
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...
Tools for artificial intelligence: EXP3, Zermelo algorithm, Alpha-Beta, and s...
 
Labex2012g
Labex2012gLabex2012g
Labex2012g
 
Theory of games
Theory of gamesTheory of games
Theory of games
 
The game of Go and energy; two nice computational intelligence problems (with...
The game of Go and energy; two nice computational intelligence problems (with...The game of Go and energy; two nice computational intelligence problems (with...
The game of Go and energy; two nice computational intelligence problems (with...
 
Noisy Optimization combining Bandits and Evolutionary Algorithms
Noisy Optimization combining Bandits and Evolutionary AlgorithmsNoisy Optimization combining Bandits and Evolutionary Algorithms
Noisy Optimization combining Bandits and Evolutionary Algorithms
 
Choosing between several options in uncertain environments
Choosing between several options in uncertain environmentsChoosing between several options in uncertain environments
Choosing between several options in uncertain environments
 
Hydroelectricity
HydroelectricityHydroelectricity
Hydroelectricity
 
Energy Management (production side)
Energy Management (production side)Energy Management (production side)
Energy Management (production side)
 
Openoffice and Linux
Openoffice and LinuxOpenoffice and Linux
Openoffice and Linux
 
Games with partial information
Games with partial informationGames with partial information
Games with partial information
 
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
 
Artificial intelligence and blind Go
Artificial intelligence and blind GoArtificial intelligence and blind Go
Artificial intelligence and blind Go
 
Optimization of power systems - old and new tools
Optimization of power systems - old and new toolsOptimization of power systems - old and new tools
Optimization of power systems - old and new tools
 
Artificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismArtificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with Parallelism
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Tutorialmcts

  • 1. Bandit-based Monte-Carlo planning: the game of Go and beyond Designing intelligent agents with Monte-Carlo Tree Search. Olivier.Teytaud@inria.fr + F. Teytaud + H. Doghmen + others TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud, Digiteo Labs, Pascal Network of Excellence. Keywords: UCB, EXP3, MCTS, UCT. Paris April 2011. Games, games with hidden information, games with simultaneous actions.
  • 2. Key point PLEASE INTERRUPT ME ! HAVE QUESTIONS ! LET'S HAVE A FRIENDLY SESSION ! ASK QUESTIONS NOW AND LATER BY MAIL! olivier.teytaud@inria.fr
  • 3. Outline Introduction: games / control / planning. Standard approaches Bandit-based Monte-Carlo Planning and UCT. Application to the game of Go and (far) beyond
  • 5. A game is a directed graph with actions 1 2 3 Games with simultaneous actions Paris 1st of February 5
  • 6. A game is a directed graph with actions and players 1 White Black 2 3 White 12 43 White Black Black Black Black Games with simultaneous actions Paris 1st of February 6
  • 7. A game is a directed graph with actions and players and observations Bob Bear Bee Bee 1 White Black 2 3 White 12 43 White Black Black Black Black Games with simultaneous actions Paris 1st of February 7
  • 8. A game is a directed graph with actions and players and observations and rewards Bob Bear Bee Bee 1 White Black 2 3 +1 0 White 12 43 Rewards White Black on leafs Black only! Black Black Games with simultaneous actions Paris 1st of February 8
  • 9. A game is a directed graph +actions +players +observations +rewards +loops Bob Bear Bee Bee 1 White Black 2 3 +1 0 White 12 43 White Black Black Black Black Games with simultaneous actions Paris 1st of February 9
  • 10. More than games in this formalism A main application: the management of many energy stocks in front of randomness At each time step we see random outcomes We have to make decisions (switching on or off) We have losses (ANR / NSC project)
  • 11. Opening a reservoir produces energy (and water goes to another reservoir) Classical Thermal plants Reservoir 2 Reservoir 1 Reservoir 5 Electricity demand Reservoir 4 Reservoir 3 Nuclear plants Lost water
  • 12. Outline Introduction: games / control / planning. Standard approaches Bandit-based Monte-Carlo Planning and UCT. Application to the game of Go and (far) beyond
  • 13. What are the approaches ? Dynamic programming (Massé – Bellman 50's) (still the main approach in industry) (minimax / alpha-beta in games) Reinforcement learning (some promising results, less used in industry) Some tree exploration tools (less usual in stochastic or continuous cases) Bandit-Based Monte-Carlo planning Scripts + tuning
  • 14. What are the approaches ? Dynamic programming (Massé – Bellman 50's) (still the main approach in industry) Where we are: Done: Presentation of the problem. Now: We briefly present dynamic programming Thereafter: We present MCTS / UCT.
  • 15. Dynamic programming V(x) = expectation of future loss if optimal strategy after state x. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the final states X.
  • 16. Dynamic programming V(x) = expectation of C(xH) if optimal strategy. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the final states X. We compute V for all the “-1” states X.
  • 17. Dynamic programming (DP) V(x) = expectation of C(xH) if optimal strategy. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the final states X. We compute V for all the “-1” states X. ... ... ...
  • 19. Alpha-beta = DP + pruning
  • 20. Alpha-beta = DP + pruning “Nevertheless, I believe that a world-champion- level Go machine can be built within 10 years, based upon the same method of intensive analysis--brute force, basically--that Deep Blue employed for chess.” Hsu, IEEE Spectrum, 2007. (==> I don't think so.)
  • 21. Extensions of DP Approximate dynamic programming (e.g. for continuous domains) Reinforcement learning Case where f(...) or A is black-box Huge state spaces ==> but lack of stability Direct Policy Search, Fitted-Q-Iteration... ==> there is room for improvements
  • 22. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions Weakness Games as benchmarks ?
  • 23. Monte-Carlo Tree Search Monte-Carlo Tree Search (MCTS) appeared in games. R. Coulom. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In Proceedings of the 5th International Conference on Computers and Games, Turin, Italy, 2006. Its most well-known variant is termed Upper Confidence Tree (UCT).
  • 24. UCT (Upper Confidence Trees) Coulom (06) Chaslot, Saito & Bouzy (06) Kocsis Szepesvari (06)
  • 25. UCT
  • 26. UCT
  • 27. UCT
  • 28. UCT
  • 29. UCT Kocsis & Szepesvari (06)
  • 31. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 32. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 33. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 34. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  • 35. Parallelizing MCTS On a parallel machine with shared memory: just many simulations in parallel, the same memory for all. On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  • 36. Parallelizing MCTS On a parallel machine with shared memory: just many simulations in parallel, the same memory for all. On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  • 37. Parallelizing MCTS On a parallel machine with shared memory: just many simulations in parallel, the same memory for all. On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  • 38. Parallelizing MCTS On a parallel machine with shared memory: just many simulations in parallel, the same memory for all. On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  • 39. Parallelizing MCTS On a parallel machine with shared memory: just many simulations in parallel, the same memory for all. On a parallel machine with no shared memory: one MCTS per comp. node, and 3 times per second: Select nodes with at least 5% of total sims (depth at most 3) Average all statistics on these nodes ==> comp cost = log(nb comp nodes)
  • 40. Good news: it works So misleading numbers...
  • 41. Much better than voting schemes But little difference with T. Cazenave (depth 0).
  • 42. Every month, someone tells us: Try with a bigger machine ! And win against top pros !
  • 43. Being faster is not the solution
  • 44. The same in Havannah (F. Teytaud)
  • 45. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions Weakness Games as benchmarks ?
  • 46. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 47. Why UCT is suboptimal for games ? There are better formula than mean + sqrt(log(...) / …) (=UCT) MCTS, under mild conditions on games (including deterministic two-player zero- sum games), can be consistent (→ best move); frugal (if there is a good move, it does not visit infinitely often all the tree). (==> not true for UCT)
  • 48. Why UCT is suboptimal for games ? There are better formula than mean + sqrt(log(...) / …) (=UCT) MCTS, under mild There is better for deterministic conditions on games win/draw/loss games: (including deterministic two-player zero- (sumRewards+K)/ sum games), can be (nbTrials+2K) consistent (→ best move); frugal (if there is a good move, it does not visit infinitely often all the tree). (==> not true for UCT)
  • 49. Go: from 29 to 6 stones Formula for simulation nbWins + 1 argmax --------------- nbLosses + 2 Berthier, Doghmen, T., LION 2010 ==> consistency ==> frugality
  • 50. It depends on the game and on the tuning...
  • 51. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 52. Infinite action spaces: progressive widening UCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(i)/ number of trials with decision u )  ==> argmax only on the i first arms ( [ 0.25 0.5 ] ) (Coulom, Chaslot et al, Wang et al)
  • 53. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 54. Extensions ``Standard'' UCT: score(situation,move) = compromise (in [0,1+] ) between a) empirical quality P ( win | nextMove(situation) = move ) estimated in simulations b) exploration term Remark: No offline learning
  • 55. Extension: offline learning (introducing imitation learning) c) offline value (Bouzy et al) = empirical estimate P ( played | pattern ) Pattern = ball of locations, each location either: - this is black stone - this is white stone - this is empty - this is not black stone - this is not white stone - this is not empty - this is border Support = frequency of “the center of this pattern is played” Confidence = conditional frequency of play Bias = confidence of pattern with max support
  • 56. Extension: offline learning (introducing imitation learning) score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value (Bouzy et al, Coulom) = empirical estimate P ( played | pattern ) for patterns with big support ==> estimated on database At first, (c) is the most important; later, (a) dominates.
  • 57. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 58. Extensions ``Standard'' UCT: score(situation,move) = compromise between a) empirical quality b) exploration term Remark: No learning from one situation to another
  • 59. Extension: transient values score(situation,move) = compromise between a) empirical quality P' ( win | nextMove(situation) = move ) estimated in simulations b) exploration term c) offline value d) ``transient'' value: (Gelly et al, 07) P' (win | move ∈ laterMoves(situations) ) ==> brings information from node N to ancestor node M ==> does not bring information from node N to descendants or cousins (many people have tried...) Brügman, Gelly et al
  • 60. Transient values = RAVE = very good in many games It works also in Havannah.
  • 61. It works also in NoGo NoGo = rules of Go except that capturing ==> loosing
  • 62. Counter-example to RAVE, B2 By M. Müller B2 makes sense only if it is played Immediately (otherwise A5 kills).
  • 63. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 64. Extensions ``Standard'' UCT: score(situation,move) = compromise between a) empirical quality b) exploration term Remarks: No expert rules
  • 65. Extension: expert rules score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value d) transient value e) expert rules ==> empirically derived linear combination Most important terms, (e)+(c) first, then (d) becomes stronger, finally (a) only
  • 66. Extension: expert rules in the Monte-Carlo part Decisive moves: play immediate wins. Anti-decisive moves: don't play moves with immediate winning reply. Teytaud&Teytaud, CIG2010: can be fast in connection games. E.g. Havannah:
  • 67. Go: from 29 to 6 stones 1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo 2008: win against a pro (4p) 19x19, H8 CrazyStone 2008: win against a pro (4p) 19x19, H7 CrazyStone 2009: win against a pro (9p) 19x19, H7 MoGo 2009: win against a pro (1p) 19x19, H6 MoGo 2010: win against a pro (4p) 19x19, H6 Zen 2010: win against a pro (5p) 19x19, H6 Zen 2007: win against a pro (5p) 9x9 (blitz) MoGo 2008: win against a pro (5p) 9x9 white MoGo 2009: win against a pro (5p) 9x9 black MoGo 2009: win against a pro (9p) 9x9 white Fuego 2009: win against a pro (9p) 9x9 black MoGoTW ==> still 6 stones at least!
  • 68. Go: from 29 to 6 stones 1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo 2008: win against a pro (4p) 19x19, H8 CrazyStone 2008: win against a pro (4p) 19x19, H7 CrazyStone 2009: win against a pro (9p) 19x19, H7 MoGo 2009: win against a pro (1p) 19x19, H6 MoGo 2010: win against a pro (5p) 19x19, H6 Zen Wins with H6 / H7 are lucky (rare) 2007: win against a pro (5p) 9x9 (blitz) MoGo wins 2008: win against a pro (5p) 9x9 white MoGo 2009: win against a pro (5p) 9x9 black MoGo 2009: win against a pro (9p) 9x9 white Fuego 2009: win against a pro (9p) 9x9 black MoGoTW ==> still 6 stones at least!
  • 69. Go: from 29 to 6 stones 1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo 2008: win against a pro (4p) 19x19, H8 CrazyStone 2008: win against a pro (4p) 19x19, H7 CrazyStone 2009: win against a pro (9p) 19x19, H7 MoGo 2009: win against a pro (1p) 19x19, H6 MoGo 2010: win against a pro (5p) 19x19, H6 Zen 2007: win against a pro (5p) 9x9 (blitz) MoGo 2008: win against a pro (5p) 9x9 white MoGo 2009: win against a pro (5p) 9x9 black MoGo 2009: win against a pro (9p) 9x9 white Fuego 2009: win against a pro (9p) 9x9 black MoGoTW Win with disadvantageous ==> still 6 stones at least! side.
  • 70. 13x13 Go: new results! 9x9 Go: computers are at the human best level. - Fuego won against a top level human as white - mogoTW did it both as black and white and regularly wins some games against the top players. - mogoTW won ¾ yesterday in blind go (blind go = go in 9x9 according to the pros) 19x19 Go: the best humans still (almost always) win easily with 7 handicap stones. In WCCI 2010, experiments in 13x13 Go: - MoGo won 2/2 against 6D players with handicap 2 - MfoG won 1/2 against 6D players with handicap 2 - Fuego won 0/2 against 6D players with handicap 2 And yesterday MoGoTW won one game with handicap 2.5!
  • 71. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions More than UCT in MCTS Infinite action spaces Offline learning Online learning Expert knowledge Hidden information
  • 72. Bandits We have seen UCB: choose action with maximal score Q(action,state) = empirical_reward(action,state) +sqrt( log(nbSims(state)) / nbSims(action,state) ) EXP3 is another bandit: For adversarial cases Based on a stochastic formula
  • 73. EXP3 in one slide Grigoriadis et al, Auer et al, Audibert & Bubeck Colt 2009
  • 74. MCTS for simultaneous actions Player 1 plays Player 2 plays Both players play ... Player 1 plays Player 2 plays
  • 75. MCTS for simultaneous actions Player 1 plays Flory, Teytaud, = maxUCB node Evostar 2011 Player 2 plays =minUCB node Both players play =EXP3 node Player 1 plays ... Player 2 plays =maxUCB node =minUCB node
  • 76. MCTS for hidden information Player 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  • 77. MCTS for hidden information Player 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node “Observation set” = set of sequences of observations Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  • 78. MCTS for hidden information Player 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node “Observation set” = set of Here, possible sequences sequences of of observations observations are partitioned in 3 Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  • 79. MCTS for hidden information Player 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Thanks Martin (incrementally + application to phantom-tic-tac-toe: see D. Auger 2011) Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  • 80. MCTS for hidden information Player 1 Observation set 1 Observation set 2 EXP3 node EXP3 node Observation set 3 EXP3 node Use EXP3: (incrementally + application to phantom-tic-tac-toe: see D. Auger 2010) in consistent even adversarial setting. Observation set 2 Player 2 Observation set 1 EXP3 node EXP3 node Observation set 3 EXP3 node
  • 81. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 82. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 83. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 84. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 85. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 86. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 87. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 88. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 89. MCTS with hidden information While (there is time for thinking) { s=initial state while (s not terminal) { os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 90. MCTS with hidden information: incremental version While (there is time for thinking) { Possibly refine s=initial state the family while (s not terminal) { of bandits. os1=observationSet1=(); os2=() b1=bandit1(os1); b2=bandit2(os2) d1=b1.makeDecision;d2=b2.makeDecision (s,o1,o2)=transition(s,d1,d2) os1=os1.o1, os2=os2.o2 } send reward to all bandits in the simulation }
  • 91. Let's have fun with a nice application (
  • 92. Let's have fun with Urban Rivals (4 cards) Each player has - four cards (each one can be used once) - 12 pilz (each one can be used once) - 12 life points Each card has: - one attack level - one damage - special effects (forget it for the moment) Four turns: - P1 attacks P2 - P2 attacks P1 - P1 attacks P2 - P2 attacks P1 Games with simultaneous actions Paris 1st of February 92
  • 93. Let's have fun with Urban Rivals First, attacker plays: - chooses a card - chooses ( PRIVATELY ) a number of pilz Attack level = attack(card) x (1+nb of pilz) Then, defender plays: - chooses a card - chooses a number of pilz Defense level = attack(card) x (1+nb of pilz) Result: If attack > defense Defender looses Power(attacker's card) Else Attacker looses Power(defender's card) Games with simultaneous actions Paris 1st of February 93
  • 94. Let's have fun with Urban Rivals ==> The MCTS-based AI is now at the best human level. Experimental (only) remarks on EXP3: - discard strategies with small number of sims = better approx of the Nash - also an improvement by taking into account the other bandit - not yet compared to INF - virtual simulations (inspired by Kummer) Games with simultaneous actions Paris 1st of February 94
  • 95. Let's have fun with a nice application ) We are now at the best human level in Urban Rivals.
  • 96. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions Weakness Games as benchmarks ?
  • 97. Game of Go (9x9 here)
  • 104. Game of Go: counting territories (white has 7.5 “bonus” as black starts)
  • 105. Game of Go: the rules Black plays at the blue circle: the white group dies (it is removed) It's impossible to kill white (two “eyes”). “Ko” rules: we don't come back to the same situation. (without ko: “PSPACE hard” with ko: “EXPTIME-complete”) At the end, we count territories ==> black starts, so +7.5 for white.
  • 106. Easy for computers ... because human knowledge easy to encode. 106
  • 107. Difficult for computers (pointed out to me by J.M. Jolion in 1998) 107
  • 108. Key point in Go: there are human-easy situations which are computer-hard. We'll see much easier situations poorly understood. (komi 7.5) 108
  • 109. Difficult for computers (win for black, playing A) We'll see much easier situations poorly understood. (komi 7.5) But let's see an easier case. 109
  • 110. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 111. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 112. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 113. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 114. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 115. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 116. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 117. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 118. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 119. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 120. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 121. It does not work. Why ? 50% of estimated win probability! In the first node:  The first simulations give ~ 50%  The next simulations go to 100% or 0% (depending  on the chosen move)  But, then, we switch to another node                                                 (~ 8! x 8! such nodes)
  • 122. And the humans ? 50% of estimated win probability! In the first node:  The first simulations give ~ 50%  The next simulations go to 100% or 0% (depending  on the chosen move)  But, then, we DON'T switch to another node   
  • 127. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions Weakness Games as benchmarks ?
  • 128. Difficult board games: Havannah Very difficult for computers. Very simple to implement.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 135. What else ? Games with  simultaneous actions or hidden information Flory, Teytaud, Evostar 2011 Games with hidden information Games with simultaneous actions. UrbanRivals = internet card game; 11 millions of registered users. Game with hidden information. Frédéric Lemoine MIG 11/07/2008 135
  • 136. What else ? Real Time Strategy Game (multiple actors, partially obs.) Frédéric Lemoine MIG 11/07/2008 136
  • 138. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. (holds true for Go) Frédéric Lemoine MIG 11/07/2008 138
  • 139. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. VS Frédéric Lemoine MIG 11/07/2008 139
  • 141. When is MCTS relevant ? Robust in front of: High dimension; Non-convexity of Bellman values; Complex models Delayed reward
  • 142. When is MCTS relevant ? Robust in front of: High dimension; Non-convexity of Bellman values; Complex models Delayed reward More difficult for High values of H; Highly unobservable cases (Monte-Carlo, but not Monte-Carlo Tree Search, see Cazenave et al.) Lack of reasonable baseline for the MC
  • 143. When is MCTS relevant ? Robust in front of: High dimension; Non-convexity of Bellman values; Complex models Delayed reward Go: H ~300 Dimension = 361 More difficult for Fully observable High values of H; Fully delayed reward Highly unobservables cases Lack of reasonnable baseline for the MC
  • 144. When is MCTS relevant ? How to apply it: Implement the transition (a function action x state → state ) Design a Monte-Carlo part (a random simulation) (a heuristic in one-player games; difficult if two opponents) ==> at this point you can simulate... Implement UCT (just a bias in the simulator – no real optimizer)
  • 145. When is MCTS relevant ? How to apply it: Implement the transition (a function action x state → state ) Design a Monte-Carlo part (a random simulation) (a heuristic in one-player games; difficult if two opponents) ==> at this point you can simulate... Implement UCT (just a bias in the simulator – no real optimizer) Possibly RAVE values (Gelly et al) Parallelize multicores + MPI (Cazenave et al, Gelly et al) Decisive moves + anti-decisive moves (Teytaud et al) Patterns (Bouzy et al)
  • 146. Advantages of MCTS: easy + visible Many indicators (not only expectation; simulation-based; visible; easy to check) Algorithm indeed simpler (unless in-depth optimization as for Go competition...) than DP Anytime (you stop when you want)
  • 147. Advantages of MCTS: general No convexity assumption Arbitrarily complex model I can add an opponent High dimension ok
  • 148. Drawback of MCTS Recent method Impact of H not clearly known (?) No free lunch: a model of the transition / uncertainties is required (but, an advantage: no constraint) (but: Fonteneau et al, Model-free MC)
  • 149. Conclusion Essentially asymptotically proved only Empirically good for The game of Go Some other (difficult) games Non-linear expensive optimization Active learning Tested industrially (Spiral library – architecture-specific ) There are understood (but not solved) weaknesses Next challenge: Solve these weaknesses (introducing learning ? Refut. Tables ? Drake et al) More industrial applications Partially observable cases (Cazenave et al, Rolet et al, Auger) H large, truncating (Lorentz) Scalability (Doghmen et al)