SlideShare a Scribd company logo
1 of 119
Bandit-based Monte-Carlo planning: the game
 of Go and beyond


The game of Go: recent
progress for an old game

 Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ.
Paris-Sud, LRI, CMAP, Univ. Amsterdam, Taiwan universities (including NUTN)

TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal Network of Excellence.




Paris,
April 2010.
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
History: I'm not an expert

 Origins: I don't know. Too many dates in
  the literature. Someone knows ?
 8th century: the game of Go in Japan ?
 9th century: symmetric game ?
 16th century: first schools ?
 Recently:
   huge progress thanks to cultural differences in
    teaching ?
   becomes known in Europe (cf interest for Asian
    cultures and Ikaru No Go)
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
Rules
Only recently
formalized in a
mathematical
sense.
For some rules:
winner not always clearly defined (comment
by a strong japanese friend: in asian
cultures this is not so important).
Recently: “komi” modified, superko adapted
so that no draw.
Time settings get smaller and smaller (TV +
younger people)
Game of Go: the rules
        Black plays at the blue circle: the
        white group dies (it is removed)


It's impossible to kill white (two “eyes”).




      “Ko” rules: we don't come back to the same situation.




  At the end, we count territories
  ==> black starts, so +7.5 for white.
Game of Go (9x9 here)
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go
Game of Go: counting territories
(white has 7.5 “bonus” as black starts)
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
Introduction to games

Partially or fully observable
Randomized or not
Iterated or not
1,2,3,... players
Decentralized or not
Continuous or not
Infinite time or not
Complexity measures
(not always well defined)


State-space complexity
                            Combinatorial
Game-tree size              complexity
Decision complexity         measures
Game-tree complexity
Computational complexity     Computational
                             complexity
Perfect-play complexity      measures
State of the art level
Complexity measures
(not always well defined)

State-space complexity = number
        of possible states
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
Complexity measures
(not always well defined)


State-space complexity
Game-tree size = number of leafs
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
Complexity measures
(not always well defined)

State-space complexity
Game-tree size
Decision complexity = min # of
leafs of tree showing perfect play
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
Complexity measures
(not always well defined)

State-space complexity
Game-tree size
Decision complexity
Game-tree complexity = # of leafs
for perfect play with constant depth
Computational complexity
Perfect-play complexity
State of the art level
Complexity measures
(not always well defined)

State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity (=
complexity classes, later)
Perfect-play complexity
State of the art level
Computational complexity:
 Main reasons for this measure ?

Good feeling of understanding
                             (disagree if you want :-) )
Explicit families of problems
                             (extracted by reduction)
Fun

Connections
                  with classical complexity measures
Much better for looking clever
                  (when you speak about NP-complete
                           problems you look clever)
Computational complexity

Known:




Conjectured: strict inclusions everywhere.

Higher classes include undecidable cases.
Computational complexity

Given a class X, a problem q can be
in X
or harder than pbs in X (X-hard)
or both (X-complete)
or neither
                                     NP
                       NP          -difficile
               NP      -complete
Computational complexity
For evaluating the complexity of your
problem:

1. Generalize your game to any size
                             (non trivial for chess)
2. Consider the problem:
 - here is a board
 - is the situation a win in perfect play ?


                                           NP
                            NP
                  NP        -complete   -difficile
Computational complexity

==> cast into a decision problem (binary question)

==> can be used for choosing optimal move
                             (but not necessary)

==> trivial games can be EXPTIME-hard

==> no clear correlation with the fact that a game is difficult
    for a computer (when compared to humans)


                                                   NP
                                    NP
                        NP          -complete   -difficile
Games



Introduction

Complexity measures

Computational complexity

Zoology
PSPACE vs EXPTIME

==> many important games are either PSPACE or EXPTIME


Theorem: If playing = filling a location
for eternity, then it is PSPACE.
                   (not necessarily PSPACE-complete!)


Proof: Depth-first search.
Applis: Hex, Havannah, Tic-Tac-Toe,
       Ponnuki-Go...
PSPACE vs EXPTIME
NP / PSPACE / EXPTIME in Go
Tsumegos with no ko, forced moves only for
 W, 2 moves for B, polynomial length: NP-
 complete
Ponnuki-Go : PSPACE
Go without ko: PSPACE-hard
Go with ko + japanese rules:
                    EXPTIME-complete
Go with ko + superko: unknown
Some phantom-rengo undecidable ?

If Go with ko > Go without ko, then
              PSPACE EXPTIME
Complexity measures
(not always well defined)

State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity (complexity
of perfect algorithm)
State of the art level
Complexity measures
(not always well defined)


State-space complexity
Game-tree size
Decision complexity
Game-tree complexity
Computational complexity
Perfect-play complexity
State of the art level
State of the art level


Very weak solving
Means that we know who should win
Typically proved by strategy-stealing
E.g.: hex (first player wins), hex + swap
 (second player wins)
Weak solving
Strong solving
Best results so far
State of the art level


Very weak solving
Weak solving
Perfect play reached with reasonnable computation
time
Biggest success: draughts (tenths of years of
computation on tenths of machines)
Strong solving
Best results so far
State of the art level


Very weak solving
Weak solving
Strong solving
       Perfect play from any situation in
       reasonable time (variants of Tic-Tac-Toe)
Best results so far
State of the art level
Very weak solving
Weak solving
Strong solving

Best results so far
Shi-Fu-Mi: humans loose
English draughts: humans + machines reach perfect
play
Chess: nobody can compete with machines
Ponnuki-Go: some variants solved
9x9 Go: MoGoTW won with the disadvantageous side
with a top player
Go: from 29 to 6 stones
1998: loss against amateur (6d) 19x19 H29
2008: win against a pro (8p) 19x19, H9        MoGo
2008: win against a pro (4p) 19x19, H8    CrazyStone
2008: win against a pro (4p) 19x19, H7    CrazyStone
2009: win against a pro (9p) 19x19, H7        MoGo
2009: win against a pro (1p) 19x19, H6        MoGo

2007: win against a pro (5p) 9x9 (blitz)     MoGo
2008: win against a pro (5p) 9x9 white       MoGo
2009: win against a pro (5p) 9x9 black       MoGo
2009: win against a pro (9p) 9x9 white       Fuego
2009: win against a pro (9p) 9x9 black        MoGoTW

==> still 6 stones at least!
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) appeared
in games.
Its most well-known variant is termed Upper
Confidence Tree (UCT).
I here present UCT.
Bandits;
Monte-Carlo approach for tree-search;
UCT.
A ``bandit'' problem

p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )
A ``bandit'' problem: the target

p1,...,pN unknown probabilities ∈ [0,1]
At each time step i∈ [1,n]
choose ui∈ {1,...,N} (as a function of uj and rj, j<i)
With probability pui
win ( ri=1 )
loose ( ri=0 )

Regret: Rn=n max{pi} - ∑ rj            (j<n)
How to minimize the regret (worst case on p) ?
Bandits – a classical solution

Regret: Rn=n max{pi} - ∑ rj          (j<i)


UCB1: Choose u maximizing the compromise:


  Empirical average for decision u
  + √( log(i)/ number of trials with decision u )


  ==> optimal regret O(log(n))
                                  (Lai et al; Auer et al)
Infinite bandit: progressive
widening
UCB1: Choose u maximizing the compromise:


  Empirical average for decision u
  + √( log(i)/ number of trials with decision u )

                                 
  ==> argmax only on the     i       first arms
            (   [ 0.25 0.5 ] )


                  (Coulom, Chaslot et al, Wang et al)
Bandits: much more

What is a bandit:
  - a criterion (here a bandit)
      defines the problem
  - usually a score (typically
      exploration+exploitation)
          defines a criterion
  ==> an optimal score for a criterion is not optimal
   for another ==> a wide literature
Bandits and trees
 - we have seen the
   definition of discrete
   time control problems;


 - we have seen what are
      bandits


 - we now introduce trees and UCT
UCT (Upper Confidence Trees)




Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT
UCT
UCT
UCT
UCT
      Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
               5/7
             + k.sqrt( log(10)/7 )
... or exploration ?
Go: from 29 to 6 stones
Formula for
simulation
                Asymptotically optimal move.

                But all the tree is visited infinitely often!

                What is used in implementations which work ?
Go: from 29 to 6 stones
Formula for
simulation
Go: from 29 to 6 stones
Formula for
simulation

               Not consistent! Sometimes:
                - Good move might have 0/1
                - Bad move 1/(N-1) after N simulations
               ==> we only simulate bad move!
Go: from 29 to 6 stones
Formula for
simulation




                       Other (better) estimates,
                       but still inconsistent
Go: from 29 to 6 stones
Formula for
simulation
                        nbWins + 1
                argmax ---------------
                        nbLosses + 2

                   ==> consistency
                   ==> frugality
Outline

Discrete time control: various approaches
Monte-Carlo Tree Search (UCT, MCTS; 2006)
Extensions
Weakness
Games as benchmarks ?
Monte-Carlo Tree Search (beyond UCT)
Monte-Carlo Tree Search (beyond UCT)
Why UCT is suboptimal for games ?
(boring version)
Why UCT is suboptimal for games ?
(clear version)

  Monte-Carlo Tree Search, under mild
   conditions on games (including
   deterministic two-player zero-sum
   games), can be
 consistent (→ best move);
 frugal (if there is a good move, it does not
 visit infinitely often all the tree).
Why UCT is suboptimal for games ?
(clear version)

  Frugal algorithms: folklore results (many
    people implement “frugal” MCTS).
  However, these algorithms are (usually)
   not consistent.
  What is new is
  sufficient
  conditions for
  consistency + frugal.
Extensions
``Standard'' UCT:
   score(situation,move) = compromise (in [0,1+] )
           between
      a) empirical quality
           P ( win | nextMove(situation) = move )
            estimated in simulations
      b) exploration term
             (UCT is not fundamental in Go)

Remarks:
  1) No offline learning
  2) No learning from one situation to another
  3) No expert rules
Extension 1: offline learning

  score(situation,move) = compromise between
     a) empirical quality
     b) exploration term
      c) offline value (Chaslot et al, Coulom) =
            empirical estimate P ( played | pattern )
            for patterns with big support
            ==> estimated on database

At first, (c) is the most important; later, (a) dominates.
Extensions
``Standard'' UCT:
   score(situation,move) = compromise between
      a) empirical quality
      b) exploration term

Remarks:
  1) No offline learning
  2) No learning from one situation to another
  3) No expert rules
Extension 2: transient values

score(situation,move) = compromise between
   a) empirical quality
           P' ( win | nextMove(situation) = move )
          estimated in simulations
   b) exploration term
    c) offline value
    d) ``transient'' value:              (Gelly et al, 07)
           P' (win | same player plays “move” later)
 ==> brings information from node N to ancestor node M
 ==> does not bring information from node N to
       descendants or cousins (many people have tried)
Extensions
``Standard'' UCT:
   score(situation,move) = compromise between
      a) empirical quality
      b) exploration term

Remarks:
  1) No offline learning
  2) No learning from one situation to another
  3) No expert rules
Extension 3: expert rules
  score(situation,move) = compromise between
     a) empirical quality
     b) exploration term
      c) offline value
      d) transient value
     e) expert rules

   ==> empirically derived linear combination
Most important terms,
(e)+(c) first,
then (d) becomes stronger,
finally (a) only
Go: from 29 to 6 stones
2007: win against a pro (5p) 9x9 (blitz)      MoGo
2008: win against a pro (5p) 9x9 white        MoGo
2009: win against a pro (5p) 9x9 black        MoGo
2009: win against a pro (9p) 9x9 white        Fuego
2009: win against a pro (9p) 9x9 black        MoGoTW

2008: win against a pro (8p) 19x19, H9         MoGo
2008: win against a pro (4p) 19x19, H8     CrazyStone
2008: win against a pro (4p) 19x19, H7     CrazyStone
2009: win against a pro (9p) 19x19, H7         MoGo
2009: win against a pro (1p) 19x19, H6         MoGo

==> still 6 stones at least!
Why computers are weak against
humans           (thanks M. Jolion)
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
Semeai

     Plenty of equivalent
               situations!

         They are randomly
             sampled, with 
          no generalization.

          50% of estimated
            win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
A trivial semeai

           Plenty of equivalent
                     situations!

            They are randomly
                sampled, with 
             no generalization.

             50% of estimated
               win probability!
It does not work. Why ?

                                             50% of estimated
                                               win probability!


In the first node:
 The first simulations give ~ 50%
 The next simulations go to 100% or 0% (depending 
on the chosen move)
 But, then, we switch to another node 
                                               (~ 8! x 8! such nodes)
And the humans ?

                                 50% of estimated
                                   win probability!


In the first node:
 The first simulations give ~ 50%
 The next simulations go to 100% or 0% (depending 
on the chosen move)
 But, then, we DON'T switch to another node 
 
Semeais

Should
white
play in
the
semeai
(G1)
or capture
(J15) ?
             94
Semeais

Should black
play the
semeai ?




               95
Semeais

Should black
play the
semeai ?




               96
Semeais

Should black
play the
semeai ?



Useless!



               97
Outline: the game of Go

The history


The rules


Variants of go and complexity


Computers playing Go


Go and power plants
What is high-dimensional
 discrete time control ?
There are time steps: 0, 1, 2, ..., H.
There are states and transitions:
                   xi+1 = f( xi, di)
di is the decision at time step i:
                      di=u(xi)
There is a cost:
                     C = C(xH)
   ==> We look for u(.) such that C is as
            small as possible.
High-dimensional discrete time
 control

              xi+1 = f( xi, di)
                 di=u(xi)
                C = C(xH)


  ==> We look for u(.) such that C is as
           small as possible.
Discrete time + high dimension
 + uncertainty
    xi+1 = f( xi, di, Ai)    di=u(xi)
    C = C(xH)
Ai might be:
- a Markov model
- an opponent: Ai maximizes inf C


==> we look for u(.) such that C is as small
      as possible (e.g. on average).
Summary

High dimensional discrete time control is an
important problem


Many problems have no satisfactory
solution.


A new approach: Bandit-Based Monte-Carlo
Planning
High-dimensional discrete time
control
A main application: the management of
 many energy stocks in front of randomness
At each time step we see random outcomes
We have to take decisions
After H time steps, we observe a cost
What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)


Reinforcement learning (some promising results,
less used in industry)


Some tree exploration tools (less usual in
stochastic or continuous cases)


Bandit-Based Monte-Carlo planning (MCTS/UCT)
What are the approaches ?

Dynamic programming (Massé – Bellman 50's)
(still the main approach in industry)


  Where we are:
  Done: Presentation of the problem.
  Now: We briefly present dynamic
   programming
  Thereafter: We present MCTS / UCT.
Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the X with horizon H.
Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the X with horizon H.
We compute V for all the X with horizon H-1.
Dynamic programming

V(x) = expectation of C(xH) if optimal
strategy.  (well defined)
u(x) such that the expectation of
V(f(x,u(x),A)) is minimal
Computation by dynamic programming


We compute V for all the X with horizon H.
We compute V for all the X with horizon H-1.
            ...        ...          ...
Extensions

Approximate dynamic programming (e.g.
for continuous domains)
Reinforcement learning
Case where f(...) or A is black-box
Huge state spaces
      ==> but lack of stability
...
==> there is room for improvements
Conclusion : games = great for
artificial intelligence

                      Very difficult
                     for computers.
What else ? First Person Shooting
(UCT for partially observable MDP)




Frédéric Lemoine   MIG 11/07/2008    111
What else ? Real Time Strategy Game
   (multiple actors, partially obs.)




 Frédéric Lemoine   MIG 11/07/2008   112
What else ? Sports (continuous control)




 Frédéric Lemoine   MIG 11/07/2008   113
“Real” games
     Assumption: if a computer understands and guesses spins, then
     this robot will be efficient for something else than just games.

     (holds true for Go)




Frédéric Lemoine                 MIG 11/07/2008                         114
“Real” games
     Assumption: if a computer understands and guesses spins, then
     this robot will be efficient for something else than just games.




                                       VS




Frédéric Lemoine                 MIG 11/07/2008                         115
What else ? Collaborative sports




Frédéric Lemoine   MIG 11/07/2008        116
Conclusion
Essentially asymptotically proved only
Empirically good for
The game of Go
Some other games
Non-linear expensive optimization
Active learning
Not (yet) tested industrially
Understood weaknesses: plenty of very similar nodes!

 Next challenge:
Solve these weaknesses
Industrial applications
Partially observable cases : cf Cazenave, Rolet
Biblio
Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
UCT: Kocsis, Szepesvari, Coquelin, Munos...
MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
                                               Pérez, Rimmel, Wang...
Tree + DP for industrial applicationl: Péret, Garcia...
Bandits with infinitely many arms:
     Audibert, Coulom, Munos, Wang...
Applications far from Go: Rolet,
     Teytaud (F), Rimmel, De Mesmay
     ...
Links with “macro-actions” ?
Parallelization, mixing with offline
  learning, bias...
Paul Veyssière                                               Vincent Berthier

                 Contributors
Amine Bourki                                                 Hassen Doghmen
Matthieu Coulm                                               Univ. Taiwan
                                                             Univ. Paris

 Bandits: Lai, Robbins, Auer, Cesa-Bianchi...
 UCT: Kocsis, Szepesvari, Coquelin, Munos...
 MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller,
                                                Pérez, Rimmel, Wang...
 Tree + DP for industrial applicationl: Péret, Garcia...
 Bandits with infinitely many arms:
      Audibert, Coulom, Munos, Wang...
 Applications far from Go: Rolet,
      Teytaud (F), Rimmel, De Mesmay
      ...
 Links with “macro-actions” ?
 Parallelization, mixing with offline
   learning, bias...

More Related Content

Viewers also liked

Artificial intelligence (in Russian)
Artificial intelligence (in Russian)Artificial intelligence (in Russian)
Artificial intelligence (in Russian)
Vladislav Shuklin
 
Artificial intelligence presentation finale
Artificial intelligence presentation finaleArtificial intelligence presentation finale
Artificial intelligence presentation finale
MMileon
 
Artificial Intelligence and Socially Empathetic Robots
Artificial Intelligence and Socially Empathetic RobotsArtificial Intelligence and Socially Empathetic Robots
Artificial Intelligence and Socially Empathetic Robots
clairey08
 
Artificial intelligence and intelligent system
Artificial intelligence and intelligent systemArtificial intelligence and intelligent system
Artificial intelligence and intelligent system
JAKA Pradana
 
Artificial Intelligence AI Topics History and Overview
Artificial Intelligence AI Topics History and OverviewArtificial Intelligence AI Topics History and Overview
Artificial Intelligence AI Topics History and Overview
butest
 
Artificial intelligence and_software_engineering_2004
Artificial intelligence and_software_engineering_2004Artificial intelligence and_software_engineering_2004
Artificial intelligence and_software_engineering_2004
Luis Menchaca
 
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
IDES Editor
 

Viewers also liked (15)

Artificial intelligence manish kumar_office2010
Artificial intelligence manish kumar_office2010Artificial intelligence manish kumar_office2010
Artificial intelligence manish kumar_office2010
 
Artificial intelligenc
Artificial intelligencArtificial intelligenc
Artificial intelligenc
 
Artificial intelligence (in Russian)
Artificial intelligence (in Russian)Artificial intelligence (in Russian)
Artificial intelligence (in Russian)
 
Artificial intelligence presentation finale
Artificial intelligence presentation finaleArtificial intelligence presentation finale
Artificial intelligence presentation finale
 
Artificial Intelligence and Socially Empathetic Robots
Artificial Intelligence and Socially Empathetic RobotsArtificial Intelligence and Socially Empathetic Robots
Artificial Intelligence and Socially Empathetic Robots
 
Artificial intelligence and intelligent system
Artificial intelligence and intelligent systemArtificial intelligence and intelligent system
Artificial intelligence and intelligent system
 
Artificial Intelligence AI Topics History and Overview
Artificial Intelligence AI Topics History and OverviewArtificial Intelligence AI Topics History and Overview
Artificial Intelligence AI Topics History and Overview
 
Artificial intelligence and_software_engineering_2004
Artificial intelligence and_software_engineering_2004Artificial intelligence and_software_engineering_2004
Artificial intelligence and_software_engineering_2004
 
Artificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismArtificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with Parallelism
 
Artificial Inteligence for Games an Overview SBGAMES 2012
Artificial Inteligence for Games an Overview SBGAMES 2012Artificial Inteligence for Games an Overview SBGAMES 2012
Artificial Inteligence for Games an Overview SBGAMES 2012
 
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
Artificial Intelligence Based Mutual Authentication Technique with Four Entit...
 
What is Artificial Intelligence?
What is Artificial Intelligence?What is Artificial Intelligence?
What is Artificial Intelligence?
 
Artificial intelligence ai choice mechanism hypothesis of a mathematical method
Artificial intelligence ai choice mechanism hypothesis of a mathematical methodArtificial intelligence ai choice mechanism hypothesis of a mathematical method
Artificial intelligence ai choice mechanism hypothesis of a mathematical method
 
Artificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer DiagnosisArtificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
 
Lecture01. introduction
Lecture01. introductionLecture01. introduction
Lecture01. introduction
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Artificial intelligence and the game of Go

  • 1. Bandit-based Monte-Carlo planning: the game of Go and beyond The game of Go: recent progress for an old game Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ. Paris-Sud, LRI, CMAP, Univ. Amsterdam, Taiwan universities (including NUTN) TAO, Inria-Saclay IDF, Cnrs 8623, Lri, Univ. Paris-Sud, Digiteo Labs, Pascal Network of Excellence. Paris, April 2010.
  • 2. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 3. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 4. History: I'm not an expert Origins: I don't know. Too many dates in the literature. Someone knows ? 8th century: the game of Go in Japan ? 9th century: symmetric game ? 16th century: first schools ? Recently: huge progress thanks to cultural differences in teaching ? becomes known in Europe (cf interest for Asian cultures and Ikaru No Go)
  • 5. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 6. Rules Only recently formalized in a mathematical sense. For some rules: winner not always clearly defined (comment by a strong japanese friend: in asian cultures this is not so important). Recently: “komi” modified, superko adapted so that no draw. Time settings get smaller and smaller (TV + younger people)
  • 7. Rules Only recently formalized in a mathematical sense. For some rules: winner not always clearly defined (comment by a strong japanese friend: in asian cultures this is not so important). Recently: “komi” modified, superko adapted so that no draw. Time settings get smaller and smaller (TV + younger people)
  • 8. Rules Only recently formalized in a mathematical sense. For some rules: winner not always clearly defined (comment by a strong japanese friend: in asian cultures this is not so important). Recently: “komi” modified, superko adapted so that no draw. Time settings get smaller and smaller (TV + younger people)
  • 9. Rules Only recently formalized in a mathematical sense. For some rules: winner not always clearly defined (comment by a strong japanese friend: in asian cultures this is not so important). Recently: “komi” modified, superko adapted so that no draw. Time settings get smaller and smaller (TV + younger people)
  • 10. Game of Go: the rules Black plays at the blue circle: the white group dies (it is removed) It's impossible to kill white (two “eyes”). “Ko” rules: we don't come back to the same situation. At the end, we count territories ==> black starts, so +7.5 for white.
  • 11. Game of Go (9x9 here)
  • 18. Game of Go: counting territories (white has 7.5 “bonus” as black starts)
  • 19. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 20. Introduction to games Partially or fully observable Randomized or not Iterated or not 1,2,3,... players Decentralized or not Continuous or not Infinite time or not
  • 21. Complexity measures (not always well defined) State-space complexity Combinatorial Game-tree size complexity Decision complexity measures Game-tree complexity Computational complexity Computational complexity Perfect-play complexity measures State of the art level
  • 22. Complexity measures (not always well defined) State-space complexity = number of possible states Game-tree size Decision complexity Game-tree complexity Computational complexity Perfect-play complexity State of the art level
  • 23. Complexity measures (not always well defined) State-space complexity Game-tree size = number of leafs Decision complexity Game-tree complexity Computational complexity Perfect-play complexity State of the art level
  • 24. Complexity measures (not always well defined) State-space complexity Game-tree size Decision complexity = min # of leafs of tree showing perfect play Game-tree complexity Computational complexity Perfect-play complexity State of the art level
  • 25. Complexity measures (not always well defined) State-space complexity Game-tree size Decision complexity Game-tree complexity = # of leafs for perfect play with constant depth Computational complexity Perfect-play complexity State of the art level
  • 26. Complexity measures (not always well defined) State-space complexity Game-tree size Decision complexity Game-tree complexity Computational complexity (= complexity classes, later) Perfect-play complexity State of the art level
  • 27. Computational complexity: Main reasons for this measure ? Good feeling of understanding (disagree if you want :-) ) Explicit families of problems (extracted by reduction) Fun Connections with classical complexity measures Much better for looking clever (when you speak about NP-complete problems you look clever)
  • 28. Computational complexity Known: Conjectured: strict inclusions everywhere. Higher classes include undecidable cases.
  • 29. Computational complexity Given a class X, a problem q can be in X or harder than pbs in X (X-hard) or both (X-complete) or neither NP NP -difficile NP -complete
  • 30. Computational complexity For evaluating the complexity of your problem: 1. Generalize your game to any size (non trivial for chess) 2. Consider the problem: - here is a board - is the situation a win in perfect play ? NP NP NP -complete -difficile
  • 31. Computational complexity ==> cast into a decision problem (binary question) ==> can be used for choosing optimal move (but not necessary) ==> trivial games can be EXPTIME-hard ==> no clear correlation with the fact that a game is difficult for a computer (when compared to humans) NP NP NP -complete -difficile
  • 33. PSPACE vs EXPTIME ==> many important games are either PSPACE or EXPTIME Theorem: If playing = filling a location for eternity, then it is PSPACE. (not necessarily PSPACE-complete!) Proof: Depth-first search. Applis: Hex, Havannah, Tic-Tac-Toe, Ponnuki-Go...
  • 35. NP / PSPACE / EXPTIME in Go Tsumegos with no ko, forced moves only for W, 2 moves for B, polynomial length: NP- complete Ponnuki-Go : PSPACE Go without ko: PSPACE-hard Go with ko + japanese rules: EXPTIME-complete Go with ko + superko: unknown Some phantom-rengo undecidable ? If Go with ko > Go without ko, then PSPACE EXPTIME
  • 36. Complexity measures (not always well defined) State-space complexity Game-tree size Decision complexity Game-tree complexity Computational complexity Perfect-play complexity (complexity of perfect algorithm) State of the art level
  • 37. Complexity measures (not always well defined) State-space complexity Game-tree size Decision complexity Game-tree complexity Computational complexity Perfect-play complexity State of the art level
  • 38. State of the art level Very weak solving Means that we know who should win Typically proved by strategy-stealing E.g.: hex (first player wins), hex + swap (second player wins) Weak solving Strong solving Best results so far
  • 39. State of the art level Very weak solving Weak solving Perfect play reached with reasonnable computation time Biggest success: draughts (tenths of years of computation on tenths of machines) Strong solving Best results so far
  • 40. State of the art level Very weak solving Weak solving Strong solving Perfect play from any situation in reasonable time (variants of Tic-Tac-Toe) Best results so far
  • 41. State of the art level Very weak solving Weak solving Strong solving Best results so far Shi-Fu-Mi: humans loose English draughts: humans + machines reach perfect play Chess: nobody can compete with machines Ponnuki-Go: some variants solved 9x9 Go: MoGoTW won with the disadvantageous side with a top player
  • 42. Go: from 29 to 6 stones 1998: loss against amateur (6d) 19x19 H29 2008: win against a pro (8p) 19x19, H9 MoGo 2008: win against a pro (4p) 19x19, H8 CrazyStone 2008: win against a pro (4p) 19x19, H7 CrazyStone 2009: win against a pro (9p) 19x19, H7 MoGo 2009: win against a pro (1p) 19x19, H6 MoGo 2007: win against a pro (5p) 9x9 (blitz) MoGo 2008: win against a pro (5p) 9x9 white MoGo 2009: win against a pro (5p) 9x9 black MoGo 2009: win against a pro (9p) 9x9 white Fuego 2009: win against a pro (9p) 9x9 black MoGoTW ==> still 6 stones at least!
  • 43. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 44. Monte-Carlo Tree Search Monte-Carlo Tree Search (MCTS) appeared in games. Its most well-known variant is termed Upper Confidence Tree (UCT). I here present UCT. Bandits; Monte-Carlo approach for tree-search; UCT.
  • 45. A ``bandit'' problem p1,...,pN unknown probabilities ∈ [0,1] At each time step i∈ [1,n] choose ui∈ {1,...,N} (as a function of uj and rj, j<i) With probability pui win ( ri=1 ) loose ( ri=0 )
  • 46. A ``bandit'' problem: the target p1,...,pN unknown probabilities ∈ [0,1] At each time step i∈ [1,n] choose ui∈ {1,...,N} (as a function of uj and rj, j<i) With probability pui win ( ri=1 ) loose ( ri=0 ) Regret: Rn=n max{pi} - ∑ rj (j<n) How to minimize the regret (worst case on p) ?
  • 47. Bandits – a classical solution Regret: Rn=n max{pi} - ∑ rj (j<i) UCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(i)/ number of trials with decision u ) ==> optimal regret O(log(n)) (Lai et al; Auer et al)
  • 48. Infinite bandit: progressive widening UCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(i)/ number of trials with decision u )  ==> argmax only on the i first arms ( [ 0.25 0.5 ] ) (Coulom, Chaslot et al, Wang et al)
  • 49. Bandits: much more What is a bandit: - a criterion (here a bandit) defines the problem - usually a score (typically exploration+exploitation) defines a criterion ==> an optimal score for a criterion is not optimal for another ==> a wide literature
  • 50. Bandits and trees - we have seen the definition of discrete time control problems; - we have seen what are bandits - we now introduce trees and UCT
  • 51. UCT (Upper Confidence Trees) Coulom (06) Chaslot, Saito & Bouzy (06) Kocsis Szepesvari (06)
  • 52. UCT
  • 53. UCT
  • 54. UCT
  • 55. UCT
  • 56. UCT Kocsis & Szepesvari (06)
  • 58. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 59. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 60. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 62. Go: from 29 to 6 stones Formula for simulation Asymptotically optimal move. But all the tree is visited infinitely often! What is used in implementations which work ?
  • 63. Go: from 29 to 6 stones Formula for simulation
  • 64. Go: from 29 to 6 stones Formula for simulation Not consistent! Sometimes: - Good move might have 0/1 - Bad move 1/(N-1) after N simulations ==> we only simulate bad move!
  • 65. Go: from 29 to 6 stones Formula for simulation Other (better) estimates, but still inconsistent
  • 66. Go: from 29 to 6 stones Formula for simulation nbWins + 1 argmax --------------- nbLosses + 2 ==> consistency ==> frugality
  • 67. Outline Discrete time control: various approaches Monte-Carlo Tree Search (UCT, MCTS; 2006) Extensions Weakness Games as benchmarks ?
  • 68. Monte-Carlo Tree Search (beyond UCT)
  • 69. Monte-Carlo Tree Search (beyond UCT)
  • 70. Why UCT is suboptimal for games ? (boring version)
  • 71. Why UCT is suboptimal for games ? (clear version) Monte-Carlo Tree Search, under mild conditions on games (including deterministic two-player zero-sum games), can be consistent (→ best move); frugal (if there is a good move, it does not visit infinitely often all the tree).
  • 72. Why UCT is suboptimal for games ? (clear version) Frugal algorithms: folklore results (many people implement “frugal” MCTS). However, these algorithms are (usually) not consistent. What is new is sufficient conditions for consistency + frugal.
  • 73. Extensions ``Standard'' UCT: score(situation,move) = compromise (in [0,1+] ) between a) empirical quality P ( win | nextMove(situation) = move ) estimated in simulations b) exploration term (UCT is not fundamental in Go) Remarks: 1) No offline learning 2) No learning from one situation to another 3) No expert rules
  • 74. Extension 1: offline learning score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value (Chaslot et al, Coulom) = empirical estimate P ( played | pattern ) for patterns with big support ==> estimated on database At first, (c) is the most important; later, (a) dominates.
  • 75. Extensions ``Standard'' UCT: score(situation,move) = compromise between a) empirical quality b) exploration term Remarks: 1) No offline learning 2) No learning from one situation to another 3) No expert rules
  • 76. Extension 2: transient values score(situation,move) = compromise between a) empirical quality P' ( win | nextMove(situation) = move ) estimated in simulations b) exploration term c) offline value d) ``transient'' value: (Gelly et al, 07) P' (win | same player plays “move” later) ==> brings information from node N to ancestor node M ==> does not bring information from node N to descendants or cousins (many people have tried)
  • 77. Extensions ``Standard'' UCT: score(situation,move) = compromise between a) empirical quality b) exploration term Remarks: 1) No offline learning 2) No learning from one situation to another 3) No expert rules
  • 78. Extension 3: expert rules score(situation,move) = compromise between a) empirical quality b) exploration term c) offline value d) transient value e) expert rules ==> empirically derived linear combination Most important terms, (e)+(c) first, then (d) becomes stronger, finally (a) only
  • 79. Go: from 29 to 6 stones 2007: win against a pro (5p) 9x9 (blitz) MoGo 2008: win against a pro (5p) 9x9 white MoGo 2009: win against a pro (5p) 9x9 black MoGo 2009: win against a pro (9p) 9x9 white Fuego 2009: win against a pro (9p) 9x9 black MoGoTW 2008: win against a pro (8p) 19x19, H9 MoGo 2008: win against a pro (4p) 19x19, H8 CrazyStone 2008: win against a pro (4p) 19x19, H7 CrazyStone 2009: win against a pro (9p) 19x19, H7 MoGo 2009: win against a pro (1p) 19x19, H6 MoGo ==> still 6 stones at least!
  • 80. Why computers are weak against humans (thanks M. Jolion)
  • 81. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 82. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 83. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 84. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 85. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 86. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 87. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 88. Semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 89. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 90. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 91. A trivial semeai Plenty of equivalent situations! They are randomly sampled, with  no generalization. 50% of estimated win probability!
  • 92. It does not work. Why ? 50% of estimated win probability! In the first node:  The first simulations give ~ 50%  The next simulations go to 100% or 0% (depending  on the chosen move)  But, then, we switch to another node                                                 (~ 8! x 8! such nodes)
  • 93. And the humans ? 50% of estimated win probability! In the first node:  The first simulations give ~ 50%  The next simulations go to 100% or 0% (depending  on the chosen move)  But, then, we DON'T switch to another node   
  • 98. Outline: the game of Go The history The rules Variants of go and complexity Computers playing Go Go and power plants
  • 99. What is high-dimensional discrete time control ? There are time steps: 0, 1, 2, ..., H. There are states and transitions: xi+1 = f( xi, di) di is the decision at time step i: di=u(xi) There is a cost: C = C(xH) ==> We look for u(.) such that C is as small as possible.
  • 100. High-dimensional discrete time control xi+1 = f( xi, di) di=u(xi) C = C(xH) ==> We look for u(.) such that C is as small as possible.
  • 101. Discrete time + high dimension + uncertainty xi+1 = f( xi, di, Ai) di=u(xi) C = C(xH) Ai might be: - a Markov model - an opponent: Ai maximizes inf C ==> we look for u(.) such that C is as small as possible (e.g. on average).
  • 102. Summary High dimensional discrete time control is an important problem Many problems have no satisfactory solution. A new approach: Bandit-Based Monte-Carlo Planning
  • 103. High-dimensional discrete time control A main application: the management of many energy stocks in front of randomness At each time step we see random outcomes We have to take decisions After H time steps, we observe a cost
  • 104. What are the approaches ? Dynamic programming (Massé – Bellman 50's) (still the main approach in industry) Reinforcement learning (some promising results, less used in industry) Some tree exploration tools (less usual in stochastic or continuous cases) Bandit-Based Monte-Carlo planning (MCTS/UCT)
  • 105. What are the approaches ? Dynamic programming (Massé – Bellman 50's) (still the main approach in industry) Where we are: Done: Presentation of the problem. Now: We briefly present dynamic programming Thereafter: We present MCTS / UCT.
  • 106. Dynamic programming V(x) = expectation of C(xH) if optimal strategy. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the X with horizon H.
  • 107. Dynamic programming V(x) = expectation of C(xH) if optimal strategy. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the X with horizon H. We compute V for all the X with horizon H-1.
  • 108. Dynamic programming V(x) = expectation of C(xH) if optimal strategy. (well defined) u(x) such that the expectation of V(f(x,u(x),A)) is minimal Computation by dynamic programming We compute V for all the X with horizon H. We compute V for all the X with horizon H-1. ... ... ...
  • 109. Extensions Approximate dynamic programming (e.g. for continuous domains) Reinforcement learning Case where f(...) or A is black-box Huge state spaces ==> but lack of stability ... ==> there is room for improvements
  • 110. Conclusion : games = great for artificial intelligence Very difficult for computers.
  • 112. What else ? Real Time Strategy Game (multiple actors, partially obs.) Frédéric Lemoine MIG 11/07/2008 112
  • 114. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. (holds true for Go) Frédéric Lemoine MIG 11/07/2008 114
  • 115. “Real” games Assumption: if a computer understands and guesses spins, then this robot will be efficient for something else than just games. VS Frédéric Lemoine MIG 11/07/2008 115
  • 117. Conclusion Essentially asymptotically proved only Empirically good for The game of Go Some other games Non-linear expensive optimization Active learning Not (yet) tested industrially Understood weaknesses: plenty of very similar nodes! Next challenge: Solve these weaknesses Industrial applications Partially observable cases : cf Cazenave, Rolet
  • 118. Biblio Bandits: Lai, Robbins, Auer, Cesa-Bianchi... UCT: Kocsis, Szepesvari, Coquelin, Munos... MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller, Pérez, Rimmel, Wang... Tree + DP for industrial applicationl: Péret, Garcia... Bandits with infinitely many arms: Audibert, Coulom, Munos, Wang... Applications far from Go: Rolet, Teytaud (F), Rimmel, De Mesmay ... Links with “macro-actions” ? Parallelization, mixing with offline learning, bias...
  • 119. Paul Veyssière Vincent Berthier Contributors Amine Bourki Hassen Doghmen Matthieu Coulm Univ. Taiwan Univ. Paris Bandits: Lai, Robbins, Auer, Cesa-Bianchi... UCT: Kocsis, Szepesvari, Coquelin, Munos... MCTS (Go): Coulom, Chaslot, Fiter, Gelly, Hoock, Silver, Muller, Pérez, Rimmel, Wang... Tree + DP for industrial applicationl: Péret, Garcia... Bandits with infinitely many arms: Audibert, Coulom, Munos, Wang... Applications far from Go: Rolet, Teytaud (F), Rimmel, De Mesmay ... Links with “macro-actions” ? Parallelization, mixing with offline learning, bias...