AI Supremacy in Games
Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR
Karel Ha
13th
June 2018
Artificial Intelligence Center
Czech Technical University in Prague
The Outline
AI in Games
Chess: Deep Blue
Atari Games: Deep Reinforcement Learning
Go: AlphaGo, AlphaGo Zero, AlphaZero
Poker: Cepheus, DeepStack
Beyond DeepStack: TensorCFR
Conclusion
1
AI in Games
1
Chess: Deep Blue
Deep Blue (1996/1997)
1
Deep Blue vs Kasparov
2
Deep Blue vs Kasparov
six-game chess matches
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
first match in Philadelphia in 1996 and won by Kasparov
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
first match in Philadelphia in 1996 and won by Kasparov
second match in New York City in 1997 and won by Deep Blue
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
first match in Philadelphia in 1996 and won by Kasparov
second match in New York City in 1997 and won by Deep Blue
The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament
conditions.
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
first match in Philadelphia in 1996 and won by Kasparov
second match in New York City in 1997 and won by Deep Blue
The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament
conditions.
2
Deep Blue vs Kasparov
six-game chess matches
between world chess champion Garry Kasparov
and an IBM supercomputer called Deep Blue
first match in Philadelphia in 1996 and won by Kasparov
second match in New York City in 1997 and won by Deep Blue
The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament
conditions.
Use of brute-force search!
2
Atari Games: Deep Reinforcement
Learning
Atari Player by Google DeepMind
https://youtu.be/0X-NdPtFKq0?t=21m13s
[Mnih et al. 2015] 3
Reinforcement Learning
https://youtu.be/0X-NdPtFKq0?t=16m57s 4
Reinforcement Learning
games of self-play
https://youtu.be/0X-NdPtFKq0?t=16m57s 4
Go: AlphaGo, AlphaGo Zero,
AlphaZero
Crash Course:
Tree Search
4
Tree Search
Optimal value v∗(s) determines the outcome of the game:
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
b is the games breadth (number of legal moves per position)
[Silver et al. 2016] 5
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect play by all players.
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where
b is the games breadth (number of legal moves per position)
d is its depth (game length)
[Silver et al. 2016] 5
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
for the depth: a neural network to evaluate current position
[Allis 1994] 6
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
for the breadth: a neural network to select moves
for the depth: a neural network to evaluate current position
for the tree traverse: Monte Carlo tree search (MCTS)
[Allis 1994] 6
Monte Carlo tree search
7
Crash Course:
Neural Networks
7
Neural Network: Inspiration
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
Neural Network: Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
Neural Network: Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
Neural Network: Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
suitable to model systems with a high tolerance to error
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
Neural Network: Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller scales
suitable to model systems with a high tolerance to error
e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
Neural Network: Modes
[Dieterle 2003] 9
Neural Network: Modes
Two modes
[Dieterle 2003] 9
Neural Network: Modes
Two modes
feedforward for making predictions
[Dieterle 2003] 9
Neural Network: Modes
Two modes
feedforward for making predictions
backpropagation for learning
[Dieterle 2003] 9
(Deep) Convolutional Neural Network
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 10
(Deep) Convolutional Neural Network
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 10
AlphaGo (2016)
10
Policy and Value Networks
[Silver et al. 2016] 11
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 12
SL Policy Networks
move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).
[Silver et al. 2016] 13
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 14
Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
[Silver et al. 2016] 15
Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.
[Silver et al. 2016] 15
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 16
RL Policy Networks
identical in structure to the SL policy network
[Silver et al. 2016] 17
RL Policy Networks
identical in structure to the SL policy network
games of self-play
[Silver et al. 2016] 17
RL Policy Networks
identical in structure to the SL policy network
games of self-play
between the current RL policy network and a randomly selected
previous iteration
[Silver et al. 2016] 17
RL Policy Networks
identical in structure to the SL policy network
games of self-play
between the current RL policy network and a randomly selected
previous iteration
goal: to win in the games of self-play
[Silver et al. 2016] 17
RL Policy Networks
identical in structure to the SL policy network
games of self-play
between the current RL policy network and a randomly selected
previous iteration
goal: to win in the games of self-play
[Silver et al. 2016] 17
RL Policy Networks
identical in structure to the SL policy network
games of self-play
between the current RL policy network and a randomly selected
previous iteration
goal: to win in the games of self-play
[Silver et al. 2016] 17
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 18
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
Value network memorized the game outcomes, rather than
generalizing to new positions.
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a separate game
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a separate game
almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a separate game
almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
[Silver et al. 2016] 19
Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution
Beware: successive positions are strongly correlated!
Value network memorized the game outcomes, rather than
generalizing to new positions.
Solution: generate 30 million (new) positions, each sampled
from a separate game
almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!
[Silver et al. 2016] 19
Value Network: Selection of Moves
evaluation of all successors s of the root position s, using vθ(s)
[Silver et al. 2016] 20
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 21
ELO Ratings for Various Combinations of Networks
[Silver et al. 2016] 22
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
2. expansion phase
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
2. expansion phase
3. evaluation phase
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
[Silver et al. 2016] 23
Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).
The next action is selected with a lookahead search:
1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)
[Silver et al. 2016] 23
MCTS Algorithm: Selection
[Silver et al. 2016] 24
MCTS Algorithm: Expansion
[Silver et al. 2016] 25
MCTS Algorithm: Expansion
A leaf position may be expanded by the SL policy network pσ.
[Silver et al. 2016] 25
MCTS Algorithm: Expansion
A leaf position may be expanded by the SL policy network pσ.
Once it’s expanded, it remains so until the end.
[Silver et al. 2016] 25
MCTS: Evaluation
[Silver et al. 2016] 26
MCTS: Evaluation
Both of the following 2 evalutions:
[Silver et al. 2016] 26
MCTS: Evaluation
Both of the following 2 evalutions:
evaluation from the value network vθ(s)
[Silver et al. 2016] 26
MCTS: Evaluation
Both of the following 2 evalutions:
evaluation from the value network vθ(s)
evaluation by the outcome of the fast rollout pπ
[Silver et al. 2016] 26
MCTS: Evaluation
Both of the following 2 evalutions:
evaluation from the value network vθ(s)
evaluation by the outcome of the fast rollout pπ
[Silver et al. 2016] 26
MCTS: Evaluation
Both of the following 2 evalutions:
evaluation from the value network vθ(s)
evaluation by the outcome of the fast rollout pπ
[Silver et al. 2016] 26
MCTS: Backup
At the end of a simulation, each traversed edge updates its values.
[Silver et al. 2016] 27
Once the search is complete, the algorithm
chooses the most visited move from the root
position.
[Silver et al. 2016] 27
Tree Evaluation: using Value Network
tree-edge values averaged over value network evaluations only
[Silver et al. 2016] 28
Tree Evaluation: using Rollouts
tree-edge values averaged over rollout evaluations only
[Silver et al. 2016] 29
Percentage of Simulations
percentage frequency:
which actions were selected during simulations
[Silver et al. 2016] 30
Principal Variation
i.e. the Path with the Maximum Visit Count
[Silver et al. 2016] 31
Principal Variation
i.e. the Path with the Maximum Visit Count
AlphaGo selected the move indicated by the red circle.
[Silver et al. 2016] 31
Principal Variation
i.e. the Path with the Maximum Visit Count
AlphaGo selected the move indicated by the red circle.
Fan Hui responded with the move indicated by the white square.
[Silver et al. 2016] 31
Principal Variation
i.e. the Path with the Maximum Visit Count
AlphaGo selected the move indicated by the red circle.
Fan Hui responded with the move indicated by the white square.
In his post-game commentary, he preferred the move predicted by AlphaGo (label 1).
[Silver et al. 2016] 31
Tournament with Other Go Programs
[Silver et al. 2016] 32
Fan Hui
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
100 billion neurons
https://en.wikipedia.org/wiki/Fan_Hui 33
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biological neural network:
100 billion neurons
100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 33
AlphaGo versus Fan Hui
34
AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
34
AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui 34
Lee Sedol “The Strong Stone”
https://en.wikipedia.org/wiki/Lee_Sedol 35
Lee Sedol “The Strong Stone”
professional 9 dan
https://en.wikipedia.org/wiki/Lee_Sedol 35
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
https://en.wikipedia.org/wiki/Lee_Sedol 35
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
“Roger Federer” of Go
https://en.wikipedia.org/wiki/Lee_Sedol 35
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
“Roger Federer” of Go
Lee Sedol would win 97 out of 100 games against Fan Hui.
https://en.wikipedia.org/wiki/Lee_Sedol 35
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
“Roger Federer” of Go
Lee Sedol would win 97 out of 100 games against Fan Hui.
biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 35
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
35
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
35
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
35
AlphaGo versus Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize
will be donated to charities, including UNICEF, and Go
organisations.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize
will be donated to charities, including UNICEF, and Go
organisations.
Lee received $170,000 ($150,000 for participating in all the
five games, and an additional $20,000 for each game won).
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
AlphaGo Master
https://deepmind.com/research/alphago/match-archive/master/ 37
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
https://deepmind.com/research/alphago/match-archive/master/ 37
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
https://deepmind.com/research/alphago/match-archive/master/ 37
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.
https://deepmind.com/research/alphago/match-archive/master/ 37
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.
AlphaGo won this series of games 60:0.
https://deepmind.com/research/alphago/match-archive/master/ 37
https://events.google.com/alphago2017/ 37
https://events.google.com/alphago2017/ 38
23 May - 27 May 2017 in Wuzhen, China
https://events.google.com/alphago2017/ 38
23 May - 27 May 2017 in Wuzhen, China
Team Go vs. AlphaGo 0:1
https://events.google.com/alphago2017/ 38
23 May - 27 May 2017 in Wuzhen, China
Team Go vs. AlphaGo 0:1
AlphaGo vs. world champion Ke Jie 3:0
https://events.google.com/alphago2017/ 38
38
defeated AlphaGo Lee by 100 games to 0
38
AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
39
AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
defeated AlphaGo Zero (version with 20 blocks trained for 3 days)
by 60 games to 40
39
39
1 AlphaGo Fan
39
1 AlphaGo Fan
2 AlphaGo Lee
39
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
39
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
39
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
5 AlphaZero
39
Poker: Cepheus, DeepStack
Real life consists of bluffing, of little tactics
of deception, of asking yourself what is the
other man going to think I mean to do.
John von Neumann
39
Game Tree in Poker
P R E - F L O P
F L O P
Raise
RaiseFold
Call
Call
Fold
Call
Fold
Call
CallFold
Bet
Check
Check
Raise
Check
T U R N
-50
-100
100
100
-200
Fold
1
22
1
1
2 2
1
[Moravˇc´ık et al. 2017] 40
Game Tree in Poker
P R E - F L O P
F L O P
Raise
RaiseFold
Call
Call
Fold
Call
Fold
Call
CallFold
Bet
Check
Check
Raise
Check
T U R N
-50
-100
100
100
-200
Fold
1
22
1
1
2 2
1
the so-called public tree (tree of public events)
[Moravˇc´ık et al. 2017] 40
Cepheus (2014)
40
Cepheus
Heads-up Limit Holdem Poker
[Bowling et al. 2015] 41
Cepheus
Heads-up Limit Holdem Poker
http://poker.srv.ualberta.ca/
[Bowling et al. 2015] 41
a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
http://poker.srv.ualberta.ca/ 42
a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
parameters (weights) of neural networks ∼ strategies
http://poker.srv.ualberta.ca/ 42
a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
parameters (weights) of neural networks ∼ strategies
loss ∼ counterfactual values
http://poker.srv.ualberta.ca/ 42
a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
parameters (weights) of neural networks ∼ strategies
loss ∼ counterfactual values
gradient ∼ counterfactual regrets
http://poker.srv.ualberta.ca/ 42
a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
parameters (weights) of neural networks ∼ strategies
loss ∼ counterfactual values
gradient ∼ counterfactual regrets
trained neural network (optimal weights) ∼ solution to the game (Nash equilibrium)
http://poker.srv.ualberta.ca/ 42
DeepStack (2016/2017)
42
https://www.deepstack.ai
42
DeepStack
https://www.deepstack.ai 43
DeepStack
completed in December 2016
https://www.deepstack.ai 43
DeepStack
completed in December 2016
published in Science in March 2017
https://www.deepstack.ai 43
DeepStack
completed in December 2016
published in Science in March 2017
the first AI capable of beating (11) professional poker players
https://www.deepstack.ai 43
DeepStack
completed in December 2016
published in Science in March 2017
the first AI capable of beating (11) professional poker players
44,000 hands of Heads-Up No-Limit Texas Hold’em
https://www.deepstack.ai 43
DeepStack: Components
C
Sampled poker
situations
BA
Agent's possible actions
Lookahead tree
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
[Moravˇc´ık et al. 2017] 44
DeepStack: Components
C
Sampled poker
situations
BA
Agent's possible actions
Lookahead tree
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving
[Moravˇc´ık et al. 2017] 44
DeepStack: Components
C
Sampled poker
situations
BA
Agent's possible actions
Lookahead tree
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving
“intuitive” local search
[Moravˇc´ık et al. 2017] 44
DeepStack: Components
C
Sampled poker
situations
BA
Agent's possible actions
Lookahead tree
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving
“intuitive” local search
sparse lookahead trees
[Moravˇc´ık et al. 2017] 44
DeepStack: Continual Resolving
https://www.deepstack.ai 45
DeepStack: Continual Resolving
only a strategy based on the current state
https://www.deepstack.ai 45
DeepStack: Continual Resolving
only a strategy based on the current state
only for the remainder of the hand
https://www.deepstack.ai 45
DeepStack: Continual Resolving
only a strategy based on the current state
only for the remainder of the hand
no strategy for the full game
https://www.deepstack.ai 45
DeepStack: Continual Resolving
only a strategy based on the current state
only for the remainder of the hand
no strategy for the full game
⇒ lower overall exploitability
https://www.deepstack.ai 45
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
https://www.deepstack.ai 46
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
computation beyond a certain depth: a fast-approximate
estimate
https://www.deepstack.ai 46
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
computation beyond a certain depth: a fast-approximate
estimate
namely, deep neural networks
https://www.deepstack.ai 46
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
computation beyond a certain depth: a fast-approximate
estimate
namely, deep neural networks
“gut feeling” of the value of holding any cards in any situation
https://www.deepstack.ai 46
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
computation beyond a certain depth: a fast-approximate
estimate
namely, deep neural networks
“gut feeling” of the value of holding any cards in any situation
https://www.deepstack.ai 46
DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game
computation beyond a certain depth: a fast-approximate
estimate
namely, deep neural networks
“gut feeling” of the value of holding any cards in any situation
https://www.deepstack.ai 46
DeepStack: Deep Counterfactual Value Network
Input
Bucket
ranges
7 Hidden Layers
• fully connected
• linear, PReLU
Output
Bucket
values
F E E D F O R W A R D
N E U R A L N E T
Z E R O - S U M
N E U R A L N E T
Output
Counterfactual
values
C A R D
C O U N T E R F A C T U A L
V A L U E S
Zero-sum
Error
B U C K E T I N G
( I N V E R S E )
B U C K E T I N G
C A R D
R A N G E S
500500500500500500500
1000
1
P2
P1
P1
P2
1326
P1
P2
Pot
Public
1326
22100
1
1000
P2
P1
1000
P2
P1
[Moravˇc´ık et al. 2017] 47
DeepStack: Sparse Lookahead Trees
a reduced number of actions considered
https://www.deepstack.ai 48
DeepStack: Sparse Lookahead Trees
a reduced number of actions considered
to play at conventional human speeds
https://www.deepstack.ai 48
DeepStack: Sparse Lookahead Trees
a reduced number of actions considered
to play at conventional human speeds
games re-solved in under five seconds
https://www.deepstack.ai 48
DeepStack: Sparse Lookahead Trees
a reduced number of actions considered
to play at conventional human speeds
games re-solved in under five seconds
on a simple gaming laptop
https://www.deepstack.ai 48
DeepStack: Sparse Lookahead Trees
a reduced number of actions considered
to play at conventional human speeds
games re-solved in under five seconds
on a simple gaming laptop
with an NVIDIA GeForce GTX 1080 GPU
https://www.deepstack.ai 48
DeepStack: Against Professional Players
Participant
Handsplayed
1250
1000
500
250
750
50
0
-500
-250
DeepStackwinrate(mbb/g)
1000
2000
3000
0
5 10 15 20 25 30
[Moravˇc´ık et al. 2017] 49
DeepStack: Theoretical Guarantees
Theorem
Let the error of counterfactual values returned by the value
function be ≤ .
[Moravˇc´ık et al. 2017] 50
DeepStack: Theoretical Guarantees
Theorem
Let the error of counterfactual values returned by the value
function be ≤ .
Let T be the number of resolving iterations for each decision.
[Moravˇc´ık et al. 2017] 50
DeepStack: Theoretical Guarantees
Theorem
Let the error of counterfactual values returned by the value
function be ≤ .
Let T be the number of resolving iterations for each decision.
[Moravˇc´ık et al. 2017] 50
DeepStack: Theoretical Guarantees
Theorem
Let the error of counterfactual values returned by the value
function be ≤ .
Let T be the number of resolving iterations for each decision.
Then the exploitability of the strategies is
≤ k1 +
k2
√
T
where k1 and k2 are game-specific constants.
[Moravˇc´ık et al. 2017] 50
Beyond DeepStack: TensorCFR
from presentation of dr. Viliam Lisy 50
TensorCFR (2018)
50
TensorCFR
an implementation of CFR+
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
optimized for GPU
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
optimized for GPU
on-going effort of:
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
optimized for GPU
on-going effort of:
dr. Viliam Lisy
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
optimized for GPU
on-going effort of:
dr. Viliam Lisy
Bc. Jan Rudolf (part-time)
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR
an implementation of CFR+
in TensorFlow
optimized for GPU
on-going effort of:
dr. Viliam Lisy
Bc. Jan Rudolf (part-time)
and me :-)
https://gitlab.com/beyond-deepstack/TensorCFR/ 51
TensorCFR: Computation Graph
https://gitlab.com/beyond-deepstack/TensorCFR/ 52
TensorCFR: Compute Time
https://gitlab.com/beyond-deepstack/TensorCFR/ 53
TensorCFR: Memory
https://gitlab.com/beyond-deepstack/TensorCFR/ 54
TensorCFR: TPU Compatibility
https://gitlab.com/beyond-deepstack/TensorCFR/ 55
Conclusion
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
represent states using conveniently trained GANs????
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
represent states using conveniently trained GANs????
something else?????????????
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
represent states using conveniently trained GANs????
something else?????????????
Any ideas and suggestions?
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
represent states using conveniently trained GANs????
something else?????????????
Any ideas and suggestions?
56
Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
extend to arbitrary games
other than poker with a clear public tree
research on fast value approximators over exponentially large
input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
represent states using conveniently trained GANs????
something else?????????????
Any ideas and suggestions? Now it’s the right time!
56
You know why?
56
Because we’re hiring!
56
Thank you!
Questions?
56
Backup Slides
SL Policy Networks (1/2)
13-layer deep convolutional neural network
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
55.7% accuracy (raw board position + move history as input)
[Silver et al. 2016]
SL Policy Networks (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classification
trained from 30 millions positions from the KGS Go Server
stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)
Results:
44.4% accuracy (the state-of-the-art from other groups)
55.7% accuracy (raw board position + move history as input)
57.0% accuracy (all input features)
[Silver et al. 2016]
SL Policy Networks (2/2)
Small improvements in accuracy led to large improvements
in playing strength (see the next slide)
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go program,
Pachi (Baudiˇs and Gailly 2011)
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go program,
Pachi (Baudiˇs and Gailly 2011)
The previous state-of-the-art, based only on SL of CNN:
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go program,
Pachi (Baudiˇs and Gailly 2011)
The previous state-of-the-art, based only on SL of CNN:
[Silver et al. 2016]
RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% of win rate against the strongest open-source Go program,
Pachi (Baudiˇs and Gailly 2011)
The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi
[Silver et al. 2016]
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
[Silver et al. 2016]
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
forward pass of the value network vθ
[Silver et al. 2016]
Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.
Each position evaluated by:
forward pass of the value network vθ
100 rollouts, played out using the corresponding policy
[Silver et al. 2016]
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
[Silver et al. 2016]
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 search threads
40 CPUs
8 GPUs
[Silver et al. 2016]
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 search threads
40 CPUs
8 GPUs
Distributed version of AlphaGo (on multiple machines):
40 search threads
1202 CPUs
176 GPUs
[Silver et al. 2016]
ELO Ratings for Various Combinations of Threads
[Silver et al. 2016]
Further Reading i
References i
Allis, Louis Victor et al. (1994). Searching for Solutions in Games and Artificial Intelligence. Ponsen & Looijen.
Baudiˇs, Petr and Jean-loup Gailly (2011). “Pachi: State of the Art Open Source Go Program”. In: Advances in
Computer Games. Springer, pp. 24–38.
Bowling, Michael et al. (2015). “Heads-Up Limit Hold’em Poker is Solved”. In: Science 347.6218, pp. 145–149.
url: http://poker.cs.ualberta.ca/15science.html.
Dieterle, Frank Jochen (2003). “Multianalyte Quantifications by Means of Integration of Artificial Neural Networks,
Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data”. PhD thesis. Universit¨at T¨ubingen.
Mnih, Volodymyr et al. (2015). “Human-Level Control through Deep Reinforcement Learning”. In: Nature
518.7540, pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.
Moravˇc´ık, Matej et al. (2017). “DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker”. In:
Science 356.6337, pp. 508–513.
Munroe, Randall. Game AIs. url: https://xkcd.com/1002/ (visited on 04/02/2016).
Silver, David et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature
529.7587, pp. 484–489.

AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR

  • 1.
    AI Supremacy inGames Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR Karel Ha 13th June 2018 Artificial Intelligence Center Czech Technical University in Prague
  • 2.
    The Outline AI inGames Chess: Deep Blue Atari Games: Deep Reinforcement Learning Go: AlphaGo, AlphaGo Zero, AlphaZero Poker: Cepheus, DeepStack Beyond DeepStack: TensorCFR Conclusion 1
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    Deep Blue vsKasparov 2
  • 8.
    Deep Blue vsKasparov six-game chess matches 2
  • 9.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov 2
  • 10.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue 2
  • 11.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue first match in Philadelphia in 1996 and won by Kasparov 2
  • 12.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue first match in Philadelphia in 1996 and won by Kasparov second match in New York City in 1997 and won by Deep Blue 2
  • 13.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue first match in Philadelphia in 1996 and won by Kasparov second match in New York City in 1997 and won by Deep Blue The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament conditions. 2
  • 14.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue first match in Philadelphia in 1996 and won by Kasparov second match in New York City in 1997 and won by Deep Blue The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament conditions. 2
  • 15.
    Deep Blue vsKasparov six-game chess matches between world chess champion Garry Kasparov and an IBM supercomputer called Deep Blue first match in Philadelphia in 1996 and won by Kasparov second match in New York City in 1997 and won by Deep Blue The 1997 match was the first defeat of a reigning world chess champion by a computer under tournament conditions. Use of brute-force search! 2
  • 16.
    Atari Games: DeepReinforcement Learning
  • 17.
    Atari Player byGoogle DeepMind https://youtu.be/0X-NdPtFKq0?t=21m13s [Mnih et al. 2015] 3
  • 18.
  • 19.
    Reinforcement Learning games ofself-play https://youtu.be/0X-NdPtFKq0?t=16m57s 4
  • 20.
    Go: AlphaGo, AlphaGoZero, AlphaZero
  • 21.
  • 22.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: [Silver et al. 2016] 5
  • 23.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s [Silver et al. 2016] 5
  • 24.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. [Silver et al. 2016] 5
  • 25.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. [Silver et al. 2016] 5
  • 26.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where [Silver et al. 2016] 5
  • 27.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where b is the games breadth (number of legal moves per position) [Silver et al. 2016] 5
  • 28.
    Tree Search Optimal valuev∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where b is the games breadth (number of legal moves per position) d is its depth (game length) [Silver et al. 2016] 5
  • 29.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 [Allis 1994] 6
  • 30.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! [Allis 1994] 6
  • 31.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol times more complex than chess. https://deepmind.com/alpha-go.html [Allis 1994] 6
  • 32.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? [Allis 1994] 6
  • 33.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves [Allis 1994] 6
  • 34.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves for the depth: a neural network to evaluate current position [Allis 1994] 6
  • 35.
    Game tree ofGo Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves for the depth: a neural network to evaluate current position for the tree traverse: Monte Carlo tree search (MCTS) [Allis 1994] 6
  • 36.
  • 37.
  • 38.
  • 39.
    Neural Network: Inspiration inspiredby the neuronal structure of the mammalian cerebral cortex http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
  • 40.
    Neural Network: Inspiration inspiredby the neuronal structure of the mammalian cerebral cortex but on much smaller scales http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
  • 41.
    Neural Network: Inspiration inspiredby the neuronal structure of the mammalian cerebral cortex but on much smaller scales suitable to model systems with a high tolerance to error http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
  • 42.
    Neural Network: Inspiration inspiredby the neuronal structure of the mammalian cerebral cortex but on much smaller scales suitable to model systems with a high tolerance to error e.g. audio or image recognition http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8
  • 43.
  • 44.
    Neural Network: Modes Twomodes [Dieterle 2003] 9
  • 45.
    Neural Network: Modes Twomodes feedforward for making predictions [Dieterle 2003] 9
  • 46.
    Neural Network: Modes Twomodes feedforward for making predictions backpropagation for learning [Dieterle 2003] 9
  • 47.
    (Deep) Convolutional NeuralNetwork http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 10
  • 48.
    (Deep) Convolutional NeuralNetwork http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 10
  • 49.
  • 50.
    Policy and ValueNetworks [Silver et al. 2016] 11
  • 51.
    Training the (DeepConvolutional) Neural Networks [Silver et al. 2016] 12
  • 52.
    SL Policy Networks moveprobabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%). [Silver et al. 2016] 13
  • 53.
    Training the (DeepConvolutional) Neural Networks [Silver et al. 2016] 14
  • 54.
    Rollout Policy Rollout policypπ(a|s) is faster but less accurate than SL policy network. [Silver et al. 2016] 15
  • 55.
    Rollout Policy Rollout policypπ(a|s) is faster but less accurate than SL policy network. It takes 2µs to select an action, compared to 3 ms in case of SL policy network. [Silver et al. 2016] 15
  • 56.
    Training the (DeepConvolutional) Neural Networks [Silver et al. 2016] 16
  • 57.
    RL Policy Networks identicalin structure to the SL policy network [Silver et al. 2016] 17
  • 58.
    RL Policy Networks identicalin structure to the SL policy network games of self-play [Silver et al. 2016] 17
  • 59.
    RL Policy Networks identicalin structure to the SL policy network games of self-play between the current RL policy network and a randomly selected previous iteration [Silver et al. 2016] 17
  • 60.
    RL Policy Networks identicalin structure to the SL policy network games of self-play between the current RL policy network and a randomly selected previous iteration goal: to win in the games of self-play [Silver et al. 2016] 17
  • 61.
    RL Policy Networks identicalin structure to the SL policy network games of self-play between the current RL policy network and a randomly selected previous iteration goal: to win in the games of self-play [Silver et al. 2016] 17
  • 62.
    RL Policy Networks identicalin structure to the SL policy network games of self-play between the current RL policy network and a randomly selected previous iteration goal: to win in the games of self-play [Silver et al. 2016] 17
  • 63.
    Training the (DeepConvolutional) Neural Networks [Silver et al. 2016] 18
  • 64.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution [Silver et al. 2016] 19
  • 65.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! [Silver et al. 2016] 19
  • 66.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! Value network memorized the game outcomes, rather than generalizing to new positions. [Silver et al. 2016] 19
  • 67.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a separate game [Silver et al. 2016] 19
  • 68.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a separate game almost the accuracy of Monte Carlo rollouts (using pρ), but 15000 times less computation! [Silver et al. 2016] 19
  • 69.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a separate game almost the accuracy of Monte Carlo rollouts (using pρ), but 15000 times less computation! [Silver et al. 2016] 19
  • 70.
    Value Network similar architectureto the policy network, but outputs a single prediction instead of a probability distribution Beware: successive positions are strongly correlated! Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a separate game almost the accuracy of Monte Carlo rollouts (using pρ), but 15000 times less computation! [Silver et al. 2016] 19
  • 71.
    Value Network: Selectionof Moves evaluation of all successors s of the root position s, using vθ(s) [Silver et al. 2016] 20
  • 72.
    Training the (DeepConvolutional) Neural Networks [Silver et al. 2016] 21
  • 73.
    ELO Ratings forVarious Combinations of Networks [Silver et al. 2016] 22
  • 74.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). [Silver et al. 2016] 23
  • 75.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: [Silver et al. 2016] 23
  • 76.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase [Silver et al. 2016] 23
  • 77.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase 2. expansion phase [Silver et al. 2016] 23
  • 78.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase 2. expansion phase 3. evaluation phase [Silver et al. 2016] 23
  • 79.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of simulation) [Silver et al. 2016] 23
  • 80.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of simulation) [Silver et al. 2016] 23
  • 81.
    Monte Carlo TreeSearch (MCTS) Algorithm The tree is traversed by simulation from the root state (i.e. descending the tree). The next action is selected with a lookahead search: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of simulation) [Silver et al. 2016] 23
  • 82.
  • 83.
  • 84.
    MCTS Algorithm: Expansion Aleaf position may be expanded by the SL policy network pσ. [Silver et al. 2016] 25
  • 85.
    MCTS Algorithm: Expansion Aleaf position may be expanded by the SL policy network pσ. Once it’s expanded, it remains so until the end. [Silver et al. 2016] 25
  • 86.
  • 87.
    MCTS: Evaluation Both ofthe following 2 evalutions: [Silver et al. 2016] 26
  • 88.
    MCTS: Evaluation Both ofthe following 2 evalutions: evaluation from the value network vθ(s) [Silver et al. 2016] 26
  • 89.
    MCTS: Evaluation Both ofthe following 2 evalutions: evaluation from the value network vθ(s) evaluation by the outcome of the fast rollout pπ [Silver et al. 2016] 26
  • 90.
    MCTS: Evaluation Both ofthe following 2 evalutions: evaluation from the value network vθ(s) evaluation by the outcome of the fast rollout pπ [Silver et al. 2016] 26
  • 91.
    MCTS: Evaluation Both ofthe following 2 evalutions: evaluation from the value network vθ(s) evaluation by the outcome of the fast rollout pπ [Silver et al. 2016] 26
  • 92.
    MCTS: Backup At theend of a simulation, each traversed edge updates its values. [Silver et al. 2016] 27
  • 93.
    Once the searchis complete, the algorithm chooses the most visited move from the root position. [Silver et al. 2016] 27
  • 94.
    Tree Evaluation: usingValue Network tree-edge values averaged over value network evaluations only [Silver et al. 2016] 28
  • 95.
    Tree Evaluation: usingRollouts tree-edge values averaged over rollout evaluations only [Silver et al. 2016] 29
  • 96.
    Percentage of Simulations percentagefrequency: which actions were selected during simulations [Silver et al. 2016] 30
  • 97.
    Principal Variation i.e. thePath with the Maximum Visit Count [Silver et al. 2016] 31
  • 98.
    Principal Variation i.e. thePath with the Maximum Visit Count AlphaGo selected the move indicated by the red circle. [Silver et al. 2016] 31
  • 99.
    Principal Variation i.e. thePath with the Maximum Visit Count AlphaGo selected the move indicated by the red circle. Fan Hui responded with the move indicated by the white square. [Silver et al. 2016] 31
  • 100.
    Principal Variation i.e. thePath with the Maximum Visit Count AlphaGo selected the move indicated by the red circle. Fan Hui responded with the move indicated by the white square. In his post-game commentary, he preferred the move predicted by AlphaGo (label 1). [Silver et al. 2016] 31
  • 101.
    Tournament with OtherGo Programs [Silver et al. 2016] 32
  • 102.
  • 103.
    Fan Hui professional 2dan https://en.wikipedia.org/wiki/Fan_Hui 33
  • 104.
    Fan Hui professional 2dan European Go Champion in 2013, 2014 and 2015 https://en.wikipedia.org/wiki/Fan_Hui 33
  • 105.
    Fan Hui professional 2dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 https://en.wikipedia.org/wiki/Fan_Hui 33
  • 106.
    Fan Hui professional 2dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: https://en.wikipedia.org/wiki/Fan_Hui 33
  • 107.
    Fan Hui professional 2dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: 100 billion neurons https://en.wikipedia.org/wiki/Fan_Hui 33
  • 108.
    Fan Hui professional 2dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: 100 billion neurons 100 up to 1,000 trillion neuronal connections https://en.wikipedia.org/wiki/Fan_Hui 33
  • 109.
  • 110.
    AlphaGo versus FanHui AlphaGo won 5 - 0 in a formal match on October 2015. 34
  • 111.
    AlphaGo versus FanHui AlphaGo won 5 - 0 in a formal match on October 2015. [AlphaGo] is very strong and stable, it seems like a wall. ... I know AlphaGo is a computer, but if no one told me, maybe I would think the player was a little strange, but a very strong player, a real person. Fan Hui 34
  • 112.
    Lee Sedol “TheStrong Stone” https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 113.
    Lee Sedol “TheStrong Stone” professional 9 dan https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 114.
    Lee Sedol “TheStrong Stone” professional 9 dan the 2nd in international titles https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 115.
    Lee Sedol “TheStrong Stone” professional 9 dan the 2nd in international titles “Roger Federer” of Go https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 116.
    Lee Sedol “TheStrong Stone” professional 9 dan the 2nd in international titles “Roger Federer” of Go Lee Sedol would win 97 out of 100 games against Fan Hui. https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 117.
    Lee Sedol “TheStrong Stone” professional 9 dan the 2nd in international titles “Roger Federer” of Go Lee Sedol would win 97 out of 100 games against Fan Hui. biological neural network, comparable to Fan Hui’s (in number of neurons and connections) https://en.wikipedia.org/wiki/Lee_Sedol 35
  • 118.
    I heard GoogleDeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol 35
  • 119.
    I heard GoogleDeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4-1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 35
  • 120.
    I heard GoogleDeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4-1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 35
  • 121.
    AlphaGo versus LeeSedol https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 122.
    AlphaGo versus LeeSedol In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 123.
    AlphaGo versus LeeSedol In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 124.
    AlphaGo versus LeeSedol In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 125.
    AlphaGo versus LeeSedol In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. Since AlphaGo won, Google DeepMind stated that the prize will be donated to charities, including UNICEF, and Go organisations. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 126.
    AlphaGo versus LeeSedol In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. Since AlphaGo won, Google DeepMind stated that the prize will be donated to charities, including UNICEF, and Go organisations. Lee received $170,000 ($150,000 for participating in all the five games, and an additional $20,000 for each game won). https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36
  • 127.
  • 128.
    AlphaGo Master In January2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. https://deepmind.com/research/alphago/match-archive/master/ 37
  • 129.
    AlphaGo Master In January2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. https://deepmind.com/research/alphago/match-archive/master/ 37
  • 130.
    AlphaGo Master In January2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. Over one week, AlphaGo played 60 online fast time-control games. https://deepmind.com/research/alphago/match-archive/master/ 37
  • 131.
    AlphaGo Master In January2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. Over one week, AlphaGo played 60 online fast time-control games. AlphaGo won this series of games 60:0. https://deepmind.com/research/alphago/match-archive/master/ 37
  • 132.
  • 133.
  • 134.
    23 May -27 May 2017 in Wuzhen, China https://events.google.com/alphago2017/ 38
  • 135.
    23 May -27 May 2017 in Wuzhen, China Team Go vs. AlphaGo 0:1 https://events.google.com/alphago2017/ 38
  • 136.
    23 May -27 May 2017 in Wuzhen, China Team Go vs. AlphaGo 0:1 AlphaGo vs. world champion Ke Jie 3:0 https://events.google.com/alphago2017/ 38
  • 137.
  • 138.
    defeated AlphaGo Leeby 100 games to 0 38
  • 139.
    AI system thatmastered chess, Shogi and Go to “superhuman levels” within a handful of hours AlphaZero 39
  • 140.
    AI system thatmastered chess, Shogi and Go to “superhuman levels” within a handful of hours AlphaZero defeated AlphaGo Zero (version with 20 blocks trained for 3 days) by 60 games to 40 39
  • 141.
  • 142.
  • 143.
    1 AlphaGo Fan 2AlphaGo Lee 39
  • 144.
    1 AlphaGo Fan 2AlphaGo Lee 3 AlphaGo Master 39
  • 145.
    1 AlphaGo Fan 2AlphaGo Lee 3 AlphaGo Master 4 AlphaGo Zero 39
  • 146.
    1 AlphaGo Fan 2AlphaGo Lee 3 AlphaGo Master 4 AlphaGo Zero 5 AlphaZero 39
  • 147.
  • 148.
    Real life consistsof bluffing, of little tactics of deception, of asking yourself what is the other man going to think I mean to do. John von Neumann 39
  • 149.
    Game Tree inPoker P R E - F L O P F L O P Raise RaiseFold Call Call Fold Call Fold Call CallFold Bet Check Check Raise Check T U R N -50 -100 100 100 -200 Fold 1 22 1 1 2 2 1 [Moravˇc´ık et al. 2017] 40
  • 150.
    Game Tree inPoker P R E - F L O P F L O P Raise RaiseFold Call Call Fold Call Fold Call CallFold Bet Check Check Raise Check T U R N -50 -100 100 100 -200 Fold 1 22 1 1 2 2 1 the so-called public tree (tree of public events) [Moravˇc´ık et al. 2017] 40
  • 151.
  • 152.
    Cepheus Heads-up Limit HoldemPoker [Bowling et al. 2015] 41
  • 153.
    Cepheus Heads-up Limit HoldemPoker http://poker.srv.ualberta.ca/ [Bowling et al. 2015] 41
  • 154.
    a massive useof Counterfactual Regret Minimization+ (CFR+): game-theoretic analogy to the gradient descent http://poker.srv.ualberta.ca/ 42
  • 155.
    a massive useof Counterfactual Regret Minimization+ (CFR+): game-theoretic analogy to the gradient descent parameters (weights) of neural networks ∼ strategies http://poker.srv.ualberta.ca/ 42
  • 156.
    a massive useof Counterfactual Regret Minimization+ (CFR+): game-theoretic analogy to the gradient descent parameters (weights) of neural networks ∼ strategies loss ∼ counterfactual values http://poker.srv.ualberta.ca/ 42
  • 157.
    a massive useof Counterfactual Regret Minimization+ (CFR+): game-theoretic analogy to the gradient descent parameters (weights) of neural networks ∼ strategies loss ∼ counterfactual values gradient ∼ counterfactual regrets http://poker.srv.ualberta.ca/ 42
  • 158.
    a massive useof Counterfactual Regret Minimization+ (CFR+): game-theoretic analogy to the gradient descent parameters (weights) of neural networks ∼ strategies loss ∼ counterfactual values gradient ∼ counterfactual regrets trained neural network (optimal weights) ∼ solution to the game (Nash equilibrium) http://poker.srv.ualberta.ca/ 42
  • 159.
  • 160.
  • 161.
  • 162.
    DeepStack completed in December2016 https://www.deepstack.ai 43
  • 163.
    DeepStack completed in December2016 published in Science in March 2017 https://www.deepstack.ai 43
  • 164.
    DeepStack completed in December2016 published in Science in March 2017 the first AI capable of beating (11) professional poker players https://www.deepstack.ai 43
  • 165.
    DeepStack completed in December2016 published in Science in March 2017 the first AI capable of beating (11) professional poker players 44,000 hands of Heads-Up No-Limit Texas Hold’em https://www.deepstack.ai 43
  • 166.
    DeepStack: Components C Sampled poker situations BA Agent'spossible actions Lookahead tree Current public state Agent’s range Opponent counterfactual values Neural net [see B] Action history Public tree Subtree ValuesRanges [Moravˇc´ık et al. 2017] 44
  • 167.
    DeepStack: Components C Sampled poker situations BA Agent'spossible actions Lookahead tree Current public state Agent’s range Opponent counterfactual values Neural net [see B] Action history Public tree Subtree ValuesRanges continual resolving [Moravˇc´ık et al. 2017] 44
  • 168.
    DeepStack: Components C Sampled poker situations BA Agent'spossible actions Lookahead tree Current public state Agent’s range Opponent counterfactual values Neural net [see B] Action history Public tree Subtree ValuesRanges continual resolving “intuitive” local search [Moravˇc´ık et al. 2017] 44
  • 169.
    DeepStack: Components C Sampled poker situations BA Agent'spossible actions Lookahead tree Current public state Agent’s range Opponent counterfactual values Neural net [see B] Action history Public tree Subtree ValuesRanges continual resolving “intuitive” local search sparse lookahead trees [Moravˇc´ık et al. 2017] 44
  • 170.
  • 171.
    DeepStack: Continual Resolving onlya strategy based on the current state https://www.deepstack.ai 45
  • 172.
    DeepStack: Continual Resolving onlya strategy based on the current state only for the remainder of the hand https://www.deepstack.ai 45
  • 173.
    DeepStack: Continual Resolving onlya strategy based on the current state only for the remainder of the hand no strategy for the full game https://www.deepstack.ai 45
  • 174.
    DeepStack: Continual Resolving onlya strategy based on the current state only for the remainder of the hand no strategy for the full game ⇒ lower overall exploitability https://www.deepstack.ai 45
  • 175.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game https://www.deepstack.ai 46
  • 176.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game computation beyond a certain depth: a fast-approximate estimate https://www.deepstack.ai 46
  • 177.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game computation beyond a certain depth: a fast-approximate estimate namely, deep neural networks https://www.deepstack.ai 46
  • 178.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game computation beyond a certain depth: a fast-approximate estimate namely, deep neural networks “gut feeling” of the value of holding any cards in any situation https://www.deepstack.ai 46
  • 179.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game computation beyond a certain depth: a fast-approximate estimate namely, deep neural networks “gut feeling” of the value of holding any cards in any situation https://www.deepstack.ai 46
  • 180.
    DeepStack: “Intuitive” LocalSearch no reasoning about the full remaining game computation beyond a certain depth: a fast-approximate estimate namely, deep neural networks “gut feeling” of the value of holding any cards in any situation https://www.deepstack.ai 46
  • 181.
    DeepStack: Deep CounterfactualValue Network Input Bucket ranges 7 Hidden Layers • fully connected • linear, PReLU Output Bucket values F E E D F O R W A R D N E U R A L N E T Z E R O - S U M N E U R A L N E T Output Counterfactual values C A R D C O U N T E R F A C T U A L V A L U E S Zero-sum Error B U C K E T I N G ( I N V E R S E ) B U C K E T I N G C A R D R A N G E S 500500500500500500500 1000 1 P2 P1 P1 P2 1326 P1 P2 Pot Public 1326 22100 1 1000 P2 P1 1000 P2 P1 [Moravˇc´ık et al. 2017] 47
  • 182.
    DeepStack: Sparse LookaheadTrees a reduced number of actions considered https://www.deepstack.ai 48
  • 183.
    DeepStack: Sparse LookaheadTrees a reduced number of actions considered to play at conventional human speeds https://www.deepstack.ai 48
  • 184.
    DeepStack: Sparse LookaheadTrees a reduced number of actions considered to play at conventional human speeds games re-solved in under five seconds https://www.deepstack.ai 48
  • 185.
    DeepStack: Sparse LookaheadTrees a reduced number of actions considered to play at conventional human speeds games re-solved in under five seconds on a simple gaming laptop https://www.deepstack.ai 48
  • 186.
    DeepStack: Sparse LookaheadTrees a reduced number of actions considered to play at conventional human speeds games re-solved in under five seconds on a simple gaming laptop with an NVIDIA GeForce GTX 1080 GPU https://www.deepstack.ai 48
  • 187.
    DeepStack: Against ProfessionalPlayers Participant Handsplayed 1250 1000 500 250 750 50 0 -500 -250 DeepStackwinrate(mbb/g) 1000 2000 3000 0 5 10 15 20 25 30 [Moravˇc´ık et al. 2017] 49
  • 188.
    DeepStack: Theoretical Guarantees Theorem Letthe error of counterfactual values returned by the value function be ≤ . [Moravˇc´ık et al. 2017] 50
  • 189.
    DeepStack: Theoretical Guarantees Theorem Letthe error of counterfactual values returned by the value function be ≤ . Let T be the number of resolving iterations for each decision. [Moravˇc´ık et al. 2017] 50
  • 190.
    DeepStack: Theoretical Guarantees Theorem Letthe error of counterfactual values returned by the value function be ≤ . Let T be the number of resolving iterations for each decision. [Moravˇc´ık et al. 2017] 50
  • 191.
    DeepStack: Theoretical Guarantees Theorem Letthe error of counterfactual values returned by the value function be ≤ . Let T be the number of resolving iterations for each decision. Then the exploitability of the strategies is ≤ k1 + k2 √ T where k1 and k2 are game-specific constants. [Moravˇc´ık et al. 2017] 50
  • 192.
  • 193.
    from presentation ofdr. Viliam Lisy 50
  • 194.
  • 195.
    TensorCFR an implementation ofCFR+ https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 196.
    TensorCFR an implementation ofCFR+ in TensorFlow https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 197.
    TensorCFR an implementation ofCFR+ in TensorFlow optimized for GPU https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 198.
    TensorCFR an implementation ofCFR+ in TensorFlow optimized for GPU on-going effort of: https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 199.
    TensorCFR an implementation ofCFR+ in TensorFlow optimized for GPU on-going effort of: dr. Viliam Lisy https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 200.
    TensorCFR an implementation ofCFR+ in TensorFlow optimized for GPU on-going effort of: dr. Viliam Lisy Bc. Jan Rudolf (part-time) https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 201.
    TensorCFR an implementation ofCFR+ in TensorFlow optimized for GPU on-going effort of: dr. Viliam Lisy Bc. Jan Rudolf (part-time) and me :-) https://gitlab.com/beyond-deepstack/TensorCFR/ 51
  • 202.
  • 203.
  • 204.
  • 205.
  • 206.
  • 207.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow 56
  • 208.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree 56
  • 209.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space 56
  • 210.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? 56
  • 211.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? 56
  • 212.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? 56
  • 213.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? represent states using conveniently trained GANs???? 56
  • 214.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? represent states using conveniently trained GANs???? something else????????????? 56
  • 215.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? represent states using conveniently trained GANs???? something else????????????? Any ideas and suggestions? 56
  • 216.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? represent states using conveniently trained GANs???? something else????????????? Any ideas and suggestions? 56
  • 217.
    Future Plans forTensorCFR implement continual resolving (i.e. DeepStack) in TensorFlow extend to arbitrary games other than poker with a clear public tree research on fast value approximators over exponentially large input space sampling? neural networks?? distributed representations (i.e. represent states similarly as in Word2Vec)??? represent states using conveniently trained GANs???? something else????????????? Any ideas and suggestions? Now it’s the right time! 56
  • 218.
  • 219.
  • 220.
  • 221.
  • 222.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network [Silver et al. 2016]
  • 223.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves [Silver et al. 2016]
  • 224.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification [Silver et al. 2016]
  • 225.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server [Silver et al. 2016]
  • 226.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) [Silver et al. 2016]
  • 227.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) [Silver et al. 2016]
  • 228.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: [Silver et al. 2016]
  • 229.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) [Silver et al. 2016]
  • 230.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) 55.7% accuracy (raw board position + move history as input) [Silver et al. 2016]
  • 231.
    SL Policy Networks(1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) 55.7% accuracy (raw board position + move history as input) 57.0% accuracy (all input features) [Silver et al. 2016]
  • 232.
    SL Policy Networks(2/2) Small improvements in accuracy led to large improvements in playing strength (see the next slide) [Silver et al. 2016]
  • 233.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): [Silver et al. 2016]
  • 234.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network [Silver et al. 2016]
  • 235.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) [Silver et al. 2016]
  • 236.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) The previous state-of-the-art, based only on SL of CNN: [Silver et al. 2016]
  • 237.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) The previous state-of-the-art, based only on SL of CNN: [Silver et al. 2016]
  • 238.
    RL Policy Networks(details) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) The previous state-of-the-art, based only on SL of CNN: 11% of “win” rate against Pachi [Silver et al. 2016]
  • 239.
    Evaluation accuracy invarious stages of a game Move number is the number of moves that had been played in the given position. [Silver et al. 2016]
  • 240.
    Evaluation accuracy invarious stages of a game Move number is the number of moves that had been played in the given position. Each position evaluated by: forward pass of the value network vθ [Silver et al. 2016]
  • 241.
    Evaluation accuracy invarious stages of a game Move number is the number of moves that had been played in the given position. Each position evaluated by: forward pass of the value network vθ 100 rollouts, played out using the corresponding policy [Silver et al. 2016]
  • 242.
    Scalability asynchronous multi-threaded search simulationson CPUs computation of neural networks on GPUs [Silver et al. 2016]
  • 243.
    Scalability asynchronous multi-threaded search simulationson CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs [Silver et al. 2016]
  • 244.
    Scalability asynchronous multi-threaded search simulationson CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Distributed version of AlphaGo (on multiple machines): 40 search threads 1202 CPUs 176 GPUs [Silver et al. 2016]
  • 245.
    ELO Ratings forVarious Combinations of Threads [Silver et al. 2016]
  • 246.
  • 247.
    References i Allis, LouisVictor et al. (1994). Searching for Solutions in Games and Artificial Intelligence. Ponsen & Looijen. Baudiˇs, Petr and Jean-loup Gailly (2011). “Pachi: State of the Art Open Source Go Program”. In: Advances in Computer Games. Springer, pp. 24–38. Bowling, Michael et al. (2015). “Heads-Up Limit Hold’em Poker is Solved”. In: Science 347.6218, pp. 145–149. url: http://poker.cs.ualberta.ca/15science.html. Dieterle, Frank Jochen (2003). “Multianalyte Quantifications by Means of Integration of Artificial Neural Networks, Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data”. PhD thesis. Universit¨at T¨ubingen. Mnih, Volodymyr et al. (2015). “Human-Level Control through Deep Reinforcement Learning”. In: Nature 518.7540, pp. 529–533. url: https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf. Moravˇc´ık, Matej et al. (2017). “DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker”. In: Science 356.6337, pp. 508–513. Munroe, Randall. Game AIs. url: https://xkcd.com/1002/ (visited on 04/02/2016). Silver, David et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587, pp. 484–489.