AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR

AI Supremacy in Games
Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR
Karel Ha
13th
June 2018
Artiﬁcial Intelligence Center
Czech Technical University in Prague

The Outline
AI in Games
Chess: Deep Blue
Atari Games: Deep Reinforcement Learning
Go: AlphaGo, AlphaGo Zero, AlphaZero
Poker: Cepheus, DeepStack
Beyond DeepStack: TensorCFR
Conclusion
1

Deep Blue vs Kasparov
six-game chess matches
2

between world chess champion Garry Kasparov
2

and an IBM supercomputer called Deep Blue
2

ﬁrst match in Philadelphia in 1996 and won by Kasparov
2

second match in New York City in 1997 and won by Deep Blue
2

The 1997 match was the ﬁrst defeat of a reigning world chess champion by a computer under tournament
conditions.
2

The 1997 match was the ﬁrst defeat of a reigning world chess champion by a computer under tournament
conditions.
Use of brute-force search!
2

Atari Games: Deep Reinforcement
Learning

Atari Player by Google DeepMind
https://youtu.be/0X-NdPtFKq0?t=21m13s
[Mnih et al. 2015] 3

Reinforcement Learning
https://youtu.be/0X-NdPtFKq0?t=16m57s 4

Reinforcement Learning
games of self-play
https://youtu.be/0X-NdPtFKq0?t=16m57s 4

Go: AlphaGo, AlphaGo Zero,
AlphaZero

Tree Search
Optimal value v∗(s) determines the outcome of the game:
[Silver et al. 2016] 5

Tree Search
from every board position or state s

Tree Search
under perfect play by all players.

Tree Search
It is computed by recursively traversing a search tree containing
approximately bd possible sequences of moves, where

Tree Search
b is the games breadth (number of legal moves per position)

Tree Search
b is the games breadth (number of legal moves per position)
d is its depth (game length)

Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
universe!
That makes Go a googol
times more complex than
chess.
https://deepmind.com/alpha-go.html
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
universe!
chess.
How to handle the size of the game tree?
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
universe!
chess.
for the breadth: a neural network to select moves
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
universe!
chess.
for the depth: a neural network to evaluate current position
[Allis 1994] 6

Game tree of Go
chess: b ≈ 35, d ≈ 80
universe!
chess.
for the depth: a neural network to evaluate current position
for the tree traverse: Monte Carlo tree search (MCTS)
[Allis 1994] 6

Crash Course:
Neural Networks
7

Neural Network: Inspiration
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 8

inspired by the neuronal structure of the mammalian cerebral
cortex

cortex
but on much smaller scales

cortex
suitable to model systems with a high tolerance to error

cortex
suitable to model systems with a high tolerance to error
e.g. audio or image recognition

Neural Network: Modes
[Dieterle 2003] 9

Two modes
[Dieterle 2003] 9

Two modes
feedforward for making predictions
[Dieterle 2003] 9

Two modes
feedforward for making predictions
backpropagation for learning
[Dieterle 2003] 9

(Deep) Convolutional Neural Network

Policy and Value Networks

Training the (Deep Convolutional) Neural Networks

SL Policy Networks
move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).

Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.

Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.

RL Policy Networks
identical in structure to the SL policy network

RL Policy Networks
games of self-play

RL Policy Networks
games of self-play
between the current RL policy network and a randomly selected
previous iteration

RL Policy Networks
games of self-play
between the current RL policy network and a randomly selected
previous iteration
goal: to win in the games of self-play

Value Network
similar architecture to the policy network, but outputs a single
prediction instead of a probability distribution

Value Network
Beware: successive positions are strongly correlated!

Value Network
Value network memorized the game outcomes, rather than
generalizing to new positions.

Value Network
Solution: generate 30 million (new) positions, each sampled
from a separate game

Value Network
Solution: generate 30 million (new) positions, each sampled
from a separate game
almost the accuracy of Monte Carlo rollouts (using pρ), but
15000 times less computation!

Value Network: Selection of Moves
evaluation of all successors s of the root position s, using vθ(s)

ELO Ratings for Various Combinations of Networks

Monte Carlo Tree Search (MCTS) Algorithm
The tree is traversed by simulation from the root state (i.e.
descending the tree).

The next action is selected with a lookahead search:

1. selection phase

1. selection phase
2. expansion phase

1. selection phase
2. expansion phase
3. evaluation phase

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of simulation)

MCTS Algorithm: Selection

MCTS Algorithm: Expansion

A leaf position may be expanded by the SL policy network pσ.

A leaf position may be expanded by the SL policy network pσ.
Once it’s expanded, it remains so until the end.

MCTS: Evaluation

MCTS: Evaluation
Both of the following 2 evalutions:

MCTS: Evaluation
evaluation from the value network vθ(s)

MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome of the fast rollout pπ

MCTS: Backup
At the end of a simulation, each traversed edge updates its values.

Once the search is complete, the algorithm
chooses the most visited move from the root
position.

Tree Evaluation: using Value Network
tree-edge values averaged over value network evaluations only

Tree Evaluation: using Rollouts
tree-edge values averaged over rollout evaluations only

Percentage of Simulations
percentage frequency:
which actions were selected during simulations

Principal Variation
i.e. the Path with the Maximum Visit Count

Principal Variation
AlphaGo selected the move indicated by the red circle.

Principal Variation
Fan Hui responded with the move indicated by the white square.

Principal Variation
Fan Hui responded with the move indicated by the white square.
In his post-game commentary, he preferred the move predicted by AlphaGo (label 1).

Tournament with Other Go Programs

Fan Hui
https://en.wikipedia.org/wiki/Fan_Hui 33

Fan Hui
professional 2 dan

Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015

Fan Hui
professional 2 dan
European Professional Go Champion in 2016

Fan Hui
professional 2 dan
biological neural network:

Fan Hui
professional 2 dan
100 billion neurons

Fan Hui
professional 2 dan
100 billion neurons
100 up to 1,000 trillion neuronal connections

AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
34

AlphaGo versus Fan Hui
AlphaGo won 5 - 0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui 34

Lee Sedol “The Strong Stone”
https://en.wikipedia.org/wiki/Lee_Sedol 35

professional 9 dan

professional 9 dan
the 2nd in international titles

professional 9 dan
“Roger Federer” of Go

professional 9 dan
Lee Sedol would win 97 out of 100 games against Fan Hui.

professional 9 dan
Lee Sedol would win 97 out of 100 games against Fan Hui.
biological neural network, comparable to Fan Hui’s (in number
of neurons and connections)

I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
conﬁdent that I can win, at least this time.
Lee Sedol
35

I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
conﬁdent that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4-1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
35

AlphaGo versus Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 36

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won
by resignation.

by resignation.
The winner of the match was slated to win $1 million.

by resignation.
Since AlphaGo won, Google DeepMind stated that the prize
will be donated to charities, including UNICEF, and Go
organisations.

by resignation.
Since AlphaGo won, Google DeepMind stated that the prize
will be donated to charities, including UNICEF, and Go
organisations.
Lee received $170,000 ($150,000 for participating in all the
ﬁve games, and an additional $20,000 for each game won).

AlphaGo Master
https://deepmind.com/research/alphago/match-archive/master/ 37

AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unoﬃcial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.

AlphaGo Master
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.

AlphaGo Master
”Magister”.
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.

AlphaGo Master
”Magister”.
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.
AlphaGo won this series of games 60:0.

https://events.google.com/alphago2017/ 37

23 May - 27 May 2017 in Wuzhen, China

Team Go vs. AlphaGo 0:1

Team Go vs. AlphaGo 0:1
AlphaGo vs. world champion Ke Jie 3:0

defeated AlphaGo Lee by 100 games to 0
38

AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
39

AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
defeated AlphaGo Zero (version with 20 blocks trained for 3 days)
by 60 games to 40
39

1 AlphaGo Fan
2 AlphaGo Lee
39

1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
39

1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
39

1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
5 AlphaZero
39

Real life consists of bluﬃng, of little tactics
of deception, of asking yourself what is the
other man going to think I mean to do.
John von Neumann
39

Game Tree in Poker
P R E - F L O P
F L O P
Raise
RaiseFold
Call
Call
Fold
Call
Fold
Call
CallFold
Bet
Check
Check
Raise
Check
T U R N
-50
-100
100
100
-200
Fold
1
22
1
1
2 2
1
[Moravˇc´ık et al. 2017] 40

Game Tree in Poker
P R E - F L O P
F L O P
Raise
RaiseFold
Call
Call
Fold
Call
Fold
Call
CallFold
Bet
Check
Check
Raise
Check
T U R N
-50
-100
100
100
-200
Fold
1
22
1
1
2 2
1
the so-called public tree (tree of public events)

Cepheus
Heads-up Limit Holdem Poker
[Bowling et al. 2015] 41

Cepheus
Heads-up Limit Holdem Poker
http://poker.srv.ualberta.ca/
[Bowling et al. 2015] 41

a massive use of Counterfactual Regret Minimization+ (CFR+):
game-theoretic analogy to the gradient descent
http://poker.srv.ualberta.ca/ 42

parameters (weights) of neural networks ∼ strategies

loss ∼ counterfactual values

gradient ∼ counterfactual regrets

gradient ∼ counterfactual regrets
trained neural network (optimal weights) ∼ solution to the game (Nash equilibrium)

DeepStack
https://www.deepstack.ai 43

DeepStack
completed in December 2016

DeepStack
published in Science in March 2017

DeepStack
the ﬁrst AI capable of beating (11) professional poker players

DeepStack
the ﬁrst AI capable of beating (11) professional poker players
44,000 hands of Heads-Up No-Limit Texas Hold’em

DeepStack: Components
C
Sampled poker
situations
BA
Agent's possible actions
Lookahead tree
Current public state
Agent’s range
Opponent counterfactual values
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges

C
Sampled poker
situations
BA
Lookahead tree
Agent’s range
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving

C
Sampled poker
situations
BA
Lookahead tree
Agent’s range
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving
“intuitive” local search

C
Sampled poker
situations
BA
Lookahead tree
Agent’s range
Neural net [see B]
Action history
Public tree
Subtree
ValuesRanges
continual resolving
“intuitive” local search
sparse lookahead trees

DeepStack: Continual Resolving

only a strategy based on the current state

only for the remainder of the hand

no strategy for the full game

no strategy for the full game
⇒ lower overall exploitability

DeepStack: “Intuitive” Local Search
no reasoning about the full remaining game

computation beyond a certain depth: a fast-approximate
estimate

estimate
namely, deep neural networks

estimate
namely, deep neural networks
“gut feeling” of the value of holding any cards in any situation

DeepStack: Deep Counterfactual Value Network
Input
Bucket
ranges
7 Hidden Layers
• fully connected
• linear, PReLU
Output
Bucket
values
F E E D F O R W A R D
N E U R A L N E T
Z E R O - S U M
N E U R A L N E T
Output
Counterfactual
values
C A R D
C O U N T E R F A C T U A L
V A L U E S
Zero-sum
Error
B U C K E T I N G
( I N V E R S E )
B U C K E T I N G
C A R D
R A N G E S
500500500500500500500
1000
1
P2
P1
P1
P2
1326
P1
P2
Pot
Public
1326
22100
1
1000
P2
P1
1000
P2
P1

DeepStack: Sparse Lookahead Trees
a reduced number of actions considered

to play at conventional human speeds

games re-solved in under ﬁve seconds

on a simple gaming laptop

on a simple gaming laptop
with an NVIDIA GeForce GTX 1080 GPU

DeepStack: Against Professional Players
Participant
Handsplayed
1250
1000
500
250
750
50
0
-500
-250
DeepStackwinrate(mbb/g)
1000
2000
3000
0
5 10 15 20 25 30

DeepStack: Theoretical Guarantees
Theorem
Let the error of counterfactual values returned by the value
function be ≤ .

Theorem
function be ≤ .
Let T be the number of resolving iterations for each decision.

Theorem
function be ≤ .
Let T be the number of resolving iterations for each decision.
Then the exploitability of the strategies is
≤ k1 +
k2
√
T
where k1 and k2 are game-speciﬁc constants.

from presentation of dr. Viliam Lisy 50

TensorCFR
an implementation of CFR+
https://gitlab.com/beyond-deepstack/TensorCFR/ 51

TensorCFR
in TensorFlow

TensorCFR
in TensorFlow
optimized for GPU

TensorCFR
in TensorFlow
optimized for GPU
on-going eﬀort of:

TensorCFR
in TensorFlow
optimized for GPU
dr. Viliam Lisy

TensorCFR
in TensorFlow
optimized for GPU
dr. Viliam Lisy
Bc. Jan Rudolf (part-time)

TensorCFR
in TensorFlow
optimized for GPU
dr. Viliam Lisy
Bc. Jan Rudolf (part-time)
and me :-)

TensorCFR: Computation Graph

TensorCFR: Compute Time

TensorCFR: Memory

TensorCFR: TPU Compatibility

Future Plans for TensorCFR
implement continual resolving (i.e. DeepStack) in TensorFlow
56

extend to arbitrary games
other than poker with a clear public tree
56

research on fast value approximators over exponentially large
input space
56

input space
sampling?
56

input space
sampling?
neural networks??
56

input space
sampling?
neural networks??
distributed representations (i.e. represent states similarly as in
Word2Vec)???
56

input space
sampling?
neural networks??
Word2Vec)???
represent states using conveniently trained GANs????
56

input space
sampling?
neural networks??
Word2Vec)???
something else?????????????
56

input space
sampling?
neural networks??
Word2Vec)???
Any ideas and suggestions?
56

input space
sampling?
neural networks??
Word2Vec)???
Any ideas and suggestions? Now it’s the right time!
56

SL Policy Networks (1/2)
13-layer deep convolutional neural network
[Silver et al. 2016]

goal: to predict expert human moves

task of classiﬁcation

trained from 30 millions positions from the KGS Go Server

stochastic gradient ascent:
∆σ ∝
∂ log pσ(a|s)
∂σ
(to maximize the likelihood of the human move a selected in state s)

∆σ ∝
∂ log pσ(a|s)
∂σ
Results:

∆σ ∝
∂ log pσ(a|s)
∂σ
Results:
44.4% accuracy (the state-of-the-art from other groups)

∆σ ∝
∂ log pσ(a|s)
∂σ
Results:
55.7% accuracy (raw board position + move history as input)

∆σ ∝
∂ log pσ(a|s)
∂σ
Results:
55.7% accuracy (raw board position + move history as input)
57.0% accuracy (all input features)

Small improvements in accuracy led to large improvements
in playing strength (see the next slide)

RL Policy Networks (details)
Results (by sampling each move at ∼ pρ(·|st)):

80% of win rate against the SL policy network

85% of win rate against the strongest open-source Go program,
Pachi (Baudiˇs and Gailly 2011)

The previous state-of-the-art, based only on SL of CNN:

The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi

Evaluation accuracy in various stages of a game
Move number is the number of moves that had been played in the given position.

Each position evaluated by:
forward pass of the value network vθ

Each position evaluated by:
forward pass of the value network vθ
100 rollouts, played out using the corresponding policy

Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs

Scalability
simulations on CPUs
AlphaGo:
40 search threads
40 CPUs
8 GPUs

Scalability
simulations on CPUs
AlphaGo:
40 search threads
40 CPUs
8 GPUs
Distributed version of AlphaGo (on multiple machines):
40 search threads
1202 CPUs
176 GPUs

ELO Ratings for Various Combinations of Threads

References i
Allis, Louis Victor et al. (1994). Searching for Solutions in Games and Artificial Intelligence. Ponsen & Looijen.
Baudiˇs, Petr and Jean-loup Gailly (2011). “Pachi: State of the Art Open Source Go Program”. In: Advances in
Computer Games. Springer, pp. 24–38.
Bowling, Michael et al. (2015). “Heads-Up Limit Hold’em Poker is Solved”. In: Science 347.6218, pp. 145–149.
url: http://poker.cs.ualberta.ca/15science.html.
Dieterle, Frank Jochen (2003). “Multianalyte Quantifications by Means of Integration of Artificial Neural Networks,
Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data”. PhD thesis. Universität Tübingen.
Mnih, Volodymyr et al. (2015). “Human-Level Control through Deep Reinforcement Learning”. In: Nature
518.7540, pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.
Moravˇc´ık, Matej et al. (2017). “DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker”. In:
Science 356.6337, pp. 508–513.
Munroe, Randall. Game AIs. url: https://xkcd.com/1002/ (visited on 04/02/2016).
Silver, David et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature
529.7587, pp. 484–489.

AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR

More Related Content

What's hot

Similar to AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR

More from Karel Ha

Recently uploaded

AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and TensorCFR