AlphaGo/AlphaGo
Zero
Keita Watanabe
Motivation
• Tree based decision-making framework is common
across robotics, AV, and etc…
• Monte Carlo Tree Search (MCTS) is one of the most
successful method among other tree search
algorithms.
• Recent MCTS based decesion-making framework
for AV (Cai 2019) significantly influenced by
AlphaGo
Overview of this
presentation
• Introduction to Go
• Alpha Go
• SL Policy Network
• RL Policy Network
• Value Network
• MCTS (Monte Carlo Tree Search)
• Alpha Go Zero
• Improvements from Alpha Go
Rule of Go I
Retrieved from Wikipedia

https://en.wikipedia.org/wiki/Go_(game)
Go is an adversarial game with the objective of
surrounding a larger total area of the board with one's
stones than the opponent. As the game progresses, the
players position stones on the board to map out formations
and potential territories. Contests between opposing
formations are often extremely complex and may result in the
expansion, reduction, or wholesale capture and loss of
formation stones.
The four liberties (adjacent empty points) of a single black
stone (A), as White reduces those liberties by one (B, C, and D).
When Black has only one liberty left (D), that stone is "in atari .
White may capture that stone (remove from board) with a play
on its last liberty (at D-1).
A basic principle of Go is that a group of stones must have at
least one "liberty" to remain on the board. A "liberty" is an open
"point" (intersection) bordering the group. An enclosed liberty
(or liberties) is called an eye (眼), and a group of stones with
two or more eyes is said to be unconditionally "alive". Such
groups cannot be captured, even if surrounded.
Rule of Go II
Points where
Black can capture White
Points where
White cannot place stone
Fig. 1.1 of (Otsuki 2017)
Rule of Go IV: 

Victory judgment
If you want to know more, just ask Ivo or Erik
Fig. 1.2 of (Otsuki 2017)
* Score: # of stones + # of
eyes
* Komi: Black (Moves first)
takes a handicap. Typically
7.5 points
* Black territory 45, White
territory 36

45 > 36 + 7.5 => Black wins
Why Go is so difficult?
Approx Size of the Search Space
Othello 10**60
Chess 10**120
Shogi 10**220
Go 10**360
Table 1.1 of (Otuki)
Size of the search space is enormous!
Alpha Go
Abstract
The game of Go has long been viewed as the most challenging of classic
games for artificial intelligence owing to its enormous search space and the
difficulty of evaluating board positions and moves. Here we introduce a new
approach to computer Go that uses 'value networks' to evaluate board
positions and 'policy networks' to select moves. These deep neural networks are
trained by a novel combination of supervised learning from human expert games,
and reinforcement learning from games of self-play. Without any lookahead
search, the neural networks play Go at the level of state-of-the-art Monte
Carlo tree search programs that simulate thousands of random games of
self-play. We also introduce a new search algorithm that combines Monte
Carlo simulation with value and policy networks. Using this search algorithm, our
program AlphaGo achieved a 99.8% winning rate against other Go programs,
and defeated the human European Go champion by 5 games to 0. This is the
first time that a computer program has defeated a human professional player in the
full-sized game of Go, a feat previously thought to be at least a decade away.
1
2
3
4
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Rollout Policy
• Logistic Regression with well known features (see the table below) used in
this field.
• Trained with 30 million positions from the KGS Go Server (https://
www.gokgs.com/).
• This model used for Rollout (details will be explained later).
In total: 109747 Features. Extended Table 4 of (Silver 2016)
Logistic Regression
.
.
.
x1
x2
x109747
Σ
u =
109747
∑
k=1
wkxk
˜
p =
1
1 + e−u
• Logistic Regression with well known features used
in this field
• Trained with 30 million positions from the KGS Go
Server (https://www.gokgs.com/)
Tree Policy
• It is a logistic regression model with additional features.
• Improved performance with extra computational time.
• Used for Expansion step of Monte Carlo Tree search
•
In total: 141989 Features. Extended Table 4 of (Silver 2016)
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Policy Network: Overview
Fig1. of (Silver 2016)
• Convolutional Neural Network
• The network first is trained
by supervised learning
algorithm and later refined by
reinforcement learning
• Trained with KGS dataset.
29.4 million positions from
160000 games played by
KGS 6 to 9 dan
SL policy network
Output is percentage Fig. 2.18 (Otsuki 2017)
SL Policy Network
• Convolutional Neural Network
• Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS
6 to 9 dan
• 48 Channels (Features) is prepared (Next slide explains details).
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output: Prob. of the next move
Input features
(Silver 2016)
Note: Most of the hand-made features here are not new, but commonly used
in this field.
RL Policy Network
• They further trained the policy network by policy gradient reinforcement
learning.
• Training is done by self-play
• The win rate of the RL policy network over the original SL policy network
was 80%
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Value Network
• Alpha Go uses the RL policy network
to generate training data for the
Value Network, which predict win
rate.
• Training data (Position, Win/Lose)
30 million
• It took 1 week with 50 GPU
• Training also took 1 week with 50
GPU
• The network provides Evaluation
function for Go (that considered to be
hard previously).
Fig1. of (Silver 2016)
Rollout Policy Policy Network Value Network
Model
Logistic
Regression
CNN (13 Layers) CNN (15 Layers)
Time for
evaluation of a
state
2μs 5ms 5ms
Time for playout
(200 moves)
0.4ms 1.0 s -
# of playouts
per sec
About 2500 About 1 -
Accuracy 24% 57% -
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
MCTS Example: Nim
• You can take more than one
stone from ether left or right
• You will win when you take
the last stone
• This example from http://
blog.brainpad.co.jp/entry/
2018/04/05/163000
Game Tree
Green: Player who moves first wins
Yellow: Player who moves second wins
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
Monte Carlo Simulation
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
You can find out Q value of each state by simulation.
-> MCTS is a heuristic that enable us efficiently investigate promising states
Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a heuristic
search algorithm for decision processes.
The focus of Monte Carlo tree search is on
the analysis of the most promising moves,
expanding the search tree based on random
sampling of the search space. The application
of Monte Carlo tree search in games is based
on many playouts. In each playout, the game
is played out to the very end by selecting
moves at random. The final game result of
each playout is then used to weight the
nodes in the game tree so that better nodes
are more likely to be chosen in future
playouts.
(Browne 2016)
MCTS Example
N: 0, Q: 0
Initial State
N: # of visits to the state
Q: Expected reward
Selection
Select node that maximizes
Q(s, a) + Cp
2 log ns
ns,a
N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
First term: Estimated reward
Second term: Bias term

It balances Exploration vs. Exploitation
(Auer, P, 2002) 

(In this case, it s random)
①
②
N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Rollout
Randomly play game and 

find out win/lose
N: 1, Q: 0
N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Backup
Renew Q of the state
③
④
N: 1, Q: 0
N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
Expansion
Expand tree when a node is visited 

certain pre-defined times 

(in this case 2)
⑤
⑥ ⑦
• Bias is evaluated by the original Bias + Output of the
SL policy network

• Evaluation of win rate => playout + Output of the
value Network
• Massive parallel computation using both GPUs (176)
and CPUs (1202)
Q(s, a) = (1 − λ)
Wv(s, a)
Nv(s, a)
+ λ
Wr(s, a)
Nr(s, a)
u(s, a) = cpuctP(s, a)
∑b
Nr(s, b)
1 + Nr(s, a)
MCTS in Alpha Go
Value Network MCTS
P(s, a)
Performance
Figure 4 of (Silver 2016)
Alpha Go Zero
Abstract
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of
Go. The tree search in AlphaGo evaluated positions and selected moves using
deep neural networks. These neural networks were trained by supervised
learning from human expert moves, and by reinforcement learning from self-
play. Here we introduce an algorithm based solely on reinforcement
learning, without human data, guidance or domain knowledge beyond
game rules. AlphaGo becomes its own teacher: a neural network is trained to
predict AlphaGo s own move selections and also the winner of AlphaGo s
games. This neural network improves the strength of the tree search, resulting
in higher quality move selection and stronger self-play in the next iteration.
Starting tabula rasa, our new program AlphaGo Zero achieved superhuman
performance, winning 100‒0 against the previously published, champion-
defeating AlphaGo.
Tabula rasa is a Latin phrase often translated as "clean slate" .
1
2
Point1: Dual Network
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output1: Prediction
of the next move
19
19
Output Layer
Output2: Win Rate
• 40 Layers+ Convolutional Neural Network
• Each layer 3x3 convolution layer + Batch normalization + Relu
• Layer 2 39 are ResNet
• Trained by self-play (details are described later)
• 17 Channels (Features) is prepared (the next slide shows details).
• Learning method of this network discussed later. (For now, let s assume we have trained it nicely).
p
v
AlphaGo Zero is less depends
on hand crafted features
48 Features of Alpha Go (Silver 2016)
Feature # of planes
Position of black
stones
1
Position of black
stones
1
Position of black
stones k (1 7) steps
before
7
Position of white
stones k (1 7) steps
before
7
Turn 1
17 Features of Alpha Go Zero (Silver 2017)
Point 2: Improvement of
MCTS
• MCTS algorithm uses the following value for state
selection.
• No playout, it just relies on value.
Q(s, a) + u(s, a)
Q(s, a) =
W(s, a)
N(s, a)
u(s, a) = cpuct p(s, a)
∑b
N(s, b)
1 + N(s, a)
Win Rate
Bias
Prediction
of the move a
MCTS 1: Selection
25%
48%
35%
Select the no de which has max Q(s, a) + u(s, a)
MCTS 2: Selection
25%
48%
35%
Expand the node
30% 42%
MCTS 2: Selection
25%
48%
35%
Evaluate p and v using the dual network.
* p will be used for the calculation of Q+U
* the win rate on the state is updated by v
30%
42% -> 70%
p
v = 70 %
MCTS 3: Backup
50% 40%
Update win rate of each state and propagate until the root node.
60% 70%
p
v = 70 %
60% -> 65%
50% -> 55%
Point3 Improvements
on RL
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• The dual network (parameter ) accumulate data
by self play (step 1: repeated 25 thousand times).
• Based on the result, update the parameter of the
network (step 2), and get new parameter .
• Let two network instantiations compete, update the
network parameter if the new parameter set wins.
• Repeat step 1 and step 2
θ′
θ
Step 1: Data Accumulation
• Do a self play. Store the outcome z.
• Store all (s, π, z) tuples in the game.
• The policy π is calculated as
• Repeat the above processes 250000 times
πa =
N(s, a)1/γ
∑b
N(s, b)1/γ
Step 2: Parameter update
• Calculate the loss function using (s, π, z) evaluated
in the previous step.
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• Update parameter using gradient descent method.
θ′ ← θ − α ⋅ Δθ
Empirical evaluation of
AlphaGo Zero
Fig 3 of (Silver 2017)
Performance of AlphaGo
Zero
Fig 6 of (Silver 2017)
https://research.fb.com/facebook-open-sources-elf-opengo/
References
1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the
game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/
nature16961

Alpha Go
2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka
gakushu kara mita sono shikumi. Shoeisha.
3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game
of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270

Alpha Go Zero
4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte
Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1).
https://doi.org/10.1109/TCIAIG.2012.2186810
5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on
Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106
6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search.
Retrieved from https://arxiv.org/pdf/1905.12197.pdf

AlphaGo and AlphaGo Zero

  • 1.
  • 2.
    Motivation • Tree baseddecision-making framework is common across robotics, AV, and etc… • Monte Carlo Tree Search (MCTS) is one of the most successful method among other tree search algorithms. • Recent MCTS based decesion-making framework for AV (Cai 2019) significantly influenced by AlphaGo
  • 3.
    Overview of this presentation •Introduction to Go • Alpha Go • SL Policy Network • RL Policy Network • Value Network • MCTS (Monte Carlo Tree Search) • Alpha Go Zero • Improvements from Alpha Go
  • 4.
    Rule of GoI Retrieved from Wikipedia
 https://en.wikipedia.org/wiki/Go_(game) Go is an adversarial game with the objective of surrounding a larger total area of the board with one's stones than the opponent. As the game progresses, the players position stones on the board to map out formations and potential territories. Contests between opposing formations are often extremely complex and may result in the expansion, reduction, or wholesale capture and loss of formation stones. The four liberties (adjacent empty points) of a single black stone (A), as White reduces those liberties by one (B, C, and D). When Black has only one liberty left (D), that stone is "in atari . White may capture that stone (remove from board) with a play on its last liberty (at D-1). A basic principle of Go is that a group of stones must have at least one "liberty" to remain on the board. A "liberty" is an open "point" (intersection) bordering the group. An enclosed liberty (or liberties) is called an eye (眼), and a group of stones with two or more eyes is said to be unconditionally "alive". Such groups cannot be captured, even if surrounded.
  • 5.
    Rule of GoII Points where Black can capture White Points where White cannot place stone Fig. 1.1 of (Otsuki 2017)
  • 6.
    Rule of GoIV: 
 Victory judgment If you want to know more, just ask Ivo or Erik Fig. 1.2 of (Otsuki 2017) * Score: # of stones + # of eyes * Komi: Black (Moves first) takes a handicap. Typically 7.5 points * Black territory 45, White territory 36
 45 > 36 + 7.5 => Black wins
  • 7.
    Why Go isso difficult? Approx Size of the Search Space Othello 10**60 Chess 10**120 Shogi 10**220 Go 10**360 Table 1.1 of (Otuki) Size of the search space is enormous!
  • 8.
  • 9.
    Abstract The game ofGo has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away. 1 2 3 4
  • 10.
    Overview of AlphaGo Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 11.
    Rollout Policy • LogisticRegression with well known features (see the table below) used in this field. • Trained with 30 million positions from the KGS Go Server (https:// www.gokgs.com/). • This model used for Rollout (details will be explained later). In total: 109747 Features. Extended Table 4 of (Silver 2016)
  • 12.
    Logistic Regression . . . x1 x2 x109747 Σ u = 109747 ∑ k=1 wkxk ˜ p= 1 1 + e−u • Logistic Regression with well known features used in this field • Trained with 30 million positions from the KGS Go Server (https://www.gokgs.com/)
  • 13.
    Tree Policy • Itis a logistic regression model with additional features. • Improved performance with extra computational time. • Used for Expansion step of Monte Carlo Tree search • In total: 141989 Features. Extended Table 4 of (Silver 2016)
  • 14.
    Overview of AlphaGo Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 15.
    Policy Network: Overview Fig1.of (Silver 2016) • Convolutional Neural Network • The network first is trained by supervised learning algorithm and later refined by reinforcement learning • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan
  • 16.
    SL policy network Outputis percentage Fig. 2.18 (Otsuki 2017)
  • 17.
    SL Policy Network •Convolutional Neural Network • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan • 48 Channels (Features) is prepared (Next slide explains details). https://senseis.xmp.net/?Go 19 x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output: Prob. of the next move
  • 18.
    Input features (Silver 2016) Note:Most of the hand-made features here are not new, but commonly used in this field.
  • 19.
    RL Policy Network •They further trained the policy network by policy gradient reinforcement learning. • Training is done by self-play • The win rate of the RL policy network over the original SL policy network was 80%
  • 20.
    Overview of AlphaGo Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 21.
    Value Network • AlphaGo uses the RL policy network to generate training data for the Value Network, which predict win rate. • Training data (Position, Win/Lose) 30 million • It took 1 week with 50 GPU • Training also took 1 week with 50 GPU • The network provides Evaluation function for Go (that considered to be hard previously). Fig1. of (Silver 2016)
  • 22.
    Rollout Policy PolicyNetwork Value Network Model Logistic Regression CNN (13 Layers) CNN (15 Layers) Time for evaluation of a state 2μs 5ms 5ms Time for playout (200 moves) 0.4ms 1.0 s - # of playouts per sec About 2500 About 1 - Accuracy 24% 57% -
  • 23.
    Overview of AlphaGo Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 24.
    MCTS Example: Nim •You can take more than one stone from ether left or right • You will win when you take the last stone • This example from http:// blog.brainpad.co.jp/entry/ 2018/04/05/163000
  • 25.
    Game Tree Green: Playerwho moves first wins Yellow: Player who moves second wins Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
  • 26.
    Monte Carlo Simulation Retrievedfrom http://blog.brainpad.co.jp/entry/2018/04/05/163000 You can find out Q value of each state by simulation. -> MCTS is a heuristic that enable us efficiently investigate promising states
  • 27.
    Monte Carlo TreeSearch Monte Carlo tree search (MCTS) is a heuristic search algorithm for decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts. (Browne 2016)
  • 28.
    MCTS Example N: 0,Q: 0 Initial State N: # of visits to the state Q: Expected reward Selection Select node that maximizes Q(s, a) + Cp 2 log ns ns,a N: 1, Q: 0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 First term: Estimated reward Second term: Bias term
 It balances Exploration vs. Exploitation (Auer, P, 2002) 
 (In this case, it s random) ① ②
  • 29.
    N: 1, Q:0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Rollout Randomly play game and 
 find out win/lose N: 1, Q: 0 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Backup Renew Q of the state ③ ④
  • 30.
    N: 1, Q:0 N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 Expansion Expand tree when a node is visited 
 certain pre-defined times 
 (in this case 2) ⑤ ⑥ ⑦
  • 31.
    • Bias isevaluated by the original Bias + Output of the SL policy network
 • Evaluation of win rate => playout + Output of the value Network • Massive parallel computation using both GPUs (176) and CPUs (1202) Q(s, a) = (1 − λ) Wv(s, a) Nv(s, a) + λ Wr(s, a) Nr(s, a) u(s, a) = cpuctP(s, a) ∑b Nr(s, b) 1 + Nr(s, a) MCTS in Alpha Go Value Network MCTS P(s, a)
  • 32.
  • 33.
  • 34.
    Abstract A long-standing goalof artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self- play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo s own move selections and also the winner of AlphaGo s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100‒0 against the previously published, champion- defeating AlphaGo. Tabula rasa is a Latin phrase often translated as "clean slate" . 1 2
  • 35.
    Point1: Dual Network https://senseis.xmp.net/?Go 19x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output1: Prediction of the next move 19 19 Output Layer Output2: Win Rate • 40 Layers+ Convolutional Neural Network • Each layer 3x3 convolution layer + Batch normalization + Relu • Layer 2 39 are ResNet • Trained by self-play (details are described later) • 17 Channels (Features) is prepared (the next slide shows details). • Learning method of this network discussed later. (For now, let s assume we have trained it nicely). p v
  • 36.
    AlphaGo Zero isless depends on hand crafted features 48 Features of Alpha Go (Silver 2016) Feature # of planes Position of black stones 1 Position of black stones 1 Position of black stones k (1 7) steps before 7 Position of white stones k (1 7) steps before 7 Turn 1 17 Features of Alpha Go Zero (Silver 2017)
  • 37.
    Point 2: Improvementof MCTS • MCTS algorithm uses the following value for state selection. • No playout, it just relies on value. Q(s, a) + u(s, a) Q(s, a) = W(s, a) N(s, a) u(s, a) = cpuct p(s, a) ∑b N(s, b) 1 + N(s, a) Win Rate Bias Prediction of the move a
  • 38.
    MCTS 1: Selection 25% 48% 35% Selectthe no de which has max Q(s, a) + u(s, a)
  • 39.
  • 40.
    MCTS 2: Selection 25% 48% 35% Evaluatep and v using the dual network. * p will be used for the calculation of Q+U * the win rate on the state is updated by v 30% 42% -> 70% p v = 70 %
  • 41.
    MCTS 3: Backup 50%40% Update win rate of each state and propagate until the root node. 60% 70% p v = 70 % 60% -> 65% 50% -> 55%
  • 42.
    Point3 Improvements on RL (p,v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • The dual network (parameter ) accumulate data by self play (step 1: repeated 25 thousand times). • Based on the result, update the parameter of the network (step 2), and get new parameter . • Let two network instantiations compete, update the network parameter if the new parameter set wins. • Repeat step 1 and step 2 θ′ θ
  • 43.
    Step 1: DataAccumulation • Do a self play. Store the outcome z. • Store all (s, π, z) tuples in the game. • The policy π is calculated as • Repeat the above processes 250000 times πa = N(s, a)1/γ ∑b N(s, b)1/γ
  • 44.
    Step 2: Parameterupdate • Calculate the loss function using (s, π, z) evaluated in the previous step. (p, v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • Update parameter using gradient descent method. θ′ ← θ − α ⋅ Δθ
  • 45.
    Empirical evaluation of AlphaGoZero Fig 3 of (Silver 2017)
  • 46.
  • 47.
  • 48.
    References 1. Silver, D.,Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/ nature16961
 Alpha Go 2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka gakushu kara mita sono shikumi. Shoeisha. 3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270
 Alpha Go Zero 4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1). https://doi.org/10.1109/TCIAIG.2012.2186810 5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106 6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search. Retrieved from https://arxiv.org/pdf/1905.12197.pdf