Mastering the game of go with deep neural networks and tree search

Mastering the game of Go with
deep neural networks and tree
search
Speaker: San-Feng Chang
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche,
G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman,
S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach,
M., Kavukcuoglu, K., Graepel, T., and Hassabis, D.
Nature, 529(7587):484–489, 2016.
2016/3/22 1

Outline
• AI in Game Playing
• Previous Work of Go Research
• Architecture of AlphaGo
• AlphaGo’s methods
• The playing strength of AlphaGo
• Conclusion
2016/3/22 2

AI in Game Playing(1/3)
• Game-playing is a specific problem to measure
the performance of an AI.
• One classification for outcomes of an AI test is:
2016/3/22 3
Optimal It is not possible to perform better
Strong super-human Performs better than all humans
Super-human Performs better than most humans
Sub-human Performs worse than most humans

Game Players
Branching
Factor
Depth Length Complexity
Chess
Deep Blue vs
Kasparov (1997)
35 80
35^80 ≈
10^123
Go
AlphaGo vs Lee
Sedol (2016)
250 150
250^150≈
10^360
2016/3/22 4
Evolution of Gaming Tree Search:
Brute Force
Minmax &
Alpha-Beta
MCTS
AlphaGo’s
Method

• Minmax & Alpha-Beta Pruning
2016/3/22 5
The complexity is still too high.
https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/AB_pruning.svg/1280px-AB_pruning.svg.png?1458451165542

Previous Work of Go Research (1/4)
• Monte Carlo rollouts search to maximum
depth without branching at all, by sampling
long sequences of actions for both players
from a policy p.
• Monte Carlo tree search (MCTS) uses Monte
Carlo rollouts to estimate the value of each
state in a search tree.
2016/3/22 6

• Monte Carlo Tree Search:
2016/3/22 7
2/3
1/1 1/2
1/1 0/1
2/3
1/1 1/2
1/1 0/1
Selection
(Randomly)
Expansion
0/0
Player 1
Player 2
Player 1

• Monte Carlo Tree Search:
2016/3/22 8
2/3
1/1 1/2
1/1 0/1
Simulation
0/0
......
3/4
1/1 2/3
2/2 0/1
Back-Propagation
1/1
Player 1
Player 2
Player 1
Player 2

• The strongest current Go programs are based
on MCTS, enhanced by policies that are
trained to predict human expert moves.
• However, prior work has been limited to
shallow policies or value functions based on a
linear combination of input features.
2016/3/22 9

Architecture of AlphaGo
2016/3/22 10
Neural Network Training Pipeline
s: board position
a: legal moves
p(a|s): probability distribution
v(s): scalar value
Two Brains
Human expert dataset:
KGS server ~ 160,000 games
29.4 million positions

Convolution Neural Network(1/2)
2016/3/22 11
A regular 3-layer Neural Network A convolutional neural network
Input volume of size: W1 x H1 x D1
Requires four hyperparameters:
1. Number of filters K (depth)
2. Spatial extent F (kernel size)
3. The stride S
4. The amount of zero padding P
Output volume size: W2 x H2 x D2
W2 = (W1 – F + 2P)/S + 1
H2 = (H1 – F + 2P)/S + 1
D2 = k
• Parameter sharing:
total weights: (F * F * D1) * K
http://cs231n.github.io/convolutional-networks/

Convolution Neural Network(2/2)
2016/3/22 12http://cs231n.github.io/convolutional-networks/
Number of filter K: 2
Spatial extent F: 3 x 3
Stride S: 2
Zero padding P: 1

AlphaGo’s methods –
Trained by Human Expert (1/6)
• Rollout Policy :
– Using 2μs to select an action but only 24.2% accuracy
to predict expert moves correctly
– Using a linear softmax of small pattern features with
weights π
2016/3/22 13
p
n1
n2
n3
n1,in
n2,in
n3,in
ininin
in
nnn
n
out
eee
e
n ,3,2,1
,1
,1


https://qph.fs.quoracdn.net/main-qimg-9e2d012ef7cb8b29d2bed14d2975c986

Trained by Human Expert (2/6)
• SL policy :
– Using 3ms to select an action and 57.0% accuracy
to predict expert moves correctly
– Using 13 layers convolutional neural network with
weights σ
2016/3/22 14
p
......
Input
Size: 19*19
48 planes
First layer
Conv + ReLU
Kernel size: 5 x 5
2nd~12th layers
Conv + ReLU
Kernel size: 3 x 3
13th layers
Kernel size: 1 x 1, 1 filter, softmax

Reinforcement Learning pρ (3/6)
2016/3/22 15
SL policy
pσ
Initialize Weights
ρ = ρ- = σ
RL policy
pρ
pρ- pρ
Opponent pool
Play ...... End
r
reward
Policy Gradient
Method
Add pρ to
opponent pool

Value Network vθ (4/6)
• Supervised Learning:
– Used to estimate the positions’ winning rate at
current state
– Using 15 layers CNN
2016/3/22 16
......
Input
Size: 19*19
48 planes
+1 unit
(current color)
1st~13th layers
The same as
RL Policy networks
15th layers
Full-connected
1 tanh unit
14th layer
Fully-connected
256 ReLU unit

Value Network vθ (5/6)
• Randomly sample an integer U in 1 ~ 450
– t = 1 ~ U-1 – Played by SL policy network pσ
– t = U – Random action
– t = U+1 ~ End – Played by RL policy network pρ
• Reward
• Only a single training example (sU+1, zU+1) is
added to the data set from each game.
2016/3/22 17
 Tt srz 

Searching (6/6)
2016/3/22 18
• Q: Action Value  Winning scores
• u(P): Upper Confidence bound  Exploration vs. Exploitation
• P: Prior probability  using pσ (SL performed better than RL)
More

The playing strength of AlphaGo
2016/3/22 19

Conclusion
• Reaching a milestone is the beginning of the
next milestone.
• Stay hungry, stay foolish!
2016/3/22 20

References(1/2)
• Nature:
– Mastering the game of Go with deep neural
networks and tree search
• Mark Chang:
– http://www.slideshare.net/ckmarkohchang/alph
ago-in-depth
• CNN:
– http://cs231n.github.io/convolutional-networks/
2016/3/22 21

References(2/2)
• 陳鍾誠
– http://www.slideshare.net/ccckmit/30alphago
• Monte Carlo Tree Search
– https://jeffbradberry.com/posts/2015/09/intro-
to-monte-carlo-tree-search/
• How AlphaGo Works
– http://www.slideshare.net/ShaneSeungwhanMo
on/how-alphago-works
2016/3/22 22

Formula(1/2)
• Policy Network: classification
• Policy Network: reinforcement learning
• Value Network: regression
2016/3/22 24
 
 



m
k
kk
sap
m 1
log




    i
t
i
t
n
i
i
t
i
t
i
t
svz
sap
n



   1 1
log




    


 



  
k
m
k
kk sv
svz
m 1

Formula(2/2)
• Searching:
2016/3/22 25
    asuasQa tt
a
t ,,maxarg 
   
 asN
asP
asu
,1
,
,


   

n
i
iaslasN
1
,,,
 
 
   

n
i
L
i
sViasl
asN
asQ
1
,,
,
1
,
l(s,a,i) indicates whether an edge (s,a) ith simulation
si
L is the leaf node from ith simulation
      LLL zsvsV    1
Back
   
 
 asN
bsN
asPcasu b r
puct
,1
,
,,




How AlphaGo selected its move
2016/3/22 26

(Bonus 1)
2016/3/22 27

(Bonus 2)
2016/3/22 28

Mastering the game of go with deep neural networks and tree search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Mastering the game of go with deep neural networks and tree search

Similar to Mastering the game of go with deep neural networks and tree search (20)

Recently uploaded

Recently uploaded (20)

Mastering the game of go with deep neural networks and tree search

Editor's Notes