AlphaGo Zero: Mastering the Game of Go Without Human Knowledge

AlphaGo Zero
Mastering the game of Go without human knowledge

AlphaGo: Mastering the game of Go with
deep neural networks and tree search

Limitations
 It required a large amount of training data. (29.4 million
positions from 160,000 games from the KGS Go server)
 Uses many hand-crafted features to judge moves. (Atari size,
liberties, capture size, etc.)
 This may limit the range of possible moves that it could play.

AlphaGo Zero
 No human data: Starts from random play.
 No hand-crafted features: Only the board and the rules of the
game are used.
 Single machine (4 TPUs for evaluation)

Reinforcement Learning
A Short Explanation of Agents, States, Value Functions, Policies, and MCTS

Terminology & Concepts
Agent: The thing interacting with the environment.
State (s): The situation that the agent is in.
Action (a): The action that the agent takes in a
given state.
Reward (r): The reward (or penalty) that the
agent receives from taking an action in a state.
Policy (π): The function that decides probabilities
for taking each action in a given state.
Value Function (V(s)): The value (long term total
reward) of the given state.
Action Value Function (Q(s, a)): The value of a
given action in a given state.
𝑉 𝑠 =
𝑎∈𝐴
𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎
Note: The policy outputs probabilities for actions. It must thus sum to 1.

The Explore-Exploit Tradeoff
 The fundamental question of Reinforcement Learning:
 Explore: Explore the environment further to find higher rewards.
 Exploit: Exploit the known states/actions to maximize reward.
Should I just eat the cheese that I have already found, or should I search the maze for
more/better cheese?

Monte Carlo Tree Search
The modified version used in AlphaGo Zero

Monte Carlo Tree Search (MCTS)
 Monte Carlo: Randomly trying
things, e.g. throwing dice.
 Tree Search: Searching the
various “leaves” of the “tree” of
possibilities.
 Node: State
 Edge: Action

The Function of MCTS in AlphaGo Zero
 MCTS is used to simulate
games in AlphaGo Zero’s
“imagination”.
 But the method of picking the
next move in its “imagination”
and in “reality” are very
different.
MCTS

MCTS in AlphaGo Zero
 Select the next move in the MCTS simulation using U(s, a) + Q(s, a) for the
state ‘s’ in the simulation.
 Repeat this process until an unevaluated node (a “leaf” node) is encountered.
 Backup from the node after evaluating its V(s) and P(s, a). Update the visit
counts and Q(s, a).

Selection
 Select the next move using the PUCT algorithm
𝑎 𝑡 = argmax
𝑎
[𝑄 𝑠𝑡, 𝑎 + 𝑈(𝑠𝑡, 𝑎)]
𝑈 𝑠, 𝑎 = 𝑐 𝑝𝑢𝑐𝑡 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠, 𝑏
1 + 𝑁 𝑠, 𝑎
𝑡: The time step in an MCTS simulation
𝑄 𝑠𝑡, 𝑎 represents Exploitation.
𝑈(𝑠𝑡, 𝑎) represents Exploration.
𝑐 𝑝𝑢𝑐𝑡: A hyperparameter controlling the Explore/Exploit tradeoff
P(s, a): The prior probabilities (NN output with Dirichlet noise)
N(s, a): The visit count of that action in that state
𝑏 𝑁 𝑠, 𝑏 : The visit count of the state

Expand and Evaluate
 Go down the “branches” of the “tree” until a
“leaf” node (unevaluated state) is encountered.
 Then evaluate v = 𝑉(𝑠 𝐿) and p = 𝑃(𝑠, 𝑎) of
that node (state) using the Neural Network.
 The “tree” then grows a “branch” where there
used to be a “leaf”.

Backup
𝑁 𝑠 ← 𝑁 𝑠 + 1
𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1
𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑉 𝑠 𝐿
𝑄 𝑠, 𝑎 ←
𝑊 𝑠, 𝑎
𝑁 𝑠, 𝑎
 After encountering and evaluating a leaf node,
go back up to the “root” node.
 Update visit counts N(s), N(s, a) and Q(s, a)
 W(s, a) is the total action value, used only for
calculating Q(s, a), the average action value.

Play 𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
 𝑠0: Root node
 𝜏: “Temperature” parameter controlling exploration (1 → 0.0)
 𝑁 𝑠0, 𝑎 : Visit count of possible actions from the root node.
 𝑏 𝑁 𝑠0, 𝑏 : Visit count of the root node.
 After 1600 simulations of MCTS, select the next action.
 This is a “real” action, not an “imaginary” action.

Key Points
 The probabilities of the policy 𝜋 are given by the visit counts of
MCTS simulation, not by the NN directly.
 The visit count in MCTS are maintained for 1 game. Not just for
one MCTS simulation and not between multiple games.
 Action selection is different for simulation and play.
 The NN only evaluates each node once, when it is a leaf node.

Training by Self-Play
Using self-play to train the RL agent with supervised learning

Network Architecture
 Inputs: The previous 8 states
 Outputs: Predicted Policy 𝑝 &
Predicted Value 𝑣.
 Structure: 40 residual blocks with
2 output heads.
 The policy head (top) has softmax
activation to output probabilities.
 The value head (bottom) has tanh
(∵ +1: win, 0: tie, -1: lose).

Generating Training Data via Self-Play
𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 log 𝑝 + 𝑐 𝜃
2
 Loss = MSE(actual outcome, prediction) +
Cross Entropy(MCTS policy, prediction) +
L2(model weights).
 Self-Play games of the current best model are
used to generate training data.
 Multiple self-play games are run simultaneously
to provide sufficient training data.

Prior Probabilities
𝑃 𝑠, 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎,
(𝜖 = 0.25, 𝜂~𝐷𝑖𝑟 0.03 )
 𝑝 𝑎: NN output for action 𝑎.
 The Prior probability is
obtained by adding Dirichlet
noise to the NN output.
Temperature (Simulated annealing)
𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
𝜏 = 1 for the first 30 plays of self-play.
Then reduce 𝜏 → 0.0, which is
equivalent to
𝜋 𝑎 𝑠0 = argmax
𝑎
𝑁 𝑠0, 𝑎
Training vs Evaluation

Results
Performance and comparison with other models.

Final Performance and Comparison
 AlphaGo Zero achieved SOTA after 40 days, defeating AlphaGo
Master by 89:11.
 The “raw network” shows the performance of the NN without MCTS.
(Using 𝑝, 𝑣 as actual policy and value, not for MCTS simulation)

Neural Network
Architecture Comparison
The 2 innovations of AlphaGo Zero were
1. Using a ResNet instead of a CNN.
2. Combining the Policy and Value
networks into a single dual network.
By separating out each factor, we can see
the contributions of each.
For performance, ResNet and Dual
network factors seem to have an equal
contribution.

Empirical evaluation
 Defeated AlphaGo Lee 100:0 after 72hrs
 Worse at predicting human moves but better at playing Go.
 The NN seems to have learned a style different from humans.

Go knowledge learned by
AlphaGo Zero.
AlphaGo Zero discovered many human
moves and variants of known human
moves, as well as new moves unknown to
human players.
This data also supports the conclusion
that AlphaGo Zero has learned a new
style of playing Go, different from how
humans play the game.

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AlphaGo Zero: Mastering the Game of Go Without Human Knowledge

Similar to AlphaGo Zero: Mastering the Game of Go Without Human Knowledge (20)

More from Joonhyung Lee

More from Joonhyung Lee (10)

Recently uploaded

Recently uploaded (20)

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge