A brief but in-depth and highly understandable introduction to AlphaGo Zero, the successor to the world-famous AlphaGo.
Unlike its predecessor, which relied on a huge amount of human training data, AlphaGo Zero requires no human input in its training process.
Because of this, it is uninhibited by human prejudices and preconceptions, which allowed it to become the best go player, human or machine, in history.
This is the presentation that I gave concerning the subject in my first semester as a graduate subject. It was designed for those with a background in deep learning, but not reinforcement learning. It explains the core concepts necessary to understand AlphaGo to an interested audience.
3. Limitations
It required a large amount of training data. (29.4 million
positions from 160,000 games from the KGS Go server)
Uses many hand-crafted features to judge moves. (Atari size,
liberties, capture size, etc.)
This may limit the range of possible moves that it could play.
4. AlphaGo Zero
No human data: Starts from random play.
No hand-crafted features: Only the board and the rules of the
game are used.
Single machine (4 TPUs for evaluation)
6. Terminology & Concepts
Agent: The thing interacting with the environment.
State (s): The situation that the agent is in.
Action (a): The action that the agent takes in a
given state.
Reward (r): The reward (or penalty) that the
agent receives from taking an action in a state.
Policy (π): The function that decides probabilities
for taking each action in a given state.
Value Function (V(s)): The value (long term total
reward) of the given state.
Action Value Function (Q(s, a)): The value of a
given action in a given state.
𝑉 𝑠 =
𝑎∈𝐴
𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎
Note: The policy outputs probabilities for actions. It must thus sum to 1.
7. The Explore-Exploit Tradeoff
The fundamental question of Reinforcement Learning:
Explore: Explore the environment further to find higher rewards.
Exploit: Exploit the known states/actions to maximize reward.
Should I just eat the cheese that I have already found, or should I search the maze for
more/better cheese?
8. Monte Carlo Tree Search
The modified version used in AlphaGo Zero
9. Monte Carlo Tree Search (MCTS)
Monte Carlo: Randomly trying
things, e.g. throwing dice.
Tree Search: Searching the
various “leaves” of the “tree” of
possibilities.
Node: State
Edge: Action
10. The Function of MCTS in AlphaGo Zero
MCTS is used to simulate
games in AlphaGo Zero’s
“imagination”.
But the method of picking the
next move in its “imagination”
and in “reality” are very
different.
MCTS
11. MCTS in AlphaGo Zero
Select the next move in the MCTS simulation using U(s, a) + Q(s, a) for the
state ‘s’ in the simulation.
Repeat this process until an unevaluated node (a “leaf” node) is encountered.
Backup from the node after evaluating its V(s) and P(s, a). Update the visit
counts and Q(s, a).
12. Selection
Select the next move using the PUCT algorithm
𝑎 𝑡 = argmax
𝑎
[𝑄 𝑠𝑡, 𝑎 + 𝑈(𝑠𝑡, 𝑎)]
𝑈 𝑠, 𝑎 = 𝑐 𝑝𝑢𝑐𝑡 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠, 𝑏
1 + 𝑁 𝑠, 𝑎
𝑡: The time step in an MCTS simulation
𝑄 𝑠𝑡, 𝑎 represents Exploitation.
𝑈(𝑠𝑡, 𝑎) represents Exploration.
𝑐 𝑝𝑢𝑐𝑡: A hyperparameter controlling the Explore/Exploit tradeoff
P(s, a): The prior probabilities (NN output with Dirichlet noise)
N(s, a): The visit count of that action in that state
𝑏 𝑁 𝑠, 𝑏 : The visit count of the state
13. Expand and Evaluate
Go down the “branches” of the “tree” until a
“leaf” node (unevaluated state) is encountered.
Then evaluate v = 𝑉(𝑠 𝐿) and p = 𝑃(𝑠, 𝑎) of
that node (state) using the Neural Network.
The “tree” then grows a “branch” where there
used to be a “leaf”.
14. Backup
𝑁 𝑠 ← 𝑁 𝑠 + 1
𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1
𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑉 𝑠 𝐿
𝑄 𝑠, 𝑎 ←
𝑊 𝑠, 𝑎
𝑁 𝑠, 𝑎
After encountering and evaluating a leaf node,
go back up to the “root” node.
Update visit counts N(s), N(s, a) and Q(s, a)
W(s, a) is the total action value, used only for
calculating Q(s, a), the average action value.
15. Play 𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
𝑠0: Root node
𝜏: “Temperature” parameter controlling exploration (1 → 0.0)
𝑁 𝑠0, 𝑎 : Visit count of possible actions from the root node.
𝑏 𝑁 𝑠0, 𝑏 : Visit count of the root node.
After 1600 simulations of MCTS, select the next action.
This is a “real” action, not an “imaginary” action.
16. Key Points
The probabilities of the policy 𝜋 are given by the visit counts of
MCTS simulation, not by the NN directly.
The visit count in MCTS are maintained for 1 game. Not just for
one MCTS simulation and not between multiple games.
Action selection is different for simulation and play.
The NN only evaluates each node once, when it is a leaf node.
18. Network Architecture
Inputs: The previous 8 states
Outputs: Predicted Policy 𝑝 &
Predicted Value 𝑣.
Structure: 40 residual blocks with
2 output heads.
The policy head (top) has softmax
activation to output probabilities.
The value head (bottom) has tanh
(∵ +1: win, 0: tie, -1: lose).
19. Generating Training Data via Self-Play
𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 log 𝑝 + 𝑐 𝜃
2
Loss = MSE(actual outcome, prediction) +
Cross Entropy(MCTS policy, prediction) +
L2(model weights).
Self-Play games of the current best model are
used to generate training data.
Multiple self-play games are run simultaneously
to provide sufficient training data.
20. Prior Probabilities
𝑃 𝑠, 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎,
(𝜖 = 0.25, 𝜂~𝐷𝑖𝑟 0.03 )
𝑝 𝑎: NN output for action 𝑎.
The Prior probability is
obtained by adding Dirichlet
noise to the NN output.
Temperature (Simulated annealing)
𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
𝜏 = 1 for the first 30 plays of self-play.
Then reduce 𝜏 → 0.0, which is
equivalent to
𝜋 𝑎 𝑠0 = argmax
𝑎
𝑁 𝑠0, 𝑎
Training vs Evaluation
22. Final Performance and Comparison
AlphaGo Zero achieved SOTA after 40 days, defeating AlphaGo
Master by 89:11.
The “raw network” shows the performance of the NN without MCTS.
(Using 𝑝, 𝑣 as actual policy and value, not for MCTS simulation)
23. Neural Network
Architecture Comparison
The 2 innovations of AlphaGo Zero were
1. Using a ResNet instead of a CNN.
2. Combining the Policy and Value
networks into a single dual network.
By separating out each factor, we can see
the contributions of each.
For performance, ResNet and Dual
network factors seem to have an equal
contribution.
24. Empirical evaluation
Defeated AlphaGo Lee 100:0 after 72hrs
Worse at predicting human moves but better at playing Go.
The NN seems to have learned a style different from humans.
25. Go knowledge learned by
AlphaGo Zero.
AlphaGo Zero discovered many human
moves and variants of known human
moves, as well as new moves unknown to
human players.
This data also supports the conclusion
that AlphaGo Zero has learned a new
style of playing Go, different from how
humans play the game.