SlideShare a Scribd company logo
1 of 45
AlphaZero
A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go through Self-Play
Introduction: AlphaGo and its Successors
▪ AlphaGo: January 27th, 2016
▪ AlphaGo Master: December 29th, 2016
▪ AlphaGo Zero: October 19th, 2017
▪ AlphaZero: December 5th, 2017
▪ The full AlphaZero paper was published
in December 6th, 2018 in Science.
AlphaZero: One Program to Rule them All
▪ Going Beyond the Game of Go: All three games of chess, shogi, and
go are played by a single algorithm and single network architecture.
Training is performed separately for each game.
▪ No human data: Starts tabula rasa (hence the “Zero” in the name)
from random play and only uses self-play.
▪ No hand-crafted features: Only the rules of each game and raw board
positions are used (different from original AlphaGo).
▪ Shared hyperparameters: Only learning-rate schedule and exploration
noise parameters are different for each game.
Reinforcement Learning:
A Brief Introduction
Introduction
▪ Reinforcement Learning (RL) concerns how software agents should take actions in an
environment to maximize some reward.
▪ It is different from Supervised Learning (SL) in that the agent discovers the reward by
exploring its environment, making labelled data unnecessary.
▪ AlphaZero uses the discrete Markov Decision Process (MDP) paradigm, where outcomes
are partly random and partly under the control of the agent.
Terminology
▪ Agent: The thing interacting with the
environment.
▪ State (s): The situation that the agent is in.
▪ Action (a): The action that the agent takes.
▪ Reward (r): The reward (or penalty) that the
agent receives from taking an action in a state.
▪ Policy (π): The function that decides probabilities
for taking each possible action in a given state.
Returns a vector with probabilities for all actions.
𝑉 𝑠 =
𝑎∈𝐴
𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎
• Value Function (V(s)): The value (long term
discounted total reward) of the given state.
• Action-Value Function (Q(s, a)): The value of a
given action in a given state.
Key Properties
▪ The value of a state is the sum of its action-values weighted by the likelihood of the action.
𝑉 𝑠 =
𝑎∈𝐴
𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎
▪ Policies must sum to 1 because they are the probabilities of choosing possible actions.
𝑎∈𝐴
𝜋(𝑠, 𝑎) = 1
The Explore-Exploit Tradeoff
▪ The fundamental question of Reinforcement Learning:
▪ Explore: Explore the environment further to find higher rewards.
▪ Exploit: Exploit the known states/actions to maximize reward.
Should I just eat the cheese that I have already found, or should I search the maze for more/better cheese?
The Markov Property
▪ All states in the Markov Decision Process (MDP) must satisfy the Markov Property: All
states must depend only on the state immediately before it. There is no memory of
previous states.
▪ A stochastic process has the Markov property if the conditional probability distribution of
the future given the present and the past depends only on the present state and not on any
previous states.
▪ Unfortunately, board games do not satisfy the Markov property.
Monte Carlo Tree Search
Monte Carlo Simulation
▪ Using repeated random sampling to
simulate intractable systems.
▪ The name derives from the Casino de
Monte-Carlo in Monaco.
▪ Monte Carlo simulation can be applied
to any problem with a probabilistic
interpretation.
Monte Carlo Tree Search
▪ Node: State
▪ Edge: Action
▪ Tree Search: Searching the various “leaves”
of the “tree” of possibilities.
▪ The simulation begins from the “root” node.
▪ When visited in simulation, a “leaf” node
becomes a “branch” node and sprouts its own
“leaf” nodes in the “tree”.
MCTS in AlphaZero
▪ MCTS is used to simulate games in
AlphaZero’s “imagination”.
▪ The processes for selecting the next
move in the “imagination” and in
“reality” are very different.
MCTS
Training by Self-Play
Network Architecture: Introduction
▪ Inputs: Concatenated board positions from the previous 8 turns in the player’s perspective.
▪ Outputs: Policy for MCTS simulation (policy head, top) and Value of given state (value head, bottom).
▪ Inputs also include information, such as the current player, concatenated channel-wise.
▪ Policy outputs for chess and shogi are in 2D, unlike go, which has 1D outputs.
Overview
• Select the next move in the simulation using Polynomial Upper Confidence Trees (PUCT).
• Repeat until an unevaluated leaf node is encountered.
• Backup from the node after evaluating its value and action value. Update the statistics of the branches.
• Play after enough (800 was used for AlphaZero) simulations have been performed to generate a policy.
Core Concepts
▪ 𝑁(𝑠, 𝑎): Visit count, the number of times a state-action pair has been visited.
▪ 𝑊(𝑠, 𝑎): Total action-value, the sum of all NN value outputs from that branch.
▪ 𝑄(𝑠, 𝑎): Mean action-value, 𝑊(𝑠, 𝑎)/𝑁(𝑠, 𝑎).
▪ 𝑃(𝑠, 𝑎): Prior Probability, policy output of NN for the given state-action pair (s, a).
▪ 𝑁 𝑠 : Parent visit count. 𝑁 𝑠 = 𝑎∈𝐴 𝑁 𝑠, 𝑎
▪ 𝐶(𝑠): Exploration rate. Stays nearly constant in a single simulation.
Select
▪ Select the next move in the simulation using the PUCT algorithm.
𝑎 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 = argmax
𝑎
[𝑄 𝑠, 𝑎 + 𝑈(𝑠, 𝑎)]
𝑈 𝑠, 𝑎 = 𝐶 𝑠 𝑃 𝑠, 𝑎
𝑁 𝑠
1 + 𝑁 𝑠, 𝑎
𝐶 𝑠 = log
1 + 𝑁 𝑠 + 𝑐 𝑏𝑎𝑠𝑒
𝑐 𝑏𝑎𝑠𝑒
+ 𝑐𝑖𝑛𝑖𝑡
▪ 𝑄 𝑠, 𝑎 + 𝑈(𝑠, 𝑎): Upper Confidence Bound.
▪ 𝑄 𝑠, 𝑎 : The Exploitation component.
▪ 𝑈(𝑠, 𝑎): The Exploration component.
Key Points
▪ All statistics for MCTS (N, W, Q, P, C) are maintained for 1 game only, not for 1
simulation and not between multiple games.
▪ The NN evaluates each node only once, when it is a leaf node.
▪ The NN outputs 𝑃(𝑠, 𝑎) and 𝑉 𝑠 by the policy and value heads, respectively.
▪ 𝑃(𝑠, 𝑎), 𝑄 𝑠, 𝑎 , and 𝑈 𝑠, 𝑎 are vectors with one element per action, not scalars.
Expand and Evaluate
▪ From the root node, go down the branch nodes of the tree
until a leaf node (an unevaluated node) is encountered.
▪ Evaluate the leaf node (𝑠′) using 𝑓𝜃, the Neural Network
(NN) to obtain the policy and value for the simulation.
𝒑, 𝑣 = 𝑓𝜃 𝑠 , 𝒑 = 𝑃 𝑠, 𝑎 , 𝑣 = 𝑉 𝑠′
▪ The tree then grows a branch where there was a leaf.
Backup
𝑁 𝑠 ← 𝑁 𝑠 + 1
𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1
𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑣
𝑄 𝑠, 𝑎 ←
𝑊 𝑠, 𝑎
𝑁 𝑠, 𝑎
▪ A simulation terminates if a leaf node is reached, the game ends
in the simulation, the value is below a resignation threshold, or
a maximum game length is reached.
▪ Update the visit counts and average action value for all previous
state-action pairs, all the way up the tree to the root node.
Play
𝜋 𝑎 𝑠′ =
𝑁 𝑠′, 𝑎
𝑁 𝑠′
1
𝜏
▪ After a specified number of simulations (800 was used), the policy for play is
decided by the visit count and the temperature parameter.
▪ 𝜏: The temperature parameter controlling the entropy of the policy.
▪ The moves in play are “real” moves not “imaginary” simulations.
Key Points
▪ The probabilities of the play policy π are given by the visit counts of MCTS simulation,
not by the NN directly.
▪ No NN training occurs during MCTS simulation.
▪ The action selection mechanisms for simulation and play are different.
The Loss Function
𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 𝑙𝑜𝑔𝒑 + 𝑐 𝜃
2
Loss = MSE(actual value, predicted value)
+ Cross Entropy(MCTS policy, predicted policy)
+ L2 Decay(model weights)
▪ 𝑧 = 1, 0, −1 for win, tie, and lose of the true
outcome of a game.
▪ 𝑐: Weight decay hyperparameter.
Intuition for MCTS in AlphaZero
Self-Play vs Evaluation
Prior Probabilities
𝑃 𝑠′
, 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎
𝜂 𝛼~𝐷𝑖𝑟 𝛼
▪ In training, noise is added to the root node prior
probability.
▪ 𝜖 = 0.25, 𝛼 = {0.3, 0.15, 0.03} for chess, shogi,
and go, respectively.
▪ 𝛼 is scaled in inverse to the approximate number of
legal moves in a typical position.
Temperature
𝜋 𝑎 𝑠′ =
𝑁 𝑠′
, 𝑎
𝑁 𝑠′
1
𝜏
▪ Simulated annealing is used to increase exploration
during the first few moves (𝜏 = 1 for the first 30
moves, 𝜏 ≈ 0 afterwards).
▪ 𝜏 ≈ 0 is equivalent to choosing the action with
highest probability while 𝜏 = 1 is equivalent to
randomly choosing an action according to
probabilities given by the vector 𝜋 𝑎 𝑠′
.
Details of Training Data Generation
▪ Self-Play games of the most recent model are used to generate training data.
▪ Multiple self-play games are run in parallel to provide enough training data.
▪ 5,000 first-generation TPUs were used for data generation during training.
▪ 16 second-generation TPUs were used for model training.
▪ The actual MCTS is performed asynchronously for better resource utilization.
▪ A batch size of 4096 game steps was used for training.
Differences with AlphaGo Zero
▪ No data augmentation by symmetries. Go is symmetric but chess and shogi are not.
▪ A single network is continually updated instead of testing for the best player every 1,000
steps. Self-play games are always generated by the latest model.
▪ No Bayesian optimization of hyperparameters.
▪ 19 residual blocks in the body of the NN, unlike the final version of AlphaGo Zero, which
had 39. However, this is identical to the early version of AlphaGo Zero.
The Neural Network
Network Architecture: Structure
▪ 19 residual blocks in the body with 2 output heads.
▪ The policy head (top) has softmax activation to output probabilities for the policy for the state.
▪ The value head (bottom) has tanh activation to output the value of the state (∵ +1: win, 0: tie, -1: lose).
Network Inputs
Network Outputs
Results and Performance
Comparison with Previous Programs
Comparison with Reduced Thinking Time for AlphaZero
Effects of Data Augmentation in Go
Training Speed in Steps
Repeatability of Training
Interpretation and Final Remarks
Common Misunderstandings
▪ Computers just search for all possible positions.
▪ Computers cannot have creativity or intuition like humans.
▪ Computers can only perform tasks programmed by humans; therefore they cannot exceed
humans.
▪ AlphaZero needs a supercomputer to run.
Comparison of Number of Searches
Expert Opinion
“I admit that I was pleased to see that AlphaZero had a dynamic, open style like my own. The
conventional wisdom was that machines would approach perfection with endless dry
maneuvering, usually leading to drawn games. But in my observation, AlphaZero prioritizes
piece activity over material, preferring positions that to my eye looked risky and aggressive.
Programs usually reflect priorities and prejudices of programmers, but because AlphaZero
programs itself, I would say that its style reflects the truth. This superior understanding allowed
it to outclass the world's top traditional program despite calculating far fewer positions per
second. It's the embodiment of the cliché, “work smarter, not harder.””
-Garry Kasparov, former World Chess Champion
Additional Information
▪ The “Zero” in AlphaZero and AlphaGo Zero means that these systems began learning
tabula rasa, from random initialization with zero human input, only the rules of the game.
▪ A single machine with 4 first-generation TPUs and 44 CPU cores was used for game-play.
A first-generation TPU has a similar inference speed to an NVIDIA Titan V GPU.
▪ Leela Zero, an open-source implementation of AlphaGo Zero and AlphaZero, is available
for those without access to 5,000 TPUs.
What Next?
The End. Q&A

More Related Content

What's hot

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchKarel Ha
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemSease
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim
 
知識ベース型推薦の解説
知識ベース型推薦の解説知識ベース型推薦の解説
知識ベース型推薦の解説Takahiro Kubo
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeJoonhyung Lee
 
no free-lunch theorem
no free-lunch theoremno free-lunch theorem
no free-lunch theoremGem WeBlog
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedBodla Kumar
 
Nature-inspired metaheuristic algorithms for optimization and computional int...
Nature-inspired metaheuristic algorithms for optimization and computional int...Nature-inspired metaheuristic algorithms for optimization and computional int...
Nature-inspired metaheuristic algorithms for optimization and computional int...Xin-She Yang
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNDat Nguyen
 
AI3391 Artificial Intelligence Session 20 partially observed games.pptx
AI3391 Artificial Intelligence Session 20 partially observed games.pptxAI3391 Artificial Intelligence Session 20 partially observed games.pptx
AI3391 Artificial Intelligence Session 20 partially observed games.pptxAsst.prof M.Gokilavani
 
オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話京大 マイコンクラブ
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현정주 김
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackPark JunPyo
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yooJaeJun Yoo
 

What's hot (20)

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
知識ベース型推薦の解説
知識ベース型推薦の解説知識ベース型推薦の解説
知識ベース型推薦の解説
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
no free-lunch theorem
no free-lunch theoremno free-lunch theorem
no free-lunch theorem
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn Feed
 
Nature-inspired metaheuristic algorithms for optimization and computional int...
Nature-inspired metaheuristic algorithms for optimization and computional int...Nature-inspired metaheuristic algorithms for optimization and computional int...
Nature-inspired metaheuristic algorithms for optimization and computional int...
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCN
 
Chapter1 4.6 mod
Chapter1 4.6 modChapter1 4.6 mod
Chapter1 4.6 mod
 
AI3391 Artificial Intelligence Session 20 partially observed games.pptx
AI3391 Artificial Intelligence Session 20 partially observed games.pptxAI3391 Artificial Intelligence Session 20 partially observed games.pptx
AI3391 Artificial Intelligence Session 20 partially observed games.pptx
 
オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話オセロの終盤ソルバーを100倍以上高速化した話
オセロの終盤ソルバーを100倍以上高速化した話
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
Alpha-Beta Search
Alpha-Beta SearchAlpha-Beta Search
Alpha-Beta Search
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedback
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yoo
 

Similar to AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go through Self-Play

A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...AdityaSuryavamshi
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAcademia Sinica
 
cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptSamiksha880257
 
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCEudayvanand
 
AI subject - Game Theory and cps ppt pptx
AI subject  - Game Theory and cps ppt pptxAI subject  - Game Theory and cps ppt pptx
AI subject - Game Theory and cps ppt pptxnizmishaik1
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
AI3391 Artificial Intelligence UNIT III Notes_merged.pdf
AI3391 Artificial Intelligence UNIT III Notes_merged.pdfAI3391 Artificial Intelligence UNIT III Notes_merged.pdf
AI3391 Artificial Intelligence UNIT III Notes_merged.pdfAsst.prof M.Gokilavani
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptSanGeet25
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
MINI-MAX ALGORITHM.pptx
MINI-MAX ALGORITHM.pptxMINI-MAX ALGORITHM.pptx
MINI-MAX ALGORITHM.pptxNayanChandak1
 
From Alpha Go to Alpha Zero - Vaas Madrid 2018
From Alpha Go to Alpha Zero -  Vaas Madrid 2018From Alpha Go to Alpha Zero -  Vaas Madrid 2018
From Alpha Go to Alpha Zero - Vaas Madrid 2018Juantomás García Molina
 

Similar to AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go through Self-Play (20)

A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
 
Two player games
Two player gamesTwo player games
Two player games
 
Capgemini 1
Capgemini 1Capgemini 1
Capgemini 1
 
Finalver
FinalverFinalver
Finalver
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP Game
 
cs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.pptcs-171-07-Games and Adversarila Search.ppt
cs-171-07-Games and Adversarila Search.ppt
 
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
21CSC206T_UNIT3.pptx.pdf ARITIFICIAL INTELLIGENCE
 
AI subject - Game Theory and cps ppt pptx
AI subject  - Game Theory and cps ppt pptxAI subject  - Game Theory and cps ppt pptx
AI subject - Game Theory and cps ppt pptx
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
AI3391 Artificial Intelligence UNIT III Notes_merged.pdf
AI3391 Artificial Intelligence UNIT III Notes_merged.pdfAI3391 Artificial Intelligence UNIT III Notes_merged.pdf
AI3391 Artificial Intelligence UNIT III Notes_merged.pdf
 
adversial search.pptx
adversial search.pptxadversial search.pptx
adversial search.pptx
 
Games.4
Games.4Games.4
Games.4
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
cai
caicai
cai
 
MINI-MAX ALGORITHM.pptx
MINI-MAX ALGORITHM.pptxMINI-MAX ALGORITHM.pptx
MINI-MAX ALGORITHM.pptx
 
From Alpha Go to Alpha Zero - Vaas Madrid 2018
From Alpha Go to Alpha Zero -  Vaas Madrid 2018From Alpha Go to Alpha Zero -  Vaas Madrid 2018
From Alpha Go to Alpha Zero - Vaas Madrid 2018
 
Module_3_1.pptx
Module_3_1.pptxModule_3_1.pptx
Module_3_1.pptx
 

More from Joonhyung Lee

Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANDenoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANJoonhyung Lee
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainJoonhyung Lee
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...Joonhyung Lee
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...Joonhyung Lee
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Joonhyung Lee
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingJoonhyung Lee
 

More from Joonhyung Lee (10)

nnUNet
nnUNetnnUNet
nnUNet
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANDenoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
 
StarGAN
StarGANStarGAN
StarGAN
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical Imaging
 

Recently uploaded

Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsSérgio Sacani
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxmuralinath2
 
The Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdfThe Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdfMohamed Said
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Sahil Suleman
 
Cell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCherry
 
Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...MohammedAhmed246550
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpSérgio Sacani
 
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
SCHISTOSOMA HEAMATOBIUM life cycle  .pdfSCHISTOSOMA HEAMATOBIUM life cycle  .pdf
SCHISTOSOMA HEAMATOBIUM life cycle .pdfDebdattaGhosh6
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxKyawThanTint
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surfaceSérgio Sacani
 
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana LahariERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Laharimuralinath2
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Sérgio Sacani
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanmuralinath2
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinathmuralinath2
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!University of Hertfordshire
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Sérgio Sacani
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthSérgio Sacani
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Sérgio Sacani
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...PABOLU TEJASREE
 

Recently uploaded (20)

Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
The Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdfThe Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdf
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 
Cell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptx
 
Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...Land use land cover change analysis and detection of its drivers using geospa...
Land use land cover change analysis and detection of its drivers using geospa...
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
SCHISTOSOMA HEAMATOBIUM life cycle  .pdfSCHISTOSOMA HEAMATOBIUM life cycle  .pdf
SCHISTOSOMA HEAMATOBIUM life cycle .pdf
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptx
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
 
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana LahariERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
ERTHROPOIESIS: Dr. E. Muralinath & R. Gnana Lahari
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!Quantifying Artificial Intelligence and What Comes Next!
Quantifying Artificial Intelligence and What Comes Next!
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 

AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go through Self-Play

  • 1. AlphaZero A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go through Self-Play
  • 2. Introduction: AlphaGo and its Successors ▪ AlphaGo: January 27th, 2016 ▪ AlphaGo Master: December 29th, 2016 ▪ AlphaGo Zero: October 19th, 2017 ▪ AlphaZero: December 5th, 2017 ▪ The full AlphaZero paper was published in December 6th, 2018 in Science.
  • 3. AlphaZero: One Program to Rule them All ▪ Going Beyond the Game of Go: All three games of chess, shogi, and go are played by a single algorithm and single network architecture. Training is performed separately for each game. ▪ No human data: Starts tabula rasa (hence the “Zero” in the name) from random play and only uses self-play. ▪ No hand-crafted features: Only the rules of each game and raw board positions are used (different from original AlphaGo). ▪ Shared hyperparameters: Only learning-rate schedule and exploration noise parameters are different for each game.
  • 5. Introduction ▪ Reinforcement Learning (RL) concerns how software agents should take actions in an environment to maximize some reward. ▪ It is different from Supervised Learning (SL) in that the agent discovers the reward by exploring its environment, making labelled data unnecessary. ▪ AlphaZero uses the discrete Markov Decision Process (MDP) paradigm, where outcomes are partly random and partly under the control of the agent.
  • 6. Terminology ▪ Agent: The thing interacting with the environment. ▪ State (s): The situation that the agent is in. ▪ Action (a): The action that the agent takes. ▪ Reward (r): The reward (or penalty) that the agent receives from taking an action in a state. ▪ Policy (π): The function that decides probabilities for taking each possible action in a given state. Returns a vector with probabilities for all actions. 𝑉 𝑠 = 𝑎∈𝐴 𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎 • Value Function (V(s)): The value (long term discounted total reward) of the given state. • Action-Value Function (Q(s, a)): The value of a given action in a given state.
  • 7. Key Properties ▪ The value of a state is the sum of its action-values weighted by the likelihood of the action. 𝑉 𝑠 = 𝑎∈𝐴 𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎 ▪ Policies must sum to 1 because they are the probabilities of choosing possible actions. 𝑎∈𝐴 𝜋(𝑠, 𝑎) = 1
  • 8. The Explore-Exploit Tradeoff ▪ The fundamental question of Reinforcement Learning: ▪ Explore: Explore the environment further to find higher rewards. ▪ Exploit: Exploit the known states/actions to maximize reward. Should I just eat the cheese that I have already found, or should I search the maze for more/better cheese?
  • 9. The Markov Property ▪ All states in the Markov Decision Process (MDP) must satisfy the Markov Property: All states must depend only on the state immediately before it. There is no memory of previous states. ▪ A stochastic process has the Markov property if the conditional probability distribution of the future given the present and the past depends only on the present state and not on any previous states. ▪ Unfortunately, board games do not satisfy the Markov property.
  • 11. Monte Carlo Simulation ▪ Using repeated random sampling to simulate intractable systems. ▪ The name derives from the Casino de Monte-Carlo in Monaco. ▪ Monte Carlo simulation can be applied to any problem with a probabilistic interpretation.
  • 12. Monte Carlo Tree Search ▪ Node: State ▪ Edge: Action ▪ Tree Search: Searching the various “leaves” of the “tree” of possibilities. ▪ The simulation begins from the “root” node. ▪ When visited in simulation, a “leaf” node becomes a “branch” node and sprouts its own “leaf” nodes in the “tree”.
  • 13. MCTS in AlphaZero ▪ MCTS is used to simulate games in AlphaZero’s “imagination”. ▪ The processes for selecting the next move in the “imagination” and in “reality” are very different. MCTS
  • 15. Network Architecture: Introduction ▪ Inputs: Concatenated board positions from the previous 8 turns in the player’s perspective. ▪ Outputs: Policy for MCTS simulation (policy head, top) and Value of given state (value head, bottom). ▪ Inputs also include information, such as the current player, concatenated channel-wise. ▪ Policy outputs for chess and shogi are in 2D, unlike go, which has 1D outputs.
  • 16. Overview • Select the next move in the simulation using Polynomial Upper Confidence Trees (PUCT). • Repeat until an unevaluated leaf node is encountered. • Backup from the node after evaluating its value and action value. Update the statistics of the branches. • Play after enough (800 was used for AlphaZero) simulations have been performed to generate a policy.
  • 17. Core Concepts ▪ 𝑁(𝑠, 𝑎): Visit count, the number of times a state-action pair has been visited. ▪ 𝑊(𝑠, 𝑎): Total action-value, the sum of all NN value outputs from that branch. ▪ 𝑄(𝑠, 𝑎): Mean action-value, 𝑊(𝑠, 𝑎)/𝑁(𝑠, 𝑎). ▪ 𝑃(𝑠, 𝑎): Prior Probability, policy output of NN for the given state-action pair (s, a). ▪ 𝑁 𝑠 : Parent visit count. 𝑁 𝑠 = 𝑎∈𝐴 𝑁 𝑠, 𝑎 ▪ 𝐶(𝑠): Exploration rate. Stays nearly constant in a single simulation.
  • 18. Select ▪ Select the next move in the simulation using the PUCT algorithm. 𝑎 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 = argmax 𝑎 [𝑄 𝑠, 𝑎 + 𝑈(𝑠, 𝑎)] 𝑈 𝑠, 𝑎 = 𝐶 𝑠 𝑃 𝑠, 𝑎 𝑁 𝑠 1 + 𝑁 𝑠, 𝑎 𝐶 𝑠 = log 1 + 𝑁 𝑠 + 𝑐 𝑏𝑎𝑠𝑒 𝑐 𝑏𝑎𝑠𝑒 + 𝑐𝑖𝑛𝑖𝑡 ▪ 𝑄 𝑠, 𝑎 + 𝑈(𝑠, 𝑎): Upper Confidence Bound. ▪ 𝑄 𝑠, 𝑎 : The Exploitation component. ▪ 𝑈(𝑠, 𝑎): The Exploration component.
  • 19. Key Points ▪ All statistics for MCTS (N, W, Q, P, C) are maintained for 1 game only, not for 1 simulation and not between multiple games. ▪ The NN evaluates each node only once, when it is a leaf node. ▪ The NN outputs 𝑃(𝑠, 𝑎) and 𝑉 𝑠 by the policy and value heads, respectively. ▪ 𝑃(𝑠, 𝑎), 𝑄 𝑠, 𝑎 , and 𝑈 𝑠, 𝑎 are vectors with one element per action, not scalars.
  • 20. Expand and Evaluate ▪ From the root node, go down the branch nodes of the tree until a leaf node (an unevaluated node) is encountered. ▪ Evaluate the leaf node (𝑠′) using 𝑓𝜃, the Neural Network (NN) to obtain the policy and value for the simulation. 𝒑, 𝑣 = 𝑓𝜃 𝑠 , 𝒑 = 𝑃 𝑠, 𝑎 , 𝑣 = 𝑉 𝑠′ ▪ The tree then grows a branch where there was a leaf.
  • 21. Backup 𝑁 𝑠 ← 𝑁 𝑠 + 1 𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1 𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑣 𝑄 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 𝑁 𝑠, 𝑎 ▪ A simulation terminates if a leaf node is reached, the game ends in the simulation, the value is below a resignation threshold, or a maximum game length is reached. ▪ Update the visit counts and average action value for all previous state-action pairs, all the way up the tree to the root node.
  • 22. Play 𝜋 𝑎 𝑠′ = 𝑁 𝑠′, 𝑎 𝑁 𝑠′ 1 𝜏 ▪ After a specified number of simulations (800 was used), the policy for play is decided by the visit count and the temperature parameter. ▪ 𝜏: The temperature parameter controlling the entropy of the policy. ▪ The moves in play are “real” moves not “imaginary” simulations.
  • 23. Key Points ▪ The probabilities of the play policy π are given by the visit counts of MCTS simulation, not by the NN directly. ▪ No NN training occurs during MCTS simulation. ▪ The action selection mechanisms for simulation and play are different.
  • 24. The Loss Function 𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 𝑙𝑜𝑔𝒑 + 𝑐 𝜃 2 Loss = MSE(actual value, predicted value) + Cross Entropy(MCTS policy, predicted policy) + L2 Decay(model weights) ▪ 𝑧 = 1, 0, −1 for win, tie, and lose of the true outcome of a game. ▪ 𝑐: Weight decay hyperparameter.
  • 25. Intuition for MCTS in AlphaZero
  • 26. Self-Play vs Evaluation Prior Probabilities 𝑃 𝑠′ , 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎 𝜂 𝛼~𝐷𝑖𝑟 𝛼 ▪ In training, noise is added to the root node prior probability. ▪ 𝜖 = 0.25, 𝛼 = {0.3, 0.15, 0.03} for chess, shogi, and go, respectively. ▪ 𝛼 is scaled in inverse to the approximate number of legal moves in a typical position. Temperature 𝜋 𝑎 𝑠′ = 𝑁 𝑠′ , 𝑎 𝑁 𝑠′ 1 𝜏 ▪ Simulated annealing is used to increase exploration during the first few moves (𝜏 = 1 for the first 30 moves, 𝜏 ≈ 0 afterwards). ▪ 𝜏 ≈ 0 is equivalent to choosing the action with highest probability while 𝜏 = 1 is equivalent to randomly choosing an action according to probabilities given by the vector 𝜋 𝑎 𝑠′ .
  • 27. Details of Training Data Generation ▪ Self-Play games of the most recent model are used to generate training data. ▪ Multiple self-play games are run in parallel to provide enough training data. ▪ 5,000 first-generation TPUs were used for data generation during training. ▪ 16 second-generation TPUs were used for model training. ▪ The actual MCTS is performed asynchronously for better resource utilization. ▪ A batch size of 4096 game steps was used for training.
  • 28. Differences with AlphaGo Zero ▪ No data augmentation by symmetries. Go is symmetric but chess and shogi are not. ▪ A single network is continually updated instead of testing for the best player every 1,000 steps. Self-play games are always generated by the latest model. ▪ No Bayesian optimization of hyperparameters. ▪ 19 residual blocks in the body of the NN, unlike the final version of AlphaGo Zero, which had 39. However, this is identical to the early version of AlphaGo Zero.
  • 30. Network Architecture: Structure ▪ 19 residual blocks in the body with 2 output heads. ▪ The policy head (top) has softmax activation to output probabilities for the policy for the state. ▪ The value head (bottom) has tanh activation to output the value of the state (∵ +1: win, 0: tie, -1: lose).
  • 35. Comparison with Reduced Thinking Time for AlphaZero
  • 36. Effects of Data Augmentation in Go
  • 40. Common Misunderstandings ▪ Computers just search for all possible positions. ▪ Computers cannot have creativity or intuition like humans. ▪ Computers can only perform tasks programmed by humans; therefore they cannot exceed humans. ▪ AlphaZero needs a supercomputer to run.
  • 41. Comparison of Number of Searches
  • 42. Expert Opinion “I admit that I was pleased to see that AlphaZero had a dynamic, open style like my own. The conventional wisdom was that machines would approach perfection with endless dry maneuvering, usually leading to drawn games. But in my observation, AlphaZero prioritizes piece activity over material, preferring positions that to my eye looked risky and aggressive. Programs usually reflect priorities and prejudices of programmers, but because AlphaZero programs itself, I would say that its style reflects the truth. This superior understanding allowed it to outclass the world's top traditional program despite calculating far fewer positions per second. It's the embodiment of the cliché, “work smarter, not harder.”” -Garry Kasparov, former World Chess Champion
  • 43. Additional Information ▪ The “Zero” in AlphaZero and AlphaGo Zero means that these systems began learning tabula rasa, from random initialization with zero human input, only the rules of the game. ▪ A single machine with 4 first-generation TPUs and 44 CPU cores was used for game-play. A first-generation TPU has a similar inference speed to an NVIDIA Titan V GPU. ▪ Leela Zero, an open-source implementation of AlphaGo Zero and AlphaZero, is available for those without access to 5,000 TPUs.