SlideShare a Scribd company logo
1 of 26
AlphaGo Zero
Mastering the game of Go without human knowledge
AlphaGo: Mastering the game of Go with
deep neural networks and tree search
Limitations
 It required a large amount of training data. (29.4 million
positions from 160,000 games from the KGS Go server)
 Uses many hand-crafted features to judge moves. (Atari size,
liberties, capture size, etc.)
 This may limit the range of possible moves that it could play.
AlphaGo Zero
 No human data: Starts from random play.
 No hand-crafted features: Only the board and the rules of the
game are used.
 Single machine (4 TPUs for evaluation)
Reinforcement Learning
A Short Explanation of Agents, States, Value Functions, Policies, and MCTS
Terminology & Concepts
Agent: The thing interacting with the environment.
State (s): The situation that the agent is in.
Action (a): The action that the agent takes in a
given state.
Reward (r): The reward (or penalty) that the
agent receives from taking an action in a state.
Policy (π): The function that decides probabilities
for taking each action in a given state.
Value Function (V(s)): The value (long term total
reward) of the given state.
Action Value Function (Q(s, a)): The value of a
given action in a given state.
𝑉 𝑠 =
𝑎∈𝐴
𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎
Note: The policy outputs probabilities for actions. It must thus sum to 1.
The Explore-Exploit Tradeoff
 The fundamental question of Reinforcement Learning:
 Explore: Explore the environment further to find higher rewards.
 Exploit: Exploit the known states/actions to maximize reward.
Should I just eat the cheese that I have already found, or should I search the maze for
more/better cheese?
Monte Carlo Tree Search
The modified version used in AlphaGo Zero
Monte Carlo Tree Search (MCTS)
 Monte Carlo: Randomly trying
things, e.g. throwing dice.
 Tree Search: Searching the
various “leaves” of the “tree” of
possibilities.
 Node: State
 Edge: Action
The Function of MCTS in AlphaGo Zero
 MCTS is used to simulate
games in AlphaGo Zero’s
“imagination”.
 But the method of picking the
next move in its “imagination”
and in “reality” are very
different.
MCTS
MCTS in AlphaGo Zero
 Select the next move in the MCTS simulation using U(s, a) + Q(s, a) for the
state ‘s’ in the simulation.
 Repeat this process until an unevaluated node (a “leaf” node) is encountered.
 Backup from the node after evaluating its V(s) and P(s, a). Update the visit
counts and Q(s, a).
Selection
 Select the next move using the PUCT algorithm
𝑎 𝑡 = argmax
𝑎
[𝑄 𝑠𝑡, 𝑎 + 𝑈(𝑠𝑡, 𝑎)]
𝑈 𝑠, 𝑎 = 𝑐 𝑝𝑢𝑐𝑡 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠, 𝑏
1 + 𝑁 𝑠, 𝑎
𝑡: The time step in an MCTS simulation
𝑄 𝑠𝑡, 𝑎 represents Exploitation.
𝑈(𝑠𝑡, 𝑎) represents Exploration.
𝑐 𝑝𝑢𝑐𝑡: A hyperparameter controlling the Explore/Exploit tradeoff
P(s, a): The prior probabilities (NN output with Dirichlet noise)
N(s, a): The visit count of that action in that state
𝑏 𝑁 𝑠, 𝑏 : The visit count of the state
Expand and Evaluate
 Go down the “branches” of the “tree” until a
“leaf” node (unevaluated state) is encountered.
 Then evaluate v = 𝑉(𝑠 𝐿) and p = 𝑃(𝑠, 𝑎) of
that node (state) using the Neural Network.
 The “tree” then grows a “branch” where there
used to be a “leaf”.
Backup
𝑁 𝑠 ← 𝑁 𝑠 + 1
𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1
𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑉 𝑠 𝐿
𝑄 𝑠, 𝑎 ←
𝑊 𝑠, 𝑎
𝑁 𝑠, 𝑎
 After encountering and evaluating a leaf node,
go back up to the “root” node.
 Update visit counts N(s), N(s, a) and Q(s, a)
 W(s, a) is the total action value, used only for
calculating Q(s, a), the average action value.
Play 𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
 𝑠0: Root node
 𝜏: “Temperature” parameter controlling exploration (1 → 0.0)
 𝑁 𝑠0, 𝑎 : Visit count of possible actions from the root node.
 𝑏 𝑁 𝑠0, 𝑏 : Visit count of the root node.
 After 1600 simulations of MCTS, select the next action.
 This is a “real” action, not an “imaginary” action.
Key Points
 The probabilities of the policy 𝜋 are given by the visit counts of
MCTS simulation, not by the NN directly.
 The visit count in MCTS are maintained for 1 game. Not just for
one MCTS simulation and not between multiple games.
 Action selection is different for simulation and play.
 The NN only evaluates each node once, when it is a leaf node.
Training by Self-Play
Using self-play to train the RL agent with supervised learning
Network Architecture
 Inputs: The previous 8 states
 Outputs: Predicted Policy 𝑝 &
Predicted Value 𝑣.
 Structure: 40 residual blocks with
2 output heads.
 The policy head (top) has softmax
activation to output probabilities.
 The value head (bottom) has tanh
(∵ +1: win, 0: tie, -1: lose).
Generating Training Data via Self-Play
𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 log 𝑝 + 𝑐 𝜃
2
 Loss = MSE(actual outcome, prediction) +
Cross Entropy(MCTS policy, prediction) +
L2(model weights).
 Self-Play games of the current best model are
used to generate training data.
 Multiple self-play games are run simultaneously
to provide sufficient training data.
Prior Probabilities
𝑃 𝑠, 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎,
(𝜖 = 0.25, 𝜂~𝐷𝑖𝑟 0.03 )
 𝑝 𝑎: NN output for action 𝑎.
 The Prior probability is
obtained by adding Dirichlet
noise to the NN output.
Temperature (Simulated annealing)
𝜋 𝑎 𝑠0 =
𝑁 𝑠0, 𝑎
𝑏 𝑁 𝑠0, 𝑏
1
𝜏
𝜏 = 1 for the first 30 plays of self-play.
Then reduce 𝜏 → 0.0, which is
equivalent to
𝜋 𝑎 𝑠0 = argmax
𝑎
𝑁 𝑠0, 𝑎
Training vs Evaluation
Results
Performance and comparison with other models.
Final Performance and Comparison
 AlphaGo Zero achieved SOTA after 40 days, defeating AlphaGo
Master by 89:11.
 The “raw network” shows the performance of the NN without MCTS.
(Using 𝑝, 𝑣 as actual policy and value, not for MCTS simulation)
Neural Network
Architecture Comparison
The 2 innovations of AlphaGo Zero were
1. Using a ResNet instead of a CNN.
2. Combining the Policy and Value
networks into a single dual network.
By separating out each factor, we can see
the contributions of each.
For performance, ResNet and Dual
network factors seem to have an equal
contribution.
Empirical evaluation
 Defeated AlphaGo Lee 100:0 after 72hrs
 Worse at predicting human moves but better at playing Go.
 The NN seems to have learned a style different from humans.
Go knowledge learned by
AlphaGo Zero.
AlphaGo Zero discovered many human
moves and variants of known human
moves, as well as new moves unknown to
human players.
This data also supports the conclusion
that AlphaGo Zero has learned a new
style of playing Go, different from how
humans play the game.
The End

More Related Content

What's hot

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptSanGeet25
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial SearchMegha Sharma
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
2-Agents- Artificial Intelligence
2-Agents- Artificial Intelligence2-Agents- Artificial Intelligence
2-Agents- Artificial IntelligenceMhd Sb
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchDheerendra k
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchTekendra Nath Yogi
 
The Wumpus World in Artificial intelligence.pptx
The Wumpus World in Artificial intelligence.pptxThe Wumpus World in Artificial intelligence.pptx
The Wumpus World in Artificial intelligence.pptxJenishaR1
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 

What's hot (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
ch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.pptch_5 Game playing Min max and Alpha Beta pruning.ppt
ch_5 Game playing Min max and Alpha Beta pruning.ppt
 
AlphaZero
AlphaZeroAlphaZero
AlphaZero
 
Understanding AlphaGo
Understanding AlphaGoUnderstanding AlphaGo
Understanding AlphaGo
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial Search
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
A Star Search
A Star SearchA Star Search
A Star Search
 
2-Agents- Artificial Intelligence
2-Agents- Artificial Intelligence2-Agents- Artificial Intelligence
2-Agents- Artificial Intelligence
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
AI Lecture 4 (informed search and exploration)
AI Lecture 4 (informed search and exploration)AI Lecture 4 (informed search and exploration)
AI Lecture 4 (informed search and exploration)
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Agents_AI.ppt
Agents_AI.pptAgents_AI.ppt
Agents_AI.ppt
 
How AlphaGo Works
How AlphaGo WorksHow AlphaGo Works
How AlphaGo Works
 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed search
 
The Wumpus World in Artificial intelligence.pptx
The Wumpus World in Artificial intelligence.pptxThe Wumpus World in Artificial intelligence.pptx
The Wumpus World in Artificial intelligence.pptx
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 

Similar to AlphaGo Zero: Mastering the Game of Go Without Human Knowledge

AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...Joonhyung Lee
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationAlexandre Monnin
 
Tensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymTensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymHO-HSUN LIN
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoTim Riser
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAcademia Sinica
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우영우 김
 
Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Alp Çoker
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesOlivier Teytaud
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introductionSungminYou
 

Similar to AlphaGo Zero: Mastering the Game of Go Without Human Knowledge (20)

AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
 
Tensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymTensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI Gym
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
Two player games
Two player gamesTwo player games
Two player games
 
Alpha Go: in few slides
Alpha Go: in few slidesAlpha Go: in few slides
Alpha Go: in few slides
 
(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)(Alpha) Zero to Elo (with demo)
(Alpha) Zero to Elo (with demo)
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP Game
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
1.game
1.game1.game
1.game
 
Chess engine presentation
Chess engine presentationChess engine presentation
Chess engine presentation
 
AI_unit3.pptx
AI_unit3.pptxAI_unit3.pptx
AI_unit3.pptx
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우
 
Minimax
MinimaxMinimax
Minimax
 
Games.4
Games.4Games.4
Games.4
 
Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introduction
 

More from Joonhyung Lee

Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANDenoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANJoonhyung Lee
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainJoonhyung Lee
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...Joonhyung Lee
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...Joonhyung Lee
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Joonhyung Lee
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingJoonhyung Lee
 

More from Joonhyung Lee (10)

nnUNet
nnUNetnnUNet
nnUNet
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGANDenoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
Denoising Unpaired Low Dose CT Images with Self-Ensembled CycleGAN
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable ...
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
 
StarGAN
StarGANStarGAN
StarGAN
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical Imaging
 

Recently uploaded

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Cherry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cherry
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptxCherry
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Cherry
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxCherry
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfCherry
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methodsimroshankoirala
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsDeepika Singh
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycleCherry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 

Recently uploaded (20)

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 

AlphaGo Zero: Mastering the Game of Go Without Human Knowledge

  • 1. AlphaGo Zero Mastering the game of Go without human knowledge
  • 2. AlphaGo: Mastering the game of Go with deep neural networks and tree search
  • 3. Limitations  It required a large amount of training data. (29.4 million positions from 160,000 games from the KGS Go server)  Uses many hand-crafted features to judge moves. (Atari size, liberties, capture size, etc.)  This may limit the range of possible moves that it could play.
  • 4. AlphaGo Zero  No human data: Starts from random play.  No hand-crafted features: Only the board and the rules of the game are used.  Single machine (4 TPUs for evaluation)
  • 5. Reinforcement Learning A Short Explanation of Agents, States, Value Functions, Policies, and MCTS
  • 6. Terminology & Concepts Agent: The thing interacting with the environment. State (s): The situation that the agent is in. Action (a): The action that the agent takes in a given state. Reward (r): The reward (or penalty) that the agent receives from taking an action in a state. Policy (π): The function that decides probabilities for taking each action in a given state. Value Function (V(s)): The value (long term total reward) of the given state. Action Value Function (Q(s, a)): The value of a given action in a given state. 𝑉 𝑠 = 𝑎∈𝐴 𝜋(𝑠, 𝑎)𝑄 𝑠, 𝑎 Note: The policy outputs probabilities for actions. It must thus sum to 1.
  • 7. The Explore-Exploit Tradeoff  The fundamental question of Reinforcement Learning:  Explore: Explore the environment further to find higher rewards.  Exploit: Exploit the known states/actions to maximize reward. Should I just eat the cheese that I have already found, or should I search the maze for more/better cheese?
  • 8. Monte Carlo Tree Search The modified version used in AlphaGo Zero
  • 9. Monte Carlo Tree Search (MCTS)  Monte Carlo: Randomly trying things, e.g. throwing dice.  Tree Search: Searching the various “leaves” of the “tree” of possibilities.  Node: State  Edge: Action
  • 10. The Function of MCTS in AlphaGo Zero  MCTS is used to simulate games in AlphaGo Zero’s “imagination”.  But the method of picking the next move in its “imagination” and in “reality” are very different. MCTS
  • 11. MCTS in AlphaGo Zero  Select the next move in the MCTS simulation using U(s, a) + Q(s, a) for the state ‘s’ in the simulation.  Repeat this process until an unevaluated node (a “leaf” node) is encountered.  Backup from the node after evaluating its V(s) and P(s, a). Update the visit counts and Q(s, a).
  • 12. Selection  Select the next move using the PUCT algorithm 𝑎 𝑡 = argmax 𝑎 [𝑄 𝑠𝑡, 𝑎 + 𝑈(𝑠𝑡, 𝑎)] 𝑈 𝑠, 𝑎 = 𝑐 𝑝𝑢𝑐𝑡 𝑃 𝑠, 𝑎 𝑏 𝑁 𝑠, 𝑏 1 + 𝑁 𝑠, 𝑎 𝑡: The time step in an MCTS simulation 𝑄 𝑠𝑡, 𝑎 represents Exploitation. 𝑈(𝑠𝑡, 𝑎) represents Exploration. 𝑐 𝑝𝑢𝑐𝑡: A hyperparameter controlling the Explore/Exploit tradeoff P(s, a): The prior probabilities (NN output with Dirichlet noise) N(s, a): The visit count of that action in that state 𝑏 𝑁 𝑠, 𝑏 : The visit count of the state
  • 13. Expand and Evaluate  Go down the “branches” of the “tree” until a “leaf” node (unevaluated state) is encountered.  Then evaluate v = 𝑉(𝑠 𝐿) and p = 𝑃(𝑠, 𝑎) of that node (state) using the Neural Network.  The “tree” then grows a “branch” where there used to be a “leaf”.
  • 14. Backup 𝑁 𝑠 ← 𝑁 𝑠 + 1 𝑁 𝑠, 𝑎 ← 𝑁 𝑠, 𝑎 + 1 𝑊 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 + 𝑉 𝑠 𝐿 𝑄 𝑠, 𝑎 ← 𝑊 𝑠, 𝑎 𝑁 𝑠, 𝑎  After encountering and evaluating a leaf node, go back up to the “root” node.  Update visit counts N(s), N(s, a) and Q(s, a)  W(s, a) is the total action value, used only for calculating Q(s, a), the average action value.
  • 15. Play 𝜋 𝑎 𝑠0 = 𝑁 𝑠0, 𝑎 𝑏 𝑁 𝑠0, 𝑏 1 𝜏  𝑠0: Root node  𝜏: “Temperature” parameter controlling exploration (1 → 0.0)  𝑁 𝑠0, 𝑎 : Visit count of possible actions from the root node.  𝑏 𝑁 𝑠0, 𝑏 : Visit count of the root node.  After 1600 simulations of MCTS, select the next action.  This is a “real” action, not an “imaginary” action.
  • 16. Key Points  The probabilities of the policy 𝜋 are given by the visit counts of MCTS simulation, not by the NN directly.  The visit count in MCTS are maintained for 1 game. Not just for one MCTS simulation and not between multiple games.  Action selection is different for simulation and play.  The NN only evaluates each node once, when it is a leaf node.
  • 17. Training by Self-Play Using self-play to train the RL agent with supervised learning
  • 18. Network Architecture  Inputs: The previous 8 states  Outputs: Predicted Policy 𝑝 & Predicted Value 𝑣.  Structure: 40 residual blocks with 2 output heads.  The policy head (top) has softmax activation to output probabilities.  The value head (bottom) has tanh (∵ +1: win, 0: tie, -1: lose).
  • 19. Generating Training Data via Self-Play 𝑙 = 𝑧 − 𝑣 2 − 𝜋 𝑇 log 𝑝 + 𝑐 𝜃 2  Loss = MSE(actual outcome, prediction) + Cross Entropy(MCTS policy, prediction) + L2(model weights).  Self-Play games of the current best model are used to generate training data.  Multiple self-play games are run simultaneously to provide sufficient training data.
  • 20. Prior Probabilities 𝑃 𝑠, 𝑎 = 1 − 𝜖 𝑝 𝑎 + 𝜖𝜂 𝑎, (𝜖 = 0.25, 𝜂~𝐷𝑖𝑟 0.03 )  𝑝 𝑎: NN output for action 𝑎.  The Prior probability is obtained by adding Dirichlet noise to the NN output. Temperature (Simulated annealing) 𝜋 𝑎 𝑠0 = 𝑁 𝑠0, 𝑎 𝑏 𝑁 𝑠0, 𝑏 1 𝜏 𝜏 = 1 for the first 30 plays of self-play. Then reduce 𝜏 → 0.0, which is equivalent to 𝜋 𝑎 𝑠0 = argmax 𝑎 𝑁 𝑠0, 𝑎 Training vs Evaluation
  • 22. Final Performance and Comparison  AlphaGo Zero achieved SOTA after 40 days, defeating AlphaGo Master by 89:11.  The “raw network” shows the performance of the NN without MCTS. (Using 𝑝, 𝑣 as actual policy and value, not for MCTS simulation)
  • 23. Neural Network Architecture Comparison The 2 innovations of AlphaGo Zero were 1. Using a ResNet instead of a CNN. 2. Combining the Policy and Value networks into a single dual network. By separating out each factor, we can see the contributions of each. For performance, ResNet and Dual network factors seem to have an equal contribution.
  • 24. Empirical evaluation  Defeated AlphaGo Lee 100:0 after 72hrs  Worse at predicting human moves but better at playing Go.  The NN seems to have learned a style different from humans.
  • 25. Go knowledge learned by AlphaGo Zero. AlphaGo Zero discovered many human moves and variants of known human moves, as well as new moves unknown to human players. This data also supports the conclusion that AlphaGo Zero has learned a new style of playing Go, different from how humans play the game.