SlideShare a Scribd company logo
1 of 141
Download to read offline
AlphaZero
Mastering Chess and Shogi
by Self-Play
with a General Reinforcement Learning Algorithm
Karel Ha
article by Google DeepMind
AI Seminar, 19th December 2017
Outline
The Alpha* Timeline
AlphaGo
AlphaGo Zero (AG0)
AlphaZero
Conclusion
1
The Alpha* Timeline
Fan Hui
https://en.wikipedia.org/wiki/Fan_Hui 2
Fan Hui
■ professional 2 dan
https://en.wikipedia.org/wiki/Fan_Hui 2
Fan Hui
■ professional 2 dan
■ European Go Champion in 2013, 2014 and 2015
https://en.wikipedia.org/wiki/Fan_Hui 2
Fan Hui
■ professional 2 dan
■ European Go Champion in 2013, 2014 and 2015
■ European Professional Go Champion in 2016
https://en.wikipedia.org/wiki/Fan_Hui 2
AlphaGo (AlphaGo Fan) vs. Fan Hui
3
AlphaGo (AlphaGo Fan) vs. Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
3
AlphaGo (AlphaGo Fan) vs. Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.
Fan Hui 3
Lee Sedol “The Strong Stone”
https://en.wikipedia.org/wiki/Lee_Sedol 4
Lee Sedol “The Strong Stone”
■ professional 9 dan
https://en.wikipedia.org/wiki/Lee_Sedol 4
Lee Sedol “The Strong Stone”
■ professional 9 dan
■ the 2nd in international titles
https://en.wikipedia.org/wiki/Lee_Sedol 4
Lee Sedol “The Strong Stone”
■ professional 9 dan
■ the 2nd in international titles
■ the 5th youngest to become a professional Go player in South
Korean history
https://en.wikipedia.org/wiki/Lee_Sedol 4
Lee Sedol “The Strong Stone”
■ professional 9 dan
■ the 2nd in international titles
■ the 5th youngest to become a professional Go player in South
Korean history
■ Lee Sedol would win 97 out of 100 games against Fan Hui.
https://en.wikipedia.org/wiki/Lee_Sedol 4
Lee Sedol “The Strong Stone”
■ professional 9 dan
■ the 2nd in international titles
■ the 5th youngest to become a professional Go player in South
Korean history
■ Lee Sedol would win 97 out of 100 games against Fan Hui.
■ “Roger Federer” of Go
https://en.wikipedia.org/wiki/Lee_Sedol 4
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
4
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4:1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
4
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.
Lee Sedol
...even beating AlphaGo by 4:1 may allow
the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.
interview in JTBC
Newsroom
4
AlphaGo (AlphaGo Lee) vs. Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
AlphaGo (AlphaGo Lee) vs. Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
AlphaGo (AlphaGo Lee) vs. Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
AlphaGo (AlphaGo Lee) vs. Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
AlphaGo Master
https://deepmind.com/research/alphago/match-archive/master/ 6
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
https://deepmind.com/research/alphago/match-archive/master/ 6
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
https://deepmind.com/research/alphago/match-archive/master/ 6
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.
https://deepmind.com/research/alphago/match-archive/master/ 6
AlphaGo Master
In January 2017, DeepMind revealed that AlphaGo had played
a series of unofficial online games against some of the strongest
professional Go players under the pseudonyms “Master” and
”Magister”.
This AlphaGo was an improved version of the AlphaGo that played
Lee Sedol in 2016.
Over one week, AlphaGo played 60 online fast time-control games.
AlphaGo won this series of games 60:0.
https://deepmind.com/research/alphago/match-archive/master/ 6
https://events.google.com/alphago2017/ 6
https://events.google.com/alphago2017/ 7
■ 23 May - 27 May 2017 in Wuzhen, China
https://events.google.com/alphago2017/ 7
■ 23 May - 27 May 2017 in Wuzhen, China
■ Team Go vs. AlphaGo 0:1
https://events.google.com/alphago2017/ 7
■ 23 May - 27 May 2017 in Wuzhen, China
■ Team Go vs. AlphaGo 0:1
■ AlphaGo vs. world champion Ke Jie 3:0
https://events.google.com/alphago2017/ 7
7
defeated AlphaGo Lee by 100 games to 0
7
AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
8
AI system that mastered
chess, Shogi and Go to
“superhuman levels” within
a handful of hours
AlphaZero
defeated AlphaGo Zero (version with 20 blocks trained for 3 days)
by 60 games to 40
8
8
1 AlphaGo Fan
8
1 AlphaGo Fan
2 AlphaGo Lee
8
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
8
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
8
1 AlphaGo Fan
2 AlphaGo Lee
3 AlphaGo Master
4 AlphaGo Zero
5 AlphaZero
8
AlphaGo
Policy and Value Networks
[Silver et al. 2016] 9
Training the (Deep Convolutional) Neural Networks
[Silver et al. 2016] 10
AlphaGo Zero (AG0)
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
■ additional (auxialiary) input features × only the black and
white stones from the board as input features
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
■ additional (auxialiary) input features × only the black and
white stones from the board as input features
■ separate policy and value networks × single neural network
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
■ additional (auxialiary) input features × only the black and
white stones from the board as input features
■ separate policy and value networks × single neural network
■ tree search using also Monte Carlo rollouts × simpler tree
search using only the single neural network to both evaluate
positions and sample moves
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
■ additional (auxialiary) input features × only the black and
white stones from the board as input features
■ separate policy and value networks × single neural network
■ tree search using also Monte Carlo rollouts × simpler tree
search using only the single neural network to both evaluate
positions and sample moves
■ (AlphaGo Lee) distributed machines + 48 tensor processing
units (TPUs) × single machines + 4 TPUs
[Silver et al. 2017b] 11
AG0: Differences Compared to AlphaGo {Fan, Lee, Master}
AlphaGo {Fan, Lee, Master} × AlphaGo Zero:
■ supervised learning from human expert positions × from
scratch by self-play reinforcement learning (“tabula rasa”)
■ additional (auxialiary) input features × only the black and
white stones from the board as input features
■ separate policy and value networks × single neural network
■ tree search using also Monte Carlo rollouts × simpler tree
search using only the single neural network to both evaluate
positions and sample moves
■ (AlphaGo Lee) distributed machines + 48 tensor processing
units (TPUs) × single machines + 4 TPUs
■ (AlphaGo Lee) several months of training time × 72 h of
training time (outperforming AlphaGo Lee after 36 h)
[Silver et al. 2017b] 11
AG0 achieves this via
[Silver et al. 2017b] 12
AG0 achieves this via
■ a new reinforcement learning algorithm
[Silver et al. 2017b] 12
AG0 achieves this via
■ a new reinforcement learning algorithm
■ with lookahead search inside the training loop
[Silver et al. 2017b] 12
AG0: Self-Play Reinforcement Learning
[Silver et al. 2017b] 13
AG0: Self-Play Reinforcement Learning
[Silver et al. 2017b] 13
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
■ fθ(s) = (p, v)
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
■ fθ(s) = (p, v)
■ specifics:
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
■ fθ(s) = (p, v)
■ specifics:
■ (20 or 40) residual blocks (of convolutional layers)
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
■ fθ(s) = (p, v)
■ specifics:
■ (20 or 40) residual blocks (of convolutional layers)
■ batch normalization
[Silver et al. 2017b] 14
AG0: Self-Play Reinforcement Learning – Neural Network
deep neural network fθ with parameters θ:
■ input: raw board representation s
■ output:
■ move probabilities p
■ value v of the board position
■ fθ(s) = (p, v)
■ specifics:
■ (20 or 40) residual blocks (of convolutional layers)
■ batch normalization
■ rectifier non-linearities
[Silver et al. 2017b] 14
AG0: Comparison of Various Neural Network Architectures
[Silver et al. 2017b] 15
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
ii. move is sampled from πt
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
ii. move is sampled from πt
iii. data (st, πt, zt) for each t are stored for later training
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
ii. move is sampled from πt
iii. data (st, πt, zt) for each t are stored for later training
iv. new neural network fθi is trained in order to minimize the loss
l = (z − v)2
− π⊤
log p + c||θ||2
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
ii. move is sampled from πt
iii. data (st, πt, zt) for each t are stored for later training
iv. new neural network fθi is trained in order to minimize the loss
l = (z − v)2
− π⊤
log p + c||θ||2
[Silver et al. 2017b] 16
AG0: Self-Play Reinforcement Learning – Steps
0. random weights θ0
1. at each iteration i > 0, self-play games are generated:
i. MCTS samples search probabilities πt based on the neural
network from the previous iteration fθi−1 :
πt = αθi−1
(st)
for each time-step t = 1, 2, . . . , T
ii. move is sampled from πt
iii. data (st, πt, zt) for each t are stored for later training
iv. new neural network fθi is trained in order to minimize the loss
l = (z − v)2
− π⊤
log p + c||θ||2
Loss l makes (p, v) = fθ(s) more closely match the improved search
probabilities and self-play winner (π, z).
[Silver et al. 2017b] 16
AG0: Monte Carlo Tree Search (1/2)
Monte Carlo Tree Search (MCTS) in AG0:
[Silver et al. 2017b] 17
AG0: Monte Carlo Tree Search (1/2)
Monte Carlo Tree Search (MCTS) in AG0:
[Silver et al. 2017b] 17
AG0: Monte Carlo Tree Search (2/2)
[Silver et al. 2017b] 18
AG0: Self-Play Reinforcement Learning – Review
[Silver et al. 2017b] 19
AG0: Elo Rating over Training Time (RL vs. SL)
[Silver et al. 2017b] 20
AG0: Elo Rating over Training Time (AG0 with 40 blocks)
[Silver et al. 2017b] 21
AG0: Tournament between AI Go Programs
[Silver et al. 2017b] 22
AG0: Discovered Joseki (Corner Sequences)
[Silver et al. 2017b] 23
AG0: Discovered Joseki (Corner Sequences)
a five human joseki
[Silver et al. 2017b] 23
AG0: Discovered Joseki (Corner Sequences)
a five human joseki
b five novel joseki variants eventually preferred by AG0
[Silver et al. 2017b] 23
AG0: Discovered Playing Styles
[Silver et al. 2017b] 24
AG0: Discovered Playing Styles
at 3 h greedy capture of stones
[Silver et al. 2017b] 24
AG0: Discovered Playing Styles
at 3 h greedy capture of stones
at 19 h the fundamentals of Go concepts (life-and-death,
influence, territory...)
[Silver et al. 2017b] 24
AG0: Discovered Playing Styles
at 3 h greedy capture of stones
at 19 h the fundamentals of Go concepts (life-and-death,
influence, territory...)
at 70 h remarkably balanced game (multiple battles,
complicated ko fight, a half-point win for white...)[Silver et al. 2017b] 24
AlphaZero
24
To watch such a strong programme like
Stockfish, against whom most top players
would be happy to win even one game out
of a hundred, being completely taken apart
is certainly definitive.
Viswanathan Anand
24
To watch such a strong programme like
Stockfish, against whom most top players
would be happy to win even one game out
of a hundred, being completely taken apart
is certainly definitive.
Viswanathan Anand
It’s like chess from another dimension.
Demis Hassabis
24
AlphaZero: Differences Compared to AlphaGo Zero
AlphaGo Zero × AlphaZero:
[Silver et al. 2017a] 25
AlphaZero: Differences Compared to AlphaGo Zero
AlphaGo Zero × AlphaZero:
■ binary outcome (win / loss) × expected outcome (including
draws or potentially other outcomes)
[Silver et al. 2017a] 25
AlphaZero: Differences Compared to AlphaGo Zero
AlphaGo Zero × AlphaZero:
■ binary outcome (win / loss) × expected outcome (including
draws or potentially other outcomes)
■ board positions transformed before passing to neural networks
(by randomly selected rotation or reflection) × no data
augmentation
[Silver et al. 2017a] 25
AlphaZero: Differences Compared to AlphaGo Zero
AlphaGo Zero × AlphaZero:
■ binary outcome (win / loss) × expected outcome (including
draws or potentially other outcomes)
■ board positions transformed before passing to neural networks
(by randomly selected rotation or reflection) × no data
augmentation
■ games generated by the best player from previous iterations
(margin of 55 %) × continual update using the latest
parameters (without the evaluation and selection steps)
[Silver et al. 2017a] 25
AlphaZero: Differences Compared to AlphaGo Zero
AlphaGo Zero × AlphaZero:
■ binary outcome (win / loss) × expected outcome (including
draws or potentially other outcomes)
■ board positions transformed before passing to neural networks
(by randomly selected rotation or reflection) × no data
augmentation
■ games generated by the best player from previous iterations
(margin of 55 %) × continual update using the latest
parameters (without the evaluation and selection steps)
■ hyper-parameters tuned by Bayesian optimisation × reused
the same hyper-parameters without game-specific tuning
[Silver et al. 2017a] 25
AlphaZero: Elo Rating over Training Time
[Silver et al. 2017a] 26
AlphaZero: Elo Rating over Training Time
[Silver et al. 2017a] 26
AlphaZero: Elo Rating over Training Time
[Silver et al. 2017a] 26
AlphaZero: Elo Rating over Training Time
[Silver et al. 2017a] 26
AlphaZero: Tournament between AI Programs
(Values are given from AlphaZero’s point of view.)
[Silver et al. 2017a] 27
AlphaZero: Openings Discovered by the Self-Play (1/2)
[Silver et al. 2017a] 28
AlphaZero: Openings Discovered by the Self-Play (2/2)
[Silver et al. 2017a] 29
Conclusion
Difficulties of Go
■ challenging decision-making
[Silver et al. 2017a] 30
Difficulties of Go
■ challenging decision-making
■ intractable search space
[Silver et al. 2017a] 30
Difficulties of Go
■ challenging decision-making
■ intractable search space
■ complex optimal solution
It appears infeasible to directly approximate using a policy or value function!
[Silver et al. 2017a] 30
AlphaZero: Summary
■ Monte Carlo tree search
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
■ Monte Carlo tree search
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
■ Monte Carlo tree search
■ more efficient when compared to previous AlphaGo versions
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
■ Monte Carlo tree search
■ more efficient when compared to previous AlphaGo versions
■ single machine
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
■ Monte Carlo tree search
■ more efficient when compared to previous AlphaGo versions
■ single machine
■ 4 TPUs
[Silver et al. 2017a] 31
AlphaZero: Summary
■ Monte Carlo tree search
■ effective move selection and position evaluation
■ through deep convolutional neural networks
■ trained by new self-play reinforcement learning algorithm
■ new search algorithm combining
■ evaluation by a single neural network
■ Monte Carlo tree search
■ more efficient when compared to previous AlphaGo versions
■ single machine
■ 4 TPUs
■ hours rather than months of training time
[Silver et al. 2017a] 31
Novel approach
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
■ evaluating them more precisely (the same neural network)
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
■ evaluating them more precisely (the same neural network)
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
■ evaluating them more precisely (the same neural network)
Deep Blue relied on a handcrafted evaluation function.
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
■ evaluating them more precisely (the same neural network)
Deep Blue relied on a handcrafted evaluation function.
AlphaZero was trained tabula rasa from self-play. It used
general-purpose learning.
[Silver et al. 2017a] 32
Novel approach
During the matches (against Stockfish and Elmo), AlphaZero
evaluated thousands of times fewer positions than Deep Blue
against Kasparov.
It compensated this by:
■ selecting those positions more intelligently (the neural
network)
■ evaluating them more precisely (the same neural network)
Deep Blue relied on a handcrafted evaluation function.
AlphaZero was trained tabula rasa from self-play. It used
general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of AI problems!
[Silver et al. 2017a] 32
Thank you!
Questions?
32
Backup Slides
Input Features of AlphaZero’s Neural Networks
[Silver et al. 2017a]
AlphaZero: Statistics of Training
[Silver et al. 2017a]
AlphaZero: Evaluation Speeds
[Silver et al. 2017a]
Scalability When Compared to Other Programs
[Silver et al. 2017a]
Further Reading I
AlphaGo:
■ Google Research Blog
http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html
■ an article in Nature
http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234
■ a reddit article claiming that AlphaGo is even stronger than it appears to be:
“AlphaGo would rather win by less points, but with higher probability.”
https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/
■ a video of how AlphaGo works (put in layman’s terms) https://youtu.be/qWcfiPi9gUU
Articles by Google DeepMind:
■ Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih
et al. 2015)
■ Neural Turing Machines (Graves, Wayne, and Danihelka 2014)
Artificial Intelligence:
■ Artificial Intelligence course at MIT
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-034-artificial-intelligence-fall-2010/index.htm
Further Reading II
■ Introduction to Artificial Intelligence at Udacity
https://www.udacity.com/course/intro-to-artificial-intelligence--cs271
■ General Game Playing course https://www.coursera.org/course/ggp
■ Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2
■ The Singularity Is Near (Kurzweil 2005)
Combinatorial Game Theory (founded by John H. Conway to study endgames in Go):
■ Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory
■ On Numbers and Games (Conway 1976)
■ Computer Go as a sum of local games: an application of combinatorial game theory (Müller 1995)
Chess:
■ Deep Blue beats G. Kasparov in 1997 https://youtu.be/NJarxpYyoFI
Machine Learning:
■ Machine Learning course
https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/
■ Reinforcement Learning http://reinforcementlearning.ai-depot.com/
■ Deep Learning (LeCun, Bengio, and Hinton 2015)
Further Reading III
■ Deep Learning course https://www.udacity.com/course/deep-learning--ud730
■ Two Minute Papers https://www.youtube.com/user/keeroyz
■ Applications of Deep Learning https://youtu.be/hPKJBXkyTKM
Neuroscience:
■ http://www.brainfacts.org/
References I
Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6.
Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural Turing Machines”. In: arXiv preprint
arXiv:1410.5401.
Kurzweil, Ray (2005). The Singularity is Near: When Humans Transcend Biology. Penguin.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep Learning”. In: Nature 521.7553, pp. 436–444.
Mnih, Volodymyr et al. (2015). “Human-Level Control through Deep Reinforcement Learning”. In: Nature
518.7540, pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.
Müller, Martin (1995). “Computer Go as a Sum of Local Games: an Application of Combinatorial Game Theory”.
PhD thesis. TU Graz.
Silver, David et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature
529.7587, pp. 484–489.
Silver, David et al. (2017a). “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning
Algorithm”. In: arXiv preprint arXiv:1712.01815.
Silver, David et al. (2017b). “Mastering the Game of Go without Human Knowledge”. In: Nature 550.7676,
pp. 354–359.

More Related Content

What's hot

Artificial Intelligence in Gaming.pptx
Artificial Intelligence in Gaming.pptxArtificial Intelligence in Gaming.pptx
Artificial Intelligence in Gaming.pptxMd. Rakib Trofder
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Long Lin at AI Frontiers : AI in Gaming
Long Lin at AI Frontiers : AI in GamingLong Lin at AI Frontiers : AI in Gaming
Long Lin at AI Frontiers : AI in GamingAI Frontiers
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaEdureka!
 
Artificial intelligence in gaming.
Artificial intelligence in gaming.Artificial intelligence in gaming.
Artificial intelligence in gaming.Rishikese MR
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchKarel Ha
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckSlideTeam
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox Tsahi Glik
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
Artificial intelligence In Modern-Games.
Artificial intelligence In Modern-Games. Artificial intelligence In Modern-Games.
Artificial intelligence In Modern-Games. Nitish Kavishetti
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance
 
알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review상은 박
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 

What's hot (20)

Artificial Intelligence in Gaming.pptx
Artificial Intelligence in Gaming.pptxArtificial Intelligence in Gaming.pptx
Artificial Intelligence in Gaming.pptx
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Long Lin at AI Frontiers : AI in Gaming
Long Lin at AI Frontiers : AI in GamingLong Lin at AI Frontiers : AI in Gaming
Long Lin at AI Frontiers : AI in Gaming
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
Artificial intelligence in gaming.
Artificial intelligence in gaming.Artificial intelligence in gaming.
Artificial intelligence in gaming.
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
 
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Artificial intelligence In Modern-Games.
Artificial intelligence In Modern-Games. Artificial intelligence In Modern-Games.
Artificial intelligence In Modern-Games.
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 
알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 

More from Karel Ha

transcript-master-studies-Karel-Ha
transcript-master-studies-Karel-Hatranscript-master-studies-Karel-Ha
transcript-master-studies-Karel-HaKarel Ha
 
Schrodinger poster 2020
Schrodinger poster 2020Schrodinger poster 2020
Schrodinger poster 2020Karel Ha
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkKarel Ha
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between CapsulesKarel Ha
 
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...Karel Ha
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerKarel Ha
 
transcript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Hatranscript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-HaKarel Ha
 
Mastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationMastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationKarel Ha
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiKarel Ha
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
Separation Axioms
Separation AxiomsSeparation Axioms
Separation AxiomsKarel Ha
 
Oddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiOddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiKarel Ha
 
Algorithmic Game Theory
Algorithmic Game TheoryAlgorithmic Game Theory
Algorithmic Game TheoryKarel Ha
 
Summer Student Programme
Summer Student ProgrammeSummer Student Programme
Summer Student ProgrammeKarel Ha
 
Summer @CERN
Summer @CERNSummer @CERN
Summer @CERNKarel Ha
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC ProtectionKarel Ha
 
Question Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsQuestion Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsKarel Ha
 

More from Karel Ha (17)

transcript-master-studies-Karel-Ha
transcript-master-studies-Karel-Hatranscript-master-studies-Karel-Ha
transcript-master-studies-Karel-Ha
 
Schrodinger poster 2020
Schrodinger poster 2020Schrodinger poster 2020
Schrodinger poster 2020
 
CapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule NetworkCapsuleGAN: Generative Adversarial Capsule Network
CapsuleGAN: Generative Adversarial Capsule Network
 
Dynamic Routing Between Capsules
Dynamic Routing Between CapsulesDynamic Routing Between Capsules
Dynamic Routing Between Capsules
 
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
transcript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Hatranscript-bachelor-studies-Karel-Ha
transcript-bachelor-studies-Karel-Ha
 
Mastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: PresentationMastering the game of Go with deep neural networks and tree search: Presentation
Mastering the game of Go with deep neural networks and tree search: Presentation
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Separation Axioms
Separation AxiomsSeparation Axioms
Separation Axioms
 
Oddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologiiOddělovací axiomy v bezbodové topologii
Oddělovací axiomy v bezbodové topologii
 
Algorithmic Game Theory
Algorithmic Game TheoryAlgorithmic Game Theory
Algorithmic Game Theory
 
Summer Student Programme
Summer Student ProgrammeSummer Student Programme
Summer Student Programme
 
Summer @CERN
Summer @CERNSummer @CERN
Summer @CERN
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC Protection
 
Question Answering with Subgraph Embeddings
Question Answering with Subgraph EmbeddingsQuestion Answering with Subgraph Embeddings
Question Answering with Subgraph Embeddings
 

Recently uploaded

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 

Recently uploaded (20)

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

AlphaZero

  • 1. AlphaZero Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm Karel Ha article by Google DeepMind AI Seminar, 19th December 2017
  • 2. Outline The Alpha* Timeline AlphaGo AlphaGo Zero (AG0) AlphaZero Conclusion 1
  • 5. Fan Hui ■ professional 2 dan https://en.wikipedia.org/wiki/Fan_Hui 2
  • 6. Fan Hui ■ professional 2 dan ■ European Go Champion in 2013, 2014 and 2015 https://en.wikipedia.org/wiki/Fan_Hui 2
  • 7. Fan Hui ■ professional 2 dan ■ European Go Champion in 2013, 2014 and 2015 ■ European Professional Go Champion in 2016 https://en.wikipedia.org/wiki/Fan_Hui 2
  • 8. AlphaGo (AlphaGo Fan) vs. Fan Hui 3
  • 9. AlphaGo (AlphaGo Fan) vs. Fan Hui AlphaGo won 5:0 in a formal match on October 2015. 3
  • 10. AlphaGo (AlphaGo Fan) vs. Fan Hui AlphaGo won 5:0 in a formal match on October 2015. [AlphaGo] is very strong and stable, it seems like a wall. ... I know AlphaGo is a computer, but if no one told me, maybe I would think the player was a little strange, but a very strong player, a real person. Fan Hui 3
  • 11. Lee Sedol “The Strong Stone” https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 12. Lee Sedol “The Strong Stone” ■ professional 9 dan https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 13. Lee Sedol “The Strong Stone” ■ professional 9 dan ■ the 2nd in international titles https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 14. Lee Sedol “The Strong Stone” ■ professional 9 dan ■ the 2nd in international titles ■ the 5th youngest to become a professional Go player in South Korean history https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 15. Lee Sedol “The Strong Stone” ■ professional 9 dan ■ the 2nd in international titles ■ the 5th youngest to become a professional Go player in South Korean history ■ Lee Sedol would win 97 out of 100 games against Fan Hui. https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 16. Lee Sedol “The Strong Stone” ■ professional 9 dan ■ the 2nd in international titles ■ the 5th youngest to become a professional Go player in South Korean history ■ Lee Sedol would win 97 out of 100 games against Fan Hui. ■ “Roger Federer” of Go https://en.wikipedia.org/wiki/Lee_Sedol 4
  • 17. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol 4
  • 18. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4:1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 4
  • 19. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4:1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 4
  • 20. AlphaGo (AlphaGo Lee) vs. Lee Sedol https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
  • 21. AlphaGo (AlphaGo Lee) vs. Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
  • 22. AlphaGo (AlphaGo Lee) vs. Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
  • 23. AlphaGo (AlphaGo Lee) vs. Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 5
  • 25. AlphaGo Master In January 2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. https://deepmind.com/research/alphago/match-archive/master/ 6
  • 26. AlphaGo Master In January 2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. https://deepmind.com/research/alphago/match-archive/master/ 6
  • 27. AlphaGo Master In January 2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. Over one week, AlphaGo played 60 online fast time-control games. https://deepmind.com/research/alphago/match-archive/master/ 6
  • 28. AlphaGo Master In January 2017, DeepMind revealed that AlphaGo had played a series of unofficial online games against some of the strongest professional Go players under the pseudonyms “Master” and ”Magister”. This AlphaGo was an improved version of the AlphaGo that played Lee Sedol in 2016. Over one week, AlphaGo played 60 online fast time-control games. AlphaGo won this series of games 60:0. https://deepmind.com/research/alphago/match-archive/master/ 6
  • 31. ■ 23 May - 27 May 2017 in Wuzhen, China https://events.google.com/alphago2017/ 7
  • 32. ■ 23 May - 27 May 2017 in Wuzhen, China ■ Team Go vs. AlphaGo 0:1 https://events.google.com/alphago2017/ 7
  • 33. ■ 23 May - 27 May 2017 in Wuzhen, China ■ Team Go vs. AlphaGo 0:1 ■ AlphaGo vs. world champion Ke Jie 3:0 https://events.google.com/alphago2017/ 7
  • 34. 7
  • 35. defeated AlphaGo Lee by 100 games to 0 7
  • 36. AI system that mastered chess, Shogi and Go to “superhuman levels” within a handful of hours AlphaZero 8
  • 37. AI system that mastered chess, Shogi and Go to “superhuman levels” within a handful of hours AlphaZero defeated AlphaGo Zero (version with 20 blocks trained for 3 days) by 60 games to 40 8
  • 38. 8
  • 40. 1 AlphaGo Fan 2 AlphaGo Lee 8
  • 41. 1 AlphaGo Fan 2 AlphaGo Lee 3 AlphaGo Master 8
  • 42. 1 AlphaGo Fan 2 AlphaGo Lee 3 AlphaGo Master 4 AlphaGo Zero 8
  • 43. 1 AlphaGo Fan 2 AlphaGo Lee 3 AlphaGo Master 4 AlphaGo Zero 5 AlphaZero 8
  • 45. Policy and Value Networks [Silver et al. 2016] 9
  • 46. Training the (Deep Convolutional) Neural Networks [Silver et al. 2016] 10
  • 48. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: [Silver et al. 2017b] 11
  • 49. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) [Silver et al. 2017b] 11
  • 50. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) ■ additional (auxialiary) input features × only the black and white stones from the board as input features [Silver et al. 2017b] 11
  • 51. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) ■ additional (auxialiary) input features × only the black and white stones from the board as input features ■ separate policy and value networks × single neural network [Silver et al. 2017b] 11
  • 52. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) ■ additional (auxialiary) input features × only the black and white stones from the board as input features ■ separate policy and value networks × single neural network ■ tree search using also Monte Carlo rollouts × simpler tree search using only the single neural network to both evaluate positions and sample moves [Silver et al. 2017b] 11
  • 53. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) ■ additional (auxialiary) input features × only the black and white stones from the board as input features ■ separate policy and value networks × single neural network ■ tree search using also Monte Carlo rollouts × simpler tree search using only the single neural network to both evaluate positions and sample moves ■ (AlphaGo Lee) distributed machines + 48 tensor processing units (TPUs) × single machines + 4 TPUs [Silver et al. 2017b] 11
  • 54. AG0: Differences Compared to AlphaGo {Fan, Lee, Master} AlphaGo {Fan, Lee, Master} × AlphaGo Zero: ■ supervised learning from human expert positions × from scratch by self-play reinforcement learning (“tabula rasa”) ■ additional (auxialiary) input features × only the black and white stones from the board as input features ■ separate policy and value networks × single neural network ■ tree search using also Monte Carlo rollouts × simpler tree search using only the single neural network to both evaluate positions and sample moves ■ (AlphaGo Lee) distributed machines + 48 tensor processing units (TPUs) × single machines + 4 TPUs ■ (AlphaGo Lee) several months of training time × 72 h of training time (outperforming AlphaGo Lee after 36 h) [Silver et al. 2017b] 11
  • 55. AG0 achieves this via [Silver et al. 2017b] 12
  • 56. AG0 achieves this via ■ a new reinforcement learning algorithm [Silver et al. 2017b] 12
  • 57. AG0 achieves this via ■ a new reinforcement learning algorithm ■ with lookahead search inside the training loop [Silver et al. 2017b] 12
  • 58. AG0: Self-Play Reinforcement Learning [Silver et al. 2017b] 13
  • 59. AG0: Self-Play Reinforcement Learning [Silver et al. 2017b] 13
  • 60. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: [Silver et al. 2017b] 14
  • 61. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s [Silver et al. 2017b] 14
  • 62. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: [Silver et al. 2017b] 14
  • 63. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p [Silver et al. 2017b] 14
  • 64. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position [Silver et al. 2017b] 14
  • 65. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position ■ fθ(s) = (p, v) [Silver et al. 2017b] 14
  • 66. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position ■ fθ(s) = (p, v) ■ specifics: [Silver et al. 2017b] 14
  • 67. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position ■ fθ(s) = (p, v) ■ specifics: ■ (20 or 40) residual blocks (of convolutional layers) [Silver et al. 2017b] 14
  • 68. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position ■ fθ(s) = (p, v) ■ specifics: ■ (20 or 40) residual blocks (of convolutional layers) ■ batch normalization [Silver et al. 2017b] 14
  • 69. AG0: Self-Play Reinforcement Learning – Neural Network deep neural network fθ with parameters θ: ■ input: raw board representation s ■ output: ■ move probabilities p ■ value v of the board position ■ fθ(s) = (p, v) ■ specifics: ■ (20 or 40) residual blocks (of convolutional layers) ■ batch normalization ■ rectifier non-linearities [Silver et al. 2017b] 14
  • 70. AG0: Comparison of Various Neural Network Architectures [Silver et al. 2017b] 15
  • 71. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 [Silver et al. 2017b] 16
  • 72. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: [Silver et al. 2017b] 16
  • 73. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T [Silver et al. 2017b] 16
  • 74. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T ii. move is sampled from πt [Silver et al. 2017b] 16
  • 75. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T ii. move is sampled from πt iii. data (st, πt, zt) for each t are stored for later training [Silver et al. 2017b] 16
  • 76. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T ii. move is sampled from πt iii. data (st, πt, zt) for each t are stored for later training iv. new neural network fθi is trained in order to minimize the loss l = (z − v)2 − π⊤ log p + c||θ||2 [Silver et al. 2017b] 16
  • 77. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T ii. move is sampled from πt iii. data (st, πt, zt) for each t are stored for later training iv. new neural network fθi is trained in order to minimize the loss l = (z − v)2 − π⊤ log p + c||θ||2 [Silver et al. 2017b] 16
  • 78. AG0: Self-Play Reinforcement Learning – Steps 0. random weights θ0 1. at each iteration i > 0, self-play games are generated: i. MCTS samples search probabilities πt based on the neural network from the previous iteration fθi−1 : πt = αθi−1 (st) for each time-step t = 1, 2, . . . , T ii. move is sampled from πt iii. data (st, πt, zt) for each t are stored for later training iv. new neural network fθi is trained in order to minimize the loss l = (z − v)2 − π⊤ log p + c||θ||2 Loss l makes (p, v) = fθ(s) more closely match the improved search probabilities and self-play winner (π, z). [Silver et al. 2017b] 16
  • 79. AG0: Monte Carlo Tree Search (1/2) Monte Carlo Tree Search (MCTS) in AG0: [Silver et al. 2017b] 17
  • 80. AG0: Monte Carlo Tree Search (1/2) Monte Carlo Tree Search (MCTS) in AG0: [Silver et al. 2017b] 17
  • 81. AG0: Monte Carlo Tree Search (2/2) [Silver et al. 2017b] 18
  • 82. AG0: Self-Play Reinforcement Learning – Review [Silver et al. 2017b] 19
  • 83. AG0: Elo Rating over Training Time (RL vs. SL) [Silver et al. 2017b] 20
  • 84. AG0: Elo Rating over Training Time (AG0 with 40 blocks) [Silver et al. 2017b] 21
  • 85. AG0: Tournament between AI Go Programs [Silver et al. 2017b] 22
  • 86. AG0: Discovered Joseki (Corner Sequences) [Silver et al. 2017b] 23
  • 87. AG0: Discovered Joseki (Corner Sequences) a five human joseki [Silver et al. 2017b] 23
  • 88. AG0: Discovered Joseki (Corner Sequences) a five human joseki b five novel joseki variants eventually preferred by AG0 [Silver et al. 2017b] 23
  • 89. AG0: Discovered Playing Styles [Silver et al. 2017b] 24
  • 90. AG0: Discovered Playing Styles at 3 h greedy capture of stones [Silver et al. 2017b] 24
  • 91. AG0: Discovered Playing Styles at 3 h greedy capture of stones at 19 h the fundamentals of Go concepts (life-and-death, influence, territory...) [Silver et al. 2017b] 24
  • 92. AG0: Discovered Playing Styles at 3 h greedy capture of stones at 19 h the fundamentals of Go concepts (life-and-death, influence, territory...) at 70 h remarkably balanced game (multiple battles, complicated ko fight, a half-point win for white...)[Silver et al. 2017b] 24
  • 94. 24
  • 95. To watch such a strong programme like Stockfish, against whom most top players would be happy to win even one game out of a hundred, being completely taken apart is certainly definitive. Viswanathan Anand 24
  • 96. To watch such a strong programme like Stockfish, against whom most top players would be happy to win even one game out of a hundred, being completely taken apart is certainly definitive. Viswanathan Anand It’s like chess from another dimension. Demis Hassabis 24
  • 97. AlphaZero: Differences Compared to AlphaGo Zero AlphaGo Zero × AlphaZero: [Silver et al. 2017a] 25
  • 98. AlphaZero: Differences Compared to AlphaGo Zero AlphaGo Zero × AlphaZero: ■ binary outcome (win / loss) × expected outcome (including draws or potentially other outcomes) [Silver et al. 2017a] 25
  • 99. AlphaZero: Differences Compared to AlphaGo Zero AlphaGo Zero × AlphaZero: ■ binary outcome (win / loss) × expected outcome (including draws or potentially other outcomes) ■ board positions transformed before passing to neural networks (by randomly selected rotation or reflection) × no data augmentation [Silver et al. 2017a] 25
  • 100. AlphaZero: Differences Compared to AlphaGo Zero AlphaGo Zero × AlphaZero: ■ binary outcome (win / loss) × expected outcome (including draws or potentially other outcomes) ■ board positions transformed before passing to neural networks (by randomly selected rotation or reflection) × no data augmentation ■ games generated by the best player from previous iterations (margin of 55 %) × continual update using the latest parameters (without the evaluation and selection steps) [Silver et al. 2017a] 25
  • 101. AlphaZero: Differences Compared to AlphaGo Zero AlphaGo Zero × AlphaZero: ■ binary outcome (win / loss) × expected outcome (including draws or potentially other outcomes) ■ board positions transformed before passing to neural networks (by randomly selected rotation or reflection) × no data augmentation ■ games generated by the best player from previous iterations (margin of 55 %) × continual update using the latest parameters (without the evaluation and selection steps) ■ hyper-parameters tuned by Bayesian optimisation × reused the same hyper-parameters without game-specific tuning [Silver et al. 2017a] 25
  • 102. AlphaZero: Elo Rating over Training Time [Silver et al. 2017a] 26
  • 103. AlphaZero: Elo Rating over Training Time [Silver et al. 2017a] 26
  • 104. AlphaZero: Elo Rating over Training Time [Silver et al. 2017a] 26
  • 105. AlphaZero: Elo Rating over Training Time [Silver et al. 2017a] 26
  • 106. AlphaZero: Tournament between AI Programs (Values are given from AlphaZero’s point of view.) [Silver et al. 2017a] 27
  • 107. AlphaZero: Openings Discovered by the Self-Play (1/2) [Silver et al. 2017a] 28
  • 108. AlphaZero: Openings Discovered by the Self-Play (2/2) [Silver et al. 2017a] 29
  • 110. Difficulties of Go ■ challenging decision-making [Silver et al. 2017a] 30
  • 111. Difficulties of Go ■ challenging decision-making ■ intractable search space [Silver et al. 2017a] 30
  • 112. Difficulties of Go ■ challenging decision-making ■ intractable search space ■ complex optimal solution It appears infeasible to directly approximate using a policy or value function! [Silver et al. 2017a] 30
  • 113. AlphaZero: Summary ■ Monte Carlo tree search [Silver et al. 2017a] 31
  • 114. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation [Silver et al. 2017a] 31
  • 115. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks [Silver et al. 2017a] 31
  • 116. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm [Silver et al. 2017a] 31
  • 117. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining [Silver et al. 2017a] 31
  • 118. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network [Silver et al. 2017a] 31
  • 119. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network ■ Monte Carlo tree search [Silver et al. 2017a] 31
  • 120. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network ■ Monte Carlo tree search ■ more efficient when compared to previous AlphaGo versions [Silver et al. 2017a] 31
  • 121. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network ■ Monte Carlo tree search ■ more efficient when compared to previous AlphaGo versions ■ single machine [Silver et al. 2017a] 31
  • 122. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network ■ Monte Carlo tree search ■ more efficient when compared to previous AlphaGo versions ■ single machine ■ 4 TPUs [Silver et al. 2017a] 31
  • 123. AlphaZero: Summary ■ Monte Carlo tree search ■ effective move selection and position evaluation ■ through deep convolutional neural networks ■ trained by new self-play reinforcement learning algorithm ■ new search algorithm combining ■ evaluation by a single neural network ■ Monte Carlo tree search ■ more efficient when compared to previous AlphaGo versions ■ single machine ■ 4 TPUs ■ hours rather than months of training time [Silver et al. 2017a] 31
  • 124. Novel approach [Silver et al. 2017a] 32
  • 125. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. [Silver et al. 2017a] 32
  • 126. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) [Silver et al. 2017a] 32
  • 127. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) ■ evaluating them more precisely (the same neural network) [Silver et al. 2017a] 32
  • 128. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) ■ evaluating them more precisely (the same neural network) [Silver et al. 2017a] 32
  • 129. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) ■ evaluating them more precisely (the same neural network) Deep Blue relied on a handcrafted evaluation function. [Silver et al. 2017a] 32
  • 130. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) ■ evaluating them more precisely (the same neural network) Deep Blue relied on a handcrafted evaluation function. AlphaZero was trained tabula rasa from self-play. It used general-purpose learning. [Silver et al. 2017a] 32
  • 131. Novel approach During the matches (against Stockfish and Elmo), AlphaZero evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: ■ selecting those positions more intelligently (the neural network) ■ evaluating them more precisely (the same neural network) Deep Blue relied on a handcrafted evaluation function. AlphaZero was trained tabula rasa from self-play. It used general-purpose learning. This approach is not specific to the game of Go. The algorithm can be used for much wider class of AI problems! [Silver et al. 2017a] 32
  • 134. Input Features of AlphaZero’s Neural Networks [Silver et al. 2017a]
  • 135. AlphaZero: Statistics of Training [Silver et al. 2017a]
  • 137. Scalability When Compared to Other Programs [Silver et al. 2017a]
  • 138. Further Reading I AlphaGo: ■ Google Research Blog http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html ■ an article in Nature http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234 ■ a reddit article claiming that AlphaGo is even stronger than it appears to be: “AlphaGo would rather win by less points, but with higher probability.” https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/ ■ a video of how AlphaGo works (put in layman’s terms) https://youtu.be/qWcfiPi9gUU Articles by Google DeepMind: ■ Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih et al. 2015) ■ Neural Turing Machines (Graves, Wayne, and Danihelka 2014) Artificial Intelligence: ■ Artificial Intelligence course at MIT http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/ 6-034-artificial-intelligence-fall-2010/index.htm
  • 139. Further Reading II ■ Introduction to Artificial Intelligence at Udacity https://www.udacity.com/course/intro-to-artificial-intelligence--cs271 ■ General Game Playing course https://www.coursera.org/course/ggp ■ Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2 ■ The Singularity Is Near (Kurzweil 2005) Combinatorial Game Theory (founded by John H. Conway to study endgames in Go): ■ Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory ■ On Numbers and Games (Conway 1976) ■ Computer Go as a sum of local games: an application of combinatorial game theory (Müller 1995) Chess: ■ Deep Blue beats G. Kasparov in 1997 https://youtu.be/NJarxpYyoFI Machine Learning: ■ Machine Learning course https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/ ■ Reinforcement Learning http://reinforcementlearning.ai-depot.com/ ■ Deep Learning (LeCun, Bengio, and Hinton 2015)
  • 140. Further Reading III ■ Deep Learning course https://www.udacity.com/course/deep-learning--ud730 ■ Two Minute Papers https://www.youtube.com/user/keeroyz ■ Applications of Deep Learning https://youtu.be/hPKJBXkyTKM Neuroscience: ■ http://www.brainfacts.org/
  • 141. References I Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6. Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural Turing Machines”. In: arXiv preprint arXiv:1410.5401. Kurzweil, Ray (2005). The Singularity is Near: When Humans Transcend Biology. Penguin. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep Learning”. In: Nature 521.7553, pp. 436–444. Mnih, Volodymyr et al. (2015). “Human-Level Control through Deep Reinforcement Learning”. In: Nature 518.7540, pp. 529–533. url: https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf. Müller, Martin (1995). “Computer Go as a Sum of Local Games: an Application of Combinatorial Game Theory”. PhD thesis. TU Graz. Silver, David et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587, pp. 484–489. Silver, David et al. (2017a). “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”. In: arXiv preprint arXiv:1712.01815. Silver, David et al. (2017b). “Mastering the Game of Go without Human Knowledge”. In: Nature 550.7676, pp. 354–359.