Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
AlphaGo: Mastering the game of Go
with deep neural networks and tree search
Karel Ha
article by Google DeepMind
Optimizati...
Why AI?
Applications of AI
spam filters
1
Applications of AI
spam filters
recommender systems (Netflix, YouTube)
1
Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
1
Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
audio recognition (Shazam,...
Applications of AI
spam filters
recommender systems (Netflix, YouTube)
predictive text (Swiftkey)
audio recognition (Shazam,...
Artistic-Style Painting (1/2)
[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
Artistic-Style Painting (1/2)
[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
Artistic-Style Painting (2/2)
Champandard 2016 3
C Code Generated Character by Character
Karpathy 2015 4
Algebraic Geometry Generated Character by Character
Karpathy 2015 5
Game of Thrones Generated Character by Character
http://pjreddie.com/darknet/rnns-in-darknet/ 5
Game of Thrones Generated Character by Character
JON
He leaned close and onions, barefoot from
his shoulder. “I am not a p...
Game of Thrones Generated Character by Character
JON
He leaned close and onions, barefoot from
his shoulder. “I am not a p...
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear ...
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear ...
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear ...
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear ...
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches
We’ve got nuclear ...
Atari Player by Google DeepMind
https://youtu.be/0X-NdPtFKq0?t=21m13s
Mnih et al. 2015 7
https://xkcd.com/1002/ 7
Heads-up Limit Holdem Poker Is Solved!
Bowling et al. 2015 8
Heads-up Limit Holdem Poker Is Solved!
Cepheus http://poker.srv.ualberta.ca/
0.000986 big blinds per game on expectation
B...
Basics of Machine Learning
https://dataaspirant.com/2014/09/19/supervised-and-unsupervised-learning/ 8
Supervised Learning (SL)
http://www.nickgillian.com/ 9
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Supervised Learning (SL)
1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions,...
Regression
9
Regression
9
Mathematical Regression
https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/
10
Classification
https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 11
Underfitting and Overfitting
https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
Underfitting and Overfitting
Beware of overfitting!
https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
Underfitting and Overfitting
Beware of overfitting!
It is like learning for a mathematical exam by memorizing proofs.
https:/...
Reinforcement Learning (RL)
https://youtu.be/0X-NdPtFKq0?t=16m57s 13
Reinforcement Learning (RL)
Specially: games of self-play
https://youtu.be/0X-NdPtFKq0?t=16m57s 13
Monte Carlo Tree Search
Tree Search
Optimal value v∗(s) determines the outcome of the game:
Silver et al. 2016 14
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
Silver et al. 201...
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect pla...
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect pla...
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect pla...
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect pla...
Tree Search
Optimal value v∗(s) determines the outcome of the game:
from every board position or state s
under perfect pla...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150
Allis et al. 1994 15
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Game tree of Go
Sizes of trees for various games:
chess: b ≈ 35, d ≈ 80
Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms i...
Monte Carlo tree search
16
Neural networks
Neural Networks (NN): Inspiration
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
http://pages.cs.wisc...
Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller ...
Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller ...
Neural Networks (NN): Inspiration
inspired by the neuronal structure of the mammalian cerebral
cortex
but on much smaller ...
Neural Networks: Modes
Dieterle 2003 18
Neural Networks: Modes
Two modes
Dieterle 2003 18
Neural Networks: Modes
Two modes
feedforward for making predictions
Dieterle 2003 18
Neural Networks: Modes
Two modes
feedforward for making predictions
backpropagation for learning
Dieterle 2003 18
Neural Networks: an Example of Feedforward
http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 19
Gradient Descent in Neural Networks
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html ...
Gradient Descent in Neural Networks
Motto: ”Learn by mistakes!”
However, error functions are not necessarily convex or so ...
http://xkcd.com/1425/ 20
Convolutional Neural Networks (CNN or ConvNet)
http://code.flickr.net/2014/10/20/introducing-flickr-park-or-bird/ 21
(Deep) Convolutional Neural Networks
The hierarchy of concepts is captured in the number of layers: the deep in “Deep Lear...
(Deep) Convolutional Neural Networks
The hierarchy of concepts is captured in the number of layers: the deep in “Deep Lear...
Rules of Go
Backgammon: Man vs. Fate
22
Backgammon: Man vs. Fate
Chess: Man vs. Man
22
Go: Man vs. Self
Robert ˇS´amal (White) versus Karel Kr´al (Black), Spring School of Combinatorics 2016 22
Rules of Go
23
Rules of Go
Black versus White. Black starts the game.
23
Rules of Go
Black versus White. Black starts the game.
23
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
23
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
23
Rules of Go
Black versus White. Black starts the game.
the rule of liberty
the “ko” rule
Handicap for difference in ranks: ...
Scoring Rules: Area Scoring
https://en.wikipedia.org/wiki/Go_(game) 24
Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
https://en.wikipedi...
Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
plus the number of ...
Scoring Rules: Area Scoring
A player’s score is:
the number of stones that the player has on the board
plus the number of ...
Ranks of Players
Kyu and Dan ranks
https://en.wikipedia.org/wiki/Go_(game) 25
Ranks of Players
Kyu and Dan ranks
or alternatively, Elo ratings
https://en.wikipedia.org/wiki/Go_(game) 25
Chocolate micro-break
25
AlphaGo: Inside Out
Policy and Value Networks
Silver et al. 2016 26
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 27
SL Policy Network (1/2)
13-layer deep convolutional neural network
Silver et al. 2016 28
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
Silver et al. 2016 ...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (1/2)
13-layer deep convolutional neural network
goal: to predict expert human moves
task of classificati...
SL Policy Network (2/2)
Small improvements in accuracy led to large improvements
in playing strength
Silver et al. 2016 29
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 30
Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
Silver et al. 2016 31
Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
accuracy of 24.2%
Silver et al. ...
Rollout Policy
Rollout policy pπ(a|s) is faster but less accurate than SL
policy network.
accuracy of 24.2%
It takes 2µs t...
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 32
RL Policy Network (1/2)
identical in structure to the SL policy network
Silver et al. 2016 33
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
Silver et a...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (1/2)
identical in structure to the SL policy network
goal: to win in the games of self-play
task of cla...
RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
Silver et al. 2016 34
RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
Silve...
RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% o...
RL Policy Network (2/2)
Results (by sampling each move at ∼ pρ(·|st)):
80% of win rate against the SL policy network
85% o...
Training the (Deep Convolutional) Neural Networks
Silver et al. 2016 35
Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability d...
Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability d...
Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability d...
Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability d...
Value Network (1/2)
similar architecture to the policy network, but outputs a single
prediction instead of a probability d...
Value Network (2/2)
Beware of overfitting!
Silver et al. 2016 37
Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Silver et al. 2016 37
Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game ...
Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game ...
Value Network (2/2)
Beware of overfitting!
Consecutive positions are strongly correlated.
Value network memorized the game ...
Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given posit...
Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given posit...
Evaluation Accuracy in Various Stages of a Game
Move number is the number of moves that had been played in the given posit...
Elo Ratings for Various Combinations of Networks
Silver et al. 2016 39
The Main Algorithm
Silver et al. 2016 39
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
Silver et al. 2016 40
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
Silver et al. 2016 40
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
Si...
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm
The next action is selected by lookahead search, using simulation:
1. selection phase
2. expansion phase
3....
MCTS Algorithm: Selection
Silver et al. 2016 41
MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg max
a
(Q(st , a) + u(st , a...
MCTS Algorithm: Selection
At each time step t, an action at is selected from state st
at = arg max
a
(Q(st , a) + u(st , a...
MCTS Algorithm: Expansion
Silver et al. 2016 42
MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ.
Silver et al. 2016 42
MCTS Algorithm: Expansion
A leaf position may be expanded (just once) by the SL policy network pσ.
The output probabilitie...
MCTS: Evaluation
Silver et al. 2016 43
MCTS: Evaluation
evaluation from the value network vθ(s)
Silver et al. 2016 43
MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ unti...
MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ unti...
MCTS: Evaluation
evaluation from the value network vθ(s)
evaluation by the outcome z using the fast rollout policy pπ unti...
MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
the action values Q
Silver et al. 2...
MCTS: Backup
At the end of simulation, each traversed edge is updated by accumulating:
the action values Q
visit counts N
...
Once the search is complete, the algorithm
chooses the most visited move from the root
position.
Silver et al. 2016 44
Percentage of Simulations
percentage frequency with which actions were selected from the root during simulations
Silver et...
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
Silver et al. 2016 46
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the m...
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the m...
Principal Variation (Path with Maximum Visit Count)
The moves are presented in a numbered sequence.
AlphaGo selected the m...
Scalability
asynchronous multi-threaded search
Silver et al. 2016 47
Scalability
asynchronous multi-threaded search
simulations on CPUs
Silver et al. 2016 47
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
Silver et al. 20...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
Silver et al. 20...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Scalability
asynchronous multi-threaded search
simulations on CPUs
computation of neural networks on GPUs
AlphaGo:
40 sear...
Elo Ratings for Various Combinations of Threads
Silver et al. 2016 48
Results: the strength of AlphaGo
Tournament with Other Go Programs
Silver et al. 2016 49
Fan Hui
https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui
professional 2 dan
https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
https://e...
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biologica...
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biologica...
Fan Hui
professional 2 dan
European Go Champion in 2013, 2014 and 2015
European Professional Go Champion in 2016
biologica...
AlphaGo versus Fan Hui
51
AlphaGo versus Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
51
AlphaGo versus Fan Hui
AlphaGo won 5:0 in a formal match on October 2015.
[AlphaGo] is very strong and stable, it seems
li...
Lee Sedol “The Strong Stone”
https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”
professional 9 dan
https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to be...
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to be...
Lee Sedol “The Strong Stone”
professional 9 dan
the 2nd in international titles
the 5th youngest (12 years 4 months) to be...
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this ...
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this ...
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this ...
AlphaGo versus Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
https://en.wikipedia.org/wiki/Alph...
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; ...
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; ...
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; ...
AlphaGo versus Lee Sedol
In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; ...
Who’s next?
53
http://www.goratings.org/ (18th April 2016) 53
AlphaGo versus Ke Jie?
https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?
professional 9 dan
https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major inte...
AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major inte...
AlphaGo versus Ke Jie?
professional 9 dan
the 1st in (unofficial) world ranking list
the youngest player to win 3 major inte...
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke ...
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke ...
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.
Ke ...
Conclusion
Difficulties of Go
challenging decision-making
Silver et al. 2016 55
Difficulties of Go
challenging decision-making
intractable search space
Silver et al. 2016 55
Difficulties of Go
challenging decision-making
intractable search space
complex optimal solution
It appears infeasible to di...
AlphaGo: summary
Monte Carlo tree search
Silver et al. 2016 56
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
Silver et al. 2016 56
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
AlphaGo: summary
Monte Carlo tree search
effective move selection and position evaluation
through deep convolutional neural...
Novel approach
Silver et al. 2016 57
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Novel approach
During the match against Fan Hui, AlphaGo evaluated thousands
of times fewer positions than Deep Blue again...
Thank you!
Questions?
57
Backup Slides
Input features for rollout and tree policy
Silver et al. 2016
Selection of Moves by the SL Policy Network
move probabilities taken directly from the SL policy network pσ (reported as a...
Selection of Moves by the Value Network
evaluation of all successors s of the root position s, using vθ(s)
Silver et al. 2...
Tree Evaluation from Value Network
action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over val...
Tree Evaluation from Rollouts
action values Q(s, a), averaged over rollout evaluations only
Silver et al. 2016
Results of a tournament between different Go programs
Silver et al. 2016
Results of a tournament between AlphaGo and distributed Al-
phaGo, testing scalability with hardware
Silver et al. 2016
AlphaGo versus Fan Hui: Game 1
Silver et al. 2016
AlphaGo versus Fan Hui: Game 2
Silver et al. 2016
AlphaGo versus Fan Hui: Game 3
Silver et al. 2016
AlphaGo versus Fan Hui: Game 4
Silver et al. 2016
AlphaGo versus Fan Hui: Game 5
Silver et al. 2016
AlphaGo versus Lee Sedol: Game 1
https://youtu.be/vFr3K2DORc8
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (1/2)
https://youtu.be/l-GsfyVCBu0
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 3
https://youtu.be/qUAmTYHEyM8
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 4
https://youtu.be/yCALyQRN3hw
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (1/2)
https://youtu.be/mzpW10DPHeQ
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (2/2)
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
Further Reading I
AlphaGo:
Google Research Blog
http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-o...
Further Reading II
Introduction to Artificial Intelligence at Udacity
https://www.udacity.com/course/intro-to-artificial-in...
Further Reading III
Deep Learning course https://www.udacity.com/course/deep-learning--ud730
Two Minute Papers https://www...
References I
Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looi...
References II
Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin.
LeCun, Yann, Yoshua B...
Upcoming SlideShare
Loading in …5
×

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search

2,504 views

Published on

the presentation of the article "Mastering the game of Go with deep neural networks and tree search" given at the Optimization Seminar 2015/2016

Notes:
- All URLs are clickable.

- All citations are clickable (when hovered over the "year" part of "[author year]").

- To download without a SlideShare account, use https://www.dropbox.com/s/p4rnlhoewbedkjg/AlphaGo.pdf?dl=0

- The corresponding leaflet is available at http://www.slideshare.net/KarelHa1/leaflet-for-the-talk-on-alphago

- The source code is available at https://github.com/mathemage/AlphaGo-presentation

Published in: Science

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search

  1. 1. AlphaGo: Mastering the game of Go with deep neural networks and tree search Karel Ha article by Google DeepMind Optimization Seminar, 20th April 2016
  2. 2. Why AI?
  3. 3. Applications of AI spam filters 1
  4. 4. Applications of AI spam filters recommender systems (Netflix, YouTube) 1
  5. 5. Applications of AI spam filters recommender systems (Netflix, YouTube) predictive text (Swiftkey) 1
  6. 6. Applications of AI spam filters recommender systems (Netflix, YouTube) predictive text (Swiftkey) audio recognition (Shazam, SoundHound) 1
  7. 7. Applications of AI spam filters recommender systems (Netflix, YouTube) predictive text (Swiftkey) audio recognition (Shazam, SoundHound) self-driving cars 1
  8. 8. Artistic-Style Painting (1/2) [1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
  9. 9. Artistic-Style Painting (1/2) [1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
  10. 10. Artistic-Style Painting (2/2) Champandard 2016 3
  11. 11. C Code Generated Character by Character Karpathy 2015 4
  12. 12. Algebraic Geometry Generated Character by Character Karpathy 2015 5
  13. 13. Game of Thrones Generated Character by Character http://pjreddie.com/darknet/rnns-in-darknet/ 5
  14. 14. Game of Thrones Generated Character by Character JON He leaned close and onions, barefoot from his shoulder. “I am not a purple girl,” he said as he stood over him. “The sight of you sell your father with you a little choice.” “I say to swear up his sea or a boy of stone and heart, down,” Lord Tywin said. “I love your word or her to me.” Darknet (on Linux) http://pjreddie.com/darknet/rnns-in-darknet/ 5
  15. 15. Game of Thrones Generated Character by Character JON He leaned close and onions, barefoot from his shoulder. “I am not a purple girl,” he said as he stood over him. “The sight of you sell your father with you a little choice.” “I say to swear up his sea or a boy of stone and heart, down,” Lord Tywin said. “I love your word or her to me.” Darknet (on Linux) JON Each in days and the woods followed his king. “I understand.” “I am not your sister Lord Robert?” “The door was always some cellar to do his being girls and the Magnar of Baratheon, and there were thousands of every bite of half the same as though he was not a great knight should be seen, and not to look at the Redwyne two thousand men.” Darknet (on OS X) http://pjreddie.com/darknet/rnns-in-darknet/ 5
  16. 16. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches Hayes 2016 6
  17. 17. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever. Hayes 2016 6
  18. 18. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever. The biggest risk to the world, is me, believe it or not. Hayes 2016 6
  19. 19. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever. The biggest risk to the world, is me, believe it or not. I am what ISIS doesn’t need. Hayes 2016 6
  20. 20. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever. The biggest risk to the world, is me, believe it or not. I am what ISIS doesn’t need. I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He has been amazing. Hayes 2016 6
  21. 21. DeepDrumpf: a Twitter bot / neural network which learned the language of Donald Trump from his speeches We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever. The biggest risk to the world, is me, believe it or not. I am what ISIS doesn’t need. I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He has been amazing. I buy Hillary, it’s beautiful and I’m happy about it. Hayes 2016 6
  22. 22. Atari Player by Google DeepMind https://youtu.be/0X-NdPtFKq0?t=21m13s Mnih et al. 2015 7
  23. 23. https://xkcd.com/1002/ 7
  24. 24. Heads-up Limit Holdem Poker Is Solved! Bowling et al. 2015 8
  25. 25. Heads-up Limit Holdem Poker Is Solved! Cepheus http://poker.srv.ualberta.ca/ 0.000986 big blinds per game on expectation Bowling et al. 2015 8
  26. 26. Basics of Machine Learning
  27. 27. https://dataaspirant.com/2014/09/19/supervised-and-unsupervised-learning/ 8
  28. 28. Supervised Learning (SL) http://www.nickgillian.com/ 9
  29. 29. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... http://www.nickgillian.com/ 9
  30. 30. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... 2. training on training set http://www.nickgillian.com/ 9
  31. 31. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... 2. training on training set 3. testing on testing set http://www.nickgillian.com/ 9
  32. 32. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... 2. training on training set 3. testing on testing set 4. deployment http://www.nickgillian.com/ 9
  33. 33. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... 2. training on training set 3. testing on testing set 4. deployment http://www.nickgillian.com/ 9
  34. 34. Supervised Learning (SL) 1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go Server... 2. training on training set 3. testing on testing set 4. deployment http://www.nickgillian.com/ 9
  35. 35. Regression 9
  36. 36. Regression 9
  37. 37. Mathematical Regression https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/ 10
  38. 38. Classification https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 11
  39. 39. Underfitting and Overfitting https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
  40. 40. Underfitting and Overfitting Beware of overfitting! https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
  41. 41. Underfitting and Overfitting Beware of overfitting! It is like learning for a mathematical exam by memorizing proofs. https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
  42. 42. Reinforcement Learning (RL) https://youtu.be/0X-NdPtFKq0?t=16m57s 13
  43. 43. Reinforcement Learning (RL) Specially: games of self-play https://youtu.be/0X-NdPtFKq0?t=16m57s 13
  44. 44. Monte Carlo Tree Search
  45. 45. Tree Search Optimal value v∗(s) determines the outcome of the game: Silver et al. 2016 14
  46. 46. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s Silver et al. 2016 14
  47. 47. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. Silver et al. 2016 14
  48. 48. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. Silver et al. 2016 14
  49. 49. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where Silver et al. 2016 14
  50. 50. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where b is the games breadth (number of legal moves per position) Silver et al. 2016 14
  51. 51. Tree Search Optimal value v∗(s) determines the outcome of the game: from every board position or state s under perfect play by all players. It is computed by recursively traversing a search tree containing approximately bd possible sequences of moves, where b is the games breadth (number of legal moves per position) d is its depth (game length) Silver et al. 2016 14
  52. 52. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 Allis et al. 1994 15
  53. 53. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! Allis et al. 1994 15
  54. 54. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol [10100 ] times more complex than chess. https://deepmind.com/alpha-go.html Allis et al. 1994 15
  55. 55. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol [10100 ] times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? Allis et al. 1994 15
  56. 56. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol [10100 ] times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves Allis et al. 1994 15
  57. 57. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol [10100 ] times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves for the depth: a neural network to evaluate the current position Allis et al. 1994 15
  58. 58. Game tree of Go Sizes of trees for various games: chess: b ≈ 35, d ≈ 80 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the universe! That makes Go a googol [10100 ] times more complex than chess. https://deepmind.com/alpha-go.html How to handle the size of the game tree? for the breadth: a neural network to select moves for the depth: a neural network to evaluate the current position for the tree traverse: Monte Carlo tree search (MCTS) Allis et al. 1994 15
  59. 59. Monte Carlo tree search 16
  60. 60. Neural networks
  61. 61. Neural Networks (NN): Inspiration http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
  62. 62. Neural Networks (NN): Inspiration inspired by the neuronal structure of the mammalian cerebral cortex http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
  63. 63. Neural Networks (NN): Inspiration inspired by the neuronal structure of the mammalian cerebral cortex but on much smaller scales http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
  64. 64. Neural Networks (NN): Inspiration inspired by the neuronal structure of the mammalian cerebral cortex but on much smaller scales suitable to model systems with a high tolerance to error http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
  65. 65. Neural Networks (NN): Inspiration inspired by the neuronal structure of the mammalian cerebral cortex but on much smaller scales suitable to model systems with a high tolerance to error e.g. audio or image recognition http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
  66. 66. Neural Networks: Modes Dieterle 2003 18
  67. 67. Neural Networks: Modes Two modes Dieterle 2003 18
  68. 68. Neural Networks: Modes Two modes feedforward for making predictions Dieterle 2003 18
  69. 69. Neural Networks: Modes Two modes feedforward for making predictions backpropagation for learning Dieterle 2003 18
  70. 70. Neural Networks: an Example of Feedforward http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 19
  71. 71. Gradient Descent in Neural Networks http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
  72. 72. Gradient Descent in Neural Networks Motto: ”Learn by mistakes!” http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
  73. 73. Gradient Descent in Neural Networks Motto: ”Learn by mistakes!” However, error functions are not necessarily convex or so “smooth”. http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
  74. 74. http://xkcd.com/1425/ 20
  75. 75. Convolutional Neural Networks (CNN or ConvNet) http://code.flickr.net/2014/10/20/introducing-flickr-park-or-bird/ 21
  76. 76. (Deep) Convolutional Neural Networks The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”. http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
  77. 77. (Deep) Convolutional Neural Networks The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”. http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
  78. 78. Rules of Go
  79. 79. Backgammon: Man vs. Fate 22
  80. 80. Backgammon: Man vs. Fate Chess: Man vs. Man 22
  81. 81. Go: Man vs. Self Robert ˇS´amal (White) versus Karel Kr´al (Black), Spring School of Combinatorics 2016 22
  82. 82. Rules of Go 23
  83. 83. Rules of Go Black versus White. Black starts the game. 23
  84. 84. Rules of Go Black versus White. Black starts the game. 23
  85. 85. Rules of Go Black versus White. Black starts the game. the rule of liberty 23
  86. 86. Rules of Go Black versus White. Black starts the game. the rule of liberty the “ko” rule 23
  87. 87. Rules of Go Black versus White. Black starts the game. the rule of liberty the “ko” rule Handicap for difference in ranks: Black can place 1 or more stones in advance (compensation for White’s greater strength). 23
  88. 88. Scoring Rules: Area Scoring https://en.wikipedia.org/wiki/Go_(game) 24
  89. 89. Scoring Rules: Area Scoring A player’s score is: the number of stones that the player has on the board https://en.wikipedia.org/wiki/Go_(game) 24
  90. 90. Scoring Rules: Area Scoring A player’s score is: the number of stones that the player has on the board plus the number of empty intersections surrounded by that player’s stones https://en.wikipedia.org/wiki/Go_(game) 24
  91. 91. Scoring Rules: Area Scoring A player’s score is: the number of stones that the player has on the board plus the number of empty intersections surrounded by that player’s stones plus komi(dashi) points for the White player which is a compensation for the first move advantage of the Black player https://en.wikipedia.org/wiki/Go_(game) 24
  92. 92. Ranks of Players Kyu and Dan ranks https://en.wikipedia.org/wiki/Go_(game) 25
  93. 93. Ranks of Players Kyu and Dan ranks or alternatively, Elo ratings https://en.wikipedia.org/wiki/Go_(game) 25
  94. 94. Chocolate micro-break 25
  95. 95. AlphaGo: Inside Out
  96. 96. Policy and Value Networks Silver et al. 2016 26
  97. 97. Training the (Deep Convolutional) Neural Networks Silver et al. 2016 27
  98. 98. SL Policy Network (1/2) 13-layer deep convolutional neural network Silver et al. 2016 28
  99. 99. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves Silver et al. 2016 28
  100. 100. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification Silver et al. 2016 28
  101. 101. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server Silver et al. 2016 28
  102. 102. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Silver et al. 2016 28
  103. 103. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Silver et al. 2016 28
  104. 104. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: Silver et al. 2016 28
  105. 105. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) Silver et al. 2016 28
  106. 106. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) 55.7% accuracy (raw board position + move history as input) Silver et al. 2016 28
  107. 107. SL Policy Network (1/2) 13-layer deep convolutional neural network goal: to predict expert human moves task of classification trained from 30 millions positions from the KGS Go Server stochastic gradient ascent: ∆σ ∝ ∂ log pσ(a|s) ∂σ (to maximize the likelihood of the human move a selected in state s) Results: 44.4% accuracy (the state-of-the-art from other groups) 55.7% accuracy (raw board position + move history as input) 57.0% accuracy (all input features) Silver et al. 2016 28
  108. 108. SL Policy Network (2/2) Small improvements in accuracy led to large improvements in playing strength Silver et al. 2016 29
  109. 109. Training the (Deep Convolutional) Neural Networks Silver et al. 2016 30
  110. 110. Rollout Policy Rollout policy pπ(a|s) is faster but less accurate than SL policy network. Silver et al. 2016 31
  111. 111. Rollout Policy Rollout policy pπ(a|s) is faster but less accurate than SL policy network. accuracy of 24.2% Silver et al. 2016 31
  112. 112. Rollout Policy Rollout policy pπ(a|s) is faster but less accurate than SL policy network. accuracy of 24.2% It takes 2µs to select an action, compared to 3 ms in case of SL policy network. Silver et al. 2016 31
  113. 113. Training the (Deep Convolutional) Neural Networks Silver et al. 2016 32
  114. 114. RL Policy Network (1/2) identical in structure to the SL policy network Silver et al. 2016 33
  115. 115. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play Silver et al. 2016 33
  116. 116. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification Silver et al. 2016 33
  117. 117. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ Silver et al. 2016 33
  118. 118. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play Silver et al. 2016 33
  119. 119. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play between the current RL policy network and a randomly selected previous iteration Silver et al. 2016 33
  120. 120. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play between the current RL policy network and a randomly selected previous iteration to prevent overfitting to the current policy Silver et al. 2016 33
  121. 121. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play between the current RL policy network and a randomly selected previous iteration to prevent overfitting to the current policy stochastic gradient ascent: ∆ρ ∝ ∂ log pρ(at|st) ∂ρ zt at time step t, where reward function zt is +1 for winning and −1 for losing. Silver et al. 2016 33
  122. 122. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play between the current RL policy network and a randomly selected previous iteration to prevent overfitting to the current policy stochastic gradient ascent: ∆ρ ∝ ∂ log pρ(at|st) ∂ρ zt at time step t, where reward function zt is +1 for winning and −1 for losing. Silver et al. 2016 33
  123. 123. RL Policy Network (1/2) identical in structure to the SL policy network goal: to win in the games of self-play task of classification weights ρ initialized to the same values, ρ := σ games of self-play between the current RL policy network and a randomly selected previous iteration to prevent overfitting to the current policy stochastic gradient ascent: ∆ρ ∝ ∂ log pρ(at|st) ∂ρ zt at time step t, where reward function zt is +1 for winning and −1 for losing. Silver et al. 2016 33
  124. 124. RL Policy Network (2/2) Results (by sampling each move at ∼ pρ(·|st)): Silver et al. 2016 34
  125. 125. RL Policy Network (2/2) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network Silver et al. 2016 34
  126. 126. RL Policy Network (2/2) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) Silver et al. 2016 34
  127. 127. RL Policy Network (2/2) Results (by sampling each move at ∼ pρ(·|st)): 80% of win rate against the SL policy network 85% of win rate against the strongest open-source Go program, Pachi (Baudiˇs and Gailly 2011) The previous state-of-the-art, based only on SL of CNN: 11% of “win” rate against Pachi Silver et al. 2016 34
  128. 128. Training the (Deep Convolutional) Neural Networks Silver et al. 2016 35
  129. 129. Value Network (1/2) similar architecture to the policy network, but outputs a single prediction instead of a probability distribution Silver et al. 2016 36
  130. 130. Value Network (1/2) similar architecture to the policy network, but outputs a single prediction instead of a probability distribution goal: to estimate a value function vp (s) = E[zt|st = s, at...T ∼ p] that predicts the outcome from position s (of games played by using policy p) Silver et al. 2016 36
  131. 131. Value Network (1/2) similar architecture to the policy network, but outputs a single prediction instead of a probability distribution goal: to estimate a value function vp (s) = E[zt|st = s, at...T ∼ p] that predicts the outcome from position s (of games played by using policy p) Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s). Silver et al. 2016 36
  132. 132. Value Network (1/2) similar architecture to the policy network, but outputs a single prediction instead of a probability distribution goal: to estimate a value function vp (s) = E[zt|st = s, at...T ∼ p] that predicts the outcome from position s (of games played by using policy p) Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s). task of regression Silver et al. 2016 36
  133. 133. Value Network (1/2) similar architecture to the policy network, but outputs a single prediction instead of a probability distribution goal: to estimate a value function vp (s) = E[zt|st = s, at...T ∼ p] that predicts the outcome from position s (of games played by using policy p) Double approximation: vθ(s) ≈ vpρ (s) ≈ v∗(s). task of regression stochastic gradient descent: ∆θ ∝ ∂vθ(s) ∂θ (z − vθ(s)) (to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z) Silver et al. 2016 36
  134. 134. Value Network (2/2) Beware of overfitting! Silver et al. 2016 37
  135. 135. Value Network (2/2) Beware of overfitting! Consecutive positions are strongly correlated. Silver et al. 2016 37
  136. 136. Value Network (2/2) Beware of overfitting! Consecutive positions are strongly correlated. Value network memorized the game outcomes, rather than generalizing to new positions. Silver et al. 2016 37
  137. 137. Value Network (2/2) Beware of overfitting! Consecutive positions are strongly correlated. Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a seperate game Silver et al. 2016 37
  138. 138. Value Network (2/2) Beware of overfitting! Consecutive positions are strongly correlated. Value network memorized the game outcomes, rather than generalizing to new positions. Solution: generate 30 million (new) positions, each sampled from a seperate game almost the accuracy of Monte Carlo rollouts (using pρ), but 15000 times less computation! Silver et al. 2016 37
  139. 139. Evaluation Accuracy in Various Stages of a Game Move number is the number of moves that had been played in the given position. Silver et al. 2016 38
  140. 140. Evaluation Accuracy in Various Stages of a Game Move number is the number of moves that had been played in the given position. Each position evaluated by: forward pass of the value network vθ Silver et al. 2016 38
  141. 141. Evaluation Accuracy in Various Stages of a Game Move number is the number of moves that had been played in the given position. Each position evaluated by: forward pass of the value network vθ 100 rollouts, played out using the corresponding policy Silver et al. 2016 38
  142. 142. Elo Ratings for Various Combinations of Networks Silver et al. 2016 39
  143. 143. The Main Algorithm Silver et al. 2016 39
  144. 144. MCTS Algorithm The next action is selected by lookahead search, using simulation: Silver et al. 2016 40
  145. 145. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase Silver et al. 2016 40
  146. 146. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase Silver et al. 2016 40
  147. 147. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase Silver et al. 2016 40
  148. 148. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Silver et al. 2016 40
  149. 149. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Silver et al. 2016 40
  150. 150. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Each edge (s, a) keeps: action value Q(s, a) Silver et al. 2016 40
  151. 151. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Each edge (s, a) keeps: action value Q(s, a) visit count N(s, a) Silver et al. 2016 40
  152. 152. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Each edge (s, a) keeps: action value Q(s, a) visit count N(s, a) prior probability P(s, a) (from SL policy network pσ) Silver et al. 2016 40
  153. 153. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Each edge (s, a) keeps: action value Q(s, a) visit count N(s, a) prior probability P(s, a) (from SL policy network pσ) Silver et al. 2016 40
  154. 154. MCTS Algorithm The next action is selected by lookahead search, using simulation: 1. selection phase 2. expansion phase 3. evaluation phase 4. backup phase (at end of all simulations) Each edge (s, a) keeps: action value Q(s, a) visit count N(s, a) prior probability P(s, a) (from SL policy network pσ) The tree is traversed by simulation (descending the tree) from the root state. Silver et al. 2016 40
  155. 155. MCTS Algorithm: Selection Silver et al. 2016 41
  156. 156. MCTS Algorithm: Selection At each time step t, an action at is selected from state st at = arg max a (Q(st , a) + u(st , a)) Silver et al. 2016 41
  157. 157. MCTS Algorithm: Selection At each time step t, an action at is selected from state st at = arg max a (Q(st , a) + u(st , a)) where bonus u(st , a) ∝ P(s, a) 1 + N(s, a) Silver et al. 2016 41
  158. 158. MCTS Algorithm: Expansion Silver et al. 2016 42
  159. 159. MCTS Algorithm: Expansion A leaf position may be expanded (just once) by the SL policy network pσ. Silver et al. 2016 42
  160. 160. MCTS Algorithm: Expansion A leaf position may be expanded (just once) by the SL policy network pσ. The output probabilities are stored as priors P(s, a) := pσ(a|s). Silver et al. 2016 42
  161. 161. MCTS: Evaluation Silver et al. 2016 43
  162. 162. MCTS: Evaluation evaluation from the value network vθ(s) Silver et al. 2016 43
  163. 163. MCTS: Evaluation evaluation from the value network vθ(s) evaluation by the outcome z using the fast rollout policy pπ until the end of game Silver et al. 2016 43
  164. 164. MCTS: Evaluation evaluation from the value network vθ(s) evaluation by the outcome z using the fast rollout policy pπ until the end of game Silver et al. 2016 43
  165. 165. MCTS: Evaluation evaluation from the value network vθ(s) evaluation by the outcome z using the fast rollout policy pπ until the end of game Using a mixing parameter λ, the final leaf evaluation V (s) is V (s) = (1 − λ)vθ(s) + λz Silver et al. 2016 43
  166. 166. MCTS: Backup At the end of simulation, each traversed edge is updated by accumulating: the action values Q Silver et al. 2016 44
  167. 167. MCTS: Backup At the end of simulation, each traversed edge is updated by accumulating: the action values Q visit counts N Silver et al. 2016 44
  168. 168. Once the search is complete, the algorithm chooses the most visited move from the root position. Silver et al. 2016 44
  169. 169. Percentage of Simulations percentage frequency with which actions were selected from the root during simulations Silver et al. 2016 45
  170. 170. Principal Variation (Path with Maximum Visit Count) The moves are presented in a numbered sequence. Silver et al. 2016 46
  171. 171. Principal Variation (Path with Maximum Visit Count) The moves are presented in a numbered sequence. AlphaGo selected the move indicated by the red circle; Silver et al. 2016 46
  172. 172. Principal Variation (Path with Maximum Visit Count) The moves are presented in a numbered sequence. AlphaGo selected the move indicated by the red circle; Fan Hui responded with the move indicated by the white square; Silver et al. 2016 46
  173. 173. Principal Variation (Path with Maximum Visit Count) The moves are presented in a numbered sequence. AlphaGo selected the move indicated by the red circle; Fan Hui responded with the move indicated by the white square; in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo. Silver et al. 2016 46
  174. 174. Scalability asynchronous multi-threaded search Silver et al. 2016 47
  175. 175. Scalability asynchronous multi-threaded search simulations on CPUs Silver et al. 2016 47
  176. 176. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs Silver et al. 2016 47
  177. 177. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs Silver et al. 2016 47
  178. 178. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads Silver et al. 2016 47
  179. 179. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs Silver et al. 2016 47
  180. 180. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Silver et al. 2016 47
  181. 181. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Silver et al. 2016 47
  182. 182. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Distributed version of AlphaGo (on multiple machines): Silver et al. 2016 47
  183. 183. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Distributed version of AlphaGo (on multiple machines): 40 search threads Silver et al. 2016 47
  184. 184. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Distributed version of AlphaGo (on multiple machines): 40 search threads 1202 CPUs Silver et al. 2016 47
  185. 185. Scalability asynchronous multi-threaded search simulations on CPUs computation of neural networks on GPUs AlphaGo: 40 search threads 40 CPUs 8 GPUs Distributed version of AlphaGo (on multiple machines): 40 search threads 1202 CPUs 176 GPUs Silver et al. 2016 47
  186. 186. Elo Ratings for Various Combinations of Threads Silver et al. 2016 48
  187. 187. Results: the strength of AlphaGo
  188. 188. Tournament with Other Go Programs Silver et al. 2016 49
  189. 189. Fan Hui https://en.wikipedia.org/wiki/Fan_Hui 50
  190. 190. Fan Hui professional 2 dan https://en.wikipedia.org/wiki/Fan_Hui 50
  191. 191. Fan Hui professional 2 dan European Go Champion in 2013, 2014 and 2015 https://en.wikipedia.org/wiki/Fan_Hui 50
  192. 192. Fan Hui professional 2 dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 https://en.wikipedia.org/wiki/Fan_Hui 50
  193. 193. Fan Hui professional 2 dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: https://en.wikipedia.org/wiki/Fan_Hui 50
  194. 194. Fan Hui professional 2 dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: 100 billion neurons https://en.wikipedia.org/wiki/Fan_Hui 50
  195. 195. Fan Hui professional 2 dan European Go Champion in 2013, 2014 and 2015 European Professional Go Champion in 2016 biological neural network: 100 billion neurons 100 up to 1,000 trillion neuronal connections https://en.wikipedia.org/wiki/Fan_Hui 50
  196. 196. AlphaGo versus Fan Hui 51
  197. 197. AlphaGo versus Fan Hui AlphaGo won 5:0 in a formal match on October 2015. 51
  198. 198. AlphaGo versus Fan Hui AlphaGo won 5:0 in a formal match on October 2015. [AlphaGo] is very strong and stable, it seems like a wall. ... I know AlphaGo is a computer, but if no one told me, maybe I would think the player was a little strange, but a very strong player, a real person. Fan Hui 51
  199. 199. Lee Sedol “The Strong Stone” https://en.wikipedia.org/wiki/Lee_Sedol 52
  200. 200. Lee Sedol “The Strong Stone” professional 9 dan https://en.wikipedia.org/wiki/Lee_Sedol 52
  201. 201. Lee Sedol “The Strong Stone” professional 9 dan the 2nd in international titles https://en.wikipedia.org/wiki/Lee_Sedol 52
  202. 202. Lee Sedol “The Strong Stone” professional 9 dan the 2nd in international titles the 5th youngest (12 years 4 months) to become a professional Go player in South Korean history https://en.wikipedia.org/wiki/Lee_Sedol 52
  203. 203. Lee Sedol “The Strong Stone” professional 9 dan the 2nd in international titles the 5th youngest (12 years 4 months) to become a professional Go player in South Korean history Lee Sedol would win 97 out of 100 games against Fan Hui. https://en.wikipedia.org/wiki/Lee_Sedol 52
  204. 204. Lee Sedol “The Strong Stone” professional 9 dan the 2nd in international titles the 5th youngest (12 years 4 months) to become a professional Go player in South Korean history Lee Sedol would win 97 out of 100 games against Fan Hui. biological neural network comparable to Fan Hui’s (in number of neurons and connections) https://en.wikipedia.org/wiki/Lee_Sedol 52
  205. 205. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol 52
  206. 206. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4:1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 52
  207. 207. I heard Google DeepMind’s AI is surprisingly strong and getting stronger, but I am confident that I can win, at least this time. Lee Sedol ...even beating AlphaGo by 4:1 may allow the Google DeepMind team to claim its de facto victory and the defeat of him [Lee Sedol], or even humankind. interview in JTBC Newsroom 52
  208. 208. AlphaGo versus Lee Sedol https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  209. 209. AlphaGo versus Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  210. 210. AlphaGo versus Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  211. 211. AlphaGo versus Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  212. 212. AlphaGo versus Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. Since AlphaGo won, Google DeepMind stated that the prize will be donated to charities, including UNICEF, and Go organisations. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  213. 213. AlphaGo versus Lee Sedol In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol. AlphaGo won all but the 4th game; all games were won by resignation. The winner of the match was slated to win $1 million. Since AlphaGo won, Google DeepMind stated that the prize will be donated to charities, including UNICEF, and Go organisations. Lee received $170,000 ($150,000 for participating in all the five games, and an additional $20,000 for each game won). https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
  214. 214. Who’s next? 53
  215. 215. http://www.goratings.org/ (18th April 2016) 53
  216. 216. AlphaGo versus Ke Jie? https://en.wikipedia.org/wiki/Ke_Jie 54
  217. 217. AlphaGo versus Ke Jie? professional 9 dan https://en.wikipedia.org/wiki/Ke_Jie 54
  218. 218. AlphaGo versus Ke Jie? professional 9 dan the 1st in (unofficial) world ranking list https://en.wikipedia.org/wiki/Ke_Jie 54
  219. 219. AlphaGo versus Ke Jie? professional 9 dan the 1st in (unofficial) world ranking list the youngest player to win 3 major international tournaments https://en.wikipedia.org/wiki/Ke_Jie 54
  220. 220. AlphaGo versus Ke Jie? professional 9 dan the 1st in (unofficial) world ranking list the youngest player to win 3 major international tournaments head-to-head record against Lee Sedol 8:2 https://en.wikipedia.org/wiki/Ke_Jie 54
  221. 221. AlphaGo versus Ke Jie? professional 9 dan the 1st in (unofficial) world ranking list the youngest player to win 3 major international tournaments head-to-head record against Lee Sedol 8:2 biological neural network comparable to Fan Hui’s, and thus by transitivity, also comparable to Lee Sedol’s https://en.wikipedia.org/wiki/Ke_Jie 54
  222. 222. I believe I can beat it. Machines can be very strong in many aspects but still have loopholes in certain calculations. Ke Jie 54
  223. 223. I believe I can beat it. Machines can be very strong in many aspects but still have loopholes in certain calculations. Ke Jie Now facing AlphaGo, I do not feel the same strong instinct of victory when I play a human player, but I still believe I have the advantage against it. It’s 60 percent in favor of me. Ke Jie 54
  224. 224. I believe I can beat it. Machines can be very strong in many aspects but still have loopholes in certain calculations. Ke Jie Now facing AlphaGo, I do not feel the same strong instinct of victory when I play a human player, but I still believe I have the advantage against it. It’s 60 percent in favor of me. Ke Jie Even though AlphaGo may have defeated Lee Sedol, it won’t beat me. Ke Jie 54
  225. 225. Conclusion
  226. 226. Difficulties of Go challenging decision-making Silver et al. 2016 55
  227. 227. Difficulties of Go challenging decision-making intractable search space Silver et al. 2016 55
  228. 228. Difficulties of Go challenging decision-making intractable search space complex optimal solution It appears infeasible to directly approximate using a policy or value function! Silver et al. 2016 55
  229. 229. AlphaGo: summary Monte Carlo tree search Silver et al. 2016 56
  230. 230. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation Silver et al. 2016 56
  231. 231. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks Silver et al. 2016 56
  232. 232. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning Silver et al. 2016 56
  233. 233. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining Silver et al. 2016 56
  234. 234. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Silver et al. 2016 56
  235. 235. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Monte Carlo rollouts Silver et al. 2016 56
  236. 236. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Monte Carlo rollouts scalable implementation Silver et al. 2016 56
  237. 237. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Monte Carlo rollouts scalable implementation multi-threaded simulations on CPUs Silver et al. 2016 56
  238. 238. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Monte Carlo rollouts scalable implementation multi-threaded simulations on CPUs parallel GPU computations Silver et al. 2016 56
  239. 239. AlphaGo: summary Monte Carlo tree search effective move selection and position evaluation through deep convolutional neural networks trained by novel combination of supervised and reinforcement learning new search algorithm combining neural network evaluation Monte Carlo rollouts scalable implementation multi-threaded simulations on CPUs parallel GPU computations distributed version over multiple machines Silver et al. 2016 56
  240. 240. Novel approach Silver et al. 2016 57
  241. 241. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. Silver et al. 2016 57
  242. 242. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) Silver et al. 2016 57
  243. 243. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) evaluating them more precisely (value network) Silver et al. 2016 57
  244. 244. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) evaluating them more precisely (value network) Silver et al. 2016 57
  245. 245. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) evaluating them more precisely (value network) Deep Blue relied on a handcrafted evaluation function. Silver et al. 2016 57
  246. 246. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) evaluating them more precisely (value network) Deep Blue relied on a handcrafted evaluation function. AlphaGo was trained directly and automatically from gameplay. It used general-purpose learning. Silver et al. 2016 57
  247. 247. Novel approach During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue against Kasparov. It compensated this by: selecting those positions more intelligently (policy network) evaluating them more precisely (value network) Deep Blue relied on a handcrafted evaluation function. AlphaGo was trained directly and automatically from gameplay. It used general-purpose learning. This approach is not specific to the game of Go. The algorithm can be used for much wider class of (so far seemingly) intractable problems in AI! Silver et al. 2016 57
  248. 248. Thank you! Questions? 57
  249. 249. Backup Slides
  250. 250. Input features for rollout and tree policy Silver et al. 2016
  251. 251. Selection of Moves by the SL Policy Network move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%). Silver et al. 2016
  252. 252. Selection of Moves by the Value Network evaluation of all successors s of the root position s, using vθ(s) Silver et al. 2016
  253. 253. Tree Evaluation from Value Network action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only) Silver et al. 2016
  254. 254. Tree Evaluation from Rollouts action values Q(s, a), averaged over rollout evaluations only Silver et al. 2016
  255. 255. Results of a tournament between different Go programs Silver et al. 2016
  256. 256. Results of a tournament between AlphaGo and distributed Al- phaGo, testing scalability with hardware Silver et al. 2016
  257. 257. AlphaGo versus Fan Hui: Game 1 Silver et al. 2016
  258. 258. AlphaGo versus Fan Hui: Game 2 Silver et al. 2016
  259. 259. AlphaGo versus Fan Hui: Game 3 Silver et al. 2016
  260. 260. AlphaGo versus Fan Hui: Game 4 Silver et al. 2016
  261. 261. AlphaGo versus Fan Hui: Game 5 Silver et al. 2016
  262. 262. AlphaGo versus Lee Sedol: Game 1 https://youtu.be/vFr3K2DORc8 https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  263. 263. AlphaGo versus Lee Sedol: Game 2 (1/2) https://youtu.be/l-GsfyVCBu0 https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  264. 264. AlphaGo versus Lee Sedol: Game 2 (2/2) https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  265. 265. AlphaGo versus Lee Sedol: Game 3 https://youtu.be/qUAmTYHEyM8 https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  266. 266. AlphaGo versus Lee Sedol: Game 4 https://youtu.be/yCALyQRN3hw https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  267. 267. AlphaGo versus Lee Sedol: Game 5 (1/2) https://youtu.be/mzpW10DPHeQ https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  268. 268. AlphaGo versus Lee Sedol: Game 5 (2/2) https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
  269. 269. Further Reading I AlphaGo: Google Research Blog http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html an article in Nature http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234 a reddit article claiming that AlphaGo is even stronger than it appears to be: “AlphaGo would rather win by less points, but with higher probability.” https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/ a video of how AlphaGo works (put in layman’s terms) https://youtu.be/qWcfiPi9gUU Articles by Google DeepMind: Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih et al. 2015) Neural Turing Machines (Graves, Wayne, and Danihelka 2014) Artificial Intelligence: Artificial Intelligence course at MIT http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/ 6-034-artificial-intelligence-fall-2010/index.htm
  270. 270. Further Reading II Introduction to Artificial Intelligence at Udacity https://www.udacity.com/course/intro-to-artificial-intelligence--cs271 General Game Playing course https://www.coursera.org/course/ggp Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2 The Singularity Is Near (Kurzweil 2005) Combinatorial Game Theory (founded by John H. Conway to study endgames in Go): Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory On Numbers and Games (Conway 1976) Computer Go as a sum of local games: an application of combinatorial game theory (M¨uller 1995) Chess: Deep Blue beats G. Kasparov in 1997 https://youtu.be/NJarxpYyoFI Machine Learning: Machine Learning course https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/ Reinforcement Learning http://reinforcementlearning.ai-depot.com/ Deep Learning (LeCun, Bengio, and Hinton 2015)
  271. 271. Further Reading III Deep Learning course https://www.udacity.com/course/deep-learning--ud730 Two Minute Papers https://www.youtube.com/user/keeroyz Applications of Deep Learning https://youtu.be/hPKJBXkyTKM Neuroscience: http://www.brainfacts.org/
  272. 272. References I Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looijen. Baudiˇs, Petr and Jean-loup Gailly (2011). “Pachi: State of the art open source Go program”. In: Advances in Computer Games. Springer, pp. 24–38. Bowling, Michael et al. (2015). “Heads-up limit holdem poker is solved”. In: Science 347.6218, pp. 145–149. url: http://poker.cs.ualberta.ca/15science.html. Champandard, Alex J (2016). “Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks”. In: arXiv preprint arXiv:1603.01768. Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6. Dieterle, Frank Jochen (2003). “Multianalyte quantifications by means of integration of artificial neural networks, genetic algorithms and chemometrics for time-resolved analytical data”. PhD thesis. Universit¨at T¨ubingen. Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge (2015). “A Neural Algorithm of Artistic Style”. In: CoRR abs/1508.06576. url: http://arxiv.org/abs/1508.06576. Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing machines”. In: arXiv preprint arXiv:1410.5401. Hayes, Bradley (2016). url: https://twitter.com/deepdrumpf. Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. url: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (visited on 04/01/2016).
  273. 273. References II Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444. Li, Chuan and Michael Wand (2016). “Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis”. In: CoRR abs/1601.04589. url: http://arxiv.org/abs/1601.04589. Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518.7540, pp. 529–533. url: https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf. M¨uller, Martin (1995). “Computer Go as a sum of local games: an application of combinatorial game theory”. PhD thesis. TU Graz. Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587, pp. 484–489.

×