Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Mastering the game of Go with deep ... by Karel Ha 430 views
- Order denying Abbott Labs Motion fo... by Mark Briggs 326 views
- Introduction to Machine Learning @ ... by Ilya Kuzovkin 606 views
- Machine Learning and Data Mining: 0... by Pier Luca Lanzi 7109 views
- Введение в архитектуры нейронных се... by Grigory Sapunov 2844 views
- Focus Junior - 14 Maggio 2016 by Pier Luca Lanzi 699 views

678 views

Published on

Published in:
Science

No Downloads

Total views

678

On SlideShare

0

From Embeds

0

Number of Embeds

135

Shares

0

Downloads

18

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering the game of Go with deep neural networks and tree search
- 2. THE GAME OF GO
- 3. BOARD
- 4. BOARD STONES
- 5. BOARD STONES GROUPS
- 6. BOARD STONES LIBERTIES GROUPS
- 7. BOARD STONES LIBERTIES CAPTURE GROUPS
- 8. BOARD STONES LIBERTIES CAPTURE KO GROUPS
- 9. BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
- 10. BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
- 11. BOARD STONES LIBERTIES CAPTURE KO GROUPS EXAMPLES
- 12. BOARD STONES LIBERTIES CAPTURE TWO EYES KO GROUPS EXAMPLES
- 13. BOARD STONES LIBERTIES CAPTURE FINAL COUNT KO GROUPS EXAMPLES TWO EYES
- 14. TRAINING
- 15. Supervised policy network p (a|s) Reinforcement policy network p⇢(a|s) Rollout policy network p⇡(a|s) Value! network v✓(s) Tree policy network p⌧ (a|s) TRAINING THE BUILDING BLOCKS SUPERVISED CLASSIFICATION REINFORCEMENT SUPERVISED REGRESSION
- 16. Supervised policy network p (a|s)
- 17. Supervised policy network p (a|s) 19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU 1 convolutional layer 1x1 ReLU Softmax
- 18. Supervised policy network p (a|s) 19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Softmax 1 convolutional layer 1x1 ReLU • 29.4M positions from games between 6 to 9 dan players
- 19. Supervised policy network p (a|s) 19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Softmax • stochastic gradient ascent ! • learning rate = 0.003, halved every 80M steps ! • batch size m = 16 ! • 3 weeks on 50 GPUs to make 340M steps ↵1 convolutional layer 1x1 ReLU • 29.4M positions from games between 6 to 9 dan players
- 20. Supervised policy network p (a|s) 19 x 19 x 48 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Softmax • stochastic gradient ascent ! • learning rate = 0.003, halved every 80M steps ! • batch size m = 16 ! • 3 weeks on 50 GPUs to make 340M steps ↵1 convolutional layer 1x1 ReLU • 29.4M positions from games between 6 to 9 dan players • Augmented: 8 reﬂections/rotations • Test set (1M) accuracy: 57.0% • 3 ms to select an action
- 21. 19 X 19 X 48 INPUT
- 22. 19 X 19 X 48 INPUT
- 23. 19 X 19 X 48 INPUT
- 24. 19 X 19 X 48 INPUT
- 25. Rollout policy p⇡(a|s) • Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax p (a|s)
- 26. Rollout policy p⇡(a|s) • Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax p (a|s)
- 27. Rollout policy p⇡(a|s) Tree policy p⌧ (a|s) • Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax p (a|s)
- 28. Rollout policy p⇡(a|s) • “similar to the rollout policy but with more features” Tree policy p⌧ (a|s) • Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax p (a|s)
- 29. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢
- 30. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions
- 31. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions • Play a game until the end, get the reward zt = ±r(sT ) = ±1
- 32. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions • Play a game until the end, get the reward • Set and play the same game again, this time updating the network parameters at each time step t zt = ±r(sT ) = ±1 zi t = zt
- 33. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions • Play a game until the end, get the reward • Set and play the same game again, this time updating the network parameters at each time step t • = … ‣ 0 “on the ﬁrst pass through the training pipeline” ‣ “on the second pass” zt = ±r(sT ) = ±1 zi t = zt v(si t) v✓(si t)
- 34. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions • Play a game until the end, get the reward • Set and play the same game again, this time updating the network parameters at each time step t • = … ‣ 0 “on the ﬁrst pass through the training pipeline” ‣ “on the second pass” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs zt = ±r(sT ) = ±1 zi t = zt v(si t) v✓(si t)
- 35. Reinforcement policy network p⇢(a|s) Same architecture Weights are initialized with⇢ • Self-play: current network vs. randomized pool of previous versions • 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) • 3 ms to select an action • Play a game until the end, get the reward • Set and play the same game again, this time updating the network parameters at each time step t • = … ‣ 0 “on the ﬁrst pass through the training pipeline” ‣ “on the second pass” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs zt = ±r(sT ) = ±1 zi t = zt v(si t) v✓(si t)
- 36. Value! network v✓(s)
- 37. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit 1 convolutional layer 1x1 ReLU
- 38. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation 1 convolutional layer 1x1 ReLU
- 39. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation 1 convolutional layer 1x1 ReLU • Stochastic gradient descent to minimize MSE
- 40. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation 1 convolutional layer 1x1 ReLU • Stochastic gradient descent to minimize MSE • Train on 30M state-outcome pairs , each from a unique game generated by self-play: (s, z)
- 41. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation 1 convolutional layer 1x1 ReLU • Stochastic gradient descent to minimize MSE • Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from SL policy ‣ make random move u ‣ sample t=u+1…T from RL policy and get game outcome z ‣ add pair to the training set (s, z) (su, zu)
- 42. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation 1 convolutional layer 1x1 ReLU • Stochastic gradient descent to minimize MSE • Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from SL policy ‣ make random move u ‣ sample t=u+1…T from RL policy and get game outcome z ‣ add pair to the training set • One week on 50 GPUs to train on 50M batches of size m=32 (s, z) (su, zu)
- 43. 19 x 19 x 49 input 1 convolutional layer 5x5 with k=192 ﬁlters, ReLU 11 convolutional layers 3x3 with k=192 ﬁlters, ReLU Fully connected layer 256 ReLU units Value! network v✓(s) Fully connected layer 1 tanh unit • Evaluate the value of the position s under policy p: ! • Double approximation • MSE on the test set: 0.234 • Close to MC estimation from RL policy; 15,000 faster 1 convolutional layer 1x1 ReLU • Stochastic gradient descent to minimize MSE • Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from SL policy ‣ make random move u ‣ sample t=u+1…T from RL policy and get game outcome z ‣ add pair to the training set • One week on 50 GPUs to train on 50M batches of size m=32 (s, z) (su, zu)
- 44. PLAYING
- 45. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
- 46. APV-MCTS Each node s has edges (s, a) for all legal actions and stores statistics: Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value ASYNCHRONOUS POLICY AND VALUE MCTS
- 47. APV-MCTS Each node s has edges (s, a) for all legal actions and stores statistics: Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value ASYNCHRONOUS POLICY AND VALUE MCTS Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.
- 48. APV-MCTS Each node s has edges (s, a) for all legal actions and stores statistics: Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value ASYNCHRONOUS POLICY AND VALUE MCTS Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue.sL
- 49. APV-MCTS Each node s has edges (s, a) for all legal actions and stores statistics: Prior Number of evaluations Number of rollouts MC value estimate Rollout value estimate Combined mean action value ASYNCHRONOUS POLICY AND VALUE MCTS Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found. Position is added to evaluation queue.sL Bunch of nodes selected for evaluation…
- 50. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
- 51. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Node s is evaluated using the value network to obtain v✓(s)
- 52. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the ﬁnal game score. p⇡(a|s) v✓(s)
- 53. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Node s is evaluated using the value network to obtain and using rollout simulation with policy till the end of each simulated game to get the ﬁnal game score. p⇡(a|s) v✓(s) Each leaf is evaluated, we are ready to propagate updates
- 54. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
- 55. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Statistics along the paths of each simulation are updated during the backward pass though t < L
- 56. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well
- 57. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated
- 58. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Statistics along the paths of each simulation are updated during the backward pass though t < L visits counts are updated as well Finally overall evaluation of each visited state-action edge is updated Current tree is updated
- 59. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
- 60. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’ nthr
- 61. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’ nthr It is initialized using the tree policy to and updated with SL policy: p⌧ (a|s0 ) ⌧
- 62. APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS Tree is expanded, fully updated and ready for the next move! Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’ nthr It is initialized using the tree policy to and updated with SL policy: p⌧ (a|s0 ) ⌧
- 63. https://www.youtube.com/watch?v=oRvlyEpOQ-8 WINNING
- 64. https://www.youtube.com/watch?v=oRvlyEpOQ-8

No public clipboards found for this slide

Be the first to comment