Deep Learning and Reinforcement Learning

1. Deep Learning & Reinforcement Learning Renārs Liepiņš Lead Researcher, LUMII & LETA renars.liepins@lumii.lv At “Riga AI, Machine Learning and Bots”, February 16, 2017

2. Outline • Current State • Deep Learning • Reinforcement Learning

4. Source

5. Machine learning is a core transformative way by which we are  rethinking everything we are doing – Sundar Pichai (CEO Google) 2015 Source

6. Source

7. Source

8. Why such optimism?

9. Artiﬁcial Intelligence computer systems able  to perform tasks normally  requiring human intelligence

10. 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 2015 2016 3.083.57 6.7 11.7 16.4 27.828 29 Classic Deep Learning Human  Level

12. Human  Level

15. Nice, but so what?

16. First Universal Learning Algorithm

17. Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Before Deep Learning Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Source

18. Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! With Deep Learning Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Neurons in the brain Output Deep Learning: Neural network

19. Universal Learning Algorithm Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network … …

20. A yellow bus  driving down…. Universal Learning Algorithm – Speech Recognition Andrew NgAndrew Ng _ q u i c k … Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

21. Universal Learning Algorithm – Translation Dzeltens autobuss brauc pa ceļu…. A yellow bus  driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

22. Universal Learning Algorithm – Self driving cars Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

23. Universal Learning Algorithm A yellow bus  driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network

24. Data (image) The limitations of supervise Universal Learning Algorithm – Image captions A yellow bus  driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

25. Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.)

26. Universal Learning Algorithm – X-ray reports Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

27. Universal Learning Algorithm – Photo localisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network PlaNet is able to determine the location of almost any image with superhuman ability. Deep Learning in Computer Vision Image Localization PlaNet is able to determine the location of almost any image with superhuman ability Source

28. Universal Learning Algorithm – Style Transfer Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

30. Universal Learning Algorithm – Semantic Face Transforms Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before a showcasing the quality of our method. In this ﬁgure (and no other) a mask was applied to preserve th image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image ( categories. Each transformation was performed via linear interpolation in deep feature space composed images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other age transformations. Works content change models for viewpoint changes) but do n mouth open Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpo showcasing the quality of our method. In this ﬁgure (and no other) a mask was appli image was 400x400, all source and target images used in the transformation were only olderinput mouth open eyes open Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation categories. Each transformation was performed via linear interpolation in deep feature images. It also requires that sample images with and without age transform Source

31. Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step, showcasing the quality of our method. In this ﬁgure (and no other) a mask was applied to preserve the background. Although the input image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling facial hair spectacles Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards six categories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features. images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other caucasian males). age transformations. Works by Reed et al. [29, 30] propose content change models for challenging tasks (identity and viewpoint changes) but do not demonstrate photo-realistic results. A contemporaneous work [4] edits image content by

32. Universal Learning Algorithm – Lipreading Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus  driving down…. Deep Learning in Computer Vision LipNet - Sentence-level Lipreading Source LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy. Source

33. Universal Learning Algorithm – Sketch Vectorisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

34. Universal Learning Algorithm – Handwriting Generation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus  driving down…. Source

35. Deep Learning in Computer Vision Image Generation - Handwriting This LSTM recurrent neural network is able to generate highly realistic cursive handwriting in a wide variety of styles, simply by predicting one data point at a time.

36. Universal Learning Algorithm – Image upscaling Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source

37. Google – Saving you bandwidth through machine learning Source

38. First Universal Learning Algorithm

39. Not Magic • Simply downloading and “applying” open-source software won’t work. • Needs to be customised to your business context and data. • Needs lots of examples and computing power for training Source

43. cell body output axon synapse Neuron Source

44. cell body output axon synapse Neuron Artiﬁcal Neuron Source

45. Source

46. Andrew NgAndrew Ng the brain Output Deep Learning: Neural network

47. Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x1 x2 x3 x4 x5 W4W3W2W1

48. Training Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x3 =(W2 × x2 )+ x1 x2 x3 x4 x5 W4W3W2W1 . 0.9 0.3 0.2 output 1.0 0.0 1.0 true out error 0.1 0.3 0.8 training data Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x1 x2 x3 x4 x5 W4W3W2W1 error backpropagation

49. eatures for machine learning Image! Vision features! Detection! mages! udio!

50. Input Result

51. “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? WHAT MAKES DEEP LEARNING DEEP? Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has tr of parameters – on

52. Demo

53. http://playground.tensorﬂow.org/

54. https://transcranial.github.io/keras-js/

55. Why Now?

56. 20171943 1956 A brief History A long time ago… 1974 Backpropagation awkward silence (AI Winter) 1995 SVM reigns Convolution Neural Networks for Handwritten Recognition 1998 2006 Restricted Boltzmann Machine 1958 Perceptron 1969 Perceptron criticized Google Brain Project on 16k Cores 2012 2012 AlexNet wins ImageNet 1969 1958 1974 AI Winter 1998 Deep  Learning 2012

57. Why Now?

58. Computational Power Big Data Neurons in the brain Output Deep Learning: Neural network Algorithms

60. Current Situation

61. Outline • Current State • Deep Learning • Reinforcement Learning AndrewAndrew Neurons in the brain Output Deep Learning: Neural network

62. Outline • Current State • Deep Learning • Reinforcement Learning Learning from Experience

65. Source

66. 40% reduction in cooling Source

67. What is Reinforcement Learning? Action (A1) State (S1) Reward (R1)Agent Environment

68. What is Reinforcement Learning? Action (A2) State (S2) Reward (R2)Agent Environment

69. What is Reinforcement Learning? Agent Environment Action (Ai) State (Si) Reward (Ri)

70. What is Reinforcement Learning? Agent Environment Goal: Maximize Accumulated Rewards R1 + R2 + R3 + … Action (Ai) State (Si) Reward (Ri) iRi ∑=

71. Pong Example States (S) … Actions (A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards

72. Reinforcement Agent Agent out of 49 Atari games ithin Google

73. Reinforcement Agent = Policy Function Agent out of 49 Atari games ithin Google π(S) -> A Policy Function =

74. AiSi Pong Example π( ) ->

75. AiSi Pong Example π( ) -> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po

76. AiSi Pong Example π( ) -> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po

77. Pong Example States (S) … Actions (A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards π(S) -> A

78. Reinforcement Learning Problem that maximizes iRi ∑Accumulated Rewards: Policy Function: π(S) -> A Find

79. How to Find π(S) -> A ?

80. Reinforcement Learning Algorithms • Q-Learning • Actor-Critic methods • Policy Gradient

81. Reinforcement Learning Algorithms • Q-Learning • Actor-Critic methods • Policy Gradient

82. Episode Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. 😁R3=+1R1=0 R2=0 ion of learned value functions on two alization of the learned value function on nd 2, thestate value is predicted to be ,17 the lowest level. Each of the peaks in to a reward obtained by clearing a brick. reak through to the top level of bricks and tion of breaking out and clearing a ue is above 23 and the agent has broken bounce at the upper part of the bricks visualization of the learned action-value point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. .7, reflecting the expected value of this state based on time point 2, the agent starts moving the paddle value of the ‘up’ action stays high while the value of the 0.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, pressing ‘up’ and the expected reward keeps increasing the ball reaches the left edge of the screen and the value at the agent is about to receive a reward of 1. Note, he past trajectory of the ball purely for illustrative hown during the game). With permission from Atari 0.7, reflecting the expected value of this state based on At time point 2, the agent starts moving the paddle he value of the ‘up’ action stays high while the value of the 20.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, by pressing ‘up’ and the expected reward keeps increasing n the ball reaches the left edge of the screen and the value hat the agent is about to receive a reward of 1. Note, the past trajectory of the ball purely for illustrative shown during the game). With permission from Atari Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. re 2 | Visualization of learned value functions on two d Pong. a, A visualization of the learned value function on t time points 1 and 2, thestate value is predicted to be ,17 ing the bricks at the lowest level. Each of the peaks in rve corresponds to a reward obtained by clearing a brick. gent is about to break through to the top level of bricks and ,21 in anticipation of breaking out and clearing a point 4, the value is above 23 and the agent has broken oint, the ball will bounce at the upper part of the bricks m by itself. b, A visualization of the learned action-value e Pong. At time point 1, the ball is moving towards the the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. ure 2 | Visualization of learned value functions on two nd Pong. a, A visualization of the learned value function on At time points 1 and 2, thestate value is predicted to be ,17 aring the bricks at the lowest level. Each of the peaks in urve corresponds to a reward obtained by clearing a brick. agent is about to break through to the top level of bricks and to ,21 in anticipation of breaking out and clearing a At point 4, the value is above 23 and the agent has broken point, the ball will bounce at the upper part of the bricks em by itself. b, A visualization of the learned action-value me Pong. At time point 1, the ball is moving towards the y the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Game Over 👍 👍 👍 iRi ∑ = +1

83. 😭 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a e set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ring many of them by itself. b, A visualization of the learned action-value tion on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on ame Breakout.At time points 1 and 2, the state value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in alue function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari tended Data Figure 2 | Visualization of learned value functions on two mes, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 d the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. time point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a ge set of bricks. At point 4, the value is above 23 and the agent has broken ough. After this point, the ball will bounce at the upper part of the bricks aring many of them by itself. b, A visualization of the learned action-value ction on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Episode R3=-1R1=0 R2=0 Game Over 👎 👎 👎 iRi ∑ = −1

84. How to Find π(S) -> A ?

85. How to Find π(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)

86. Pong Example π( ) ->

87. π( ) -> Action Probability 0 0.25 0.5 0.75 1

88. π( ) -> Action Probability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re

89. π( ) -> Action Probability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re

90. 2. Approximate π with NeuralNet: π(S, θ) -> P(A) How to Find π(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)

91. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1

92. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1 , θ

93. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1 ,

94. Si Action Probability 0 0.25 0.5 0.75 1

95. π(Si, θ) θ

98. How to Find ?π(Si, θ) -> P(A) How to Find θ Loss Function…

99. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH

100. π(Si , θ | Ai) θk 0.0 0.2 0.4 0.6 0.8 1.0 π(Si , θ | Ai) } 👍 👎 Δ(π(Si , θ | Ai) ) Δ θk

101. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH

103. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH

104. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH

105. Reinforcement Learning

106. Outline • Current State • Deep Learning • Reinforcement Learning • Conclusions

107. Conclusions 1. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network … …2. 3.

Deep Learning and Reinforcement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning and Reinforcement Learning

Similar to Deep Learning and Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Deep Learning and Reinforcement Learning