Deep Learning & Reinforcement Learning
Renārs Liepiņš
Lead Researcher, LUMII & LETA
renars.liepins@lumii.lv
At “Riga AI, Machine Learning and Bots”, February 16, 2017
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Source
Machine learning is a core transformative way by which we are

rethinking everything we are doing
– Sundar Pichai (CEO Google) 2015
Source
Source
Source
Why such optimism?
Artificial Intelligence
computer systems able

to perform tasks normally

requiring human intelligence
0
5
10
15
20
25
30
2010 2011 2012 2013 2014 2015 2016
3.083.57
6.7
11.7
16.4
27.828
29
Classic Deep Learning
Human

Level
Human

Level
Nice, but so what?
First Universal Learning Algorithm
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Before Deep Learning
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Source
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
With Deep Learning
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Neurons in the brain
Output
Deep Learning: Neural network
Universal Learning Algorithm
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
… …
A yellow bus

driving down….
Universal Learning Algorithm – Speech Recognition
Andrew NgAndrew Ng
_ q u i c k …
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Translation
Dzeltens autobuss
brauc pa ceļu….
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Self driving cars
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Data (image)
The limitations of supervise
Universal Learning Algorithm – Image captions
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Universal Learning Algorithm – X-ray reports
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Photo localisation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
PlaNet is able to determine the location of almost any image with superhuman ability.
Deep Learning in Computer Vision
Image Localization
PlaNet is able to determine the location of almost any image with superhuman ability
Source
Universal Learning Algorithm – Style Transfer
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Semantic Face Transforms
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before a
showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve th
image was 400x400, all source and target images used in the transformation were only 100x100.
olderinput mouth open eyes open smiling
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (
categories. Each transformation was performed via linear interpolation in deep feature space composed
images. It also requires that sample images with and without
the desired attribute are otherwise similar to the target image
(e.g. in the case of Figure 1 they consist of images of other
age transformations. Works
content change models for
viewpoint changes) but do n
mouth open
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpo
showcasing the quality of our method. In this figure (and no other) a mask was appli
image was 400x400, all source and target images used in the transformation were only
olderinput mouth open eyes open
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation
categories. Each transformation was performed via linear interpolation in deep feature
images. It also requires that sample images with and without age transform
Source
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,
showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the input
image was 400x400, all source and target images used in the transformation were only 100x100.
olderinput mouth open eyes open smiling facial hair spectacles
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards six
categories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.
images. It also requires that sample images with and without
the desired attribute are otherwise similar to the target image
(e.g. in the case of Figure 1 they consist of images of other
caucasian males).
age transformations. Works by Reed et al. [29, 30] propose
content change models for challenging tasks (identity and
viewpoint changes) but do not demonstrate photo-realistic
results. A contemporaneous work [4] edits image content by
Universal Learning Algorithm – Lipreading
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
A yellow bus

driving down….
Deep Learning in Computer Vision
LipNet - Sentence-level Lipreading
Source
LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders
and the previous 79.6% state-of-the-art accuracy.
Source
Universal Learning Algorithm – Sketch Vectorisation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Handwriting Generation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
A yellow bus

driving down….
Source
Deep Learning in Computer Vision
Image Generation - Handwriting
This LSTM recurrent neural network is able to generate highly realistic
cursive handwriting in a wide variety of styles, simply by predicting one data
point at a time.
Universal Learning Algorithm – Image upscaling
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Google – Saving you bandwidth through machine learning
Source
First Universal Learning Algorithm
Not Magic
• Simply downloading and “applying” open-source software won’t work.
• Needs to be customised to your business context and data.
• Needs lots of examples and computing power for training
Source
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
cell body
output axon
synapse
Neuron
Source
cell body
output axon
synapse
Neuron
Artifical Neuron
Source
Source
Andrew NgAndrew Ng
the brain
Output
Deep Learning: Neural network
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x1 x2 x3
x4
x5
W4W3W2W1
Training
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x2
=(W1
× x1
)+
x3
=(W2
× x2
)+
x1 x2 x3
x4
x5
W4W3W2W1
.
0.9
0.3
0.2
output
1.0
0.0
1.0
true out error
0.1
0.3
0.8
training data
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x2
=(W1
× x1
)+
x1 x2 x3
x4
x5
W4W3W2W1
error backpropagation
eatures for machine learning
Image! Vision features! Detection!
mages!
udio!
Input Result
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
WHAT MAKES DEEP LEARNING DEEP?
Today’s Largest
Networks
~10 layers
1B parameters
10M images
~30 Exaflops
~30 GPU days
Human brain has tr
of parameters – on
Demo
http://playground.tensorflow.org/
https://transcranial.github.io/keras-js/
Why Now?
20171943
1956
A brief History
A long time ago…
1974 Backpropagation
awkward silence (AI Winter)
1995
SVM reigns
Convolution Neural Networks for
Handwritten Recognition
1998
2006
Restricted
Boltzmann
Machine
1958 Perceptron
1969
Perceptron criticized
Google Brain Project on
16k Cores
2012
2012
AlexNet wins
ImageNet
1969
1958
1974 AI Winter 1998
Deep

Learning
2012
Why Now?
Computational Power
Big Data
Neurons in the brain
Output
Deep Learning: Neural network
Algorithms
Current Situation
Outline
• Current State
• Deep Learning
• Reinforcement Learning
AndrewAndrew
Neurons in the brain
Output
Deep Learning: Neural network
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Learning from Experience
Source
40% reduction
in cooling
Source
What is Reinforcement Learning?
Action (A1)
State (S1)
Reward (R1)Agent Environment
What is Reinforcement Learning?
Action (A2)
State (S2)
Reward (R2)Agent Environment
What is Reinforcement Learning?
Agent Environment
Action (Ai)
State (Si)
Reward (Ri)
What is Reinforcement Learning?
Agent Environment
Goal: Maximize Accumulated Rewards R1 + R2 + R3 + …
Action (Ai)
State (Si)
Reward (Ri)
iRi
∑=
Pong Example
States (S)
…
Actions (A) Rewards (R)
+1
-1
0
EnvironmentAgent
out of 49 Atari games
ithin Google
Goal: Maximize Accumulated Rewards
Reinforcement Agent
Agent
out of 49 Atari games
ithin Google
Reinforcement Agent = Policy Function
Agent
out of 49 Atari games
ithin Google
π(S) -> A
Policy Function
=
AiSi
Pong Example
π( ) ->
AiSi
Pong Example
π( ) ->
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
all actions are around 0.7, reflec
previous experience. At time po
AiSi
Pong Example
π( ) ->
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
all actions are around 0.7, reflec
previous experience. At time po
Pong Example
States (S)
…
Actions (A) Rewards (R)
+1
-1
0
EnvironmentAgent
out of 49 Atari games
ithin Google
Goal: Maximize Accumulated Rewards
π(S) -> A
Reinforcement Learning Problem
that maximizes
iRi
∑Accumulated Rewards:
Policy Function: π(S) -> A
Find
How to Find
π(S) -> A
?
Reinforcement Learning Algorithms
• Q-Learning
• Actor-Critic methods
• Policy Gradient
Reinforcement Learning Algorithms
• Q-Learning
• Actor-Critic methods
• Policy Gradient
Episode
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
😁R3=+1R1=0 R2=0
ion of learned value functions on two
alization of the learned value function on
nd 2, thestate value is predicted to be ,17
the lowest level. Each of the peaks in
to a reward obtained by clearing a brick.
reak through to the top level of bricks and
tion of breaking out and clearing a
ue is above 23 and the agent has broken
bounce at the upper part of the bricks
visualization of the learned action-value
point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
.7, reflecting the expected value of this state based on
time point 2, the agent starts moving the paddle
value of the ‘up’ action stays high while the value of the
0.9. This reflects the fact that pressing ‘down’ would lead
ball and incurring a reward of 21. At time point 3,
pressing ‘up’ and the expected reward keeps increasing
the ball reaches the left edge of the screen and the value
at the agent is about to receive a reward of 1. Note,
he past trajectory of the ball purely for illustrative
hown during the game). With permission from Atari
0.7, reflecting the expected value of this state based on
At time point 2, the agent starts moving the paddle
he value of the ‘up’ action stays high while the value of the
20.9. This reflects the fact that pressing ‘down’ would lead
ball and incurring a reward of 21. At time point 3,
by pressing ‘up’ and the expected reward keeps increasing
n the ball reaches the left edge of the screen and the value
hat the agent is about to receive a reward of 1. Note,
the past trajectory of the ball purely for illustrative
shown during the game). With permission from Atari
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
re 2 | Visualization of learned value functions on two
d Pong. a, A visualization of the learned value function on
t time points 1 and 2, thestate value is predicted to be ,17
ing the bricks at the lowest level. Each of the peaks in
rve corresponds to a reward obtained by clearing a brick.
gent is about to break through to the top level of bricks and
,21 in anticipation of breaking out and clearing a
point 4, the value is above 23 and the agent has broken
oint, the ball will bounce at the upper part of the bricks
m by itself. b, A visualization of the learned action-value
e Pong. At time point 1, the ball is moving towards the
the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
ure 2 | Visualization of learned value functions on two
nd Pong. a, A visualization of the learned value function on
At time points 1 and 2, thestate value is predicted to be ,17
aring the bricks at the lowest level. Each of the peaks in
urve corresponds to a reward obtained by clearing a brick.
agent is about to break through to the top level of bricks and
to ,21 in anticipation of breaking out and clearing a
At point 4, the value is above 23 and the agent has broken
point, the ball will bounce at the upper part of the bricks
em by itself. b, A visualization of the learned action-value
me Pong. At time point 1, the ball is moving towards the
y the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Game
Over
👍 👍 👍
iRi
∑ = +1
😭
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
nded Data Figure 2 | Visualization of learned value functions on two
es, Breakout and Pong. a, A visualization of the learned value function on
game Breakout.At time points 1 and 2, thestate value is predicted to be ,17
the agent is clearing the bricks at the lowest level. Each of the peaks in
value function curve corresponds to a reward obtained by clearing a brick.
me point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
e set of bricks. At point 4, the value is above 23 and the agent has broken
ugh. After this point, the ball will bounce at the upper part of the bricks
ring many of them by itself. b, A visualization of the learned action-value
tion on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
nded Data Figure 2 | Visualization of learned value functions on two
es, Breakout and Pong. a, A visualization of the learned value function on
ame Breakout.At time points 1 and 2, the state value is predicted to be ,17
the agent is clearing the bricks at the lowest level. Each of the peaks in
alue function curve corresponds to a reward obtained by clearing a brick.
me point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
set of bricks. At point 4, the value is above 23 and the agent has broken
ugh. After this point, the ball will bounce at the upper part of the bricks
ing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
tended Data Figure 2 | Visualization of learned value functions on two
mes, Breakout and Pong. a, A visualization of the learned value function on
game Breakout.At time points 1 and 2, thestate value is predicted to be ,17
d the agent is clearing the bricks at the lowest level. Each of the peaks in
value function curve corresponds to a reward obtained by clearing a brick.
time point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
ge set of bricks. At point 4, the value is above 23 and the agent has broken
ough. After this point, the ball will bounce at the upper part of the bricks
aring many of them by itself. b, A visualization of the learned action-value
ction on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Episode
R3=-1R1=0 R2=0
Game
Over
👎 👎 👎
iRi
∑ = −1
How to Find
π(S) -> A
?
How to Find π(S) -> A ?
1. Change π to Stochastic:
π(S) -> P(A)
Pong Example
π( ) ->
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
all actions are around 0.7, reflecting the expec
previous experience. At time point 2, the agen
towards the ball and the value of the ‘up’ action
‘down’ action falls to 20.9. This reflects the fac
to the agent losing the ball and incurring a re
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
all actions are around 0.7, reflecting the expec
previous experience. At time point 2, the agen
towards the ball and the value of the ‘up’ action
‘down’ action falls to 20.9. This reflects the fac
to the agent losing the ball and incurring a re
2. Approximate π with NeuralNet:
π(S, θ) -> P(A)
How to Find π(S) -> A ?
1. Change π to Stochastic:
π(S) -> P(A)
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
, θ
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
,
Si
Action Probability
0 0.25 0.5 0.75 1
π(Si, θ)
θ
Si
Action Probability
0 0.25 0.5 0.75 1
Si
Action Probability
0 0.25 0.5 0.75 1
How to Find ?π(Si, θ) -> P(A)
How to Find θ
Loss Function…
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
R1=0
0.0
0.2
0.4
0.6
0.8
1.0
R2=0
0.0
0.2
0.4
0.6
0.8
1.0
Game
Over
R3=+1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Ri
i
n
∑ = +1
😁
👍 👍 👍
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
π(Si , θ | Ai)
θk
0.0
0.2
0.4
0.6
0.8
1.0
π(Si , θ | Ai)
}
👍
👎
Δ(π(Si , θ | Ai) )
Δ θk
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
R1=0
0.0
0.2
0.4
0.6
0.8
1.0
R2=0
0.0
0.2
0.4
0.6
0.8
1.0
Game
Over
R3=+1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Ri
i
n
∑ = +1
😁
👍 👍 👍
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
• Conclusions
Conclusions
1.
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
… …2.
3.
Deep Learning and Reinforcement Learning

Deep Learning and Reinforcement Learning

  • 1.
    Deep Learning &Reinforcement Learning Renārs Liepiņš Lead Researcher, LUMII & LETA renars.liepins@lumii.lv At “Riga AI, Machine Learning and Bots”, February 16, 2017
  • 2.
    Outline • Current State •Deep Learning • Reinforcement Learning
  • 3.
    Outline • Current State •Deep Learning • Reinforcement Learning
  • 4.
  • 5.
    Machine learning isa core transformative way by which we are
 rethinking everything we are doing – Sundar Pichai (CEO Google) 2015 Source
  • 6.
  • 7.
  • 8.
  • 9.
    Artificial Intelligence computer systemsable
 to perform tasks normally
 requiring human intelligence
  • 10.
    0 5 10 15 20 25 30 2010 2011 20122013 2014 2015 2016 3.083.57 6.7 11.7 16.4 27.828 29 Classic Deep Learning Human
 Level
  • 12.
  • 15.
  • 16.
  • 17.
    Andrew Ng Features formachine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Before Deep Learning Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Source
  • 18.
    Andrew Ng Features formachine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! With Deep Learning Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Neurons in the brain Output Deep Learning: Neural network
  • 19.
    Universal Learning Algorithm AndrewNgAndrew Ng Neurons in the brain Output Deep Learning: Neural network … …
  • 20.
    A yellow bus
 drivingdown…. Universal Learning Algorithm – Speech Recognition Andrew NgAndrew Ng _ q u i c k … Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 21.
    Universal Learning Algorithm– Translation Dzeltens autobuss brauc pa ceļu…. A yellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 22.
    Universal Learning Algorithm– Self driving cars Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 23.
    Universal Learning Algorithm Ayellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network
  • 24.
    Data (image) The limitationsof supervise Universal Learning Algorithm – Image captions A yellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 25.
    Andrew NAndrew N Chinesecaptions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.)
  • 26.
    Universal Learning Algorithm– X-ray reports Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 27.
    Universal Learning Algorithm– Photo localisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network PlaNet is able to determine the location of almost any image with superhuman ability. Deep Learning in Computer Vision Image Localization PlaNet is able to determine the location of almost any image with superhuman ability Source
  • 28.
    Universal Learning Algorithm– Style Transfer Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 30.
    Universal Learning Algorithm– Semantic Face Transforms Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before a showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve th image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image ( categories. Each transformation was performed via linear interpolation in deep feature space composed images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other age transformations. Works content change models for viewpoint changes) but do n mouth open Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpo showcasing the quality of our method. In this figure (and no other) a mask was appli image was 400x400, all source and target images used in the transformation were only olderinput mouth open eyes open Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation categories. Each transformation was performed via linear interpolation in deep feature images. It also requires that sample images with and without age transform Source
  • 31.
    Figure 1. (Zoomin for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step, showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the input image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling facial hair spectacles Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards six categories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features. images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other caucasian males). age transformations. Works by Reed et al. [29, 30] propose content change models for challenging tasks (identity and viewpoint changes) but do not demonstrate photo-realistic results. A contemporaneous work [4] edits image content by
  • 32.
    Universal Learning Algorithm– Lipreading Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus
 driving down…. Deep Learning in Computer Vision LipNet - Sentence-level Lipreading Source LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy. Source
  • 33.
    Universal Learning Algorithm– Sketch Vectorisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 34.
    Universal Learning Algorithm– Handwriting Generation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus
 driving down…. Source
  • 35.
    Deep Learning inComputer Vision Image Generation - Handwriting This LSTM recurrent neural network is able to generate highly realistic cursive handwriting in a wide variety of styles, simply by predicting one data point at a time.
  • 36.
    Universal Learning Algorithm– Image upscaling Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 37.
    Google – Savingyou bandwidth through machine learning Source
  • 38.
  • 39.
    Not Magic • Simplydownloading and “applying” open-source software won’t work. • Needs to be customised to your business context and data. • Needs lots of examples and computing power for training Source
  • 41.
    Outline • Current State •Deep Learning • Reinforcement Learning
  • 42.
    Outline • Current State •Deep Learning • Reinforcement Learning
  • 43.
  • 44.
  • 45.
  • 46.
    Andrew NgAndrew Ng thebrain Output Deep Learning: Neural network
  • 47.
    Yes/No (Mug or not?) Whatis a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x1 x2 x3 x4 x5 W4W3W2W1
  • 48.
    Training Yes/No (Mug or not?) Whatis a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x3 =(W2 × x2 )+ x1 x2 x3 x4 x5 W4W3W2W1 . 0.9 0.3 0.2 output 1.0 0.0 1.0 true out error 0.1 0.3 0.8 training data Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x1 x2 x3 x4 x5 W4W3W2W1 error backpropagation
  • 49.
    eatures for machinelearning Image! Vision features! Detection! mages! udio!
  • 50.
  • 51.
    “cat” ● Loosely basedon (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? WHAT MAKES DEEP LEARNING DEEP? Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has tr of parameters – on
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
    20171943 1956 A brief History Along time ago… 1974 Backpropagation awkward silence (AI Winter) 1995 SVM reigns Convolution Neural Networks for Handwritten Recognition 1998 2006 Restricted Boltzmann Machine 1958 Perceptron 1969 Perceptron criticized Google Brain Project on 16k Cores 2012 2012 AlexNet wins ImageNet 1969 1958 1974 AI Winter 1998 Deep
 Learning 2012
  • 57.
  • 58.
    Computational Power Big Data Neuronsin the brain Output Deep Learning: Neural network Algorithms
  • 60.
  • 61.
    Outline • Current State •Deep Learning • Reinforcement Learning AndrewAndrew Neurons in the brain Output Deep Learning: Neural network
  • 62.
    Outline • Current State •Deep Learning • Reinforcement Learning Learning from Experience
  • 65.
  • 66.
  • 67.
    What is ReinforcementLearning? Action (A1) State (S1) Reward (R1)Agent Environment
  • 68.
    What is ReinforcementLearning? Action (A2) State (S2) Reward (R2)Agent Environment
  • 69.
    What is ReinforcementLearning? Agent Environment Action (Ai) State (Si) Reward (Ri)
  • 70.
    What is ReinforcementLearning? Agent Environment Goal: Maximize Accumulated Rewards R1 + R2 + R3 + … Action (Ai) State (Si) Reward (Ri) iRi ∑=
  • 71.
    Pong Example States (S) … Actions(A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards
  • 72.
    Reinforcement Agent Agent out of49 Atari games ithin Google
  • 73.
    Reinforcement Agent =Policy Function Agent out of 49 Atari games ithin Google π(S) -> A Policy Function =
  • 74.
  • 75.
    AiSi Pong Example π( )-> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po
  • 76.
    AiSi Pong Example π( )-> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po
  • 77.
    Pong Example States (S) … Actions(A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards π(S) -> A
  • 78.
    Reinforcement Learning Problem thatmaximizes iRi ∑Accumulated Rewards: Policy Function: π(S) -> A Find
  • 79.
  • 80.
    Reinforcement Learning Algorithms •Q-Learning • Actor-Critic methods • Policy Gradient
  • 81.
    Reinforcement Learning Algorithms •Q-Learning • Actor-Critic methods • Policy Gradient
  • 82.
    Episode Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. 😁R3=+1R1=0 R2=0 ion of learned value functions on two alization of the learned value function on nd 2, thestate value is predicted to be ,17 the lowest level. Each of the peaks in to a reward obtained by clearing a brick. reak through to the top level of bricks and tion of breaking out and clearing a ue is above 23 and the agent has broken bounce at the upper part of the bricks visualization of the learned action-value point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. .7, reflecting the expected value of this state based on time point 2, the agent starts moving the paddle value of the ‘up’ action stays high while the value of the 0.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, pressing ‘up’ and the expected reward keeps increasing the ball reaches the left edge of the screen and the value at the agent is about to receive a reward of 1. Note, he past trajectory of the ball purely for illustrative hown during the game). With permission from Atari 0.7, reflecting the expected value of this state based on At time point 2, the agent starts moving the paddle he value of the ‘up’ action stays high while the value of the 20.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, by pressing ‘up’ and the expected reward keeps increasing n the ball reaches the left edge of the screen and the value hat the agent is about to receive a reward of 1. Note, the past trajectory of the ball purely for illustrative shown during the game). With permission from Atari Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. re 2 | Visualization of learned value functions on two d Pong. a, A visualization of the learned value function on t time points 1 and 2, thestate value is predicted to be ,17 ing the bricks at the lowest level. Each of the peaks in rve corresponds to a reward obtained by clearing a brick. gent is about to break through to the top level of bricks and ,21 in anticipation of breaking out and clearing a point 4, the value is above 23 and the agent has broken oint, the ball will bounce at the upper part of the bricks m by itself. b, A visualization of the learned action-value e Pong. At time point 1, the ball is moving towards the the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. ure 2 | Visualization of learned value functions on two nd Pong. a, A visualization of the learned value function on At time points 1 and 2, thestate value is predicted to be ,17 aring the bricks at the lowest level. Each of the peaks in urve corresponds to a reward obtained by clearing a brick. agent is about to break through to the top level of bricks and to ,21 in anticipation of breaking out and clearing a At point 4, the value is above 23 and the agent has broken point, the ball will bounce at the upper part of the bricks em by itself. b, A visualization of the learned action-value me Pong. At time point 1, the ball is moving towards the y the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Game Over 👍 👍 👍 iRi ∑ = +1
  • 83.
    😭 Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a e set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ring many of them by itself. b, A visualization of the learned action-value tion on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on ame Breakout.At time points 1 and 2, the state value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in alue function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari tended Data Figure 2 | Visualization of learned value functions on two mes, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 d the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. time point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a ge set of bricks. At point 4, the value is above 23 and the agent has broken ough. After this point, the ball will bounce at the upper part of the bricks aring many of them by itself. b, A visualization of the learned action-value ction on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Episode R3=-1R1=0 R2=0 Game Over 👎 👎 👎 iRi ∑ = −1
  • 84.
  • 85.
    How to Findπ(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)
  • 86.
  • 87.
    π( ) -> ActionProbability 0 0.25 0.5 0.75 1
  • 88.
    π( ) -> ActionProbability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re
  • 89.
    π( ) -> ActionProbability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re
  • 90.
    2. Approximate πwith NeuralNet: π(S, θ) -> P(A) How to Find π(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)
  • 91.
    Siπ( ) -> ActionProbability 0 0.25 0.5 0.75 1
  • 92.
    Siπ( ) -> ActionProbability 0 0.25 0.5 0.75 1 , θ
  • 93.
    Siπ( ) -> ActionProbability 0 0.25 0.5 0.75 1 ,
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
    How to Find?π(Si, θ) -> P(A) How to Find θ Loss Function…
  • 99.
    Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 100.
    π(Si , θ| Ai) θk 0.0 0.2 0.4 0.6 0.8 1.0 π(Si , θ | Ai) } 👍 👎 Δ(π(Si , θ | Ai) ) Δ θk
  • 101.
    Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 103.
    Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 104.
    Extended Data Figure2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 105.
  • 106.
    Outline • Current State •Deep Learning • Reinforcement Learning • Conclusions
  • 107.
    Conclusions 1. Andrew NgAndrew Ng Neuronsin the brain Output Deep Learning: Neural network … …2. 3.