SlideShare a Scribd company logo
Deep Learning & Reinforcement Learning
Renārs Liepiņš
Lead Researcher, LUMII & LETA
renars.liepins@lumii.lv
At “Riga AI, Machine Learning and Bots”, February 16, 2017
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Source
Machine learning is a core transformative way by which we are

rethinking everything we are doing
– Sundar Pichai (CEO Google) 2015
Source
Source
Source
Why such optimism?
Artificial Intelligence
computer systems able

to perform tasks normally

requiring human intelligence
0
5
10
15
20
25
30
2010 2011 2012 2013 2014 2015 2016
3.083.57
6.7
11.7
16.4
27.828
29
Classic Deep Learning
Human

Level
Human

Level
Nice, but so what?
First Universal Learning Algorithm
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Before Deep Learning
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
Source
Andrew Ng
Features for machine learning
Image! Vision features! Detection!
Images!
Audio! Audio features! Speaker ID!
Audio!
Text!
Text! Text features!
Web search!
…!
With Deep Learning
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Neurons in the brain
Output
Deep Learning: Neural network
Universal Learning Algorithm
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
… …
A yellow bus

driving down….
Universal Learning Algorithm – Speech Recognition
Andrew NgAndrew Ng
_ q u i c k …
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Translation
Dzeltens autobuss
brauc pa ceļu….
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Self driving cars
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Data (image)
The limitations of supervise
Universal Learning Algorithm – Image captions
A yellow bus

driving down….
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Andrew NAndrew N
Chinese captions
(A baseball player
getting ready to bat.)
(A person surfing on the ocean.)
	
(A double-decker bus driving on a street.)
Universal Learning Algorithm – X-ray reports
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Photo localisation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
PlaNet is able to determine the location of almost any image with superhuman ability.
Deep Learning in Computer Vision
Image Localization
PlaNet is able to determine the location of almost any image with superhuman ability
Source
Universal Learning Algorithm – Style Transfer
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Semantic Face Transforms
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before a
showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve th
image was 400x400, all source and target images used in the transformation were only 100x100.
olderinput mouth open eyes open smiling
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (
categories. Each transformation was performed via linear interpolation in deep feature space composed
images. It also requires that sample images with and without
the desired attribute are otherwise similar to the target image
(e.g. in the case of Figure 1 they consist of images of other
age transformations. Works
content change models for
viewpoint changes) but do n
mouth open
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpo
showcasing the quality of our method. In this figure (and no other) a mask was appli
image was 400x400, all source and target images used in the transformation were only
olderinput mouth open eyes open
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation
categories. Each transformation was performed via linear interpolation in deep feature
images. It also requires that sample images with and without age transform
Source
Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,
showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the input
image was 400x400, all source and target images used in the transformation were only 100x100.
olderinput mouth open eyes open smiling facial hair spectacles
Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards six
categories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.
images. It also requires that sample images with and without
the desired attribute are otherwise similar to the target image
(e.g. in the case of Figure 1 they consist of images of other
caucasian males).
age transformations. Works by Reed et al. [29, 30] propose
content change models for challenging tasks (identity and
viewpoint changes) but do not demonstrate photo-realistic
results. A contemporaneous work [4] edits image content by
Universal Learning Algorithm – Lipreading
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
A yellow bus

driving down….
Deep Learning in Computer Vision
LipNet - Sentence-level Lipreading
Source
LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders
and the previous 79.6% state-of-the-art accuracy.
Source
Universal Learning Algorithm – Sketch Vectorisation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Universal Learning Algorithm – Handwriting Generation
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
A yellow bus

driving down….
Source
Deep Learning in Computer Vision
Image Generation - Handwriting
This LSTM recurrent neural network is able to generate highly realistic
cursive handwriting in a wide variety of styles, simply by predicting one data
point at a time.
Universal Learning Algorithm – Image upscaling
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
Source
Google – Saving you bandwidth through machine learning
Source
First Universal Learning Algorithm
Not Magic
• Simply downloading and “applying” open-source software won’t work.
• Needs to be customised to your business context and data.
• Needs lots of examples and computing power for training
Source
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
cell body
output axon
synapse
Neuron
Source
cell body
output axon
synapse
Neuron
Artifical Neuron
Source
Source
Andrew NgAndrew Ng
the brain
Output
Deep Learning: Neural network
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x1 x2 x3
x4
x5
W4W3W2W1
Training
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x2
=(W1
× x1
)+
x3
=(W2
× x2
)+
x1 x2 x3
x4
x5
W4W3W2W1
.
0.9
0.3
0.2
output
1.0
0.0
1.0
true out error
0.1
0.3
0.8
training data
Yes/No
(Mug or not?)
What is a neural network?
Data (image)
x1
∈!5
,!x2
∈!5
x2
=(W1
× x1
)+
x1 x2 x3
x4
x5
W4W3W2W1
error backpropagation
eatures for machine learning
Image! Vision features! Detection!
mages!
udio!
Input Result
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
“cat”
● Loosely based on
(what little) we know
about the brain
What is Deep Learning?
WHAT MAKES DEEP LEARNING DEEP?
Today’s Largest
Networks
~10 layers
1B parameters
10M images
~30 Exaflops
~30 GPU days
Human brain has tr
of parameters – on
Demo
http://playground.tensorflow.org/
https://transcranial.github.io/keras-js/
Why Now?
20171943
1956
A brief History
A long time ago…
1974 Backpropagation
awkward silence (AI Winter)
1995
SVM reigns
Convolution Neural Networks for
Handwritten Recognition
1998
2006
Restricted
Boltzmann
Machine
1958 Perceptron
1969
Perceptron criticized
Google Brain Project on
16k Cores
2012
2012
AlexNet wins
ImageNet
1969
1958
1974 AI Winter 1998
Deep

Learning
2012
Why Now?
Computational Power
Big Data
Neurons in the brain
Output
Deep Learning: Neural network
Algorithms
Current Situation
Outline
• Current State
• Deep Learning
• Reinforcement Learning
AndrewAndrew
Neurons in the brain
Output
Deep Learning: Neural network
Outline
• Current State
• Deep Learning
• Reinforcement Learning
Learning from Experience
Source
40% reduction
in cooling
Source
What is Reinforcement Learning?
Action (A1)
State (S1)
Reward (R1)Agent Environment
What is Reinforcement Learning?
Action (A2)
State (S2)
Reward (R2)Agent Environment
What is Reinforcement Learning?
Agent Environment
Action (Ai)
State (Si)
Reward (Ri)
What is Reinforcement Learning?
Agent Environment
Goal: Maximize Accumulated Rewards R1 + R2 + R3 + …
Action (Ai)
State (Si)
Reward (Ri)
iRi
∑=
Pong Example
States (S)
…
Actions (A) Rewards (R)
+1
-1
0
EnvironmentAgent
out of 49 Atari games
ithin Google
Goal: Maximize Accumulated Rewards
Reinforcement Agent
Agent
out of 49 Atari games
ithin Google
Reinforcement Agent = Policy Function
Agent
out of 49 Atari games
ithin Google
π(S) -> A
Policy Function
=
AiSi
Pong Example
π( ) ->
AiSi
Pong Example
π( ) ->
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
all actions are around 0.7, reflec
previous experience. At time po
AiSi
Pong Example
π( ) ->
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
all actions are around 0.7, reflec
previous experience. At time po
Pong Example
States (S)
…
Actions (A) Rewards (R)
+1
-1
0
EnvironmentAgent
out of 49 Atari games
ithin Google
Goal: Maximize Accumulated Rewards
π(S) -> A
Reinforcement Learning Problem
that maximizes
iRi
∑Accumulated Rewards:
Policy Function: π(S) -> A
Find
How to Find
π(S) -> A
?
Reinforcement Learning Algorithms
• Q-Learning
• Actor-Critic methods
• Policy Gradient
Reinforcement Learning Algorithms
• Q-Learning
• Actor-Critic methods
• Policy Gradient
Episode
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
😁R3=+1R1=0 R2=0
ion of learned value functions on two
alization of the learned value function on
nd 2, thestate value is predicted to be ,17
the lowest level. Each of the peaks in
to a reward obtained by clearing a brick.
reak through to the top level of bricks and
tion of breaking out and clearing a
ue is above 23 and the agent has broken
bounce at the upper part of the bricks
visualization of the learned action-value
point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
.7, reflecting the expected value of this state based on
time point 2, the agent starts moving the paddle
value of the ‘up’ action stays high while the value of the
0.9. This reflects the fact that pressing ‘down’ would lead
ball and incurring a reward of 21. At time point 3,
pressing ‘up’ and the expected reward keeps increasing
the ball reaches the left edge of the screen and the value
at the agent is about to receive a reward of 1. Note,
he past trajectory of the ball purely for illustrative
hown during the game). With permission from Atari
0.7, reflecting the expected value of this state based on
At time point 2, the agent starts moving the paddle
he value of the ‘up’ action stays high while the value of the
20.9. This reflects the fact that pressing ‘down’ would lead
ball and incurring a reward of 21. At time point 3,
by pressing ‘up’ and the expected reward keeps increasing
n the ball reaches the left edge of the screen and the value
hat the agent is about to receive a reward of 1. Note,
the past trajectory of the ball purely for illustrative
shown during the game). With permission from Atari
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
re 2 | Visualization of learned value functions on two
d Pong. a, A visualization of the learned value function on
t time points 1 and 2, thestate value is predicted to be ,17
ing the bricks at the lowest level. Each of the peaks in
rve corresponds to a reward obtained by clearing a brick.
gent is about to break through to the top level of bricks and
,21 in anticipation of breaking out and clearing a
point 4, the value is above 23 and the agent has broken
oint, the ball will bounce at the upper part of the bricks
m by itself. b, A visualization of the learned action-value
e Pong. At time point 1, the ball is moving towards the
the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
ure 2 | Visualization of learned value functions on two
nd Pong. a, A visualization of the learned value function on
At time points 1 and 2, thestate value is predicted to be ,17
aring the bricks at the lowest level. Each of the peaks in
urve corresponds to a reward obtained by clearing a brick.
agent is about to break through to the top level of bricks and
to ,21 in anticipation of breaking out and clearing a
At point 4, the value is above 23 and the agent has broken
point, the ball will bounce at the upper part of the bricks
em by itself. b, A visualization of the learned action-value
me Pong. At time point 1, the ball is moving towards the
y the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Game
Over
👍 👍 👍
iRi
∑ = +1
😭
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
nded Data Figure 2 | Visualization of learned value functions on two
es, Breakout and Pong. a, A visualization of the learned value function on
game Breakout.At time points 1 and 2, thestate value is predicted to be ,17
the agent is clearing the bricks at the lowest level. Each of the peaks in
value function curve corresponds to a reward obtained by clearing a brick.
me point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
e set of bricks. At point 4, the value is above 23 and the agent has broken
ugh. After this point, the ball will bounce at the upper part of the bricks
ring many of them by itself. b, A visualization of the learned action-value
tion on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
nded Data Figure 2 | Visualization of learned value functions on two
es, Breakout and Pong. a, A visualization of the learned value function on
ame Breakout.At time points 1 and 2, the state value is predicted to be ,17
the agent is clearing the bricks at the lowest level. Each of the peaks in
alue function curve corresponds to a reward obtained by clearing a brick.
me point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
set of bricks. At point 4, the value is above 23 and the agent has broken
ugh. After this point, the ball will bounce at the upper part of the bricks
ing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
tended Data Figure 2 | Visualization of learned value functions on two
mes, Breakout and Pong. a, A visualization of the learned value function on
game Breakout.At time points 1 and 2, thestate value is predicted to be ,17
d the agent is clearing the bricks at the lowest level. Each of the peaks in
value function curve corresponds to a reward obtained by clearing a brick.
time point 3, the agent is about to break through to the top level of bricks and
value increases to ,21 in anticipation of breaking out and clearing a
ge set of bricks. At point 4, the value is above 23 and the agent has broken
ough. After this point, the ball will bounce at the upper part of the bricks
aring many of them by itself. b, A visualization of the learned action-value
ction on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
Episode
R3=-1R1=0 R2=0
Game
Over
👎 👎 👎
iRi
∑ = −1
How to Find
π(S) -> A
?
How to Find π(S) -> A ?
1. Change π to Stochastic:
π(S) -> P(A)
Pong Example
π( ) ->
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
all actions are around 0.7, reflecting the expec
previous experience. At time point 2, the agen
towards the ball and the value of the ‘up’ action
‘down’ action falls to 20.9. This reflects the fac
to the agent losing the ball and incurring a re
π( ) ->
Action Probability
0 0.25 0.5 0.75 1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
all actions are around 0.7, reflecting the expec
previous experience. At time point 2, the agen
towards the ball and the value of the ‘up’ action
‘down’ action falls to 20.9. This reflects the fac
to the agent losing the ball and incurring a re
2. Approximate π with NeuralNet:
π(S, θ) -> P(A)
How to Find π(S) -> A ?
1. Change π to Stochastic:
π(S) -> P(A)
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
, θ
Siπ( ) ->
Action Probability
0 0.25 0.5 0.75 1
,
Si
Action Probability
0 0.25 0.5 0.75 1
π(Si, θ)
θ
Si
Action Probability
0 0.25 0.5 0.75 1
Si
Action Probability
0 0.25 0.5 0.75 1
How to Find ?π(Si, θ) -> P(A)
How to Find θ
Loss Function…
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
R1=0
0.0
0.2
0.4
0.6
0.8
1.0
R2=0
0.0
0.2
0.4
0.6
0.8
1.0
Game
Over
R3=+1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Ri
i
n
∑ = +1
😁
👍 👍 👍
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
π(Si , θ | Ai)
θk
0.0
0.2
0.4
0.6
0.8
1.0
π(Si , θ | Ai)
}
👍
👎
Δ(π(Si , θ | Ai) )
Δ θk
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
R1=0
0.0
0.2
0.4
0.6
0.8
1.0
R2=0
0.0
0.2
0.4
0.6
0.8
1.0
Game
Over
R3=+1
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Ri
i
n
∑ = +1
😁
👍 👍 👍
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
the game Breakout.At time points 1 and 2, the state value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
function on the game Pong. At time point 1, the ball is moving towards the
paddle controlled by the agent on the right side of the screen and the values of
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two
games, Breakout and Pong. a, A visualization of the learned value function on
thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17
and the agent is clearing the bricks at the lowest level. Each of the peaks in
the value function curve corresponds to a reward obtained by clearing a brick.
At time point 3, the agent is about to break through to the top level of bricks and
the value increases to ,21 in anticipation of breaking out and clearing a
large set of bricks. At point 4, the value is above 23 and the agent has broken
through. After this point, the ball will bounce at the upper part of the bricks
clearing many of them by itself. b, A visualization of the learned action-value
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the ‘up’ action stays high while the value of the
‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
LETTER RESEARCH
Reinforcement Learning
Outline
• Current State
• Deep Learning
• Reinforcement Learning
• Conclusions
Conclusions
1.
Andrew NgAndrew Ng
Neurons in the brain
Output
Deep Learning: Neural network
… …2.
3.
Deep Learning and Reinforcement Learning

More Related Content

What's hot

Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
Buhwan Jeong
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
Roelof Pieters
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
Grigory Sapunov
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Mustafa Aldemir
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
David Khosid
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
indico data
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Jörgen Sandig
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
Holberton School
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
Sri Ambati
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Jun Wang
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Poo Kuan Hoong
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
Mark Scully
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
David Rostcheck
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
David Dao
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
Thomas da Silva Paula
 
Yann le cun
Yann le cunYann le cun
Yann le cunYandex
 
Deep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with styleDeep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with style
Roelof Pieters
 

What's hot (20)

Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
 
Yann le cun
Yann le cunYann le cun
Yann le cun
 
Deep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with styleDeep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with style
 

Similar to Deep Learning and Reinforcement Learning

[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
JaeJun Yoo
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.
Takrim Ul Islam Laskar
 
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Jonathon Hare
 
Artificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainArtificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google Brain
Rawan Al-Omari
 
Visual search
Visual searchVisual search
Visual search
Julien Jouganous
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
AtharvaTanawade
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
Anas Arram, Ph.D
 
Generative models
Generative modelsGenerative models
Generative models
Birger Moell
 
7-200404101602.pdf
7-200404101602.pdf7-200404101602.pdf
7-200404101602.pdf
ssuser07e9f2
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
Chen Sagiv
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
MdMahfoozAlam5
 
Cat and dog classification
Cat and dog classificationCat and dog classification
Cat and dog classification
omaraldabash
 
Designing a neural network architecture for image recognition
Designing a neural network architecture for image recognitionDesigning a neural network architecture for image recognition
Designing a neural network architecture for image recognition
ShandukaniVhulondo
 
Deep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search EngineDeep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search Engine
C4Media
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
NUS-ISS
 
Deep Learning on iOS #360iDev
Deep Learning on iOS #360iDevDeep Learning on iOS #360iDev
Deep Learning on iOS #360iDev
Shuichi Tsutsumi
 
C1_W4.pdf
C1_W4.pdfC1_W4.pdf
C1_W4.pdf
anamolpradhan
 
Image captioning
Image captioningImage captioning
Image captioning
Muhammad Zbeedat
 

Similar to Deep Learning and Reinforcement Learning (20)

[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.Facial emotion detection on babies' emotional face using Deep Learning.
Facial emotion detection on babies' emotional face using Deep Learning.
 
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
Spot the Dog: An overview of semantic retrieval of unannotated images in the ...
 
Artificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainArtificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google Brain
 
Visual search
Visual searchVisual search
Visual search
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 
Generative models
Generative modelsGenerative models
Generative models
 
7-200404101602.pdf
7-200404101602.pdf7-200404101602.pdf
7-200404101602.pdf
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
 
Cat and dog classification
Cat and dog classificationCat and dog classification
Cat and dog classification
 
Designing a neural network architecture for image recognition
Designing a neural network architecture for image recognitionDesigning a neural network architecture for image recognition
Designing a neural network architecture for image recognition
 
Deep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search EngineDeep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search Engine
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Deep Learning on iOS #360iDev
Deep Learning on iOS #360iDevDeep Learning on iOS #360iDev
Deep Learning on iOS #360iDev
 
C1_W4.pdf
C1_W4.pdfC1_W4.pdf
C1_W4.pdf
 
Image captioning
Image captioningImage captioning
Image captioning
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Deep Learning and Reinforcement Learning

  • 1. Deep Learning & Reinforcement Learning Renārs Liepiņš Lead Researcher, LUMII & LETA renars.liepins@lumii.lv At “Riga AI, Machine Learning and Bots”, February 16, 2017
  • 2. Outline • Current State • Deep Learning • Reinforcement Learning
  • 3. Outline • Current State • Deep Learning • Reinforcement Learning
  • 5. Machine learning is a core transformative way by which we are
 rethinking everything we are doing – Sundar Pichai (CEO Google) 2015 Source
  • 9. Artificial Intelligence computer systems able
 to perform tasks normally
 requiring human intelligence
  • 10. 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 2015 2016 3.083.57 6.7 11.7 16.4 27.828 29 Classic Deep Learning Human
 Level
  • 11.
  • 13.
  • 14.
  • 15. Nice, but so what?
  • 17. Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Before Deep Learning Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! Source
  • 18. Andrew Ng Features for machine learning Image! Vision features! Detection! Images! Audio! Audio features! Speaker ID! Audio! Text! Text! Text features! Web search! …! With Deep Learning Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Neurons in the brain Output Deep Learning: Neural network
  • 19. Universal Learning Algorithm Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network … …
  • 20. A yellow bus
 driving down…. Universal Learning Algorithm – Speech Recognition Andrew NgAndrew Ng _ q u i c k … Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 21. Universal Learning Algorithm – Translation Dzeltens autobuss brauc pa ceļu…. A yellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 22. Universal Learning Algorithm – Self driving cars Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 23. Universal Learning Algorithm A yellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network
  • 24. Data (image) The limitations of supervise Universal Learning Algorithm – Image captions A yellow bus
 driving down…. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 25. Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.) Andrew NAndrew N Chinese captions (A baseball player getting ready to bat.) (A person surfing on the ocean.) (A double-decker bus driving on a street.)
  • 26. Universal Learning Algorithm – X-ray reports Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 27. Universal Learning Algorithm – Photo localisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network PlaNet is able to determine the location of almost any image with superhuman ability. Deep Learning in Computer Vision Image Localization PlaNet is able to determine the location of almost any image with superhuman ability Source
  • 28. Universal Learning Algorithm – Style Transfer Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 29.
  • 30. Universal Learning Algorithm – Semantic Face Transforms Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before a showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve th image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image ( categories. Each transformation was performed via linear interpolation in deep feature space composed images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other age transformations. Works content change models for viewpoint changes) but do n mouth open Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpo showcasing the quality of our method. In this figure (and no other) a mask was appli image was 400x400, all source and target images used in the transformation were only olderinput mouth open eyes open Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation categories. Each transformation was performed via linear interpolation in deep feature images. It also requires that sample images with and without age transform Source
  • 31. Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step, showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the input image was 400x400, all source and target images used in the transformation were only 100x100. olderinput mouth open eyes open smiling facial hair spectacles Figure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards six categories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features. images. It also requires that sample images with and without the desired attribute are otherwise similar to the target image (e.g. in the case of Figure 1 they consist of images of other caucasian males). age transformations. Works by Reed et al. [29, 30] propose content change models for challenging tasks (identity and viewpoint changes) but do not demonstrate photo-realistic results. A contemporaneous work [4] edits image content by
  • 32. Universal Learning Algorithm – Lipreading Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus
 driving down…. Deep Learning in Computer Vision LipNet - Sentence-level Lipreading Source LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy. Source
  • 33. Universal Learning Algorithm – Sketch Vectorisation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 34. Universal Learning Algorithm – Handwriting Generation Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network A yellow bus
 driving down…. Source
  • 35. Deep Learning in Computer Vision Image Generation - Handwriting This LSTM recurrent neural network is able to generate highly realistic cursive handwriting in a wide variety of styles, simply by predicting one data point at a time.
  • 36. Universal Learning Algorithm – Image upscaling Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network Source
  • 37. Google – Saving you bandwidth through machine learning Source
  • 39. Not Magic • Simply downloading and “applying” open-source software won’t work. • Needs to be customised to your business context and data. • Needs lots of examples and computing power for training Source
  • 40.
  • 41. Outline • Current State • Deep Learning • Reinforcement Learning
  • 42. Outline • Current State • Deep Learning • Reinforcement Learning
  • 46. Andrew NgAndrew Ng the brain Output Deep Learning: Neural network
  • 47. Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x1 x2 x3 x4 x5 W4W3W2W1
  • 48. Training Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x3 =(W2 × x2 )+ x1 x2 x3 x4 x5 W4W3W2W1 . 0.9 0.3 0.2 output 1.0 0.0 1.0 true out error 0.1 0.3 0.8 training data Yes/No (Mug or not?) What is a neural network? Data (image) x1 ∈!5 ,!x2 ∈!5 x2 =(W1 × x1 )+ x1 x2 x3 x4 x5 W4W3W2W1 error backpropagation
  • 49. eatures for machine learning Image! Vision features! Detection! mages! udio!
  • 51. “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? “cat” ● Loosely based on (what little) we know about the brain What is Deep Learning? WHAT MAKES DEEP LEARNING DEEP? Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has tr of parameters – on
  • 52. Demo
  • 56. 20171943 1956 A brief History A long time ago… 1974 Backpropagation awkward silence (AI Winter) 1995 SVM reigns Convolution Neural Networks for Handwritten Recognition 1998 2006 Restricted Boltzmann Machine 1958 Perceptron 1969 Perceptron criticized Google Brain Project on 16k Cores 2012 2012 AlexNet wins ImageNet 1969 1958 1974 AI Winter 1998 Deep
 Learning 2012
  • 58. Computational Power Big Data Neurons in the brain Output Deep Learning: Neural network Algorithms
  • 59.
  • 61. Outline • Current State • Deep Learning • Reinforcement Learning AndrewAndrew Neurons in the brain Output Deep Learning: Neural network
  • 62. Outline • Current State • Deep Learning • Reinforcement Learning Learning from Experience
  • 63.
  • 64.
  • 67. What is Reinforcement Learning? Action (A1) State (S1) Reward (R1)Agent Environment
  • 68. What is Reinforcement Learning? Action (A2) State (S2) Reward (R2)Agent Environment
  • 69. What is Reinforcement Learning? Agent Environment Action (Ai) State (Si) Reward (Ri)
  • 70. What is Reinforcement Learning? Agent Environment Goal: Maximize Accumulated Rewards R1 + R2 + R3 + … Action (Ai) State (Si) Reward (Ri) iRi ∑=
  • 71. Pong Example States (S) … Actions (A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards
  • 72. Reinforcement Agent Agent out of 49 Atari games ithin Google
  • 73. Reinforcement Agent = Policy Function Agent out of 49 Atari games ithin Google π(S) -> A Policy Function =
  • 75. AiSi Pong Example π( ) -> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po
  • 76. AiSi Pong Example π( ) -> Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on all actions are around 0.7, reflec previous experience. At time po
  • 77. Pong Example States (S) … Actions (A) Rewards (R) +1 -1 0 EnvironmentAgent out of 49 Atari games ithin Google Goal: Maximize Accumulated Rewards π(S) -> A
  • 78. Reinforcement Learning Problem that maximizes iRi ∑Accumulated Rewards: Policy Function: π(S) -> A Find
  • 80. Reinforcement Learning Algorithms • Q-Learning • Actor-Critic methods • Policy Gradient
  • 81. Reinforcement Learning Algorithms • Q-Learning • Actor-Critic methods • Policy Gradient
  • 82. Episode Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. 😁R3=+1R1=0 R2=0 ion of learned value functions on two alization of the learned value function on nd 2, thestate value is predicted to be ,17 the lowest level. Each of the peaks in to a reward obtained by clearing a brick. reak through to the top level of bricks and tion of breaking out and clearing a ue is above 23 and the agent has broken bounce at the upper part of the bricks visualization of the learned action-value point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. .7, reflecting the expected value of this state based on time point 2, the agent starts moving the paddle value of the ‘up’ action stays high while the value of the 0.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, pressing ‘up’ and the expected reward keeps increasing the ball reaches the left edge of the screen and the value at the agent is about to receive a reward of 1. Note, he past trajectory of the ball purely for illustrative hown during the game). With permission from Atari 0.7, reflecting the expected value of this state based on At time point 2, the agent starts moving the paddle he value of the ‘up’ action stays high while the value of the 20.9. This reflects the fact that pressing ‘down’ would lead ball and incurring a reward of 21. At time point 3, by pressing ‘up’ and the expected reward keeps increasing n the ball reaches the left edge of the screen and the value hat the agent is about to receive a reward of 1. Note, the past trajectory of the ball purely for illustrative shown during the game). With permission from Atari Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. re 2 | Visualization of learned value functions on two d Pong. a, A visualization of the learned value function on t time points 1 and 2, thestate value is predicted to be ,17 ing the bricks at the lowest level. Each of the peaks in rve corresponds to a reward obtained by clearing a brick. gent is about to break through to the top level of bricks and ,21 in anticipation of breaking out and clearing a point 4, the value is above 23 and the agent has broken oint, the ball will bounce at the upper part of the bricks m by itself. b, A visualization of the learned action-value e Pong. At time point 1, the ball is moving towards the the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. ure 2 | Visualization of learned value functions on two nd Pong. a, A visualization of the learned value function on At time points 1 and 2, thestate value is predicted to be ,17 aring the bricks at the lowest level. Each of the peaks in urve corresponds to a reward obtained by clearing a brick. agent is about to break through to the top level of bricks and to ,21 in anticipation of breaking out and clearing a At point 4, the value is above 23 and the agent has broken point, the ball will bounce at the upper part of the bricks em by itself. b, A visualization of the learned action-value me Pong. At time point 1, the ball is moving towards the y the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Game Over 👍 👍 👍 iRi ∑ = +1
  • 83. 😭 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a e set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ring many of them by itself. b, A visualization of the learned action-value tion on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. nded Data Figure 2 | Visualization of learned value functions on two es, Breakout and Pong. a, A visualization of the learned value function on ame Breakout.At time points 1 and 2, the state value is predicted to be ,17 the agent is clearing the bricks at the lowest level. Each of the peaks in alue function curve corresponds to a reward obtained by clearing a brick. me point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a set of bricks. At point 4, the value is above 23 and the agent has broken ugh. After this point, the ball will bounce at the upper part of the bricks ing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari tended Data Figure 2 | Visualization of learned value functions on two mes, Breakout and Pong. a, A visualization of the learned value function on game Breakout.At time points 1 and 2, thestate value is predicted to be ,17 d the agent is clearing the bricks at the lowest level. Each of the peaks in value function curve corresponds to a reward obtained by clearing a brick. time point 3, the agent is about to break through to the top level of bricks and value increases to ,21 in anticipation of breaking out and clearing a ge set of bricks. At point 4, the value is above 23 and the agent has broken ough. After this point, the ball will bounce at the upper part of the bricks aring many of them by itself. b, A visualization of the learned action-value ction on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. Episode R3=-1R1=0 R2=0 Game Over 👎 👎 👎 iRi ∑ = −1
  • 85. How to Find π(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)
  • 87. π( ) -> Action Probability 0 0.25 0.5 0.75 1
  • 88. π( ) -> Action Probability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re
  • 89. π( ) -> Action Probability 0 0.25 0.5 0.75 1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. all actions are around 0.7, reflecting the expec previous experience. At time point 2, the agen towards the ball and the value of the ‘up’ action ‘down’ action falls to 20.9. This reflects the fac to the agent losing the ball and incurring a re
  • 90. 2. Approximate π with NeuralNet: π(S, θ) -> P(A) How to Find π(S) -> A ? 1. Change π to Stochastic: π(S) -> P(A)
  • 91. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1
  • 92. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1 , θ
  • 93. Siπ( ) -> Action Probability 0 0.25 0.5 0.75 1 ,
  • 98. How to Find ?π(Si, θ) -> P(A) How to Find θ Loss Function…
  • 99. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 100. π(Si , θ | Ai) θk 0.0 0.2 0.4 0.6 0.8 1.0 π(Si , θ | Ai) } 👍 👎 Δ(π(Si , θ | Ai) ) Δ θk
  • 101. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 R1=0 0.0 0.2 0.4 0.6 0.8 1.0 R2=0 0.0 0.2 0.4 0.6 0.8 1.0 Game Over R3=+1 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Ri i n ∑ = +1 😁 👍 👍 👍 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 102.
  • 103. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 104. Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on the game Breakout.At time points 1 and 2, the state value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value function on the game Pong. At time point 1, the ball is moving towards the paddle controlled by the agent on the right side of the screen and the values of all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari Interactive, Inc. LETTER RESEARCH Extended Data Figure 2 | Visualization of learned value functions on two games, Breakout and Pong. a, A visualization of the learned value function on thegame Breakout.At time points 1 and 2, thestate value is predicted to be ,17 and the agent is clearing the bricks at the lowest level. Each of the peaks in the value function curve corresponds to a reward obtained by clearing a brick. At time point 3, the agent is about to break through to the top level of bricks and the value increases to ,21 in anticipation of breaking out and clearing a large set of bricks. At point 4, the value is above 23 and the agent has broken through. After this point, the ball will bounce at the upper part of the bricks clearing many of them by itself. b, A visualization of the learned action-value all actions are around 0.7, reflecting the expected value of this state based on previous experience. At time point 2, the agent starts moving the paddle towards the ball and the value of the ‘up’ action stays high while the value of the ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead to the agent losing the ball and incurring a reward of 21. At time point 3, the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing until time point 4, when the ball reaches the left edge of the screen and the value of all actions reflects that the agent is about to receive a reward of 1. Note, the dashed line shows the past trajectory of the ball purely for illustrative purposes (that is, not shown during the game). With permission from Atari LETTER RESEARCH
  • 106. Outline • Current State • Deep Learning • Reinforcement Learning • Conclusions
  • 107. Conclusions 1. Andrew NgAndrew Ng Neurons in the brain Output Deep Learning: Neural network … …2. 3.