SlideShare a Scribd company logo
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Technical University of Catalonia
Reinforcement
Learning (Reloaded)
Day 11 Lecture 1
#DLUPC
http://bit.ly/dlai2018
2
Acknowledgements
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
3
Acknowledgements
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
4
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning
5
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning
6
Outline
1. Previously in DLAI
7
Architecture
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
8
Training data
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
In RL, training data is obtained by computing interaction sequences of:
State , Action , Reward
A complete episode of T interactions:
S1
, A1
, R2
, S2
, A2
, …, ST
One experience corresponds to a single
tuple within an episode:
(s,a,r,s’)
9
Markov Decision Process (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P γ
Environment
samples initial
state s0
~ p(s0
)
Agent
selects
action a
Environment samples
next state s’ ~ P ( .| s, a)
Environment samples
reward r ~ R(. | s,a) reward
(r)
state (s) action (a)
10
Markov Decision Process (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P γ
11
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
MDP: Model (of the Environment)
S A R P γ
12
MDP: Model (of the Environment)
The transition function P(s’,r|s,a) records the probability of transitioning from s
to s’ after taking action a, while obtaining a reward r.
where ℙ is the symbol for probability.
State-
transition function
Reward function
Sum of transition function
across all rewards
13
MDP: Model (of the Environment)
Environment samples
next state s’ ~ P ( .| s,a)
Environment samples
reward r ~ R(. | s,a) reward (r)
state (s) action (a)
The Model (of the Environment) is defined by:
● P(s’|s,a): State-transition function
● R(s,a): Reward function.
Model
(of the Environment)
14
Figure: OpenAI Spinning Up
MDP: Model (of the Environment)
15
MDP: Model (of the Environment)
● Model-based RL: R(·| s,a) and P(·| s,a) are known, so an optimal solution can
be found with Dynamic Programming.
Model-based RL is the only option that can be applied in reality due to the
poor data efficiency of model-free RL.
More details at Sergey Levine’s slides @ Berkeley.
● Model-free RL: there is no prior knowledge of the world. The agent needs
to learn all dynamics from scratch, resulting in poor data efficiency which
limits its application to real-world agents.
Requires sim2real transfer for real deployment.
16
MDP: Model (of the Environment)
Example of sim2real: Dexterity (OpenAI 2018)
17
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Agent
selects
action a
The Policy ㄫ is a function S ➝ A that specifies which action to take given a state.
● On-policy learning: the agent is trained with a sequence of interactions of
the target policy.
● Off-policy learning: the agent is trained with a sequence of interactions
obtained from a behaviour policy different from the target policy.
Policy ㄫ
18
MDP: Policy
The Policy ㄫ is a function S ➝ A that specifies which action to take in each state.
● Deterministic: ㄫ(s)=a
● Stochastic: ㄫ(a|s)=ℙㄫ
[A=a|S=s]
19
MDP: Policy
The Policy ㄫ is a function S ➝ A that specifies which action to take in each state.
● Discrete actions: categorical distribution over actions.
○ In the deterministic case: take the argmax action.
○ In the stochastic case: sample from the categorical distribution.
● Continuous actions: Gaussian distribution over actions; i.e. the policy
generally outputs the mean and std per dimension.
○ In the deterministic case: take the mean.
○ In the stochastic case: sample from a normal distribution.
Slide concept: Víctor Campos
20
MDP: Return Gt
The future reward, also known as return Gt
, is a total sum of discounted rewards
from t and onwards in time...
...where γ is the discount factor between [0,1].
Return
Gt
21
MDP: Value Functions Vㄫ
(s) & Qㄫ
(s,a)
Given a policy ㄫ, a value function is the expected return Gt
given....
...by following policy ㄫ starting
from state s & action a”
“Expected
return…
State-value
function Vs
s & a
Action-value
function Q
More
specific
22
MDP: Advantage Function Aㄫ
(s,a)
The Advantage function (A-value) is the defined as:
s & a
Advantage
function A
Aㄫ
(s,a) tells how good/bad an action is with respect to the expectation of
returns over all possible actions for a given state.
23
MDP: Optimal Value and Policy
The optimal value functions produce the maximum return...
Optimal
value
functions
Optimal
policy ㄫ*
The optimal policy achieves optimal value functions V*
(s) and Q*
(s,a)
24
Outline
1. Previously in DLAI
2. Common Approaches
25
Value vs Policy vs Model
Value
function
Policy ㄫ Model
(of the Environment)
26
Value vs Policy vs Model
Value
function
Policy ㄫ Model
(of the Environment)
Goals of Reinforcement Learning
27
Value vs Policy vs Model
David Silver (Deepmind), “Introduction to Deep Learning” (2015)
Summary of approaches in RL based on whether we want to learn the value, policy,
or the model of the environment.
28
Value vs Policy vs Model
David Silver (Deepmind), “Introduction to Deep Learning” (2015)
Summary of approaches in RL based on whether we want to learn the value, policy,
or the model of the environment.
29
Value vs Policy vs Model
Value
function
Policy ㄫ Model
(of the Environment)
Goals of Reinforcement Learning
30
Inferring a Policy from the Value
Value
function
Policy ㄫ
Given a value function, a policy can be
easily defined. For example:
Greedy policy
ㄫ(s) = arg maxa∈A
Q(s,a)
31
Generalized Policy Iteration (GPI)
Generalized Policy Iteration (GPI) algorithms adopt an iterative procedure to
improve the policy ㄫ:
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
The value function Vㄫi
is approximated repeatedly to be closer to the true value
of the current policy and in the meantime, the policy is improved repeatedly to
approach optimality.
32
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
Example: Grid world
● The agent lives in a grid.
● Walls block the agent’s path.
● Big rewards come at the end.
33
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
Example: Grid world with noise = 0.2
● The agent’s actions do not always go as
planned:
○ Action North takes the agent:
■ North (80%)
■ West (10%)
■ East (10%)
○ Similarly for the West, East & South
actions.
34
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
35
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
36
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
37
Generalized Policy Iteration (GPI)
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
38
GPI: Monte-Carlo Methods (MC)
Monte-Carlo (MC) Methods learns from complete episodes of raw experience and
computes the observed mean return as an approximation of the expected return Gt
.
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
MC
S1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, ST
S1
, A1
, R2
, S2
, R2
, …, ST
...where 𝟙[St
=s] is a
binary indicator function.
39
GPI: Monte-Carlo Methods (MC)
The optimal policy ㄫ*
is learned by iterating as follows:
1. Improve the policy greedily with respect to the current Q:
ㄫ(s) = arg maxa∈A
Q(s,a)
2. Generate a new complete episode with the updated policy ㄫ.
3. Update Q using the new episode
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
40
Figure: OpenAI Spinning Up
Deep Q-learning (DQN)
41
Deep Q-learning (DQN)
Problem: Estimating the optimal Q*
(s,a) might be feasible for few
state-action pairs with exhaustive search, but impossible for large
spaces.
Solution: Learn a function Q(s,a, θ) approximation with machine
learning:
Eg. Neural Network
parameters
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Q*
(s) = arg maxa∈A
Q(s,a, θ)Q(s,a, θ)
42
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Single
Feed
Forward
Pass
A single feedforward pass to compute the Q-values
for all actions from the current state (efficient)Q(s,a, θ)
43
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
44
Value vs Policy vs Model
David Silver (Deepmind), “Introduction to Deep Learning” (2015)
Summary of approaches in RL based on whether we want to learn the value, policy,
or the model of the environment.
45
Value vs Policy vs Model
Value
function
Policy ㄫ Model
(of the Environment)
Goals of Reinforcement Learning
46
Value vs Policy vs Model
Value
function
Policy ㄫ
Directly learn the policy by estimating the
parameters θ of a stochastic policy
function:
ㄫ(a|s;θ)
47
Figure: OpenAI Spinning Up
REINFORCE (Vanilla Policy Gradients - VPN)
48Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. "Policy gradient methods for
reinforcement learning with function approximation." NIPS 2000.
REINFORCE (Vanilla Policy Gradients - VPN)
the mathematical formulation of ‘trial-and-error’: try and action, and make it
more likely if it resulted in positive reward; otherwise, make it less likely.
Opposite goals in:
● Supervised learning: minimize a loss function by gradient descent.
● Reinforcement learning: maximize J(θ), the expected return of the policy,
by gradient ascent.
REINFORCE (Vanilla Policy Gradients - VPN)
49
Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018)
Policy parameters
(weights of the NN)
REINFORCE (Vanilla Policy Gradients - VPN)
50
Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018)
Policy parameters
(weights of the NN)
Estimation of the expectation over all
possible trajectories, using N sampled
trajectories (Monte Carlo)
REINFORCE (Vanilla Policy Gradients - VPN)
51
Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018)
Policy parameters
(weights of the NN)
Estimation of the expectation over all
possible trajectories, using N sampled
trajectories (Monte Carlo)
The policy gradient. If we follow it, the
action ai,t
will be more likely if the
agent ever finds itself again in state si,t
REINFORCE (Vanilla Policy Gradients - VPN)
52
Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018)
Policy parameters
(weights of the NN)
Estimation of the expectation over all
possible trajectories, using N sampled
trajectories (Monte Carlo)
Cumulative
(discounted) reward
REINFORCE (Vanilla Policy Gradients - VPN)
The policy gradient. If we follow it, the
action ai,t
will be more likely if the
agent ever finds itself again in state si,t
53
Learn more
Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning,
Berkeley.
Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
54
Learn more
David Silver, UCL COMP050, Reinforcement Learning
55
Learn more
OpenAI Spinning Up in Deep RL
56
A taxonomy of RL Algorithms
Figure: OpenAI Spinning Up
57
Final Questions

More Related Content

What's hot

What's hot (20)

Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
 

Similar to Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 

Similar to Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018 (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 

More from Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

  • 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Technical University of Catalonia Reinforcement Learning (Reloaded) Day 11 Lecture 1 #DLUPC http://bit.ly/dlai2018
  • 3. 3 Acknowledgements Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  • 4. 4 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  • 5. 5 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  • 7. 7 Architecture Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  • 8. 8 Training data Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) In RL, training data is obtained by computing interaction sequences of: State , Action , Reward A complete episode of T interactions: S1 , A1 , R2 , S2 , A2 , …, ST One experience corresponds to a single tuple within an episode: (s,a,r,s’)
  • 9. 9 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P γ Environment samples initial state s0 ~ p(s0 ) Agent selects action a Environment samples next state s’ ~ P ( .| s, a) Environment samples reward r ~ R(. | s,a) reward (r) state (s) action (a)
  • 10. 10 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P γ
  • 11. 11 Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Model (of the Environment) S A R P γ
  • 12. 12 MDP: Model (of the Environment) The transition function P(s’,r|s,a) records the probability of transitioning from s to s’ after taking action a, while obtaining a reward r. where ℙ is the symbol for probability. State- transition function Reward function Sum of transition function across all rewards
  • 13. 13 MDP: Model (of the Environment) Environment samples next state s’ ~ P ( .| s,a) Environment samples reward r ~ R(. | s,a) reward (r) state (s) action (a) The Model (of the Environment) is defined by: ● P(s’|s,a): State-transition function ● R(s,a): Reward function. Model (of the Environment)
  • 14. 14 Figure: OpenAI Spinning Up MDP: Model (of the Environment)
  • 15. 15 MDP: Model (of the Environment) ● Model-based RL: R(·| s,a) and P(·| s,a) are known, so an optimal solution can be found with Dynamic Programming. Model-based RL is the only option that can be applied in reality due to the poor data efficiency of model-free RL. More details at Sergey Levine’s slides @ Berkeley. ● Model-free RL: there is no prior knowledge of the world. The agent needs to learn all dynamics from scratch, resulting in poor data efficiency which limits its application to real-world agents. Requires sim2real transfer for real deployment.
  • 16. 16 MDP: Model (of the Environment) Example of sim2real: Dexterity (OpenAI 2018)
  • 17. 17 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Agent selects action a The Policy ㄫ is a function S ➝ A that specifies which action to take given a state. ● On-policy learning: the agent is trained with a sequence of interactions of the target policy. ● Off-policy learning: the agent is trained with a sequence of interactions obtained from a behaviour policy different from the target policy. Policy ㄫ
  • 18. 18 MDP: Policy The Policy ㄫ is a function S ➝ A that specifies which action to take in each state. ● Deterministic: ㄫ(s)=a ● Stochastic: ㄫ(a|s)=ℙㄫ [A=a|S=s]
  • 19. 19 MDP: Policy The Policy ㄫ is a function S ➝ A that specifies which action to take in each state. ● Discrete actions: categorical distribution over actions. ○ In the deterministic case: take the argmax action. ○ In the stochastic case: sample from the categorical distribution. ● Continuous actions: Gaussian distribution over actions; i.e. the policy generally outputs the mean and std per dimension. ○ In the deterministic case: take the mean. ○ In the stochastic case: sample from a normal distribution. Slide concept: Víctor Campos
  • 20. 20 MDP: Return Gt The future reward, also known as return Gt , is a total sum of discounted rewards from t and onwards in time... ...where γ is the discount factor between [0,1]. Return Gt
  • 21. 21 MDP: Value Functions Vㄫ (s) & Qㄫ (s,a) Given a policy ㄫ, a value function is the expected return Gt given.... ...by following policy ㄫ starting from state s & action a” “Expected return… State-value function Vs s & a Action-value function Q More specific
  • 22. 22 MDP: Advantage Function Aㄫ (s,a) The Advantage function (A-value) is the defined as: s & a Advantage function A Aㄫ (s,a) tells how good/bad an action is with respect to the expectation of returns over all possible actions for a given state.
  • 23. 23 MDP: Optimal Value and Policy The optimal value functions produce the maximum return... Optimal value functions Optimal policy ㄫ* The optimal policy achieves optimal value functions V* (s) and Q* (s,a)
  • 24. 24 Outline 1. Previously in DLAI 2. Common Approaches
  • 25. 25 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment)
  • 26. 26 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  • 27. 27 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  • 28. 28 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  • 29. 29 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  • 30. 30 Inferring a Policy from the Value Value function Policy ㄫ Given a value function, a policy can be easily defined. For example: Greedy policy ㄫ(s) = arg maxa∈A Q(s,a)
  • 31. 31 Generalized Policy Iteration (GPI) Generalized Policy Iteration (GPI) algorithms adopt an iterative procedure to improve the policy ㄫ: Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) The value function Vㄫi is approximated repeatedly to be closer to the true value of the current policy and in the meantime, the policy is improved repeatedly to approach optimality.
  • 32. 32 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012) Example: Grid world ● The agent lives in a grid. ● Walls block the agent’s path. ● Big rewards come at the end.
  • 33. 33 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012) Example: Grid world with noise = 0.2 ● The agent’s actions do not always go as planned: ○ Action North takes the agent: ■ North (80%) ■ West (10%) ■ East (10%) ○ Similarly for the West, East & South actions.
  • 34. 34 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  • 35. 35 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  • 36. 36 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  • 37. 37 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  • 38. 38 GPI: Monte-Carlo Methods (MC) Monte-Carlo (MC) Methods learns from complete episodes of raw experience and computes the observed mean return as an approximation of the expected return Gt . Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) MC S1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, ST S1 , A1 , R2 , S2 , R2 , …, ST ...where 𝟙[St =s] is a binary indicator function.
  • 39. 39 GPI: Monte-Carlo Methods (MC) The optimal policy ㄫ* is learned by iterating as follows: 1. Improve the policy greedily with respect to the current Q: ㄫ(s) = arg maxa∈A Q(s,a) 2. Generate a new complete episode with the updated policy ㄫ. 3. Update Q using the new episode Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  • 40. 40 Figure: OpenAI Spinning Up Deep Q-learning (DQN)
  • 41. 41 Deep Q-learning (DQN) Problem: Estimating the optimal Q* (s,a) might be feasible for few state-action pairs with exhaustive search, but impossible for large spaces. Solution: Learn a function Q(s,a, θ) approximation with machine learning: Eg. Neural Network parameters Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Q* (s) = arg maxa∈A Q(s,a, θ)Q(s,a, θ)
  • 42. 42 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)Q(s,a, θ)
  • 43. 43 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
  • 44. 44 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  • 45. 45 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  • 46. 46 Value vs Policy vs Model Value function Policy ㄫ Directly learn the policy by estimating the parameters θ of a stochastic policy function: ㄫ(a|s;θ)
  • 47. 47 Figure: OpenAI Spinning Up REINFORCE (Vanilla Policy Gradients - VPN)
  • 48. 48Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. "Policy gradient methods for reinforcement learning with function approximation." NIPS 2000. REINFORCE (Vanilla Policy Gradients - VPN) the mathematical formulation of ‘trial-and-error’: try and action, and make it more likely if it resulted in positive reward; otherwise, make it less likely. Opposite goals in: ● Supervised learning: minimize a loss function by gradient descent. ● Reinforcement learning: maximize J(θ), the expected return of the policy, by gradient ascent. REINFORCE (Vanilla Policy Gradients - VPN)
  • 49. 49 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) REINFORCE (Vanilla Policy Gradients - VPN)
  • 50. 50 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) REINFORCE (Vanilla Policy Gradients - VPN)
  • 51. 51 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) The policy gradient. If we follow it, the action ai,t will be more likely if the agent ever finds itself again in state si,t REINFORCE (Vanilla Policy Gradients - VPN)
  • 52. 52 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) Cumulative (discounted) reward REINFORCE (Vanilla Policy Gradients - VPN) The policy gradient. If we follow it, the action ai,t will be more likely if the agent ever finds itself again in state si,t
  • 53. 53 Learn more Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
  • 54. 54 Learn more David Silver, UCL COMP050, Reinforcement Learning
  • 56. 56 A taxonomy of RL Algorithms Figure: OpenAI Spinning Up