SlideShare a Scribd company logo
Policy Gradient
Jie-Han Chen
NetDB, National Cheng Kung University
5/22, 2018 @ National Cheng Kung University, Taiwan
Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
3. Rich Sutton’s textbook
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer
Outline
● Pitfall of Value-based Reinforcement Learning
● Policy gradient
● Variance reduction
● Policy in policy gradient
● Off-policy policy gradient
● Reference
Value-based Reinforcement Learning
In previous lecture, we introduce how to use neural network to approximate value
function and how to learn the optimal policy in discrete action space.
Value-based Reinforcement Learning
In Deep Q Network, we use neural network to approximate the action value
function Q(s, a).
The greedy policy is
Value-based Reinforcement Learning
● The optimal policy learned from value-based method is deterministic
● It’s hard to be applied in continuous action problem
Pitfall of Value-based Reinforcement Learning
Consider the following simple maze, the features are constructed by 4 elements
and each element means whether facing the wall in that direction (N, S, W, E).
feature: (1, 1, 0, 0)
Pitfall of Value-based Reinforcement Learning
For deterministic policy:
● It will move east either or west in both grey states.
● It may get stuck and never reach the goal state.
Pitfall of Value-based Reinforcement Learning
Although well-defined observation could help the agent to distinguish the
difference in different states, sometimes we prefer to use stochastic policy.
Pitfall of Value-based Reinforcement Learning
In robotics, the action(control) is often continuous. We
need to decide the degree/torque in the robotic arm
given observation.
It’s hard to use argmax to demonstrate the optimal
action of robotic arm, so we need other solutions in
continuous control problem.
Value-based and Policy-based RL
Value-Based
● Learnt Value Function
● Implicit policy
Policy-Based
● No Value Function
● Learnt Policy
Actor-Critic
● Learnt Value Function
● Learnt Policy
Policy Gradient
The objective of reinforcement learning is to maximize the expected episodic
rewards (here, we take episodic task as our example)
We define as and is represented for the episodic trajectory,
the objective can be expressed as following:
Policy Gradient
action
p(s’|s, a)
state (s)
Policy Gradient
action
p(s’|s, a)
state (s)
In this example, the sample is a episodic trajectory
not the experience for each transition (s, a, r, s’)
Policy Gradient
How to do we find the optimal parameters of neural network?
???
Policy Gradient
How to do we find the optimal parameters of neural network?
Gradient Ascent ! (maximize objective)
*** Math Caution ***
Policy Gradient
tips:
Policy Gradient
tips:
Policy Gradient
tips:
We just sample trajectories using current policy and
adjust the likelihood of trajectories by episodic rewards.
Policy Gradient
Policy Gradient
Policy Gradient
Policy Gradient
Policy Gradient
The gradient of objective:
Adjust the action probability
taken in that trajectory.
According to the magnitude
of episodic rewards
Policy Gradient
The gradient of objective:
In practice, we replace expectation by sampling multiple trajectories.
Vanilla Policy Gradient - REINFORCE algorithm
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Features of Policy-Based RL
Advantages
● Better convergence
● Effective in high-dimensional or continuous action space
● Stochastic policy
Disadvantages:
● Local optimal rather than global optimal
● Inefficient learning (learned by episodes)
● High variance
REINFORCE: bias and variance
The estimator is unbiased, because it use true
rewards to evaluate policy.
The estimator of REINFORCE is known to have
high variance because of huge difference in
episodic rewards. High variance results in slow
convergence.
Variance reduction
There are two method to reduce variance
1. Causality
2. Baseline
Variance reduction: causality
Original:
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
REINFORCE: reduce variance
There are two method to reduce variance
Original:
1. Causality:
2. Baseline
Variance reduction: baseline
baseline:
Variance reduction: baseline
baseline:
we can choose baseline like:
Average sampled trajectories’ rewards
Variance reduction: baseline
baseline:
Do baselines introduce bias in expectation?
Variance reduction: baseline
baseline:
Do baselines introduce bias in expectation?
Analyze:
Variance reduction: baseline
*baseline is independent of policy
Reduce variance with baseline won’t make model biased, as long as the
baseline is independent of the policy (not action-related)
Variance reduction: baseline
● Subtracting a baseline is unbiased in expectation, It won’t make the
estimator biased.
● The baseline can be any function, random variable, as long as it does
not vary with action.
Variance reduction: baseline
Variance:
Related Paper:
VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT FACTORIZED
BASELINES, Cathy Wu*, Aravind Rajeswaran*, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor
Mordatch, Pieter Abbeel OpenAI (currently under review at ICLR 2018)
Variance reduction: causality + baseline
In previous, we introduce 2 method to reduce variance.
● Causality
● Baseline
Question: Can we combine two variance reduction method together?
Variance reduction: causality + baseline
The ideal form:
Terminal state
Staring state
If you are in certain state of trajectory, there are many potential
path to reach different terminal state. Of course, the remaining
rewards would also be different.
The naive concept is to find the average remaining rewards in
that state, in other words, value function.
The state in trajectory
Variance reduction: causality + baseline
We can learn value function by the method mentioned before
(tabular method or using function approximator)
In REINFORCE algorithm, the agent play with environment
until reaching terminal state. We know the remaining rewards
at each step so that we can use Monte Carlo method to
evaluate policy, and the loss could be MSE.
Policy
Policy Gradient can be apply to discrete action space problem so as continuous
action space problem.
● Discrete action problem: Softmax Policy
● Continuous action problem: Gaussian Policy
Policy: Softmax Policy
Here, we suppose the function approximator is h(s, a) and is its parameters
we will sample the action according to its softmax probability.
observation
(features)
action
probability
Policy: Gaussian Policy
In continuous control, a Gaussian policy is common.
● Gaussian distribution:
● Gaussian Policy: we use neural network to approximate mean, and sample the
action according to Gaussian distribution. The variance can also be
parameterized.
Policy: Gaussian Policy
Here, we use fixed variance and mean value is computed by linear combination of
state features, where is used to features transformation.
The gradient of the log of the policy is
In neural network, you just need to backpropagate the sampled action probability.
Reason of Inefficient Learning
The main reasons for inefficient learning in REINFORCE are:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
Reason of Inefficient Learning
The main reasons for inefficient learning in REINFORCE are:
● The REINFORCE algorithm is on-policy
The learning process of REINFORCE:
1. Sample multiple trajectories from
2. Fit model
If the sampled trajectories are from different distribution, the
learning result would be wrong.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
○ In vanilla policy gradient, we sample multiple trajectories but just update model once. In
contrast to TD learning, vanilla policy gradient learns much slower.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
Improve by importance sampling
Improve by Actor-Critic
Importance sampling
Importance sampling is a statistical technique that we could use to estimate the
properties from different distribution here.
Suppose the objective is , but the data are sampled from q(X), we
can do such transformation:
Objective of on-policy policy gradient :
Objective of off-policy policy gradient:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to collect samples.
We sample the trajectories from the objective would be:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to collect samples.
We sample the trajectories from the objective would be:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to interact with environment.
We sample the trajectories from the objective would be:
Importance sampling ratio
ends up depending only on
the two policies and the
sequence.
Off-policy & importance sampling
Suppose the off-policy objective function is:
target policy (learner neural net)
behavior policy (expert/behavior neural net)
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
Suppose the off-policy objective function is:
How about causality?
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
The gradient of off-policy objective:
future action won’t affect current weight
Off-policy & importance sampling
The gradient of off-policy objective:
1. This is the general form of off-policy policy gradient. if we
use on-policy learning, the form is as same as vanilla policy
gradient (importance sampling ratio is 1)
2. In practice, We store trajectories along with its action
probability each step, and then update the neural network
by adding importance sampling ratio.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
We’ve already discussed.
In the next section
Reference
● CS 294, Berkeley lecture 4: http://rll.berkeley.edu/deeprlcourse/
● David Silver RL course lecture7: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
● Baseline Substraction from Shan-Hung Wu https://www.youtube.com/watch?v=XnXRzOB0Pc8
● Andrej Karpathy’s blog: http://karpathy.github.io/2016/05/31/rl/
● Policy Gradient in pytorch https://github.com/pytorch/examples/tree/master/reinforcement_learning
Outline
● Pitfall of Value-based Reinforcement Learning
○ Value-based policy is deterministic
○ Hard to handle continuous control
● Policy gradient
● Variance reduction
○ Causality
○ Baseline
● Policy in policy gradient
○ Softmax policy
○ Gaussian policy
● Off-policy policy gradient
○ Importance sampling

More Related Content

What's hot

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Mohit Rajput
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
Seung Jae Lee
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
kandelin
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
SlideTeam
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Edureka!
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
Seung Jae Lee
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
Kuppusamy P
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
SameerJolly2
 
Activation functions
Activation functionsActivation functions
Activation functions
PRATEEK SAHU
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
Yisong Yue
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
Hasan H Topcu
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 

What's hot (20)

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
Reinforcement Learning In AI Powerpoint Presentation Slide Templates Complete...
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 

Similar to Policy gradient

Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
Mikko Mäkipää
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
Chris Ohk
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
Weam Banjar
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Lviv Startup Club
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
YasutoTamura1
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Chris Ohk
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
Supun Abeysinghe
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
MohibKhan79
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
DrPArivalaganASSTPRO
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj25
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 

Similar to Policy gradient (20)

Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 

More from Jie-Han Chen

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
Jie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
Jie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Jie-Han Chen
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
Jie-Han Chen
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
Jie-Han Chen
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
Jie-Han Chen
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
Jie-Han Chen
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
Jie-Han Chen
 

More from Jie-Han Chen (12)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Recently uploaded

Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
sayalidalavi006
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 

Recently uploaded (20)

Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 

Policy gradient

  • 1. Policy Gradient Jie-Han Chen NetDB, National Cheng Kung University 5/22, 2018 @ National Cheng Kung University, Taiwan
  • 2. Some content and images in this slides were borrowed from: 1. Sergey Levine’s Deep Reinforcement Learning class in UCB 2. David Silver’s Reinforcement Learning class in UCL 3. Rich Sutton’s textbook 4. Deep Reinforcement Learning and Control in CMU (CMU 10703) 2 Disclaimer
  • 3. Outline ● Pitfall of Value-based Reinforcement Learning ● Policy gradient ● Variance reduction ● Policy in policy gradient ● Off-policy policy gradient ● Reference
  • 4. Value-based Reinforcement Learning In previous lecture, we introduce how to use neural network to approximate value function and how to learn the optimal policy in discrete action space.
  • 5. Value-based Reinforcement Learning In Deep Q Network, we use neural network to approximate the action value function Q(s, a). The greedy policy is
  • 6. Value-based Reinforcement Learning ● The optimal policy learned from value-based method is deterministic ● It’s hard to be applied in continuous action problem
  • 7. Pitfall of Value-based Reinforcement Learning Consider the following simple maze, the features are constructed by 4 elements and each element means whether facing the wall in that direction (N, S, W, E). feature: (1, 1, 0, 0)
  • 8. Pitfall of Value-based Reinforcement Learning For deterministic policy: ● It will move east either or west in both grey states. ● It may get stuck and never reach the goal state.
  • 9. Pitfall of Value-based Reinforcement Learning Although well-defined observation could help the agent to distinguish the difference in different states, sometimes we prefer to use stochastic policy.
  • 10. Pitfall of Value-based Reinforcement Learning In robotics, the action(control) is often continuous. We need to decide the degree/torque in the robotic arm given observation. It’s hard to use argmax to demonstrate the optimal action of robotic arm, so we need other solutions in continuous control problem.
  • 11. Value-based and Policy-based RL Value-Based ● Learnt Value Function ● Implicit policy Policy-Based ● No Value Function ● Learnt Policy Actor-Critic ● Learnt Value Function ● Learnt Policy
  • 12. Policy Gradient The objective of reinforcement learning is to maximize the expected episodic rewards (here, we take episodic task as our example) We define as and is represented for the episodic trajectory, the objective can be expressed as following:
  • 14. Policy Gradient action p(s’|s, a) state (s) In this example, the sample is a episodic trajectory not the experience for each transition (s, a, r, s’)
  • 15. Policy Gradient How to do we find the optimal parameters of neural network? ???
  • 16. Policy Gradient How to do we find the optimal parameters of neural network? Gradient Ascent ! (maximize objective)
  • 20. Policy Gradient tips: We just sample trajectories using current policy and adjust the likelihood of trajectories by episodic rewards.
  • 25. Policy Gradient The gradient of objective: Adjust the action probability taken in that trajectory. According to the magnitude of episodic rewards
  • 26. Policy Gradient The gradient of objective: In practice, we replace expectation by sampling multiple trajectories.
  • 27. Vanilla Policy Gradient - REINFORCE algorithm REINFORCE algorithm: 1. sample trajectory from 2. 3.
  • 28. Features of Policy-Based RL Advantages ● Better convergence ● Effective in high-dimensional or continuous action space ● Stochastic policy Disadvantages: ● Local optimal rather than global optimal ● Inefficient learning (learned by episodes) ● High variance
  • 29. REINFORCE: bias and variance The estimator is unbiased, because it use true rewards to evaluate policy. The estimator of REINFORCE is known to have high variance because of huge difference in episodic rewards. High variance results in slow convergence.
  • 30. Variance reduction There are two method to reduce variance 1. Causality 2. Baseline
  • 32. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 33. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 34. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 35. REINFORCE: reduce variance There are two method to reduce variance Original: 1. Causality: 2. Baseline
  • 37. Variance reduction: baseline baseline: we can choose baseline like: Average sampled trajectories’ rewards
  • 38. Variance reduction: baseline baseline: Do baselines introduce bias in expectation?
  • 39. Variance reduction: baseline baseline: Do baselines introduce bias in expectation? Analyze:
  • 40. Variance reduction: baseline *baseline is independent of policy Reduce variance with baseline won’t make model biased, as long as the baseline is independent of the policy (not action-related)
  • 41. Variance reduction: baseline ● Subtracting a baseline is unbiased in expectation, It won’t make the estimator biased. ● The baseline can be any function, random variable, as long as it does not vary with action.
  • 42. Variance reduction: baseline Variance: Related Paper: VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT FACTORIZED BASELINES, Cathy Wu*, Aravind Rajeswaran*, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, Pieter Abbeel OpenAI (currently under review at ICLR 2018)
  • 43. Variance reduction: causality + baseline In previous, we introduce 2 method to reduce variance. ● Causality ● Baseline Question: Can we combine two variance reduction method together?
  • 44. Variance reduction: causality + baseline The ideal form: Terminal state Staring state If you are in certain state of trajectory, there are many potential path to reach different terminal state. Of course, the remaining rewards would also be different. The naive concept is to find the average remaining rewards in that state, in other words, value function. The state in trajectory
  • 45. Variance reduction: causality + baseline We can learn value function by the method mentioned before (tabular method or using function approximator) In REINFORCE algorithm, the agent play with environment until reaching terminal state. We know the remaining rewards at each step so that we can use Monte Carlo method to evaluate policy, and the loss could be MSE.
  • 46. Policy Policy Gradient can be apply to discrete action space problem so as continuous action space problem. ● Discrete action problem: Softmax Policy ● Continuous action problem: Gaussian Policy
  • 47. Policy: Softmax Policy Here, we suppose the function approximator is h(s, a) and is its parameters we will sample the action according to its softmax probability. observation (features) action probability
  • 48. Policy: Gaussian Policy In continuous control, a Gaussian policy is common. ● Gaussian distribution: ● Gaussian Policy: we use neural network to approximate mean, and sample the action according to Gaussian distribution. The variance can also be parameterized.
  • 49. Policy: Gaussian Policy Here, we use fixed variance and mean value is computed by linear combination of state features, where is used to features transformation. The gradient of the log of the policy is In neural network, you just need to backpropagate the sampled action probability.
  • 50. Reason of Inefficient Learning The main reasons for inefficient learning in REINFORCE are: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method
  • 51. Reason of Inefficient Learning The main reasons for inefficient learning in REINFORCE are: ● The REINFORCE algorithm is on-policy The learning process of REINFORCE: 1. Sample multiple trajectories from 2. Fit model If the sampled trajectories are from different distribution, the learning result would be wrong.
  • 52. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method ○ In vanilla policy gradient, we sample multiple trajectories but just update model once. In contrast to TD learning, vanilla policy gradient learns much slower.
  • 53. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method Improve by importance sampling Improve by Actor-Critic
  • 54. Importance sampling Importance sampling is a statistical technique that we could use to estimate the properties from different distribution here. Suppose the objective is , but the data are sampled from q(X), we can do such transformation: Objective of on-policy policy gradient : Objective of off-policy policy gradient:
  • 55. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to collect samples. We sample the trajectories from the objective would be:
  • 56. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to collect samples. We sample the trajectories from the objective would be:
  • 57. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to interact with environment. We sample the trajectories from the objective would be: Importance sampling ratio ends up depending only on the two policies and the sequence.
  • 58. Off-policy & importance sampling Suppose the off-policy objective function is: target policy (learner neural net) behavior policy (expert/behavior neural net)
  • 59. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 60. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 61. Off-policy & importance sampling Suppose the off-policy objective function is: How about causality?
  • 62. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 63. Off-policy & importance sampling The gradient of off-policy objective: future action won’t affect current weight
  • 64. Off-policy & importance sampling The gradient of off-policy objective: 1. This is the general form of off-policy policy gradient. if we use on-policy learning, the form is as same as vanilla policy gradient (importance sampling ratio is 1) 2. In practice, We store trajectories along with its action probability each step, and then update the neural network by adding importance sampling ratio.
  • 65. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method We’ve already discussed. In the next section
  • 66. Reference ● CS 294, Berkeley lecture 4: http://rll.berkeley.edu/deeprlcourse/ ● David Silver RL course lecture7: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf ● Baseline Substraction from Shan-Hung Wu https://www.youtube.com/watch?v=XnXRzOB0Pc8 ● Andrej Karpathy’s blog: http://karpathy.github.io/2016/05/31/rl/ ● Policy Gradient in pytorch https://github.com/pytorch/examples/tree/master/reinforcement_learning
  • 67. Outline ● Pitfall of Value-based Reinforcement Learning ○ Value-based policy is deterministic ○ Hard to handle continuous control ● Policy gradient ● Variance reduction ○ Causality ○ Baseline ● Policy in policy gradient ○ Softmax policy ○ Gaussian policy ● Off-policy policy gradient ○ Importance sampling