SlideShare a Scribd company logo
Policy Gradient (Part 1)
Reinforcement Learning : An Introduction 2e, 2018.July
Bean https://www.facebook.com/littleqoo
1
Agenda
Reinforcement Learning:An Introduction
● Policy Gradinet Theorem
● REINFORCE: Monte -Carlo Policy Gradient
● One-Step Actor Critic
● Actor Critic with Eligibility Trace (Eposodic and Continuing Case)
● Policy Parameterization for Continuous Actions
DeepMind (Richard Sutton、David Silver)
● Determinstic Policy Gradient (DPG(2014)、DDPG(2016)、MADDPG(2018)[part 2])
● Distributed Proximal Policy Optimization(DPPO 2017.07) [part 2]
OpenAI (Pieter Abbeel、John Schulman)
● Trust Region Policy Gradient (TRPO(2016)) [part 2]
● Proximal Policy Optimization (PPO(2017.07)) [part 2]
2
Reinforcement Learning Classification
● Value-Based
○ Learned Value Function
○ Implicit Policy
(usually Ɛ-greedy)
● Policy-Based
○ No Value Function
○ Explicit Policy
Parameterization
● Mixed(Actor-Critic)
○ Learned Value Function
○ Policy Parameterization
3
Policy Gradient Method
Goal:
Performance Messure:
Optimization:Gradient Ascent
[Actor-Critic Method]:Learn approximation to both policy and value
function
4
Policy Approximation (Discrete Actions)
● Ensure exploration we generally require that the policy never becomes
deterministic
● The most common parameterization for discrete action spaces - Softmax
in action preferences
○ discrete action space can not too large
● Action preferences can be
parameterization arbitrarily(linear, ANN...)
5
Advantage of Policy Approximation
1. Can approach to a deterministic policy (Ɛ-
greedy always has Ɛ probability of selecting
a random action),ex: Temperature parameter
(T -> 0) of soft-max
○ In practice, it is difficult to choose reduction
schedule or initial value of T
2. Enables the selection of actions with
arbitrary probabilities
○ Bluffing in poker, Action-Value methods have no
natural way
6
https://en.wikipedia.org/wiki/Softmax_function (Temperature parameters)
3. May be simpler function to approximate depending on the complexity
of
policies and action-value functions
4. A good way of injecting prior knowledge about the desired form of the
policy into the reinforcement learning system (often the most important
reason)
7
Short Corridor With Switched Actions
● All the states appear
identical under the function
approximation
● A method can do
significantly better if it can
learn a specific probability
with which to select right
● The best probability is
about 0.59
8
The Policy Gradient Theorem (Episodic)
https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-
learning-with-function-approximation.pdf
NIPS 2000 Policy Gradient Methods for Reinforcement Learning with Function
Approximation (Richard S. Sutton)
9
The Policy Gradient Theorem
● Stronger convergence of guarantees are available for policy-gradient
method than for action-value methods
○ Ɛ-greedy selection may change dramatically for an arbitrary small
action value change that results in having the maximal value
● There are two cases define the different performance messures
○ Episodic Case - performance measure as the value of the start state
of the episode
○ Continuing Case - no end even start state (Refer to Chap10.3)
10
The Policy Gradient Theorem (Episodic)
● Performance
● Gradient Ascent
● Discount = 1
Bellman Equation
11
Cont.
● Performance
● Gradient Ascent
recurisively
unroll
12
The Policy Gradient Theorem (Episodic)
● Performance
● Gradient Ascent
13
The Policy Gradient Theorem (Episodic) - Basic Meaning
14
The Policy Gradient Theorem (Episodic) - On Policy Distribution
fraction of time spent in s that is
usually under on-policy training
(on-policy distribution, the same
as p.43)
15
better be writed in
The Policy Gradient Theorem (Episodic) - On Policy Distribution
Number of time steps spent, on average, in
state s in a single episoid
h(s) denotes the probability that
an episode begins in states in a
single episode
16
The Policy Gradient Theorem (Episodic) - Concept
17
Ratio of s that
appear in the
state-action tree
Gathering gradients over all action
spaces of every state
The Policy Gradient Theorem (Episodic):
Sum Over States Weighted by How Offen the States Occur Under The
Policy
● Policy gradient for episodic case
● The distribution is the on-policy distribution under
● The constant of proportionality is the average length of an episode and
can be absorbed to step size
● Performance’s gradient ascent does not involve the derivative of the state
distribution
18
REINFORCE : Monte-Carlo Policy Gradient
Classical Policy Gradient
19
REINFORCE Algorithm
All Actions Method
Classical Monte-Carlo
20
REINFORCE Meaning
● The update increases the
parameter vector in this
direction proportional
to the return
● inversely proportional to the
action probability (make sense
because otherwise actions that
are selected frequently are at an
advantage)
Action is a summation. If using
samping by action probability, we
have to average gradient by sampling
number
21
REINFORCE Algorithm
Wait Until One Episode Generated
22
REINFORCE on the short-corridor gridworld
short-corridor gridworld
● With a good step
size, the total
reward per episode
approaches the
optimal value of the
start state
23
REINFORCE Defect & Solution
● Slow Converage
● High Variance From Reward
● Hard To Choose Learning Rate
24
REINFORCE with Baseline (episodic)
● Expected value of the update
unchanged(unbiased), but it
can have a large effect on its
variance
● Baseline can be any function,
evan a random variable
● For MDPs, the baseline should
vary with state, one natural
choice is state value function
○ some states all actions
have high values => a high
baseline
○ in others, => a low baseline
Treat State-Value Function as
a Independent Value-function
Approximation!
25
REINFORCE with Baseline (episodic)
can be learned by any methods
of previous chapters independently.
We use the same Monte-Carlo
here.(Section 9.3 Gradient Monte-Carlo)
26
Short-Corridor GridWorld
● Learn much
faster
● Policy
parameter is
much less clear
to set
● State-Value
function
paramenter
(Section 9.6)
27
Defects
● Learn Slowly (product estimates of high variance)
● Incovenient to implement online or continuing problems
28
Actor-Critic Methods
Combine Policy Function with Value Function
29
One-Step Actor-Critic Method
● Add One-step
bootstrapping to make
it online
● But TD Method always
introduces bias
● The TD(0) with only
one random step has
lower variance than
Monte-Carlo and
accelerate learning 30
Actor-Critic
● Actor - Policy
Function
● Critic- State-Value
Function
● Critic Assign Credit
to Cricitize Actor’s
Selection
31
https://cs.wmich.edu/~trenary/files/cs5300/RLBook/node66.html
One-step Actor-Critic Algorithm (episodic)
Independent Semi-Gradient TD(0)
(Session 9.3)
32
Actor-Critic with Eligiblity Traces (episodic)
● Weight Vector
is a long-term
memory
● Eligibility trace
is a short-term
memory,
keeping track
of which
components of
the weight
vector have
contributed to
recent state
33
Review of Eligibility Traces - Forward View (Optional)
34
Review of Eligibility Traces - Forward View (Optional)
TD(0)
TD(1)
TD(2)
35
My Eligibility Trace Indution Link https://cacoo.com/diagrams/gof2aiV3fCXFGJXF
Review of Eligibility Traces - Backward View vs Momentum (Optional)
Example:
Eligibility Traces Gradient Momentum
similiar
Accumulate Decayed Gradient
36
The Policy Gradient Theorem (Continuing)
37
The Policy Gradient Theorem (Continuing) - Performance Measure with Ergodicity
● “Ergodicity Assumption”
○ Any early decision by
the agent can have
only a temporary
effect
○ State Expectation in
the long run depends
on policy and MDP
transition
probabilities
○ Steady state
distribution is
assumed to exist and
to be independent of S0
guarantee
limit exist
Average Rate of Reward per Time Step
38
( is a fixed parameter for any . We will
treat it later as a linear function independent of
s in the theorem)
V(s)
The Policy Gradient Theorem (Continuing) - Performance Measure Definition
“Every Step’s Average
Reward Is The Same”39
The Policy Gradient Theorem (Continuing) - Steady State Distribution
Steady State Distribution Under
40
Replace Discount with Average Reward for Continuing Problem(Session 10.3, 10.4)
● Continuing problem with discounted setting is useful in tabular case, but
questionable for function approximation case
● In Continuing problem, performance measure with discounted setting is
proportional to the average reward setting (They has almost the same
effect )(session 10.4)
● Discounted setting is problematic with function approximation
○ with function approximation we have lost the policy improvement
theorem (session 4.3) important in Policy Iteration Method
41
Proof The Policy Gradient Theorem (Continuing) 1/2
Gradient Definition
Parameterization of policy
by replacing discount with
average reward setting 42
Proof The Policy Gradient Theorem (Continuing) 2/2
● Introduce
steady state
distribution
and its
property
steady state distribution property
43
By Definistion, it’s independent of s
Trick
Steady State Distribution Property
44
Policy Gradient Theorem (Continuing) Final Concept
45
Actor-Critic with Eligibility Traces (continuing)
● Replace Discount
with average
reward
● Traing with
Semi-Gradient
TD(0)
Independent Semi-Gradient TD(0)
=1
46
Policy Parameterization for Continuous
Actions
● Can deal with large or infinite
continue actions spaces
● Normal distribution of the actions are
through the state’s parameterization
Feature vectors constructed by Polynomial, Fourier... (Session 9.5) 47
Make it Positive
Chapter 19 Summary
● Policy gradient is superior to Ɛ-greedy and action-value method in
○ Can learn specific probabilities for taking the actions
○ Can approach deterministic policies asymptotically
○ Can naturally handle continuous action spaces
● Policy gradient theorem gives an exact formula for how performance is a
affected by the policy parameter that does not involve derivatives of the state
distribution.
● REINFORCE method
○ Add State-Value as Baseline -> reduce variance without introducing bias
● Actor-Critic method
○ Add state-value function for bootstrapping ->introduce bias but reduce
variance and accelerate learning
○ Critic assign credit to cricitize Actor’s selection
48
Deterministic Policy Gradient
(DPG)
http://proceedings.mlr.press/v32/silver14.pdf
http://proceedings.mlr.press/v32/silver14-supp.pdf
ICML 2014 Deterministic Policy Gradient Algorithms (David Silver)
49
Comparison with Stochastic Policy Gradient
Advantage
● No action space sampling, more efficient (usually 10x faster)
● Can deal with large action space more efficiently
Weekness
● Less Exploration
50
Deterministic Policy Gradient Theorem - Performance Measure
● Deterministic Policy
Performance Messure
51
● Policy Gradient (Continuing)
Performance Messure
(Paper Not Distinguishing from
Episodic to Continuing Case)
Similar to
V(s)
V(s)
Deterministic Policy Gradient Theorem - Gradient
52
● Policy Gradient
(Continuing)
● Deterministic Policy
Gradient
Deterministic Policy Gradient Theorem
Policy Gradient Theorem
Transition Probability
is parameterized by
Policy Gradient Theorem
53
Reward is
parameterized
by
Deterministic Policy Gradient Theorem
Policy Gradient Theorem
Unrolling
54
Combination
coverage cues
Deterministic Policy Gradient Theorem - Basic Meaning
55
No foundOne found
Two found
p=1p=1
Deterministic Policy Gradient Theorem
56
Steady Distribution Probability
(p.57)
(p.57)
(p.58)
Deterministic Policy Gradient Theorem
57
p=1 p=1 p=1 p=1 p=1 p=1
p(a|s’)=1 p(a|s’’)=1 p(a|s’’’)=1
Deterministic Policy Gradient Theorem vs Policy Gradient Theorem (episodic)
58
Both samping from
steady distribution,
but PG has to sum
over all acton spaces
Samping Space Samping Space
On-Policy Deterministic Actor-Critic Problems
59
● Behaving according to a deterministic policy will not ensure adequate
exploration and may lead to suboptimal solutions
● It may be useful for environments in which there is sufficient noise in the
environment to ensure adequate exploration, even with a deterministic
behaviour policy
● On-policy is not practical; may be useful for environments in which
there is sufficient noise in the environment to ensure adequate
exploration
Sarsa Update
Off-Policy Deterministic Actor-Critic (OPDAC)
● Original Deterministic target policy
µθ(s)
● Trajectories generated by an
arbitrary stochastic behaviour policy
β(s,a)
● Value-action function off-policy
update - Q learning
60
Off Policy Actor-Critic (using Importance
Sampling in both Actor and Critic)
https://arxiv.org/pdf/1205.4839.pdf
Off Policy Deterministic Actor-Critic
DAC removes the integral
over actions, so we can avoid
importance sampling in the
actor
Compatible Function Approximation
61
● For any deterministic policy (s), there always exists a compatible function
approximator of
Off-Policy Deterministic Actor-Critic (OPDAC)
62
Actor
Critic
Experiments Designs
63
1. Continus Bandit, with fixed width Gaussian
behaviro policy
2. Mountain Car, with fixed width
Gaussian behavior policy
3. Octopus Arm with 6 segments
a. Sigmoidal multi-layer perceptron (8 hidden units and sigmoidal output units) to represent
the policy (s)
b. A(s) function approximator (session 4.3)
c. V(s) multi-layer perceptron (40 hidden units and linear output units).
Experiment Results
64
In practice, the DAC significantly outperformed its stochastic counterpart by several
orders of magnitude in a bandit with 50 continuous action dimensions, and solved a
challenging reinforcement learning problem with 20 continuous action dimensions
and 50 state dimensions.
Deep Deterministic Policy
Gradient (DDPG)
https://arxiv.org/pdf/1509.02971.pdf
ICLR 2016 Continuous Control With Deep Reinforcement Learning (DeepMind)
65
Q-Learning Limitation
66
http://doremi2016.logdown.com/posts/2017/01/25/convolutional-neural-networks-cnn my cnn architecture
http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning
Tabular Q-learning Limitations
● Very limited states/actions
● Can’t generalize to unobserved states
Q-learning with function approximation(neural net) can solve limits
above but still unstable or diverge
● The correlations present in the sequence of observations
● Small updates to Q function may significantly change the
policy(policy may oscillate)
● Scale of rewards vary greatly from game to game
○ lead to largely unstable gradient caculation
Deep Q-Learning
67http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning
1. Experimence Replay
○ Break samples’
correlations
○ Off-policy learn for
all past policies
2. Independent Target Q-
network and update
weight from Q-network
every C steps
○ Avoid oscillations
○ Break correlations
with Q-network
3. Clip rewards to limit the
scale of TD error
○ Robust Gradinet
behavior policy Ɛ-greedy
experience replay buffer
Freeze and update Target Q network
train
minibach
size samples
68
modify from https://blog.csdn.net/u013236946/article/details/72871858
DQN Flow
DQN Flow (cont.)
69
1. Each time step, using Ɛ-greedy from Q-Network to creating samples and
assign to the experience buffer
2. Each Time Step, Experience Buffer randomly assign mini batch samples to
all networks(Q Network, Target Network Q’)
3. Calculate Q Network’s TD error. Update Q Network and target network
Q’(every C steps)
DQN Disadvantage
● Many tasks of interest, most notably physical control tasks, have
continuous (real valued) and high dimensional action spaces
● With high-dimensional observation spaces, it can only handle discrete and
low-dimensional action spaces (requires an iterative optimization process
at every step to find the argmax)
● Simple Approach for DQN to deal with continus domain is simply
discretizing,but many limitation:the number of actions increases
exponentially with the number of degrees of freedom,ex:a 7 degree of
freedom system (as in the human arm) with the coarsest discretization a ∈
{−k, 0, k} for each joint. 3^7 = 2187 action dimensionality
70
https://arxiv.org/pdf/1509.02971.pdf CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING
DDPG Contributions (DQN+DPG)
71
● Can learn policies “end-to-end”:directly from raw pixel inputs (DQN)
● Can learn policies from high-dimentional continus action space (DPG)
● From above, we can learn policies in large state and action space online
72
DDPG Algo
● Experience
Replay
● Independent
Target
networks
● Batch
Normalization
of Minibatch
● Temporal
Correlated
Exploration
temporal correlated random policy
experience replay buffer
mini batch
Train Actor
mini batch
Train Critic
weighted blending between Q and Target Q’ network
weighted blending between Actor μ and Target Actor μ’network
DDPG Flow
73
DDPG Flow (cont.)
74
1. Each time step, using temporally correlated policy to create a sample and
assign it to experience replay buffer
2. Each time step, experience buffer assign mini batch samples to all
networks(Actor μ, Actor Target μ’, Q Network, Q' Target Network)
3. Calculate Q Network’s TD error. Update Q Network and Q' target network
Calculate Actor’s gradient. Update μ and target μ’
DDPG Challenges and Solutions
75
● Replay Buffer is used to break up sequeitial samples (like DQN)
● Target Networks is used for stable learning, but use “soft” update
○
○ Target networks slowly change, but greatly improve the stability of
learning
● Using Batch Normalization to normalize each dimension across the
minibatch samples (in low dimensional feature space, observation may
have different physical values, like position and velocity)
● Use Ornstein-Uhlenbeck process to generate temporally correlated
exploration efficiency with inertia
Applications
76

More Related Content

What's hot

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
Krish_ver2
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
Seung Jae Lee
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Mohammed Bennamoun
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
tuxette
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Jungyeol
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
Particle Swarm Optimization - PSO
Particle Swarm Optimization - PSOParticle Swarm Optimization - PSO
Particle Swarm Optimization - PSO
Mohamed Talaat
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
Seung Jae Lee
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?
Bernard Marr
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Geometric algorithms
Geometric algorithmsGeometric algorithms
Geometric algorithms
Ganesh Solanke
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 

What's hot (20)

An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Particle Swarm Optimization - PSO
Particle Swarm Optimization - PSOParticle Swarm Optimization - PSO
Particle Swarm Optimization - PSO
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Geometric algorithms
Geometric algorithmsGeometric algorithms
Geometric algorithms
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Similar to Reinforcement learning:policy gradient (part 1)

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
Supun Abeysinghe
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
Anand Joshi
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Seung Jae Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
ssuser0e9ad8
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
MohamedAliHabib3
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
Luca Marignati
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Lviv Startup Club
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
Kishor Datta Gupta
 

Similar to Reinforcement learning:policy gradient (part 1) (20)

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 

Reinforcement learning:policy gradient (part 1)

  • 1. Policy Gradient (Part 1) Reinforcement Learning : An Introduction 2e, 2018.July Bean https://www.facebook.com/littleqoo 1
  • 2. Agenda Reinforcement Learning:An Introduction ● Policy Gradinet Theorem ● REINFORCE: Monte -Carlo Policy Gradient ● One-Step Actor Critic ● Actor Critic with Eligibility Trace (Eposodic and Continuing Case) ● Policy Parameterization for Continuous Actions DeepMind (Richard Sutton、David Silver) ● Determinstic Policy Gradient (DPG(2014)、DDPG(2016)、MADDPG(2018)[part 2]) ● Distributed Proximal Policy Optimization(DPPO 2017.07) [part 2] OpenAI (Pieter Abbeel、John Schulman) ● Trust Region Policy Gradient (TRPO(2016)) [part 2] ● Proximal Policy Optimization (PPO(2017.07)) [part 2] 2
  • 3. Reinforcement Learning Classification ● Value-Based ○ Learned Value Function ○ Implicit Policy (usually Ɛ-greedy) ● Policy-Based ○ No Value Function ○ Explicit Policy Parameterization ● Mixed(Actor-Critic) ○ Learned Value Function ○ Policy Parameterization 3
  • 4. Policy Gradient Method Goal: Performance Messure: Optimization:Gradient Ascent [Actor-Critic Method]:Learn approximation to both policy and value function 4
  • 5. Policy Approximation (Discrete Actions) ● Ensure exploration we generally require that the policy never becomes deterministic ● The most common parameterization for discrete action spaces - Softmax in action preferences ○ discrete action space can not too large ● Action preferences can be parameterization arbitrarily(linear, ANN...) 5
  • 6. Advantage of Policy Approximation 1. Can approach to a deterministic policy (Ɛ- greedy always has Ɛ probability of selecting a random action),ex: Temperature parameter (T -> 0) of soft-max ○ In practice, it is difficult to choose reduction schedule or initial value of T 2. Enables the selection of actions with arbitrary probabilities ○ Bluffing in poker, Action-Value methods have no natural way 6
  • 7. https://en.wikipedia.org/wiki/Softmax_function (Temperature parameters) 3. May be simpler function to approximate depending on the complexity of policies and action-value functions 4. A good way of injecting prior knowledge about the desired form of the policy into the reinforcement learning system (often the most important reason) 7
  • 8. Short Corridor With Switched Actions ● All the states appear identical under the function approximation ● A method can do significantly better if it can learn a specific probability with which to select right ● The best probability is about 0.59 8
  • 9. The Policy Gradient Theorem (Episodic) https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement- learning-with-function-approximation.pdf NIPS 2000 Policy Gradient Methods for Reinforcement Learning with Function Approximation (Richard S. Sutton) 9
  • 10. The Policy Gradient Theorem ● Stronger convergence of guarantees are available for policy-gradient method than for action-value methods ○ Ɛ-greedy selection may change dramatically for an arbitrary small action value change that results in having the maximal value ● There are two cases define the different performance messures ○ Episodic Case - performance measure as the value of the start state of the episode ○ Continuing Case - no end even start state (Refer to Chap10.3) 10
  • 11. The Policy Gradient Theorem (Episodic) ● Performance ● Gradient Ascent ● Discount = 1 Bellman Equation 11
  • 12. Cont. ● Performance ● Gradient Ascent recurisively unroll 12
  • 13. The Policy Gradient Theorem (Episodic) ● Performance ● Gradient Ascent 13
  • 14. The Policy Gradient Theorem (Episodic) - Basic Meaning 14
  • 15. The Policy Gradient Theorem (Episodic) - On Policy Distribution fraction of time spent in s that is usually under on-policy training (on-policy distribution, the same as p.43) 15 better be writed in
  • 16. The Policy Gradient Theorem (Episodic) - On Policy Distribution Number of time steps spent, on average, in state s in a single episoid h(s) denotes the probability that an episode begins in states in a single episode 16
  • 17. The Policy Gradient Theorem (Episodic) - Concept 17 Ratio of s that appear in the state-action tree Gathering gradients over all action spaces of every state
  • 18. The Policy Gradient Theorem (Episodic): Sum Over States Weighted by How Offen the States Occur Under The Policy ● Policy gradient for episodic case ● The distribution is the on-policy distribution under ● The constant of proportionality is the average length of an episode and can be absorbed to step size ● Performance’s gradient ascent does not involve the derivative of the state distribution 18
  • 19. REINFORCE : Monte-Carlo Policy Gradient Classical Policy Gradient 19
  • 20. REINFORCE Algorithm All Actions Method Classical Monte-Carlo 20
  • 21. REINFORCE Meaning ● The update increases the parameter vector in this direction proportional to the return ● inversely proportional to the action probability (make sense because otherwise actions that are selected frequently are at an advantage) Action is a summation. If using samping by action probability, we have to average gradient by sampling number 21
  • 22. REINFORCE Algorithm Wait Until One Episode Generated 22
  • 23. REINFORCE on the short-corridor gridworld short-corridor gridworld ● With a good step size, the total reward per episode approaches the optimal value of the start state 23
  • 24. REINFORCE Defect & Solution ● Slow Converage ● High Variance From Reward ● Hard To Choose Learning Rate 24
  • 25. REINFORCE with Baseline (episodic) ● Expected value of the update unchanged(unbiased), but it can have a large effect on its variance ● Baseline can be any function, evan a random variable ● For MDPs, the baseline should vary with state, one natural choice is state value function ○ some states all actions have high values => a high baseline ○ in others, => a low baseline Treat State-Value Function as a Independent Value-function Approximation! 25
  • 26. REINFORCE with Baseline (episodic) can be learned by any methods of previous chapters independently. We use the same Monte-Carlo here.(Section 9.3 Gradient Monte-Carlo) 26
  • 27. Short-Corridor GridWorld ● Learn much faster ● Policy parameter is much less clear to set ● State-Value function paramenter (Section 9.6) 27
  • 28. Defects ● Learn Slowly (product estimates of high variance) ● Incovenient to implement online or continuing problems 28
  • 29. Actor-Critic Methods Combine Policy Function with Value Function 29
  • 30. One-Step Actor-Critic Method ● Add One-step bootstrapping to make it online ● But TD Method always introduces bias ● The TD(0) with only one random step has lower variance than Monte-Carlo and accelerate learning 30
  • 31. Actor-Critic ● Actor - Policy Function ● Critic- State-Value Function ● Critic Assign Credit to Cricitize Actor’s Selection 31 https://cs.wmich.edu/~trenary/files/cs5300/RLBook/node66.html
  • 32. One-step Actor-Critic Algorithm (episodic) Independent Semi-Gradient TD(0) (Session 9.3) 32
  • 33. Actor-Critic with Eligiblity Traces (episodic) ● Weight Vector is a long-term memory ● Eligibility trace is a short-term memory, keeping track of which components of the weight vector have contributed to recent state 33
  • 34. Review of Eligibility Traces - Forward View (Optional) 34
  • 35. Review of Eligibility Traces - Forward View (Optional) TD(0) TD(1) TD(2) 35 My Eligibility Trace Indution Link https://cacoo.com/diagrams/gof2aiV3fCXFGJXF
  • 36. Review of Eligibility Traces - Backward View vs Momentum (Optional) Example: Eligibility Traces Gradient Momentum similiar Accumulate Decayed Gradient 36
  • 37. The Policy Gradient Theorem (Continuing) 37
  • 38. The Policy Gradient Theorem (Continuing) - Performance Measure with Ergodicity ● “Ergodicity Assumption” ○ Any early decision by the agent can have only a temporary effect ○ State Expectation in the long run depends on policy and MDP transition probabilities ○ Steady state distribution is assumed to exist and to be independent of S0 guarantee limit exist Average Rate of Reward per Time Step 38 ( is a fixed parameter for any . We will treat it later as a linear function independent of s in the theorem) V(s)
  • 39. The Policy Gradient Theorem (Continuing) - Performance Measure Definition “Every Step’s Average Reward Is The Same”39
  • 40. The Policy Gradient Theorem (Continuing) - Steady State Distribution Steady State Distribution Under 40
  • 41. Replace Discount with Average Reward for Continuing Problem(Session 10.3, 10.4) ● Continuing problem with discounted setting is useful in tabular case, but questionable for function approximation case ● In Continuing problem, performance measure with discounted setting is proportional to the average reward setting (They has almost the same effect )(session 10.4) ● Discounted setting is problematic with function approximation ○ with function approximation we have lost the policy improvement theorem (session 4.3) important in Policy Iteration Method 41
  • 42. Proof The Policy Gradient Theorem (Continuing) 1/2 Gradient Definition Parameterization of policy by replacing discount with average reward setting 42
  • 43. Proof The Policy Gradient Theorem (Continuing) 2/2 ● Introduce steady state distribution and its property steady state distribution property 43 By Definistion, it’s independent of s Trick
  • 45. Policy Gradient Theorem (Continuing) Final Concept 45
  • 46. Actor-Critic with Eligibility Traces (continuing) ● Replace Discount with average reward ● Traing with Semi-Gradient TD(0) Independent Semi-Gradient TD(0) =1 46
  • 47. Policy Parameterization for Continuous Actions ● Can deal with large or infinite continue actions spaces ● Normal distribution of the actions are through the state’s parameterization Feature vectors constructed by Polynomial, Fourier... (Session 9.5) 47 Make it Positive
  • 48. Chapter 19 Summary ● Policy gradient is superior to Ɛ-greedy and action-value method in ○ Can learn specific probabilities for taking the actions ○ Can approach deterministic policies asymptotically ○ Can naturally handle continuous action spaces ● Policy gradient theorem gives an exact formula for how performance is a affected by the policy parameter that does not involve derivatives of the state distribution. ● REINFORCE method ○ Add State-Value as Baseline -> reduce variance without introducing bias ● Actor-Critic method ○ Add state-value function for bootstrapping ->introduce bias but reduce variance and accelerate learning ○ Critic assign credit to cricitize Actor’s selection 48
  • 50. Comparison with Stochastic Policy Gradient Advantage ● No action space sampling, more efficient (usually 10x faster) ● Can deal with large action space more efficiently Weekness ● Less Exploration 50
  • 51. Deterministic Policy Gradient Theorem - Performance Measure ● Deterministic Policy Performance Messure 51 ● Policy Gradient (Continuing) Performance Messure (Paper Not Distinguishing from Episodic to Continuing Case) Similar to V(s) V(s)
  • 52. Deterministic Policy Gradient Theorem - Gradient 52 ● Policy Gradient (Continuing) ● Deterministic Policy Gradient
  • 53. Deterministic Policy Gradient Theorem Policy Gradient Theorem Transition Probability is parameterized by Policy Gradient Theorem 53 Reward is parameterized by
  • 54. Deterministic Policy Gradient Theorem Policy Gradient Theorem Unrolling 54 Combination coverage cues
  • 55. Deterministic Policy Gradient Theorem - Basic Meaning 55 No foundOne found Two found p=1p=1
  • 56. Deterministic Policy Gradient Theorem 56 Steady Distribution Probability (p.57) (p.57) (p.58)
  • 57. Deterministic Policy Gradient Theorem 57 p=1 p=1 p=1 p=1 p=1 p=1 p(a|s’)=1 p(a|s’’)=1 p(a|s’’’)=1
  • 58. Deterministic Policy Gradient Theorem vs Policy Gradient Theorem (episodic) 58 Both samping from steady distribution, but PG has to sum over all acton spaces Samping Space Samping Space
  • 59. On-Policy Deterministic Actor-Critic Problems 59 ● Behaving according to a deterministic policy will not ensure adequate exploration and may lead to suboptimal solutions ● It may be useful for environments in which there is sufficient noise in the environment to ensure adequate exploration, even with a deterministic behaviour policy ● On-policy is not practical; may be useful for environments in which there is sufficient noise in the environment to ensure adequate exploration Sarsa Update
  • 60. Off-Policy Deterministic Actor-Critic (OPDAC) ● Original Deterministic target policy µθ(s) ● Trajectories generated by an arbitrary stochastic behaviour policy β(s,a) ● Value-action function off-policy update - Q learning 60 Off Policy Actor-Critic (using Importance Sampling in both Actor and Critic) https://arxiv.org/pdf/1205.4839.pdf Off Policy Deterministic Actor-Critic DAC removes the integral over actions, so we can avoid importance sampling in the actor
  • 61. Compatible Function Approximation 61 ● For any deterministic policy (s), there always exists a compatible function approximator of
  • 62. Off-Policy Deterministic Actor-Critic (OPDAC) 62 Actor Critic
  • 63. Experiments Designs 63 1. Continus Bandit, with fixed width Gaussian behaviro policy 2. Mountain Car, with fixed width Gaussian behavior policy 3. Octopus Arm with 6 segments a. Sigmoidal multi-layer perceptron (8 hidden units and sigmoidal output units) to represent the policy (s) b. A(s) function approximator (session 4.3) c. V(s) multi-layer perceptron (40 hidden units and linear output units).
  • 64. Experiment Results 64 In practice, the DAC significantly outperformed its stochastic counterpart by several orders of magnitude in a bandit with 50 continuous action dimensions, and solved a challenging reinforcement learning problem with 20 continuous action dimensions and 50 state dimensions.
  • 65. Deep Deterministic Policy Gradient (DDPG) https://arxiv.org/pdf/1509.02971.pdf ICLR 2016 Continuous Control With Deep Reinforcement Learning (DeepMind) 65
  • 66. Q-Learning Limitation 66 http://doremi2016.logdown.com/posts/2017/01/25/convolutional-neural-networks-cnn my cnn architecture http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning Tabular Q-learning Limitations ● Very limited states/actions ● Can’t generalize to unobserved states Q-learning with function approximation(neural net) can solve limits above but still unstable or diverge ● The correlations present in the sequence of observations ● Small updates to Q function may significantly change the policy(policy may oscillate) ● Scale of rewards vary greatly from game to game ○ lead to largely unstable gradient caculation
  • 67. Deep Q-Learning 67http://www.davidqiu.com:8888/research/nature14236.pdf Human-level control through deep reinforcement learning 1. Experimence Replay ○ Break samples’ correlations ○ Off-policy learn for all past policies 2. Independent Target Q- network and update weight from Q-network every C steps ○ Avoid oscillations ○ Break correlations with Q-network 3. Clip rewards to limit the scale of TD error ○ Robust Gradinet behavior policy Ɛ-greedy experience replay buffer Freeze and update Target Q network train minibach size samples
  • 69. DQN Flow (cont.) 69 1. Each time step, using Ɛ-greedy from Q-Network to creating samples and assign to the experience buffer 2. Each Time Step, Experience Buffer randomly assign mini batch samples to all networks(Q Network, Target Network Q’) 3. Calculate Q Network’s TD error. Update Q Network and target network Q’(every C steps)
  • 70. DQN Disadvantage ● Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces ● With high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces (requires an iterative optimization process at every step to find the argmax) ● Simple Approach for DQN to deal with continus domain is simply discretizing,but many limitation:the number of actions increases exponentially with the number of degrees of freedom,ex:a 7 degree of freedom system (as in the human arm) with the coarsest discretization a ∈ {−k, 0, k} for each joint. 3^7 = 2187 action dimensionality 70 https://arxiv.org/pdf/1509.02971.pdf CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING
  • 71. DDPG Contributions (DQN+DPG) 71 ● Can learn policies “end-to-end”:directly from raw pixel inputs (DQN) ● Can learn policies from high-dimentional continus action space (DPG) ● From above, we can learn policies in large state and action space online
  • 72. 72 DDPG Algo ● Experience Replay ● Independent Target networks ● Batch Normalization of Minibatch ● Temporal Correlated Exploration temporal correlated random policy experience replay buffer mini batch Train Actor mini batch Train Critic weighted blending between Q and Target Q’ network weighted blending between Actor μ and Target Actor μ’network
  • 74. DDPG Flow (cont.) 74 1. Each time step, using temporally correlated policy to create a sample and assign it to experience replay buffer 2. Each time step, experience buffer assign mini batch samples to all networks(Actor μ, Actor Target μ’, Q Network, Q' Target Network) 3. Calculate Q Network’s TD error. Update Q Network and Q' target network Calculate Actor’s gradient. Update μ and target μ’
  • 75. DDPG Challenges and Solutions 75 ● Replay Buffer is used to break up sequeitial samples (like DQN) ● Target Networks is used for stable learning, but use “soft” update ○ ○ Target networks slowly change, but greatly improve the stability of learning ● Using Batch Normalization to normalize each dimension across the minibatch samples (in low dimensional feature space, observation may have different physical values, like position and velocity) ● Use Ornstein-Uhlenbeck process to generate temporally correlated exploration efficiency with inertia

Editor's Notes

  1. Learn a parameterized policy that can select actions without consulting a value function value function may still be used to learn the policy parameter, but is not required for action selection 不失一般性,S0在每個episodic是固定的(non-random) J(theta)在episodic和continuing不一樣
  2. In problems with significant function approximation, the best approximate policy may be stochastic
  3. In the latter case, a policy-based method will typically learn faster and yield a superior asymptotic policy(Kothiyal 2016)
  4. 在同一個狀態下,如果環境轉移機率是會改變的,根據環境所調整出的Ɛ,一定會比隨機選出的得到更好的Reward
  5. policy change parameters will affect pi , reward, p. pi and reward is easy to calculate, but p is belong to environment(unknown) policy gradient theorem will try to not involve the derivative of state distribution(p)
  6. episodic policy gradient theorem是利用值函數作performance(continuing是average reward,兩個不一樣),最後得出是等比例於這個結果,這個結果在這裡是每一個step的平均reward,在episodic中mu(s)是指全部step中出現s的比例,因此累加並乘上所有eta(s),全部eta(s)的累加就是累加所有狀態s在所有step的機率,因此就是平均step數,所以episodic的結果要乘上平均 所有step中出現s的比例,和只統計某一個step中的比例(continuing case)是一樣的 在2000年的論文中m(s)被定義為時間趨近無限步後的steady state下的比例
  7. 1. q is some learned approximation to q-pi is promising and deserving of further study
  8. 之前對同一個State下的Action的gradient,是採用全部加起來的算法,當不想累加所有的action,而採用Sampling時,由於機率高的action被選到的次數多,這樣累加會失去原來的原理,因此要除以自己的機率值
  9. https://zhuanlan.zhihu.com/p/35958186
  10. 由於是無偏估計,但是由於梯度有高方差問題,梯度累加會出現梯度變化劇烈,同時間Sampling次數還不夠多的情況下,比較難得到最佳local minimum,需要有夠多的取樣點後逼近最佳值
  11. using TD to make it online and continue
  12. http://mi.eng.cam.ac.uk/~mg436/LectureSlides/MLSALT7/L5.pdf
  13. TD的跟真实值G是有差距的,故存在偏差。同时TD只用到了一步随机状态和动作,因此TD目标的随机性比蒙特卡罗方法中的 要小(共n步隨機),因此其方差也比蒙特卡罗方法的方差小
  14. 值函數近似(value-function approximation)的原理都是由Mean Squared Value Error中來的(session 9.2)
  15. expectations are conditioned on the initial state, S0,侷限在以S0為起點所遍歷過(Ergodicity)的所有狀態下的期望值,沒有辦法經歷到的(也許是其他S為起點才能經歷到的狀態),不在M(s)範圍內 J(theta)=r(pi)的定義,是用來假設有個定值,而且會獨立於s(推導會用到),在theorem推導時,不會用這個definition,只會把它當一個線性變數,
  16. ji
  17. episodic policy gradient theorem是利用值函數作performance(continuing是average reward,兩個不一樣),最後得出是等比例於這個結果,這個結果在這裡是每一個step的平均reward,在episodic中mu(s)是指全部step中出現s的比例,因此累加並乘上所有eta(s),全部eta(s)的累加就是累加所有狀態s在所有step的機率,因此就是平均step數,所以episodic的結果要乘上平均
  18. exp因為標準差必須為正
  19. DPG並沒有區分episodic和continuing case p(s->s’)是狀態轉移機率,由pi與p組成(這一個p是狀態動作轉移機率,是由s,pi當input參數)
  20. deterministic 中積分的p是指所有狀態轉移機率的累加(eta(s)),期望值取樣的p在這裡不適合,因為p必須要是機率函數,應該改成簡化eta(s)後的mu(s)
  21. 如果環境機率能有足夠的探索noise
  22. N is nose - used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia