SlideShare a Scribd company logo
State Aware Imitation Learning
Y. Schroecker and C. L. Isbell
NIPS 2017
Eiji Uchibe
Why I choose this paper
• I was going to introduce “Learning Robust Rewards
with Adversarial Inverse Reinforcement Learning”
because it is closely related to my study
– This paper will be presented by Kobayashi-san
• So, I decided to talk about “State Aware Imitation
Learning”
– A stationary distribution is important in RL/IL, but it is not
carefully studied
– Our paper (Morimura et al., 2010) plays a critical role
– I’m trying to use it as a component of intrinsic motivation
to explore the environment efficiently
Notation
• 𝒔, 𝒂: (continuous) state and action
• 𝜋 𝜽(𝒂 ∣ 𝒔): stochastic policy parameterized by 𝜽
• 𝑝(𝒔′
∣ 𝒔, 𝒂): (unknown) state transition probability
• 𝑑 𝜋 𝜽 (𝒔): stationary distribution under 𝜋 𝜽
• 𝑝 𝜋 𝜽 𝒔′, 𝒂 𝒔 : forward transition probability related
to the forward Markov chain 𝑀(𝜽)
• 𝑞 𝜋 𝜽 𝒔, 𝒂 𝒔′
: backward transition probability
related to the backward Markov chain 𝐵(𝜽)
Example: Stationary distribution
• Consider the 9x9 grid world
• 5 actions (stop, left, right,
up, and down)
• Random policy
• Move to the intended
direction with probability0.9.
• Consider the limiting distribution
𝜋 𝒂 𝒔 = Τ1 5
𝑑 𝜋 𝒔 ≜ lim
𝑡→∞
Pr(𝑆𝑡 = 𝒔 ∣ 𝐴0:𝑡 ∼ 𝜋)
Example: Stationary distribution
• Converge to the same stationary distribution
– It does not depend on the starting state (ergodicity)
– For infinite horizon average reward problems, the
stationary distribution plays an important role to define
the objective function
Policy-Gradient theorem
• The objective function of the policy gradient RL
• Policy gradient algorithms provide a method to
evaluate 𝛻𝜽 𝐽(𝜽)
• Policy gradient theorem (Morimura et al., 2010)
– 𝑄 𝛾
𝜋 𝜽
, 𝑉𝛾
𝜋 𝜃
: value functions with discount factor 𝛾
𝛻𝜽 𝐽(𝜽) = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 𝑄 𝛾
𝜋 𝜽
𝒔, 𝒂 d𝒔d𝒂
𝐽 𝜽 = lim
𝑇→∞
1
𝑇
𝔼 𝑀 𝜽 ෍
𝑡=1
𝑇
𝑟𝑡 = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑟 𝒔, 𝒂 d𝒔d𝒂
+(1 − 𝛾) න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 𝑉𝛾
𝜋 𝜽
𝒔 d𝒔
Policy-Gradient theorem
•
• The second term is usually ignored because 1 − 𝛾 ≈
0 in many settings
• We still need to samples from 𝑑 𝜋 𝜽 (𝒔) to evaluate the
policy gradient, but it is usually difficult
• 𝑑 𝜋 𝜽 (𝒔) is considered to derive algorithms, but a
different distribution is used in practice because
𝑑 𝜋 𝜽 (𝒔) is unknown and hard to estimate
𝛻𝜽 𝐽(𝜽) = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 𝑄 𝛾
𝜋 𝜽
𝒔, 𝒂 d𝒔d𝒂
+(1 − 𝛾) න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 𝑉𝛾
𝜋 𝜽
𝒔 d𝒔
Imitation learning as a Maximum-A-
Posteriori (MAP) problem
• 𝑆 𝐷, 𝐴 𝐷: a set of demonstrated states
and that of actions, respectively
• The goal is to find 𝜋 𝜽(𝒂 ∣ 𝒔) that
imitates the demonstrator’s
behavior, formulated as the MAP problem
• Inverse RL is also formulated as the MAP (Choi and
Kim, 2011)
arg max
𝜽
𝑝(𝜽 ∣ 𝑆 𝐷, 𝐴 𝐷)
+ln 𝑝(𝑆 𝐷 ∣ 𝜽) + ሿln 𝑝(𝜽)
= arg max
𝜽
ൣln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽)
𝑆 𝐷, 𝐴 𝐷
What is 𝒑(𝑺 𝑫 ∣ 𝜽)
•
• We can evaluate the first and the third terms easily
because 𝜋 𝜽(𝒂 ∣ 𝒔) is implemented by ourselves
• The role of ln 𝑝(𝑆 𝐷 ∣ 𝜽) is to reproduce states that
are similar to the ones in 𝑆 𝐷
• If it is not considered, the imitation policy has a
tendency to take actions that goes to states that are
not in 𝑆 𝐷 (Pomerleau, 1989; Ross and Bagnell, 2010)
arg max
𝜽
ln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽) + ln 𝑝(𝑆 𝐷 ∣ 𝜽) + ln 𝑝(𝜽)
State Aware Imitation Learning (SAIL)
• To solve the MAP problem, we need 𝛻𝜽 ln 𝑝(𝑆 𝐷 ∣ 𝜽)
• This paper assumes
• 𝜽 is updated by the following gradient
• The major contribution is to provide the estimator of
𝛻𝜽 𝑑 𝜋 𝜽(𝒔)
Δ𝜽 =
1
𝑆 𝐷
෍
𝒔,𝒂 ∈(𝑆 𝐷,𝐴 𝐷)
𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔) + 𝛻𝜽 𝑑 𝜋 𝜽 𝑠 + 𝛻𝜽 𝑝(𝜽)
𝛻𝜃 ln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽) 𝛻𝜃 ln 𝑝(𝑆 𝐷 ∣ 𝜽)
𝛻𝜽 ln 𝑝 𝑆 𝐷 𝜽 = ෍
𝒔∈𝑆 𝐷
𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔)
3.1 A temporal difference approach to …
• Definition of stationary distribution
• Because
we obtain the following relation
where
𝑑 𝜋 𝜽 𝒔′
= න𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′
𝒔, 𝒂 d𝒔d𝒂
𝛻𝜽 𝑑 𝜋 𝜽 𝒔′
𝛻𝜽 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′
𝒔, 𝒂
= 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′
𝒔, 𝒂 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔)
= න𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′ 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) + 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 d𝒔d𝒂
𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′
≜ 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′
𝒔, 𝒂
Reverse transition probability
• Dividing by 𝑑 𝜋 𝜽(𝒔′) yields
where
• 𝑞 𝜋 𝜽(𝒔, 𝒂 ∣ 𝒔′
) is called the backward Markov chain
while 𝑝 𝜋 𝜃(𝒔′, 𝒂 ∣ 𝒔) the forward Markov chain. See
Propositions 1 and 2 of Morimura et al. (2010) for
some theoretical properties
𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔′
= න𝑞 𝜋 𝜽 𝒔, 𝒂 ∣ 𝒔′ 𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔) + 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 d𝒔d𝒂
𝑞 𝜋 𝜽 𝒔, 𝒂 ∣ 𝒔′ ≜
𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′
𝑑 𝜋 𝜽 𝒔′
reverse transition
probability
Temporal difference error
• Arranging the previous equation, the fundamental
equation to estimate 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) can be obtained
where
– Note 𝜹 is a vector
• Recap: Temporal difference error of Q-learning
𝜹 𝒔, 𝒂, 𝒔′ ≜ 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 + 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔
න𝑞 𝜋 𝜽 𝒔, 𝒂 𝒔′ 𝜹 𝒔, 𝒂, 𝒔′ d𝒔d𝒂 = 𝟎
−𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔′
𝛿 𝒔, 𝒂, 𝒔′ ≜ 𝑟 𝒔, 𝒂 + 𝛾 max
𝒂′
𝑄 𝒔′, 𝒂′ − 𝑄 𝒔, 𝒂
3.2 Online temporal difference learning …
• This paper proposes an online TD learning algorithm,
which is suitable for deep neural networks
• 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) is approximated by
– 𝒘: parameter of the approximator
– 𝒄: unknown constant vector
• Update rule
𝒇 𝒘 𝒔 ≈ 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝒄
Δ𝒘 = 𝛼𝛻𝒘 𝒇 𝒘 𝒔′ 𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔) +𝒇 𝒘 𝒔 − 𝒇 𝒘 𝒔′
TD error
target valuegradient
(matrix?)
Constraint on 𝒇 𝒘(𝒔)
• Since 𝑑 𝜋 𝜽(𝒔) is a probabilitydensity function, it
should satisfy ∫ 𝑑 𝜋 𝜽 𝒔 d𝒔 = 1
• Therefore, we have to consider the constraint on
𝑓𝒘(𝒔) during learning
• The above constraint can be re-written as follows:
• This constraint is satisfied by setting c appropriately
𝔼 𝑑 𝜋 𝜽 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 = න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔)d𝒔
= න𝛻𝜽 𝑑 𝜋 𝜽(𝒔)d𝒔 = 𝛻𝜽 න𝑑 𝜋 𝜽 𝒔 d𝒔 = 𝛻𝜽 1 = 𝟎
𝒇 𝒘 𝒔 ≈ 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝔼 𝒇 𝒘 𝒔
Algorithm 1: SAIL
• 𝑆 𝐸, 𝐴 𝐸: a set of states and that of actions generated
by the agent’s policy 𝜋 𝜽
4.2 Noisy bipedal walker
• The goal is to traverse a
plain without falling
• 4 dimensional action
– torque applied to 4 joints
– noise is added
• state
– 4 dim. for velocity in x and y directions, angle of the hull,
and angular velocity
– 8 dim. for the position and velocity of the 4 joints in legs
– 2 dim. for the contact information
– 10 dim. for lidar readings
Neural networks used in the experiment
• 𝜋 𝜽 𝒂 𝒔 = 𝒩(𝒂 ∣ 𝝁 𝒔 , 𝚺): Gaussian policy
– 𝝁(𝒔): one hidden layer
consisting of 100 nodes with
tanh activations
– 𝚺: diagonal covariance matrix
• 𝒇 𝒘(𝒔): feedforward NN
– two hidden layers of 80 nodes
each using ReLU activations
𝒔 𝝁
hidden
𝚺
𝒩(𝒂∣𝝁𝒔,𝚺)
𝒔
hidden
hidden
𝒇𝒘(𝒔)
4.2 Noisy bipedal walker
My opinions
• Is learning the gradient of 𝑑 𝜋 𝜃 easier than learning a
value function?
• The dimension of the gradient is equal to the number
of parameters of the policy. In general, it is too huge
if we use a deep NN policy
• The technique of the target network is not used
• Trust region framework should be introduced to
avoid catastrophic changes
References
• Choi, J. and Kim, K.-E. (2011). Map inference for Bayesian inverse
reinforcement learning. NIPS 24.
• Morimura, T. Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010).
Derivatives of Logarithmic Stationary Distributions for Policy Gradient
Reinforcement Learning. Neural Computation, 22, 342-376.
• Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural
network. NIPS 1.
• Ross, S., and Bagnell, J. A. (2010). Efficient Reductions for Imitation
Learning. In Proc. of 13th AISTATS.
• Schroecker, Y., and Isbell, C. L. (2017). State Aware Imitation Learning.
NIPS 30.

More Related Content

What's hot

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learningjenil desai
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningSungjoon Choi
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...Ajay Kumar
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review민재 정
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Amazon Web Services
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distanceananth
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2ananth
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basicsananth
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient TrainingHung Le
 

What's hot (20)

Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learning
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy Learning
 
Deep robotics
Deep roboticsDeep robotics
Deep robotics
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basics
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 

Similar to NIPS KANSAI Reading Group #5: State Aware Imitation Learning

Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rlChoiJinwon3
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx건영 박
 
Signals and systems-2
Signals and systems-2Signals and systems-2
Signals and systems-2sarun soman
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matchingtaeseon ryu
 
Transformers.pdf
Transformers.pdfTransformers.pdf
Transformers.pdfAli Zoljodi
 
Analysis of large scale spiking networks dynamics with spatio-temporal constr...
Analysis of large scale spiking networks dynamics with spatio-temporal constr...Analysis of large scale spiking networks dynamics with spatio-temporal constr...
Analysis of large scale spiking networks dynamics with spatio-temporal constr...Hassan Nasser
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to applicationGAYO3
 
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...Gyutaek Oh
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseWei Yang
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
 
Spring 2016 Intern at Treasure Data
Spring 2016 Intern at Treasure DataSpring 2016 Intern at Treasure Data
Spring 2016 Intern at Treasure DataSotaro Sugimoto
 
Elementary statistical inference1
Elementary statistical inference1Elementary statistical inference1
Elementary statistical inference1SEMINARGROOT
 
Stability analysis of impulsive fractional differential systems with delay
Stability analysis of impulsive fractional differential systems with delayStability analysis of impulsive fractional differential systems with delay
Stability analysis of impulsive fractional differential systems with delayMostafa Shokrian Zeini
 

Similar to NIPS KANSAI Reading Group #5: State Aware Imitation Learning (20)

Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rl
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Av 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background MaterialAv 738- Adaptive Filtering - Background Material
Av 738- Adaptive Filtering - Background Material
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx
 
Signals and systems-2
Signals and systems-2Signals and systems-2
Signals and systems-2
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
Transformers.pdf
Transformers.pdfTransformers.pdf
Transformers.pdf
 
Analysis of large scale spiking networks dynamics with spatio-temporal constr...
Analysis of large scale spiking networks dynamics with spatio-temporal constr...Analysis of large scale spiking networks dynamics with spatio-temporal constr...
Analysis of large scale spiking networks dynamics with spatio-temporal constr...
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to application
 
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...
Unpaired Deep Learning for Accelerated MRI Using Optimal Transport Driven Cyc...
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defense
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Spring 2016 Intern at Treasure Data
Spring 2016 Intern at Treasure DataSpring 2016 Intern at Treasure Data
Spring 2016 Intern at Treasure Data
 
Elementary statistical inference1
Elementary statistical inference1Elementary statistical inference1
Elementary statistical inference1
 
Stability analysis of impulsive fractional differential systems with delay
Stability analysis of impulsive fractional differential systems with delayStability analysis of impulsive fractional differential systems with delay
Stability analysis of impulsive fractional differential systems with delay
 

Recently uploaded

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdfKamal Acharya
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-IVigneshvaranMech
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...Amil baba
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdfKamal Acharya
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationDr. Radhey Shyam
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdfAhmedHussein950959
 
AI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfAI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfmahaffeycheryld
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxVishalDeshpande27
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfAyahmorsy
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdfKamal Acharya
 

Recently uploaded (20)

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
AI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfAI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 

NIPS KANSAI Reading Group #5: State Aware Imitation Learning

  • 1. State Aware Imitation Learning Y. Schroecker and C. L. Isbell NIPS 2017 Eiji Uchibe
  • 2. Why I choose this paper • I was going to introduce “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning” because it is closely related to my study – This paper will be presented by Kobayashi-san • So, I decided to talk about “State Aware Imitation Learning” – A stationary distribution is important in RL/IL, but it is not carefully studied – Our paper (Morimura et al., 2010) plays a critical role – I’m trying to use it as a component of intrinsic motivation to explore the environment efficiently
  • 3. Notation • 𝒔, 𝒂: (continuous) state and action • 𝜋 𝜽(𝒂 ∣ 𝒔): stochastic policy parameterized by 𝜽 • 𝑝(𝒔′ ∣ 𝒔, 𝒂): (unknown) state transition probability • 𝑑 𝜋 𝜽 (𝒔): stationary distribution under 𝜋 𝜽 • 𝑝 𝜋 𝜽 𝒔′, 𝒂 𝒔 : forward transition probability related to the forward Markov chain 𝑀(𝜽) • 𝑞 𝜋 𝜽 𝒔, 𝒂 𝒔′ : backward transition probability related to the backward Markov chain 𝐵(𝜽)
  • 4. Example: Stationary distribution • Consider the 9x9 grid world • 5 actions (stop, left, right, up, and down) • Random policy • Move to the intended direction with probability0.9. • Consider the limiting distribution 𝜋 𝒂 𝒔 = Τ1 5 𝑑 𝜋 𝒔 ≜ lim 𝑡→∞ Pr(𝑆𝑡 = 𝒔 ∣ 𝐴0:𝑡 ∼ 𝜋)
  • 5. Example: Stationary distribution • Converge to the same stationary distribution – It does not depend on the starting state (ergodicity) – For infinite horizon average reward problems, the stationary distribution plays an important role to define the objective function
  • 6. Policy-Gradient theorem • The objective function of the policy gradient RL • Policy gradient algorithms provide a method to evaluate 𝛻𝜽 𝐽(𝜽) • Policy gradient theorem (Morimura et al., 2010) – 𝑄 𝛾 𝜋 𝜽 , 𝑉𝛾 𝜋 𝜃 : value functions with discount factor 𝛾 𝛻𝜽 𝐽(𝜽) = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 𝑄 𝛾 𝜋 𝜽 𝒔, 𝒂 d𝒔d𝒂 𝐽 𝜽 = lim 𝑇→∞ 1 𝑇 𝔼 𝑀 𝜽 ෍ 𝑡=1 𝑇 𝑟𝑡 = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑟 𝒔, 𝒂 d𝒔d𝒂 +(1 − 𝛾) න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 𝑉𝛾 𝜋 𝜽 𝒔 d𝒔
  • 7. Policy-Gradient theorem • • The second term is usually ignored because 1 − 𝛾 ≈ 0 in many settings • We still need to samples from 𝑑 𝜋 𝜽 (𝒔) to evaluate the policy gradient, but it is usually difficult • 𝑑 𝜋 𝜽 (𝒔) is considered to derive algorithms, but a different distribution is used in practice because 𝑑 𝜋 𝜽 (𝒔) is unknown and hard to estimate 𝛻𝜽 𝐽(𝜽) = ඵ𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 𝑄 𝛾 𝜋 𝜽 𝒔, 𝒂 d𝒔d𝒂 +(1 − 𝛾) න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 𝑉𝛾 𝜋 𝜽 𝒔 d𝒔
  • 8. Imitation learning as a Maximum-A- Posteriori (MAP) problem • 𝑆 𝐷, 𝐴 𝐷: a set of demonstrated states and that of actions, respectively • The goal is to find 𝜋 𝜽(𝒂 ∣ 𝒔) that imitates the demonstrator’s behavior, formulated as the MAP problem • Inverse RL is also formulated as the MAP (Choi and Kim, 2011) arg max 𝜽 𝑝(𝜽 ∣ 𝑆 𝐷, 𝐴 𝐷) +ln 𝑝(𝑆 𝐷 ∣ 𝜽) + ሿln 𝑝(𝜽) = arg max 𝜽 ൣln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽) 𝑆 𝐷, 𝐴 𝐷
  • 9. What is 𝒑(𝑺 𝑫 ∣ 𝜽) • • We can evaluate the first and the third terms easily because 𝜋 𝜽(𝒂 ∣ 𝒔) is implemented by ourselves • The role of ln 𝑝(𝑆 𝐷 ∣ 𝜽) is to reproduce states that are similar to the ones in 𝑆 𝐷 • If it is not considered, the imitation policy has a tendency to take actions that goes to states that are not in 𝑆 𝐷 (Pomerleau, 1989; Ross and Bagnell, 2010) arg max 𝜽 ln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽) + ln 𝑝(𝑆 𝐷 ∣ 𝜽) + ln 𝑝(𝜽)
  • 10. State Aware Imitation Learning (SAIL) • To solve the MAP problem, we need 𝛻𝜽 ln 𝑝(𝑆 𝐷 ∣ 𝜽) • This paper assumes • 𝜽 is updated by the following gradient • The major contribution is to provide the estimator of 𝛻𝜽 𝑑 𝜋 𝜽(𝒔) Δ𝜽 = 1 𝑆 𝐷 ෍ 𝒔,𝒂 ∈(𝑆 𝐷,𝐴 𝐷) 𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔) + 𝛻𝜽 𝑑 𝜋 𝜽 𝑠 + 𝛻𝜽 𝑝(𝜽) 𝛻𝜃 ln 𝑝(𝐴 𝐷 ∣ 𝑆 𝐷, 𝜽) 𝛻𝜃 ln 𝑝(𝑆 𝐷 ∣ 𝜽) 𝛻𝜽 ln 𝑝 𝑆 𝐷 𝜽 = ෍ 𝒔∈𝑆 𝐷 𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔)
  • 11. 3.1 A temporal difference approach to … • Definition of stationary distribution • Because we obtain the following relation where 𝑑 𝜋 𝜽 𝒔′ = න𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′ 𝒔, 𝒂 d𝒔d𝒂 𝛻𝜽 𝑑 𝜋 𝜽 𝒔′ 𝛻𝜽 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′ 𝒔, 𝒂 = 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′ 𝒔, 𝒂 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔) = න𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′ 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) + 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 d𝒔d𝒂 𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′ ≜ 𝑑 𝜋 𝜽 𝒔 𝜋 𝜽 𝒂 𝒔 𝑝 𝒔′ 𝒔, 𝒂
  • 12. Reverse transition probability • Dividing by 𝑑 𝜋 𝜽(𝒔′) yields where • 𝑞 𝜋 𝜽(𝒔, 𝒂 ∣ 𝒔′ ) is called the backward Markov chain while 𝑝 𝜋 𝜃(𝒔′, 𝒂 ∣ 𝒔) the forward Markov chain. See Propositions 1 and 2 of Morimura et al. (2010) for some theoretical properties 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔′ = න𝑞 𝜋 𝜽 𝒔, 𝒂 ∣ 𝒔′ 𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔) + 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 d𝒔d𝒂 𝑞 𝜋 𝜽 𝒔, 𝒂 ∣ 𝒔′ ≜ 𝑝 𝜋 𝜽 𝒔, 𝒂, 𝒔′ 𝑑 𝜋 𝜽 𝒔′ reverse transition probability
  • 13. Temporal difference error • Arranging the previous equation, the fundamental equation to estimate 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) can be obtained where – Note 𝜹 is a vector • Recap: Temporal difference error of Q-learning 𝜹 𝒔, 𝒂, 𝒔′ ≜ 𝛻𝜽 ln 𝜋 𝜽 𝒂 𝒔 + 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 න𝑞 𝜋 𝜽 𝒔, 𝒂 𝒔′ 𝜹 𝒔, 𝒂, 𝒔′ d𝒔d𝒂 = 𝟎 −𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔′ 𝛿 𝒔, 𝒂, 𝒔′ ≜ 𝑟 𝒔, 𝒂 + 𝛾 max 𝒂′ 𝑄 𝒔′, 𝒂′ − 𝑄 𝒔, 𝒂
  • 14. 3.2 Online temporal difference learning … • This paper proposes an online TD learning algorithm, which is suitable for deep neural networks • 𝛻𝜽 ln 𝑑 𝜋 𝜽 (𝒔) is approximated by – 𝒘: parameter of the approximator – 𝒄: unknown constant vector • Update rule 𝒇 𝒘 𝒔 ≈ 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝒄 Δ𝒘 = 𝛼𝛻𝒘 𝒇 𝒘 𝒔′ 𝛻𝜽 ln 𝜋 𝜽(𝒂 ∣ 𝒔) +𝒇 𝒘 𝒔 − 𝒇 𝒘 𝒔′ TD error target valuegradient (matrix?)
  • 15. Constraint on 𝒇 𝒘(𝒔) • Since 𝑑 𝜋 𝜽(𝒔) is a probabilitydensity function, it should satisfy ∫ 𝑑 𝜋 𝜽 𝒔 d𝒔 = 1 • Therefore, we have to consider the constraint on 𝑓𝒘(𝒔) during learning • The above constraint can be re-written as follows: • This constraint is satisfied by setting c appropriately 𝔼 𝑑 𝜋 𝜽 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 = න𝑑 𝜋 𝜽 𝒔 𝛻𝜽 ln 𝑑 𝜋 𝜽(𝒔)d𝒔 = න𝛻𝜽 𝑑 𝜋 𝜽(𝒔)d𝒔 = 𝛻𝜽 න𝑑 𝜋 𝜽 𝒔 d𝒔 = 𝛻𝜽 1 = 𝟎 𝒇 𝒘 𝒔 ≈ 𝛻𝜽 ln 𝑑 𝜋 𝜽 𝒔 + 𝔼 𝒇 𝒘 𝒔
  • 16. Algorithm 1: SAIL • 𝑆 𝐸, 𝐴 𝐸: a set of states and that of actions generated by the agent’s policy 𝜋 𝜽
  • 17. 4.2 Noisy bipedal walker • The goal is to traverse a plain without falling • 4 dimensional action – torque applied to 4 joints – noise is added • state – 4 dim. for velocity in x and y directions, angle of the hull, and angular velocity – 8 dim. for the position and velocity of the 4 joints in legs – 2 dim. for the contact information – 10 dim. for lidar readings
  • 18. Neural networks used in the experiment • 𝜋 𝜽 𝒂 𝒔 = 𝒩(𝒂 ∣ 𝝁 𝒔 , 𝚺): Gaussian policy – 𝝁(𝒔): one hidden layer consisting of 100 nodes with tanh activations – 𝚺: diagonal covariance matrix • 𝒇 𝒘(𝒔): feedforward NN – two hidden layers of 80 nodes each using ReLU activations 𝒔 𝝁 hidden 𝚺 𝒩(𝒂∣𝝁𝒔,𝚺) 𝒔 hidden hidden 𝒇𝒘(𝒔)
  • 20. My opinions • Is learning the gradient of 𝑑 𝜋 𝜃 easier than learning a value function? • The dimension of the gradient is equal to the number of parameters of the policy. In general, it is too huge if we use a deep NN policy • The technique of the target network is not used • Trust region framework should be introduced to avoid catastrophic changes
  • 21. References • Choi, J. and Kim, K.-E. (2011). Map inference for Bayesian inverse reinforcement learning. NIPS 24. • Morimura, T. Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning. Neural Computation, 22, 342-376. • Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network. NIPS 1. • Ross, S., and Bagnell, J. A. (2010). Efficient Reductions for Imitation Learning. In Proc. of 13th AISTATS. • Schroecker, Y., and Isbell, C. L. (2017). State Aware Imitation Learning. NIPS 30.