SlideShare a Scribd company logo
CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽
Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4
Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
෍
𝑖=1
𝑁
෍
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate
Autonomous Systems Laboratory
9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic
Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed
Autonomous Systems Laboratory
11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t)
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
trajectory distribution
𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡))
𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
objective
𝜏 𝜏
𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )]
loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Autonomous Systems Laboratory
12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
13/21
DDPG: algorithm(1/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
• policy
• exploration
• Add noise for exploration: white Gaussian noise
• soft target update
• Target network is constrained to change slowly
• Stabilize training process
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔)
𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍
𝜽′
← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
Autonomous Systems Laboratory
14/21
soft update 𝜃′
DDPG: algorithm(2/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
policy 𝜇 𝜃
target policy 𝜇 𝜃′
critic 𝑄 𝜙
target critic 𝑄 𝜙′
𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁
Env
actorcritic
𝑠𝑡
Replay buffer
store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝜇 𝜃′(𝑠𝑡+1)
update critic
loss: 𝐿(𝜙) soft update 𝜙′
𝜇 𝜃(𝑠𝑡)
𝛻𝐽(𝜃)
select action
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation
Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
Autonomous Systems Laboratory
18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance
Autonomous Systems Laboratory
19/21
Appendix: Objective gradient derivation
objective gradient
Autonomous Systems Laboratory
20/21
Appendix: DPG objective
Autonomous Systems Laboratory
21/21
Appendix: DDPG algorithm

More Related Content

What's hot

Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
Slobodan Blazeski
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
NAVER Engineering
 
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
이 의령
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
Woong won Lee
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
Dongmin Lee
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
Soft Actor Critic 解説
Soft Actor Critic 解説Soft Actor Critic 解説
Soft Actor Critic 解説
KCS Keio Computer Society
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
Max Pagels
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
정주 김
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
Kyunghwan Kim
 

What's hot (20)

Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Soft Actor Critic 解説
Soft Actor Critic 解説Soft Actor Critic 解説
Soft Actor Critic 解説
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
 

Similar to ddpg seminar

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
Anirban Santara
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
dinesh malla
 
Continuous control
Continuous controlContinuous control
Continuous control
Reiji Hatsugai
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
민재 정
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
Ryo Iwaki
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
Arithmer Inc.
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
Kai Zhang
 
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...
Lionel Briand
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
Sang Jun Lee
 
Modifed my_poster
Modifed my_posterModifed my_poster
Modifed my_poster
Anshul Goyal, EIT
 
Neural network
Neural networkNeural network
Neural network
Babu Priyavrat
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
 

Similar to ddpg seminar (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Continuous control
Continuous controlContinuous control
Continuous control
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
 
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Modifed my_poster
Modifed my_posterModifed my_poster
Modifed my_poster
 
Neural network
Neural networkNeural network
Neural network
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 

Recently uploaded

学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 

ddpg seminar

  • 1. CONTACT Autonomous Systems Laboratory Mechanical Engineering 5th Engineering Building Room 810 Web. https://sites.google.com/site/aslunist/ Deep deterministic policy gradient Minjae Jung May. 19, 2020
  • 2. Autonomous Systems Laboratory 2/21 DQN to DDPG: DQN overview Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Q learning DQN (2015) 1. replay buffer 2. neural network 3. target network
  • 3. Autonomous Systems Laboratory 3/21 DQN to DDPG: DQN algorithm Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. • DQN is capable of human level performance on many Atari games • Off policy training: replay buffer breaks the correlation of samples that are sampled from agent • High dimensional observation: deep neural network can extract feature from high dimensional input • Learning stability: target network make training process stable Environment Q Network Target Q Network DQN Loss Replay buffer 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) 𝑠𝑡 Update Copy 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) store (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝑟𝑡 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡) 𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡 ′ , 𝑎 𝑡; 𝜃′) 𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ] Q learning 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′ 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] DQN Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state 𝑎 𝑡: action 𝑟𝑡: reward 𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
  • 4. Autonomous Systems Laboratory 4/21 DQN to DDPG: Limitation of DQN (discrete action spaces) • Discrete action spaces - DQN can only handle discrete and low-dimensional action spaces - If the dimension increases, action spaces(the number of node) increase exponentially - i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces • DQN cannot be straight forwardly applied to continuous domain • Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡) 2. Update: 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′ 𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 5. Autonomous Systems Laboratory 5/21 DDPG: DQN with Policy gradient methods Q learning DQN 1. replay buffer 2. deep neural network 3. target network Policy gradient (REINFORCE) Actor critic DPG DDPG Continuous action spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 6. Autonomous Systems Laboratory 6/21 Policy gradient: The goal of Reinforcement learning 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) Agent World action 𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) model reward & next state 𝑟𝑡 𝑎 𝑡 𝑠𝑡+1 state& 𝑠𝑡 policy 𝜋(𝑎 𝑡|𝑠𝑡) 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 Markov decision process 𝑠1 𝑎1 𝑠2 𝑎2 𝑠3 𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2) 𝑎3 𝑝(𝑠4|𝑠3, 𝑎3) 𝜏 objective: 𝐽(𝜃) trajectory distribution Goal of reinforcement learning policy(𝜋 𝜃): stochastic policy with weights 𝜽
  • 7. Autonomous Systems Laboratory 7/21 Policy gradient: REINFORCE • REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) probability 0.1 0.1 0.2 0.2 0.4
  • 8. Autonomous Systems Laboratory 8/21 Policy gradient: REINFORCE 𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡)) 𝛻𝐽 𝜃 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 ෍ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡) 𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃) The number of episodes problem Must experience some episodes to update 1. Slow training process 2. High gradient variance initial state 𝑟1 𝑟2 𝑟 𝑁 • REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝜃: weights of actor network 𝛼: learning rate
  • 9. Autonomous Systems Laboratory 9/21 → 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Policy gradient: Actor critic (actor critic) • Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic • Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action initial state sample data 𝒊 times update critic & actor sample data 𝒊 times update critic & actor 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000. 𝜙: weights of critic network 1. High gradient variance 2. Slow training policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Env.𝑎 𝑡 𝑠𝑡 (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃) actor critic update critic
  • 10. Autonomous Systems Laboratory 10/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t) 𝑎 𝑥 𝑎 𝑦 Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑠𝑡 • Need 10 action spaces for 5 discretized 2 dimensional actions 𝑎 𝑥 𝑎 𝑦 Deterministic policy 𝝁 𝜽(𝒔 𝒕) 𝑠𝑡 • Only 2 action spaces are needed
  • 11. Autonomous Systems Laboratory 11/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t) 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) trajectory distribution 𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡)) 𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) objective 𝜏 𝜏 𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )] loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
  • 12. Autonomous Systems Laboratory 12/21 DDPG: DQN + DPG Q learning DQN Policy gradient (REINFORCE) Actor critic DPG DDPG + continuous action spaces - no replay buffer: sample correlation - no target network: unstable - high variance + lower variance + off policy: replay buffer + stable update: target network + high dimensional observation spaces - discrete action spaces - low dimensional observation spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 13. Autonomous Systems Laboratory 13/21 DDPG: algorithm(1/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). • policy • exploration • Add noise for exploration: white Gaussian noise • soft target update • Target network is constrained to change slowly • Stabilize training process 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔) 𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍 𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
  • 14. Autonomous Systems Laboratory 14/21 soft update 𝜃′ DDPG: algorithm(2/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). policy 𝜇 𝜃 target policy 𝜇 𝜃′ critic 𝑄 𝜙 target critic 𝑄 𝜙′ 𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁 Env actorcritic 𝑠𝑡 Replay buffer store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝜇 𝜃′(𝑠𝑡+1) update critic loss: 𝐿(𝜙) soft update 𝜙′ 𝜇 𝜃(𝑠𝑡) 𝛻𝐽(𝜃) select action 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
  • 15. Autonomous Systems Laboratory 15/21 DDPG example: landing on a moving platform Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic Systems 93.1-2 (2019): 351-366.
  • 16. Autonomous Systems Laboratory 16/21 DDPG example: long-range robotic navigation Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018. • DDPG used as local planner for long range navigation
  • 17. Autonomous Systems Laboratory 17/21 DDPG example: multi agent DDPG (MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
  • 18. Autonomous Systems Laboratory 18/21 Conclusion & Future work • DQN have problem to adjust continuous action space directly • DDPG is able to consider continuous action spaces via policy gradient method and actor critic architecture • MADDPG for multi agent RL • Use DDPG for continuous action space decision making problem • ex) navigation, obstacle avoidance
  • 19. Autonomous Systems Laboratory 19/21 Appendix: Objective gradient derivation objective gradient