SlideShare a Scribd company logo
Continuous control with deep
reinforcement learning
2016-06-28
Taehoon Kim
Motivation
• DQN can only handle
• discrete (not continuous)
• low-dimensional action spaces
• Simple approach to adapt DQN to continuous domain is discretizing
• 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘}
• Now space dimensionality becomes 3+
= 2187
• explosion of the number of discrete actions
2
Contribution
• Present a model-free, off-policy actor-critic algorithm
• learn policies in high-dimensional, continuous action spaces
• Work based on DPG (Deterministic policy gradient)
3
Background
• actions 𝑎" ∈ ℝ2
, action space 𝒜 = ℝ2
• history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥")
• assume fully-observable so 𝑠" = 𝑥"
• policy 𝜋: 𝒮 → 𝒫(𝒜)
• Model environment as Markov decision process
• initial state distribution 𝑝(𝑠7)
• transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎")
4
Background
• Discounted future reward 𝑅" = ∑ 𝛾F9"
𝑟(𝑠F, 𝑎F)H
FI"
• Goal of RL is to learn a policy 𝜋 which maximizes the expected return
• from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]
• Discounted state visitation distribution for a policy 𝜋: ρR
5
Background
• action-value function 𝑄R
𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]
• expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋
• Bellman equation
• 𝑄R
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R
(𝑠"A7, 𝑎"A7) ]
• With deterministic policy 𝜇: 𝒮 → 𝒜
• 𝑄^
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^
𝑠"A7, 𝜇(𝑠"A7 )]
6
Background
• Expectation only depends on the environment
• possible to learn 𝑄 𝝁
off-policy, where transitions are generated from
different stochastic policy 𝜷
• Q-learning with greedy policy 𝜇 𝑠 = arg max
f
𝑄 𝑠, 𝑎
• 𝐿 𝜃i
= 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i
− 𝑦"
n
]
• where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i
)
• To scale Q-learning into large non-linear approximators:
• a replay buffer, a separate target network
7
(a	commonly	used	off-policy algorithm)
Deterministic Policy Gradient (DPG)
• In continuous space, finding the greedy policy requires an optimization of 𝑎" at
every timestep
• too slow to large, unconstrained function approximators and nontrivial action spaces
• Instead, used an actor-critic approach based on the DPG algorithm
• actor: 𝜇 𝑠 𝜃^
: 	𝒮 → 𝒜
• critic: 𝑄(𝑠, 𝑎|𝜃i
)
8
Learning algorithm
• Actor is updated by following the applying the chain rule to the expected return
from the start distribution 𝒥 w.r.t 𝜃^
• 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =
𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY
∇rs 𝜇 𝑠 𝜃^ |NIN"
• Silver et al. (2014) proved this is the policy gradient
• the gradient of policy’s performance
9
Contributions
• Introducing non-linear function approximators means that
convergence is no longer guaranteed
• But essential to learn and generalize on large state spaces
• Contribution
• To provide modifications to DPG, inspired by the success of DQN
• Allow to use neural network function approximators to learn in large state and
action spaces online
10
Challenges 1
• NN for RL usually assume that the samples are i.i.d.
• but when the samples are generated from exploring sequentially in an environment,
this assumption no longer holds.
• As DQN, we use replay buffer to address this issue
• As DQN, we used target network for stable learning but use “soft” target
updates
• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1
• Target network slowly change that greatly improve the stability of learning
11
Challenges 2
• When learning from low dimensional feature vector, observations may have
different physical units (i.e. positions and velocities)
• make it difficult to learn effectively and also to find hyper-parameters which generalize across
environments
• Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension
across the samples in a minibatch to have unit mean and variance
• Also maintains a running average of the mean and variance for normalization during testing
• Use all layers of 𝜇 and 𝑄 prior to the action input
• Can train different units without needing to manually ensure the units were within a set range
12
(exploration	or	evaluation)
Challenges 3
• Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem
of exploration independently from the learning algorithm
• Constructed an exploration policy 𝜇` by adding noise sampled from a noise
process 𝒩
• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"
^
+ 𝒩
• Use an Ornstein-Uhlenbeck process to generate temporally
correlated exploration for exploration efficiency with inertia
13
14
Experiment details
• Adam. 𝑙𝑟^
= 109|
, 𝑙𝑟i
= 109}
• 𝑄 include 𝐿n weight decay of 109n
and 𝛾 = 0.99
• 𝜏 = 0.001
• ReLU for hidden layers, tanh for output layer of the actor to bound the actions
• NN: 2 hidden layers with 400 and 300 units
• Action is not included until the 2nd hidden layer of 𝑄
• The final layer weights and biases are initialized from a uniform distribution −3×109}
,3×109}
• to ensure the initial outputs for the policy and value estimates were near zero
• The other layers are initialized from uniform distributions −
7
•
,
7
•
where 𝑓 is the fan-in of the layer
• Replay buffer ℛ = 10„
, Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2
15
References
1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for
deep reinforcement learning. arXiv preprint arXiv:1511.06581.
2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with
double Q-learning. CoRR, abs/1509.06461.
3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience
replay. arXiv preprint arXiv:1511.05952.
4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol.
1, No. 1). Cambridge: MIT press.
16

More Related Content

What's hot

25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
Seung Jae Lee
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
YeChan(Paul) Kim
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
Thomas da Silva Paula
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
Taehoon Kim
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Universitat Politècnica de Catalunya
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
Thom Lane
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
NAVER D2
 

What's hot (20)

25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning[1312.5602] Playing Atari with Deep Reinforcement Learning
[1312.5602] Playing Atari with Deep Reinforcement Learning
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
 

Viewers also liked

Introduction to A3C model
Introduction to A3C modelIntroduction to A3C model
Introduction to A3C model
WEBFARMER. ltd.
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
mooopan
 
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
John Liu
 
アクターモデルについて
アクターモデルについてアクターモデルについて
アクターモデルについて
Takamasa Mitsuji
 
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
Sotetsu KOYAMADA(小山田創哲)
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
mooopan
 
A3C解説
A3C解説A3C解説
A3C解説
harmonylab
 
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
徹 上野山
 

Viewers also liked (8)

Introduction to A3C model
Introduction to A3C modelIntroduction to A3C model
Introduction to A3C model
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
 
アクターモデルについて
アクターモデルについてアクターモデルについて
アクターモデルについて
 
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
A3C解説
A3C解説A3C解説
A3C解説
 
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
 

Similar to Continuous control with deep reinforcement learning (DDPG)

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
Taehoon Kim
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
ssuser2624f71
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
taeseon ryu
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
taeseon ryu
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
Sanghyuk Chun
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
Tejas Kotha
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
ssuser4b1f48
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
Neural Networks
Neural NetworksNeural Networks
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
Tsuyoshi Sakama
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
Seungeon Baek
 

Similar to Continuous control with deep reinforcement learning (DDPG) (20)

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 

More from Taehoon Kim

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
Taehoon Kim
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
Taehoon Kim
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
Taehoon Kim
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
Taehoon Kim
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
Taehoon Kim
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
Taehoon Kim
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
Taehoon Kim
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
Taehoon Kim
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
Taehoon Kim
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Taehoon Kim
 
Deep Reasoning
Deep ReasoningDeep Reasoning
Deep Reasoning
Taehoon Kim
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
Taehoon Kim
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
Taehoon Kim
 

More from Taehoon Kim (13)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
Deep Reasoning
Deep ReasoningDeep Reasoning
Deep Reasoning
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 

Continuous control with deep reinforcement learning (DDPG)

  • 1. Continuous control with deep reinforcement learning 2016-06-28 Taehoon Kim
  • 2. Motivation • DQN can only handle • discrete (not continuous) • low-dimensional action spaces • Simple approach to adapt DQN to continuous domain is discretizing • 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘} • Now space dimensionality becomes 3+ = 2187 • explosion of the number of discrete actions 2
  • 3. Contribution • Present a model-free, off-policy actor-critic algorithm • learn policies in high-dimensional, continuous action spaces • Work based on DPG (Deterministic policy gradient) 3
  • 4. Background • actions 𝑎" ∈ ℝ2 , action space 𝒜 = ℝ2 • history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥") • assume fully-observable so 𝑠" = 𝑥" • policy 𝜋: 𝒮 → 𝒫(𝒜) • Model environment as Markov decision process • initial state distribution 𝑝(𝑠7) • transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎") 4
  • 5. Background • Discounted future reward 𝑅" = ∑ 𝛾F9" 𝑟(𝑠F, 𝑎F)H FI" • Goal of RL is to learn a policy 𝜋 which maximizes the expected return • from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7] • Discounted state visitation distribution for a policy 𝜋: ρR 5
  • 6. Background • action-value function 𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"] • expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋 • Bellman equation • 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R (𝑠"A7, 𝑎"A7) ] • With deterministic policy 𝜇: 𝒮 → 𝒜 • 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )] 6
  • 7. Background • Expectation only depends on the environment • possible to learn 𝑄 𝝁 off-policy, where transitions are generated from different stochastic policy 𝜷 • Q-learning with greedy policy 𝜇 𝑠 = arg max f 𝑄 𝑠, 𝑎 • 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n ] • where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i ) • To scale Q-learning into large non-linear approximators: • a replay buffer, a separate target network 7 (a commonly used off-policy algorithm)
  • 8. Deterministic Policy Gradient (DPG) • In continuous space, finding the greedy policy requires an optimization of 𝑎" at every timestep • too slow to large, unconstrained function approximators and nontrivial action spaces • Instead, used an actor-critic approach based on the DPG algorithm • actor: 𝜇 𝑠 𝜃^ : 𝒮 → 𝒜 • critic: 𝑄(𝑠, 𝑎|𝜃i ) 8
  • 9. Learning algorithm • Actor is updated by following the applying the chain rule to the expected return from the start distribution 𝒥 w.r.t 𝜃^ • 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ = 𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs 𝜇 𝑠 𝜃^ |NIN" • Silver et al. (2014) proved this is the policy gradient • the gradient of policy’s performance 9
  • 10. Contributions • Introducing non-linear function approximators means that convergence is no longer guaranteed • But essential to learn and generalize on large state spaces • Contribution • To provide modifications to DPG, inspired by the success of DQN • Allow to use neural network function approximators to learn in large state and action spaces online 10
  • 11. Challenges 1 • NN for RL usually assume that the samples are i.i.d. • but when the samples are generated from exploring sequentially in an environment, this assumption no longer holds. • As DQN, we use replay buffer to address this issue • As DQN, we used target network for stable learning but use “soft” target updates • 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1 • Target network slowly change that greatly improve the stability of learning 11
  • 12. Challenges 2 • When learning from low dimensional feature vector, observations may have different physical units (i.e. positions and velocities) • make it difficult to learn effectively and also to find hyper-parameters which generalize across environments • Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension across the samples in a minibatch to have unit mean and variance • Also maintains a running average of the mean and variance for normalization during testing • Use all layers of 𝜇 and 𝑄 prior to the action input • Can train different units without needing to manually ensure the units were within a set range 12 (exploration or evaluation)
  • 13. Challenges 3 • Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem of exploration independently from the learning algorithm • Constructed an exploration policy 𝜇` by adding noise sampled from a noise process 𝒩 • 𝜇` 𝑠" = 𝜇 𝑠" 𝜃" ^ + 𝒩 • Use an Ornstein-Uhlenbeck process to generate temporally correlated exploration for exploration efficiency with inertia 13
  • 14. 14
  • 15. Experiment details • Adam. 𝑙𝑟^ = 109| , 𝑙𝑟i = 109} • 𝑄 include 𝐿n weight decay of 109n and 𝛾 = 0.99 • 𝜏 = 0.001 • ReLU for hidden layers, tanh for output layer of the actor to bound the actions • NN: 2 hidden layers with 400 and 300 units • Action is not included until the 2nd hidden layer of 𝑄 • The final layer weights and biases are initialized from a uniform distribution −3×109} ,3×109} • to ensure the initial outputs for the policy and value estimates were near zero • The other layers are initialized from uniform distributions − 7 • , 7 • where 𝑓 is the fan-in of the layer • Replay buffer ℛ = 10„ , Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2 15
  • 16. References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 16