Continuous Control with Deep
Reinforcement Learning
Steve JobSae
Wangyu Han, Eunjoo Yang
Continuous Control with Deep Reinforcement Learning
2
1 Overview
3 Results
2 Algorithm Detail
4 Conclusion
Continuous Control with Deep Reinforcement Learning
3
1 Overview
3 Results
2 Algorithm Detail
4 Conclusion
Overview
4
Why we choose this paper?
 Angry Birds has
 High-dimensional state spaces
 3-dimensional continuous action spaces
 𝜃 (-90.00°~ 90.00°), R (𝑝𝑖𝑥𝑒𝑙), 𝑡𝑡𝑎𝑝 (𝑚𝑠)
Overview
5
Why we choose this paper? (Motivation for us)
 Policy Gradient
 Better convergence properties
 Effective in high-dimensional or continuous action spaces
 DQN
 Can solve problems with high-dimensional state spaces
 How about Policy Gradient Method + DQN?
Overview
6
Motivation for authors
 Trying to use DQN in continuous action domains
 However, DQN can only handle discrete and low-dimensional action
spaces
 To adapt DQN to continuous domains, action spaces should be discretized
 However, to apply fine-grained discretization, a number of discrete actions
will explode
Overview
7
Motivation for authors
 For example, in 7 degree of freedom system with rough discretization
𝑎𝑖 ∈ −𝑘, 0, 𝑘 , A number of actions will be 37
= 2187
 Such large action spaces are difficult to explore efficiently
Overview
8
How to solve the problem in this paper?
 Off-policy actor-critic algorithm with deep function approximator
DDPG = DQN + DPG
Deep
Deterministic
Policy
Gradient
 Use Policy Gradient Method
 Among them, Deterministic Policy Gradient
Continuous Control with Deep Reinforcement Learning
9
1 Overview
3 Results
2 Algorithm Detail
4 Conclusion
Algorithm Details
10
DDPG = DQN + DPG
Algorithm Details
11
 Unprocessed raw pixels were used for raw input
DQN for Continuous Action Space?
Algorithm Details
12
 It is not possible to straightforwardly apply Q-learning to continuous
action spaces
𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max
𝑎′
𝑄 𝑠′
, 𝑎′
; 𝜃𝑖
−
− 𝑄 𝑠, 𝑎; 𝜃𝑖
2
Improper for Q continuous action
DQN for Continuous Action Space?
Algorithm Details
13
 It is not possible to straightforwardly apply Q-learning to continuous
action spaces
𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max
𝑎′
𝑄 𝑠′
, 𝑎′
; 𝜃𝑖
−
− 𝑄 𝑠, 𝑎; 𝜃𝑖
2
 DPG maintains a parameterized actor function which specifies the
current policy by deterministically mapping states to a specific action
𝜇(𝑠|𝜃 𝜇)
 DDPG uses actor-critic approach based on the DPG algorithm
DQN for Continuous Action Space?
Algorithm Details
14
DDPG = DQN + DPG
Algorithm Details
15
Preliminary – DPG (Deterministic Policy Gradient)
 Deterministic Policy Gradient (presented by Pork BBQ team)
 Instead of using stochastic policy
𝜋 𝜃 𝑠, 𝑎 = 𝑝(𝑎|𝑠; 𝜃)
Algorithm Details
16
Preliminary – DPG (Deterministic Policy Gradient)
 Deterministic Policy Gradient (presented by Pork BBQ team)
 Instead of using stochastic policy
𝜋 𝜃 𝑠, 𝑎 = 𝑝(𝑎|𝑠; 𝜃)
 DPG defines policy as deterministic one
𝑎 = 𝜇 𝜃(𝑠)
𝜇 ∶ 𝑆 → 𝐴
Algorithm Details
17
Preliminary – Off-Policy Deterministic Actor-Critic
 Conventional action-value function written in Bellman equation form
𝑄 𝜋 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝔼 𝑎 𝑡+1~𝜋 𝑄 𝜋 𝑠𝑡+1, 𝑎 𝑡+1
Algorithm Details
18
Preliminary – Off-Policy Deterministic Actor-Critic
 Conventional action-value function written in Bellman equation form
𝑄 𝜋 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝔼 𝑎 𝑡+1~𝜋 𝑄 𝜋 𝑠𝑡+1, 𝑎 𝑡+1
 can be re-written like below
𝑄 𝜇
𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄 𝜇
(𝑠𝑡+1, 𝜇 𝑠𝑡+1 )
∵Target policy 𝑄 𝜇is
deterministic
Algorithm Details
19
Preliminary – Off-Policy Deterministic Actor-Critic
 Critic : Q-Learning Off-policy Algorithm
 Linear function approximation parameterized by 𝜃 𝑄
 Minimize MSE between target Q from TD and approximated Q function
𝑦𝑡 = 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄(𝑠𝑡+1, 𝜇 (𝑠𝑡+1)|𝜃 𝑄)
𝑤ℎ𝑒𝑟𝑒,
𝐿 𝜃 𝑄
= 𝔼 𝑠 𝑡~𝜌 𝛽,𝑎 𝑡~𝛽,𝑟𝑡~𝐸 𝑄 𝑠𝑡, 𝑎 𝑡 𝜃 𝑄
− 𝑦𝑡
2
Exploration by stochastic policy 𝛽
Algorithm Details
20
Preliminary – Off-Policy Deterministic Actor-Critic
 Actor : Deterministic Policy Gradient
 Update policy parameters 𝑄 𝜇
in the direction of action-value gradient
𝛻𝜃 𝜇 𝐽 ≈ 𝔼 𝑠 𝑡~𝜌 𝛽 𝛻𝜃 𝜇 𝑄 𝑠, 𝑎 𝜃 𝑄
𝑠=𝑠 𝑡,𝑎=𝜇 𝑠𝑡 𝜃 𝜇
= 𝔼 𝑠 𝑡~𝜌 𝛽 𝛻𝑎 𝑄 𝑠, 𝑎 𝜃 𝑄
𝑠=𝑠 𝑡,𝑎=𝜇 𝑠 𝑡
𝛻𝜃 𝜇
𝜇 𝑠 𝜃 𝜇
𝑠=𝑠 𝑡
By
chain
rule
Algorithm Details
21
DDPG
 Critic : Deep Q-Network
 Non-linear function approximation parameterized by 𝜃 𝑄
𝐿 𝜃 𝑄 = 𝔼 𝑠 𝑡~𝜌 𝛽,𝑎 𝑡~𝛽,𝑟𝑡~𝐸 𝑄 𝑠𝑡, 𝑎 𝑡 𝜃 𝑄 − 𝑦𝑡
2
𝑦𝑡 = 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄′(𝑠𝑡+1, 𝜇′(𝑠𝑡+1|𝜃 𝜇′
)|𝜃 𝑄′
)
𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max
𝑎′
𝑄 𝑠′
, 𝑎′
; 𝜃𝑖
−
− 𝑄 𝑠, 𝑎; 𝜃𝑖
2
 Original DQN
Algorithm Details
22
DDPG = DQN + DPG
Algorithm Details
23
Deep DPG (DDPG)
[1] Replay Buffer
 One challenge when using neural networks for reinforcement
learning is that most optimization algorithms assume that the
samples are independently and identically distributed.
 When the samples are generated from exploring sequentially in
an environment this assumption no longer holds.
Algorithm Details
24
Deep DPG (DDPG)
[1] Replay Buffer
 One challenge when using neural networks for reinforcement
learning is that most optimization algorithms assume that the
samples are independently and identically distributed.
 When the samples are generated from exploring sequentially in
an environment this assumption no longer holds.
 Use Replay Buffer
- Replay buffer is a finite sized cache ℛ
- (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) are stored in the replay buffer
- At each time step the actor and critic are updated by sampling
a mini batch uniformly from the buffer.
Algorithm Details
25
Deep DPG (DDPG)
[2] soft target updates
 Directly implementing Q learning with neural networks proved to
be unstable in many environments.
 Since the network being updated is also used in
calculating the target value, the Q update is prone to divergence.
𝑄(𝑠, 𝑎|𝜃 𝑄
)
Algorithm Details
26
Deep DPG (DDPG)
[2] soft target updates
 Directly implementing Q learning with neural networks proved to
be unstable in many environments.
 Since the network being updated is also used in
calculating the target value, the Q update is prone to divergence.
𝑄(𝑠, 𝑎|𝜃 𝑄
)
 Soft Target updates,
rather than directly copying the weights.
Algorithm Details
27
Deep DPG (DDPG)
[2] soft target updates
 Creating a copy of the actor and critic networks respectively that
are used for calculating the target networks
𝑄′ 𝑠, 𝑎 𝜃 𝑄′
, 𝜇′(𝑠|𝜃 𝜇′
)
Algorithm Details
28
Deep DPG (DDPG)
[2] soft target updates
 Creating a copy of the actor and critic networks respectively that
are used for calculating the target networks
𝑄′ 𝑠, 𝑎 𝜃 𝑄′
, 𝜇′(𝑠|𝜃 𝜇′
)
 The weights of these target networks are then updated by having
them slowly track the learned networks:
𝜃′ ← 𝜏𝜃 + 1 − 𝜏 𝜃′ , 𝑤ℎ𝑒𝑟𝑒 𝜏 ≪ 1
Algorithm Details
29
Deep DPG (DDPG)
[2] soft target updates
 Creating a copy of the actor and critic networks respectively that
are used for calculating the target networks
𝑄′ 𝑠, 𝑎 𝜃 𝑄′
, 𝜇′(𝑠|𝜃 𝜇′
)
 The weights of these target networks are then updated by having
them slowly track the learned networks:
𝜃′ ← 𝜏𝜃 + 1 − 𝜏 𝜃′ , 𝑤ℎ𝑒𝑟𝑒 𝜏 ≪ 1
This means that the target values are constrained to change slowly, greatly
improving the stability of learning
Algorithm Details
30
Deep DPG (DDPG)
[3] batch normalization
 When learning from low dimensional feature vector observations,
the different components of the observation may have different
physical units (e.g., positions vs velocities) and the range may vary
across environments
Algorithm Details
31
Deep DPG (DDPG)
[3] batch normalization
 When learning from low dimensional feature vector observations,
the different components of the observation may have different
physical units (e.g., positions vs velocities) and the range may vary
across environments
 Using batch normalization
Algorithm Details
32
Deep DPG (DDPG)
[3] batch normalization
 When learning from low dimensional feature vector observations,
the different components of the observation may have different
physical units (e.g., positions vs velocities) and the range may vary
across environments
 Batch normalization technique normalizes each dimension across
the samples in a mini batch to have unit mean and variance.
 Using batch normalization
 DDPG uses batch normalization on the state input and all layers of
the 𝜇 network and all layers of the Q network prior to the action
input
Algorithm Details
33
Deep DPG (DDPG)
[4] exploration in continuous action space
 A major challenge of learning in continuous action spaces is
exploration
Algorithm Details
34
Deep DPG (DDPG)
[4] exploration in continuous action space
 A major challenge of learning in continuous action spaces is
exploration
 DDPG constructed an exploration policy 𝜇′ by adding noise
sampled from a noise process 𝒩 to the actor policy
𝜇′ 𝑠𝑡 = 𝜇 𝑠𝑡 𝜃𝑡
𝜇
+ 𝒩
Algorithm Details
35
Deep DPG (DDPG) – Algorithm
Algorithm Details
36
Deep DPG (DDPG) – Algorithm
Initialize Network
Algorithm Details
37
Deep DPG (DDPG) – Algorithm
For M episodes
Algorithm Details
38
Deep DPG (DDPG) – Algorithm
For one episode
Algorithm Details
39
Deep DPG (DDPG) – Algorithm
For one episode
Algorithm Details
40
Deep DPG (DDPG) – Algorithm
For one episode
Algorithm Details
41
Deep DPG (DDPG) – Algorithm
For one episode
Algorithm Details
42
Deep DPG (DDPG) – Algorithm
For one episode
Algorithm Details
43
Deep DPG (DDPG) – Algorithm
Continuous Control with Deep Reinforcement Learning
44
1 Overview
3 Results
2 Algorithm Detail
4 Conclusion
Results
45
Experiment Environment
Results
46
Experiment Environment
Results
47
Experiment Results
Original DPG with batch normalization
DPG with target network
DPG with target networks and batch normalization
DPG with target networks from pixel-only inputs
Results
48
Experiment Results
Original DPG with batch normalization
DPG with target network
DPG with target networks and batch normalization
DPG with target networks from pixel-only inputs
 Authors evaluated the policy periodically during training by testing
it without exploration noise.
Results
49
Experiment Results
Original DPG with batch normalization
DPG with target network
DPG with target networks and batch normalization
DPG with target networks from pixel-only inputs
 Authors evaluated the policy periodically during training by testing
it without exploration noise.
 Learning without a target network, as in the original work with DPG,
is very poor in many environments (light gray)
Results
50
Experiment Results
Original DPG with batch normalization
DPG with target network
DPG with target networks and batch normalization
DPG with target networks from pixel-only inputs
 In some simpler cases, learning policies from pixels is just as fast as
learning using the low-dimensional state descriptor
Results
51
 Average / best score
 Low-dimensional DDPG (lowd) high-dimensional (pix)
 Original DPG with replay buffer and batch normalization (cntrl)
 All scores are normalized so that a random agent receives 0 and a planning algorithm 1
Continuous Control with Deep Reinforcement Learning
52
1 Overview
3 Results
2 Algorithm Detail
4 Conclusion
Conclusion
53
Why we choose this paper? (Motivation for us)
From DPG : Deterministic Actor - Critic
From DQN : Nonlinear Approximation model
Using target networks for stability
Using batch normalization
Adding noise process for exploration
 Deep Deterministic Policy Gradient
Conclusion
54
Why we choose this paper? (Motivation for us)
 In angry bird agent, we have three actions that look like
continuous actions
 R , 𝜃, 𝑡𝑡𝑎𝑝
0 ≤ 𝑟 ≤ ?
0 ≤ 𝜃 ≤ 90
0 ≤ 𝑡𝑡𝑎𝑝 ≤? , To the second decimal place
Conclusion
55
Why we choose this paper? (Motivation for us)
 Policy Gradient Method
 Among them, Deterministic Policy Gradient
 With some additional tricks : DDPG
 In angry bird agent, we have three actions that look like
continuous actions
Thank you

DDPG algortihm for angry birds

  • 1.
    Continuous Control withDeep Reinforcement Learning Steve JobSae Wangyu Han, Eunjoo Yang
  • 2.
    Continuous Control withDeep Reinforcement Learning 2 1 Overview 3 Results 2 Algorithm Detail 4 Conclusion
  • 3.
    Continuous Control withDeep Reinforcement Learning 3 1 Overview 3 Results 2 Algorithm Detail 4 Conclusion
  • 4.
    Overview 4 Why we choosethis paper?  Angry Birds has  High-dimensional state spaces  3-dimensional continuous action spaces  𝜃 (-90.00°~ 90.00°), R (𝑝𝑖𝑥𝑒𝑙), 𝑡𝑡𝑎𝑝 (𝑚𝑠)
  • 5.
    Overview 5 Why we choosethis paper? (Motivation for us)  Policy Gradient  Better convergence properties  Effective in high-dimensional or continuous action spaces  DQN  Can solve problems with high-dimensional state spaces  How about Policy Gradient Method + DQN?
  • 6.
    Overview 6 Motivation for authors Trying to use DQN in continuous action domains  However, DQN can only handle discrete and low-dimensional action spaces  To adapt DQN to continuous domains, action spaces should be discretized  However, to apply fine-grained discretization, a number of discrete actions will explode
  • 7.
    Overview 7 Motivation for authors For example, in 7 degree of freedom system with rough discretization 𝑎𝑖 ∈ −𝑘, 0, 𝑘 , A number of actions will be 37 = 2187  Such large action spaces are difficult to explore efficiently
  • 8.
    Overview 8 How to solvethe problem in this paper?  Off-policy actor-critic algorithm with deep function approximator DDPG = DQN + DPG Deep Deterministic Policy Gradient  Use Policy Gradient Method  Among them, Deterministic Policy Gradient
  • 9.
    Continuous Control withDeep Reinforcement Learning 9 1 Overview 3 Results 2 Algorithm Detail 4 Conclusion
  • 10.
  • 11.
    Algorithm Details 11  Unprocessedraw pixels were used for raw input DQN for Continuous Action Space?
  • 12.
    Algorithm Details 12  Itis not possible to straightforwardly apply Q-learning to continuous action spaces 𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max 𝑎′ 𝑄 𝑠′ , 𝑎′ ; 𝜃𝑖 − − 𝑄 𝑠, 𝑎; 𝜃𝑖 2 Improper for Q continuous action DQN for Continuous Action Space?
  • 13.
    Algorithm Details 13  Itis not possible to straightforwardly apply Q-learning to continuous action spaces 𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max 𝑎′ 𝑄 𝑠′ , 𝑎′ ; 𝜃𝑖 − − 𝑄 𝑠, 𝑎; 𝜃𝑖 2  DPG maintains a parameterized actor function which specifies the current policy by deterministically mapping states to a specific action 𝜇(𝑠|𝜃 𝜇)  DDPG uses actor-critic approach based on the DPG algorithm DQN for Continuous Action Space?
  • 14.
  • 15.
    Algorithm Details 15 Preliminary –DPG (Deterministic Policy Gradient)  Deterministic Policy Gradient (presented by Pork BBQ team)  Instead of using stochastic policy 𝜋 𝜃 𝑠, 𝑎 = 𝑝(𝑎|𝑠; 𝜃)
  • 16.
    Algorithm Details 16 Preliminary –DPG (Deterministic Policy Gradient)  Deterministic Policy Gradient (presented by Pork BBQ team)  Instead of using stochastic policy 𝜋 𝜃 𝑠, 𝑎 = 𝑝(𝑎|𝑠; 𝜃)  DPG defines policy as deterministic one 𝑎 = 𝜇 𝜃(𝑠) 𝜇 ∶ 𝑆 → 𝐴
  • 17.
    Algorithm Details 17 Preliminary –Off-Policy Deterministic Actor-Critic  Conventional action-value function written in Bellman equation form 𝑄 𝜋 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝔼 𝑎 𝑡+1~𝜋 𝑄 𝜋 𝑠𝑡+1, 𝑎 𝑡+1
  • 18.
    Algorithm Details 18 Preliminary –Off-Policy Deterministic Actor-Critic  Conventional action-value function written in Bellman equation form 𝑄 𝜋 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝔼 𝑎 𝑡+1~𝜋 𝑄 𝜋 𝑠𝑡+1, 𝑎 𝑡+1  can be re-written like below 𝑄 𝜇 𝑠𝑡, 𝑎 𝑡 = 𝔼 𝑟𝑡,𝑠 𝑡+1~𝐸 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄 𝜇 (𝑠𝑡+1, 𝜇 𝑠𝑡+1 ) ∵Target policy 𝑄 𝜇is deterministic
  • 19.
    Algorithm Details 19 Preliminary –Off-Policy Deterministic Actor-Critic  Critic : Q-Learning Off-policy Algorithm  Linear function approximation parameterized by 𝜃 𝑄  Minimize MSE between target Q from TD and approximated Q function 𝑦𝑡 = 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄(𝑠𝑡+1, 𝜇 (𝑠𝑡+1)|𝜃 𝑄) 𝑤ℎ𝑒𝑟𝑒, 𝐿 𝜃 𝑄 = 𝔼 𝑠 𝑡~𝜌 𝛽,𝑎 𝑡~𝛽,𝑟𝑡~𝐸 𝑄 𝑠𝑡, 𝑎 𝑡 𝜃 𝑄 − 𝑦𝑡 2 Exploration by stochastic policy 𝛽
  • 20.
    Algorithm Details 20 Preliminary –Off-Policy Deterministic Actor-Critic  Actor : Deterministic Policy Gradient  Update policy parameters 𝑄 𝜇 in the direction of action-value gradient 𝛻𝜃 𝜇 𝐽 ≈ 𝔼 𝑠 𝑡~𝜌 𝛽 𝛻𝜃 𝜇 𝑄 𝑠, 𝑎 𝜃 𝑄 𝑠=𝑠 𝑡,𝑎=𝜇 𝑠𝑡 𝜃 𝜇 = 𝔼 𝑠 𝑡~𝜌 𝛽 𝛻𝑎 𝑄 𝑠, 𝑎 𝜃 𝑄 𝑠=𝑠 𝑡,𝑎=𝜇 𝑠 𝑡 𝛻𝜃 𝜇 𝜇 𝑠 𝜃 𝜇 𝑠=𝑠 𝑡 By chain rule
  • 21.
    Algorithm Details 21 DDPG  Critic: Deep Q-Network  Non-linear function approximation parameterized by 𝜃 𝑄 𝐿 𝜃 𝑄 = 𝔼 𝑠 𝑡~𝜌 𝛽,𝑎 𝑡~𝛽,𝑟𝑡~𝐸 𝑄 𝑠𝑡, 𝑎 𝑡 𝜃 𝑄 − 𝑦𝑡 2 𝑦𝑡 = 𝑟 𝑠𝑡, 𝑎 𝑡 + 𝛾𝑄′(𝑠𝑡+1, 𝜇′(𝑠𝑡+1|𝜃 𝜇′ )|𝜃 𝑄′ ) 𝐿𝑖 𝜃𝑖 = 𝔼 𝑠,𝑎,𝑟,𝑠′ ~𝑈(𝐷) 𝑟 + 𝛾 max 𝑎′ 𝑄 𝑠′ , 𝑎′ ; 𝜃𝑖 − − 𝑄 𝑠, 𝑎; 𝜃𝑖 2  Original DQN
  • 22.
  • 23.
    Algorithm Details 23 Deep DPG(DDPG) [1] Replay Buffer  One challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independently and identically distributed.  When the samples are generated from exploring sequentially in an environment this assumption no longer holds.
  • 24.
    Algorithm Details 24 Deep DPG(DDPG) [1] Replay Buffer  One challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independently and identically distributed.  When the samples are generated from exploring sequentially in an environment this assumption no longer holds.  Use Replay Buffer - Replay buffer is a finite sized cache ℛ - (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) are stored in the replay buffer - At each time step the actor and critic are updated by sampling a mini batch uniformly from the buffer.
  • 25.
    Algorithm Details 25 Deep DPG(DDPG) [2] soft target updates  Directly implementing Q learning with neural networks proved to be unstable in many environments.  Since the network being updated is also used in calculating the target value, the Q update is prone to divergence. 𝑄(𝑠, 𝑎|𝜃 𝑄 )
  • 26.
    Algorithm Details 26 Deep DPG(DDPG) [2] soft target updates  Directly implementing Q learning with neural networks proved to be unstable in many environments.  Since the network being updated is also used in calculating the target value, the Q update is prone to divergence. 𝑄(𝑠, 𝑎|𝜃 𝑄 )  Soft Target updates, rather than directly copying the weights.
  • 27.
    Algorithm Details 27 Deep DPG(DDPG) [2] soft target updates  Creating a copy of the actor and critic networks respectively that are used for calculating the target networks 𝑄′ 𝑠, 𝑎 𝜃 𝑄′ , 𝜇′(𝑠|𝜃 𝜇′ )
  • 28.
    Algorithm Details 28 Deep DPG(DDPG) [2] soft target updates  Creating a copy of the actor and critic networks respectively that are used for calculating the target networks 𝑄′ 𝑠, 𝑎 𝜃 𝑄′ , 𝜇′(𝑠|𝜃 𝜇′ )  The weights of these target networks are then updated by having them slowly track the learned networks: 𝜃′ ← 𝜏𝜃 + 1 − 𝜏 𝜃′ , 𝑤ℎ𝑒𝑟𝑒 𝜏 ≪ 1
  • 29.
    Algorithm Details 29 Deep DPG(DDPG) [2] soft target updates  Creating a copy of the actor and critic networks respectively that are used for calculating the target networks 𝑄′ 𝑠, 𝑎 𝜃 𝑄′ , 𝜇′(𝑠|𝜃 𝜇′ )  The weights of these target networks are then updated by having them slowly track the learned networks: 𝜃′ ← 𝜏𝜃 + 1 − 𝜏 𝜃′ , 𝑤ℎ𝑒𝑟𝑒 𝜏 ≪ 1 This means that the target values are constrained to change slowly, greatly improving the stability of learning
  • 30.
    Algorithm Details 30 Deep DPG(DDPG) [3] batch normalization  When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (e.g., positions vs velocities) and the range may vary across environments
  • 31.
    Algorithm Details 31 Deep DPG(DDPG) [3] batch normalization  When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (e.g., positions vs velocities) and the range may vary across environments  Using batch normalization
  • 32.
    Algorithm Details 32 Deep DPG(DDPG) [3] batch normalization  When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (e.g., positions vs velocities) and the range may vary across environments  Batch normalization technique normalizes each dimension across the samples in a mini batch to have unit mean and variance.  Using batch normalization  DDPG uses batch normalization on the state input and all layers of the 𝜇 network and all layers of the Q network prior to the action input
  • 33.
    Algorithm Details 33 Deep DPG(DDPG) [4] exploration in continuous action space  A major challenge of learning in continuous action spaces is exploration
  • 34.
    Algorithm Details 34 Deep DPG(DDPG) [4] exploration in continuous action space  A major challenge of learning in continuous action spaces is exploration  DDPG constructed an exploration policy 𝜇′ by adding noise sampled from a noise process 𝒩 to the actor policy 𝜇′ 𝑠𝑡 = 𝜇 𝑠𝑡 𝜃𝑡 𝜇 + 𝒩
  • 35.
    Algorithm Details 35 Deep DPG(DDPG) – Algorithm
  • 36.
    Algorithm Details 36 Deep DPG(DDPG) – Algorithm Initialize Network
  • 37.
    Algorithm Details 37 Deep DPG(DDPG) – Algorithm For M episodes
  • 38.
    Algorithm Details 38 Deep DPG(DDPG) – Algorithm For one episode
  • 39.
    Algorithm Details 39 Deep DPG(DDPG) – Algorithm For one episode
  • 40.
    Algorithm Details 40 Deep DPG(DDPG) – Algorithm For one episode
  • 41.
    Algorithm Details 41 Deep DPG(DDPG) – Algorithm For one episode
  • 42.
    Algorithm Details 42 Deep DPG(DDPG) – Algorithm For one episode
  • 43.
    Algorithm Details 43 Deep DPG(DDPG) – Algorithm
  • 44.
    Continuous Control withDeep Reinforcement Learning 44 1 Overview 3 Results 2 Algorithm Detail 4 Conclusion
  • 45.
  • 46.
  • 47.
    Results 47 Experiment Results Original DPGwith batch normalization DPG with target network DPG with target networks and batch normalization DPG with target networks from pixel-only inputs
  • 48.
    Results 48 Experiment Results Original DPGwith batch normalization DPG with target network DPG with target networks and batch normalization DPG with target networks from pixel-only inputs  Authors evaluated the policy periodically during training by testing it without exploration noise.
  • 49.
    Results 49 Experiment Results Original DPGwith batch normalization DPG with target network DPG with target networks and batch normalization DPG with target networks from pixel-only inputs  Authors evaluated the policy periodically during training by testing it without exploration noise.  Learning without a target network, as in the original work with DPG, is very poor in many environments (light gray)
  • 50.
    Results 50 Experiment Results Original DPGwith batch normalization DPG with target network DPG with target networks and batch normalization DPG with target networks from pixel-only inputs  In some simpler cases, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor
  • 51.
    Results 51  Average /best score  Low-dimensional DDPG (lowd) high-dimensional (pix)  Original DPG with replay buffer and batch normalization (cntrl)  All scores are normalized so that a random agent receives 0 and a planning algorithm 1
  • 52.
    Continuous Control withDeep Reinforcement Learning 52 1 Overview 3 Results 2 Algorithm Detail 4 Conclusion
  • 53.
    Conclusion 53 Why we choosethis paper? (Motivation for us) From DPG : Deterministic Actor - Critic From DQN : Nonlinear Approximation model Using target networks for stability Using batch normalization Adding noise process for exploration  Deep Deterministic Policy Gradient
  • 54.
    Conclusion 54 Why we choosethis paper? (Motivation for us)  In angry bird agent, we have three actions that look like continuous actions  R , 𝜃, 𝑡𝑡𝑎𝑝 0 ≤ 𝑟 ≤ ? 0 ≤ 𝜃 ≤ 90 0 ≤ 𝑡𝑡𝑎𝑝 ≤? , To the second decimal place
  • 55.
    Conclusion 55 Why we choosethis paper? (Motivation for us)  Policy Gradient Method  Among them, Deterministic Policy Gradient  With some additional tricks : DDPG  In angry bird agent, we have three actions that look like continuous actions
  • 56.

Editor's Notes

  • #2 Hello, we are team Steve JobSae. I’m Wangyu Han. We are going to present paper named ’Continuous control with Deep reinforcement Learning’ written by Google Deepmind
  • #3 This is our agenda today. First, I will introduce overview. Then, I will explain about Algorithm Detail and Results Lastly, I will wrap up with the conclusion
  • #4 Let’s Start with overview
  • #5 I will explain you about ‘Why we choose this paper?’ State can be differently defined. But In general, Angry Birds has relatively high-dimensional state spaces. And Angry bird has 3-dimensional continuous action spaces consisting of theta, R, and tap time. The range of theta is from -90.00 to 90.00, rounding up to 2 decimal points. The default unit of r is pixel. And the default unit of tap time is mili second.
  • #6 As we learned from this class, Policy gradient has better convergence properties and effective in high-dimensional or continuous action spaces. And DQN can solve problems with high-dimensional state spaces. By using raw images as inputs in Atari games, DQN performed better than other algorithms. These features fit well with Angry Birds scenario. So, We started researching papers which integrate Policy Gradient Method and DQN.
  • #7 Author also try to use DQN in continuous action domains. But, DQN can only handle discrete and low-dimensional action spaces because it is based on value-based RL. To adapt DQN to continuous domains, action spaces should be discretized. However to apply finer grained discretization, a number of discrete actions will explode
  • #8 For example, in seven degree of freedom system with rough discretization in actions as –k, zero, k. A number of actions will be 3 power 7 same as 2187. Such large action spaces are difficult to explore efficiently.
  • #9 To solve problem, they use policy gradient method. Among a lot of policy gradient methods, They choose deterministic policy gradient which was covered by Team Pork BBQ last week. The Paper suggests Deep Deterministic Policy Gradient, in short DDPG algorithm. DDPG is combination of DQN and DPG algorithm. It is off-policy actor critic algorithm with deep function approximator.
  • #10 Next, I will introduce Algorithm Detail.
  • #11 First, I will explain about DQN part in DDPG algorithm.
  • #12 For Atari games, unprocessed raw pixels were used for raw input
  • #13 It is not possible to straightforwardly apply Q-learning to continuous action spaces. You can see the loss function of DQN algorithm. It has maximum term to select best actions for given action-value function. But, this term is improper for continuous actions.
  • #14 So, DDPG uses actor-critic approach based on the DPG algorithm. DPG maintains a parameterized actor function which specifies the current policy by deterministically mapping states to a specific action
  • #15 Next, about DPG part in DDPG algorithm.
  • #16 Stochastic policy is represented as PI of state and action same as probability of action bar state. It means stochastically selects action a in state s according to parameter vector THETA.
  • #17 Instead of using stochastic policy, DPG defines deterministic policy 뮤 which determine specific action a in state s
  • #18 DPG is Off-Policy Deterministic Actor-Critic algorithm. Conventional action-value function written in Bellman equation form is first equation.
  • #19 This equation can be re-written in Deterministic Actor critic like below equation. Because Target policy Q MU is deterministic.
  • #20 Critic is used for updating action-value function Q. In DPG, critic is Q-learning off-policy ALGORITHM with linear approximation parameterized by THETA Q. Parameter is updated in direction of minimizing MSE between target Q from TD and approximated Q function. Like below equation. In approximated Q function, by Off-policy, Action in time t is obtained from stochastic policy BETA. So exploration is available. And in target Q, action is determined by Deterministic policy MU.
  • #21 Actor is used for updating policy parameters. Stochastic policy gradient adjusts the policy parameters in the direction of maximizing a reward gradient. Instead, Deterministic policy gradient adjusts the policy parameters Theta MU in the direction of action-value gradient First line is changed to second line by chain rule.
  • #22 In DDPG algorithm, Non-linear large neural networks is used as function approximator. Compare to Original DQN loss function, In DDPG, maximum term is changed to Deterministic policy MU as shown in red arrow.
  • #23 OK, Now let me introduce the DDPG algorithm itself. Until now, we told you that roughly speaking, The concept of DDPG is composed of both DQN and DPG concept. From the DQN, nonlinear approximation concept was taken. And Actor critic Deterministic policy gradient method was taken from DPG. So now, let me introduce how DDPG mixed them up and added their-own solution.
  • #24 Firstly, one challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independenly and identically distributed. However, when the samples are generated from exploring sequentially in an environment, this assumption no longer holds.
  • #25 So in DDPG, they brought Replay Buffer concept from DQN to solve this problem. In DDPG, Replay buffer is a finite sized cache R and state , action and reward at time step t and state at time step t+1 are stored as a tuple in the replay buffer At each time step the actor and critic are updated by sampling a mini batch uniformly from the buffer There fore by using Replay Buffer, assumption for independently and identically distributed samples can be somewhat satisfied.
  • #26 Secondly, Directly implementing Q learning with neural networks proved to be unstable in many environments Since the network being updated is also used in calculated the target value, the Q update is prone to divergence
  • #27 To solve the problem, DDPG uses soft target updates rather than directly copying the weihts.
  • #28 To do that, DDPG creates a copy of the actor and critic networks respectively that are used for calculating the target networks
  • #29 Then the weights of these target networks are then updated by having them slowly tracked the learned networks
  • #30 This means that the target values are constrained to change slowly, greatly improving the stability of learning
  • #31 Thirdly, DDPG uses batch normalization. When learning from low dimensional feature vector observations, the different components of the observation may have different physical units, for example positions, velocities ect. And the range may vary across environments
  • #32 To deal with the problem, DDPG uses batch normalization
  • #33 As you already know, batch normalization technique normalizes each dimension across the samples in a mini batch to have unit mean and variance. DDPG uses batch normalization on the state input and all layers of the mu network and all layers of the Q network prior to the action input
  • #34 Lastly, Exploration is quite important issue in continuous action space Especially in DPG, Deterministic Policy determines deterministic action according to the state. DDPG should solve the exploration issue.
  • #35 To do that, DDPG constructed an exploration policy mu prime by adding noise sampled from a noise process N to the actor policy
  • #36 OK, Until now, I introduced Characteristic and concept of Deep DPG Algorithm . Here is the pseudo code for the DDPG Algorithm. Let me explain it quickly to wrap up the introducing algorithm
  • #37 Firstly, as you know in DDPG, they use actor critic DPG . Therefore It initializes critic network and actor network Also, the used slow update by using additional target network These are Q’ and mu’ And Replay buffer is used to deal with iid sample problem. So, here It initialize all component of DDPG
  • #38 And For M episodes, Initialize random process for action exploration and receive initial observation state s1
  • #39 Next, For one episode these steps are conducted.
  • #40 Select action from mu and additional noise. So It provides exploration.
  • #41 After executing action at and oberve reward rt and observe new state
  • #42 Store the transition tuple in replay buffer R
  • #43  To update the network, I mean, for training, Sampling a random minibatch of N transitions are conducted from replay buffer R. For training critic network, Y are set as following off policy Q-learning in terms of target network Q’ And update critic by minimizing the loss in terms of critic network. To update actor policy, this DPG equation is used. And then slow updates are conducted.
  • #44 Here’s the end. Let me remind you the characteristics of DDPG DDPG adds noise process to action for exploration And DDPG uses replay buffer And DDPG uses slow target network update for stability
  • #46 Let’s move on to Experiment for DDPG They constructed simulated physical environments of varying levels of difficulty to test their algorithm Here’s the screen capture of the Experiments
  • #47 Let me show you the video shortly.
  • #48 So here’s the results of the experiment. Original DPG with batch normalization is in light gray, DPG with target network is in black DPG with target networks and batch normalization is in green DPG with target networks from pixel-only inputs is in blue
  • #49 Authors evaluated the policy periodically during training by testing it without exploration noise.
  • #50 Learning without a target network, as in the original work with DPG is very poor in many environments. It is depicted in light gray colored graph
  • #51 Interestingly, In some simpler cases, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor However, Most low-dimensional state descriptors performed better.
  • #52 Here’s the table for the numeric performance. Average and best scores are written in the table And First two columns are for low dimensional DDPg, and next two columns are for high dimensional one Last two columns are original DPG with batch normalization and replay buffer. All scores are normalized so that a random agent receives 0 and a planning algorithm 1. So you can see that DDPG outperforms than original DPG and sometimes, it showed better performance than a planning algorithm.
  • #53 So, Let me wrap up the presentation.
  • #54 In summary, Deep deterministic Policy gradient combined concept of DPG and DQN Based on DPG it uses deterministic actor critic model From DQN, it uses nonlinear approximation model for actor and critic network Additionaly, DDPG uses target networks for stability and batch normalization and added noise processs for exploration to action
  • #55 As a conclusion, let me remind you why we choose this paper In angry bird agent, we have three actions that look like continuous actions.
  • #56 So we choose DDPG which is proper algorithm for continuous action space.
  • #57 Thank you for your attendance.