0415_seminar_DeepDPG

Continuous Control with
Deep Reinforcement Learning
2016 ICLR
Timothy P. Lillicrap, et al. (Google DeepMind)
Presenter : Hyemin Ahn

Introduction
2016-04-17 CPSLAB (EECS) 2
 Another work of
Deep Learning
+ Reinforcement Learning
from Google DEEPMIND !
 Extended their Deep Q Network,
which is dealing with discrete action space,
to continuous action space.

Results : Preview
2016-04-17 CPSLAB (EECS) 3

Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?

2016-04-17 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw

2016-04-17 CPSLAB (EECS) 6
The agent takes
an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵,
and receives a scalar reward 𝒓 𝒕.
𝒂 𝒕
𝒙 𝒕

2016-04-17 CPSLAB (EECS) 7
For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 ,
which maps states to probability distribution over actions.
𝒂 𝟏
𝒂 𝟐
𝛑(𝐬𝐭)
𝒂 𝟐𝒂 𝟏

2016-04-17 CPSLAB (EECS) 8
𝒂 𝟏
𝒔 𝟏
𝛑
𝒂 𝟐
𝒔 𝟐
𝛑
𝒂 𝟑
𝒔 𝟑
𝛑
𝒑(𝒔 𝟐|𝒔 𝟏, 𝒂 𝟏) 𝒑(𝒔 𝟑|𝒔 𝟐, 𝒂 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸𝒊−𝒕
𝒓(𝒔𝒊, 𝒂𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒔 𝒕, 𝒂 𝒕 = 𝔼 𝝅[𝑹 𝒕|𝒔 𝒕, 𝒂 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼 𝝅(𝑹 𝟏) !
𝒓(𝒔 𝟏, 𝒂 𝟏) 𝒓(𝒔 𝟐, 𝒂 𝟐) 𝒓(𝒔 𝟑, 𝒂 𝟑)M
D
P

2016-04-17 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒔 𝒕, 𝒂 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒔 𝒕, 𝒂 𝒕<

2016-04-17 CPSLAB (EECS) 10
• From environment E,
xt : observation
st ∈ 𝒮 : state.
• If E is fully observed, st = xt
• at ∈ 𝒜 : agent’s action
• π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior
: maps states to probability distribution over the actions.
• With 𝒮, 𝒜, an initial state distribution p(s1),
transition dynamics p st+1 st, at , and reward function r st, at ,
Agent’s behavior can be modeled as a Markov Decision Process (MDP).
• Rt = i=t
T
γi−t
r(si, ai) : the sum of discounted future reward
with a discounting factor γ ∈ [0,1].
• Objective of RL : learning a policy π maximizing 𝔼π(R1).
 For this, state-action-value function Qπ
st, at = 𝔼π[Rt|st, at] is used.

Q-learning is finding , the greedy policy
Reinforcement Learning : Q-Learning
2016-04-17 CPSLAB (EECS) 11
π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜
The Bellman equation refers to this recursive relationship
It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜
Let us think about the deterministic policy instead of stochastic one
Can we do this in continuous action space?

Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 12
What DQN(Deep Q Network) was learning Q-function with NN.
With function approximator parameterized by 𝜃 𝑄
,
Model a network
Finding 𝜃 𝑄
minimizing the loss function
Problem 1: How can we know this real value?

Also model a policy with a network parameter 𝜃 𝜇
And find a parameter 𝜃 𝜇
which can do
2016-04-17 CPSLAB (EECS) 13
How can we find in a continuous action space?
Anyway, if assume that we know
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
The gradient of the policy’s performance can be defined as,
Problem 2: How can we successfully explore this action space?

Objective : Learn and in a continuous action space!
2016-04-17 CPSLAB (EECS) 14
How can we successfully explore this action space?
Problem of
How can we know this real value?
Problem of
Both are neural network
Authors suggest to use additional ‘target networks’
and

DDPG(Deep DPG) Algorithm
2016-04-17 CPSLAB (EECS) 15

2016-04-17 CPSLAB (EECS) 16
Our objective

2016-04-17 CPSLAB (EECS) 17
A finite sized cache of transitions (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 )
Our objective

2016-04-17 CPSLAB (EECS) 18
Our objective

2016-04-17 CPSLAB (EECS) 19
Our objective
Explored reward + sum of future rewards from target policy network

2016-04-17 CPSLAB (EECS) 20
Our objective
𝜏 ≪ 1 for avoiding divergence
Explored reward + sum of future rewards from target policy network

2016-04-17 CPSLAB (EECS) 21
Assume these are right, real target networks
Exploration

Experiment : Results
2016-04-17 CPSLAB (EECS) 22

0415_seminar_DeepDPG

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to 0415_seminar_DeepDPG

Similar to 0415_seminar_DeepDPG (20)

Recently uploaded

Recently uploaded (20)

0415_seminar_DeepDPG