This presentation is for introducing google DeepMind's DeepDPG algorithm to my colleagues.
I tried my best to make it easy to be understood...
Comment is always welcome :)
hiddenmaze91.blogspot.com
1. Continuous Control with
Deep Reinforcement Learning
2016 ICLR
Timothy P. Lillicrap, et al. (Google DeepMind)
Presenter : Hyemin Ahn
2. Introduction
2016-04-17 CPSLAB (EECS) 2
Another work of
Deep Learning
+ Reinforcement Learning
from Google DEEPMIND !
Extended their Deep Q Network,
which is dealing with discrete action space,
to continuous action space.
4. Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?
5. Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
6. Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 6
The agent takes
an action 𝒂 𝒕 ∈ 𝒜 ⊆ ℝ 𝑵,
and receives a scalar reward 𝒓 𝒕.
𝒂 𝒕
𝒙 𝒕
7. Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 7
For selecting the action, there is a policy 𝛑: 𝓢 → 𝓟 𝓐 ,
which maps states to probability distribution over actions.
𝒂 𝟏
𝒂 𝟐
𝛑(𝐬𝐭)
𝒂 𝟐𝒂 𝟏
10. Reinforcement Learning : overview
2016-04-17 CPSLAB (EECS) 10
• From environment E,
xt : observation
st ∈ 𝒮 : state.
• If E is fully observed, st = xt
• at ∈ 𝒜 : agent’s action
• π: 𝒮 → 𝒫 𝒜 : a policy defining agent’s behavior
: maps states to probability distribution over the actions.
• With 𝒮, 𝒜, an initial state distribution p(s1),
transition dynamics p st+1 st, at , and reward function r st, at ,
Agent’s behavior can be modeled as a Markov Decision Process (MDP).
• Rt = i=t
T
γi−t
r(si, ai) : the sum of discounted future reward
with a discounting factor γ ∈ [0,1].
• Objective of RL : learning a policy π maximizing 𝔼π(R1).
For this, state-action-value function Qπ
st, at = 𝔼π[Rt|st, at] is used.
11. Q-learning is finding , the greedy policy
Reinforcement Learning : Q-Learning
2016-04-17 CPSLAB (EECS) 11
π: 𝒮 → 𝒫 𝒜 𝜇: 𝒮 → 𝒜
The Bellman equation refers to this recursive relationship
It gets harder to compute this due to stochastic policy π: 𝒮 → 𝒫 𝒜
Let us think about the deterministic policy instead of stochastic one
Can we do this in continuous action space?
12. Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 12
What DQN(Deep Q Network) was learning Q-function with NN.
With function approximator parameterized by 𝜃 𝑄
,
Model a network
Finding 𝜃 𝑄
minimizing the loss function
Problem 1: How can we know this real value?
13. Also model a policy with a network parameter 𝜃 𝜇
And find a parameter 𝜃 𝜇
which can do
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 13
How can we find in a continuous action space?
Anyway, if assume that we know
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
The gradient of the policy’s performance can be defined as,
Problem 2: How can we successfully explore this action space?
14. Objective : Learn and in a continuous action space!
Reinforcement Learning : continuous space?
2016-04-17 CPSLAB (EECS) 14
How can we successfully explore this action space?
Problem of
How can we know this real value?
Problem of
Both are neural network
Authors suggest to use additional ‘target networks’
and