1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

HAYA!
Continuous Deep Q-Learning
with Model-based Acceleration
2016 ICML
S. Gu, T. Lillicrap, I. Sutskever, S. Levine.
Presenter : Hyemin Ahn

HAYA!
Introduction
2016-12-02 CPSLAB (EECS) 2
 Another, and Another improved work of
Deep - Reinforcement Learning
 Tried incorporate the advantages of
Model-free Reinforcement Learning
&&
Model-based Reinforcement Learning

HAYA!
Results : Preview
2016-12-02 CPSLAB (EECS) 3

HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 4
Agent
How can we formulize our behavior?

HAYA!
2016-12-02 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw

HAYA!
2016-12-02 CPSLAB (EECS) 6
The agent takes
an action 𝒖 𝒕 ∈ 𝒰,
and receives a scalar reward 𝒓 𝒕.
𝒖 𝒕
𝒙 𝒕

HAYA!
2016-12-02 CPSLAB (EECS) 7
The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭),
which maps states to probability distribution over actions.
𝒖 𝟏
𝒖 𝟐
𝛑(𝐮𝐭|𝒙 𝒕)
𝒖 𝟐𝒖 𝟏

HAYA!
2016-12-02 CPSLAB (EECS) 8
𝒖 𝟏
𝒙 𝟏
𝛑
𝒖 𝟐
𝒙 𝟐
𝛑
𝒖 𝟑
𝒙 𝟑
𝛑
𝒑(𝒙 𝟐|𝒙 𝟏, 𝒖 𝟏) 𝒑(𝒙 𝟑|𝒙 𝟐, 𝒖 𝟐)
𝑹 𝒕 =
𝒊=𝒕
𝑻
𝜸(𝒊−𝒕)
𝒓(𝒙𝒊, 𝒖𝒊)
: cumulative sum of rewards
over sequences. (𝜸 ∈ [𝟎, 𝟏]:discounting factor)
𝑸 𝝅
𝒙 𝒕, 𝒖 𝒕 = 𝔼[𝑹 𝒕|𝒙 𝒕, 𝒖 𝒕]
: state-action value function.
Objective of RL
: find 𝛑 maximizing 𝔼(𝑹 𝟏) !
𝒓(𝒙 𝟏, 𝒖 𝟏) 𝒓(𝒙 𝟐, 𝒖 𝟐) 𝒓(𝒙 𝟑, 𝒖 𝟑)M
D
P

HAYA!
2016-12-02 CPSLAB (EECS) 9
𝑸 𝝅 𝒕𝒓𝒊𝒏𝒊𝒕𝒚 𝒙 𝒕, 𝒖 𝒕 𝑸 𝝅 𝒏𝒆𝒐 𝒙 𝒕, 𝒖 𝒕<

HAYA!
2016-12-02 CPSLAB (EECS) 10
• From environment E,
𝒙 ∈ 𝒳 : state
𝒖 ∈ 𝒰 : action
• π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior
: maps states to probability distribution over the actions
• With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition
to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t
• Rt = i=t
T
γ(i−t)
r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1]
• Objective of RL : learning a policy π maximizing,

HAYA!
Reinforcement Learning : Model Free?
2016-12-02 CPSLAB (EECS) 11
• When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known.
• We define the Q-function 𝑄 𝜋
𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋
as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter.
• Q-learning learns a greedy deterministic policy
which corresponds to
• The learning objective is to minimize the Bellman error,
 𝛽 : arbitrary exploration policy, 𝜌 𝛽
: resulting state visitation frequency of the policy 𝛽,
 𝜃 𝑄
: parameter of the Q-function,
 Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))

HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 12
How Authors learned parameterized Q-function with Deep Learning,
when the domain of state-action is continuous?
Value function
Advantage function of
a given policy 𝝅
They suggest to use a neural network
that separately outputs a value function term, and an advantage term.
 State-dependent, positive-definite square matrix,
parameterized by 𝑷 𝒙 𝜃 𝑃
= 𝑳 𝒙 𝜃 𝑃
𝑳 𝒙 𝜃 𝑃 𝑇
.
 𝑳 𝒙 𝜃 𝑃
: Lower-triangular matrix whose entries
come from a linear output layer of a neural network.
The action that maximizes
the Q-function is always
given by 𝝁(𝒙|𝜃 𝜇
).

HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 13
Trick : assume that we have a target network.
𝑄′(𝒙, 𝒖|𝜃 𝑄′)
the SLOW-LEARNER
𝑄 𝒙, 𝒖 𝜃 𝑄
the EXPLORER
𝑹 EXPERIENCE
CONTAINER

HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 14
 The sample complexity of model-free algorithms tends to be high when
using high-dimensional function approximators.
 To reduce the sample complexity and accelerate the learning phase,
how about using a good exploratory behavior
from the trajectory optimization?

HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 15
 how about using a good exploratory behavior from the trajectory optimization?
𝑄′(𝒙, 𝒖|𝜃 𝑄′
)
𝑄 𝒙, 𝒖 𝜃 𝑄
𝑹
𝑩 𝒇𝑩 𝒐𝒍𝒅
𝜇 𝒙 𝜃 𝜇 𝜋 𝑡
𝑖𝐿𝑄𝐺
𝒖 𝑡 𝒙 𝑡
𝓜
𝑹 𝒇
𝒇
𝒇

HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 16

HAYA!
Experiment : Results
2016-12-02 CPSLAB (EECS) 17

HAYA!
2016-12-02 CPSLAB (EECS) 18

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Similar to 1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration (20)

Recently uploaded

Recently uploaded (20)

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Editor's Notes