The document summarizes a research paper titled "Continuous Deep Q-Learning with Model-based Acceleration" presented at ICML 2016. It proposes a method that incorporates advantages of both model-free and model-based reinforcement learning. The method uses deep Q-learning with normalized advantage functions to learn a parameterized Q-function for continuous state-action spaces. It accelerates the learning process by using trajectory optimization from an imagined model to generate exploratory behaviors during data collection.
2. HAYA!
Introduction
2016-12-02 CPSLAB (EECS) 2
Another, and Another improved work of
Deep - Reinforcement Learning
Tried incorporate the advantages of
Model-free Reinforcement Learning
&&
Model-based Reinforcement Learning
5. HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 5
At each time 𝒕,
the agent receives an observation 𝒙 𝒕
from environment 𝑬
Wow
so scare
such gun
so many bullets
nice suit btw
6. HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 6
The agent takes
an action 𝒖 𝒕 ∈ 𝒰,
and receives a scalar reward 𝒓 𝒕.
𝒖 𝒕
𝒙 𝒕
7. HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 7
The agent chooses an action according to its current policy 𝛑(𝐮𝐭|𝐱 𝐭),
which maps states to probability distribution over actions.
𝒖 𝟏
𝒖 𝟐
𝛑(𝐮𝐭|𝒙 𝒕)
𝒖 𝟐𝒖 𝟏
10. HAYA!
Reinforcement Learning : overview
2016-12-02 CPSLAB (EECS) 10
• From environment E,
𝒙 ∈ 𝒳 : state
𝒖 ∈ 𝒰 : action
• π(𝒖 𝑡|𝒙 𝑡) : a policy defining agent’s behavior
: maps states to probability distribution over the actions
• With 𝒳, 𝒰, an initial state distribution p(𝒙1), the agent experiences a transition
to a new state sampled from the dynamics distribution p 𝒙t+1 𝒙t, 𝒖t
• Rt = i=t
T
γ(i−t)
r(𝒙i, 𝒖i) : the sum of future reward with a discounting factor γ ∈ [0,1]
• Objective of RL : learning a policy π maximizing,
11. HAYA!
Reinforcement Learning : Model Free?
2016-12-02 CPSLAB (EECS) 11
• When the system dynamics p 𝒙t+1 𝒙t, 𝒖t are not known.
• We define the Q-function 𝑄 𝜋
𝒙 𝑡, 𝒖 𝑡 , corresponding to a policy 𝜋
as the expected return from 𝒙 𝑡 after taking 𝒖 𝑡 and following 𝜋 thereafter.
• Q-learning learns a greedy deterministic policy
which corresponds to
• The learning objective is to minimize the Bellman error,
𝛽 : arbitrary exploration policy, 𝜌 𝛽
: resulting state visitation frequency of the policy 𝛽,
𝜃 𝑄
: parameter of the Q-function,
Assume that there is a fixed target 𝑦𝑡, 𝑸(𝒙, 𝝁(𝒙))
12. HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 12
How Authors learned parameterized Q-function with Deep Learning,
when the domain of state-action is continuous?
Value function
Advantage function of
a given policy 𝝅
They suggest to use a neural network
that separately outputs a value function term, and an advantage term.
State-dependent, positive-definite square matrix,
parameterized by 𝑷 𝒙 𝜃 𝑃
= 𝑳 𝒙 𝜃 𝑃
𝑳 𝒙 𝜃 𝑃 𝑇
.
𝑳 𝒙 𝜃 𝑃
: Lower-triangular matrix whose entries
come from a linear output layer of a neural network.
The action that maximizes
the Q-function is always
given by 𝝁(𝒙|𝜃 𝜇
).
13. HAYA!
Continuous Q-Learning with
Normalized Advantage Functions
2016-12-02 CPSLAB (EECS) 13
Trick : assume that we have a target network.
𝑄′(𝒙, 𝒖|𝜃 𝑄′)
the SLOW-LEARNER
𝑄 𝒙, 𝒖 𝜃 𝑄
the EXPLORER
𝑹 EXPERIENCE
CONTAINER
14. HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 14
The sample complexity of model-free algorithms tends to be high when
using high-dimensional function approximators.
To reduce the sample complexity and accelerate the learning phase,
how about using a good exploratory behavior
from the trajectory optimization?
15. HAYA!
Accelerating Learning with Imagination Rollouts
2016-12-02 CPSLAB (EECS) 15
how about using a good exploratory behavior from the trajectory optimization?
𝑄′(𝒙, 𝒖|𝜃 𝑄′
)
𝑄 𝒙, 𝒖 𝜃 𝑄
𝑹
𝑩 𝒇𝑩 𝒐𝒍𝒅
𝜇 𝒙 𝜃 𝜇 𝜋 𝑡
𝑖𝐿𝑄𝐺
𝒖 𝑡 𝒙 𝑡
𝓜
𝑹 𝒇
𝒇
𝒇