Deep Q-learning from Demonstrations DQfD

Deep Q-learning from
Demonstrations
Ammar Rashed

11
41
Starts with better scores on the first
million
State of the art performance
14
Best demonstrations out-
performed

Motivation
Motivation
Background
Related Work
Methodology
Results

Motivation
• What is the problem?
• What is missing?
• What do we have instead?
• The idea

What is the problem?
The agent must learn in the real
domain with real consequences.
The agent needs to perform
well from the start of learning.

What is missing?
• Perfectly accurate
simulations to train the
agent safely in.

What do we have instead?
Data of the system operating
under previous controller
Demonstrations

The idea
Leverage demonstrations to accelerate learning.

Background
Fantastic concepts and where to find them
• Large-margin loss
• Importance Sampling
• Prioritized Experience Replay

Large-margin
classification
loss
𝐿𝑖 = −log
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖 + 𝑗≠ 𝑦 𝑖
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 cos 𝜃 𝑗

Importance
Sampling
• Estimate properties of a distribution with samples
from another distribution
• In RL: estimate the value function of a policy with
samples collected from an older policy
What
• Improves sampling efficiency in policy gradient
methods.
Why
• Estimation with minimum variance
What is the best sampling distribution
• Sample data points with higher rewards
How

the diagram on the right sampled data proportional to the reward value and
therefore, it will produce estimations that have a smaller variance.

Prioritized Experience Replay
• What
• Sample more important transitions more
frequently.
• How
• Probability of sampling a transition 𝑖 is
proportional to its priority 𝑝𝑖
• 𝑃 𝑖 =
𝑝 𝛼
𝑘 𝑝 𝑘
𝛼
• 𝑝𝑖 = 𝛿𝑖 + 𝜖

Connection to Importance
Sampling
• To account for the change in distribution,
updates in the network are weighted with
importance sampling weights
• 𝜔 =
1
𝑁
.
1
𝑃 𝑖
𝛽
• 𝑁 is the size of the replay buffer.
• 𝛽 controls the amount of importance sampling
• 𝛽 is annealed linearly from 𝛽0 to 𝛽 = 1

Related work
Who did what?
• What do they do?
• How do they do it?
• What do they require?
• How well does it work?

Imitation
Learning
• Iteratively produces new policies based on polling the expert
policy outside its original state space.
• Requires the expert availability during training.
DAGGER
• Extends DAGGER to work with NN and continuous action space.
• Expert needs to provide a value function
Deeply AggreVaTeD
• Input: the entire demonstration set and the current state.
• Use demonstrations to specify the goal state from different
initial conditions.
• Requires a distribution of tasks with different initial conditions
and goal states.
One-shot learning
• The learner chooses a policy. The adversary chooses a reward
function.
Zero-sum game

What is wrong with imitation learning?
• Does not combine imitation with reinforcement learning.
• Can never out-perform demonstrations.

Combining
Imitation
learning and
RL
• Transfers knowledge directly from human
policies.
• Use demonstrations to shape rewards.
Human-Agent Transfer (HAT)
• Shape the policy used to sample experience.
• Use policy iteration from demonstrations.
Policy shaping with human teachers

Replay Buffers
with
demonstrations
• Replay buffer mixed with agent and demonstration
data.
• Slightly better than random agent.
Human Experience Replay (HER)
• No supervised learning or pre-training.
• Requires the ability to set the state environment.
Human Checkpoint Replay (HCR)
• Replay buffer is initialized with demonstration
data.
• Do not maintain demonstrations data in the replay
buffer.
• No supervised learning or pre-training.
Replay Buffer Spiking (RBS)

Similar works
• Supervised: Pre-train on 30 million expert actions before
interaction with the real task.
• Policy gradient: Apply policy gradient with planning rollouts.
AlphaGo
• Combine TD and classification losses in Q-learning setup.
• Use a trained Q-learning agent to generate demonstrations.
• Policy used by the demonstrator guaranteed to be represented
by the apprenticeship agent.
• Cross-entropy classification loss instead of large-margin loss.
Accelerated DQN with Expert Trajectories (ADET)

Methodology
Deep Q-learning from Demonstrations (DQfD)

Overview
Pre-train solely on
demonstration data.
Continue training
while interacting with
the environment.

Pre-training
• Goal:
• Imitate the demonstrator with a value function that
satisfies the Bellman equation.
• Trick:
• Such value function can be updated with TD updates
once the agent starts interacting with the environment.
• Process:
• Sample mini-batches from the demonstration data and
update the network with four losses.

Interacting with the
environment
• Collect self-generated data in 𝐷 𝑟𝑒𝑝𝑙𝑎𝑦
.
• Never overwrites demonstration data.
• Add small constants 𝜖 𝑎 and 𝜖 𝑑to the priorities of agent and
demonstration transitions.
• Supervised loss is not applied to self-generated data.
• 𝜆2 = 0

Loss function:
𝐽 𝑄
= 𝐽 𝐷𝑄 𝑄
+ 𝜆1 𝐽 𝑛 𝑄
+ 𝜆2 𝐽 𝐸 𝑄
+ 𝜆3 𝐽𝐿2(𝑄)
1-step double Q-learning loss.
𝐽 𝐷𝑄 𝑄 = 𝑅 𝑠, 𝑎 + 𝛾𝑄 𝑆𝑡+1 + 𝑎 𝑡+1
𝑚𝑎𝑥
; 𝜃′ − 𝑄 𝑠, 𝑎; 𝜃
2
n-step double Q-learning loss.
𝐽 𝑛 𝑄
Supervised large margin classification loss.
𝐽 𝐸 𝑄 = 𝑚𝑎𝑥 𝑎∈𝐴 𝑄 𝑠, 𝑎 + 𝑙 𝑎 𝐸, 𝑎 − 𝑄(𝑠, 𝑎 𝐸)
L2 regularization loss.

Why is
supervised
loss
important?
The demonstration data is
necessarily covering a narrow
part of the state space and
not taking all possible actions.
Many state-actions have
never been taken and have no
data to ground them to
realistic values.

Contributions
• Demonstration data: in replay buffer
permanently.
• Pre-training: solely on the demonstration data.
• Supervised large-margin loss: in addition to TD
losses to push the value of the demonstrator’s
actions above the other action values.
• L2 Regularization losses: to prevent overfitting.
• A mix of 1-step and N-step TD losses: to update
Q-network.
• Demonstration priority bonus: The priorities of
demonstration transitions are given a bonus of
𝜖 𝑑, to boost the frequency that they are
sampled.

Does DQfD solve our problem?
• Out-performed worst demonstrations in 29 games.
• Out-performed best demonstration in 14 games.
• State of the art in 11 games.
• Performed better than PDD DQN on the first million steps on
41 games.
• PDD DQN needs 83 million steps to catch up with DQfD
performance.

Deep Q-learning from Demonstrations DQfD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Q-learning from Demonstrations DQfD

Similar to Deep Q-learning from Demonstrations DQfD (20)

Recently uploaded

Recently uploaded (20)

Deep Q-learning from Demonstrations DQfD

Editor's Notes