11. Importance
Sampling
• Estimate properties of a distribution with samples
from another distribution
• In RL: estimate the value function of a policy with
samples collected from an older policy
What
• Improves sampling efficiency in policy gradient
methods.
Why
• Estimation with minimum variance
What is the best sampling distribution
• Sample data points with higher rewards
How
12. the diagram on the right sampled data proportional to the reward value and
therefore, it will produce estimations that have a smaller variance.
13. Prioritized Experience Replay
• What
• Sample more important transitions more
frequently.
• How
• Probability of sampling a transition 𝑖 is
proportional to its priority 𝑝𝑖
• 𝑃 𝑖 =
𝑝 𝛼
𝑘 𝑝 𝑘
𝛼
• 𝑝𝑖 = 𝛿𝑖 + 𝜖
14. Connection to Importance
Sampling
• To account for the change in distribution,
updates in the network are weighted with
importance sampling weights
• 𝜔 =
1
𝑁
.
1
𝑃 𝑖
𝛽
• 𝑁 is the size of the replay buffer.
• 𝛽 controls the amount of importance sampling
• 𝛽 is annealed linearly from 𝛽0 to 𝛽 = 1
15. Related work
Who did what?
• What do they do?
• How do they do it?
• What do they require?
• How well does it work?
16. Imitation
Learning
• Iteratively produces new policies based on polling the expert
policy outside its original state space.
• Requires the expert availability during training.
DAGGER
• Extends DAGGER to work with NN and continuous action space.
• Expert needs to provide a value function
Deeply AggreVaTeD
• Input: the entire demonstration set and the current state.
• Use demonstrations to specify the goal state from different
initial conditions.
• Requires a distribution of tasks with different initial conditions
and goal states.
One-shot learning
• The learner chooses a policy. The adversary chooses a reward
function.
Zero-sum game
17. What is wrong with imitation learning?
• Does not combine imitation with reinforcement learning.
• Can never out-perform demonstrations.
18. Combining
Imitation
learning and
RL
• Transfers knowledge directly from human
policies.
• Use demonstrations to shape rewards.
Human-Agent Transfer (HAT)
• Shape the policy used to sample experience.
• Use policy iteration from demonstrations.
Policy shaping with human teachers
19. Replay Buffers
with
demonstrations
• Replay buffer mixed with agent and demonstration
data.
• Slightly better than random agent.
Human Experience Replay (HER)
• No supervised learning or pre-training.
• Requires the ability to set the state environment.
Human Checkpoint Replay (HCR)
• Replay buffer is initialized with demonstration
data.
• Do not maintain demonstrations data in the replay
buffer.
• No supervised learning or pre-training.
Replay Buffer Spiking (RBS)
20. Similar works
• Supervised: Pre-train on 30 million expert actions before
interaction with the real task.
• Policy gradient: Apply policy gradient with planning rollouts.
AlphaGo
• Combine TD and classification losses in Q-learning setup.
• Use a trained Q-learning agent to generate demonstrations.
• Policy used by the demonstrator guaranteed to be represented
by the apprenticeship agent.
• Cross-entropy classification loss instead of large-margin loss.
Accelerated DQN with Expert Trajectories (ADET)
23. Pre-training
• Goal:
• Imitate the demonstrator with a value function that
satisfies the Bellman equation.
• Trick:
• Such value function can be updated with TD updates
once the agent starts interacting with the environment.
• Process:
• Sample mini-batches from the demonstration data and
update the network with four losses.
24. Interacting with the
environment
• Collect self-generated data in 𝐷 𝑟𝑒𝑝𝑙𝑎𝑦
.
• Never overwrites demonstration data.
• Add small constants 𝜖 𝑎 and 𝜖 𝑑to the priorities of agent and
demonstration transitions.
• Supervised loss is not applied to self-generated data.
• 𝜆2 = 0
26. Why is
supervised
loss
important?
The demonstration data is
necessarily covering a narrow
part of the state space and
not taking all possible actions.
Many state-actions have
never been taken and have no
data to ground them to
realistic values.
28. Contributions
• Demonstration data: in replay buffer
permanently.
• Pre-training: solely on the demonstration data.
• Supervised large-margin loss: in addition to TD
losses to push the value of the demonstrator’s
actions above the other action values.
• L2 Regularization losses: to prevent overfitting.
• A mix of 1-step and N-step TD losses: to update
Q-network.
• Demonstration priority bonus: The priorities of
demonstration transitions are given a bonus of
𝜖 𝑑, to boost the frequency that they are
sampled.
30. Does DQfD solve our problem?
• Out-performed worst demonstrations in 29 games.
• Out-performed best demonstration in 14 games.
• State of the art in 11 games.
• Performed better than PDD DQN on the first million steps on
41 games.
• PDD DQN needs 83 million steps to catch up with DQfD
performance.
where aE is the action the expert demonstrator took in state s and l(aE, a) is a margin function that is 0 when a = aE and positive otherwise.
The λ parameters control the weighting between the losses
If we were to pre-train the network with only Q-learning updates towards the max value of the next state, the network would update towards the highest of these ungrounded variables and the network would propagate these values throughout the Q function
In practice, choosing the ratio between demonstration and self-generated data while learning is critical to improve the performance of the algorithm. Prioritized experience replay to automatically control this