SlideShare a Scribd company logo
1 of 35
Deep Q-learning from
Demonstrations
Ammar Rashed
11
41
Starts with better scores on the first
million
State of the art performance
14
Best demonstrations out-
performed
Motivation
Motivation
Background
Related Work
Methodology
Results
Motivation
• What is the problem?
• What is missing?
• What do we have instead?
• The idea
What is the problem?
The agent must learn in the real
domain with real consequences.
The agent needs to perform
well from the start of learning.
What is missing?
• Perfectly accurate
simulations to train the
agent safely in.
What do we have instead?
Data of the system operating
under previous controller
Demonstrations
The idea
Leverage demonstrations to accelerate learning.
Background
Fantastic concepts and where to find them
• Large-margin loss
• Importance Sampling
• Prioritized Experience Replay
Large-margin
classification
loss
𝐿𝑖 = −log
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖 + 𝑗≠ 𝑦 𝑖
𝑒
𝑊𝑦 𝑖
− 𝑥 𝑖 cos 𝜃 𝑗
Importance
Sampling
• Estimate properties of a distribution with samples
from another distribution
• In RL: estimate the value function of a policy with
samples collected from an older policy
What
• Improves sampling efficiency in policy gradient
methods.
Why
• Estimation with minimum variance
What is the best sampling distribution
• Sample data points with higher rewards
How
the diagram on the right sampled data proportional to the reward value and
therefore, it will produce estimations that have a smaller variance.
Prioritized Experience Replay
• What
• Sample more important transitions more
frequently.
• How
• Probability of sampling a transition 𝑖 is
proportional to its priority 𝑝𝑖
• 𝑃 𝑖 =
𝑝 𝛼
𝑘 𝑝 𝑘
𝛼
• 𝑝𝑖 = 𝛿𝑖 + 𝜖
Connection to Importance
Sampling
• To account for the change in distribution,
updates in the network are weighted with
importance sampling weights
• 𝜔 =
1
𝑁
.
1
𝑃 𝑖
𝛽
• 𝑁 is the size of the replay buffer.
• 𝛽 controls the amount of importance sampling
• 𝛽 is annealed linearly from 𝛽0 to 𝛽 = 1
Related work
Who did what?
• What do they do?
• How do they do it?
• What do they require?
• How well does it work?
Imitation
Learning
• Iteratively produces new policies based on polling the expert
policy outside its original state space.
• Requires the expert availability during training.
DAGGER
• Extends DAGGER to work with NN and continuous action space.
• Expert needs to provide a value function
Deeply AggreVaTeD
• Input: the entire demonstration set and the current state.
• Use demonstrations to specify the goal state from different
initial conditions.
• Requires a distribution of tasks with different initial conditions
and goal states.
One-shot learning
• The learner chooses a policy. The adversary chooses a reward
function.
Zero-sum game
What is wrong with imitation learning?
• Does not combine imitation with reinforcement learning.
• Can never out-perform demonstrations.
Combining
Imitation
learning and
RL
• Transfers knowledge directly from human
policies.
• Use demonstrations to shape rewards.
Human-Agent Transfer (HAT)
• Shape the policy used to sample experience.
• Use policy iteration from demonstrations.
Policy shaping with human teachers
Replay Buffers
with
demonstrations
• Replay buffer mixed with agent and demonstration
data.
• Slightly better than random agent.
Human Experience Replay (HER)
• No supervised learning or pre-training.
• Requires the ability to set the state environment.
Human Checkpoint Replay (HCR)
• Replay buffer is initialized with demonstration
data.
• Do not maintain demonstrations data in the replay
buffer.
• No supervised learning or pre-training.
Replay Buffer Spiking (RBS)
Similar works
• Supervised: Pre-train on 30 million expert actions before
interaction with the real task.
• Policy gradient: Apply policy gradient with planning rollouts.
AlphaGo
• Combine TD and classification losses in Q-learning setup.
• Use a trained Q-learning agent to generate demonstrations.
• Policy used by the demonstrator guaranteed to be represented
by the apprenticeship agent.
• Cross-entropy classification loss instead of large-margin loss.
Accelerated DQN with Expert Trajectories (ADET)
Methodology
Deep Q-learning from Demonstrations (DQfD)
Overview
Pre-train solely on
demonstration data.
Continue training
while interacting with
the environment.
Pre-training
• Goal:
• Imitate the demonstrator with a value function that
satisfies the Bellman equation.
• Trick:
• Such value function can be updated with TD updates
once the agent starts interacting with the environment.
• Process:
• Sample mini-batches from the demonstration data and
update the network with four losses.
Interacting with the
environment
• Collect self-generated data in 𝐷 𝑟𝑒𝑝𝑙𝑎𝑦
.
• Never overwrites demonstration data.
• Add small constants 𝜖 𝑎 and 𝜖 𝑑to the priorities of agent and
demonstration transitions.
• Supervised loss is not applied to self-generated data.
• 𝜆2 = 0
Loss function:
𝐽 𝑄
= 𝐽 𝐷𝑄 𝑄
+ 𝜆1 𝐽 𝑛 𝑄
+ 𝜆2 𝐽 𝐸 𝑄
+ 𝜆3 𝐽𝐿2(𝑄)
1-step double Q-learning loss.
𝐽 𝐷𝑄 𝑄 = 𝑅 𝑠, 𝑎 + 𝛾𝑄 𝑆𝑡+1 + 𝑎 𝑡+1
𝑚𝑎𝑥
; 𝜃′ − 𝑄 𝑠, 𝑎; 𝜃
2
n-step double Q-learning loss.
𝐽 𝑛 𝑄
Supervised large margin classification loss.
𝐽 𝐸 𝑄 = 𝑚𝑎𝑥 𝑎∈𝐴 𝑄 𝑠, 𝑎 + 𝑙 𝑎 𝐸, 𝑎 − 𝑄(𝑠, 𝑎 𝐸)
L2 regularization loss.
Why is
supervised
loss
important?
The demonstration data is
necessarily covering a narrow
part of the state space and
not taking all possible actions.
Many state-actions have
never been taken and have no
data to ground them to
realistic values.
DQfD
Contributions
• Demonstration data: in replay buffer
permanently.
• Pre-training: solely on the demonstration data.
• Supervised large-margin loss: in addition to TD
losses to push the value of the demonstrator’s
actions above the other action values.
• L2 Regularization losses: to prevent overfitting.
• A mix of 1-step and N-step TD losses: to update
Q-network.
• Demonstration priority bonus: The priorities of
demonstration transitions are given a bonus of
𝜖 𝑑, to boost the frequency that they are
sampled.
Results
Does DQfD solve our problem?
• Out-performed worst demonstrations in 29 games.
• Out-performed best demonstration in 14 games.
• State of the art in 11 games.
• Performed better than PDD DQN on the first million steps on
41 games.
• PDD DQN needs 83 million steps to catch up with DQfD
performance.
Models Comparison
Models Comparison Contd.
Evaluating model components
Questions?

More Related Content

What's hot

【2017年度】勉強会資料_学習に関するテクニック
【2017年度】勉強会資料_学習に関するテクニック【2017年度】勉強会資料_学習に関するテクニック
【2017年度】勉強会資料_学習に関するテクニックRyosuke Tanno
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1Chihiro Kusunoki
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 
Go-ICP: グローバル最適(Globally optimal) なICPの解説
Go-ICP: グローバル最適(Globally optimal) なICPの解説Go-ICP: グローバル最適(Globally optimal) なICPの解説
Go-ICP: グローバル最適(Globally optimal) なICPの解説Yusuke Sekikawa
 
論文紹介 No-Reward Meta Learning (RL architecture勉強会)
論文紹介 No-Reward Meta Learning (RL architecture勉強会)論文紹介 No-Reward Meta Learning (RL architecture勉強会)
論文紹介 No-Reward Meta Learning (RL architecture勉強会)Yusuke Nakata
 
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTSDeep Learning JP
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識佑 甲野
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズムnishio
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
Introduction to Prioritized Experience Replay
Introduction to Prioritized Experience ReplayIntroduction to Prioritized Experience Replay
Introduction to Prioritized Experience ReplayWEBFARMER. ltd.
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用Ryo Iwaki
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
論文紹介-Multi-Objective Deep Reinforcement Learning
論文紹介-Multi-Objective Deep Reinforcement Learning論文紹介-Multi-Objective Deep Reinforcement Learning
論文紹介-Multi-Objective Deep Reinforcement LearningShunta Nomura
 
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...Deep Learning JP
 
TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介Takeo Imai
 
PRML輪読#13
PRML輪読#13PRML輪読#13
PRML輪読#13matsuolab
 

What's hot (20)

【2017年度】勉強会資料_学習に関するテクニック
【2017年度】勉強会資料_学習に関するテクニック【2017年度】勉強会資料_学習に関するテクニック
【2017年度】勉強会資料_学習に関するテクニック
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
PRML 10.4 - 10.6
PRML 10.4 - 10.6PRML 10.4 - 10.6
PRML 10.4 - 10.6
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
Go-ICP: グローバル最適(Globally optimal) なICPの解説
Go-ICP: グローバル最適(Globally optimal) なICPの解説Go-ICP: グローバル最適(Globally optimal) なICPの解説
Go-ICP: グローバル最適(Globally optimal) なICPの解説
 
論文紹介 No-Reward Meta Learning (RL architecture勉強会)
論文紹介 No-Reward Meta Learning (RL architecture勉強会)論文紹介 No-Reward Meta Learning (RL architecture勉強会)
論文紹介 No-Reward Meta Learning (RL architecture勉強会)
 
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS
[DL輪読会]DISTRIBUTIONAL POLICY GRADIENTS
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識
 
勾配降下法の 最適化アルゴリズム
勾配降下法の最適化アルゴリズム勾配降下法の最適化アルゴリズム
勾配降下法の 最適化アルゴリズム
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
Introduction to Prioritized Experience Replay
Introduction to Prioritized Experience ReplayIntroduction to Prioritized Experience Replay
Introduction to Prioritized Experience Replay
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
論文紹介-Multi-Objective Deep Reinforcement Learning
論文紹介-Multi-Objective Deep Reinforcement Learning論文紹介-Multi-Objective Deep Reinforcement Learning
論文紹介-Multi-Objective Deep Reinforcement Learning
 
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...
[DL輪読会]Grandmaster level in StarCraft II using multi-agent reinforcement lear...
 
W8PRML5.1-5.3
W8PRML5.1-5.3W8PRML5.1-5.3
W8PRML5.1-5.3
 
TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介TVMの次期グラフIR Relayの紹介
TVMの次期グラフIR Relayの紹介
 
PRML輪読#13
PRML輪読#13PRML輪読#13
PRML輪読#13
 

Similar to Deep Q-learning from Demonstrations DQfD

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning SystemsTrieu Nguyen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSPreferred Networks
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 

Similar to Deep Q-learning from Demonstrations DQfD (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning Systems
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Deep Q-learning from Demonstrations DQfD

  • 2. 11 41 Starts with better scores on the first million State of the art performance 14 Best demonstrations out- performed
  • 4. Motivation • What is the problem? • What is missing? • What do we have instead? • The idea
  • 5. What is the problem? The agent must learn in the real domain with real consequences. The agent needs to perform well from the start of learning.
  • 6. What is missing? • Perfectly accurate simulations to train the agent safely in.
  • 7. What do we have instead? Data of the system operating under previous controller Demonstrations
  • 8. The idea Leverage demonstrations to accelerate learning.
  • 9. Background Fantastic concepts and where to find them • Large-margin loss • Importance Sampling • Prioritized Experience Replay
  • 10. Large-margin classification loss 𝐿𝑖 = −log 𝑒 𝑊𝑦 𝑖 − 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖 𝑒 𝑊𝑦 𝑖 − 𝑥 𝑖 𝜓 𝜃 𝑦 𝑖 + 𝑗≠ 𝑦 𝑖 𝑒 𝑊𝑦 𝑖 − 𝑥 𝑖 cos 𝜃 𝑗
  • 11. Importance Sampling • Estimate properties of a distribution with samples from another distribution • In RL: estimate the value function of a policy with samples collected from an older policy What • Improves sampling efficiency in policy gradient methods. Why • Estimation with minimum variance What is the best sampling distribution • Sample data points with higher rewards How
  • 12. the diagram on the right sampled data proportional to the reward value and therefore, it will produce estimations that have a smaller variance.
  • 13. Prioritized Experience Replay • What • Sample more important transitions more frequently. • How • Probability of sampling a transition 𝑖 is proportional to its priority 𝑝𝑖 • 𝑃 𝑖 = 𝑝 𝛼 𝑘 𝑝 𝑘 𝛼 • 𝑝𝑖 = 𝛿𝑖 + 𝜖
  • 14. Connection to Importance Sampling • To account for the change in distribution, updates in the network are weighted with importance sampling weights • 𝜔 = 1 𝑁 . 1 𝑃 𝑖 𝛽 • 𝑁 is the size of the replay buffer. • 𝛽 controls the amount of importance sampling • 𝛽 is annealed linearly from 𝛽0 to 𝛽 = 1
  • 15. Related work Who did what? • What do they do? • How do they do it? • What do they require? • How well does it work?
  • 16. Imitation Learning • Iteratively produces new policies based on polling the expert policy outside its original state space. • Requires the expert availability during training. DAGGER • Extends DAGGER to work with NN and continuous action space. • Expert needs to provide a value function Deeply AggreVaTeD • Input: the entire demonstration set and the current state. • Use demonstrations to specify the goal state from different initial conditions. • Requires a distribution of tasks with different initial conditions and goal states. One-shot learning • The learner chooses a policy. The adversary chooses a reward function. Zero-sum game
  • 17. What is wrong with imitation learning? • Does not combine imitation with reinforcement learning. • Can never out-perform demonstrations.
  • 18. Combining Imitation learning and RL • Transfers knowledge directly from human policies. • Use demonstrations to shape rewards. Human-Agent Transfer (HAT) • Shape the policy used to sample experience. • Use policy iteration from demonstrations. Policy shaping with human teachers
  • 19. Replay Buffers with demonstrations • Replay buffer mixed with agent and demonstration data. • Slightly better than random agent. Human Experience Replay (HER) • No supervised learning or pre-training. • Requires the ability to set the state environment. Human Checkpoint Replay (HCR) • Replay buffer is initialized with demonstration data. • Do not maintain demonstrations data in the replay buffer. • No supervised learning or pre-training. Replay Buffer Spiking (RBS)
  • 20. Similar works • Supervised: Pre-train on 30 million expert actions before interaction with the real task. • Policy gradient: Apply policy gradient with planning rollouts. AlphaGo • Combine TD and classification losses in Q-learning setup. • Use a trained Q-learning agent to generate demonstrations. • Policy used by the demonstrator guaranteed to be represented by the apprenticeship agent. • Cross-entropy classification loss instead of large-margin loss. Accelerated DQN with Expert Trajectories (ADET)
  • 21. Methodology Deep Q-learning from Demonstrations (DQfD)
  • 22. Overview Pre-train solely on demonstration data. Continue training while interacting with the environment.
  • 23. Pre-training • Goal: • Imitate the demonstrator with a value function that satisfies the Bellman equation. • Trick: • Such value function can be updated with TD updates once the agent starts interacting with the environment. • Process: • Sample mini-batches from the demonstration data and update the network with four losses.
  • 24. Interacting with the environment • Collect self-generated data in 𝐷 𝑟𝑒𝑝𝑙𝑎𝑦 . • Never overwrites demonstration data. • Add small constants 𝜖 𝑎 and 𝜖 𝑑to the priorities of agent and demonstration transitions. • Supervised loss is not applied to self-generated data. • 𝜆2 = 0
  • 25. Loss function: 𝐽 𝑄 = 𝐽 𝐷𝑄 𝑄 + 𝜆1 𝐽 𝑛 𝑄 + 𝜆2 𝐽 𝐸 𝑄 + 𝜆3 𝐽𝐿2(𝑄) 1-step double Q-learning loss. 𝐽 𝐷𝑄 𝑄 = 𝑅 𝑠, 𝑎 + 𝛾𝑄 𝑆𝑡+1 + 𝑎 𝑡+1 𝑚𝑎𝑥 ; 𝜃′ − 𝑄 𝑠, 𝑎; 𝜃 2 n-step double Q-learning loss. 𝐽 𝑛 𝑄 Supervised large margin classification loss. 𝐽 𝐸 𝑄 = 𝑚𝑎𝑥 𝑎∈𝐴 𝑄 𝑠, 𝑎 + 𝑙 𝑎 𝐸, 𝑎 − 𝑄(𝑠, 𝑎 𝐸) L2 regularization loss.
  • 26. Why is supervised loss important? The demonstration data is necessarily covering a narrow part of the state space and not taking all possible actions. Many state-actions have never been taken and have no data to ground them to realistic values.
  • 27. DQfD
  • 28. Contributions • Demonstration data: in replay buffer permanently. • Pre-training: solely on the demonstration data. • Supervised large-margin loss: in addition to TD losses to push the value of the demonstrator’s actions above the other action values. • L2 Regularization losses: to prevent overfitting. • A mix of 1-step and N-step TD losses: to update Q-network. • Demonstration priority bonus: The priorities of demonstration transitions are given a bonus of 𝜖 𝑑, to boost the frequency that they are sampled.
  • 30. Does DQfD solve our problem? • Out-performed worst demonstrations in 29 games. • Out-performed best demonstration in 14 games. • State of the art in 11 games. • Performed better than PDD DQN on the first million steps on 41 games. • PDD DQN needs 83 million steps to catch up with DQfD performance.
  • 34.

Editor's Notes

  1. where aE is the action the expert demonstrator took in state s and l(aE, a) is a margin function that is 0 when a = aE and positive otherwise. The λ parameters control the weighting between the losses
  2. If we were to pre-train the network with only Q-learning updates towards the max value of the next state, the network would update towards the highest of these ungrounded variables and the network would propagate these values throughout the Q function
  3. In practice, choosing the ratio between demonstration and self-generated data while learning is critical to improve the performance of the algorithm. Prioritized experience replay to automatically control this
  4. Out of 42 games