SlideShare a Scribd company logo
1 of 12
Lab Seminar
Deep Reinforcement Learning
with Double Q-Learning
Seunghyeok Back
2018. 11. 05
Graduate Student in MS&ph.D integrated course
Artificial intelligence Lab
shback@gist.ac.kr
School of Integrated Technology (SIT)
Gwangju Institute of Science and Technology (GIST)
Introduction
(2016) Deep Reinforcement Learning with Double Q-Learning
Van Hasselt, Hado, Arthur Guez, and David Silver (Deepmind), AAAI
Double Q-learning + DQN = > Double DQN
• DQN suffers from substantial overestimations in some of Atari game
• Generalize Double Q-learning (tabular setting -> large-scale function approximation)
• Propose Double DQN algorithm (reduce overestimation and leads to much better performance)
RL goal: Find optimal policy
Q-learning: Find optimal policy using action-value function
Update of Q-function
Too large problems to learn all action values in all states -> Function approximation
1) Parameterize Q function
2) Update Q function toward to Target value (similar as SGD)
Target value (Bellman equation)
Overestimation in Q-learning
Target value – current predicted Q-value
= Temporal Difference (TD) Error
Learning rate
Gradient of current predicted Q-value
• Sometimes learn unrealistically high action value due to maximization step
• Tends to prefer overestimated to underestimated values
• Overestimation is attributed to
1) insufficiently flexible function approximation 2) noise 3) estimation errors
Question: Do the overestimations negatively affects performance?
 Yes, if the overestimations are not uniform & not concentrated at states about which we wish to learn
 DQN : sufficiently flexible function approximation (deep), no harmful effects of noise (deterministic)
 Favorable setting, but overestimation occurs!
 Double DQN reduce overestimation and leads to better results on Atari domain
Overestimation in Q-learning
• DQN outputs a vector of action values for a given state s
• Target for update is
Two main learning strategy:
1) Separate Target network
• 𝜃−
is parameter of the target network / Every 𝜏 steps, 𝜃−
is set as 𝜃−
= 𝜃
2) Experience replay
• Save transition to replay buffer and sample uniformly for training
DQN
https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8 https://www.slideshare.net/MuhammedKocaba/humanlevel-
control-through-deep-reinforcement-learning-presentation
• (Q-learning, DQN) Use same value Q both to select and to evaluate an action -> Overestimation
• (Double Q-learning) One for selection of an action, one for evaluation of an action
• Two value functions, each of weights are 𝜃 and 𝜃′
• Update one of the two value functions, randomly
• Select action with Q(𝑆, 𝑎; 𝜃), evaluate this action with Q(𝑆, 𝑎; 𝜃′
)
• Decouple the selection from the evaluation
Double Q-learning
Selection (inner Q) Evaluation (outer Q)
Use target network as a second value function without introducing additional networks
Double Q-learning
• Pick one from two value function and update
• Each value function can be used for both evaluation and selection
Double DQN
• Target network Q(𝑆, 𝑎; 𝜃−
) is used only for the evaluation, not for the selection
• Remains a periodic copy of the online network to the target network
• Value functions are not fully decoupled from each other
• Minimal possible change of DQN towards Double Q-learning
• Get most of the benefit of Double Q-learning, while keeping the rest of the DQN
• Fair comparison and minimal computational overhead
Double DQN
• Testbed: Atari 2600 games
• High-dimensional inputs / the visual and mechanic of games are vary substantially b/w games
• Same experimental setup and network architecture with DQN (Mnih et al, 2015)
• Single GPU for 200M frames per each game
Experiment
https://ocr.space/blog/2015/02/playing-atari-deep-reinforcement-learning.html
Results on overoptimism
• 6 Atari games with 6 different random seeds with the hyper-parameters employed in DQN
• Horizontal line: average of the actual value from each visited state (computed after learning concluded)
• Without overestimation, estimate and horizontal line would be matched up
• Estimates of Double DQN are much closer to the real value than DQN’s
• Double DQN produce more accurate value estimates but also better policies
Results on overoptimism
• Extreme overestimations in two games (Wizard of Wor, Asterix)
• Overestimation coincide with decreasing scores (leads to bad policy)
• Double DQN is much more stable
Quality of the learned policies
• (Mnih et al, 2015) Evaluation starts with no-op action
• Wait 30 times while not affecting the environment
• to provide different starting points for the agent
• Same hyper parameter for both DQN and Double DQN
• Double DQN clearly improves over DQN
Quality of the learned policies
• In deterministic games with a unique staring point,
Evaluation with No-op can make the agent to remember
sequence of actions without much need to generalize
• Instead of no-op, use starting points sampled from human expert
• Double DQN appears more robust (appropriate generalizations)
• Find general solutions, not a deterministic sequence of steps

More Related Content

What's hot

RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingAmr E. Mohamed
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Ahmed Gad
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)nikhilus85
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Kalman filter - Applications in Image processing
Kalman filter - Applications in Image processingKalman filter - Applications in Image processing
Kalman filter - Applications in Image processingRavi Teja
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
Lecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingLecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingVARUN KUMAR
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
4.intensity transformations
4.intensity transformations4.intensity transformations
4.intensity transformationsYahya Alkhaldi
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)Kyunghwan Kim
 
Signals and systems-3
Signals and systems-3Signals and systems-3
Signals and systems-3sarun soman
 

What's hot (20)

RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processingDsp 2018 foehu - lec 10 - multi-rate digital signal processing
Dsp 2018 foehu - lec 10 - multi-rate digital signal processing
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...Derivation of Convolutional Neural Network from Fully Connected Network Step-...
Derivation of Convolutional Neural Network from Fully Connected Network Step-...
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Kalman filter - Applications in Image processing
Kalman filter - Applications in Image processingKalman filter - Applications in Image processing
Kalman filter - Applications in Image processing
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
Hog
HogHog
Hog
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
Lecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingLecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image Processing
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
4.intensity transformations
4.intensity transformations4.intensity transformations
4.intensity transformations
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
 
Signals and systems-3
Signals and systems-3Signals and systems-3
Signals and systems-3
 

Similar to deep reinforcement learning with double q learning

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review민재 정
 
Reinforcement Learning for Marlo
Reinforcement Learning for MarloReinforcement Learning for Marlo
Reinforcement Learning for MarloRuben Romero
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerShunta Saito
 
Comparison GWAP Mechanical Turk
Comparison GWAP Mechanical TurkComparison GWAP Mechanical Turk
Comparison GWAP Mechanical TurkElena Simperl
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 

Similar to deep reinforcement learning with double q learning (20)

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
 
Iiwas19 yamazaki slide
Iiwas19 yamazaki slideIiwas19 yamazaki slide
Iiwas19 yamazaki slide
 
Reinforcement Learning for Marlo
Reinforcement Learning for MarloReinforcement Learning for Marlo
Reinforcement Learning for Marlo
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Human level control through deep rl
Human level control through deep rlHuman level control through deep rl
Human level control through deep rl
 
Comparison GWAP Mechanical Turk
Comparison GWAP Mechanical TurkComparison GWAP Mechanical Turk
Comparison GWAP Mechanical Turk
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

deep reinforcement learning with double q learning

  • 1. Lab Seminar Deep Reinforcement Learning with Double Q-Learning Seunghyeok Back 2018. 11. 05 Graduate Student in MS&ph.D integrated course Artificial intelligence Lab shback@gist.ac.kr School of Integrated Technology (SIT) Gwangju Institute of Science and Technology (GIST)
  • 2. Introduction (2016) Deep Reinforcement Learning with Double Q-Learning Van Hasselt, Hado, Arthur Guez, and David Silver (Deepmind), AAAI Double Q-learning + DQN = > Double DQN • DQN suffers from substantial overestimations in some of Atari game • Generalize Double Q-learning (tabular setting -> large-scale function approximation) • Propose Double DQN algorithm (reduce overestimation and leads to much better performance)
  • 3. RL goal: Find optimal policy Q-learning: Find optimal policy using action-value function Update of Q-function Too large problems to learn all action values in all states -> Function approximation 1) Parameterize Q function 2) Update Q function toward to Target value (similar as SGD) Target value (Bellman equation) Overestimation in Q-learning Target value – current predicted Q-value = Temporal Difference (TD) Error Learning rate Gradient of current predicted Q-value
  • 4. • Sometimes learn unrealistically high action value due to maximization step • Tends to prefer overestimated to underestimated values • Overestimation is attributed to 1) insufficiently flexible function approximation 2) noise 3) estimation errors Question: Do the overestimations negatively affects performance?  Yes, if the overestimations are not uniform & not concentrated at states about which we wish to learn  DQN : sufficiently flexible function approximation (deep), no harmful effects of noise (deterministic)  Favorable setting, but overestimation occurs!  Double DQN reduce overestimation and leads to better results on Atari domain Overestimation in Q-learning
  • 5. • DQN outputs a vector of action values for a given state s • Target for update is Two main learning strategy: 1) Separate Target network • 𝜃− is parameter of the target network / Every 𝜏 steps, 𝜃− is set as 𝜃− = 𝜃 2) Experience replay • Save transition to replay buffer and sample uniformly for training DQN https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8 https://www.slideshare.net/MuhammedKocaba/humanlevel- control-through-deep-reinforcement-learning-presentation
  • 6. • (Q-learning, DQN) Use same value Q both to select and to evaluate an action -> Overestimation • (Double Q-learning) One for selection of an action, one for evaluation of an action • Two value functions, each of weights are 𝜃 and 𝜃′ • Update one of the two value functions, randomly • Select action with Q(𝑆, 𝑎; 𝜃), evaluate this action with Q(𝑆, 𝑎; 𝜃′ ) • Decouple the selection from the evaluation Double Q-learning Selection (inner Q) Evaluation (outer Q)
  • 7. Use target network as a second value function without introducing additional networks Double Q-learning • Pick one from two value function and update • Each value function can be used for both evaluation and selection Double DQN • Target network Q(𝑆, 𝑎; 𝜃− ) is used only for the evaluation, not for the selection • Remains a periodic copy of the online network to the target network • Value functions are not fully decoupled from each other • Minimal possible change of DQN towards Double Q-learning • Get most of the benefit of Double Q-learning, while keeping the rest of the DQN • Fair comparison and minimal computational overhead Double DQN
  • 8. • Testbed: Atari 2600 games • High-dimensional inputs / the visual and mechanic of games are vary substantially b/w games • Same experimental setup and network architecture with DQN (Mnih et al, 2015) • Single GPU for 200M frames per each game Experiment https://ocr.space/blog/2015/02/playing-atari-deep-reinforcement-learning.html
  • 9. Results on overoptimism • 6 Atari games with 6 different random seeds with the hyper-parameters employed in DQN • Horizontal line: average of the actual value from each visited state (computed after learning concluded) • Without overestimation, estimate and horizontal line would be matched up • Estimates of Double DQN are much closer to the real value than DQN’s • Double DQN produce more accurate value estimates but also better policies
  • 10. Results on overoptimism • Extreme overestimations in two games (Wizard of Wor, Asterix) • Overestimation coincide with decreasing scores (leads to bad policy) • Double DQN is much more stable
  • 11. Quality of the learned policies • (Mnih et al, 2015) Evaluation starts with no-op action • Wait 30 times while not affecting the environment • to provide different starting points for the agent • Same hyper parameter for both DQN and Double DQN • Double DQN clearly improves over DQN
  • 12. Quality of the learned policies • In deterministic games with a unique staring point, Evaluation with No-op can make the agent to remember sequence of actions without much need to generalize • Instead of no-op, use starting points sampled from human expert • Double DQN appears more robust (appropriate generalizations) • Find general solutions, not a deterministic sequence of steps