Double Q-learning Paper Reading

•

0 likes•193 views

Takato Yamazaki

Summarizing DDQN.

Technology

Deep Reinforcement Learning with
Double Q-learning
Presenter: Takato Yamazaki
1

About the Paper
Title
Deep Reinforcement Learning with Double Q-learning
[arXiv:1509.06461]
Author
Hado van Hasselt, Arthur Guez, David Silver
Af liation
Google DeepMind
Year
2015
2

Outline
How DDQN was Derived
DDQN
Experiment Environment
Results
Summary
Related Papers
3

How DDQN was Derived
Reinforcement Learning
Agent's Goal: Learn good policies for sequential decision problems
With policy π, the true value Q of an action a in state s is
Q (s, a) = E R + γR + ...∣S = s, A = a, π
Optimal value is then
Q (s, a) = Q (s, a)
π [ 1 2 0 0 ]
∗
π
max π
4

How DDQN was Derived
Q-learning (Watkins, 1989)
Q(s, a) = Q(s, a) + α −
where α is the learning rate.
Current Q value will move closer to (Reward + next Q value)
(R + γ Q(s , a )t+1
a′
max ′ ′
Q(s, a))
5

How DDQN was Derived
Deep Q-learning (Mnih et al., 2015)
What if there is in nite states...
Q-learning can be considered as minimization problem.
Neural network can be used to minimize the error!
Y t
DQN
L(θ )
θt
min t
= R + γ Q(s , a ; θ )t+1
a′
max ′ ′
t
−
= E (R + γ Q(s , a ; θ ) − Q(s, a; θ ))
θt
min [ t+1
a′
max ′ ′
t
−
t
2
]
6

How DDQN was Derived
Deep Q-learning (Mnih et al., 2015) (Continued)
Experience replay
Store observed transitions to memory bank
Sample from memory bank randomly and train network
Target network
Copy online network θ to target network θ every τ stepst t
−
7

How DDQN was Derived
Double Q-learning (van Hasselt, 2010)
Q-learning often OVERESTIMATES the Q values because...
it uses the maximum action value every time to update Q values
it uses the same values to select and to evaluate an action
Double Q-learning helps avoiding overestimates!
Split the weights θ into selector and evaluator
8

Double Q-learning (van Hasselt, 2010) (continued)
9

Double Q-learning (van Hasselt, 2010) (continued)
Q-learning target
Y = R + γ Q(s , a ; θ )
Transform to
Y = R + γQ s , argmax Q(s , a; θ ); θ
Use different parameter for evaluating the Q-value
Y = R + γQ s , argmax Q(s , a; θ ); θ
t
Q
t+1
a′
max ′ ′
t
t
Q
t+1 ( ′
a
′
t t)
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
10

Double Q-learning (van Hasselt, 2010) (continued)
11

DDQN
Double Deep Q-learning (DDQN)
Combination of DQN and Double Q-learning!!!
Using neural network as selector and evaluator.
Easy implementation because...
DQN uses target network feature
Online network θ = Selector
Target network θ = Evaluator
t
t
−
12

Double Deep Q-learning (DDQN) (continued)
Double Q-learning's target was described as
Y = R + γQ s , argmax Q(s , a; θ ); θ
Transform for DDQN
Y = R + γQ s , argmax Q(s , a; θ ); θ
where θ is the online network and θ is the target network
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
t
DoubleDQN
t+1 ( ′
a
′
t t
−
)
t t
−
13

Experiment Environment
Atari 2600 Games, using the Arcade Learning Environment (ALE)
14

Experiment Environment
Network
Optimizer: RMSProp
15

Experiment Environment
Parameters (DQN, DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 10000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.1
Steps: 50,000,000 steps
(
1, 000, 000
1
)
16

Experiment Environment
Parameters (Tuned for DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 30000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.01
Steps: 50,000,000 steps
(
1, 000, 000
1
)
17

Results
DDQN is better than DQN
Value estimates: argmax Q(S , a; θ)
T
1
t=1
∑
T
a t
18

Results
More results (100 games each)
20

Summary
DDQN > DQN for most of the environments.
Less overestimations of values.
Implementing is easy!
Go DDQN!!
22

Related Papers
Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination
in Adversarial Multi-Agent with Distributed Double DQN".
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas
Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with
deep reinforcement learning”, 2015;
[http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc
Lanctot: “Dueling Network Architectures for Deep Reinforcement
Learning”, 2015; [http://arxiv.org/abs/1511.06581
arXiv:1511.06581].
23

What's hot

Machine Learning: Introduction to Neural NetworksFrancesco Collova'

Introduction to PyTorchJun Young Park

An introduction to deep reinforcement learningBig Data Colombia

Max flow min cutMayank Garg

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark

Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa

Adversarial searchDheerendra k

Support Vector Machines- SVMCarlo Carandang

Min-Max algorithmDr. C.V. Suresh Babu

Intro to Deep Reinforcement LearningKhaled Saleh

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.

2.2 decision treeKrish_ver2

Reinforcement learning 7313Slideshare

Reinforcement Learning : A Beginners TutorialOmar Enayet

Intro to Reinforcement learning - part IIIMikko Mäkipää

Reinforcement learningDing Li

Fuzzy Clustering(C-means, K-means)Fellowship at Vodafone FutureLab

Deep Q-LearningNikolay Pavlov

Particles Swarm OptimizationBrian Raafiu

Support Vector Machines ( SVM ) Mohammad Junaid Khan

What's hot (20)

Machine Learning: Introduction to Neural Networks

Introduction to PyTorch

An introduction to deep reinforcement learning

Max flow min cut

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Neural Networks: Multilayer Perceptron

Adversarial search

Support Vector Machines- SVM

Min-Max algorithm

Intro to Deep Reinforcement Learning

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

2.2 decision tree

Reinforcement learning 7313

Reinforcement Learning : A Beginners Tutorial

Intro to Reinforcement learning - part III

Reinforcement learning

Fuzzy Clustering(C-means, K-means)

Deep Q-Learning

Particles Swarm Optimization

Support Vector Machines ( SVM )

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

Pigging Solutions in Pet Food Manufacturing

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

How to Remove Document Management Hurdles with X-Docs?

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Salesforce Community Group Quito, Salesforce 101

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Slack Application Development 101 Slides

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Install Stable Diffusion in windows machine

How to convert PDF to text with Nanonets

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Double Q-learning Paper Reading

1. Deep Reinforcement Learning with Double Q-learning Presenter: Takato Yamazaki 1

2. About the Paper Title Deep Reinforcement Learning with Double Q-learning [arXiv:1509.06461] Author Hado van Hasselt, Arthur Guez, David Silver Af liation Google DeepMind Year 2015 2

3. Outline How DDQN was Derived DDQN Experiment Environment Results Summary Related Papers 3

4. How DDQN was Derived Reinforcement Learning Agent's Goal: Learn good policies for sequential decision problems With policy π, the true value Q of an action a in state s is Q (s, a) = E R + γR + ...∣S = s, A = a, π Optimal value is then Q (s, a) = Q (s, a) π [ 1 2 0 0 ] ∗ π max π 4

5. How DDQN was Derived Q-learning (Watkins, 1989) Q(s, a) = Q(s, a) + α − where α is the learning rate. Current Q value will move closer to (Reward + next Q value) (R + γ Q(s , a )t+1 a′ max ′ ′ Q(s, a)) 5

6. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) What if there is in nite states... Q-learning can be considered as minimization problem. Neural network can be used to minimize the error! Y t DQN L(θ ) θt min t = R + γ Q(s , a ; θ )t+1 a′ max ′ ′ t − = E (R + γ Q(s , a ; θ ) − Q(s, a; θ )) θt min [ t+1 a′ max ′ ′ t − t 2 ] 6

7. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) (Continued) Experience replay Store observed transitions to memory bank Sample from memory bank randomly and train network Target network Copy online network θ to target network θ every τ stepst t − 7

8. How DDQN was Derived Double Q-learning (van Hasselt, 2010) Q-learning often OVERESTIMATES the Q values because... it uses the maximum action value every time to update Q values it uses the same values to select and to evaluate an action Double Q-learning helps avoiding overestimates! Split the weights θ into selector and evaluator 8

9. Double Q-learning (van Hasselt, 2010) (continued) 9

10. Double Q-learning (van Hasselt, 2010) (continued) Q-learning target Y = R + γ Q(s , a ; θ ) Transform to Y = R + γQ s , argmax Q(s , a; θ ); θ Use different parameter for evaluating the Q-value Y = R + γQ s , argmax Q(s , a; θ ); θ t Q t+1 a′ max ′ ′ t t Q t+1 ( ′ a ′ t t) t DoubleQ t+1 ( ′ a ′ t t ′ ) 10

11. Double Q-learning (van Hasselt, 2010) (continued) 11

12. DDQN Double Deep Q-learning (DDQN) Combination of DQN and Double Q-learning!!! Using neural network as selector and evaluator. Easy implementation because... DQN uses target network feature Online network θ = Selector Target network θ = Evaluator t t − 12

13. Double Deep Q-learning (DDQN) (continued) Double Q-learning's target was described as Y = R + γQ s , argmax Q(s , a; θ ); θ Transform for DDQN Y = R + γQ s , argmax Q(s , a; θ ); θ where θ is the online network and θ is the target network t DoubleQ t+1 ( ′ a ′ t t ′ ) t DoubleDQN t+1 ( ′ a ′ t t − ) t t − 13

14. Experiment Environment Atari 2600 Games, using the Arcade Learning Environment (ALE) 14

15. Experiment Environment Network Optimizer: RMSProp 15

16. Experiment Environment Parameters (DQN, DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 10000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.1 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 16

17. Experiment Environment Parameters (Tuned for DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 30000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.01 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 17

18. Results DDQN is better than DQN Value estimates: argmax Q(S , a; θ) T 1 t=1 ∑ T a t 18

19. Results More results 19

20. Results More results (100 games each) 20

21. Results More results 21

22. Summary DDQN > DQN for most of the environments. Less overestimations of values. Implementing is easy! Go DDQN!! 22

23. Related Papers Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination in Adversarial Multi-Agent with Distributed Double DQN". Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971]. Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot: “Dueling Network Architectures for Deep Reinforcement Learning”, 2015; [http://arxiv.org/abs/1511.06581 arXiv:1511.06581]. 23

Double Q-learning Paper Reading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Double Q-learning Paper Reading

Similar to Double Q-learning Paper Reading (20)

Recently uploaded

Recently uploaded (20)

Double Q-learning Paper Reading