SlideShare a Scribd company logo
DQN algorithm
kv
Physics Department, National Taiwan University
kelispinor@gmail.com
The silide is largely credicted from David Silver’s slide and CS294
July 16, 2018
kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
Overview
Overview
1 Overview
2 Introdution
What is Reinforcement Learning
Markov Decision Process
Dynamic Programming
kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
Introdution What is Reinforcement Learning
What is Reinforcement Learning?
RL is a general framework for AI.
RL is for agent with ability to interact
Each action influences agent’s future states
Success is measured by a scalar reward signal
RL in a nutshell: Select actions to maximize future reward.
kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
Introdution What is Reinforcement Learning
Reinforcement Learning Framework
In Reinforcement Learning, the agent observes current state St, receives
reward Rt, then interacts with the environment with action At under
policy.
Agent
Environment
Action atNew state st+1 Reward rt+1
kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
Introdution Markov Decision Process
Markov Decision Process
Markov Property
The future is independent of the past given the present.
P(St+1|St) = P(St+1|St, St−1, ..., S2, S1)
MDP is a tuple < S, A, P, R, γ >, defined by follwing components
S: state space
A: action space
P(r, s |s, a): transition probability. trainsition s, a → r, s
kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
Introdution Dynamic Programming
Policy
Policy: Is any function mapping from the states to actions π : S → A
Deterministic policy a = π(s)
Stochastic policy a ∼ π(a|s)
kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
Introdution Dynamic Programming
Policy Evaluation and Value Functions
Policy optimization: maximize expected reward wrt policy π
maximize E
t
rt
Policy evaluation: compute the expected return for given π
State value function: V π
(s) = E
∞
t γt
rt|St = s
State-action value function: Qπ
(s, a) = E
∞
t γt
rt|St = s, At = a
kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
Introdution Dynamic Programming
Value Functions
Q-function or state-action value function: expected total reward from
state s and action a under a policy π
Qπ
(s, a) = E
π
[r0 + γr1 + γ2
r2 + ...|s0 = s, a0 = a] (1)
State value function: expected (long-term )retrun starting from s
V π
(s) = E
π
[r0 + γr1 + γ2
r2 + ...|St = s] (2)
= E
a∼π
[Qπ
(s, a)|St = s] (3)
Advantage function
Aπ
(s, a) = Qπ
(s, a) − V π
(s) (4)
kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
Introdution Dynamic Programming
Bellman Equation
State action value function can be unrolled recursively
Qπ
(s, a) = E[r0 + γr1 + γ2
r2 + ...|s, a] (5)
= E
s
[r + γQπ
(s , a )|s, a] (6)
Optimal Q function Q∗(s, a) can be unrolled recursively
Q∗
(s, a) = E
s
[r + max
a
Q∗
(s , a )|s, a] (7)
Value iteration algorithm solves the Bellman equation
Qi+1(s) = E
s
[r + max
a
Qi (s , a )|s, a] (8)
kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
Introdution Dynamic Programming
Bellman Backups Operator
Q-function with clear time index
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (9)
Define Bellman backup operator, operating on Q-function
[T π
Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (10)
Qπ is a fixed point function
T π
Qπ
= Qπ
(11)
If we apply T π repeatedly to Q, the series will converge to Qπ
Q, T π
Q, (T π
)2
Q, ... → Qπ
(12)
kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
Introdution Dynamic Programming
Introducing Q∗
Denote π∗ an optimal policy.
Q∗(s, a) = Qπ∗
(s, a) = maxπ Qπ(s, a)
Satisfy π∗(s) = argmaxa Q∗(s, a)
Then, Bellman equation
Qπ
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ E
a1∼π
Qπ
(s1, a1) (13)
becomes
Q∗
(s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q∗
(s1, a1) (14)
We can also define corresponding Bellman backup operator
kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
Introdution Dynamic Programming
Bellman Backups Operator on Q∗
Bellman backup operator, operating on Q-function
[T Q](s0, a0) = E
s1∼P(s1|s0,a0)
r0 + γ max
a1
Q(s1, a1) (15)
Qπ is a fixed point function
T Q∗
= Q∗
(16)
If we apply T repeatedly to Q, the series will converge to Q∗
Q, T Q, (T )2
Q, ... → Q∗
(17)
kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
Introdution Dynamic Programming
Deep Q-Learning
Repersent value function by deep Q-Network with weights w
Q(s, a; w) ≈ Qπ
(s, a)
Objective function of Q-values is defined in mean-squared error
L(w) = E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))2
Q-learning gradient
∂L(w)
∂w
= E (r + γ max
a
Q(s , a ; w)
TD Target
−Q(s, a; w))
∂Q(s, a; w)
∂w
kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
Introdution Dynamic Programming
Deep Q-Learning
Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1)
To approximate Q ← T Qt, solve T Qt − Q(st, at)
2
T is contraction under . ∞ not . 2
kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
Introdution Dynamic Programming
Stability Issues
1 Data is sequential
Successive non-iid data are highly correlated
2 Policy changes raplidly with slightly change of Q values
π may oscillates
Distribution of data may swing
3 Scale of rewards and Q value is unknown
Large gradients can cause unstable backpropagation
kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
Introdution Dynamic Programming
Deep Q Network
Proposed solutions
1 Use experience replay
Break correlations in data, recover to iid setting
2 Fix target network
Old Q-function is freezed over long timesteps before update
Break correlations in Q-function and target
3 Clip rewards and normalize adaptively to sensible range
Robust gradients
kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
Introdution Dynamic Programming
Stablize DQN: Experience Replay
Goal: Remove correlations. Build agent’s data-set
at is sampled from -greedy policy
Store transition (st, at, rt+1, st+1) in replay memory D
Sample randomly in mini-batch of transition (s, a, r, s ) from D
Optimize MSE between Q-network and Q-Learning target
L(w) = E
a,s,r,s ∼D
(r + γ max
a
Q(s , a ; w) − Q(s, a; w))2
kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
Introdution Dynamic Programming
Stablize DQN: Fixed Target
Goal: Avoid oscillations, fix parameters used in target
Compute Q-learning target wrt old, fixed parameters w−
r + γ max
a
Q(s , a ; w−
)
Optimize MSE between Q-network and Q-learning target
L(w) = E
s,a,r,s ∼D
(r + γ max
a
Q(s , a ; w−
)
Fixed Target
−Q(s, a; w))2
Periodically update fixed parameters w− ← w
kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
Introdution Dynamic Programming
Stablize DQN: Rewards/ Values Range
Clips rewards to [-1, 1]
Ensure gradients are well-conditioned
kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
Introdution Dynamic Programming
DQN in Atari
Figure: Deep Q Learning
kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
Introdution Dynamic Programming
DQN in Atari
End-to-end learning of Q from pixels s
Input s is stacked last 4 frames
Output Q(s, a) for 18 actions
Reward is change in score for that step
Figure: Q-Network Architecture
kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
Introdution Dynamic Programming
DQN Results
kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
Introdution Dynamic Programming
Is Q-value has meaning?
But Q-values are usually overestimated.
kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
Introdution Dynamic Programming
Double Q Learning
EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2])
Q-values are noisy and overesitmated
Solution: use two networks and compute max with the other networ
QA(s, a) ← r + γQ(s , argmax
a
QB(s , a ))
QB(s, a) ← r + γQ(s , argmax
a
QA(s , a ))
Original DQN
Q(s, a) ← r + γQtarget
(s , a ) = r + γQtarget
(s , argmax
a
Qtarget
)
Double DQN
Q(s, a) ← r + γQtarget
(s , argmax
a
Q(s , a )) (18)
kv (NTU-PHYS) RLMC July 16, 2018 27 / 27

More Related Content

What's hot

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
Yoonho Lee
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
Euijin Jeong
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
Thomas da Silva Paula
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
Dongmin Lee
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
Thom Lane
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
Melaku Eneayehu
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
ahmad bassiouny
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
David Jardim
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
Mikko Mäkipää
 

What's hot (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 

Similar to Deep Reinforcement Learning: Q-Learning

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Willy Marroquin (WillyDevNET)
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
Ryo Iwaki
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
Takato Yamazaki
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
azzeddine chenine
 
Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
Kim Hammar
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
Lyft
 
SAT based planning for multiagent systems
SAT based planning for multiagent systemsSAT based planning for multiagent systems
SAT based planning for multiagent systems
Ravi Kuril
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Rene Kotze
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
Tiziano De Matteis
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physics
Sayed Ahmed
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Andrea Tassi
 
Dp
DpDp
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 

Similar to Deep Reinforcement Learning: Q-Learning (20)

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
SAT based planning for multiagent systems
SAT based planning for multiagent systemsSAT based planning for multiagent systems
SAT based planning for multiagent systems
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
Dr. Pablo Diaz Benito (University of the Witwatersrand) TITLE: "Novel Charges...
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
 
4.3 real time game physics
4.3 real time game physics4.3 real time game physics
4.3 real time game physics
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
Dp
DpDp
Dp
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 

More from Kai-Wen Zhao

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
Kai-Wen Zhao
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
Kai-Wen Zhao
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
Kai-Wen Zhao
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
Kai-Wen Zhao
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
Kai-Wen Zhao
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
Kai-Wen Zhao
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
Kai-Wen Zhao
 

More from Kai-Wen Zhao (7)

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
 

Recently uploaded

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 

Recently uploaded (20)

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 

Deep Reinforcement Learning: Q-Learning

  • 1. DQN algorithm kv Physics Department, National Taiwan University kelispinor@gmail.com The silide is largely credicted from David Silver’s slide and CS294 July 16, 2018 kv (NTU-PHYS) RLMC July 16, 2018 1 / 27
  • 2. Overview Overview 1 Overview 2 Introdution What is Reinforcement Learning Markov Decision Process Dynamic Programming kv (NTU-PHYS) RLMC July 16, 2018 2 / 27
  • 3. Introdution What is Reinforcement Learning What is Reinforcement Learning? RL is a general framework for AI. RL is for agent with ability to interact Each action influences agent’s future states Success is measured by a scalar reward signal RL in a nutshell: Select actions to maximize future reward. kv (NTU-PHYS) RLMC July 16, 2018 3 / 27
  • 4. Introdution What is Reinforcement Learning Reinforcement Learning Framework In Reinforcement Learning, the agent observes current state St, receives reward Rt, then interacts with the environment with action At under policy. Agent Environment Action atNew state st+1 Reward rt+1 kv (NTU-PHYS) RLMC July 16, 2018 4 / 27
  • 5. Introdution Markov Decision Process Markov Decision Process Markov Property The future is independent of the past given the present. P(St+1|St) = P(St+1|St, St−1, ..., S2, S1) MDP is a tuple < S, A, P, R, γ >, defined by follwing components S: state space A: action space P(r, s |s, a): transition probability. trainsition s, a → r, s kv (NTU-PHYS) RLMC July 16, 2018 5 / 27
  • 6. Introdution Dynamic Programming Policy Policy: Is any function mapping from the states to actions π : S → A Deterministic policy a = π(s) Stochastic policy a ∼ π(a|s) kv (NTU-PHYS) RLMC July 16, 2018 6 / 27
  • 7. Introdution Dynamic Programming Policy Evaluation and Value Functions Policy optimization: maximize expected reward wrt policy π maximize E t rt Policy evaluation: compute the expected return for given π State value function: V π (s) = E ∞ t γt rt|St = s State-action value function: Qπ (s, a) = E ∞ t γt rt|St = s, At = a kv (NTU-PHYS) RLMC July 16, 2018 7 / 27
  • 8. Introdution Dynamic Programming Value Functions Q-function or state-action value function: expected total reward from state s and action a under a policy π Qπ (s, a) = E π [r0 + γr1 + γ2 r2 + ...|s0 = s, a0 = a] (1) State value function: expected (long-term )retrun starting from s V π (s) = E π [r0 + γr1 + γ2 r2 + ...|St = s] (2) = E a∼π [Qπ (s, a)|St = s] (3) Advantage function Aπ (s, a) = Qπ (s, a) − V π (s) (4) kv (NTU-PHYS) RLMC July 16, 2018 8 / 27
  • 9. Introdution Dynamic Programming Bellman Equation State action value function can be unrolled recursively Qπ (s, a) = E[r0 + γr1 + γ2 r2 + ...|s, a] (5) = E s [r + γQπ (s , a )|s, a] (6) Optimal Q function Q∗(s, a) can be unrolled recursively Q∗ (s, a) = E s [r + max a Q∗ (s , a )|s, a] (7) Value iteration algorithm solves the Bellman equation Qi+1(s) = E s [r + max a Qi (s , a )|s, a] (8) kv (NTU-PHYS) RLMC July 16, 2018 9 / 27
  • 10. Introdution Dynamic Programming Bellman Backups Operator Q-function with clear time index Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (9) Define Bellman backup operator, operating on Q-function [T π Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (10) Qπ is a fixed point function T π Qπ = Qπ (11) If we apply T π repeatedly to Q, the series will converge to Qπ Q, T π Q, (T π )2 Q, ... → Qπ (12) kv (NTU-PHYS) RLMC July 16, 2018 10 / 27
  • 11. Introdution Dynamic Programming Introducing Q∗ Denote π∗ an optimal policy. Q∗(s, a) = Qπ∗ (s, a) = maxπ Qπ(s, a) Satisfy π∗(s) = argmaxa Q∗(s, a) Then, Bellman equation Qπ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ E a1∼π Qπ (s1, a1) (13) becomes Q∗ (s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q∗ (s1, a1) (14) We can also define corresponding Bellman backup operator kv (NTU-PHYS) RLMC July 16, 2018 11 / 27
  • 12. Introdution Dynamic Programming Bellman Backups Operator on Q∗ Bellman backup operator, operating on Q-function [T Q](s0, a0) = E s1∼P(s1|s0,a0) r0 + γ max a1 Q(s1, a1) (15) Qπ is a fixed point function T Q∗ = Q∗ (16) If we apply T repeatedly to Q, the series will converge to Q∗ Q, T Q, (T )2 Q, ... → Q∗ (17) kv (NTU-PHYS) RLMC July 16, 2018 12 / 27
  • 13. Introdution Dynamic Programming Deep Q-Learning Repersent value function by deep Q-Network with weights w Q(s, a; w) ≈ Qπ (s, a) Objective function of Q-values is defined in mean-squared error L(w) = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w))2 Q-learning gradient ∂L(w) ∂w = E (r + γ max a Q(s , a ; w) TD Target −Q(s, a; w)) ∂Q(s, a; w) ∂w kv (NTU-PHYS) RLMC July 16, 2018 13 / 27
  • 14. Introdution Dynamic Programming Deep Q-Learning Backup estimation T Qt = rt + maxat+1 γQ(st+1, at+1) To approximate Q ← T Qt, solve T Qt − Q(st, at) 2 T is contraction under . ∞ not . 2 kv (NTU-PHYS) RLMC July 16, 2018 14 / 27
  • 15. Introdution Dynamic Programming Stability Issues 1 Data is sequential Successive non-iid data are highly correlated 2 Policy changes raplidly with slightly change of Q values π may oscillates Distribution of data may swing 3 Scale of rewards and Q value is unknown Large gradients can cause unstable backpropagation kv (NTU-PHYS) RLMC July 16, 2018 15 / 27
  • 16. Introdution Dynamic Programming Deep Q Network Proposed solutions 1 Use experience replay Break correlations in data, recover to iid setting 2 Fix target network Old Q-function is freezed over long timesteps before update Break correlations in Q-function and target 3 Clip rewards and normalize adaptively to sensible range Robust gradients kv (NTU-PHYS) RLMC July 16, 2018 16 / 27
  • 17. Introdution Dynamic Programming Stablize DQN: Experience Replay Goal: Remove correlations. Build agent’s data-set at is sampled from -greedy policy Store transition (st, at, rt+1, st+1) in replay memory D Sample randomly in mini-batch of transition (s, a, r, s ) from D Optimize MSE between Q-network and Q-Learning target L(w) = E a,s,r,s ∼D (r + γ max a Q(s , a ; w) − Q(s, a; w))2 kv (NTU-PHYS) RLMC July 16, 2018 17 / 27
  • 18. Introdution Dynamic Programming Stablize DQN: Fixed Target Goal: Avoid oscillations, fix parameters used in target Compute Q-learning target wrt old, fixed parameters w− r + γ max a Q(s , a ; w− ) Optimize MSE between Q-network and Q-learning target L(w) = E s,a,r,s ∼D (r + γ max a Q(s , a ; w− ) Fixed Target −Q(s, a; w))2 Periodically update fixed parameters w− ← w kv (NTU-PHYS) RLMC July 16, 2018 18 / 27
  • 19. Introdution Dynamic Programming Stablize DQN: Rewards/ Values Range Clips rewards to [-1, 1] Ensure gradients are well-conditioned kv (NTU-PHYS) RLMC July 16, 2018 19 / 27
  • 20. Introdution Dynamic Programming DQN in Atari Figure: Deep Q Learning kv (NTU-PHYS) RLMC July 16, 2018 20 / 27
  • 21. Introdution Dynamic Programming DQN in Atari End-to-end learning of Q from pixels s Input s is stacked last 4 frames Output Q(s, a) for 18 actions Reward is change in score for that step Figure: Q-Network Architecture kv (NTU-PHYS) RLMC July 16, 2018 21 / 27
  • 22. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 22 / 27
  • 23. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 23 / 27
  • 24. Introdution Dynamic Programming DQN Results kv (NTU-PHYS) RLMC July 16, 2018 24 / 27
  • 25. Introdution Dynamic Programming Is Q-value has meaning? kv (NTU-PHYS) RLMC July 16, 2018 25 / 27
  • 26. Introdution Dynamic Programming Is Q-value has meaning? But Q-values are usually overestimated. kv (NTU-PHYS) RLMC July 16, 2018 26 / 27
  • 27. Introdution Dynamic Programming Double Q Learning EX1,X2 [max(X1, X2)] ≥ max(EX1,X2 [X1], EX1,X2 [X2]) Q-values are noisy and overesitmated Solution: use two networks and compute max with the other networ QA(s, a) ← r + γQ(s , argmax a QB(s , a )) QB(s, a) ← r + γQ(s , argmax a QA(s , a )) Original DQN Q(s, a) ← r + γQtarget (s , a ) = r + γQtarget (s , argmax a Qtarget ) Double DQN Q(s, a) ← r + γQtarget (s , argmax a Q(s , a )) (18) kv (NTU-PHYS) RLMC July 16, 2018 27 / 27