Deep Q-Network
guodong
Value Iteration and Q-learning
• Model-free control: iteratively optimise value function and policy
•
Value Function Approximation
• “Lookup table” is not practical
• generalize to unobserved states
• handle large state/action space (and continuous state/action)
• Transform to supervised learning problem
• model(hypothesis space)
• Loss/cost function
• optimization
• iid assumption
• RL is unstable/divergent when action-value Q function is approximated
with a nonlinear function like neural networks
• states are correlated & data distribution changes + complex model
Deep Q-Network
• First step towards “General Artificial Intelligence”
• DQN = Q-learning + Function Approximation + Deep Network
• Stabilize training with experience replay and target network
• End-to-end RL approach, and quite flexible
DQN Algorithm
Practical Tips
• stable training: experiment replay(1M)+ fixed target
• mini-batch
• E&E with decremental epsilon greedy parameter (1.0 to 0.1)
• input of Q-NETWORK includes 4 recent frames
• skip frames
• discounted reward with 0.99
• use RMSProp instead of SGD
DQN variants
• Double DQN
• Prioritized Experience Replay
• Dueling Architecture
• Asynchronous Methods
• Continuous DQN
Double Q-learning
• Motivation: reduce overestimation by decomposing the
max operation in the target into action selection and
action evaluation
Double DQN
• From Double Q-learning to DDQN
Prioritized Experience Replay
• Motivation: more frequently replay transitions
with high information
• Key components
• criterion of importance: TD error
• stochastic prioritization instead of greedy
• Importance sampling to avoid bias
Algorithm
Performance compare
Dueling Architecture - Motivation
• Motivation: for many states, estimation of state value is more important,
comparing with state-action value
• Better approximate state value, and leverage power of advantage function
Dueling Architecture - Details
• Adopt to existing DQN algorithms (output of dueling
network is still Q function)
• Estimate value function and advantage function
separately, and combine them to estimate action
value function
• In Back-propagation: the estimates value function
and Advantage function are computed automatically
Dueling Architecture - Performance
• Converge faster
• More robust (differences
between Q-values for a
given state are small, so
noise could make the nearly
greedy policy switch
abruptly)
• Achieve better performance
on Atari games (advantage
grows when the number of
actions is large)
More variants
• Continuous action control + DQN
• NAF: continuous variant of Q-learning algorithm
• DDPG: Deep DPG
• Asynchronous Methods + DQN
• multiple agents in parallel + parameter server
Reference
• Playing atari with deep reinforcement learning
• Human-level control through deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Prioritized Experience Replay
• Dueling Network Architectures for Deep Reinforcement Learning
• Asynchronous methods for deep reinforcement learning
• Continuous control with deep reinforcement learning
• Continuous Deep Q-Learning with Model-based Acceleration
• Double Q learning
• Deep Reinforcement Learning - An Overview

DQN (Deep Q-Network)

  • 1.
  • 2.
    Value Iteration andQ-learning • Model-free control: iteratively optimise value function and policy •
  • 3.
    Value Function Approximation •“Lookup table” is not practical • generalize to unobserved states • handle large state/action space (and continuous state/action) • Transform to supervised learning problem • model(hypothesis space) • Loss/cost function • optimization • iid assumption • RL is unstable/divergent when action-value Q function is approximated with a nonlinear function like neural networks • states are correlated & data distribution changes + complex model
  • 4.
    Deep Q-Network • Firststep towards “General Artificial Intelligence” • DQN = Q-learning + Function Approximation + Deep Network • Stabilize training with experience replay and target network • End-to-end RL approach, and quite flexible
  • 5.
  • 6.
    Practical Tips • stabletraining: experiment replay(1M)+ fixed target • mini-batch • E&E with decremental epsilon greedy parameter (1.0 to 0.1) • input of Q-NETWORK includes 4 recent frames • skip frames • discounted reward with 0.99 • use RMSProp instead of SGD
  • 7.
    DQN variants • DoubleDQN • Prioritized Experience Replay • Dueling Architecture • Asynchronous Methods • Continuous DQN
  • 8.
    Double Q-learning • Motivation:reduce overestimation by decomposing the max operation in the target into action selection and action evaluation
  • 9.
    Double DQN • FromDouble Q-learning to DDQN
  • 10.
    Prioritized Experience Replay •Motivation: more frequently replay transitions with high information • Key components • criterion of importance: TD error • stochastic prioritization instead of greedy • Importance sampling to avoid bias
  • 11.
  • 12.
  • 13.
    Dueling Architecture -Motivation • Motivation: for many states, estimation of state value is more important, comparing with state-action value • Better approximate state value, and leverage power of advantage function
  • 14.
    Dueling Architecture -Details • Adopt to existing DQN algorithms (output of dueling network is still Q function) • Estimate value function and advantage function separately, and combine them to estimate action value function • In Back-propagation: the estimates value function and Advantage function are computed automatically
  • 15.
    Dueling Architecture -Performance • Converge faster • More robust (differences between Q-values for a given state are small, so noise could make the nearly greedy policy switch abruptly) • Achieve better performance on Atari games (advantage grows when the number of actions is large)
  • 16.
    More variants • Continuousaction control + DQN • NAF: continuous variant of Q-learning algorithm • DDPG: Deep DPG • Asynchronous Methods + DQN • multiple agents in parallel + parameter server
  • 17.
    Reference • Playing atariwith deep reinforcement learning • Human-level control through deep reinforcement learning • Deep Reinforcement Learning with Double Q-learning • Prioritized Experience Replay • Dueling Network Architectures for Deep Reinforcement Learning • Asynchronous methods for deep reinforcement learning • Continuous control with deep reinforcement learning • Continuous Deep Q-Learning with Model-based Acceleration • Double Q learning • Deep Reinforcement Learning - An Overview