Deep Q-Learning
A Reinforcement Learning approach
What is Reinforcement Learning?
- Much like biological agents behave
- No supervisor, only a reward
- Data is time dependent (non iid)
- Feedback is delayed
- Agent actions affect the data it receives
Examples
- Play checkers (1959)
- Defeat the world champion at Backgammon (1992)
- Control a helicopter (2008)
- Make a robot to walk
- Robocup Soccer
- Play ATARI games better than humans (2014)
- Defeat the world champion at Go (2016)
Videos
Reward Hypothesis
All goals can be described by the maximisation of expected cumulative reward
- Defeat the world champion at Go: +R / -R for winning/losing a game
- Make a robot to walk: +R for forward, -R for falling over
- Play ATARI games: +R / -R for increasing/decreasing score
- Control a helicopter: + R / -R following trajectory / crashing
Agent and Environment
Fully Observable Environments
Fully Observable Environments (agent state = environment state):
- Agent directly observes environment
- Example: chess board
Partially Observable Environments (agent state not equal environment state):
- Agent indirectly observes environment
- Example: A robot with motion sensor or camera
- Agent must construct its own state representation
RL components: Policy and Value Function
Policy is agent’s behaviour function
- Maps from state to action
- Deterministic policy:
- Stochastic:
Value function is a is a prediction of future reward
- Used to evaluate state and select between actions
-
Model
Predicts what environment will do next:
Maze example: r = -1 per time-step and policy
[David Silver. Advanced Topics: RL]
Maze example: Value function and Model
[David Silver. Advanced Topics: RL]
Exploration - Exploitation dilemma
Math: Markov Decision Process (MDP)
Almost all RL problems can be formalised as MDPs
It’s a tuple:
- S is finite set of states
- A is finite set of actions
- P is state transition probability matrix:
- R is a reward function:
- Discount factor:
State-Value and Action-Value functions, Bellman eq.
Expected return starting from state s, and then following policy :
Expected return starting from state s, taking action a, and then following policy :
Finding an Optimal Policy
- There is always optimal policy for any MPD
- All optimal policies achieve the optimal value function
- All optimal policies achieve the optimal action-value function
All you need is to find
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for state-value function
[David Silver. Advanced Topics: RL]
Bellman Opt Equation for action-value function
[David Silver. Advanced Topics: RL]
Policy Iteration Demo
Q-Learning - model-free off-policy control algorithm
Model-free (vs Model-based):
- MDP model is unknown, but experience can be sampled MDP
- Model is known, but is too big to use, except by samples
Off-policy (vs On-policy):
- Can learn about policy from experience sampled from some other policy
Control (vs Prediction):
- Find best policy
Q-Learning
[David Silver. Advanced Topics: RL]
DQN - Q-Learning with function approximation
[Human-level control through deep reinforcement learning]
[Human-level control through deep reinforcement learning]
Issues with Q-learning with neural network
- Data is sequential (non-iid)
- Policy changes rapidly with slight changes to Q-values
- Policy may oscillate
- Experience flows from one extreme to another
- Scale of rewards and Q-values is unknown
- Unstable backpropagation due to large gradients
DQN solutions
- Use experience replay
- Breaks correlations in data
- Learn from all past policies
- Using off-policy Q-learning
- Freeze target Q-network
- Avoid policy oscillations
- Break correlations between Q-network and target
- Clip rewards and gradients
Neon Demo
Links
- Human-level control through deep reinforcement learning
- Course: David Silver. Advanced Topics: RL
- Tutorial: David Silver. Deep Reinforcement Learning
- Book: Sutton, Barto. Reinforcement learning
- Source Code: simple_dqn
- Reinforcejs
- The Arcade Learning Environment

Deep Q-Learning

  • 1.
  • 2.
    What is ReinforcementLearning? - Much like biological agents behave - No supervisor, only a reward - Data is time dependent (non iid) - Feedback is delayed - Agent actions affect the data it receives
  • 3.
    Examples - Play checkers(1959) - Defeat the world champion at Backgammon (1992) - Control a helicopter (2008) - Make a robot to walk - Robocup Soccer - Play ATARI games better than humans (2014) - Defeat the world champion at Go (2016) Videos
  • 4.
    Reward Hypothesis All goalscan be described by the maximisation of expected cumulative reward - Defeat the world champion at Go: +R / -R for winning/losing a game - Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score - Control a helicopter: + R / -R following trajectory / crashing
  • 5.
  • 6.
    Fully Observable Environments FullyObservable Environments (agent state = environment state): - Agent directly observes environment - Example: chess board Partially Observable Environments (agent state not equal environment state): - Agent indirectly observes environment - Example: A robot with motion sensor or camera - Agent must construct its own state representation
  • 7.
    RL components: Policyand Value Function Policy is agent’s behaviour function - Maps from state to action - Deterministic policy: - Stochastic: Value function is a is a prediction of future reward - Used to evaluate state and select between actions -
  • 8.
  • 9.
    Maze example: r= -1 per time-step and policy [David Silver. Advanced Topics: RL]
  • 10.
    Maze example: Valuefunction and Model [David Silver. Advanced Topics: RL]
  • 11.
  • 12.
    Math: Markov DecisionProcess (MDP) Almost all RL problems can be formalised as MDPs It’s a tuple: - S is finite set of states - A is finite set of actions - P is state transition probability matrix: - R is a reward function: - Discount factor:
  • 13.
    State-Value and Action-Valuefunctions, Bellman eq. Expected return starting from state s, and then following policy : Expected return starting from state s, taking action a, and then following policy :
  • 14.
    Finding an OptimalPolicy - There is always optimal policy for any MPD - All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function All you need is to find
  • 15.
    Bellman Opt Equationfor state-value function [David Silver. Advanced Topics: RL]
  • 16.
    Bellman Opt Equationfor action-value function [David Silver. Advanced Topics: RL]
  • 17.
    Bellman Opt Equationfor state-value function [David Silver. Advanced Topics: RL]
  • 18.
    Bellman Opt Equationfor action-value function [David Silver. Advanced Topics: RL]
  • 19.
  • 20.
    Q-Learning - model-freeoff-policy control algorithm Model-free (vs Model-based): - MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples Off-policy (vs On-policy): - Can learn about policy from experience sampled from some other policy Control (vs Prediction): - Find best policy
  • 21.
  • 22.
    DQN - Q-Learningwith function approximation [Human-level control through deep reinforcement learning]
  • 23.
    [Human-level control throughdeep reinforcement learning]
  • 24.
    Issues with Q-learningwith neural network - Data is sequential (non-iid) - Policy changes rapidly with slight changes to Q-values - Policy may oscillate - Experience flows from one extreme to another - Scale of rewards and Q-values is unknown - Unstable backpropagation due to large gradients
  • 25.
    DQN solutions - Useexperience replay - Breaks correlations in data - Learn from all past policies - Using off-policy Q-learning - Freeze target Q-network - Avoid policy oscillations - Break correlations between Q-network and target - Clip rewards and gradients
  • 26.
  • 27.
    Links - Human-level controlthrough deep reinforcement learning - Course: David Silver. Advanced Topics: RL - Tutorial: David Silver. Deep Reinforcement Learning - Book: Sutton, Barto. Reinforcement learning - Source Code: simple_dqn - Reinforcejs - The Arcade Learning Environment