DEEP
REINFORCEMENT
LEARNING
ON THE ROAD TO SKYNET!
UW CSE Deep Learning – Felix Leeb
OVERVIEW
TODAY
→ MDPs – formalizing decisions
→ Function Approximation
→ Value Function – DQN
→ Policy Gradients – REINFORCE,
NPG
→ Actor Critic – A3C, DDPG
NEXT TIME
→ Model Based RL –
forward/inverse
→ Planning – MCTS, MPPI
→ Imitation Learning – DAgger,
GAIL
→ Advanced Topics – Exploration,
MARL, Meta-learning, LMDPs…
UW CSE DEEP LEARNING - FELIX LEEB 2
Supervised
Learning
→ Classificati
on
→ Regression
Unsupervised
Learning
→ Inference
→ Generati
on
Reinforcemen
t Learning
→ Predictio
n
→ Control
Objective
Applications
Paradigm
UW CSE DEEP LEARNING - FELIX LEEB 3
Prediction
Control
𝑥 𝑦
𝑦
𝑥
𝑢
UW CSE DEEP LEARNING - FELIX LEEB 4
SETTING
UW CSE DEEP LEARNING - FELIX LEEB 5
Agent
Environm
ent
Action
State/Observatio
n
Reward
using policy
MARKOV PROCESSES
UW CSE DEEP LEARNING - FELIX LEEB 6
State
space
Action
space
Transition
function
Reward
function
DISCOUNT FACTOR
→ We want to be greedy but not impulsive
→ Implicitly takes uncertainty in dynamics into
account
→ Mathematically: γ<1 allows infinite horizon
returns
UW CSE DEEP LEARNING - FELIX LEEB 7
Return:
SOLVING AN MDP
Goal:
UW CSE DEEP LEARNING - FELIX LEEB 8
Objectiv
e:
VALUE FUNCTIONS
→ Value = expected gain of a state
→ Q function – action specific value function
→ Advantage function – how much more valuable is
an action
→ Value depends on future rewards  depends on
policy
UW CSE DEEP LEARNING - FELIX LEEB 9
TABULAR SOLUTION: POLICY ITERATION
Policy
Evaluatio
n
Policy
Update
UW CSE DEEP LEARNING - FELIX LEEB 10
Q LEARNING
UW CSE DEEP LEARNING - FELIX LEEB 11
FUNCTION APPROXIMATION
UW CSE DEEP LEARNING - FELIX LEEB 12
Model:
Training
data:
Loss
function: wher
e
IMPLEMENTATION
UW CSE DEEP LEARNING - FELIX LEEB 13
Action-in Action-out Off-Policy
Learning
→ The target depends in
part on our model 
old observations are
still useful
→ Use a Replay Buffer of
most recent
transitions as dataset
DEEP Q NETWORKS (DQN)
UW CSE DEEP LEARNING - FELIX LEEB 14
Mnih et al. (2015)
DQN ISSUES
UW CSE DEEP LEARNING - FELIX LEEB 15
→ Convergence is not guaranteed – hope for deep magic!
Replay Buffer Error Clipping Reward scaling Using replicas
→ Double Q Learning – decouple action selection and value
estimation
POLICY GRADIENTS
UW CSE DEEP LEARNING - FELIX LEEB 16
→ Parameterize policy and update those parameters directly
→ Enables new kinds of policies: stochastic, continuous
action spaces
→ On policy learning  learn directly from your actions
POLICY GRADIENTS
UW CSE DEEP LEARNING - FELIX LEEB 17
→ Approximate expectation value from
samples
REINFORCE
UW CSE DEEP LEARNING - FELIX LEEB 18
Sutton et al. (2000)
VARIANCE REDUCTION
UW CSE DEEP LEARNING - FELIX LEEB 19
→ Constant offsets make it harder to
differentiate the right direction
→ Remove offset  a priori value of each
state
ADVANCED POLICY GRADIENT METHODS
UW CSE DEEP LEARNING - FELIX LEEB 20
→ For stochastic functions, the gradient is not the best
direction
→ Consider the KL divergence
Approximating the Fisher information
matrix
Computing gradients with KL constraint
Gradients with KL penalty
NPG 
TRPO 
PPO 
ADVANCED POLICY GRADIENT METHODS
UW CSE DEEP LEARNING - FELIX LEEB 21
Rajeswaran et al. (2017) Heess et al. (2017)
ACTOR CRITIC
UW CSE DEEP LEARNING - FELIX LEEB 22
Critic
Actor
Estimate
Advantag
e
Propose
Actions
using Q learning update
using policy gradient
update
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
UW CSE DEEP LEARNING - FELIX LEEB 23
Mnih et al. (2016)
DDPG
UW CSE DEEP LEARNING - FELIX LEEB 24
Max Ferguson (2017)
→ Off-policy learning – using deterministic policy gradients

rl-lectures-pres.pptx

  • 1.
    DEEP REINFORCEMENT LEARNING ON THE ROADTO SKYNET! UW CSE Deep Learning – Felix Leeb
  • 2.
    OVERVIEW TODAY → MDPs –formalizing decisions → Function Approximation → Value Function – DQN → Policy Gradients – REINFORCE, NPG → Actor Critic – A3C, DDPG NEXT TIME → Model Based RL – forward/inverse → Planning – MCTS, MPPI → Imitation Learning – DAgger, GAIL → Advanced Topics – Exploration, MARL, Meta-learning, LMDPs… UW CSE DEEP LEARNING - FELIX LEEB 2
  • 3.
    Supervised Learning → Classificati on → Regression Unsupervised Learning →Inference → Generati on Reinforcemen t Learning → Predictio n → Control Objective Applications Paradigm UW CSE DEEP LEARNING - FELIX LEEB 3
  • 4.
  • 5.
    SETTING UW CSE DEEPLEARNING - FELIX LEEB 5 Agent Environm ent Action State/Observatio n Reward using policy
  • 6.
    MARKOV PROCESSES UW CSEDEEP LEARNING - FELIX LEEB 6 State space Action space Transition function Reward function
  • 7.
    DISCOUNT FACTOR → Wewant to be greedy but not impulsive → Implicitly takes uncertainty in dynamics into account → Mathematically: γ<1 allows infinite horizon returns UW CSE DEEP LEARNING - FELIX LEEB 7 Return:
  • 8.
    SOLVING AN MDP Goal: UWCSE DEEP LEARNING - FELIX LEEB 8 Objectiv e:
  • 9.
    VALUE FUNCTIONS → Value= expected gain of a state → Q function – action specific value function → Advantage function – how much more valuable is an action → Value depends on future rewards  depends on policy UW CSE DEEP LEARNING - FELIX LEEB 9
  • 10.
    TABULAR SOLUTION: POLICYITERATION Policy Evaluatio n Policy Update UW CSE DEEP LEARNING - FELIX LEEB 10
  • 11.
    Q LEARNING UW CSEDEEP LEARNING - FELIX LEEB 11
  • 12.
    FUNCTION APPROXIMATION UW CSEDEEP LEARNING - FELIX LEEB 12 Model: Training data: Loss function: wher e
  • 13.
    IMPLEMENTATION UW CSE DEEPLEARNING - FELIX LEEB 13 Action-in Action-out Off-Policy Learning → The target depends in part on our model  old observations are still useful → Use a Replay Buffer of most recent transitions as dataset
  • 14.
    DEEP Q NETWORKS(DQN) UW CSE DEEP LEARNING - FELIX LEEB 14 Mnih et al. (2015)
  • 15.
    DQN ISSUES UW CSEDEEP LEARNING - FELIX LEEB 15 → Convergence is not guaranteed – hope for deep magic! Replay Buffer Error Clipping Reward scaling Using replicas → Double Q Learning – decouple action selection and value estimation
  • 16.
    POLICY GRADIENTS UW CSEDEEP LEARNING - FELIX LEEB 16 → Parameterize policy and update those parameters directly → Enables new kinds of policies: stochastic, continuous action spaces → On policy learning  learn directly from your actions
  • 17.
    POLICY GRADIENTS UW CSEDEEP LEARNING - FELIX LEEB 17 → Approximate expectation value from samples
  • 18.
    REINFORCE UW CSE DEEPLEARNING - FELIX LEEB 18 Sutton et al. (2000)
  • 19.
    VARIANCE REDUCTION UW CSEDEEP LEARNING - FELIX LEEB 19 → Constant offsets make it harder to differentiate the right direction → Remove offset  a priori value of each state
  • 20.
    ADVANCED POLICY GRADIENTMETHODS UW CSE DEEP LEARNING - FELIX LEEB 20 → For stochastic functions, the gradient is not the best direction → Consider the KL divergence Approximating the Fisher information matrix Computing gradients with KL constraint Gradients with KL penalty NPG  TRPO  PPO 
  • 21.
    ADVANCED POLICY GRADIENTMETHODS UW CSE DEEP LEARNING - FELIX LEEB 21 Rajeswaran et al. (2017) Heess et al. (2017)
  • 22.
    ACTOR CRITIC UW CSEDEEP LEARNING - FELIX LEEB 22 Critic Actor Estimate Advantag e Propose Actions using Q learning update using policy gradient update
  • 23.
    ASYNC ADVANTAGE ACTOR-CRITIC(A3C) UW CSE DEEP LEARNING - FELIX LEEB 23 Mnih et al. (2016)
  • 24.
    DDPG UW CSE DEEPLEARNING - FELIX LEEB 24 Max Ferguson (2017) → Off-policy learning – using deterministic policy gradients

Editor's Notes

  • #4 In reinforcement learning the goal is to learn a policy, which gives us the action given
  • #5 Prediction: finding the likely output given the input Control is a little different, now we have to find a control given only the observation (and a reward signal) So can we use any of the tricks we learned for prediction in control?
  • #7 Markov – only previous state matters Decision – agent takes actions, and those decisions have consequences Process – there is some transition function Transition function is sometimes called the dynamics of the system Reward function can in general depend on both the state and action, but often it’s only related to the state Goal: maximize overall reward
  • #12 Without knowing the transition function
  • #13 Allows continuous state spaces Where’s the deep in “deep RL” Whats with this other Q? At the beginning of training our Q function will be really bad, so the updates will be bad, but each update is moving in the right direction, so overall we’re moving in the right direction Take the derivative of loss wrt Q -> gives you q learning update -> shows the mse loss for params is equivalent to the tabular setting of updating the q values
  • #14 Data: off policy data
  • #16 Clipping errors Scaling rewards Use replay buffer – prioritizing recent actions Double Q Learning – Using separate target and training Q networks Sample complexity is not great – training deep CNN through RL Continuous action spaces are essentially impossible This is all really annoying
  • #17 Do we have to bother with a value function? On policy learning – learn directly from actions Any model that can be trained, could be a policy: Allows continuous action spaces, learning a stochastic policy
  • #18 Note: we’re going to be a little hand wavy with the notation Essentially importance sampling No guarantee of finding a global optimum
  • #20 Use baseline to reduce variance – the return of the value function Turns out standard gradient descent is not necessarily the direction of steepest descent for stochastic function optimization – consider natural gradients
  • #21 Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence
  • #22 Natural Policy Gradients – use Fisher information matrix to choose gradient TRPO – adjust gradient subject to KL divergence constraint PPO – take a step directly related to KL divergence
  • #23 Get the convergence of policy gradients, and the sample complexity of q learning
  • #24 Async – parallelizes updates Uses advantage function REINFORCE updates to policy
  • #25 Continuous control Replay buffer EMA between target and training networks for stability Eps-greedy exploration Batch normalization