rl-lectures-pres.pptx

DEEP
REINFORCEMENT
LEARNING
ON THE ROAD TO SKYNET!
UW CSE Deep Learning – Felix Leeb

OVERVIEW
TODAY
→ MDPs – formalizing decisions
→ Function Approximation
→ Value Function – DQN
→ Policy Gradients – REINFORCE,
NPG
→ Actor Critic – A3C, DDPG
NEXT TIME
→ Model Based RL –
forward/inverse
→ Planning – MCTS, MPPI
→ Imitation Learning – DAgger,
GAIL
→ Advanced Topics – Exploration,
MARL, Meta-learning, LMDPs…
UW CSE DEEP LEARNING - FELIX LEEB 2

Supervised
Learning
→ Classificati
on
→ Regression
Unsupervised
Learning
→ Inference
→ Generati
on
Reinforcemen
t Learning
→ Predictio
n
→ Control
Objective
Applications
Paradigm

Prediction
Control
𝑥 𝑦
𝑦
𝑥
𝑢

SETTING
Agent
Environm
ent
Action
State/Observatio
n
Reward
using policy

MARKOV PROCESSES
State
space
Action
space
Transition
function
Reward
function

DISCOUNT FACTOR
→ We want to be greedy but not impulsive
→ Implicitly takes uncertainty in dynamics into
account
→ Mathematically: γ<1 allows infinite horizon
returns
Return:

SOLVING AN MDP
Goal:
Objectiv
e:

VALUE FUNCTIONS
→ Value = expected gain of a state
→ Q function – action specific value function
→ Advantage function – how much more valuable is
an action
→ Value depends on future rewards  depends on
policy

TABULAR SOLUTION: POLICY ITERATION
Policy
Evaluatio
n
Policy
Update

Q LEARNING

FUNCTION APPROXIMATION
Model:
Training
data:
Loss
function: wher
e

IMPLEMENTATION
Action-in Action-out Off-Policy
Learning
→ The target depends in
part on our model 
old observations are
still useful
→ Use a Replay Buffer of
most recent
transitions as dataset

DEEP Q NETWORKS (DQN)
Mnih et al. (2015)

DQN ISSUES
→ Convergence is not guaranteed – hope for deep magic!
Replay Buffer Error Clipping Reward scaling Using replicas
→ Double Q Learning – decouple action selection and value
estimation

POLICY GRADIENTS
→ Parameterize policy and update those parameters directly
→ Enables new kinds of policies: stochastic, continuous
action spaces
→ On policy learning  learn directly from your actions

POLICY GRADIENTS
→ Approximate expectation value from
samples

REINFORCE
Sutton et al. (2000)

VARIANCE REDUCTION
→ Constant offsets make it harder to
differentiate the right direction
→ Remove offset  a priori value of each
state

ADVANCED POLICY GRADIENT METHODS
→ For stochastic functions, the gradient is not the best
direction
→ Consider the KL divergence
Approximating the Fisher information
matrix
Computing gradients with KL constraint
Gradients with KL penalty
NPG 
TRPO 
PPO 

ADVANCED POLICY GRADIENT METHODS
Rajeswaran et al. (2017) Heess et al. (2017)

ACTOR CRITIC
Critic
Actor
Estimate
Advantag
e
Propose
Actions
using Q learning update
using policy gradient
update

ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
Mnih et al. (2016)

DDPG
Max Ferguson (2017)
→ Off-policy learning – using deterministic policy gradients

rl-lectures-pres.pptx

More Related Content

Similar to rl-lectures-pres.pptx

Recently uploaded

rl-lectures-pres.pptx

Editor's Notes