Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Technical University of Catalonia
Reinforcement
Learning: MDP & DQN
Day 5 Lecture 2
#DLUPC
http://bit.ly/dlai2018

2
Acknowledgements
Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.

3
Acknowledgements
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Míriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center

4
Video lecture
Xavier Giró, DLAI 2017

5
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more

6
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

Types of machine learning
Yann Lecun’s Black Forest cake
7

8
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
Predict label y corresponding to
observation x
Estimate the distribution of
observation x
Predict action y based on
observation x, to maximize a future
reward z

9
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳

10
Rich Sutton, “Temporal-Difference Learning” University of Alberta (2017)

11
Motivation
What is Reinforcement Learning (RL) ?
“a way of programming agents by reward and punishment without needing to
specify how the task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of
artificial intelligence research 4 (1996): 237-285.

12
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

13
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

14
Architecure
Figure: UCL Course on RL by David Silver

15
Architecure
Environment

16
Architecure
Environment
state (st
)

17
Architecure
Environment
state (st
)

18
Architecure
Environment
Agent
state (st
)

19
Architecure
Environment
Agent
action (At
)state (st
)

20
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st
)

21
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Reward is given to
the agent delayed
with respect to
previous states and
actions !

22
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)

23
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
GOAL: Reach the
highest score
possible at the end
of the game.

24
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
GOAL: Learn how to
take actions to
maximize
accumulative reward

25
Architecure
Multiple problems that can be formulated with a RL architecture.
Cart-Pole Problem
Objective: Balance a pole on top of a movable car
Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.

26
Architecure
Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle
Angular speed
Position
Horizontal velocity
Horizontal force
applied in the car1 at each time
step if the pole is
upright

27
Architecure
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
Robot Locomotion
Objective: Make the robot move forward

28
Architecure
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle and position
of the joints
Torques applied
on joints1 at each time
step upright +
forward
movement

29

30
Outline
1. Motivation
2. Architecture
○ Policy
○ Optimal Policy
○ Value Function
○ Q-value function
○ Optimal Q-value function
○ Bellman equation
○ Value iteration algorithm
4. Deep Q-learning
5. RL Frameworks
6. Learn more

31
Markov Decisin Process (MDP)
Markov Decision Processes provide a formalism for reinforcement learning
problems.
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Markov property:
Current state completely
characterises the state of the world.

32
Markov Decision Process (MDP)
S A R P ४

33
Markov Decision Process (MDP)
S A R P ४
Environment
samples initial
state s0
~ p(s0
)
Agent
selects
action at
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)

34
MDP: Policy
S A R P ४
Agent
selects
action at
policy π
A Policy π is a function S ➝ A that specifies which action to take in each state.

35
MDP: Policy
A Policy π is a function S ➝ A that specifies which action to take in each state.
Agent
selects
action at
policy π
GOAL: Learn how
to take actions to
maximize reward
Agent
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
MDP

36
MDP: Policy
Grid World (a simple MDP)
Objective: reach one of the terminal
states (greyed out) in least number of
actions.

37
MDP: Policy
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Each cell is a state:
A negative
“reward” (penalty)
for each transition
rt
= r = -1

38
MDP: Policy: Random
Example: Actions resulting from applying a random policy on this Grid World
problem.

39
MDP: Policy: Optimal Policy π*
Exercise: Draw the actions resulting from applying an optimal policy π* in this Grid
World problem.

40
Solution: Draw the actions resulting from applying an optimal policy π* in this Grid
World problem.

41
How do we handle the randomness (initial
state s0
, transition probabilities, action...) ?
discounted reward:
Environment
samples initial
state s0
~ p(s0
)
Agent selects
action at
~π
(.|st
)
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)

42
How do we handle the randomness (initial
state s0
, transition probabilities, action) ?
discounted reward:
The optimal policy π* will maximize the
expected sum of rewards:
initial
state
selected
action at t
sampled state for
t+1expected
cumulative
discounted reward

43
MDP: Value Function Vπ
(s)
How to estimate how good state s is for a given policy π ?
With the value function at state s, Vπ
(s), the expected cumulative reward from
following policy π from state s.
“...from following policy π
from state s.”
“Expected
cumulative reward…””

44
MDP: Q-Value Function Qπ
(s,a)
How to estimate how good a state-action pair (s,a) is for a given policy π ?
With the Q-value function at state s and action a, Qπ
(s,a), the expected
cumulative reward from taking action a in state s, and then following policy π.
“...from taking action a in state s
and then following policy π.”
“Expected
cumulative reward…””

45
MDP: Optimal Q-Value Function Q*
(s,a)
The optimal Q-value function at state s and action, Q*
(s,a), is the maximum
expected cumulative reward achievable from a given (state, action) pair:
choose the policy that
maximizes the expected
cumulative reward
(From the previous page)
Q-value function

46
MDP: Optimal Q-Value Function Q*
(s,a)
Q*
(s,a) satisfies the following Bellman equation:
Maximum expected
cumulative reward for
future pair (s’,a’)
FUTURE REWARD
Optimal Q-value function
reward for
considered
pair (s,a)
Maximum expected
cumulative reward
for considered pair
(s,a)
Expectation
across possible
future states s’
(randomness) discount
factor

47
MDP: Bellman Equation
Q*
(s,a) satisfies the following Bellman equation:
The optimal policy π* corresponds to taking the best action in any state according
to Q*.
discounted reward:
select action a’ that maximizes
expected cumulative reward

48
MDP: Solving the Optimal Policy
Value iteration algorithm: Estimate the Bellman equation with an iterative update.
The iterative estimation Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
Bellman Equation
Updated Q-value
function
Current Q-value for

49
Exploring all positive states and action is not scalable
by itself. Let alone iteratively.
Eg. If video game, it would require generating all possible pixels and
actions… as many times as necessary iterations to estimate Q*(s,a).
The iterative estimation Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
Updated Q-value
function
Current Q-value for

50
Solution: Use a deep neural network as
an function approximator of Q*(s,a).
Q(s,a,Ө) ≈ Q*(s,a)
Neural Network parameters
Exploring all positive states and action is not scalable
Eg. If video game, it would require generating all possible pixels and
actions.

51
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

52
Deep Q-learning
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)

53
Deep Q-learning
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)
Forward Pass
Loss function:
Sample a (s,a) pair Predicted Q-value
with Өi
Sample a
future state s’
Predict Q-value with Өi-1

54
Deep Q-learning
Train the DNN
to approximate
a Q-value
function that
satisfies the
Bellman
equation

55
Deep Q-learning
Must compute
reward during
training

56
Deep Q-learning
Backward Pass
Gradient update (with respect to Q-function parameters Ө):
Forward Pass
Loss function:

57
Deep Q-learning: Deep Q-Network DQN
Q(s,a,Ө) ≈ Q*(s,a)

58
Q(s,a,Ө) ≈ Q*(s,a)
efficiency Single
Feed
Forward
Pass
A single feedforward pass to compute the Q-values
for all actions from the current state (efficient)

59
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Number of
actions between
4-18, depending
on the Atari
game

60
Q(st
, ⬅), Q(st
, ➡), Q(st
, ⬆), Q(st
,⬇ )

61
Deep Q-learning: Demo
Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”

63
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

64
RL Frameworks
OpenAI Gym + keras-rl
+
keras-rl
keras-rl implements some state-of-the
art deep reinforcement learning
algorithms in Python and seamlessly
integrates with the deep learning
library Keras. Just like Keras, it works
with either Theano or TensorFlow,
which means that you can train your
algorithm efficiently either on CPU or
GPU. Furthermore, keras-rl works with
OpenAI Gym out of the box.
Slide credit: Míriam Bellver

65
RL Frameworks
OpenAI
Universe
environment

66
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

67
Learn more
Deep Learning TV,
“Reinforcement learning - Ep. 30”
Siraj Raval, Deep Q Learning for Video
Games

68
Learn more
David Silver, UCL COMP050, Reinforcement Learning

69
Learn more
Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning,
Berkeley.
Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)

70
Learn more
Nando de Freitas, “Machine Learning” (University of
Oxford)

71
Homework
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree
search. Nature, 529(7587), pp.484-489

72
Homework (exam material !!)
Greg Kohs, “AlphaGo” (2017). [@ Netflix]

73
Outline
1. Motivation
2. Architecture
4. Deep Q-learning
5. RL Frameworks
6. Learn more

Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

Similar to Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018