Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Technical University of Catalonia
Reinforcement
Learning (Reloaded)
Day 11 Lecture 1
#DLUPC
http://bit.ly/dlai2018

2
Acknowledgements
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center

3
Acknowledgements
Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)

4
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

5
A broader picture of types of learning...
Slide inspired by Alex Graves (Deepmind) at
“Unsupervised Learning Tutorial” @ NeurIPS 2018.
...with a teacher ...without a teacher
Active agent... Reinforcement learning
(with extrinsic reward)
Intrinsic motivation /
Exploration.
Passive agent... Supervised learning Unsupervised learning

6
Outline
1. Previously in DLAI

7
Architecture

8
Training data
In RL, training data is obtained by computing interaction sequences of:
State , Action , Reward
A complete episode of T interactions:
S1
, A1
, R2
, S2
, A2
, …, ST
One experience corresponds to a single
tuple within an episode:
(s,a,r,s’)

9
Markov Decision Process (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P γ
Environment
samples initial
state s0
~ p(s0
)
Agent
selects
action a
Environment samples
next state s’ ~ P ( .| s, a)
Environment samples
reward r ~ R(. | s,a) reward
(r)
state (s) action (a)

10
Markov Decision Process (MDP)
S A R P γ

11
MDP: Model (of the Environment)
S A R P γ

12
The transition function P(s’,r|s,a) records the probability of transitioning from s
to s’ after taking action a, while obtaining a reward r.
where ℙ is the symbol for probability.
State-
transition function
Reward function
Sum of transition function
across all rewards

13
Environment samples
next state s’ ~ P ( .| s,a)
Environment samples
reward r ~ R(. | s,a) reward (r)
state (s) action (a)
The Model (of the Environment) is defined by:
● P(s’|s,a): State-transition function
● R(s,a): Reward function.
Model
(of the Environment)

14
Figure: OpenAI Spinning Up

15
● Model-based RL: R(·| s,a) and P(·| s,a) are known, so an optimal solution can
be found with Dynamic Programming.
Model-based RL is the only option that can be applied in reality due to the
poor data efficiency of model-free RL.
More details at Sergey Levine’s slides @ Berkeley.
● Model-free RL: there is no prior knowledge of the world. The agent needs
to learn all dynamics from scratch, resulting in poor data efficiency which
limits its application to real-world agents.
Requires sim2real transfer for real deployment.

16
Example of sim2real: Dexterity (OpenAI 2018)

17
MDP: Policy
Agent
selects
action a
The Policy ㄫ is a function S ➝ A that specifies which action to take given a state.
● On-policy learning: the agent is trained with a sequence of interactions of
the target policy.
● Off-policy learning: the agent is trained with a sequence of interactions
obtained from a behaviour policy different from the target policy.
Policy ㄫ

18
MDP: Policy
The Policy ㄫ is a function S ➝ A that specifies which action to take in each state.
● Deterministic: ㄫ(s)=a
● Stochastic: ㄫ(a|s)=ℙㄫ
[A=a|S=s]

19
MDP: Policy
The Policy ㄫ is a function S ➝ A that specifies which action to take in each state.
● Discrete actions: categorical distribution over actions.
○ In the deterministic case: take the argmax action.
○ In the stochastic case: sample from the categorical distribution.
● Continuous actions: Gaussian distribution over actions; i.e. the policy
generally outputs the mean and std per dimension.
○ In the deterministic case: take the mean.
○ In the stochastic case: sample from a normal distribution.
Slide concept: Víctor Campos

20
MDP: Return Gt
The future reward, also known as return Gt
, is a total sum of discounted rewards
from t and onwards in time...
...where γ is the discount factor between [0,1].
Return
Gt

21
MDP: Value Functions Vㄫ
(s) & Qㄫ
(s,a)
Given a policy ㄫ, a value function is the expected return Gt
given....
...by following policy ㄫ starting
from state s & action a”
“Expected
return…
State-value
function Vs
s & a
Action-value
function Q
More
specific

22
MDP: Advantage Function Aㄫ
(s,a)
The Advantage function (A-value) is the defined as:
s & a
Advantage
function A
Aㄫ
(s,a) tells how good/bad an action is with respect to the expectation of
returns over all possible actions for a given state.

23
MDP: Optimal Value and Policy
The optimal value functions produce the maximum return...
Optimal
value
functions
Optimal
policy ㄫ*
The optimal policy achieves optimal value functions V*
(s) and Q*
(s,a)

24
Outline
1. Previously in DLAI
2. Common Approaches

25
Value vs Policy vs Model
Value
function
Policy ㄫ Model

26
Value
function
Policy ㄫ Model
Goals of Reinforcement Learning

27
David Silver (Deepmind), “Introduction to Deep Learning” (2015)
Summary of approaches in RL based on whether we want to learn the value, policy,
or the model of the environment.

28

29
Value
function
Policy ㄫ Model

30
Inferring a Policy from the Value
Value
function
Policy ㄫ
Given a value function, a policy can be
easily defined. For example:
Greedy policy
ㄫ(s) = arg maxa∈A
Q(s,a)

31
Generalized Policy Iteration (GPI)
Generalized Policy Iteration (GPI) algorithms adopt an iterative procedure to
improve the policy ㄫ:
The value function Vㄫi
is approximated repeatedly to be closer to the true value
of the current policy and in the meantime, the policy is improved repeatedly to
approach optimality.

32
Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
Example: Grid world
● The agent lives in a grid.
● Walls block the agent’s path.
● Big rewards come at the end.

33
Example: Grid world with noise = 0.2
● The agent’s actions do not always go as
planned:
○ Action North takes the agent:
■ North (80%)
■ West (10%)
■ East (10%)
○ Similarly for the West, East & South
actions.

34

35

36

37

38
GPI: Monte-Carlo Methods (MC)
Monte-Carlo (MC) Methods learns from complete episodes of raw experience and
computes the observed mean return as an approximation of the expected return Gt
.
MC
S1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, STS1
, A1
, R2
, S2
, R2
, …, ST
S1
, A1
, R2
, S2
, R2
, …, ST
...where 𝟙[St
=s] is a
binary indicator function.

39
GPI: Monte-Carlo Methods (MC)
The optimal policy ㄫ*
is learned by iterating as follows:
1. Improve the policy greedily with respect to the current Q:
ㄫ(s) = arg maxa∈A
Q(s,a)
2. Generate a new complete episode with the updated policy ㄫ.
3. Update Q using the new episode

40
Deep Q-learning (DQN)

41
Problem: Estimating the optimal Q*
(s,a) might be feasible for few
state-action pairs with exhaustive search, but impossible for large
spaces.
Solution: Learn a function Q(s,a, θ) approximation with machine
learning:
Eg. Neural Network
parameters
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Q*
(s) = arg maxa∈A
Q(s,a, θ)Q(s,a, θ)

42
Single
Feed
Forward
Pass
A single feedforward pass to compute the Q-values
for all actions from the current state (efficient)Q(s,a, θ)

43

44

45
Value
function
Policy ㄫ Model

46
Value
function
Policy ㄫ
Directly learn the policy by estimating the
parameters θ of a stochastic policy
function:
ㄫ(a|s;θ)

47
REINFORCE (Vanilla Policy Gradients - VPN)

48Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. "Policy gradient methods for
reinforcement learning with function approximation." NIPS 2000.
the mathematical formulation of ‘trial-and-error’: try and action, and make it
more likely if it resulted in positive reward; otherwise, make it less likely.
Opposite goals in:
● Supervised learning: minimize a loss function by gradient descent.
● Reinforcement learning: maximize J(θ), the expected return of the policy,
by gradient ascent.

49
Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018)
Policy parameters
(weights of the NN)

50
Policy parameters
(weights of the NN)
Estimation of the expectation over all
possible trajectories, using N sampled
trajectories (Monte Carlo)

51
Policy parameters
(weights of the NN)
The policy gradient. If we follow it, the
action ai,t
will be more likely if the
agent ever finds itself again in state si,t

52
Policy parameters
(weights of the NN)
Cumulative
(discounted) reward
The policy gradient. If we follow it, the
action ai,t
will be more likely if the
agent ever finds itself again in state si,t

53
Learn more
Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning,
Berkeley.
Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)

54
Learn more
David Silver, UCL COMP050, Reinforcement Learning

55
Learn more
OpenAI Spinning Up in Deep RL

56
A taxonomy of RL Algorithms

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

Similar to Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018