Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning
June 30, 2020
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning (RL) is a growing subset of Machine Learning and one of
the most important frontiers of Artificial Intelligence
It denotes a set of algorithms that deal with sequential decision-making.
A Reinforcement Learning algorithm can be described as a model that tells an
agent which set of actions it should take within a closed environment in order to
to maximize a predefined overall reward.
The agent tries different sets of actions, evaluating the total obtained return.
After many trials, the algorithm learns which actions give a greater reward and
establish a pattern of behavior.
Thanks to this, it is able to tell the agent which actions to take in every situation.
The goal of Reinforcement Learning is to capture higher logic and use more
adaptable algorithms than classical Machine Learning.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning applications
Robotics - Solution of high-dimensional control problems
Text mining - Producction of highly readable summaries of long texts.
Trade execution - Optimization of trading
Healthcare - Medication dosing, optimization of treatment
Games - Solution of different games and achievement of superhuman
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Reinforcement Learning algorithms are based on Markov Decision Process (MDP).
Agent: an entity which performs actions in an environment in order to optimize a
long-term reward;
Environment (e): the scenario that the agent has to face;
Set of states (S): the set of all the possible states s of the environment;
Set of actions (A): the set of all the possible actions a that can be performed by
the agent;
State transition model P (s |s, a): describes the probability that the environment
state changes in s from s with action a;
Reward (r = R(s, a)): a function that indicates the immediate the real-valued
reward for taking action a at state s;
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Episode (rollout): a sequence of states st and actions at for t that varies from 0
to the horizon;
The agent starts in a given state within its environment s0 ∈ S
At each timestep t the agent observes the current state st and takes an action at ∈ A
The state evolves into a new state st+1 ∈ S,
The agent obtains a reward rt = R(st , at )
The agent observes the new state st+1 ∈ S
Policy function: a policy can be deterministic (π(s)) or stochastic (π(a|s)):
a deterministic policy π(s) indicates the action a performed by the the agent when the
environment is in the state s (a = π(s)).
a stochastic policy π(a|s) is a function that describe the probability that action a is
performed by the the agent when the environment is in the state s.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Return Gt : the total long term reward with discount obtained at the end of the
episode, according to rt = R(st , at ):
Gt = rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · γ < 1;
Value function V (s): the expected long-term return at the end:
V (s) = E [Gt | st = s] = E rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · | st = s ;
Q-Value or Action value function Q(s, a): the expected long-term return at the
end performing action a.
The Bellman equation: the theoretical core in most RL algorithms:
Vπ(s) = R(s, π(s)) + γ
P s | s, π(s) Vπ s .
It can also be expressed using the Q-value as:
Qπ(s, a) = R(s, a) + γ
s ∈S
P s | s, a Vπ s .
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning optimal policy
According to Bellman equation, the optimal action function Q∗(s, a) is given by
(s, a) = R(s, a) + γ
P s | s, a max
s , a ,
and then the optimal policy π∗(s) is given by
(s) = arg max
a∈ A
(s, a).
In most real cases the state transition model and the reward function of the
model are unknown.
RL algorithms are used in order to learn the dynamics of the model and improve
the rewards.
The -greedy strategy is used to explore the entire environment and solve the
exploration-exploitation dilemma:
(s) =
arg maxa∈ A Q∗(s, a) probability 1 −
random from A probability
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
The Reinforcement Learning approaches
Value-based methods:
A Value-based algorithm computes the optimal value function or the optimal
action value function by iteratively improving their estimate.
Policy-based methods:
A Policy-based algorithm looks for a policy such that the action performed at
each state is optimal to gain maximum reward in the future.
Model-based methods:
A Model-based algorithm learns a virtual model starting from the original
environment, and the agent learns how to perform in the virtual model and then
get the results returned by the virtual model.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Value Function Approximation
The goal is to estimate the optimal policy π∗(s) by iteratively approximating the
optimal action value function Q∗(s, a).
The process is based on a parametric action value function ˆQ(s, a, w) of the state
s, of the action a and of a vector of parameters w (randomly initialized).
An iteration over every step of every episode if performed.
For every iteration, given the state s and the action a, we observe the reward
R(s, a) and the new state s .
According to the obtained reward the parameters are updated using the gradient
∆w = α R(s, a) + γ ˆQ(s , a , w) − ˆQ(s, a, w) w
ˆQ(s, a, w).
This process converges to the approximation of the optimal action value function.
In most of the real cases the parametric action value function ˆQ(s, a, w) is a
Neural Network, where the vector of parameters w is the vector of weights.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Value Function Approximation Reinforce Algorithm
Input: a differentiable action value parametric function ˆQ(s, a, w)
Algorithm parameters: learning rate α > 0, ε > 0
Initialize value-function weights w ∈ Rd randomly (e.g., w = 0)
Loop for each episode:
s, a ← initial state and action of episode (e.g., ε -greedy)
Loop for each step of episode:
Take action a, observe r, s
If s is terminal, w ← w + α[r − ˆQ(s, a, w)] ˆQ(s, a, w), go to next episode
Choose a = argmaxa ∈A
ˆQ s , a , w (or using ε-greedy)
w ← w + α r + γ ˆQ s , a , w − ˆQ(s, a, w) ˆQ(s, a, w)
s ← s
a ← a
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Deep Q-Networks
A Deep Q-Network is a Value Approximation algorithm where the parametric
action value function ˆQ(s, a, w) is a Deep Neural Network, and in particular a
Convolutional Neural Network.
A Deep Q-Network overcomes unstable learning using mainly 2 techniques
Target Network
A Target Network ˆQ s , a , w is a copy of the training model that is updated less
In the gradient descent formula, the Target Network is used as target in place of the
∆w = α r + γ max
ˆQ s , a , w − ˆQ(s, a, w) w
ˆQ(s, a, w).
This solution is useful to avoid instabilities given by the continue changes in the target.
Experience Replay
An Experience Replay is a buffer that stores the four-tuples (s, a, r, s ) of all the
different episodes.
Each time the model is updated, it randomly selects a batch of tuples.
This solution reduces overfitting, increases learning speed with mini-batches and reuses
past tuples to avoid forgetting.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration
Consider the deterministic case, in which we have that s is uniquely determined
by the state s and the action a by a function f : s = f (s, a).
Let L be the possible infinite horizon.
The goal of this algorithm is to estimate the optimal action value function.
By the Bellman equation, in this situation the optimal action value function is
(s, a) = HQ(s, a) = R(s, a) + γ max
Q f (s, a), a
Denote by QN (s, a) the action value functions over N steps (N ≤ L) given by
QN (s, a) = (HQN−1) (s, a) ∀N > 0, Q0(s, a) = 0.
The sequence of N-step action value functions QN (s, a) converges to the optimal
action value function Q∗(s, a) as N → L.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Algorithm
Inputs: a set of four-tuples (state, action, reward, new state) F.
Initialization: N = 0, QN = 0.
Iteration: repeat until stopping conditions are reached
N ← N + 1
Build the training set T S = il
, ol
, l = 1, · · · , #F based on the the function
ˆQN−1 and on the full set of four-tuples F, where
= sl
t , al
= rl
t + γ maxa∈A
ˆQN−1 sl
t+1, a
Use a regression algorithm to obtain the approximated N-Step action value function
ˆQN (s, a) training on the obtained dataset T S.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
Consider a car, modeled by a point mass, that is traveling on a hill with this form:
Objective - The objective is to bring the car in a minimum time to the top of the
hill with a limited speed.
State space - The state of the system is determined by the position p and the
speed v of the car. The space state is given by:
S = (p, v) ∈ R2
: |p| ≤ 1 and |v| ≤ 3 .
Every other combination of position and speed is considered a terminal state.
Action space - The action a acts directly on the acceleration of the car and can
only assume two extreme values: full acceleration (a = 4) or full deceleration
(a = −4) (A = {−4, 4}).
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
System dynamics - The time is discretized in timesteps of 0.1s. Given the state
(p, v) and the action a at timestep t, the state (p, v) at timestep t + 1 is
computed solving with a numeric method the dynamic of the system:
˙p = v, ˙v =
(1 + H (p)2)
gH (p)
1 + H (p)2
v2H (p)H (p)
1 + H (p)2
(g = 9.81).
Reward function - The reward function r(s, a) is defined through this expression:
r(st , at ) =
−1 if pt+1 < −1 or |vt+1| > 3
1 if pt+1 > 1 and |vt+1| ≤ 3
0 otherwise
Discount factor - The decay factor γ has been chosen equal to 0.95
Starting point - The car stopped at the bottom of the hill (p, v) = (−0.5, 0).
Regressor - The regressor used is an Extra Tree Regressor.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
The Fitted Q-Iteration was performed for N = 1 to 50.
For N > 20, QN (s, a) ≈ QN+1(s, a).
Left Figure: the action chosen for every combination (p, v), according to the
action value function ˆQ20(s, a) (black = deceleration, white = acceleration).
Right Figure: the optimal trajectory according to ˆQ20(s, a).
A full implementation of Fitted Q-Iteration can be found on Github.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient
The goal of the Policy Gradient method is to find the vector of parameters θ that
maximizes the value function V (s, θ) under a parametric policy π(a|s, θ).
The process is based on a parametric policy π(a|s, θ) differentiable with respect
to the vector of parameters θ (randomly initialized).
In this case we choose a stochastic policy (Stochastic Policy Gradient).
An iteration over every episode is performed.
For each timestep t we generate a sequence of triplets (state, action, reward)
choosing the action according the parametric policy π(a|s, θ).
For every timestep in the resulting sequence we compute the total long term
reward with discount Gt :
Gt =
Rk .
Then the vector of parameters θt is modified using a gradient update process
θt+1 = θt + α θV (s, θ) = θt + αGt θ ln π (at | st , θ) .
The process converges to the approximated optimal policy.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Reinforce Algorithm
Input: a differentiable policy parameterization π(a|s, θ)
Algorithm parameter: learning rate α > 0
Initialize policy parameter θ ∈ Rd (for example to 0)
Loop for each episode:
Generate an episode (s0, a0, r1), . . . , (sT−1, aT−1, rT ), following π(·|·, θ)
Loop for each step of the episode t = 0, 1, . . . , T − 1 :
G ←
θ ← θ + αγt
G ln π (at |st , θ)
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Examples of Parametric Policy
Softmax Policy
The Softmax Policy is mostly used in the case discrete actions:
π(a|s, θ) =
eφ(s,a) θ
eφ(s,ak ) θ
The explicit formula for the gradient update is
θ log (π(a|s, θ)) = φ(s, a) − Eπθ
[φ(s, ·)],
where φ(s, a) is the feature vector related to the state and the action.
Gaussian Policy
The Gaussian Policy is used in the case of a continuous action space
π(a|s, θ) =
2σ2 ,
where φ(s, a) is the feature vector, µ(s) = φ(s)T
θ, and σ can be fixed or parametric.
The explicit formula for the gradient update is
θ log(π(a|s, θ)) =
(a − µ(s))φ(s)
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Advantages and Disadvantages
A Policy Gradient method is a simpler process compared with value-based
It allows the action to be continuous with respect to the state.
It usually has better convergence properties with respect to other methods.
It avoids the growth in the usage of memory and in the computation time when
the action and state sets are large.
It can learn stochastic policies.
It allows the use -greedy method.
A Policy Gradient method typically converges to a local rather than global
It usually has high variance.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
CartPole is a game where a pole is attached by an unactuated joint to a cart,
which moves along a frictionless track. The pole starts upright.
The goal is to prevent it from falling by increasing and reducing the cart’s velocity.
State space - A single state is composed of 4 elements:
cart position
cart velocity
pole angle
pole angular velocity
The game ends when the pole falls, which is when the pole angle is more than
±12◦, or the cart position reaches the edge of the display.
Action space - The agent can take only 2 actions:
move the pole to the left
move the pole to the right
Reward - For every step taken (including the termination step), the reward is
increased by 1.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
The problem is solved with Gradient Policy method (implementation on Github)
Base Policy: Softmax Policy
Discount factor γ = 0.95, learning rate α = 0.1, max iterations per episode: 1000
After about 60 epochs (1 epoch = 20 consecutive episodes) the agent learns a
policy thanks to which we get a reward equal to 1000.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
This chart shows the average reward per epoch evolves in function of the total
number of epochs, for different values of the discount γ.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Actor-Critic Method
Actor-Critic method differs from the Policy Gradient method because estimates
both the policy and the value function, and updates both.
To address this issue and reduce the high variance in Policy Gradient, Actor-Critic
method subtracts from Gt a baseline b(s).
The Temporal Difference error δ = Gt − b(s) is used to update the vector of
parameters θ in place of the long term reward Gt .
The most used baseline is the estimation of the value function V (s).
The value function V (s) is learned with a Neural Network, whose output is the
approximated value function ˆV (s, w), where w is the vector of weights.
Then in every iteration the Temporal Difference error δ is used to adjust the
vector of parameters θ and the vector of weights w.
1 The Critic estimates the value function V(s).
2 The Actor updates the policy distribution in the direction suggested by the Critic.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Actor-Critic Algorithm
Input: a differentiable policy parameterization π(a | s, θ)
Input: a differentiable state-value function parameterization ˆv(s, w)
Algorithm parameters: step sizes αθ > 0, αw > 0
Initialize policy parameter θ ∈ Rd and state-value weights w ∈ Rd (e.g., to 0)
Loop forever (for each episode):
Initialize s (first state of episode)
I ← 1
Loop while s is not terminal (for each time step):
a ∼ π(· | s, θ)
Take action a, observe s , r
δ ← r + γ ˆV s , w − ˆV (s, w) (if s is terminal, then ˆV s , w
= 0 )
w ← w + αwIδ w ˆV (s, w)
θ ← θ + αθIδ θ ln π(a | s, θ)
I ← γI
s ← s
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model-based Methods
A Model-based is based on a base parametric model and on 3 main steps:
1 Acting: the base policy π0 (at | st ) is used to select the actions to perform in the real
environment, in order to collect a set of triplets (s, a, s ).
2 Model learning: from the collected experience, a new model f (s, a) is deduced in order
to minimize the least square error between the model’s new state and the real new state
f (s, a) − s
3 Planning: the value function and the policy are updated according to the new model, in
order to be used in the real environment in the next iteration.
Most used base models: Gaussian Process, Gaussian Mixture Model
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model Predictive Control
The Model Predictive Control (MPC) is an evolution of the model-based method.
The Model-based algorithm is vulnerable to drifting.
To address that sampling and fitting of the model are performed continuously
during the trajectory.
In MPC the whole trajectory is optimized, but only the first action is performed,
then the new triplet (s, a, s ) is added to the observations and the planning is
performed again.
By constantly changing plan, MPC is less vulnerable to problems in the model.
MPC has 5 main steps:
1 Acting
2 Model learning
3 Planning
4 Execution: the first planned action is performed, and the resulting state s is observed.
5 Dataset update: the new triplet s, a, s is appended to the dataset; go to step 3,
every N times go to step 2.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Dyna-Q Architecture
Initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A
Do until termination condition:
1 s ← current (nonterminal) state
2 a ← ε -greedy (s, Q)
3 Execute action a
4 Observe resultant reward r and new state s
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)
5 Model(s, a) ← r, s (assuming deterministic environment)
6 Planning: repeat N times:
s ← random previously observed state
a ← random action previously taken in s
r, s ← Model(s, a)
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model-based Methods Advantages and Disadvantages
Model-based Reinforcement Learning has the strong advantage of being sample
Once the model and the reward function are known, we can plan the optimal
controls without further sampling.
The learning phase is fast, since there is no need to wait for the environment to
On the downside, if the model is inaccurate we risk learning something
completely different from the reality.
Model-based algorithm still use Model-free methods either to construct the model
or in the planning/simulation.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
We had a high-level structural overview of many classic and popular RL
algorithms, but there are a lot of variants that we’ve not covered.
The main challenge in RL lays in preparing the simulation environment, which is
highly dependent on the task to be performed.
Infact many real world problems have enormous state or action spaces, and for
this reason the use of parametric functions is needed.
One of the main tasks in all the methods is to optimize rewards and penalties in
order to obtain the desired results.
Another challenge is to build a learning process that converges to the optimum in
a reasonable time avoiding bias and overfitting.
Last but not least, it’s important to avoid forgetting when acquiring new
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An
Damien Ernst, Pierre Geurts, Louis Wehenkel. Tree-Based Batch Mode
Reinforcement Learning. Journal of Machine Learning Research 6 (2005)
Reinforcement Learning

A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata

Reinforcement Learning Overview | Marco Del Pra

  • 1. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning June 30, 2020 Reinforcement Learning
  • 2. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Introduction Reinforcement Learning (RL) is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence It denotes a set of algorithms that deal with sequential decision-making. A Reinforcement Learning algorithm can be described as a model that tells an agent which set of actions it should take within a closed environment in order to to maximize a predefined overall reward. The agent tries different sets of actions, evaluating the total obtained return. After many trials, the algorithm learns which actions give a greater reward and establish a pattern of behavior. Thanks to this, it is able to tell the agent which actions to take in every situation. The goal of Reinforcement Learning is to capture higher logic and use more adaptable algorithms than classical Machine Learning. Reinforcement Learning
  • 3. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning applications Robotics - Solution of high-dimensional control problems Text mining - Producction of highly readable summaries of long texts. Trade execution - Optimization of trading Healthcare - Medication dosing, optimization of treatment Games - Solution of different games and achievement of superhuman performances. Reinforcement Learning
  • 4. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Reinforcement Learning algorithms are based on Markov Decision Process (MDP). Agent: an entity which performs actions in an environment in order to optimize a long-term reward; Environment (e): the scenario that the agent has to face; Set of states (S): the set of all the possible states s of the environment; Set of actions (A): the set of all the possible actions a that can be performed by the agent; State transition model P (s |s, a): describes the probability that the environment state changes in s from s with action a; Reward (r = R(s, a)): a function that indicates the immediate the real-valued reward for taking action a at state s; Reinforcement Learning
  • 5. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Episode (rollout): a sequence of states st and actions at for t that varies from 0 to the horizon; The agent starts in a given state within its environment s0 ∈ S At each timestep t the agent observes the current state st and takes an action at ∈ A The state evolves into a new state st+1 ∈ S, The agent obtains a reward rt = R(st , at ) The agent observes the new state st+1 ∈ S Policy function: a policy can be deterministic (π(s)) or stochastic (π(a|s)): a deterministic policy π(s) indicates the action a performed by the the agent when the environment is in the state s (a = π(s)). a stochastic policy π(a|s) is a function that describe the probability that action a is performed by the the agent when the environment is in the state s. Reinforcement Learning
  • 6. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Return Gt : the total long term reward with discount obtained at the end of the episode, according to rt = R(st , at ): Gt = rt + γrt+1 + γ2 rt+2 + γ3 rt+3 + · · · γ < 1; Value function V (s): the expected long-term return at the end: V (s) = E [Gt | st = s] = E rt + γrt+1 + γ2 rt+2 + γ3 rt+3 + · · · | st = s ; Q-Value or Action value function Q(s, a): the expected long-term return at the end performing action a. The Bellman equation: the theoretical core in most RL algorithms: Vπ(s) = R(s, π(s)) + γ s P s | s, π(s) Vπ s . It can also be expressed using the Q-value as: Qπ(s, a) = R(s, a) + γ s ∈S P s | s, a Vπ s . Reinforcement Learning
  • 7. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning optimal policy According to Bellman equation, the optimal action function Q∗(s, a) is given by Q∗ (s, a) = R(s, a) + γ s P s | s, a max a Q∗ s , a , and then the optimal policy π∗(s) is given by π∗ (s) = arg max a∈ A Q∗ (s, a). In most real cases the state transition model and the reward function of the model are unknown. RL algorithms are used in order to learn the dynamics of the model and improve the rewards. The -greedy strategy is used to explore the entire environment and solve the exploration-exploitation dilemma: π∗ (s) = arg maxa∈ A Q∗(s, a) probability 1 − random from A probability Reinforcement Learning
  • 8. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions The Reinforcement Learning approaches Value-based methods: A Value-based algorithm computes the optimal value function or the optimal action value function by iteratively improving their estimate. Policy-based methods: A Policy-based algorithm looks for a policy such that the action performed at each state is optimal to gain maximum reward in the future. Model-based methods: A Model-based algorithm learns a virtual model starting from the original environment, and the agent learns how to perform in the virtual model and then get the results returned by the virtual model. Reinforcement Learning
  • 9. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Value Function Approximation The goal is to estimate the optimal policy π∗(s) by iteratively approximating the optimal action value function Q∗(s, a). The process is based on a parametric action value function ˆQ(s, a, w) of the state s, of the action a and of a vector of parameters w (randomly initialized). An iteration over every step of every episode if performed. For every iteration, given the state s and the action a, we observe the reward R(s, a) and the new state s . According to the obtained reward the parameters are updated using the gradient descent: ∆w = α R(s, a) + γ ˆQ(s , a , w) − ˆQ(s, a, w) w ˆQ(s, a, w). This process converges to the approximation of the optimal action value function. In most of the real cases the parametric action value function ˆQ(s, a, w) is a Neural Network, where the vector of parameters w is the vector of weights. Reinforcement Learning
  • 10. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Value Function Approximation Reinforce Algorithm Input: a differentiable action value parametric function ˆQ(s, a, w) Algorithm parameters: learning rate α > 0, ε > 0 Initialize value-function weights w ∈ Rd randomly (e.g., w = 0) Loop for each episode: s, a ← initial state and action of episode (e.g., ε -greedy) Loop for each step of episode: Take action a, observe r, s If s is terminal, w ← w + α[r − ˆQ(s, a, w)] ˆQ(s, a, w), go to next episode Choose a = argmaxa ∈A ˆQ s , a , w (or using ε-greedy) w ← w + α r + γ ˆQ s , a , w − ˆQ(s, a, w) ˆQ(s, a, w) s ← s a ← a Reinforcement Learning
  • 11. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Deep Q-Networks A Deep Q-Network is a Value Approximation algorithm where the parametric action value function ˆQ(s, a, w) is a Deep Neural Network, and in particular a Convolutional Neural Network. A Deep Q-Network overcomes unstable learning using mainly 2 techniques Target Network A Target Network ˆQ s , a , w is a copy of the training model that is updated less frequently. In the gradient descent formula, the Target Network is used as target in place of the model: ∆w = α r + γ max a ˆQ s , a , w − ˆQ(s, a, w) w ˆQ(s, a, w). This solution is useful to avoid instabilities given by the continue changes in the target. Experience Replay An Experience Replay is a buffer that stores the four-tuples (s, a, r, s ) of all the different episodes. Each time the model is updated, it randomly selects a batch of tuples. This solution reduces overfitting, increases learning speed with mini-batches and reuses past tuples to avoid forgetting. Reinforcement Learning
  • 12. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Consider the deterministic case, in which we have that s is uniquely determined by the state s and the action a by a function f : s = f (s, a). Let L be the possible infinite horizon. The goal of this algorithm is to estimate the optimal action value function. By the Bellman equation, in this situation the optimal action value function is Q∗ (s, a) = HQ(s, a) = R(s, a) + γ max a Q f (s, a), a Denote by QN (s, a) the action value functions over N steps (N ≤ L) given by QN (s, a) = (HQN−1) (s, a) ∀N > 0, Q0(s, a) = 0. The sequence of N-step action value functions QN (s, a) converges to the optimal action value function Q∗(s, a) as N → L. Reinforcement Learning
  • 13. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Algorithm Inputs: a set of four-tuples (state, action, reward, new state) F. Initialization: N = 0, QN = 0. Iteration: repeat until stopping conditions are reached N ← N + 1 Build the training set T S = il , ol , l = 1, · · · , #F based on the the function ˆQN−1 and on the full set of four-tuples F, where il = sl t , al t ol = rl t + γ maxa∈A ˆQN−1 sl t+1, a Use a regression algorithm to obtain the approximated N-Step action value function ˆQN (s, a) training on the obtained dataset T S. Reinforcement Learning
  • 14. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill Consider a car, modeled by a point mass, that is traveling on a hill with this form: Objective - The objective is to bring the car in a minimum time to the top of the hill with a limited speed. State space - The state of the system is determined by the position p and the speed v of the car. The space state is given by: S = (p, v) ∈ R2 : |p| ≤ 1 and |v| ≤ 3 . Every other combination of position and speed is considered a terminal state. Action space - The action a acts directly on the acceleration of the car and can only assume two extreme values: full acceleration (a = 4) or full deceleration (a = −4) (A = {−4, 4}). Reinforcement Learning
  • 15. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill System dynamics - The time is discretized in timesteps of 0.1s. Given the state (p, v) and the action a at timestep t, the state (p, v) at timestep t + 1 is computed solving with a numeric method the dynamic of the system: ˙p = v, ˙v = u (1 + H (p)2) − gH (p) 1 + H (p)2 − v2H (p)H (p) 1 + H (p)2 (g = 9.81). Reward function - The reward function r(s, a) is defined through this expression: r(st , at ) = −1 if pt+1 < −1 or |vt+1| > 3 1 if pt+1 > 1 and |vt+1| ≤ 3 0 otherwise Discount factor - The decay factor γ has been chosen equal to 0.95 Starting point - The car stopped at the bottom of the hill (p, v) = (−0.5, 0). Regressor - The regressor used is an Extra Tree Regressor. Reinforcement Learning
  • 16. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill The Fitted Q-Iteration was performed for N = 1 to 50. For N > 20, QN (s, a) ≈ QN+1(s, a). Left Figure: the action chosen for every combination (p, v), according to the action value function ˆQ20(s, a) (black = deceleration, white = acceleration). Right Figure: the optimal trajectory according to ˆQ20(s, a). A full implementation of Fitted Q-Iteration can be found on Github. Reinforcement Learning
  • 17. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient The goal of the Policy Gradient method is to find the vector of parameters θ that maximizes the value function V (s, θ) under a parametric policy π(a|s, θ). The process is based on a parametric policy π(a|s, θ) differentiable with respect to the vector of parameters θ (randomly initialized). In this case we choose a stochastic policy (Stochastic Policy Gradient). An iteration over every episode is performed. For each timestep t we generate a sequence of triplets (state, action, reward) choosing the action according the parametric policy π(a|s, θ). For every timestep in the resulting sequence we compute the total long term reward with discount Gt : Gt = T k=t+1 γk−t−1 Rk . Then the vector of parameters θt is modified using a gradient update process θt+1 = θt + α θV (s, θ) = θt + αGt θ ln π (at | st , θ) . The process converges to the approximated optimal policy. Reinforcement Learning
  • 18. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Reinforce Algorithm Input: a differentiable policy parameterization π(a|s, θ) Algorithm parameter: learning rate α > 0 Initialize policy parameter θ ∈ Rd (for example to 0) Loop for each episode: Generate an episode (s0, a0, r1), . . . , (sT−1, aT−1, rT ), following π(·|·, θ) Loop for each step of the episode t = 0, 1, . . . , T − 1 : G ← T k=t+1 γk−t−1 rk θ ← θ + αγt G ln π (at |st , θ) Reinforcement Learning
  • 19. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Examples of Parametric Policy Softmax Policy The Softmax Policy is mostly used in the case discrete actions: π(a|s, θ) = eφ(s,a) θ N k=1 eφ(s,ak ) θ The explicit formula for the gradient update is θ log (π(a|s, θ)) = φ(s, a) − Eπθ [φ(s, ·)], where φ(s, a) is the feature vector related to the state and the action. Gaussian Policy The Gaussian Policy is used in the case of a continuous action space π(a|s, θ) = 1 √ 2πσ e − (a−µ)2 2σ2 , where φ(s, a) is the feature vector, µ(s) = φ(s)T θ, and σ can be fixed or parametric. The explicit formula for the gradient update is θ log(π(a|s, θ)) = (a − µ(s))φ(s) σ2 . Reinforcement Learning
  • 20. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Advantages and Disadvantages Advantages A Policy Gradient method is a simpler process compared with value-based methods. It allows the action to be continuous with respect to the state. It usually has better convergence properties with respect to other methods. It avoids the growth in the usage of memory and in the computation time when the action and state sets are large. It can learn stochastic policies. It allows the use -greedy method. Disadvantages A Policy Gradient method typically converges to a local rather than global optimum. It usually has high variance. Reinforcement Learning
  • 21. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole CartPole is a game where a pole is attached by an unactuated joint to a cart, which moves along a frictionless track. The pole starts upright. The goal is to prevent it from falling by increasing and reducing the cart’s velocity. State space - A single state is composed of 4 elements: cart position cart velocity pole angle pole angular velocity The game ends when the pole falls, which is when the pole angle is more than ±12◦, or the cart position reaches the edge of the display. Action space - The agent can take only 2 actions: move the pole to the left move the pole to the right Reward - For every step taken (including the termination step), the reward is increased by 1. Reinforcement Learning
  • 22. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole The problem is solved with Gradient Policy method (implementation on Github) Base Policy: Softmax Policy Discount factor γ = 0.95, learning rate α = 0.1, max iterations per episode: 1000 After about 60 epochs (1 epoch = 20 consecutive episodes) the agent learns a policy thanks to which we get a reward equal to 1000. Reinforcement Learning
  • 23. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole This chart shows the average reward per epoch evolves in function of the total number of epochs, for different values of the discount γ. Reinforcement Learning
  • 24. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Actor-Critic Method Actor-Critic method differs from the Policy Gradient method because estimates both the policy and the value function, and updates both. To address this issue and reduce the high variance in Policy Gradient, Actor-Critic method subtracts from Gt a baseline b(s). The Temporal Difference error δ = Gt − b(s) is used to update the vector of parameters θ in place of the long term reward Gt . The most used baseline is the estimation of the value function V (s). The value function V (s) is learned with a Neural Network, whose output is the approximated value function ˆV (s, w), where w is the vector of weights. Then in every iteration the Temporal Difference error δ is used to adjust the vector of parameters θ and the vector of weights w. Actor-Critic: 1 The Critic estimates the value function V(s). 2 The Actor updates the policy distribution in the direction suggested by the Critic. Reinforcement Learning
  • 25. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Actor-Critic Algorithm Input: a differentiable policy parameterization π(a | s, θ) Input: a differentiable state-value function parameterization ˆv(s, w) Algorithm parameters: step sizes αθ > 0, αw > 0 Initialize policy parameter θ ∈ Rd and state-value weights w ∈ Rd (e.g., to 0) Loop forever (for each episode): Initialize s (first state of episode) I ← 1 Loop while s is not terminal (for each time step): a ∼ π(· | s, θ) Take action a, observe s , r δ ← r + γ ˆV s , w − ˆV (s, w) (if s is terminal, then ˆV s , w . = 0 ) w ← w + αwIδ w ˆV (s, w) θ ← θ + αθIδ θ ln π(a | s, θ) I ← γI s ← s Reinforcement Learning
  • 26. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model-based Methods A Model-based is based on a base parametric model and on 3 main steps: 1 Acting: the base policy π0 (at | st ) is used to select the actions to perform in the real environment, in order to collect a set of triplets (s, a, s ). 2 Model learning: from the collected experience, a new model f (s, a) is deduced in order to minimize the least square error between the model’s new state and the real new state i f (s, a) − s 2 . 3 Planning: the value function and the policy are updated according to the new model, in order to be used in the real environment in the next iteration. Most used base models: Gaussian Process, Gaussian Mixture Model Reinforcement Learning
  • 27. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model Predictive Control The Model Predictive Control (MPC) is an evolution of the model-based method. The Model-based algorithm is vulnerable to drifting. To address that sampling and fitting of the model are performed continuously during the trajectory. In MPC the whole trajectory is optimized, but only the first action is performed, then the new triplet (s, a, s ) is added to the observations and the planning is performed again. By constantly changing plan, MPC is less vulnerable to problems in the model. MPC has 5 main steps: 1 Acting 2 Model learning 3 Planning 4 Execution: the first planned action is performed, and the resulting state s is observed. 5 Dataset update: the new triplet s, a, s is appended to the dataset; go to step 3, every N times go to step 2. Reinforcement Learning
  • 28. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Dyna-Q Architecture Initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A Do until termination condition: 1 s ← current (nonterminal) state 2 a ← ε -greedy (s, Q) 3 Execute action a 4 Observe resultant reward r and new state s Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a) 5 Model(s, a) ← r, s (assuming deterministic environment) 6 Planning: repeat N times: s ← random previously observed state a ← random action previously taken in s r, s ← Model(s, a) Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a) Reinforcement Learning
  • 29. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model-based Methods Advantages and Disadvantages Model-based Reinforcement Learning has the strong advantage of being sample efficient. Once the model and the reward function are known, we can plan the optimal controls without further sampling. The learning phase is fast, since there is no need to wait for the environment to respond. On the downside, if the model is inaccurate we risk learning something completely different from the reality. Model-based algorithm still use Model-free methods either to construct the model or in the planning/simulation. Reinforcement Learning
  • 30. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Conclusions We had a high-level structural overview of many classic and popular RL algorithms, but there are a lot of variants that we’ve not covered. The main challenge in RL lays in preparing the simulation environment, which is highly dependent on the task to be performed. Infact many real world problems have enormous state or action spaces, and for this reason the use of parametric functions is needed. One of the main tasks in all the methods is to optimize rewards and penalties in order to obtain the desired results. Another challenge is to build a learning process that converges to the optimum in a reasonable time avoiding bias and overfitting. Last but not least, it’s important to avoid forgetting when acquiring new observations. Reinforcement Learning
  • 31. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Damien Ernst, Pierre Geurts, Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research 6 (2005) 503–556. Reinforcement Learning