Reinforcement Learning Overview | Marco Del Pra

Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning
June 30, 2020

Introduction
Reinforcement Learning (RL) is a growing subset of Machine Learning and one of
the most important frontiers of Artificial Intelligence
It denotes a set of algorithms that deal with sequential decision-making.
A Reinforcement Learning algorithm can be described as a model that tells an
agent which set of actions it should take within a closed environment in order to
to maximize a predefined overall reward.
The agent tries different sets of actions, evaluating the total obtained return.
After many trials, the algorithm learns which actions give a greater reward and
establish a pattern of behavior.
Thanks to this, it is able to tell the agent which actions to take in every situation.
The goal of Reinforcement Learning is to capture higher logic and use more
adaptable algorithms than classical Machine Learning.

Reinforcement Learning applications
Robotics - Solution of high-dimensional control problems
Text mining - Producction of highly readable summaries of long texts.
Trade execution - Optimization of trading
Healthcare - Medication dosing, optimization of treatment
Games - Solution of diﬀerent games and achievement of superhuman
performances.

Reinforcement Learning actors
Reinforcement Learning algorithms are based on Markov Decision Process (MDP).
Agent: an entity which performs actions in an environment in order to optimize a
long-term reward;
Environment (e): the scenario that the agent has to face;
Set of states (S): the set of all the possible states s of the environment;
Set of actions (A): the set of all the possible actions a that can be performed by
the agent;
State transition model P (s |s, a): describes the probability that the environment
state changes in s from s with action a;
Reward (r = R(s, a)): a function that indicates the immediate the real-valued
reward for taking action a at state s;

Episode (rollout): a sequence of states st and actions at for t that varies from 0
to the horizon;
The agent starts in a given state within its environment s0 ∈ S
At each timestep t the agent observes the current state st and takes an action at ∈ A
The state evolves into a new state st+1 ∈ S,
The agent obtains a reward rt = R(st , at )
The agent observes the new state st+1 ∈ S
Policy function: a policy can be deterministic (π(s)) or stochastic (π(a|s)):
a deterministic policy π(s) indicates the action a performed by the the agent when the
environment is in the state s (a = π(s)).
a stochastic policy π(a|s) is a function that describe the probability that action a is
performed by the the agent when the environment is in the state s.

Return Gt : the total long term reward with discount obtained at the end of the
episode, according to rt = R(st , at ):
Gt = rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · γ < 1;
Value function V (s): the expected long-term return at the end:
V (s) = E [Gt | st = s] = E rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · | st = s ;
Q-Value or Action value function Q(s, a): the expected long-term return at the
end performing action a.
The Bellman equation: the theoretical core in most RL algorithms:
Vπ(s) = R(s, π(s)) + γ
s
P s | s, π(s) Vπ s .
It can also be expressed using the Q-value as:
Qπ(s, a) = R(s, a) + γ
s ∈S
P s | s, a Vπ s .

Reinforcement Learning optimal policy
According to Bellman equation, the optimal action function Q∗(s, a) is given by
Q∗
(s, a) = R(s, a) + γ
s
P s | s, a max
a
Q∗
s , a ,
and then the optimal policy π∗(s) is given by
π∗
(s) = arg max
a∈ A
Q∗
(s, a).
In most real cases the state transition model and the reward function of the
model are unknown.
RL algorithms are used in order to learn the dynamics of the model and improve
the rewards.
The -greedy strategy is used to explore the entire environment and solve the
exploration-exploitation dilemma:
π∗
(s) =
arg maxa∈ A Q∗(s, a) probability 1 −
random from A probability

The Reinforcement Learning approaches
Value-based methods:
A Value-based algorithm computes the optimal value function or the optimal
action value function by iteratively improving their estimate.
Policy-based methods:
A Policy-based algorithm looks for a policy such that the action performed at
each state is optimal to gain maximum reward in the future.
Model-based methods:
A Model-based algorithm learns a virtual model starting from the original
environment, and the agent learns how to perform in the virtual model and then
get the results returned by the virtual model.

Value Function Approximation
The goal is to estimate the optimal policy π∗(s) by iteratively approximating the
optimal action value function Q∗(s, a).
The process is based on a parametric action value function ˆQ(s, a, w) of the state
s, of the action a and of a vector of parameters w (randomly initialized).
An iteration over every step of every episode if performed.
For every iteration, given the state s and the action a, we observe the reward
R(s, a) and the new state s .
According to the obtained reward the parameters are updated using the gradient
descent:
∆w = α R(s, a) + γ ˆQ(s , a , w) − ˆQ(s, a, w) w
ˆQ(s, a, w).
This process converges to the approximation of the optimal action value function.
In most of the real cases the parametric action value function ˆQ(s, a, w) is a
Neural Network, where the vector of parameters w is the vector of weights.

Value Function Approximation Reinforce Algorithm
Input: a diﬀerentiable action value parametric function ˆQ(s, a, w)
Algorithm parameters: learning rate α > 0, ε > 0
Initialize value-function weights w ∈ Rd randomly (e.g., w = 0)
Loop for each episode:
s, a ← initial state and action of episode (e.g., ε -greedy)
Loop for each step of episode:
Take action a, observe r, s
If s is terminal, w ← w + α[r − ˆQ(s, a, w)] ˆQ(s, a, w), go to next episode
Choose a = argmaxa ∈A
ˆQ s , a , w (or using ε-greedy)
w ← w + α r + γ ˆQ s , a , w − ˆQ(s, a, w) ˆQ(s, a, w)
s ← s
a ← a

Deep Q-Networks
A Deep Q-Network is a Value Approximation algorithm where the parametric
action value function ˆQ(s, a, w) is a Deep Neural Network, and in particular a
Convolutional Neural Network.
A Deep Q-Network overcomes unstable learning using mainly 2 techniques
Target Network
A Target Network ˆQ s , a , w is a copy of the training model that is updated less
frequently.
In the gradient descent formula, the Target Network is used as target in place of the
model:
∆w = α r + γ max
a
ˆQ s , a , w − ˆQ(s, a, w) w
ˆQ(s, a, w).
This solution is useful to avoid instabilities given by the continue changes in the target.
Experience Replay
An Experience Replay is a buffer that stores the four-tuples (s, a, r, s ) of all the
different episodes.
Each time the model is updated, it randomly selects a batch of tuples.
This solution reduces overfitting, increases learning speed with mini-batches and reuses
past tuples to avoid forgetting.

Fitted Q-Iteration
Consider the deterministic case, in which we have that s is uniquely determined
by the state s and the action a by a function f : s = f (s, a).
Let L be the possible inﬁnite horizon.
The goal of this algorithm is to estimate the optimal action value function.
By the Bellman equation, in this situation the optimal action value function is
Q∗
(s, a) = HQ(s, a) = R(s, a) + γ max
a
Q f (s, a), a
Denote by QN (s, a) the action value functions over N steps (N ≤ L) given by
QN (s, a) = (HQN−1) (s, a) ∀N > 0, Q0(s, a) = 0.
The sequence of N-step action value functions QN (s, a) converges to the optimal
action value function Q∗(s, a) as N → L.

Fitted Q-Iteration Algorithm
Inputs: a set of four-tuples (state, action, reward, new state) F.
Initialization: N = 0, QN = 0.
Iteration: repeat until stopping conditions are reached
N ← N + 1
Build the training set T S = il
, ol
, l = 1, · · · , #F based on the the function
ˆQN−1 and on the full set of four-tuples F, where
il
= sl
t , al
t
ol
= rl
t + γ maxa∈A
ˆQN−1 sl
t+1, a
Use a regression algorithm to obtain the approximated N-Step action value function
ˆQN (s, a) training on the obtained dataset T S.

Fitted Q-Iteration Example - Car on a Hill
Consider a car, modeled by a point mass, that is traveling on a hill with this form:
Objective - The objective is to bring the car in a minimum time to the top of the
hill with a limited speed.
State space - The state of the system is determined by the position p and the
speed v of the car. The space state is given by:
S = (p, v) ∈ R2
: |p| ≤ 1 and |v| ≤ 3 .
Every other combination of position and speed is considered a terminal state.
Action space - The action a acts directly on the acceleration of the car and can
only assume two extreme values: full acceleration (a = 4) or full deceleration
(a = −4) (A = {−4, 4}).

System dynamics - The time is discretized in timesteps of 0.1s. Given the state
(p, v) and the action a at timestep t, the state (p, v) at timestep t + 1 is
computed solving with a numeric method the dynamic of the system:
˙p = v, ˙v =
u
(1 + H (p)2)
−
gH (p)
1 + H (p)2
−
v2H (p)H (p)
1 + H (p)2
(g = 9.81).
Reward function - The reward function r(s, a) is deﬁned through this expression:
r(st , at ) =
−1 if pt+1 < −1 or |vt+1| > 3
1 if pt+1 > 1 and |vt+1| ≤ 3
0 otherwise
Discount factor - The decay factor γ has been chosen equal to 0.95
Starting point - The car stopped at the bottom of the hill (p, v) = (−0.5, 0).
Regressor - The regressor used is an Extra Tree Regressor.

The Fitted Q-Iteration was performed for N = 1 to 50.
For N > 20, QN (s, a) ≈ QN+1(s, a).
Left Figure: the action chosen for every combination (p, v), according to the
action value function ˆQ20(s, a) (black = deceleration, white = acceleration).
Right Figure: the optimal trajectory according to ˆQ20(s, a).
A full implementation of Fitted Q-Iteration can be found on Github.

Policy Gradient
The goal of the Policy Gradient method is to find the vector of parameters θ that
maximizes the value function V (s, θ) under a parametric policy π(a|s, θ).
The process is based on a parametric policy π(a|s, θ) differentiable with respect
to the vector of parameters θ (randomly initialized).
In this case we choose a stochastic policy (Stochastic Policy Gradient).
An iteration over every episode is performed.
For each timestep t we generate a sequence of triplets (state, action, reward)
choosing the action according the parametric policy π(a|s, θ).
For every timestep in the resulting sequence we compute the total long term
reward with discount Gt :
Gt =
T
k=t+1
γk−t−1
Rk .
Then the vector of parameters θt is modified using a gradient update process
θt+1 = θt + α θV (s, θ) = θt + αGt θ ln π (at | st , θ) .
The process converges to the approximated optimal policy.

Policy Gradient Reinforce Algorithm
Input: a diﬀerentiable policy parameterization π(a|s, θ)
Algorithm parameter: learning rate α > 0
Initialize policy parameter θ ∈ Rd (for example to 0)
Loop for each episode:
Generate an episode (s0, a0, r1), . . . , (sT−1, aT−1, rT ), following π(·|·, θ)
Loop for each step of the episode t = 0, 1, . . . , T − 1 :
G ←
T
k=t+1
γk−t−1
rk
θ ← θ + αγt
G ln π (at |st , θ)

Examples of Parametric Policy
Softmax Policy
The Softmax Policy is mostly used in the case discrete actions:
π(a|s, θ) =
eφ(s,a) θ
N
k=1
eφ(s,ak ) θ
The explicit formula for the gradient update is
θ log (π(a|s, θ)) = φ(s, a) − Eπθ
[φ(s, ·)],
where φ(s, a) is the feature vector related to the state and the action.
Gaussian Policy
The Gaussian Policy is used in the case of a continuous action space
π(a|s, θ) =
1
√
2πσ
e
−
(a−µ)2
2σ2 ,
where φ(s, a) is the feature vector, µ(s) = φ(s)T
θ, and σ can be ﬁxed or parametric.
The explicit formula for the gradient update is
θ log(π(a|s, θ)) =
(a − µ(s))φ(s)
σ2
.

Policy Gradient Advantages and Disadvantages
Advantages
A Policy Gradient method is a simpler process compared with value-based
methods.
It allows the action to be continuous with respect to the state.
It usually has better convergence properties with respect to other methods.
It avoids the growth in the usage of memory and in the computation time when
the action and state sets are large.
It can learn stochastic policies.
It allows the use -greedy method.
Disadvantages
A Policy Gradient method typically converges to a local rather than global
optimum.
It usually has high variance.

Policy Gradient Example - CartPole
CartPole is a game where a pole is attached by an unactuated joint to a cart,
which moves along a frictionless track. The pole starts upright.
The goal is to prevent it from falling by increasing and reducing the cart’s velocity.
State space - A single state is composed of 4 elements:
cart position
cart velocity
pole angle
pole angular velocity
The game ends when the pole falls, which is when the pole angle is more than
±12◦, or the cart position reaches the edge of the display.
Action space - The agent can take only 2 actions:
move the pole to the left
move the pole to the right
Reward - For every step taken (including the termination step), the reward is
increased by 1.

The problem is solved with Gradient Policy method (implementation on Github)
Base Policy: Softmax Policy
Discount factor γ = 0.95, learning rate α = 0.1, max iterations per episode: 1000
After about 60 epochs (1 epoch = 20 consecutive episodes) the agent learns a
policy thanks to which we get a reward equal to 1000.

This chart shows the average reward per epoch evolves in function of the total
number of epochs, for diﬀerent values of the discount γ.

Actor-Critic Method
Actor-Critic method differs from the Policy Gradient method because estimates
both the policy and the value function, and updates both.
To address this issue and reduce the high variance in Policy Gradient, Actor-Critic
method subtracts from Gt a baseline b(s).
The Temporal Difference error δ = Gt − b(s) is used to update the vector of
parameters θ in place of the long term reward Gt .
The most used baseline is the estimation of the value function V (s).
The value function V (s) is learned with a Neural Network, whose output is the
approximated value function ˆV (s, w), where w is the vector of weights.
Then in every iteration the Temporal Difference error δ is used to adjust the
vector of parameters θ and the vector of weights w.
Actor-Critic:
1 The Critic estimates the value function V(s).
2 The Actor updates the policy distribution in the direction suggested by the Critic.

Actor-Critic Algorithm
Input: a differentiable policy parameterization π(a | s, θ)
Input: a differentiable state-value function parameterization ˆv(s, w)
Algorithm parameters: step sizes αθ > 0, αw > 0
Initialize policy parameter θ ∈ Rd and state-value weights w ∈ Rd (e.g., to 0)
Loop forever (for each episode):
Initialize s (first state of episode)
I ← 1
Loop while s is not terminal (for each time step):
a ∼ π(· | s, θ)
Take action a, observe s , r
δ ← r + γ ˆV s , w − ˆV (s, w) (if s is terminal, then ˆV s , w
.
= 0 )
w ← w + αwIδ w ˆV (s, w)
θ ← θ + αθIδ θ ln π(a | s, θ)
I ← γI
s ← s

Model-based Methods
A Model-based is based on a base parametric model and on 3 main steps:
1 Acting: the base policy π0 (at | st ) is used to select the actions to perform in the real
environment, in order to collect a set of triplets (s, a, s ).
2 Model learning: from the collected experience, a new model f (s, a) is deduced in order
to minimize the least square error between the model’s new state and the real new state
i
f (s, a) − s
2
.
3 Planning: the value function and the policy are updated according to the new model, in
order to be used in the real environment in the next iteration.
Most used base models: Gaussian Process, Gaussian Mixture Model

Model Predictive Control
The Model Predictive Control (MPC) is an evolution of the model-based method.
The Model-based algorithm is vulnerable to drifting.
To address that sampling and fitting of the model are performed continuously
during the trajectory.
In MPC the whole trajectory is optimized, but only the first action is performed,
then the new triplet (s, a, s ) is added to the observations and the planning is
performed again.
By constantly changing plan, MPC is less vulnerable to problems in the model.
MPC has 5 main steps:
1 Acting
2 Model learning
3 Planning
4 Execution: the first planned action is performed, and the resulting state s is observed.
5 Dataset update: the new triplet s, a, s is appended to the dataset; go to step 3,
every N times go to step 2.

Dyna-Q Architecture
Initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A
Do until termination condition:
1 s ← current (nonterminal) state
2 a ← ε -greedy (s, Q)
3 Execute action a
4 Observe resultant reward r and new state s
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)
5 Model(s, a) ← r, s (assuming deterministic environment)
6 Planning: repeat N times:
s ← random previously observed state
a ← random action previously taken in s
r, s ← Model(s, a)
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)

Model-based Methods Advantages and Disadvantages
Model-based Reinforcement Learning has the strong advantage of being sample
eﬃcient.
Once the model and the reward function are known, we can plan the optimal
controls without further sampling.
The learning phase is fast, since there is no need to wait for the environment to
respond.
On the downside, if the model is inaccurate we risk learning something
completely diﬀerent from the reality.
Model-based algorithm still use Model-free methods either to construct the model
or in the planning/simulation.

Conclusions
We had a high-level structural overview of many classic and popular RL
algorithms, but there are a lot of variants that we’ve not covered.
The main challenge in RL lays in preparing the simulation environment, which is
highly dependent on the task to be performed.
Infact many real world problems have enormous state or action spaces, and for
this reason the use of parametric functions is needed.
One of the main tasks in all the methods is to optimize rewards and penalties in
order to obtain the desired results.
Another challenge is to build a learning process that converges to the optimum in
a reasonable time avoiding bias and overﬁtting.
Last but not least, it’s important to avoid forgetting when acquiring new
observations.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An
Introduction.
Damien Ernst, Pierre Geurts, Louis Wehenkel. Tree-Based Batch Mode
Reinforcement Learning. Journal of Machine Learning Research 6 (2005)
503–556.

Reinforcement Learning Overview | Marco Del Pra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reinforcement Learning Overview | Marco Del Pra

Similar to Reinforcement Learning Overview | Marco Del Pra (20)

More from Data Science Milan

More from Data Science Milan (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning Overview | Marco Del Pra