SlideShare a Scribd company logo
1 of 31
Download to read offline
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning
June 30, 2020
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Introduction
Reinforcement Learning (RL) is a growing subset of Machine Learning and one of
the most important frontiers of Artificial Intelligence
It denotes a set of algorithms that deal with sequential decision-making.
A Reinforcement Learning algorithm can be described as a model that tells an
agent which set of actions it should take within a closed environment in order to
to maximize a predefined overall reward.
The agent tries different sets of actions, evaluating the total obtained return.
After many trials, the algorithm learns which actions give a greater reward and
establish a pattern of behavior.
Thanks to this, it is able to tell the agent which actions to take in every situation.
The goal of Reinforcement Learning is to capture higher logic and use more
adaptable algorithms than classical Machine Learning.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning applications
Robotics - Solution of high-dimensional control problems
Text mining - Producction of highly readable summaries of long texts.
Trade execution - Optimization of trading
Healthcare - Medication dosing, optimization of treatment
Games - Solution of different games and achievement of superhuman
performances.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Reinforcement Learning algorithms are based on Markov Decision Process (MDP).
Agent: an entity which performs actions in an environment in order to optimize a
long-term reward;
Environment (e): the scenario that the agent has to face;
Set of states (S): the set of all the possible states s of the environment;
Set of actions (A): the set of all the possible actions a that can be performed by
the agent;
State transition model P (s |s, a): describes the probability that the environment
state changes in s from s with action a;
Reward (r = R(s, a)): a function that indicates the immediate the real-valued
reward for taking action a at state s;
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Episode (rollout): a sequence of states st and actions at for t that varies from 0
to the horizon;
The agent starts in a given state within its environment s0 ∈ S
At each timestep t the agent observes the current state st and takes an action at ∈ A
The state evolves into a new state st+1 ∈ S,
The agent obtains a reward rt = R(st , at )
The agent observes the new state st+1 ∈ S
Policy function: a policy can be deterministic (π(s)) or stochastic (π(a|s)):
a deterministic policy π(s) indicates the action a performed by the the agent when the
environment is in the state s (a = π(s)).
a stochastic policy π(a|s) is a function that describe the probability that action a is
performed by the the agent when the environment is in the state s.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning actors
Return Gt : the total long term reward with discount obtained at the end of the
episode, according to rt = R(st , at ):
Gt = rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · γ < 1;
Value function V (s): the expected long-term return at the end:
V (s) = E [Gt | st = s] = E rt + γrt+1 + γ2
rt+2 + γ3
rt+3 + · · · | st = s ;
Q-Value or Action value function Q(s, a): the expected long-term return at the
end performing action a.
The Bellman equation: the theoretical core in most RL algorithms:
Vπ(s) = R(s, π(s)) + γ
s
P s | s, π(s) Vπ s .
It can also be expressed using the Q-value as:
Qπ(s, a) = R(s, a) + γ
s ∈S
P s | s, a Vπ s .
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Reinforcement Learning optimal policy
According to Bellman equation, the optimal action function Q∗(s, a) is given by
Q∗
(s, a) = R(s, a) + γ
s
P s | s, a max
a
Q∗
s , a ,
and then the optimal policy π∗(s) is given by
π∗
(s) = arg max
a∈ A
Q∗
(s, a).
In most real cases the state transition model and the reward function of the
model are unknown.
RL algorithms are used in order to learn the dynamics of the model and improve
the rewards.
The -greedy strategy is used to explore the entire environment and solve the
exploration-exploitation dilemma:
π∗
(s) =
arg maxa∈ A Q∗(s, a) probability 1 −
random from A probability
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
The Reinforcement Learning approaches
Value-based methods:
A Value-based algorithm computes the optimal value function or the optimal
action value function by iteratively improving their estimate.
Policy-based methods:
A Policy-based algorithm looks for a policy such that the action performed at
each state is optimal to gain maximum reward in the future.
Model-based methods:
A Model-based algorithm learns a virtual model starting from the original
environment, and the agent learns how to perform in the virtual model and then
get the results returned by the virtual model.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Value Function Approximation
The goal is to estimate the optimal policy π∗(s) by iteratively approximating the
optimal action value function Q∗(s, a).
The process is based on a parametric action value function ˆQ(s, a, w) of the state
s, of the action a and of a vector of parameters w (randomly initialized).
An iteration over every step of every episode if performed.
For every iteration, given the state s and the action a, we observe the reward
R(s, a) and the new state s .
According to the obtained reward the parameters are updated using the gradient
descent:
∆w = α R(s, a) + γ ˆQ(s , a , w) − ˆQ(s, a, w) w
ˆQ(s, a, w).
This process converges to the approximation of the optimal action value function.
In most of the real cases the parametric action value function ˆQ(s, a, w) is a
Neural Network, where the vector of parameters w is the vector of weights.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Value Function Approximation Reinforce Algorithm
Input: a differentiable action value parametric function ˆQ(s, a, w)
Algorithm parameters: learning rate α > 0, ε > 0
Initialize value-function weights w ∈ Rd randomly (e.g., w = 0)
Loop for each episode:
s, a ← initial state and action of episode (e.g., ε -greedy)
Loop for each step of episode:
Take action a, observe r, s
If s is terminal, w ← w + α[r − ˆQ(s, a, w)] ˆQ(s, a, w), go to next episode
Choose a = argmaxa ∈A
ˆQ s , a , w (or using ε-greedy)
w ← w + α r + γ ˆQ s , a , w − ˆQ(s, a, w) ˆQ(s, a, w)
s ← s
a ← a
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Deep Q-Networks
A Deep Q-Network is a Value Approximation algorithm where the parametric
action value function ˆQ(s, a, w) is a Deep Neural Network, and in particular a
Convolutional Neural Network.
A Deep Q-Network overcomes unstable learning using mainly 2 techniques
Target Network
A Target Network ˆQ s , a , w is a copy of the training model that is updated less
frequently.
In the gradient descent formula, the Target Network is used as target in place of the
model:
∆w = α r + γ max
a
ˆQ s , a , w − ˆQ(s, a, w) w
ˆQ(s, a, w).
This solution is useful to avoid instabilities given by the continue changes in the target.
Experience Replay
An Experience Replay is a buffer that stores the four-tuples (s, a, r, s ) of all the
different episodes.
Each time the model is updated, it randomly selects a batch of tuples.
This solution reduces overfitting, increases learning speed with mini-batches and reuses
past tuples to avoid forgetting.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration
Consider the deterministic case, in which we have that s is uniquely determined
by the state s and the action a by a function f : s = f (s, a).
Let L be the possible infinite horizon.
The goal of this algorithm is to estimate the optimal action value function.
By the Bellman equation, in this situation the optimal action value function is
Q∗
(s, a) = HQ(s, a) = R(s, a) + γ max
a
Q f (s, a), a
Denote by QN (s, a) the action value functions over N steps (N ≤ L) given by
QN (s, a) = (HQN−1) (s, a) ∀N > 0, Q0(s, a) = 0.
The sequence of N-step action value functions QN (s, a) converges to the optimal
action value function Q∗(s, a) as N → L.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Algorithm
Inputs: a set of four-tuples (state, action, reward, new state) F.
Initialization: N = 0, QN = 0.
Iteration: repeat until stopping conditions are reached
N ← N + 1
Build the training set T S = il
, ol
, l = 1, · · · , #F based on the the function
ˆQN−1 and on the full set of four-tuples F, where
il
= sl
t , al
t
ol
= rl
t + γ maxa∈A
ˆQN−1 sl
t+1, a
Use a regression algorithm to obtain the approximated N-Step action value function
ˆQN (s, a) training on the obtained dataset T S.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
Consider a car, modeled by a point mass, that is traveling on a hill with this form:
Objective - The objective is to bring the car in a minimum time to the top of the
hill with a limited speed.
State space - The state of the system is determined by the position p and the
speed v of the car. The space state is given by:
S = (p, v) ∈ R2
: |p| ≤ 1 and |v| ≤ 3 .
Every other combination of position and speed is considered a terminal state.
Action space - The action a acts directly on the acceleration of the car and can
only assume two extreme values: full acceleration (a = 4) or full deceleration
(a = −4) (A = {−4, 4}).
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
System dynamics - The time is discretized in timesteps of 0.1s. Given the state
(p, v) and the action a at timestep t, the state (p, v) at timestep t + 1 is
computed solving with a numeric method the dynamic of the system:
˙p = v, ˙v =
u
(1 + H (p)2)
−
gH (p)
1 + H (p)2
−
v2H (p)H (p)
1 + H (p)2
(g = 9.81).
Reward function - The reward function r(s, a) is defined through this expression:
r(st , at ) =
−1 if pt+1 < −1 or |vt+1| > 3
1 if pt+1 > 1 and |vt+1| ≤ 3
0 otherwise
Discount factor - The decay factor γ has been chosen equal to 0.95
Starting point - The car stopped at the bottom of the hill (p, v) = (−0.5, 0).
Regressor - The regressor used is an Extra Tree Regressor.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Fitted Q-Iteration Example - Car on a Hill
The Fitted Q-Iteration was performed for N = 1 to 50.
For N > 20, QN (s, a) ≈ QN+1(s, a).
Left Figure: the action chosen for every combination (p, v), according to the
action value function ˆQ20(s, a) (black = deceleration, white = acceleration).
Right Figure: the optimal trajectory according to ˆQ20(s, a).
A full implementation of Fitted Q-Iteration can be found on Github.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient
The goal of the Policy Gradient method is to find the vector of parameters θ that
maximizes the value function V (s, θ) under a parametric policy π(a|s, θ).
The process is based on a parametric policy π(a|s, θ) differentiable with respect
to the vector of parameters θ (randomly initialized).
In this case we choose a stochastic policy (Stochastic Policy Gradient).
An iteration over every episode is performed.
For each timestep t we generate a sequence of triplets (state, action, reward)
choosing the action according the parametric policy π(a|s, θ).
For every timestep in the resulting sequence we compute the total long term
reward with discount Gt :
Gt =
T
k=t+1
γk−t−1
Rk .
Then the vector of parameters θt is modified using a gradient update process
θt+1 = θt + α θV (s, θ) = θt + αGt θ ln π (at | st , θ) .
The process converges to the approximated optimal policy.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Reinforce Algorithm
Input: a differentiable policy parameterization π(a|s, θ)
Algorithm parameter: learning rate α > 0
Initialize policy parameter θ ∈ Rd (for example to 0)
Loop for each episode:
Generate an episode (s0, a0, r1), . . . , (sT−1, aT−1, rT ), following π(·|·, θ)
Loop for each step of the episode t = 0, 1, . . . , T − 1 :
G ←
T
k=t+1
γk−t−1
rk
θ ← θ + αγt
G ln π (at |st , θ)
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Examples of Parametric Policy
Softmax Policy
The Softmax Policy is mostly used in the case discrete actions:
π(a|s, θ) =
eφ(s,a) θ
N
k=1
eφ(s,ak ) θ
The explicit formula for the gradient update is
θ log (π(a|s, θ)) = φ(s, a) − Eπθ
[φ(s, ·)],
where φ(s, a) is the feature vector related to the state and the action.
Gaussian Policy
The Gaussian Policy is used in the case of a continuous action space
π(a|s, θ) =
1
√
2πσ
e
−
(a−µ)2
2σ2 ,
where φ(s, a) is the feature vector, µ(s) = φ(s)T
θ, and σ can be fixed or parametric.
The explicit formula for the gradient update is
θ log(π(a|s, θ)) =
(a − µ(s))φ(s)
σ2
.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Advantages and Disadvantages
Advantages
A Policy Gradient method is a simpler process compared with value-based
methods.
It allows the action to be continuous with respect to the state.
It usually has better convergence properties with respect to other methods.
It avoids the growth in the usage of memory and in the computation time when
the action and state sets are large.
It can learn stochastic policies.
It allows the use -greedy method.
Disadvantages
A Policy Gradient method typically converges to a local rather than global
optimum.
It usually has high variance.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
CartPole is a game where a pole is attached by an unactuated joint to a cart,
which moves along a frictionless track. The pole starts upright.
The goal is to prevent it from falling by increasing and reducing the cart’s velocity.
State space - A single state is composed of 4 elements:
cart position
cart velocity
pole angle
pole angular velocity
The game ends when the pole falls, which is when the pole angle is more than
±12◦, or the cart position reaches the edge of the display.
Action space - The agent can take only 2 actions:
move the pole to the left
move the pole to the right
Reward - For every step taken (including the termination step), the reward is
increased by 1.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
The problem is solved with Gradient Policy method (implementation on Github)
Base Policy: Softmax Policy
Discount factor γ = 0.95, learning rate α = 0.1, max iterations per episode: 1000
After about 60 epochs (1 epoch = 20 consecutive episodes) the agent learns a
policy thanks to which we get a reward equal to 1000.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Policy Gradient Example - CartPole
This chart shows the average reward per epoch evolves in function of the total
number of epochs, for different values of the discount γ.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Actor-Critic Method
Actor-Critic method differs from the Policy Gradient method because estimates
both the policy and the value function, and updates both.
To address this issue and reduce the high variance in Policy Gradient, Actor-Critic
method subtracts from Gt a baseline b(s).
The Temporal Difference error δ = Gt − b(s) is used to update the vector of
parameters θ in place of the long term reward Gt .
The most used baseline is the estimation of the value function V (s).
The value function V (s) is learned with a Neural Network, whose output is the
approximated value function ˆV (s, w), where w is the vector of weights.
Then in every iteration the Temporal Difference error δ is used to adjust the
vector of parameters θ and the vector of weights w.
Actor-Critic:
1 The Critic estimates the value function V(s).
2 The Actor updates the policy distribution in the direction suggested by the Critic.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Actor-Critic Algorithm
Input: a differentiable policy parameterization π(a | s, θ)
Input: a differentiable state-value function parameterization ˆv(s, w)
Algorithm parameters: step sizes αθ > 0, αw > 0
Initialize policy parameter θ ∈ Rd and state-value weights w ∈ Rd (e.g., to 0)
Loop forever (for each episode):
Initialize s (first state of episode)
I ← 1
Loop while s is not terminal (for each time step):
a ∼ π(· | s, θ)
Take action a, observe s , r
δ ← r + γ ˆV s , w − ˆV (s, w) (if s is terminal, then ˆV s , w
.
= 0 )
w ← w + αwIδ w ˆV (s, w)
θ ← θ + αθIδ θ ln π(a | s, θ)
I ← γI
s ← s
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model-based Methods
A Model-based is based on a base parametric model and on 3 main steps:
1 Acting: the base policy π0 (at | st ) is used to select the actions to perform in the real
environment, in order to collect a set of triplets (s, a, s ).
2 Model learning: from the collected experience, a new model f (s, a) is deduced in order
to minimize the least square error between the model’s new state and the real new state
i
f (s, a) − s
2
.
3 Planning: the value function and the policy are updated according to the new model, in
order to be used in the real environment in the next iteration.
Most used base models: Gaussian Process, Gaussian Mixture Model
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model Predictive Control
The Model Predictive Control (MPC) is an evolution of the model-based method.
The Model-based algorithm is vulnerable to drifting.
To address that sampling and fitting of the model are performed continuously
during the trajectory.
In MPC the whole trajectory is optimized, but only the first action is performed,
then the new triplet (s, a, s ) is added to the observations and the planning is
performed again.
By constantly changing plan, MPC is less vulnerable to problems in the model.
MPC has 5 main steps:
1 Acting
2 Model learning
3 Planning
4 Execution: the first planned action is performed, and the resulting state s is observed.
5 Dataset update: the new triplet s, a, s is appended to the dataset; go to step 3,
every N times go to step 2.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Dyna-Q Architecture
Initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A
Do until termination condition:
1 s ← current (nonterminal) state
2 a ← ε -greedy (s, Q)
3 Execute action a
4 Observe resultant reward r and new state s
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)
5 Model(s, a) ← r, s (assuming deterministic environment)
6 Planning: repeat N times:
s ← random previously observed state
a ← random action previously taken in s
r, s ← Model(s, a)
Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a)
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Model-based Methods Advantages and Disadvantages
Model-based Reinforcement Learning has the strong advantage of being sample
efficient.
Once the model and the reward function are known, we can plan the optimal
controls without further sampling.
The learning phase is fast, since there is no need to wait for the environment to
respond.
On the downside, if the model is inaccurate we risk learning something
completely different from the reality.
Model-based algorithm still use Model-free methods either to construct the model
or in the planning/simulation.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Conclusions
We had a high-level structural overview of many classic and popular RL
algorithms, but there are a lot of variants that we’ve not covered.
The main challenge in RL lays in preparing the simulation environment, which is
highly dependent on the task to be performed.
Infact many real world problems have enormous state or action spaces, and for
this reason the use of parametric functions is needed.
One of the main tasks in all the methods is to optimize rewards and penalties in
order to obtain the desired results.
Another challenge is to build a learning process that converges to the optimum in
a reasonable time avoiding bias and overfitting.
Last but not least, it’s important to avoid forgetting when acquiring new
observations.
Reinforcement Learning
Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An
Introduction.
Damien Ernst, Pierre Geurts, Louis Wehenkel. Tree-Based Batch Mode
Reinforcement Learning. Journal of Machine Learning Research 6 (2005)
503–556.
Reinforcement Learning

More Related Content

What's hot

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Top school in noida
Top school in noidaTop school in noida
Top school in noidaEdhole.com
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed BanditsYan Xu
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
How to design a linear control system
How to design a linear control systemHow to design a linear control system
How to design a linear control systemAlireza Mirzaei
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Solving The Shortest Path Tour Problem
Solving The Shortest Path Tour ProblemSolving The Shortest Path Tour Problem
Solving The Shortest Path Tour ProblemNozir Shokirov
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
 
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Yandex
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Aaex7 group2(中英夾雜)
Aaex7 group2(中英夾雜)Aaex7 group2(中英夾雜)
Aaex7 group2(中英夾雜)Shiang-Yun Yang
 

What's hot (20)

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Top school in noida
Top school in noidaTop school in noida
Top school in noida
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Chapter06
Chapter06Chapter06
Chapter06
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
How to design a linear control system
How to design a linear control systemHow to design a linear control system
How to design a linear control system
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Mdp
MdpMdp
Mdp
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Solving The Shortest Path Tour Problem
Solving The Shortest Path Tour ProblemSolving The Shortest Path Tour Problem
Solving The Shortest Path Tour Problem
 
Jmestn42351212
Jmestn42351212Jmestn42351212
Jmestn42351212
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Aaex7 group2(中英夾雜)
Aaex7 group2(中英夾雜)Aaex7 group2(中英夾雜)
Aaex7 group2(中英夾雜)
 
Germany2003 gamg
Germany2003 gamgGermany2003 gamg
Germany2003 gamg
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 

Similar to Reinforcement Learning Overview | Marco Del Pra

Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsLyft
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12VuTran231
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)NYversity
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimizationinfopapers
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014Shuyang Li
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper ReadingTakato Yamazaki
 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automatainfopapers
 

Similar to Reinforcement Learning Overview | Marco Del Pra (20)

Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
frozen_lake_rl_report.pdf
frozen_lake_rl_report.pdffrozen_lake_rl_report.pdf
frozen_lake_rl_report.pdf
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimization
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
 

More from Data Science Milan

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital paymentsData Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansData Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIData Science Milan
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSData Science Milan
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AIData Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaData Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharData Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep LearningData Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyData Science Milan
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by CervedData Science Milan
 

More from Data Science Milan (20)

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Reinforcement Learning Overview | Marco Del Pra

  • 1. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning June 30, 2020 Reinforcement Learning
  • 2. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Introduction Reinforcement Learning (RL) is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence It denotes a set of algorithms that deal with sequential decision-making. A Reinforcement Learning algorithm can be described as a model that tells an agent which set of actions it should take within a closed environment in order to to maximize a predefined overall reward. The agent tries different sets of actions, evaluating the total obtained return. After many trials, the algorithm learns which actions give a greater reward and establish a pattern of behavior. Thanks to this, it is able to tell the agent which actions to take in every situation. The goal of Reinforcement Learning is to capture higher logic and use more adaptable algorithms than classical Machine Learning. Reinforcement Learning
  • 3. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning applications Robotics - Solution of high-dimensional control problems Text mining - Producction of highly readable summaries of long texts. Trade execution - Optimization of trading Healthcare - Medication dosing, optimization of treatment Games - Solution of different games and achievement of superhuman performances. Reinforcement Learning
  • 4. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Reinforcement Learning algorithms are based on Markov Decision Process (MDP). Agent: an entity which performs actions in an environment in order to optimize a long-term reward; Environment (e): the scenario that the agent has to face; Set of states (S): the set of all the possible states s of the environment; Set of actions (A): the set of all the possible actions a that can be performed by the agent; State transition model P (s |s, a): describes the probability that the environment state changes in s from s with action a; Reward (r = R(s, a)): a function that indicates the immediate the real-valued reward for taking action a at state s; Reinforcement Learning
  • 5. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Episode (rollout): a sequence of states st and actions at for t that varies from 0 to the horizon; The agent starts in a given state within its environment s0 ∈ S At each timestep t the agent observes the current state st and takes an action at ∈ A The state evolves into a new state st+1 ∈ S, The agent obtains a reward rt = R(st , at ) The agent observes the new state st+1 ∈ S Policy function: a policy can be deterministic (π(s)) or stochastic (π(a|s)): a deterministic policy π(s) indicates the action a performed by the the agent when the environment is in the state s (a = π(s)). a stochastic policy π(a|s) is a function that describe the probability that action a is performed by the the agent when the environment is in the state s. Reinforcement Learning
  • 6. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning actors Return Gt : the total long term reward with discount obtained at the end of the episode, according to rt = R(st , at ): Gt = rt + γrt+1 + γ2 rt+2 + γ3 rt+3 + · · · γ < 1; Value function V (s): the expected long-term return at the end: V (s) = E [Gt | st = s] = E rt + γrt+1 + γ2 rt+2 + γ3 rt+3 + · · · | st = s ; Q-Value or Action value function Q(s, a): the expected long-term return at the end performing action a. The Bellman equation: the theoretical core in most RL algorithms: Vπ(s) = R(s, π(s)) + γ s P s | s, π(s) Vπ s . It can also be expressed using the Q-value as: Qπ(s, a) = R(s, a) + γ s ∈S P s | s, a Vπ s . Reinforcement Learning
  • 7. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Reinforcement Learning optimal policy According to Bellman equation, the optimal action function Q∗(s, a) is given by Q∗ (s, a) = R(s, a) + γ s P s | s, a max a Q∗ s , a , and then the optimal policy π∗(s) is given by π∗ (s) = arg max a∈ A Q∗ (s, a). In most real cases the state transition model and the reward function of the model are unknown. RL algorithms are used in order to learn the dynamics of the model and improve the rewards. The -greedy strategy is used to explore the entire environment and solve the exploration-exploitation dilemma: π∗ (s) = arg maxa∈ A Q∗(s, a) probability 1 − random from A probability Reinforcement Learning
  • 8. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions The Reinforcement Learning approaches Value-based methods: A Value-based algorithm computes the optimal value function or the optimal action value function by iteratively improving their estimate. Policy-based methods: A Policy-based algorithm looks for a policy such that the action performed at each state is optimal to gain maximum reward in the future. Model-based methods: A Model-based algorithm learns a virtual model starting from the original environment, and the agent learns how to perform in the virtual model and then get the results returned by the virtual model. Reinforcement Learning
  • 9. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Value Function Approximation The goal is to estimate the optimal policy π∗(s) by iteratively approximating the optimal action value function Q∗(s, a). The process is based on a parametric action value function ˆQ(s, a, w) of the state s, of the action a and of a vector of parameters w (randomly initialized). An iteration over every step of every episode if performed. For every iteration, given the state s and the action a, we observe the reward R(s, a) and the new state s . According to the obtained reward the parameters are updated using the gradient descent: ∆w = α R(s, a) + γ ˆQ(s , a , w) − ˆQ(s, a, w) w ˆQ(s, a, w). This process converges to the approximation of the optimal action value function. In most of the real cases the parametric action value function ˆQ(s, a, w) is a Neural Network, where the vector of parameters w is the vector of weights. Reinforcement Learning
  • 10. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Value Function Approximation Reinforce Algorithm Input: a differentiable action value parametric function ˆQ(s, a, w) Algorithm parameters: learning rate α > 0, ε > 0 Initialize value-function weights w ∈ Rd randomly (e.g., w = 0) Loop for each episode: s, a ← initial state and action of episode (e.g., ε -greedy) Loop for each step of episode: Take action a, observe r, s If s is terminal, w ← w + α[r − ˆQ(s, a, w)] ˆQ(s, a, w), go to next episode Choose a = argmaxa ∈A ˆQ s , a , w (or using ε-greedy) w ← w + α r + γ ˆQ s , a , w − ˆQ(s, a, w) ˆQ(s, a, w) s ← s a ← a Reinforcement Learning
  • 11. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Deep Q-Networks A Deep Q-Network is a Value Approximation algorithm where the parametric action value function ˆQ(s, a, w) is a Deep Neural Network, and in particular a Convolutional Neural Network. A Deep Q-Network overcomes unstable learning using mainly 2 techniques Target Network A Target Network ˆQ s , a , w is a copy of the training model that is updated less frequently. In the gradient descent formula, the Target Network is used as target in place of the model: ∆w = α r + γ max a ˆQ s , a , w − ˆQ(s, a, w) w ˆQ(s, a, w). This solution is useful to avoid instabilities given by the continue changes in the target. Experience Replay An Experience Replay is a buffer that stores the four-tuples (s, a, r, s ) of all the different episodes. Each time the model is updated, it randomly selects a batch of tuples. This solution reduces overfitting, increases learning speed with mini-batches and reuses past tuples to avoid forgetting. Reinforcement Learning
  • 12. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Consider the deterministic case, in which we have that s is uniquely determined by the state s and the action a by a function f : s = f (s, a). Let L be the possible infinite horizon. The goal of this algorithm is to estimate the optimal action value function. By the Bellman equation, in this situation the optimal action value function is Q∗ (s, a) = HQ(s, a) = R(s, a) + γ max a Q f (s, a), a Denote by QN (s, a) the action value functions over N steps (N ≤ L) given by QN (s, a) = (HQN−1) (s, a) ∀N > 0, Q0(s, a) = 0. The sequence of N-step action value functions QN (s, a) converges to the optimal action value function Q∗(s, a) as N → L. Reinforcement Learning
  • 13. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Algorithm Inputs: a set of four-tuples (state, action, reward, new state) F. Initialization: N = 0, QN = 0. Iteration: repeat until stopping conditions are reached N ← N + 1 Build the training set T S = il , ol , l = 1, · · · , #F based on the the function ˆQN−1 and on the full set of four-tuples F, where il = sl t , al t ol = rl t + γ maxa∈A ˆQN−1 sl t+1, a Use a regression algorithm to obtain the approximated N-Step action value function ˆQN (s, a) training on the obtained dataset T S. Reinforcement Learning
  • 14. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill Consider a car, modeled by a point mass, that is traveling on a hill with this form: Objective - The objective is to bring the car in a minimum time to the top of the hill with a limited speed. State space - The state of the system is determined by the position p and the speed v of the car. The space state is given by: S = (p, v) ∈ R2 : |p| ≤ 1 and |v| ≤ 3 . Every other combination of position and speed is considered a terminal state. Action space - The action a acts directly on the acceleration of the car and can only assume two extreme values: full acceleration (a = 4) or full deceleration (a = −4) (A = {−4, 4}). Reinforcement Learning
  • 15. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill System dynamics - The time is discretized in timesteps of 0.1s. Given the state (p, v) and the action a at timestep t, the state (p, v) at timestep t + 1 is computed solving with a numeric method the dynamic of the system: ˙p = v, ˙v = u (1 + H (p)2) − gH (p) 1 + H (p)2 − v2H (p)H (p) 1 + H (p)2 (g = 9.81). Reward function - The reward function r(s, a) is defined through this expression: r(st , at ) = −1 if pt+1 < −1 or |vt+1| > 3 1 if pt+1 > 1 and |vt+1| ≤ 3 0 otherwise Discount factor - The decay factor γ has been chosen equal to 0.95 Starting point - The car stopped at the bottom of the hill (p, v) = (−0.5, 0). Regressor - The regressor used is an Extra Tree Regressor. Reinforcement Learning
  • 16. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Fitted Q-Iteration Example - Car on a Hill The Fitted Q-Iteration was performed for N = 1 to 50. For N > 20, QN (s, a) ≈ QN+1(s, a). Left Figure: the action chosen for every combination (p, v), according to the action value function ˆQ20(s, a) (black = deceleration, white = acceleration). Right Figure: the optimal trajectory according to ˆQ20(s, a). A full implementation of Fitted Q-Iteration can be found on Github. Reinforcement Learning
  • 17. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient The goal of the Policy Gradient method is to find the vector of parameters θ that maximizes the value function V (s, θ) under a parametric policy π(a|s, θ). The process is based on a parametric policy π(a|s, θ) differentiable with respect to the vector of parameters θ (randomly initialized). In this case we choose a stochastic policy (Stochastic Policy Gradient). An iteration over every episode is performed. For each timestep t we generate a sequence of triplets (state, action, reward) choosing the action according the parametric policy π(a|s, θ). For every timestep in the resulting sequence we compute the total long term reward with discount Gt : Gt = T k=t+1 γk−t−1 Rk . Then the vector of parameters θt is modified using a gradient update process θt+1 = θt + α θV (s, θ) = θt + αGt θ ln π (at | st , θ) . The process converges to the approximated optimal policy. Reinforcement Learning
  • 18. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Reinforce Algorithm Input: a differentiable policy parameterization π(a|s, θ) Algorithm parameter: learning rate α > 0 Initialize policy parameter θ ∈ Rd (for example to 0) Loop for each episode: Generate an episode (s0, a0, r1), . . . , (sT−1, aT−1, rT ), following π(·|·, θ) Loop for each step of the episode t = 0, 1, . . . , T − 1 : G ← T k=t+1 γk−t−1 rk θ ← θ + αγt G ln π (at |st , θ) Reinforcement Learning
  • 19. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Examples of Parametric Policy Softmax Policy The Softmax Policy is mostly used in the case discrete actions: π(a|s, θ) = eφ(s,a) θ N k=1 eφ(s,ak ) θ The explicit formula for the gradient update is θ log (π(a|s, θ)) = φ(s, a) − Eπθ [φ(s, ·)], where φ(s, a) is the feature vector related to the state and the action. Gaussian Policy The Gaussian Policy is used in the case of a continuous action space π(a|s, θ) = 1 √ 2πσ e − (a−µ)2 2σ2 , where φ(s, a) is the feature vector, µ(s) = φ(s)T θ, and σ can be fixed or parametric. The explicit formula for the gradient update is θ log(π(a|s, θ)) = (a − µ(s))φ(s) σ2 . Reinforcement Learning
  • 20. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Advantages and Disadvantages Advantages A Policy Gradient method is a simpler process compared with value-based methods. It allows the action to be continuous with respect to the state. It usually has better convergence properties with respect to other methods. It avoids the growth in the usage of memory and in the computation time when the action and state sets are large. It can learn stochastic policies. It allows the use -greedy method. Disadvantages A Policy Gradient method typically converges to a local rather than global optimum. It usually has high variance. Reinforcement Learning
  • 21. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole CartPole is a game where a pole is attached by an unactuated joint to a cart, which moves along a frictionless track. The pole starts upright. The goal is to prevent it from falling by increasing and reducing the cart’s velocity. State space - A single state is composed of 4 elements: cart position cart velocity pole angle pole angular velocity The game ends when the pole falls, which is when the pole angle is more than ±12◦, or the cart position reaches the edge of the display. Action space - The agent can take only 2 actions: move the pole to the left move the pole to the right Reward - For every step taken (including the termination step), the reward is increased by 1. Reinforcement Learning
  • 22. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole The problem is solved with Gradient Policy method (implementation on Github) Base Policy: Softmax Policy Discount factor γ = 0.95, learning rate α = 0.1, max iterations per episode: 1000 After about 60 epochs (1 epoch = 20 consecutive episodes) the agent learns a policy thanks to which we get a reward equal to 1000. Reinforcement Learning
  • 23. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Policy Gradient Example - CartPole This chart shows the average reward per epoch evolves in function of the total number of epochs, for different values of the discount γ. Reinforcement Learning
  • 24. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Actor-Critic Method Actor-Critic method differs from the Policy Gradient method because estimates both the policy and the value function, and updates both. To address this issue and reduce the high variance in Policy Gradient, Actor-Critic method subtracts from Gt a baseline b(s). The Temporal Difference error δ = Gt − b(s) is used to update the vector of parameters θ in place of the long term reward Gt . The most used baseline is the estimation of the value function V (s). The value function V (s) is learned with a Neural Network, whose output is the approximated value function ˆV (s, w), where w is the vector of weights. Then in every iteration the Temporal Difference error δ is used to adjust the vector of parameters θ and the vector of weights w. Actor-Critic: 1 The Critic estimates the value function V(s). 2 The Actor updates the policy distribution in the direction suggested by the Critic. Reinforcement Learning
  • 25. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Actor-Critic Algorithm Input: a differentiable policy parameterization π(a | s, θ) Input: a differentiable state-value function parameterization ˆv(s, w) Algorithm parameters: step sizes αθ > 0, αw > 0 Initialize policy parameter θ ∈ Rd and state-value weights w ∈ Rd (e.g., to 0) Loop forever (for each episode): Initialize s (first state of episode) I ← 1 Loop while s is not terminal (for each time step): a ∼ π(· | s, θ) Take action a, observe s , r δ ← r + γ ˆV s , w − ˆV (s, w) (if s is terminal, then ˆV s , w . = 0 ) w ← w + αwIδ w ˆV (s, w) θ ← θ + αθIδ θ ln π(a | s, θ) I ← γI s ← s Reinforcement Learning
  • 26. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model-based Methods A Model-based is based on a base parametric model and on 3 main steps: 1 Acting: the base policy π0 (at | st ) is used to select the actions to perform in the real environment, in order to collect a set of triplets (s, a, s ). 2 Model learning: from the collected experience, a new model f (s, a) is deduced in order to minimize the least square error between the model’s new state and the real new state i f (s, a) − s 2 . 3 Planning: the value function and the policy are updated according to the new model, in order to be used in the real environment in the next iteration. Most used base models: Gaussian Process, Gaussian Mixture Model Reinforcement Learning
  • 27. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model Predictive Control The Model Predictive Control (MPC) is an evolution of the model-based method. The Model-based algorithm is vulnerable to drifting. To address that sampling and fitting of the model are performed continuously during the trajectory. In MPC the whole trajectory is optimized, but only the first action is performed, then the new triplet (s, a, s ) is added to the observations and the planning is performed again. By constantly changing plan, MPC is less vulnerable to problems in the model. MPC has 5 main steps: 1 Acting 2 Model learning 3 Planning 4 Execution: the first planned action is performed, and the resulting state s is observed. 5 Dataset update: the new triplet s, a, s is appended to the dataset; go to step 3, every N times go to step 2. Reinforcement Learning
  • 28. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Dyna-Q Architecture Initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A Do until termination condition: 1 s ← current (nonterminal) state 2 a ← ε -greedy (s, Q) 3 Execute action a 4 Observe resultant reward r and new state s Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a) 5 Model(s, a) ← r, s (assuming deterministic environment) 6 Planning: repeat N times: s ← random previously observed state a ← random action previously taken in s r, s ← Model(s, a) Q(s, a) ← Q(s, a) + α r + γ maxa Q s , a − Q(s, a) Reinforcement Learning
  • 29. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Model-based Methods Advantages and Disadvantages Model-based Reinforcement Learning has the strong advantage of being sample efficient. Once the model and the reward function are known, we can plan the optimal controls without further sampling. The learning phase is fast, since there is no need to wait for the environment to respond. On the downside, if the model is inaccurate we risk learning something completely different from the reality. Model-based algorithm still use Model-free methods either to construct the model or in the planning/simulation. Reinforcement Learning
  • 30. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Conclusions We had a high-level structural overview of many classic and popular RL algorithms, but there are a lot of variants that we’ve not covered. The main challenge in RL lays in preparing the simulation environment, which is highly dependent on the task to be performed. Infact many real world problems have enormous state or action spaces, and for this reason the use of parametric functions is needed. One of the main tasks in all the methods is to optimize rewards and penalties in order to obtain the desired results. Another challenge is to build a learning process that converges to the optimum in a reasonable time avoiding bias and overfitting. Last but not least, it’s important to avoid forgetting when acquiring new observations. Reinforcement Learning
  • 31. Introduction RL Framework Value-based Methods Policy-based Methods Model-based Methods Conclusions Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Damien Ernst, Pierre Geurts, Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research 6 (2005) 503–556. Reinforcement Learning