SlideShare a Scribd company logo
1 of 33
Download to read offline
Batch mode reinforcement learning
based on the synthesis of artificial
trajectories
Damien Ernst
University of Li`ege, Belgium
Princeton - December 2012
Batch mode Reinforcement Learning ≃
Learning a high-performance policy for a sequential decision
problem where:
• a numerical criterion is used to define the performance of
a policy. (An optimal policy is the policy that maximizes
this numerical criterion.)
• “the only” (or “most of the”) information available on the
sequential decision problem is contained in a set of
trajectories.
Batch mode RL stands at the intersection of three worlds:
optimization (maximization of the numerical criterion),
system/control theory (sequential decision problem) and
machine learning (inference from a set of trajectories).
A typical batch mode RL problem
Discrete-time dynamics:
xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U
and wt ∈ W. wt is drawn at every time step according to Pw (·).
Reward observed after each system transition:
rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h : {0, 1, . . . , T − 1} × X → U.
Performance criterion: Expected sum of the rewards
observed over the T-length horizon
PCh
(x) = Jh
(x) = E
w0,...,wT−1
[
T−1
t=0 ρ(xt , h(t, xt ), wt )] with x0 = x
and xt+1 = f(xt , h(t, xt ), wt ).
Available information: A set of elementary pieces of
trajectories Fn = {(xl
, ul
, rl
, yl
)}n
l=1 where yl
is the state
reached after taking action ul
in state xl
and rl
the
instantaneous reward associated with the transition. The
functions f, ρ and Pw are unknown. 2
Batch mode RL and function approximators
Training function approximators (radial basis functions, neural
nets, trees, etc) using the information contained in the set of
trajectories is a key element to most of the resolution schemes
for batch mode RL problems with state-action spaces having a
large (infinite) number of elements.
Two typical uses of FAs for batch mode RL:
• the FAs model of the sequential decision problem (in
our typical problem f, r and Pw ). The model is afterwards
exploited as if it was the real problem to compute a
high-performance policy.
• the FAs represent (state-action) value functions which
are used in iterative schemes so as to converge to a
(state-action) value function from which a
high-performance policy can be computed. Iterative
schemes based on the dynamic programming principle
(e.g., LSPI, FQI, Q-learning). 3
Why look beyond function approximators?
FAs based techniques: mature, can successfully solve many
real life problems but:
1. not well adapted to risk sensitive performance criteria
2. may lead to unsafe policies - poor performance
guarantees
3. may make suboptimal use of near-optimal trajectories
4. offer little clues about how to generate new experiments in
an optimal way
4
1. not well adapted to risk sensitive performance criteria
An example of risk sensitive performance criterion:
PCh
(x) =
−∞ if P(
T−1
t=0 ρ(xt , h(t, xt ), wt ) < b) > c
Jh
(x) otherwise.
FAs with dynamic programming: very problematic because
(state-action) value functions need to become functions that
take as values “probability distributions of future rewards” and
not “expected rewards”.
FAs with model learning: more likely to succeed; but what
about the challenges of fitting the FAs to model the distribution
of future states reached (rewards collected) by policies and not
only an average behavior?
5
2. may lead to unsafe policies - poor performance
guarantees
• Benchmark: • Trajectory set • Trajectory set not
puddle world covering the puddle covering the puddle
• RL algorithms: ⇒ Optimal policy ⇒ Suboptimal
FQI with trees (unsafe) policy
Typical performance guarantee in the deterministic case for
FQI = (estimated return by FQI of the policy it outputs minus
constant×(’size’ of the largest area of the state space not
covered by the sample)).
6
3. may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in
the set of trajectory, there is a trajectory:
(xopt. traj.
0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT )
where the ut s have been selected by an optimal policy.
Question: Which batch mode RL algorithms will output a
policy which is optimal for the initial state xopt. traj.
0 whatever the
other trajectories in the set? Answer: Not that many and
certainly not those using parametric FAs.
In my opinion: batch mode RL algorithms can only be
successful on large-scale problems if (i) in the set of
trajectories, many trajectories have been generated by
(near-)optimal policies (ii) the algorithms exploit very well the
information contained in those (near-)optimal trajectories.
7
4. offer little clues about how to generate new
experiments in an optimal way
Many real-life problems are variants of batch mode RL
problems for which (a limited number of) additional trajectories
can be generated (under various constraints) to enrich the
initial set of trajectories.
Question: How should these new trajectories be generated?
Many approaches based on the analysis of the FAs produced
by batch mode RL methods have been proposed; results are
mixed.
8
Rebuilding trajectories
We conjecture that mapping the set of trajectories into FAs
generally lead to the loss of essential information for
addressing these four issues ⇒ We have developed a new line
of research for solving batch mode RL that does not use at all
FAs.
Line of research articulated around the the rebuilding of
artificial (likely “broken”) trajectories by using the set of
trajectories input of the batch mode RL problem; a rebuilt
trajectory is defined by the elementary pieces of trajectory it is
made of.
The rebuilt trajectories are analysed to compute various things:
a high-performance policy, performance guarantees, where to
sample, etc.
9
BLUE ARROW = elementary piece of trajectory
Set of trajectories Examples of 5-length
given as input of the rebuilt trajectories made
batch RL problem from elements of this set
10
Model-Free Monte Carlo Estimator
Building an oracle that estimates the performance of a policy:
important problem in batch mode RL.
Indeed, if an oracle is available, problem of estimating a
high-performance policy can be reduced to an optimization
problem over a set of candidate policies.
If a model of sequential decision problem is available, a Monte
Carlo estimator (i.e., rollouts) can be used to estimate the
performance of a policy.
We detail an approach that estimates the performance of a
policy by rebuilding trajectories so as to mimic the behavior of
the Monte Carlo estimator.
11
Context in which the approach is presented
Discrete-time dynamics:
xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U
and wt ∈ W. wt is drawn at every time step according to Pw (·)
Reward observed after each system transition:
rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward
function.
Type of policies considered: h : {0, 1, . . . , T − 1} × X → U.
Performance criterion: Expected sum of the rewards
observed over the T-length horizon
PCh
(x) = Jh
(x) = E
w0,...,wT−1
[
T−1
t=0 ρ(xt , h(t, xt ), wt )] with x0 = x
and xt+1 = f(xt , h(t, xt ), wt ).
Available information: A set of elementary pieces of
trajectories Fn = {(xl
, ul
, rl
, yl
)}n
l=1. f, ρ and Pw are unknown.
Approach aimed at estimating Jh
(x) from Fn.
12
Monte Carlo Estimator
Generate nbTraj T-length trajectories by simulating the system
starting from the initial state x0; for every trajectory compute the
sum of rewards collected; average these sum of rewards over
the nbTraj trajectories to get an estimate MCEh
(x0) of Jh
(x0).
trajectory 2
r1
r2
r0
r4
r3 = r(x3, h(3, x3), w3)trajectory 1
sum rew. traj. 1 = 4
i=0 ri
MCEh(x0) = 1
3
3
i=1 sum rew. traj. i
w3 ∼ Pw (·)
x4 = f(x3, h(3, x3), w3)
x3
Illustration with
and T = 5
nbTraj = 3
x0
trajectory 3
Bias MCEh
(x0) = E
nbTraj ∗T rand. var. w∼Pw (·)
[MCEh
(x0) − Jh
(x0)]= 0
Var. MCEh
(x0) = 1
nbTraj
(Var. of the sum of rewards along a traj.)
13
Description of Model-free Monte Carlo Estimator
(MFMC)
Principle: To rebuild nbTraj T-length trajectories using the
elements of the set Fnand to average the sum of rewards
collected along the rebuilt trajectories to get an estimate
MFMCEh
(Fn, x0) of Jh
(x0).
Trajectories rebuilding algorithm: Trajectories are
sequentially rebuilt; an elementary piece of trajectory can only
be used once; trajectories are grown in length by selecting at
every instant t = 0, 1, . . . , T − 1 the elementary piece of
trajectory (x, u, r, y) that minimizes the
distance ∆((x, u), (xend
, h(t, xend
)))
where xend
is the ending state of the already rebuilt part of the
trajectory (xend
= x0 if t = 0).
14
Remark: When sequentially selecting the pieces of
trajectories, no information on the value of the disturbance w
“behind” the new piece of elementary trajectory
(x, u, r = ρ(x, u, w), y = f(x, u, w)) that is going to be selected
is given if only (x, u) and the previous elementary pieces of
trajectories selected are known. Important for having a
meaningful estimator !!!
rebuilt trajectory 2
r3 r21
r9
rebuilt trajectory 1
x0
rebuilt trajectory 3
r18
Illustration with
T = 5 and
(x18, r18, u18, y18)
sum rew. re. traj. 1 = r3
+ r18
+ r21
+ r7
+ r9
MFMCEh(Fn, x0) = 1
3
3
i=1 sum rew. re. traj. i
nbTraj = 3,
r7
F24 = {(xl
, rl
, ul
, yl
)}24
l=1
15
Analysis of the MFMC
Random set ˜Fn defined as follows:
Made of n elementary pieces of trajectory where the first two
components of an element (xl
, ul
) are given by the first two
element of the lth element of Fn and the last two are generated
by drawing for each l a disturbance signal wl
at random from
PW (·) and taking rl
= ρ(xl
, ul
, wl
) and yl
= f(xl
, ul
, wl
).
Fn is a realization of the random set ˜Fn.
Bias and variance of MFMCE defined as:
Bias MFMCEh
( ˜Fn, x0) = E
w1,...,wn∼Pw
[MFMCEh
( ˜Fn, x0) − Jh
(x0)]
Var. MFMCEh
( ˜Fn, x0) =
E
w1,...,wn∼Pw
[(MFMCEh
( ˜Fn, x0) − E
w1,...,wn∼Pw
[MFMCEh
( ˜Fn, x0)])2
]
We provide bounds of the bias and variance of this
estimator.
16
Assumptions
1] The functions f, ρ and h are Lipschitz continuous:
∃Lf , Lρ, Lh ∈ R+
: ∀(x, x′
, u, u′
, w) ∈ X2
× U2
× W;
f(x, u, w) − f(x′
, u′
, w) X ≤ Lf ( x − x′
X + u − u′
U)
|ρ(x, u, w) − ρ(x′
, u′
, w)| ≤ Lρ( x − x′
X + u − u′
U)
h(t, x) − h(t, x′
) U ≤ Lh x − x′
∀t ∈ {0, 1, . . . , T − 1}.
2] The distance ∆ is chosen such that:
∆((x, u), (x′
, u′
)) = ( x − x′
X + u − u′
U).
17
Characterization of the bias and the variance
Theorem.
Bias MFMCEh
( ˜Fn, x0) ≤ C ∗ sparsity of Fn(nbTraj ∗ T)
Var. MFMCEh
( ˜Fn, x0) ≤
( Var. MCEh(x0) + 2C ∗ sparsity of Fn(nbTraj ∗ T))2
with C = Lρ
T−1
t=0
T−t−1
i=0 [Lf (1 + Lh)]i
and with the
sparsity of Fn(k) defined as the minimal radius r such that all
balls in X × U of radius r contain at least k state-action pairs
(xl
, ul
) of the set Fn = {(xl
, ul
, rl
, yl
)}n
l=1.
18
Test system
Discrete-time dynamics: xt+1 = sin(π
2 (xt + ut + wt )) with
X = [−1, 1], U = [−1
2 , 1
2 ], W = [−0.1
2 , −0.1
2 ] and Pw (·) a uniform
pdf.
Reward observed after each system transition:
rt = 1
2π e− 1
2 (x2
t +u2
t )
+ wt
Performance criterion: Expected sum of the rewards
observed over a 15-length horizon (T = 15).
We want to evaluate the performance of the policy
h(t, x) = −x
2 when x0 = −0.5.
19
Simulations for nbTraj = 10 and size of Fn = 100, ..., 10000.
Model-free Monte Carlo Estimator Monte Carlo Estimator
20
Simulations for nbTraj = 1, . . . , 100 and size of Fn = 10, 000.
Model-free Monte Carlo Estimator Monte Carlo Estimator
21
Remember what was said about RL + FAs:
1. not well adapted to risk sensitive performance criteria
Suppose the risk sensitive performance criterion:
PCh
(x) =
−∞ if P(Jh
(x) < b) > c
Jh
(x) otherwise
where Jh
(x) = E[
T−1
t=0 ρ(xt , h(t, xt ), wt )].
MFMCE adapted to this performance criterion:
Rebuilt nbTraj starting from x0 using the set Fn as done with the
MFMCE estimator. Let sum rew traj i be the sum of rewards
collected along the ith trajectory. Output as estimation of
PCh
(x0) :



−∞ if
nbTraj
i=1
I{sum rew traj i<b}
nbTraj
> c
nbTraj
i=1
sum rew traj i
nbTraj
otherwise.
22
MFMCE in the deterministic case
We consider from now on that: xt+1 = f(xt , ut ) and
rt = ρ(xt , ut ).
One single trajectory is sufficient to compute exactly Jh
(x0) by
Monte Carlo estimation.
Theorem. Let [(xlt
, ult
, rlt
, ylt
)]T−1
t=0 be the trajectory rebuilt by
the MFMCE when using the distance measure
∆((x, u), (x′
, u′
)) = x − x′
+ u − u′
. If f, ρ and h are
Lipschitz continuous, we have
|MFMCEh
(x0) − Jh
(x0)| ≤
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
where yl−1
= x0 and LQN
= Lρ( N−1
t=0 [Lf (1 + Lh)]t
).
23
Previous theorem extends to whatever rebuilt trajectory:
Theorem. Let [(xlt
, ult
, rlt
, ylt
)]T−1
t=0 be any rebuilt trajectory. If f,
ρ and h are Lipschitz continuous, we have
|
T−1
t=0
rlt
− Jh
(x0)| ≤
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
where ∆((x, u), (x′
, u′
)) = x − x′
+ u − u′
, yl−1
= x0 and
LQN
= Lρ(
N−1
t=0 [Lf (1 + Lh)]t
).
r2 = r(x2, h(2, x2))
r3
trajectory generated by policy h
∆2 = LQ3
( yl1 − xl2 + h(2, yl1 ) − ul2 )
rl4
rl1
xl2 , ul2
| 4
t=0 rt − 4
t=0 rlt | ≤ 4
t=0 ∆t
r1
r0x0
r5
rl0
rl2
rl3yl1
24
Computing a lower bound on a policy
From previous theorem, we have for any rebuilt trajectory
[(xlt
, ult
, rlt
, ylt
)]T−1
t=0 :
Jh
(x0) ≥
T−1
t=0
rlt
−
T−1
t=0
LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
This suggests to find the rebuilt trajectory that maximizes the
right-hand side of the inequality to compute a tight lower bound
on h. Let:
lower bound(h, x0, Fn)
max
[(xlt ,ult ,rlt ,ylt )]T−1
t=0
T−1
t=0 rlt
−
T−1
t=0 LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
25
A tight upper bound on Jh
(x) can be defined and computed in
a similar way:
upper bound(h, x0, Fn)
min
[(xlt ,ult ,rlt ,ylt )]T−1
t=0
T−1
t=0 rlt
+ T−1
t=0 LQT−t
∆((ylt−1
, h(t, ylt−1
)), (xlt
, ult
))
Why are these bounds tight? Because:
∃C ∈ R+
: Jh
(x) − lower bound(h, x, Fn) ≤ C ∗ sparsity of Fn(1)
upper bound(h, x, Fn) − Jh
(x) ≤ C ∗ sparsity of Fn(1)
Functions lower bound(h, x0, Fn) and higher bound(h, x0, Fn)
can be implemented in a “smart way” by seeing the problem as
a problem of finding the shortest path in a graph. Complexity
linear with T and quadratic with |Fn|.
26
Remember what was said about RL + FAs:
2. may lead to unsafe policies - poor performance
guarantees
Let H be a set of candidate high-performance policies. To
obtain a policy with good performance guarantees, we suggest
to solve the following problem:
h ∈ arg max
h∈H
lower bound(h, x0, Fn)
If H is the set of open-loop policies, solving the above
optimization problem can be seen as identifying an “optimal”
rebuilt trajectory and outputting as open-loop policy the
sequence of actions taken along this rebuilt trajectory.
27
• Trajectory set covering the puddle:
FQI with trees h ∈ arg max
h∈H
lower bound(h, x0, Fn)
• Trajectory set not covering the puddle:
FQI with trees h ∈ arg max
h∈H
lower bound(h, x0, Fn)
28
Remember what was said about RL + FAs:
3. may make suboptimal use of near-optimal
trajectories
Suppose a deterministic batch mode RL problem and that in
Fn you have the elements of the trajectory:
(xopt. traj.
0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT )
where the ut s have been selected by an optimal policy.
Let H be the set of open-loop policies. Then, the sequence of
actions h ∈ arg max
h∈H
lower bound(h, xopt. traj.
0 , Fn) is an optimal
one whatever the other trajectories in the set.
Actually, the sequence of action h outputted by this algorithm
tends to be an append of subsequences of actions
belonging to optimal trajectories.
29
Remember what was said about RL + FAs:
4. offer little clues about how to generate new
experiments in an optimal way
The functions lower bound(h, x0, Fn) and
upper bound(h, x0, Fn) can be exploited for generating new
trajectories.
For example, suppose that you can sample the state-action
space several times so as to generate m new elementary
pieces of trajectories to enrich your initial set Fn. We have
proposed a technique to determine m “interesting” sampling
locations based on these bounds.
This technique - which is still preliminary - targets sampling
locations that lead to the largest bound width decrease for
candidate optimal policies.
30
Closure
Rebuilding trajectories: interesting concept for solving many
problems related to batch mode RL.
Actually, the solution outputted by many RL algorithms (e.g.,
model-learning with kNN, fitted Q iteration with trees) can be
characterized by a set of “rebuilt trajectories”.
⇒ I suspect that this concept of rebuilt trajectories could lead
to a general paradigm for analyzing and designing RL
algorithms.
31
Presentation based on the paper:
Batch mode reinforcement learning based on the
synthesis of artificial trajectories. R. Fonteneau, S.A.
Murphy, L. Wehenkel and D. Ernst. Annals of Operations
Research, Volume 208, Issue 1, September 2013, Pages
383-416.
A picture of the authors:
32

More Related Content

What's hot

Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
WSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsWSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsChristian Robert
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methodsChristian Robert
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesFörderverein Technische Fakultät
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-uncPucheta Julian
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN홍배 김
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 

What's hot (20)

Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Dycops2019
Dycops2019 Dycops2019
Dycops2019
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmission
 
MUMS Opening Workshop - UQ Data Fusion: An Introduction and Case Study - Robe...
MUMS Opening Workshop - UQ Data Fusion: An Introduction and Case Study - Robe...MUMS Opening Workshop - UQ Data Fusion: An Introduction and Case Study - Robe...
MUMS Opening Workshop - UQ Data Fusion: An Introduction and Case Study - Robe...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
WSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsWSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in Statistics
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application Examples
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-unc
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
mcmc
mcmcmcmc
mcmc
 
Talk 5
Talk 5Talk 5
Talk 5
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 

Viewers also liked

Ms Annie Keog - Carers Victoria, Support Services for Families
Ms Annie Keog - Carers Victoria, Support Services for FamiliesMs Annie Keog - Carers Victoria, Support Services for Families
Ms Annie Keog - Carers Victoria, Support Services for FamiliesPeer Support Network
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat Светлана Бреусова
 
Analyse du projet de décret relatif à la méthodologie tarifaire applicable a...
Analyse du projet de décret relatif à  la méthodologie tarifaire applicable a...Analyse du projet de décret relatif à  la méthodologie tarifaire applicable a...
Analyse du projet de décret relatif à la méthodologie tarifaire applicable a...Université de Liège (ULg)
 
Comm 326 Friends Fandom
Comm 326 Friends FandomComm 326 Friends Fandom
Comm 326 Friends Fandomoliviaayer
 
20141123 子ども図書館寄付キャンペーン
20141123 子ども図書館寄付キャンペーン20141123 子ども図書館寄付キャンペーン
20141123 子ども図書館寄付キャンペーンsaltpayatas
 
Burnout and Depression Uncensored
Burnout and Depression UncensoredBurnout and Depression Uncensored
Burnout and Depression UncensoredFresh Vine
 
Microgrids and their destructuring eff ects on the electrical industry
Microgrids and their destructuring effects on the electrical industryMicrogrids and their destructuring effects on the electrical industry
Microgrids and their destructuring eff ects on the electrical industryUniversité de Liège (ULg)
 
Strategic Plan 2014-2015 (Final)
Strategic Plan 2014-2015 (Final)Strategic Plan 2014-2015 (Final)
Strategic Plan 2014-2015 (Final)James Patriquin
 
презентация курса товароведение Cb вся
презентация курса  товароведение Cb вся презентация курса  товароведение Cb вся
презентация курса товароведение Cb вся Светлана Бреусова
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02Antony Pagan
 

Viewers also liked (13)

Ms Annie Keog - Carers Victoria, Support Services for Families
Ms Annie Keog - Carers Victoria, Support Services for FamiliesMs Annie Keog - Carers Victoria, Support Services for Families
Ms Annie Keog - Carers Victoria, Support Services for Families
 
Ppt ict
Ppt ictPpt ict
Ppt ict
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat
 
Analyse du projet de décret relatif à la méthodologie tarifaire applicable a...
Analyse du projet de décret relatif à  la méthodologie tarifaire applicable a...Analyse du projet de décret relatif à  la méthodologie tarifaire applicable a...
Analyse du projet de décret relatif à la méthodologie tarifaire applicable a...
 
Comm 326 Friends Fandom
Comm 326 Friends FandomComm 326 Friends Fandom
Comm 326 Friends Fandom
 
20141123 子ども図書館寄付キャンペーン
20141123 子ども図書館寄付キャンペーン20141123 子ども図書館寄付キャンペーン
20141123 子ども図書館寄付キャンペーン
 
Burnout and Depression Uncensored
Burnout and Depression UncensoredBurnout and Depression Uncensored
Burnout and Depression Uncensored
 
prezentare
prezentareprezentare
prezentare
 
Microgrids and their destructuring eff ects on the electrical industry
Microgrids and their destructuring effects on the electrical industryMicrogrids and their destructuring effects on the electrical industry
Microgrids and their destructuring eff ects on the electrical industry
 
Strategic Plan 2014-2015 (Final)
Strategic Plan 2014-2015 (Final)Strategic Plan 2014-2015 (Final)
Strategic Plan 2014-2015 (Final)
 
thyatira.pptx
thyatira.pptxthyatira.pptx
thyatira.pptx
 
презентация курса товароведение Cb вся
презентация курса  товароведение Cb вся презентация курса  товароведение Cb вся
презентация курса товароведение Cb вся
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
 

Similar to Batch mode reinforcement learning based on the synthesis of artificial trajectories

Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Université de Liège (ULg)
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Université de Liège (ULg)
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningUniversité de Liège (ULg)
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych Data Science Warsaw
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
IFAC2008art
IFAC2008artIFAC2008art
IFAC2008artYuri Kim
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fieldslswing
 
CHAC Algorithm ECAL'07 Presentation
CHAC Algorithm ECAL'07 PresentationCHAC Algorithm ECAL'07 Presentation
CHAC Algorithm ECAL'07 PresentationAntonio Mora
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
 
On the discretized algorithm for optimal proportional control problems constr...
On the discretized algorithm for optimal proportional control problems constr...On the discretized algorithm for optimal proportional control problems constr...
On the discretized algorithm for optimal proportional control problems constr...Alexander Decker
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsMd Mahbubur Rahman
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 

Similar to Batch mode reinforcement learning based on the synthesis of artificial trajectories (20)

Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
MLMM_16_08_2022.pdf
MLMM_16_08_2022.pdfMLMM_16_08_2022.pdf
MLMM_16_08_2022.pdf
 
Lec1 01
Lec1 01Lec1 01
Lec1 01
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
IFAC2008art
IFAC2008artIFAC2008art
IFAC2008art
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
CHAC Algorithm ECAL'07 Presentation
CHAC Algorithm ECAL'07 PresentationCHAC Algorithm ECAL'07 Presentation
CHAC Algorithm ECAL'07 Presentation
 
lecture.ppt
lecture.pptlecture.ppt
lecture.ppt
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 
On the discretized algorithm for optimal proportional control problems constr...
On the discretized algorithm for optimal proportional control problems constr...On the discretized algorithm for optimal proportional control problems constr...
On the discretized algorithm for optimal proportional control problems constr...
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
C025020029
C025020029C025020029
C025020029
 
Sampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective MissionsSampling-Based Planning Algorithms for Multi-Objective Missions
Sampling-Based Planning Algorithms for Multi-Objective Missions
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 

More from Université de Liège (ULg)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionUniversité de Liège (ULg)
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesUniversité de Liège (ULg)
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadUniversité de Liège (ULg)
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectUniversité de Liège (ULg)
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Université de Liège (ULg)
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Université de Liège (ULg)
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableUniversité de Liège (ULg)
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Université de Liège (ULg)
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Université de Liège (ULg)
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Université de Liège (ULg)
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationUniversité de Liège (ULg)
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Université de Liège (ULg)
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTUniversité de Liège (ULg)
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Université de Liège (ULg)
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoUniversité de Liège (ULg)
 

More from Université de Liège (ULg) (20)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transition
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communities
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future ahead
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata project
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...
 
Big infrastructures for fighting climate change
Big infrastructures for fighting climate changeBig infrastructures for fighting climate change
Big infrastructures for fighting climate change
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelable
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
 
Belgian offshore wind potential
Belgian offshore wind potentialBelgian offshore wind potential
Belgian offshore wind potential
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the Congo
 
Energy: the clash of nations
Energy: the clash of nationsEnergy: the clash of nations
Energy: the clash of nations
 
Smart Grids Versus Microgrids
Smart Grids Versus MicrogridsSmart Grids Versus Microgrids
Smart Grids Versus Microgrids
 
Uber-like Models for the Electrical Industry
Uber-like Models for the Electrical IndustryUber-like Models for the Electrical Industry
Uber-like Models for the Electrical Industry
 

Recently uploaded

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Batch mode reinforcement learning based on the synthesis of artificial trajectories

  • 1. Batch mode reinforcement learning based on the synthesis of artificial trajectories Damien Ernst University of Li`ege, Belgium Princeton - December 2012
  • 2. Batch mode Reinforcement Learning ≃ Learning a high-performance policy for a sequential decision problem where: • a numerical criterion is used to define the performance of a policy. (An optimal policy is the policy that maximizes this numerical criterion.) • “the only” (or “most of the”) information available on the sequential decision problem is contained in a set of trajectories. Batch mode RL stands at the intersection of three worlds: optimization (maximization of the numerical criterion), system/control theory (sequential decision problem) and machine learning (inference from a set of trajectories).
  • 3. A typical batch mode RL problem Discrete-time dynamics: xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U and wt ∈ W. wt is drawn at every time step according to Pw (·). Reward observed after each system transition: rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward function. Type of policies considered: h : {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T-length horizon PCh (x) = Jh (x) = E w0,...,wT−1 [ T−1 t=0 ρ(xt , h(t, xt ), wt )] with x0 = x and xt+1 = f(xt , h(t, xt ), wt ). Available information: A set of elementary pieces of trajectories Fn = {(xl , ul , rl , yl )}n l=1 where yl is the state reached after taking action ul in state xl and rl the instantaneous reward associated with the transition. The functions f, ρ and Pw are unknown. 2
  • 4. Batch mode RL and function approximators Training function approximators (radial basis functions, neural nets, trees, etc) using the information contained in the set of trajectories is a key element to most of the resolution schemes for batch mode RL problems with state-action spaces having a large (infinite) number of elements. Two typical uses of FAs for batch mode RL: • the FAs model of the sequential decision problem (in our typical problem f, r and Pw ). The model is afterwards exploited as if it was the real problem to compute a high-performance policy. • the FAs represent (state-action) value functions which are used in iterative schemes so as to converge to a (state-action) value function from which a high-performance policy can be computed. Iterative schemes based on the dynamic programming principle (e.g., LSPI, FQI, Q-learning). 3
  • 5. Why look beyond function approximators? FAs based techniques: mature, can successfully solve many real life problems but: 1. not well adapted to risk sensitive performance criteria 2. may lead to unsafe policies - poor performance guarantees 3. may make suboptimal use of near-optimal trajectories 4. offer little clues about how to generate new experiments in an optimal way 4
  • 6. 1. not well adapted to risk sensitive performance criteria An example of risk sensitive performance criterion: PCh (x) = −∞ if P( T−1 t=0 ρ(xt , h(t, xt ), wt ) < b) > c Jh (x) otherwise. FAs with dynamic programming: very problematic because (state-action) value functions need to become functions that take as values “probability distributions of future rewards” and not “expected rewards”. FAs with model learning: more likely to succeed; but what about the challenges of fitting the FAs to model the distribution of future states reached (rewards collected) by policies and not only an average behavior? 5
  • 7. 2. may lead to unsafe policies - poor performance guarantees • Benchmark: • Trajectory set • Trajectory set not puddle world covering the puddle covering the puddle • RL algorithms: ⇒ Optimal policy ⇒ Suboptimal FQI with trees (unsafe) policy Typical performance guarantee in the deterministic case for FQI = (estimated return by FQI of the policy it outputs minus constant×(’size’ of the largest area of the state space not covered by the sample)). 6
  • 8. 3. may make suboptimal use of near-optimal trajectories Suppose a deterministic batch mode RL problem and that in the set of trajectory, there is a trajectory: (xopt. traj. 0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT ) where the ut s have been selected by an optimal policy. Question: Which batch mode RL algorithms will output a policy which is optimal for the initial state xopt. traj. 0 whatever the other trajectories in the set? Answer: Not that many and certainly not those using parametric FAs. In my opinion: batch mode RL algorithms can only be successful on large-scale problems if (i) in the set of trajectories, many trajectories have been generated by (near-)optimal policies (ii) the algorithms exploit very well the information contained in those (near-)optimal trajectories. 7
  • 9. 4. offer little clues about how to generate new experiments in an optimal way Many real-life problems are variants of batch mode RL problems for which (a limited number of) additional trajectories can be generated (under various constraints) to enrich the initial set of trajectories. Question: How should these new trajectories be generated? Many approaches based on the analysis of the FAs produced by batch mode RL methods have been proposed; results are mixed. 8
  • 10. Rebuilding trajectories We conjecture that mapping the set of trajectories into FAs generally lead to the loss of essential information for addressing these four issues ⇒ We have developed a new line of research for solving batch mode RL that does not use at all FAs. Line of research articulated around the the rebuilding of artificial (likely “broken”) trajectories by using the set of trajectories input of the batch mode RL problem; a rebuilt trajectory is defined by the elementary pieces of trajectory it is made of. The rebuilt trajectories are analysed to compute various things: a high-performance policy, performance guarantees, where to sample, etc. 9
  • 11. BLUE ARROW = elementary piece of trajectory Set of trajectories Examples of 5-length given as input of the rebuilt trajectories made batch RL problem from elements of this set 10
  • 12. Model-Free Monte Carlo Estimator Building an oracle that estimates the performance of a policy: important problem in batch mode RL. Indeed, if an oracle is available, problem of estimating a high-performance policy can be reduced to an optimization problem over a set of candidate policies. If a model of sequential decision problem is available, a Monte Carlo estimator (i.e., rollouts) can be used to estimate the performance of a policy. We detail an approach that estimates the performance of a policy by rebuilding trajectories so as to mimic the behavior of the Monte Carlo estimator. 11
  • 13. Context in which the approach is presented Discrete-time dynamics: xt+1 = f(xt , ut , wt ) t = 0, 1, . . . , T − 1 where xt ∈ X, ut ∈ U and wt ∈ W. wt is drawn at every time step according to Pw (·) Reward observed after each system transition: rt = ρ(xt , ut , wt ) where ρ : X × U × W → R is the reward function. Type of policies considered: h : {0, 1, . . . , T − 1} × X → U. Performance criterion: Expected sum of the rewards observed over the T-length horizon PCh (x) = Jh (x) = E w0,...,wT−1 [ T−1 t=0 ρ(xt , h(t, xt ), wt )] with x0 = x and xt+1 = f(xt , h(t, xt ), wt ). Available information: A set of elementary pieces of trajectories Fn = {(xl , ul , rl , yl )}n l=1. f, ρ and Pw are unknown. Approach aimed at estimating Jh (x) from Fn. 12
  • 14. Monte Carlo Estimator Generate nbTraj T-length trajectories by simulating the system starting from the initial state x0; for every trajectory compute the sum of rewards collected; average these sum of rewards over the nbTraj trajectories to get an estimate MCEh (x0) of Jh (x0). trajectory 2 r1 r2 r0 r4 r3 = r(x3, h(3, x3), w3)trajectory 1 sum rew. traj. 1 = 4 i=0 ri MCEh(x0) = 1 3 3 i=1 sum rew. traj. i w3 ∼ Pw (·) x4 = f(x3, h(3, x3), w3) x3 Illustration with and T = 5 nbTraj = 3 x0 trajectory 3 Bias MCEh (x0) = E nbTraj ∗T rand. var. w∼Pw (·) [MCEh (x0) − Jh (x0)]= 0 Var. MCEh (x0) = 1 nbTraj (Var. of the sum of rewards along a traj.) 13
  • 15. Description of Model-free Monte Carlo Estimator (MFMC) Principle: To rebuild nbTraj T-length trajectories using the elements of the set Fnand to average the sum of rewards collected along the rebuilt trajectories to get an estimate MFMCEh (Fn, x0) of Jh (x0). Trajectories rebuilding algorithm: Trajectories are sequentially rebuilt; an elementary piece of trajectory can only be used once; trajectories are grown in length by selecting at every instant t = 0, 1, . . . , T − 1 the elementary piece of trajectory (x, u, r, y) that minimizes the distance ∆((x, u), (xend , h(t, xend ))) where xend is the ending state of the already rebuilt part of the trajectory (xend = x0 if t = 0). 14
  • 16. Remark: When sequentially selecting the pieces of trajectories, no information on the value of the disturbance w “behind” the new piece of elementary trajectory (x, u, r = ρ(x, u, w), y = f(x, u, w)) that is going to be selected is given if only (x, u) and the previous elementary pieces of trajectories selected are known. Important for having a meaningful estimator !!! rebuilt trajectory 2 r3 r21 r9 rebuilt trajectory 1 x0 rebuilt trajectory 3 r18 Illustration with T = 5 and (x18, r18, u18, y18) sum rew. re. traj. 1 = r3 + r18 + r21 + r7 + r9 MFMCEh(Fn, x0) = 1 3 3 i=1 sum rew. re. traj. i nbTraj = 3, r7 F24 = {(xl , rl , ul , yl )}24 l=1 15
  • 17. Analysis of the MFMC Random set ˜Fn defined as follows: Made of n elementary pieces of trajectory where the first two components of an element (xl , ul ) are given by the first two element of the lth element of Fn and the last two are generated by drawing for each l a disturbance signal wl at random from PW (·) and taking rl = ρ(xl , ul , wl ) and yl = f(xl , ul , wl ). Fn is a realization of the random set ˜Fn. Bias and variance of MFMCE defined as: Bias MFMCEh ( ˜Fn, x0) = E w1,...,wn∼Pw [MFMCEh ( ˜Fn, x0) − Jh (x0)] Var. MFMCEh ( ˜Fn, x0) = E w1,...,wn∼Pw [(MFMCEh ( ˜Fn, x0) − E w1,...,wn∼Pw [MFMCEh ( ˜Fn, x0)])2 ] We provide bounds of the bias and variance of this estimator. 16
  • 18. Assumptions 1] The functions f, ρ and h are Lipschitz continuous: ∃Lf , Lρ, Lh ∈ R+ : ∀(x, x′ , u, u′ , w) ∈ X2 × U2 × W; f(x, u, w) − f(x′ , u′ , w) X ≤ Lf ( x − x′ X + u − u′ U) |ρ(x, u, w) − ρ(x′ , u′ , w)| ≤ Lρ( x − x′ X + u − u′ U) h(t, x) − h(t, x′ ) U ≤ Lh x − x′ ∀t ∈ {0, 1, . . . , T − 1}. 2] The distance ∆ is chosen such that: ∆((x, u), (x′ , u′ )) = ( x − x′ X + u − u′ U). 17
  • 19. Characterization of the bias and the variance Theorem. Bias MFMCEh ( ˜Fn, x0) ≤ C ∗ sparsity of Fn(nbTraj ∗ T) Var. MFMCEh ( ˜Fn, x0) ≤ ( Var. MCEh(x0) + 2C ∗ sparsity of Fn(nbTraj ∗ T))2 with C = Lρ T−1 t=0 T−t−1 i=0 [Lf (1 + Lh)]i and with the sparsity of Fn(k) defined as the minimal radius r such that all balls in X × U of radius r contain at least k state-action pairs (xl , ul ) of the set Fn = {(xl , ul , rl , yl )}n l=1. 18
  • 20. Test system Discrete-time dynamics: xt+1 = sin(π 2 (xt + ut + wt )) with X = [−1, 1], U = [−1 2 , 1 2 ], W = [−0.1 2 , −0.1 2 ] and Pw (·) a uniform pdf. Reward observed after each system transition: rt = 1 2π e− 1 2 (x2 t +u2 t ) + wt Performance criterion: Expected sum of the rewards observed over a 15-length horizon (T = 15). We want to evaluate the performance of the policy h(t, x) = −x 2 when x0 = −0.5. 19
  • 21. Simulations for nbTraj = 10 and size of Fn = 100, ..., 10000. Model-free Monte Carlo Estimator Monte Carlo Estimator 20
  • 22. Simulations for nbTraj = 1, . . . , 100 and size of Fn = 10, 000. Model-free Monte Carlo Estimator Monte Carlo Estimator 21
  • 23. Remember what was said about RL + FAs: 1. not well adapted to risk sensitive performance criteria Suppose the risk sensitive performance criterion: PCh (x) = −∞ if P(Jh (x) < b) > c Jh (x) otherwise where Jh (x) = E[ T−1 t=0 ρ(xt , h(t, xt ), wt )]. MFMCE adapted to this performance criterion: Rebuilt nbTraj starting from x0 using the set Fn as done with the MFMCE estimator. Let sum rew traj i be the sum of rewards collected along the ith trajectory. Output as estimation of PCh (x0) :    −∞ if nbTraj i=1 I{sum rew traj i<b} nbTraj > c nbTraj i=1 sum rew traj i nbTraj otherwise. 22
  • 24. MFMCE in the deterministic case We consider from now on that: xt+1 = f(xt , ut ) and rt = ρ(xt , ut ). One single trajectory is sufficient to compute exactly Jh (x0) by Monte Carlo estimation. Theorem. Let [(xlt , ult , rlt , ylt )]T−1 t=0 be the trajectory rebuilt by the MFMCE when using the distance measure ∆((x, u), (x′ , u′ )) = x − x′ + u − u′ . If f, ρ and h are Lipschitz continuous, we have |MFMCEh (x0) − Jh (x0)| ≤ T−1 t=0 LQT−t ∆((ylt−1 , h(t, ylt−1 )), (xlt , ult )) where yl−1 = x0 and LQN = Lρ( N−1 t=0 [Lf (1 + Lh)]t ). 23
  • 25. Previous theorem extends to whatever rebuilt trajectory: Theorem. Let [(xlt , ult , rlt , ylt )]T−1 t=0 be any rebuilt trajectory. If f, ρ and h are Lipschitz continuous, we have | T−1 t=0 rlt − Jh (x0)| ≤ T−1 t=0 LQT−t ∆((ylt−1 , h(t, ylt−1 )), (xlt , ult )) where ∆((x, u), (x′ , u′ )) = x − x′ + u − u′ , yl−1 = x0 and LQN = Lρ( N−1 t=0 [Lf (1 + Lh)]t ). r2 = r(x2, h(2, x2)) r3 trajectory generated by policy h ∆2 = LQ3 ( yl1 − xl2 + h(2, yl1 ) − ul2 ) rl4 rl1 xl2 , ul2 | 4 t=0 rt − 4 t=0 rlt | ≤ 4 t=0 ∆t r1 r0x0 r5 rl0 rl2 rl3yl1 24
  • 26. Computing a lower bound on a policy From previous theorem, we have for any rebuilt trajectory [(xlt , ult , rlt , ylt )]T−1 t=0 : Jh (x0) ≥ T−1 t=0 rlt − T−1 t=0 LQT−t ∆((ylt−1 , h(t, ylt−1 )), (xlt , ult )) This suggests to find the rebuilt trajectory that maximizes the right-hand side of the inequality to compute a tight lower bound on h. Let: lower bound(h, x0, Fn) max [(xlt ,ult ,rlt ,ylt )]T−1 t=0 T−1 t=0 rlt − T−1 t=0 LQT−t ∆((ylt−1 , h(t, ylt−1 )), (xlt , ult )) 25
  • 27. A tight upper bound on Jh (x) can be defined and computed in a similar way: upper bound(h, x0, Fn) min [(xlt ,ult ,rlt ,ylt )]T−1 t=0 T−1 t=0 rlt + T−1 t=0 LQT−t ∆((ylt−1 , h(t, ylt−1 )), (xlt , ult )) Why are these bounds tight? Because: ∃C ∈ R+ : Jh (x) − lower bound(h, x, Fn) ≤ C ∗ sparsity of Fn(1) upper bound(h, x, Fn) − Jh (x) ≤ C ∗ sparsity of Fn(1) Functions lower bound(h, x0, Fn) and higher bound(h, x0, Fn) can be implemented in a “smart way” by seeing the problem as a problem of finding the shortest path in a graph. Complexity linear with T and quadratic with |Fn|. 26
  • 28. Remember what was said about RL + FAs: 2. may lead to unsafe policies - poor performance guarantees Let H be a set of candidate high-performance policies. To obtain a policy with good performance guarantees, we suggest to solve the following problem: h ∈ arg max h∈H lower bound(h, x0, Fn) If H is the set of open-loop policies, solving the above optimization problem can be seen as identifying an “optimal” rebuilt trajectory and outputting as open-loop policy the sequence of actions taken along this rebuilt trajectory. 27
  • 29. • Trajectory set covering the puddle: FQI with trees h ∈ arg max h∈H lower bound(h, x0, Fn) • Trajectory set not covering the puddle: FQI with trees h ∈ arg max h∈H lower bound(h, x0, Fn) 28
  • 30. Remember what was said about RL + FAs: 3. may make suboptimal use of near-optimal trajectories Suppose a deterministic batch mode RL problem and that in Fn you have the elements of the trajectory: (xopt. traj. 0 , u0, r0, x1, u1, r1, x2, . . . , xT−2, uT−2, rT−2, xT−1, uT−1, rT−1, xT ) where the ut s have been selected by an optimal policy. Let H be the set of open-loop policies. Then, the sequence of actions h ∈ arg max h∈H lower bound(h, xopt. traj. 0 , Fn) is an optimal one whatever the other trajectories in the set. Actually, the sequence of action h outputted by this algorithm tends to be an append of subsequences of actions belonging to optimal trajectories. 29
  • 31. Remember what was said about RL + FAs: 4. offer little clues about how to generate new experiments in an optimal way The functions lower bound(h, x0, Fn) and upper bound(h, x0, Fn) can be exploited for generating new trajectories. For example, suppose that you can sample the state-action space several times so as to generate m new elementary pieces of trajectories to enrich your initial set Fn. We have proposed a technique to determine m “interesting” sampling locations based on these bounds. This technique - which is still preliminary - targets sampling locations that lead to the largest bound width decrease for candidate optimal policies. 30
  • 32. Closure Rebuilding trajectories: interesting concept for solving many problems related to batch mode RL. Actually, the solution outputted by many RL algorithms (e.g., model-learning with kNN, fitted Q iteration with trees) can be characterized by a set of “rebuilt trajectories”. ⇒ I suspect that this concept of rebuilt trajectories could lead to a general paradigm for analyzing and designing RL algorithms. 31
  • 33. Presentation based on the paper: Batch mode reinforcement learning based on the synthesis of artificial trajectories. R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Annals of Operations Research, Volume 208, Issue 1, September 2013, Pages 383-416. A picture of the authors: 32