Model-free Monte Carlo-like Policy Evaluation

University of Liège – University of Michigan
Modelfree Monte Carlolike Policy Evaluation
Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst
Journées MAS, Bordeaux, France
(August 31st
– September 3rd
2010)
The material of this talk is based on a presentation
given by Raphael Fonteneau at Cap 2010.

Time
Patients
1
n
0 1 T
Evaluation
?
Therapy to evaluate

3
Introduction
● Discretetime stochastic optimal control problems arise in many
fields (finance, medecine, engineering,...)
● Many techniques for solving such problems use an oracle that
evaluates the performance of any given policy in order to
determine a (near)optimal control policy
● When the system is accessible to experimentation, such an oracle
can be based on a Monte Carlo (MC) approach
● In this paper, the only information is contained in a sample of one
step transitions of the system
● In this context, we propose a ModelFree Monte Carlo (MFMC)
estimator of the performance of a given policy that mimics in some
way the Monte Carlo estimator.

4
Problem statement
● We consider a discretetime system whose dynamics over T stages
is given by
● All xt
lie in a normed state space X, all ut
lie in a normed action
space U, wt
are i.i.d. according to a probability distribution pW
(.)
● An instantaneous reward is associated with the
action ut
while being in state xt

● A policy h: {0,...,T1} × X  U is given, and we want to evaluate
its performance.

Problem statement
● The expected return of the policy h when starting from an initial
state x0
is given by
where
with
x0
xT
w0
w1
wT −2
wT −1
x1
x2 xT−2 xT−1
r0
r1 rT −2
rT −1

6
Problem statement
● Problem: the functions f, ρ and pW
(.) are unknown
● They are replaced by a sample of system transitions
where the pairs are arbitrary chosen and the pairs
are determined by , where wl
is drawn
according to pW
(.)
How to evaluate Jh
(x0
) in this context?

7
The Monte Carlo estimator
● We define the Monte Carlo estimator of the expected return of h
when starting from the initial state x0
:
with

8
x1
1
x0
x1
p
xT −1
1
xT
1
xT
p
xT −1
px2
p
xT −2
p
x2
1
xT −2
1
w0
1
w0
p
wT −2
1
wT −1
1
xT
2
MC Estimator
∑
t=0
T −1
rt
1
∑
t=0
T −1
rt
2
∑
t=0
T −1
rt
p
1
p
∑
i=1
p
∑
t=0
T −1
rt
i
w1
1
w0
2 w1
2
wT −2
2 wT −1
2
wT −2
p
wT −1
p
wt
i
~pW .
w1
p
x1
2 x2
2
xT −2
2
xT −1
2
r0
1
r1
1
rT −2
1
rT −1
1
r0
2
r0
p
r1
2
r1
p
rT −2
2
rT −2
p
rT −1
2
rT −1
p

9
● We assume that the random variable Rh
(x0
) admits a finite
variance
● The bias and variance of the Monte Carlo estimator are

10
● Here, the MC approach is not feasible, since the system is
unknown
● We introduce the ModelFree Monte Carlo estimator
● From the sample of transitions, we build p sequences of different
transitions of length T called ``broken trajectories''
● These broken trajectories are built so as to minimize the
discrepancy (using a distance metric ∆) with a classical MC
sample that could be obtained by simulating the system with the
policy h
● We average the cumulated returns over the p broken trajectories
to compute an estimate of the expected return of h
● The algorithm has complexity O(npT) .
The Modelfree Monte Carlo estimator

11

12
Example with T=3, p=2, n=8

13

14

15

17

18

20

21

23

24

26

27

29

30

32
● Assumption: the functions f, ρ and h are Lipschitz continuous
MFMC estimator: analysis

33
● The only information available on the system is gathered in a
sample of n onestep transitions
● We define the random variable         as follows:
a
The set of pairs                                           is arbitrary chosen,
a
whereas the pairs are determined by
where wl
is drawn according to pW
(.)
●          is a realization of the random set         .

34
● Distance metric ∆
● ksparsity
● denotes the distance of (x,u) to its kth nearest
neighbor (using the distance ∆) in the sample

35
X
U
(x,u)
(x',u')The ksparsity can be
seen as the smallest
radius γ such that all
∆balls in X×U of radius
γ contain at least k
elements from

36
● Expected value of the MFMC estimator
● Theorem

37
● Variance of the MFMC estimator
● Theorem

38
Illustration
● System
● pW
(.) is uniform over W, T = 15, x0
= 0.5 .

39
Illustration
● Simulations for p = 10, n = 100 … 10 000, uniform grid
n
Monte Carlo estimatorModelfree Monte Carlo estimator

40
Illustration
● Simulations for p = 1 … 100, n = 10 000, uniform grid
Monte Carlo estimator
p p
Modelfree Monte Carlo estimator

41
Conclusions and Future work
Conclusions
● We have proposed in this paper an estimator of the expected
return of a policy in a modelfree setting, the MFMC estimator
● We have provided bounds on the bias and variance of the MFMC
estimator
● The bias and variance of the MFMC estimator converge to the bias
and variance of the MC estimator
Future work
● MFMC estimator in a direct policy search framework
● One could extend this approach to evaluate return distributions
(and not only expected values). This could allow to develop ''safe''
policy search techniques based on Value at Risk (VaR) criteria.

Model-­free Monte Carlo-­like Policy Evaluation

More Related Content

Similar to Model-­free Monte Carlo-­like Policy Evaluation

More from Université de Liège (ULg)

Recently uploaded