Reinforcement Learning in Configurable Environments

1 / 26 POLITECNICO DI MILANO 1863
POLITECNICO DI MILANO
SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING
Reinforcement Learning
in Configurable Environments:
an Information Theoretic approach
Supervisor: Prof. Marcello Restelli
Co-supervisor: Dott. Alberto Maria Metelli
Emanuele Ghelfi, 875550
20 Dec, 2018
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
2 / 26 POLITECNICO DI MILANO 1863
Contents
1 Reinforcement Learning
2 Configurable MDP
3 Contribution - Relative Entropy Model Policy Search
4 Finite–Sample Analysis
5 Experiments
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
3 / 26 POLITECNICO DI MILANO 1863
Motivations - Configurable Environments
Configure environmental parameters in a principled way.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
4 / 26 POLITECNICO DI MILANO 1863
Reinforcement Learning
Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making
problems.
Environment
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
5 / 26 POLITECNICO DI MILANO 1863
Policy Search
Definition (Policy)
A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions.
Definition (Performance)
The performance of a policy is defined as:
Jπ
= J(θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (1)
Definition (MDP solution)
An MDP is solved when we find the best performing policy:
θ∗
∈ arg max
θ∈Θ
J(θ) . (2)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
6 / 26 POLITECNICO DI MILANO 1863
Gradient vs Trust Region
Gradient methods optimize performance acting
with gradient updates:
θt+1
= θt
+ λ θJ(θt
).
θl θt θ∗
2
4
6
8
10
Jπ
Trust region methods perform a constrained
optimization:
max
θ
J(θ) s.t. θ ∈ I(θt
).
θl θt θ∗
2
4
6
8
10
θ ∈ I(θt
)
Jπ
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
7 / 26 POLITECNICO DI MILANO 1863
Trust Region methods
Relative Entropy Policy Search (REPS) (Peters et al., 2010).
Trust Region Policy Optimization (TRPO) (Schulman et al., 2015).
Proximal Policy Optimization (PPO) (Schulman et al., 2017).
Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
8 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension.
Environment
(configuration)
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
9 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
Definition (CMDP performance)
The performance of a model-policy pair is:
JP,π
= J(ω, θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (3)
Definition (CMDP Solution)
The CMDP solution is the model-policy pair maximizing the performance:
ω∗
, θ∗
∈ arg max
ω∈Ω,θ∈Θ
J(ω, θ) . (4)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
10 / 26 POLITECNICO DI MILANO 1863
State of the Art
Safe Policy Model Iteration (Metelli et al., 2018a):
Safe Approach (Pirotta et al., 2013) for solving CMDPs:
JP ,π
− JP,π
Performance improvement
≥ B(P , π ) =
AP ,π
P,π,µ + AP,π
P,π,µ
1 − γ
Advantage term
−
γ∆qP,πD
2(1 − γ)2
Dissimilarity Penalization
. (5)
Limitations:
Finite state-actions space.
Full knowledge of the environment dynamics.
High sample complexity.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
11 / 26 POLITECNICO DI MILANO 1863
Relative Entropy Model Policy Search
We present REMPS, a novel algorithm for CMDPs:
Information Theoretic approach.
Optimization and Projection.
Approximated models.
Continous state and action spaces.
We optimize the Average Reward:
JP,π
= lim inf
H→+∞
E
at ∼π(·|st )
st+1∼P(·|st ,at )
1
H
H−1
t=0
R(st, at, st+1) .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
12 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
We define the following constrained optimization problem:
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
Dual
min
η
η log EdP,π e
+
R(s,a,s )
η
subject to:
η ≥ 0.
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution over state, action and next-state.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )] Objective Function
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤ Trust Region
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 . d is well formed
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
14 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization Solution
Theorem (REMPS solution)
The solution of the REMPS problem is:
d(s, a, s ) =
dP,π(s, a, s ) exp R(s,a,s )
η
S A S dP,π(s, a, s ) exp R(s,a,s )
η dsdads
(6)
where η is the minimizer of the dual problem.
Probability of (s, a, s ) weighted exponentially with the reward.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DKL(d dP,π
) ≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
16 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d-Projection
Discrete state-action
spaces.
arg min
θ∈Θ,ω∈Ω
DKL d dω,θ .
State-Kernel Projection
Discrete action spaces.
arg min
θ∈Θ,ω∈Ω
DKL(Pπ Pπθ
ω ).
Independent Projection
Continuous state action
spaces.
arg min
θ∈Θ
DKL(π πθ).
arg min
ω∈Ω
DKL(P Pω).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
17 / 26 POLITECNICO DI MILANO 1863
Algorithm
Algorithm 1 Relative Entropy Model Policy Search
1: for t = 0,1,... until convergence do
2: Collect samples from πθt , Pωt
3: Obtain η∗, the minimizer of the dual problem. Optimization
4: Project d according to the projection strategy. Projection
a. d-Projection;
b. State-Kernel Projection;
c. Independent Projection.
5: Update Policy.
6: Update Model.
7: end for
8: return Policy-Model Pair (πθt , Pωt )
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
18 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
d solution of Optimization with ∞ samples.
d solution of Optimization with N samples.
d solution of Optimization and Projection with N samples.
Jd − Jd
=Jd − Jd
OPT
+ Jd
− Jd
PROJ
. (7)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
19 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
Jd − Jd
≤ rmaxψ(N)
8v log 2eN
v + 8 log 8
δ
N
(8)
+ rmaxφ
4 8v log 2eN
v + 8 log 8
δ
N
(9)
+
√
2rmax sup
d∈DdP,π
inf
d∈DP,Π
DKL(d d). (10)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
20 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
Motivations:
Visualize the behaviour of REMPS;
Overcoming local minima;
Configure transition function.
S0 S1
a0, θ, 1 − ω, 0
a0, θ, ω, s
a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0
a0, θ, ω, s
a0, θ, 1 − ω, L
a1, 1 − θ, ωk, l
a1, 1 − θ, 1 − ωk, s
00.20.40.60.81
0
0.5
1
2
4
6
8
10
θ
ω
J
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1
REMPS = 10
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
22 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Standard RL benchmark;
Configure cart acceleration;
Approximated model.
Minimize acceleration balancing the pole:
R(s, a, s ) = 10 − ω2
20 − 20 · (1 − cos(γ)). γ
˙γ
˙x
x
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
23 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Ideal Model Approximated Model
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
REMPS
G(PO)MDP
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
24 / 26 POLITECNICO DI MILANO 1863
Experimental Results - TORCS
TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010).
Autonomous driving and Configuration.
Continuous Control.
Approximated model.
We configure the rear wing and the brake system.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
25 / 26 POLITECNICO DI MILANO 1863
Experimental Results - TORCS
0 2 4 6 8 10 12
·102
−60
−40
−20
0
Iteration
AverageReward
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
RearWing
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
Front-RearBrakeRepartition
REMPS REPS
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
26 / 26 POLITECNICO DI MILANO 1863
Conclusions
Contributions:
REMPS able to solve the model-policy learning problem.
Finite-sample analysis.
Experimental evaluation.
Future research directions:
Adaptive KL-constraint.
Other divergences.
Finite-Time Analysis.
Plan to submit at ICML 2019.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
27 / 26 POLITECNICO DI MILANO 1863
Conclusions
Thank you for your attention
Questions?
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
28 / 26 POLITECNICO DI MILANO 1863
Bibliography I
[Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis
Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org.
2013
[Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to
overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary
Computation, July 2010, pp. 1–8. – pp. 1–8
[Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable
Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of
the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan,
Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497
[Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli,
Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat]
(2018), September. (2018), September
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
29 / 26 POLITECNICO DI MILANO 1863
Bibliography II
[Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy
Policy Search. 2010
[Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello,
Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.):
Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta,
Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315
[Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan,
Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs]
(2015), February. (2015), February
[Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ;
Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017),
July. (2017), July
[Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to
Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA,
USA : MIT Press, 1998
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
30 / 26 POLITECNICO DI MILANO 1863
d-projection
Projection of the stationary state distribution:
θ, ω = arg min
θ∈Θ,ω∈Ω
DKL d(s, a, s ) dω,θ
(s, a, s )
s.t. dω,θ
(s) =
S A
dω,θ
(s )πθ(a|s )Pω(s |s, a)dads .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
31 / 26 POLITECNICO DI MILANO 1863
Projection of the State Kernel
Projection of the state kernel:
θ, ω = arg min
θ∈Θ,ω∈Ω S
d(s)DKL(Pπ
(·|s) Pπθ
ω (·|s))ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S
Pπ
(s |s) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S A
π (a|s)P (s |s, a) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S A S
d(s, a, s ) log
A
Pω(s |s, a )πθ(a |s)da dsdads
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
32 / 26 POLITECNICO DI MILANO 1863
Independent Projection
Independent Projection of policy and model:
θ = arg min
θ∈Θ S
d(s)DKL(π (·|s) πθ(·|s))ds = (11)
= arg min
θ∈Θ S
d(s)
A
π (a|s) log
π(a|s)
πθ(a|s)
dads = (12)
= arg min
θ∈Θ S A S
d(s, a, s ) log
π (a|s)
πθ(a|s)
dsdads = (13)
= arg max
θ∈Θ S A S
d(s, a, s ) log πθ(a|s)dsdads , (14)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
33 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Joint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π and
dP ,π the stationary distribution induced by the model P and policy π . Let us assume that
the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax.
The norm of the difference of performance can be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax 2DKL(dP,π dP ,π ). (15)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
34 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Disjoint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π
the stationary distribution induced by the model P and policy π . Let us assume that the
reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If
(P , π ) admits group invertible state kernel P π the norm of the difference of performance can
be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16)
where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
35 / 26 POLITECNICO DI MILANO 1863
G(PO)MDP - Model extension
JP,π
= pθ,ω(τ)G(τ)dτ
ωJP,π
= pθ,ω(τ) ω log p(τ)G(τ)dτ (17)
= pθ,ω(τ)
H
k=0
log Pω(sk+1|sk, ak) G(τ) (18)
ωJPπ
G(PO)MDP =
H
l=0
H
k=l
ω log Pω(sk+1|sk, ak) γl
R(sl , al , sl+1) N (19)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
1 of 45

Recommended

MM framework for RL by
MM framework for RLMM framework for RL
MM framework for RLSung Yub Kim
1K views26 slides
GSA-BBO HYBRIDIZATION ALGORITHM by
GSA-BBO HYBRIDIZATION ALGORITHMGSA-BBO HYBRIDIZATION ALGORITHM
GSA-BBO HYBRIDIZATION ALGORITHMSajad Ahmad Rather
213 views36 slides
Lec5 advanced-policy-gradient-methods by
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
442 views22 slides
gadget_meeting by
gadget_meetinggadget_meeting
gadget_meetingFrancesca Iannuzzi
150 views33 slides
Improved Trainings of Wasserstein GANs (WGAN-GP) by
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
814 views25 slides
Wasserstein GAN by
Wasserstein GANWasserstein GAN
Wasserstein GANBar Vinograd
1.7K views57 slides

More Related Content

Similar to Reinforcement Learning in Configurable Environments

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018 by
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
1.2K views57 slides
Conic Clustering by
Conic ClusteringConic Clustering
Conic ClusteringNapat Rujeerapaiboon
41 views26 slides
Imitation learning tutorial by
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
6.1K views149 slides
MAPE regression, seminar @ QUT (Brisbane) by
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
353 views35 slides
Random Matrix Theory and Machine Learning - Part 4 by
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
21.9K views41 slides
A new Reinforcement Scheme for Stochastic Learning Automata by
A new Reinforcement Scheme for Stochastic Learning AutomataA new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning Automatainfopapers
353 views8 slides

Similar to Reinforcement Learning in Configurable Environments(20)

Imitation learning tutorial by Yisong Yue
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
Yisong Yue6.1K views
Random Matrix Theory and Machine Learning - Part 4 by Fabian Pedregosa
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa21.9K views
A new Reinforcement Scheme for Stochastic Learning Automata by infopapers
A new Reinforcement Scheme for Stochastic Learning AutomataA new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning Automata
infopapers353 views
Bayesian Inference and Uncertainty Quantification for Inverse Problems by Matt Moores
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Matt Moores159 views
SIAM_QMC_Fourier_Pricing_Pres.pdf by MichaelSamet4
SIAM_QMC_Fourier_Pricing_Pres.pdfSIAM_QMC_Fourier_Pricing_Pres.pdf
SIAM_QMC_Fourier_Pricing_Pres.pdf
MichaelSamet429 views
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf by Po-Chuan Chen
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen3 views
Introduction to Reinforcement Learning for Molecular Design by Dan Elton
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Dan Elton97 views
100 things I know by r-uribe
100 things I know100 things I know
100 things I know
r-uribe217 views
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju... by SYRTO Project
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
SYRTO Project384 views
Programming the Interaction Space Effectively with ReSpecTX by Stefano Mariani
Programming the Interaction Space Effectively with ReSpecTXProgramming the Interaction Space Effectively with ReSpecTX
Programming the Interaction Space Effectively with ReSpecTX
Stefano Mariani376 views
Higher Order Fused Regularization for Supervised Learning with Grouped Parame... by Koh Takeuchi
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Koh Takeuchi598 views
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ... by Yuko Kuroki (黒木祐子)
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
A walk through the intersection between machine learning and mechanistic mode... by JuanPabloCarbajal3
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
Evolution of specialist vs. generalist strategies in a continuous environment by Florence (Flo) Debarre
Evolution of specialist vs. generalist strategies in a continuous environmentEvolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environment

Recently uploaded

Pull down shoulder press final report docx (1).pdf by
Pull down shoulder press final report docx (1).pdfPull down shoulder press final report docx (1).pdf
Pull down shoulder press final report docx (1).pdfComsat Universal Islamabad Wah Campus
17 views25 slides
Proposal Presentation.pptx by
Proposal Presentation.pptxProposal Presentation.pptx
Proposal Presentation.pptxkeytonallamon
42 views36 slides
Introduction to CAD-CAM.pptx by
Introduction to CAD-CAM.pptxIntroduction to CAD-CAM.pptx
Introduction to CAD-CAM.pptxsuyogpatil49
5 views15 slides
What is Unit Testing by
What is Unit TestingWhat is Unit Testing
What is Unit TestingSadaaki Emura
24 views25 slides
Effect of deep chemical mixing columns on properties of surrounding soft clay... by
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...AltinKaradagli
9 views10 slides
DevOps-ITverse-2023-IIT-DU.pptx by
DevOps-ITverse-2023-IIT-DU.pptxDevOps-ITverse-2023-IIT-DU.pptx
DevOps-ITverse-2023-IIT-DU.pptxAnowar Hossain
12 views45 slides

Recently uploaded(20)

Introduction to CAD-CAM.pptx by suyogpatil49
Introduction to CAD-CAM.pptxIntroduction to CAD-CAM.pptx
Introduction to CAD-CAM.pptx
suyogpatil495 views
Effect of deep chemical mixing columns on properties of surrounding soft clay... by AltinKaradagli
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...
AltinKaradagli9 views
DevOps-ITverse-2023-IIT-DU.pptx by Anowar Hossain
DevOps-ITverse-2023-IIT-DU.pptxDevOps-ITverse-2023-IIT-DU.pptx
DevOps-ITverse-2023-IIT-DU.pptx
Anowar Hossain12 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi8 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth632 views
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ... by AltinKaradagli
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
Investigation of Physicochemical Changes of Soft Clay around Deep Geopolymer ...
AltinKaradagli12 views
MSA Website Slideshow (16).pdf by msaucla
MSA Website Slideshow (16).pdfMSA Website Slideshow (16).pdf
MSA Website Slideshow (16).pdf
msaucla76 views
Generative AI Models & Their Applications by SN
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their Applications
SN8 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang7883 views
GDSC Mikroskil Members Onboarding 2023.pdf by gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil53 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 views

Reinforcement Learning in Configurable Environments

  • 1. 1 / 26 POLITECNICO DI MILANO 1863 POLITECNICO DI MILANO SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING Reinforcement Learning in Configurable Environments: an Information Theoretic approach Supervisor: Prof. Marcello Restelli Co-supervisor: Dott. Alberto Maria Metelli Emanuele Ghelfi, 875550 20 Dec, 2018 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 2. 2 / 26 POLITECNICO DI MILANO 1863 Contents 1 Reinforcement Learning 2 Configurable MDP 3 Contribution - Relative Entropy Model Policy Search 4 Finite–Sample Analysis 5 Experiments Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 3. 3 / 26 POLITECNICO DI MILANO 1863 Motivations - Configurable Environments Configure environmental parameters in a principled way. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 4. 4 / 26 POLITECNICO DI MILANO 1863 Reinforcement Learning Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making problems. Environment Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 5. 5 / 26 POLITECNICO DI MILANO 1863 Policy Search Definition (Policy) A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions. Definition (Performance) The performance of a policy is defined as: Jπ = J(θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (1) Definition (MDP solution) An MDP is solved when we find the best performing policy: θ∗ ∈ arg max θ∈Θ J(θ) . (2) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 6. 6 / 26 POLITECNICO DI MILANO 1863 Gradient vs Trust Region Gradient methods optimize performance acting with gradient updates: θt+1 = θt + λ θJ(θt ). θl θt θ∗ 2 4 6 8 10 Jπ Trust region methods perform a constrained optimization: max θ J(θ) s.t. θ ∈ I(θt ). θl θt θ∗ 2 4 6 8 10 θ ∈ I(θt ) Jπ Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 7. 7 / 26 POLITECNICO DI MILANO 1863 Trust Region methods Relative Entropy Policy Search (REPS) (Peters et al., 2010). Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). Proximal Policy Optimization (PPO) (Schulman et al., 2017). Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 8. 8 / 26 POLITECNICO DI MILANO 1863 Configurable MDP A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension. Environment (configuration) Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 9. 9 / 26 POLITECNICO DI MILANO 1863 Configurable MDP Definition (CMDP performance) The performance of a model-policy pair is: JP,π = J(ω, θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (3) Definition (CMDP Solution) The CMDP solution is the model-policy pair maximizing the performance: ω∗ , θ∗ ∈ arg max ω∈Ω,θ∈Θ J(ω, θ) . (4) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 10. 10 / 26 POLITECNICO DI MILANO 1863 State of the Art Safe Policy Model Iteration (Metelli et al., 2018a): Safe Approach (Pirotta et al., 2013) for solving CMDPs: JP ,π − JP,π Performance improvement ≥ B(P , π ) = AP ,π P,π,µ + AP,π P,π,µ 1 − γ Advantage term − γ∆qP,πD 2(1 − γ)2 Dissimilarity Penalization . (5) Limitations: Finite state-actions space. Full knowledge of the environment dynamics. High sample complexity. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 11. 11 / 26 POLITECNICO DI MILANO 1863 Relative Entropy Model Policy Search We present REMPS, a novel algorithm for CMDPs: Information Theoretic approach. Optimization and Projection. Approximated models. Continous state and action spaces. We optimize the Average Reward: JP,π = lim inf H→+∞ E at ∼π(·|st ) st+1∼P(·|st ,at ) 1 H H−1 t=0 R(st, at, st+1) . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 12. 12 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization We define the following constrained optimization problem: Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . Dual min η η log EdP,π e + R(s,a,s ) η subject to: η ≥ 0. dP,π: sampling distribution. : Trust Region. d: stationary distribution over state, action and next-state. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 13. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] Objective Function subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 14. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Trust Region Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 15. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . d is well formed dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 16. 14 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Solution Theorem (REMPS solution) The solution of the REMPS problem is: d(s, a, s ) = dP,π(s, a, s ) exp R(s,a,s ) η S A S dP,π(s, a, s ) exp R(s,a,s ) η dsdads (6) where η is the minimizer of the dual problem. Probability of (s, a, s ) weighted exponentially with the reward. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 17. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 18. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 19. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DKL(d dP,π ) ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 20. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 21. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 22. 16 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d-Projection Discrete state-action spaces. arg min θ∈Θ,ω∈Ω DKL d dω,θ . State-Kernel Projection Discrete action spaces. arg min θ∈Θ,ω∈Ω DKL(Pπ Pπθ ω ). Independent Projection Continuous state action spaces. arg min θ∈Θ DKL(π πθ). arg min ω∈Ω DKL(P Pω). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 23. 17 / 26 POLITECNICO DI MILANO 1863 Algorithm Algorithm 1 Relative Entropy Model Policy Search 1: for t = 0,1,... until convergence do 2: Collect samples from πθt , Pωt 3: Obtain η∗, the minimizer of the dual problem. Optimization 4: Project d according to the projection strategy. Projection a. d-Projection; b. State-Kernel Projection; c. Independent Projection. 5: Update Policy. 6: Update Model. 7: end for 8: return Policy-Model Pair (πθt , Pωt ) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 24. 18 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? d solution of Optimization with ∞ samples. d solution of Optimization with N samples. d solution of Optimization and Projection with N samples. Jd − Jd =Jd − Jd OPT + Jd − Jd PROJ . (7) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 25. 19 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? Jd − Jd ≤ rmaxψ(N) 8v log 2eN v + 8 log 8 δ N (8) + rmaxφ 4 8v log 2eN v + 8 log 8 δ N (9) + √ 2rmax sup d∈DdP,π inf d∈DP,Π DKL(d d). (10) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 26. 20 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain Motivations: Visualize the behaviour of REMPS; Overcoming local minima; Configure transition function. S0 S1 a0, θ, 1 − ω, 0 a0, θ, ω, s a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0 a0, θ, ω, s a0, θ, 1 − ω, L a1, 1 − θ, ωk, l a1, 1 − θ, 1 − ωk, s 00.20.40.60.81 0 0.5 1 2 4 6 8 10 θ ω J 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 27. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 28. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 29. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 30. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 31. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 REMPS = 10 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 32. 22 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Standard RL benchmark; Configure cart acceleration; Approximated model. Minimize acceleration balancing the pole: R(s, a, s ) = 10 − ω2 20 − 20 · (1 − cos(γ)). γ ˙γ ˙x x Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 33. 23 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Ideal Model Approximated Model 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return REMPS G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 34. 24 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010). Autonomous driving and Configuration. Continuous Control. Approximated model. We configure the rear wing and the brake system. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 35. 25 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS 0 2 4 6 8 10 12 ·102 −60 −40 −20 0 Iteration AverageReward 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration RearWing 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration Front-RearBrakeRepartition REMPS REPS Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 36. 26 / 26 POLITECNICO DI MILANO 1863 Conclusions Contributions: REMPS able to solve the model-policy learning problem. Finite-sample analysis. Experimental evaluation. Future research directions: Adaptive KL-constraint. Other divergences. Finite-Time Analysis. Plan to submit at ICML 2019. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 37. 27 / 26 POLITECNICO DI MILANO 1863 Conclusions Thank you for your attention Questions? Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 38. 28 / 26 POLITECNICO DI MILANO 1863 Bibliography I [Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org. 2013 [Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary Computation, July 2010, pp. 1–8. – pp. 1–8 [Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan, Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497 [Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli, Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat] (2018), September. (2018), September Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 39. 29 / 26 POLITECNICO DI MILANO 1863 Bibliography II [Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy Policy Search. 2010 [Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello, Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.): Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta, Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315 [Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan, Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs] (2015), February. (2015), February [Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ; Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017), July. (2017), July [Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA, USA : MIT Press, 1998 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 40. 30 / 26 POLITECNICO DI MILANO 1863 d-projection Projection of the stationary state distribution: θ, ω = arg min θ∈Θ,ω∈Ω DKL d(s, a, s ) dω,θ (s, a, s ) s.t. dω,θ (s) = S A dω,θ (s )πθ(a|s )Pω(s |s, a)dads . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 41. 31 / 26 POLITECNICO DI MILANO 1863 Projection of the State Kernel Projection of the state kernel: θ, ω = arg min θ∈Θ,ω∈Ω S d(s)DKL(Pπ (·|s) Pπθ ω (·|s))ds = arg max θ∈Θ,ω∈Ω S d(s) S Pπ (s |s) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S d(s) S A π (a|s)P (s |s, a) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S A S d(s, a, s ) log A Pω(s |s, a )πθ(a |s)da dsdads Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 42. 32 / 26 POLITECNICO DI MILANO 1863 Independent Projection Independent Projection of policy and model: θ = arg min θ∈Θ S d(s)DKL(π (·|s) πθ(·|s))ds = (11) = arg min θ∈Θ S d(s) A π (a|s) log π(a|s) πθ(a|s) dads = (12) = arg min θ∈Θ S A S d(s, a, s ) log π (a|s) πθ(a|s) dsdads = (13) = arg max θ∈Θ S A S d(s, a, s ) log πθ(a|s)dsdads , (14) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 43. 33 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Joint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π and dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. The norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax 2DKL(dP,π dP ,π ). (15) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 44. 34 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Disjoint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If (P , π ) admits group invertible state kernel P π the norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16) where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 45. 35 / 26 POLITECNICO DI MILANO 1863 G(PO)MDP - Model extension JP,π = pθ,ω(τ)G(τ)dτ ωJP,π = pθ,ω(τ) ω log p(τ)G(τ)dτ (17) = pθ,ω(τ) H k=0 log Pω(sk+1|sk, ak) G(τ) (18) ωJPπ G(PO)MDP = H l=0 H k=l ω log Pω(sk+1|sk, ak) γl R(sl , al , sl+1) N (19) Emanuele Ghelfi Reinforcement Learning in Configurable Environments