Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning in Configurable Environments

Reinforcement Learning in Configurable Environments: an Information Theoretic Approach

  • Be the first to comment

  • Be the first to like this

Reinforcement Learning in Configurable Environments

  1. 1. 1 / 26 POLITECNICO DI MILANO 1863 POLITECNICO DI MILANO SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING Reinforcement Learning in Configurable Environments: an Information Theoretic approach Supervisor: Prof. Marcello Restelli Co-supervisor: Dott. Alberto Maria Metelli Emanuele Ghelfi, 875550 20 Dec, 2018 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  2. 2. 2 / 26 POLITECNICO DI MILANO 1863 Contents 1 Reinforcement Learning 2 Configurable MDP 3 Contribution - Relative Entropy Model Policy Search 4 Finite–Sample Analysis 5 Experiments Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  3. 3. 3 / 26 POLITECNICO DI MILANO 1863 Motivations - Configurable Environments Configure environmental parameters in a principled way. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  4. 4. 4 / 26 POLITECNICO DI MILANO 1863 Reinforcement Learning Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making problems. Environment Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  5. 5. 5 / 26 POLITECNICO DI MILANO 1863 Policy Search Definition (Policy) A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions. Definition (Performance) The performance of a policy is defined as: Jπ = J(θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (1) Definition (MDP solution) An MDP is solved when we find the best performing policy: θ∗ ∈ arg max θ∈Θ J(θ) . (2) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  6. 6. 6 / 26 POLITECNICO DI MILANO 1863 Gradient vs Trust Region Gradient methods optimize performance acting with gradient updates: θt+1 = θt + λ θJ(θt ). θl θt θ∗ 2 4 6 8 10 Jπ Trust region methods perform a constrained optimization: max θ J(θ) s.t. θ ∈ I(θt ). θl θt θ∗ 2 4 6 8 10 θ ∈ I(θt ) Jπ Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  7. 7. 7 / 26 POLITECNICO DI MILANO 1863 Trust Region methods Relative Entropy Policy Search (REPS) (Peters et al., 2010). Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). Proximal Policy Optimization (PPO) (Schulman et al., 2017). Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  8. 8. 8 / 26 POLITECNICO DI MILANO 1863 Configurable MDP A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension. Environment (configuration) Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  9. 9. 9 / 26 POLITECNICO DI MILANO 1863 Configurable MDP Definition (CMDP performance) The performance of a model-policy pair is: JP,π = J(ω, θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (3) Definition (CMDP Solution) The CMDP solution is the model-policy pair maximizing the performance: ω∗ , θ∗ ∈ arg max ω∈Ω,θ∈Θ J(ω, θ) . (4) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  10. 10. 10 / 26 POLITECNICO DI MILANO 1863 State of the Art Safe Policy Model Iteration (Metelli et al., 2018a): Safe Approach (Pirotta et al., 2013) for solving CMDPs: JP ,π − JP,π Performance improvement ≥ B(P , π ) = AP ,π P,π,µ + AP,π P,π,µ 1 − γ Advantage term − γ∆qP,πD 2(1 − γ)2 Dissimilarity Penalization . (5) Limitations: Finite state-actions space. Full knowledge of the environment dynamics. High sample complexity. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  11. 11. 11 / 26 POLITECNICO DI MILANO 1863 Relative Entropy Model Policy Search We present REMPS, a novel algorithm for CMDPs: Information Theoretic approach. Optimization and Projection. Approximated models. Continous state and action spaces. We optimize the Average Reward: JP,π = lim inf H→+∞ E at ∼π(·|st ) st+1∼P(·|st ,at ) 1 H H−1 t=0 R(st, at, st+1) . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  12. 12. 12 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization We define the following constrained optimization problem: Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . Dual min η η log EdP,π e + R(s,a,s ) η subject to: η ≥ 0. dP,π: sampling distribution. : Trust Region. d: stationary distribution over state, action and next-state. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  13. 13. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] Objective Function subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  14. 14. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Trust Region Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  15. 15. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . d is well formed dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  16. 16. 14 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Solution Theorem (REMPS solution) The solution of the REMPS problem is: d(s, a, s ) = dP,π(s, a, s ) exp R(s,a,s ) η S A S dP,π(s, a, s ) exp R(s,a,s ) η dsdads (6) where η is the minimizer of the dual problem. Probability of (s, a, s ) weighted exponentially with the reward. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  17. 17. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  18. 18. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  19. 19. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DKL(d dP,π ) ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  20. 20. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  21. 21. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  22. 22. 16 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d-Projection Discrete state-action spaces. arg min θ∈Θ,ω∈Ω DKL d dω,θ . State-Kernel Projection Discrete action spaces. arg min θ∈Θ,ω∈Ω DKL(Pπ Pπθ ω ). Independent Projection Continuous state action spaces. arg min θ∈Θ DKL(π πθ). arg min ω∈Ω DKL(P Pω). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  23. 23. 17 / 26 POLITECNICO DI MILANO 1863 Algorithm Algorithm 1 Relative Entropy Model Policy Search 1: for t = 0,1,... until convergence do 2: Collect samples from πθt , Pωt 3: Obtain η∗, the minimizer of the dual problem. Optimization 4: Project d according to the projection strategy. Projection a. d-Projection; b. State-Kernel Projection; c. Independent Projection. 5: Update Policy. 6: Update Model. 7: end for 8: return Policy-Model Pair (πθt , Pωt ) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  24. 24. 18 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? d solution of Optimization with ∞ samples. d solution of Optimization with N samples. d solution of Optimization and Projection with N samples. Jd − Jd =Jd − Jd OPT + Jd − Jd PROJ . (7) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  25. 25. 19 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? Jd − Jd ≤ rmaxψ(N) 8v log 2eN v + 8 log 8 δ N (8) + rmaxφ 4 8v log 2eN v + 8 log 8 δ N (9) + √ 2rmax sup d∈DdP,π inf d∈DP,Π DKL(d d). (10) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  26. 26. 20 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain Motivations: Visualize the behaviour of REMPS; Overcoming local minima; Configure transition function. S0 S1 a0, θ, 1 − ω, 0 a0, θ, ω, s a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0 a0, θ, ω, s a0, θ, 1 − ω, L a1, 1 − θ, ωk, l a1, 1 − θ, 1 − ωk, s 00.20.40.60.81 0 0.5 1 2 4 6 8 10 θ ω J 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  27. 27. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  28. 28. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  29. 29. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  30. 30. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  31. 31. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 REMPS = 10 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  32. 32. 22 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Standard RL benchmark; Configure cart acceleration; Approximated model. Minimize acceleration balancing the pole: R(s, a, s ) = 10 − ω2 20 − 20 · (1 − cos(γ)). γ ˙γ ˙x x Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  33. 33. 23 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Ideal Model Approximated Model 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return REMPS G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  34. 34. 24 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010). Autonomous driving and Configuration. Continuous Control. Approximated model. We configure the rear wing and the brake system. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  35. 35. 25 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS 0 2 4 6 8 10 12 ·102 −60 −40 −20 0 Iteration AverageReward 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration RearWing 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration Front-RearBrakeRepartition REMPS REPS Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  36. 36. 26 / 26 POLITECNICO DI MILANO 1863 Conclusions Contributions: REMPS able to solve the model-policy learning problem. Finite-sample analysis. Experimental evaluation. Future research directions: Adaptive KL-constraint. Other divergences. Finite-Time Analysis. Plan to submit at ICML 2019. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  37. 37. 27 / 26 POLITECNICO DI MILANO 1863 Conclusions Thank you for your attention Questions? Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  38. 38. 28 / 26 POLITECNICO DI MILANO 1863 Bibliography I [Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org. 2013 [Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary Computation, July 2010, pp. 1–8. – pp. 1–8 [Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan, Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497 [Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli, Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat] (2018), September. (2018), September Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  39. 39. 29 / 26 POLITECNICO DI MILANO 1863 Bibliography II [Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy Policy Search. 2010 [Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello, Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.): Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta, Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315 [Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan, Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs] (2015), February. (2015), February [Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ; Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017), July. (2017), July [Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA, USA : MIT Press, 1998 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  40. 40. 30 / 26 POLITECNICO DI MILANO 1863 d-projection Projection of the stationary state distribution: θ, ω = arg min θ∈Θ,ω∈Ω DKL d(s, a, s ) dω,θ (s, a, s ) s.t. dω,θ (s) = S A dω,θ (s )πθ(a|s )Pω(s |s, a)dads . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  41. 41. 31 / 26 POLITECNICO DI MILANO 1863 Projection of the State Kernel Projection of the state kernel: θ, ω = arg min θ∈Θ,ω∈Ω S d(s)DKL(Pπ (·|s) Pπθ ω (·|s))ds = arg max θ∈Θ,ω∈Ω S d(s) S Pπ (s |s) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S d(s) S A π (a|s)P (s |s, a) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S A S d(s, a, s ) log A Pω(s |s, a )πθ(a |s)da dsdads Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  42. 42. 32 / 26 POLITECNICO DI MILANO 1863 Independent Projection Independent Projection of policy and model: θ = arg min θ∈Θ S d(s)DKL(π (·|s) πθ(·|s))ds = (11) = arg min θ∈Θ S d(s) A π (a|s) log π(a|s) πθ(a|s) dads = (12) = arg min θ∈Θ S A S d(s, a, s ) log π (a|s) πθ(a|s) dsdads = (13) = arg max θ∈Θ S A S d(s, a, s ) log πθ(a|s)dsdads , (14) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  43. 43. 33 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Joint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π and dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. The norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax 2DKL(dP,π dP ,π ). (15) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  44. 44. 34 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Disjoint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If (P , π ) admits group invertible state kernel P π the norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16) where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  45. 45. 35 / 26 POLITECNICO DI MILANO 1863 G(PO)MDP - Model extension JP,π = pθ,ω(τ)G(τ)dτ ωJP,π = pθ,ω(τ) ω log p(τ)G(τ)dτ (17) = pθ,ω(τ) H k=0 log Pω(sk+1|sk, ak) G(τ) (18) ωJPπ G(PO)MDP = H l=0 H k=l ω log Pω(sk+1|sk, ak) γl R(sl , al , sl+1) N (19) Emanuele Ghelfi Reinforcement Learning in Configurable Environments

×