Model-Based Reinforcement Learning
@NIPS2017
Yasuhiro Fujita
Engineer
Preferred Networks, Inc.
Model-Based Reinforcement Learning (MBRL)
l Model = simulator = dynamics = T(s,a,sʼ)
– may or may not include the reward function
l Model-free RL uses data from the environment only
l Model-based RL uses data from a model (which is given or estimated)
u to use less data from the environment
u to look ahead and plan
u to explore
u to guarantee safety
u to generalize to different goals
Why MBRL now?
l Despite deep RLʼs recent success, itʼs still difficult to see its real-world
applications
– Requiring a huge amount of interactions (1M~1000M😨)
– No safety guarantees
– Difficult to transfer to other tasks
l MBRL can be a solution for these problems
l This talk introduces some MBRL papers from the NIPS 2017 conference
and the deep RL symposium
Imagination-Augmented Agents (I2As)
l T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O.
Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination-
Augmented Agents for Deep Reinforcement Learning,” 2017.
l I2As utilize predictions from a model for planning
l Robust to model errors
The I2A architecture (1)
l Model-free path: feed-forward net
l Model-based path:
– Make multi-step predictions (=rollouts)
from current observation and for each
action
– Encode each rollout with LSTMs
– Aggregate each code by concatenation
The I2A architecture (2)
l The imagination core
– consists of a rollout policy and a
pretrained environment model
– predicts next observation and
reward
l The rollout policy is distilled
online from the I2A policy
Value Prediction Networks
l J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017.
l Directly predicting observations in pixels might be not a good idea
– They contain irrelevant details to the agent
– They are unnecessarily high-dimensional and difficult to predict
l VPNs learn abstract states and their model by minimizing value
prediction errors
The VPN architecture
l x:observation o:option(≒action here)
l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
Planning by VPNs
l The depth and width are
fixed
l Values are averaged over
prediction steps
Training VPNs
l V(s) of each abstract state is fit to the value from planning
l Improves performance on 2D random grid worlds and some games in
Atari, combined with Async n-step Q-learning
– surpasses observation prediction on the grid worlds
QMDP-net (not RL but Imitation Learning)
l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial
Observability,” in NIPS, 2017.
l A POMDP (partially observable MDP) and its solver are modeled as a
single neural network and trained end-to-end to predict expert actions
– Value Iteration Networks (NIPS 2016) were for fully observable domains
POMDPs and the QMDP algorithm
l In a POMDP
– The agent can only observe o ~ O(s), not s
– A belief state is considered instead: b(s) = probability of being in s
l QMDP: an approximate algorithm for solving a POMDP
1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair
2. Compute the current belief b(s) = probability of the current state being s
3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a)
4. Choose argmax_a Q(b,a)
– Assumes that any uncertainty in belief will be gone after the next action
The QMDP-net architecture (1)
l Consists of a Bayesian filter and QMDP planner
– Bayesian filter outputs b
– QMDP planner outputs Q(b,a)
The QMDP-net architecture (2)
l Everything is represented as a CNN
l Works on abstract observations/states/actions that can be different from
real observations/states/actions
– abstract state = position in the plane used in CNNs
Performance of the QMDP-net
l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the
ground-truth POMDP
l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
MBRL with stability guarantees
l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based
Reinforcement Learning with Stability Guarantees,” in NIPS, 2017.
l Aims to guarantee stability (= recoverability to stable states) when there
is uncertainty in model estimation in continuous control
– Achieves both safe policy update and safe exploration
l Repeat
– Estimate the region of attraction
– Safely explore to reduce uncertainty in the model
– Update the model (e.g. Gaussian process)
– Safely improve a policy to maximize some objective
How it works
l Can safely optimize a neural network policy on a simulated inverted
pendulum, without the pendulum ever falling down
RL on a learned model
l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model-
Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017.
l If you can optimize a policy on a learned model, you may need less data
from the environment
– And NNs are good at prediction
l One way to learn a policy on a learned model
– Model Predictive Control (MPC)
Model learning is difficult
l Even small prediction errors compound and eventually diverge
– Policies learned/computed purely from simulated experiences may fail
Fine-tuning a policy with model-free RL
l Outperforms pure model-free RL by
1. Collect data, fit a model and apply MPC
2. Train a NN policy to imitate actions of MPC
3. Fine-tune the policy with model-free RL (TRPO)
Model ensemble
l T. Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep
Reinforcement Learning Symposium, 2017.
l Another way to learn a policy on a learned model
– Apply model-free RL on a learned model
l Model-Ensemble Trust Region Policy Optimization (ME-TRPO)
1. Fit an ensemble of NN models to predict next states
u Why ensemble? To maintain model uncertainty
2. Optimize a policy on simulated experiences with TRPO until performance
stops to increase
3. Collect new data for model learning and go to 1
Effect on sample-complexity
l Improves sample complexity on MuJoCo-based continuous control tasks
– x-axis is time steps in a log scale
Effect of the ensemble size
l More models, better performance
Summary
l MBRL is hot
– There were more papers than I can introduce
l Popular ideas
– Incorporating a model/planning structure into a NN
– Use model-based simulations to reduce sample complexity
l (Deep) MBRL can be a solution to drawbacks of deep RL
l However, MBRL has its own challenges
– How to learn a good model
– How to make use of a possibly bad model

Model-Based Reinforcement Learning @NIPS2017

  • 1.
    Model-Based Reinforcement Learning @NIPS2017 YasuhiroFujita Engineer Preferred Networks, Inc.
  • 2.
    Model-Based Reinforcement Learning(MBRL) l Model = simulator = dynamics = T(s,a,sʼ) – may or may not include the reward function l Model-free RL uses data from the environment only l Model-based RL uses data from a model (which is given or estimated) u to use less data from the environment u to look ahead and plan u to explore u to guarantee safety u to generalize to different goals
  • 3.
    Why MBRL now? lDespite deep RLʼs recent success, itʼs still difficult to see its real-world applications – Requiring a huge amount of interactions (1M~1000M😨) – No safety guarantees – Difficult to transfer to other tasks l MBRL can be a solution for these problems l This talk introduces some MBRL papers from the NIPS 2017 conference and the deep RL symposium
  • 4.
    Imagination-Augmented Agents (I2As) lT. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination- Augmented Agents for Deep Reinforcement Learning,” 2017. l I2As utilize predictions from a model for planning l Robust to model errors
  • 5.
    The I2A architecture(1) l Model-free path: feed-forward net l Model-based path: – Make multi-step predictions (=rollouts) from current observation and for each action – Encode each rollout with LSTMs – Aggregate each code by concatenation
  • 6.
    The I2A architecture(2) l The imagination core – consists of a rollout policy and a pretrained environment model – predicts next observation and reward l The rollout policy is distilled online from the I2A policy
  • 7.
    Value Prediction Networks lJ. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017. l Directly predicting observations in pixels might be not a good idea – They contain irrelevant details to the agent – They are unnecessarily high-dimensional and difficult to predict l VPNs learn abstract states and their model by minimizing value prediction errors
  • 8.
    The VPN architecture lx:observation o:option(≒action here) l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
  • 9.
    Planning by VPNs lThe depth and width are fixed l Values are averaged over prediction steps
  • 10.
    Training VPNs l V(s)of each abstract state is fit to the value from planning l Improves performance on 2D random grid worlds and some games in Atari, combined with Async n-step Q-learning – surpasses observation prediction on the grid worlds
  • 11.
    QMDP-net (not RLbut Imitation Learning) l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial Observability,” in NIPS, 2017. l A POMDP (partially observable MDP) and its solver are modeled as a single neural network and trained end-to-end to predict expert actions – Value Iteration Networks (NIPS 2016) were for fully observable domains
  • 12.
    POMDPs and theQMDP algorithm l In a POMDP – The agent can only observe o ~ O(s), not s – A belief state is considered instead: b(s) = probability of being in s l QMDP: an approximate algorithm for solving a POMDP 1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair 2. Compute the current belief b(s) = probability of the current state being s 3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a) 4. Choose argmax_a Q(b,a) – Assumes that any uncertainty in belief will be gone after the next action
  • 13.
    The QMDP-net architecture(1) l Consists of a Bayesian filter and QMDP planner – Bayesian filter outputs b – QMDP planner outputs Q(b,a)
  • 14.
    The QMDP-net architecture(2) l Everything is represented as a CNN l Works on abstract observations/states/actions that can be different from real observations/states/actions – abstract state = position in the plane used in CNNs
  • 15.
    Performance of theQMDP-net l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the ground-truth POMDP l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
  • 16.
    MBRL with stabilityguarantees l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees,” in NIPS, 2017. l Aims to guarantee stability (= recoverability to stable states) when there is uncertainty in model estimation in continuous control – Achieves both safe policy update and safe exploration l Repeat – Estimate the region of attraction – Safely explore to reduce uncertainty in the model – Update the model (e.g. Gaussian process) – Safely improve a policy to maximize some objective
  • 17.
    How it works lCan safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down
  • 18.
    RL on alearned model l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model- Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017. l If you can optimize a policy on a learned model, you may need less data from the environment – And NNs are good at prediction l One way to learn a policy on a learned model – Model Predictive Control (MPC)
  • 19.
    Model learning isdifficult l Even small prediction errors compound and eventually diverge – Policies learned/computed purely from simulated experiences may fail
  • 20.
    Fine-tuning a policywith model-free RL l Outperforms pure model-free RL by 1. Collect data, fit a model and apply MPC 2. Train a NN policy to imitate actions of MPC 3. Fine-tune the policy with model-free RL (TRPO)
  • 21.
    Model ensemble l T.Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep Reinforcement Learning Symposium, 2017. l Another way to learn a policy on a learned model – Apply model-free RL on a learned model l Model-Ensemble Trust Region Policy Optimization (ME-TRPO) 1. Fit an ensemble of NN models to predict next states u Why ensemble? To maintain model uncertainty 2. Optimize a policy on simulated experiences with TRPO until performance stops to increase 3. Collect new data for model learning and go to 1
  • 22.
    Effect on sample-complexity lImproves sample complexity on MuJoCo-based continuous control tasks – x-axis is time steps in a log scale
  • 23.
    Effect of theensemble size l More models, better performance
  • 24.
    Summary l MBRL ishot – There were more papers than I can introduce l Popular ideas – Incorporating a model/planning structure into a NN – Use model-based simulations to reduce sample complexity l (Deep) MBRL can be a solution to drawbacks of deep RL l However, MBRL has its own challenges – How to learn a good model – How to make use of a possibly bad model