Model-based Reinforcement Learning
with Neural Networks
on Hierarchical Dynamic System
Akihiko Yamaguchi and Christopher G. Atkeson
Robotics Institute, Carnegie Mellon University http://akihikoy.net/
http://reflectionsintheword.files.wordpress.com/
2012/08/pouring-water-into-glass.jpg
http://schools.graniteschools.org/
edtech-canderson/files/2013/01/
heinz-ketchup-old-bottle.jpg
http://old.post-gazette.com/images2/
20021213hosqueeze_230.jpg
http://img.diytrade.com/cdimg/1352823/17809917/
0/1292834033/shampoo_bottle_bodywash_bottle.jpg
http://www.nescafe.com/
upload/golden_roast_f_711.png
My pizza demonstration https://youtu.be/Wgj32blPGiE
https://youtu.be/GjwfbOur3CQ
Pouring: A Manipulation of Deformable Object
Planning actions
Planning parameters of actions
= Dynamic Programming (Opt ctrl, MPC, …)
Dynamics are partially unknown
 Reinforcement Learning Problem
RL in pouring
Adaptation: not much hard
Generalization: hard
Is Deep NN useful in this problem? (How to use in RL framework?)4
Remarks of Reinforcement Learning
Good to think about Model-free RL v.s.
Model-based RL
Successful robot-learning RL is model-free
(direct policy search) [cf. Kober et al. 2013]
Good at fine-tuning, Less computation cost (at
execution)
Robust to PoMDP
Model-based: Simulation biases
Model-based:
1. Generalization ability
2. Sharable / Reusable
3. Capable to reward changes
2 and 3: Thanks to symbolic (hierarchical)
representation
5
input
output
hidden
- u
update
FK ANN
[Magtanong et al. 2012]
How to deal with simulation biases?
Do not learn dx/dt = F(x,u) (dt: small like xx ms)
Learn (sub)task-level dynamics
Parameters  F_grasp  Grasp result
Parameters  F_flow_ctrl  Flow ctrl result
Use stochastic models
Gaussian  F  Gaussian
Stochastic Neural Networks [Yamaguchi, Atkeson, ICRA 2016]
Use stochastic dynamic programming
Stochastic Differential Dynamic Programming
[Yamaguchi, Atkeson, Humanoids 2015]
6 Model-based RL with Neural Networks for Hierarchical Dynamic System
Stochastic Neural Networks
Propagation of probability distribution from input to output
Gradients of output expectation w.r.t. an input
Difficulty: Nonlinear activation functions
ReLU (f(x)=max(0,x))
7
Mean
model
Error
model
Input
(shared)
Use Case
8 Independent neural networks for each (sub)dynamical system
Stochastic Differential Dynamic Programming
9
Results of Experiments
DNN+DDP was better
than LWR+DDP
Using redundant
features did not affect
the learning
performance
Worked in pouring
with PR2 robot
10
Video: https://youtu.be/aM3hE1J5W98
More Information
http://akihikoy.net/
https://www.youtube.com/AkihikoYamaguchi
Akihiko Yamaguchi and Christopher G. Atkeson:
Neural Networks and Differential Dynamic Programming for Reinforcement
Learning Problems, in Proceedings of the 2016 IEEE International Conference on
Robotics and Automation (ICRA2016), Stockholm, Sweden, May, 2016.
https://www.researchgate.net/publication/294729454
Akihiko Yamaguchi and Christopher G. Atkeson:
Differential Dynamic Programming with Temporally Decomposed Dynamics, in
Proceedings of the 15th IEEE-RAS International Conference on Humanoid Robots
(Humanoids2015), pp. 696-703, Seoul, 2015.
https://www.researchgate.net/publication/282157952
Akihiko Yamaguchi, Christopher G. Atkeson, and Tsukasa Ogasawara:
Pouring Skills with Planning and Learning Modeled from Human Demonstrations,
International Journal of Humanoid Robotics, Vol.12, No.3, pp.1550030, July, 2015.
https://www.researchgate.net/publication/280733055
11

Model-based Reinforcement Learning with Neural Networks on Hierarchical Dynamic System