Robot Learning with Structured Knowledge
And Richer Sensing
Akihiko Yamaguchi
Robotics Institute, Carnegie Mellon University
src: www.wolframalpha.com
Manipulations in Everyday Activities
Folding clothes
Cleaning
Cooking
Bathing
Dressing
…
7
Japanese way of
folding T-shirts
https://youtu.be/b5A
WQ5aBjgE
Chinese
cooking skills
https://youtu.be
/PFGGTPPNdRQ
Dynamic Programming& ReinforcementLearning
8
• Tasty / Awful
• “I am satisfied ”
• ….
Dynamic Programming when {Fk} are given
Reinforcement Learningwhen {Fk} are unknown
https://www.youtube.com/playlist?list=PL41MvLpqzOg8FF0xekWT9NXCdjzN_8PUS
Yamaguchiet al. "DCOB: Action space for reinforcementlearning of high DoF robots", AutonomousRobots, 2013
https://www.youtube.com/playlist?list=PL41MvLpqzOg8FF0xekWT9NXCdjzN_8PUS
Deep Reinforcement Learning
Deep learning: With big data, NN can learn any I/O
mapping with any precision. We don't have to care
about how large the state space is. It can directly
handle image as an input without designing features.
Deep RL: Using deep NN to represent policy, dynamical
models, value functions, etc. Deep RL can handle large
state space with big data.
E.g. Atali, Google (S Levine)'s learning visual servoing.
11
Deep Reinforcement Learning
12
(T-L) Learning to play Atari games
by Google DeepMind, Mnih et al.
2015
https://youtu.be/cjpEIotvwFY
(T-R) DeepMPC Robotic
Experiments - PR2 cuts food, Lenz
et al. 2015
https://youtu.be/BwA90MmkvPU
(B-L) Learning to grasp from 50K
Tries, Pinto et al. 2016
https://youtu.be/oSqHc0nLkm8
(B-R) Learning hand-eye
coordination for robotic grasping,
Levine et al. 2017
https://youtu.be/l8zKZLqkfII
Deep Reinforcement Learning
Can Deep RL solve RL problems in general?
Maybe YES
Is that intelligentthat we expect to robots?
Maybe NO
Learning grasping: 50,000 samples (Pintoet al. 2016),
800,000samples (Levine et al. 2017)
How many samples are necessary to learn cooking
sushi?
Strategy to designing a problem is unclear
Learning with less samples is unclear
13
Intelligent Robot
English proverb says:
"A word to the wise is enough."
"Many words to a fool, half a word to the wise."
In Japanese:一を知って十を知る
Robot version:
Many practices to a fool robot, half a practice to the
intelligent robot.
14
How do we measure intelligence of robots?
Adaptation ability
Generalization ability
Scalability
15 From talk by Leslie Kaelbling inICRA’16
Key components to create intelligent robots
???
???
???
???
16
Key components to create intelligent robots
Library of skills
Structured knowledge
Learning and reasoning methods
Richer sensing and general hardware
18
My Work (Introduced today)
Deformable object manipulation (liquids, powders,
vegetables and fruits, etc.)
Representing behaviors with a skill library;
Verification in PR2 and Baxter pouring
Model-based RL with structured knowledge;
Verified in simulation pouring
Richer sensing helps learning: Liquid flow
perception, FingerVision
19
Library of skills is essential
20
http://reflectionsintheword.files.wordpress.com/
2012/08/pouring-water-into-glass.jpg
http://schools.graniteschools.org/
edtech-canderson/files/2013/01/
heinz-ketchup-old-bottle.jpg
http://old.post-gazette.com/images2/
20021213hosqueeze_230.jpg
http://img.diytrade.com/cdimg/1352823/17809917/
0/1292834033/shampoo_bottle_bodywash_bottle.jpg
http://www.nescafe.com/
upload/golden_roast_f_711.png
Pouring Behavior with Skill Library
Skill library
flow ctrl (tip, shake, …), grasp, move arm, …
 State machines (structure, feedback control)
Planning methods
grasp, re-grasp, pouring locations,
feasible trajectories, …
 Optimization-based approach
Learning methods
Skill selection  Table, Softmax
Parameter adjustment
(e.g. shake axis)  Optimization (CMA-ES)
Improve plan quality  Improve value functions27
Sharing Knowledge Among Robots
28
The same implementation
worked on PR2 and Baxter
PR2 and Baxter:
Diff: Kinematics, grippers
Same: Arm DoF, sensors
Sharable knowledge:
Skills
Behavior structure
Not sharable:
Policy parameters
Achieved and NOT Achieved
Achieved:
Generalization of grasping, moving container, and
pouring skills
 over container shapes
 over initial container poses
 over different target amounts
Adaptation of pouring skills
 to new material types & container shapes
NOT achieved:
Generalization of pouring skills
 over material types & container shapes
29
Reinforcement learning for generalization
30
Reinforcement Learning in Pouring
Components of pouring behavior:
Skill library: can be general
Behavior structure: can be general
Selection of skill and skill parameters: situation specific
 Planning (dynamic programming) is necessary
Dynamics are partially unknown
 Reinforcement Learning Problem
31
Reinforcement Learning
32
[ReinforcementLearning]
[DirectPolicySearch] [ValueFunction-based]
[Model-based]
[Model-free]
RL RL SL
[DynamicProgramming][Optimization]
Planning
depth
Learning
complexity
[Policy] [ValueFunctions] [ForwardModels]What is
learned
0 1 N
33
[Direct PolicySearch]
[Value Function-based]
[Model-based]
Model-free is tend to obtain better performance
34
[Kober,Peters,2011] [Kormushev,2010]
Model-free is robust in POMDP
35
Yamaguchiet al. "DCOB: Action space for reinforcementlearning of high DoF robots", AutonomousRobots, 2013
https://www.youtube.com/playlist?list=PL41MvLpqzOg8FF0xekWT9NXCdjzN_8PUS
POMDP:
Partially Observable
Markov Decision
Process
Model-based is suffered from simulation biases
36
Simulation bias: When forward models are inaccurate (usual when
learning models), integrating the forward models causes a rapid
increase of future state estimation errors
cf. [Atkeson,Schaal,1997b][Kober,Peters,2013]
Model-based is good at generalization
37
input
output
hidden
- u
update
FK ANN
Learning inverse kinematics of android face
[Magtanong, Yamaguchi, et al. 2012]
Model-based is good at sharing / reusing
learned components
38
Forward models are sharable / reusable
Analyticalmodelscan be combined
Model-based is flexible to reward changes
39
Our Approach
Model-based reinforcement learning
How to deal with simulation biases?
Do not try to learn dx/dt = F(x,u) (dt: small like xx ms)
Learn (sub)task-level dynamics
 Parameters  F_grasp  Grasp result
 Parameters  F_flow_ctrl  Flow ctrl result
Use stochastic models
 Gaussian  F  Gaussian
Use stochastic dynamic programming
 E.g. Stochastic (Differential) Dynamic Programming
How to work with a skill library?
40
Model-based RL for Graph-Structured Dynamics
41
Learning Unknown
Dynamical Systems
with Stochastic
Neural Networks
Planning Actions
with Stochastic
Graph-DDP
42
Forward model can be:
• Dynamical system with/wo action
parameters
• Kinematics
• Featuredetection, Policy
parameterization
• Reward
• …
Bifurcation model can be:
• Possible different results of an action
• Skill selection
• Spatial decomposition of dynamics
• Spatial conversion, including
kinematics, feature detection, policy
parameterization, and rewards
• …
GraphDDP
Bifurcation primitive
[Yamaguchi andAtkeson, Humanoids2015, 2016]
43
GraphDDP
Bifurcation primitive
[Yamaguchi andAtkeson, Humanoids2015, 2016]
Skill selection
Possibledifferent results of an action
Reward
Spatial decomposition of dynamics
44
GraphDDP
Bifurcation primitive
[Yamaguchi andAtkeson, Humanoids2015, 2016]
45
 Tree DDP with multi-point search
GraphDDP
Graphstructure analysis
[Yamaguchi andAtkeson, Humanoids2015, 2016]
46 [Yamaguchi andAtkeson, ICRA 2016]
Stochastic Neural Networks
47
Works in real robots
Pouring Simulation with OpenDE
48
49
Achieved GENERALIZATION
over material variation and
container shapes
Decomposition of dynamics and
richer sensing are useful in learning
50
Example-1: Flow in Pouring
Do robots need to perceive FLOW in pouring?
51
Skill parameters
Flow
Poured amount
Robot can learn skill
parameters to maximize
rewards (poured amount ==
target amount)
Considering decomposed
dynamics (flow as
intermediate state) makes
learning easier
52
Decomposed Not decomposed
How to Perceive Flow in Reality?
53
Example-2: Tactile Sensing in Manipulation
Tactilesensing is necessary in manipulation?
e.g. Google’s grasp learning: No tactile sensing; learning
visual servoing
What if grasping a container whose content is unknown?
What if external force is applied?
54
FingerVision: Vision-based Tactile Sensing
55
Multimodal tactile sensing
Force distribution
Proximity Vision
Slip / Deformation
Object pose, texture, shape
Low-cost and easy to
manufacture
Physically robust
Summary
Library of skills is essential
Skills and high-level behavior representations can be shared among robots
Consideringpros & cons of reinforcement learningapproaches is important
Model-free is tend to obtain better performance
Model-free is robust in POMDP
Model-based is suffered from simulation biases
Model-based is good at generalization
Model-based is good at sharing / reusing learned components
Model-based is flexible to reward changes
Model-based reinforcement learningmethod for graph-structured dynamical
systems is proposed
Learning forward models with stochastic neural networks
Planning with stochastic Graph-DDP (differential dynamicprogramming)
Generalization of pouring behavior over material types is achieved
Decomposition of dynamics and richer sensing is useful in learning
More work: http://akihikoy.net60

Robot Learning with Structured Knowledge And Richer Sensing

  • 1.
    Robot Learning withStructured Knowledge And Richer Sensing Akihiko Yamaguchi Robotics Institute, Carnegie Mellon University
  • 5.
  • 7.
    Manipulations in EverydayActivities Folding clothes Cleaning Cooking Bathing Dressing … 7 Japanese way of folding T-shirts https://youtu.be/b5A WQ5aBjgE Chinese cooking skills https://youtu.be /PFGGTPPNdRQ
  • 8.
    Dynamic Programming& ReinforcementLearning 8 •Tasty / Awful • “I am satisfied ” • …. Dynamic Programming when {Fk} are given Reinforcement Learningwhen {Fk} are unknown
  • 9.
  • 10.
    Yamaguchiet al. "DCOB:Action space for reinforcementlearning of high DoF robots", AutonomousRobots, 2013 https://www.youtube.com/playlist?list=PL41MvLpqzOg8FF0xekWT9NXCdjzN_8PUS
  • 11.
    Deep Reinforcement Learning Deeplearning: With big data, NN can learn any I/O mapping with any precision. We don't have to care about how large the state space is. It can directly handle image as an input without designing features. Deep RL: Using deep NN to represent policy, dynamical models, value functions, etc. Deep RL can handle large state space with big data. E.g. Atali, Google (S Levine)'s learning visual servoing. 11
  • 12.
    Deep Reinforcement Learning 12 (T-L)Learning to play Atari games by Google DeepMind, Mnih et al. 2015 https://youtu.be/cjpEIotvwFY (T-R) DeepMPC Robotic Experiments - PR2 cuts food, Lenz et al. 2015 https://youtu.be/BwA90MmkvPU (B-L) Learning to grasp from 50K Tries, Pinto et al. 2016 https://youtu.be/oSqHc0nLkm8 (B-R) Learning hand-eye coordination for robotic grasping, Levine et al. 2017 https://youtu.be/l8zKZLqkfII
  • 13.
    Deep Reinforcement Learning CanDeep RL solve RL problems in general? Maybe YES Is that intelligentthat we expect to robots? Maybe NO Learning grasping: 50,000 samples (Pintoet al. 2016), 800,000samples (Levine et al. 2017) How many samples are necessary to learn cooking sushi? Strategy to designing a problem is unclear Learning with less samples is unclear 13
  • 14.
    Intelligent Robot English proverbsays: "A word to the wise is enough." "Many words to a fool, half a word to the wise." In Japanese:一を知って十を知る Robot version: Many practices to a fool robot, half a practice to the intelligent robot. 14
  • 15.
    How do wemeasure intelligence of robots? Adaptation ability Generalization ability Scalability 15 From talk by Leslie Kaelbling inICRA’16
  • 16.
    Key components tocreate intelligent robots ??? ??? ??? ??? 16
  • 18.
    Key components tocreate intelligent robots Library of skills Structured knowledge Learning and reasoning methods Richer sensing and general hardware 18
  • 19.
    My Work (Introducedtoday) Deformable object manipulation (liquids, powders, vegetables and fruits, etc.) Representing behaviors with a skill library; Verification in PR2 and Baxter pouring Model-based RL with structured knowledge; Verified in simulation pouring Richer sensing helps learning: Liquid flow perception, FingerVision 19
  • 20.
    Library of skillsis essential 20
  • 21.
  • 27.
    Pouring Behavior withSkill Library Skill library flow ctrl (tip, shake, …), grasp, move arm, …  State machines (structure, feedback control) Planning methods grasp, re-grasp, pouring locations, feasible trajectories, …  Optimization-based approach Learning methods Skill selection  Table, Softmax Parameter adjustment (e.g. shake axis)  Optimization (CMA-ES) Improve plan quality  Improve value functions27
  • 28.
    Sharing Knowledge AmongRobots 28 The same implementation worked on PR2 and Baxter PR2 and Baxter: Diff: Kinematics, grippers Same: Arm DoF, sensors Sharable knowledge: Skills Behavior structure Not sharable: Policy parameters
  • 29.
    Achieved and NOTAchieved Achieved: Generalization of grasping, moving container, and pouring skills  over container shapes  over initial container poses  over different target amounts Adaptation of pouring skills  to new material types & container shapes NOT achieved: Generalization of pouring skills  over material types & container shapes 29
  • 30.
    Reinforcement learning forgeneralization 30
  • 31.
    Reinforcement Learning inPouring Components of pouring behavior: Skill library: can be general Behavior structure: can be general Selection of skill and skill parameters: situation specific  Planning (dynamic programming) is necessary Dynamics are partially unknown  Reinforcement Learning Problem 31
  • 32.
    Reinforcement Learning 32 [ReinforcementLearning] [DirectPolicySearch] [ValueFunction-based] [Model-based] [Model-free] RLRL SL [DynamicProgramming][Optimization] Planning depth Learning complexity [Policy] [ValueFunctions] [ForwardModels]What is learned 0 1 N
  • 33.
  • 34.
    Model-free is tendto obtain better performance 34 [Kober,Peters,2011] [Kormushev,2010]
  • 35.
    Model-free is robustin POMDP 35 Yamaguchiet al. "DCOB: Action space for reinforcementlearning of high DoF robots", AutonomousRobots, 2013 https://www.youtube.com/playlist?list=PL41MvLpqzOg8FF0xekWT9NXCdjzN_8PUS POMDP: Partially Observable Markov Decision Process
  • 36.
    Model-based is sufferedfrom simulation biases 36 Simulation bias: When forward models are inaccurate (usual when learning models), integrating the forward models causes a rapid increase of future state estimation errors cf. [Atkeson,Schaal,1997b][Kober,Peters,2013]
  • 37.
    Model-based is goodat generalization 37 input output hidden - u update FK ANN Learning inverse kinematics of android face [Magtanong, Yamaguchi, et al. 2012]
  • 38.
    Model-based is goodat sharing / reusing learned components 38 Forward models are sharable / reusable Analyticalmodelscan be combined
  • 39.
    Model-based is flexibleto reward changes 39
  • 40.
    Our Approach Model-based reinforcementlearning How to deal with simulation biases? Do not try to learn dx/dt = F(x,u) (dt: small like xx ms) Learn (sub)task-level dynamics  Parameters  F_grasp  Grasp result  Parameters  F_flow_ctrl  Flow ctrl result Use stochastic models  Gaussian  F  Gaussian Use stochastic dynamic programming  E.g. Stochastic (Differential) Dynamic Programming How to work with a skill library? 40
  • 41.
    Model-based RL forGraph-Structured Dynamics 41 Learning Unknown Dynamical Systems with Stochastic Neural Networks Planning Actions with Stochastic Graph-DDP
  • 42.
    42 Forward model canbe: • Dynamical system with/wo action parameters • Kinematics • Featuredetection, Policy parameterization • Reward • … Bifurcation model can be: • Possible different results of an action • Skill selection • Spatial decomposition of dynamics • Spatial conversion, including kinematics, feature detection, policy parameterization, and rewards • … GraphDDP Bifurcation primitive [Yamaguchi andAtkeson, Humanoids2015, 2016]
  • 43.
    43 GraphDDP Bifurcation primitive [Yamaguchi andAtkeson,Humanoids2015, 2016] Skill selection Possibledifferent results of an action Reward Spatial decomposition of dynamics
  • 44.
  • 45.
    45  Tree DDPwith multi-point search GraphDDP Graphstructure analysis [Yamaguchi andAtkeson, Humanoids2015, 2016]
  • 46.
    46 [Yamaguchi andAtkeson,ICRA 2016] Stochastic Neural Networks
  • 47.
  • 48.
  • 49.
    49 Achieved GENERALIZATION over materialvariation and container shapes
  • 50.
    Decomposition of dynamicsand richer sensing are useful in learning 50
  • 51.
    Example-1: Flow inPouring Do robots need to perceive FLOW in pouring? 51 Skill parameters Flow Poured amount Robot can learn skill parameters to maximize rewards (poured amount == target amount) Considering decomposed dynamics (flow as intermediate state) makes learning easier
  • 52.
  • 53.
    How to PerceiveFlow in Reality? 53
  • 54.
    Example-2: Tactile Sensingin Manipulation Tactilesensing is necessary in manipulation? e.g. Google’s grasp learning: No tactile sensing; learning visual servoing What if grasping a container whose content is unknown? What if external force is applied? 54
  • 55.
    FingerVision: Vision-based TactileSensing 55 Multimodal tactile sensing Force distribution Proximity Vision Slip / Deformation Object pose, texture, shape Low-cost and easy to manufacture Physically robust
  • 60.
    Summary Library of skillsis essential Skills and high-level behavior representations can be shared among robots Consideringpros & cons of reinforcement learningapproaches is important Model-free is tend to obtain better performance Model-free is robust in POMDP Model-based is suffered from simulation biases Model-based is good at generalization Model-based is good at sharing / reusing learned components Model-based is flexible to reward changes Model-based reinforcement learningmethod for graph-structured dynamical systems is proposed Learning forward models with stochastic neural networks Planning with stochastic Graph-DDP (differential dynamicprogramming) Generalization of pouring behavior over material types is achieved Decomposition of dynamics and richer sensing is useful in learning More work: http://akihikoy.net60