Deep Learning
applied to robotics
Presenter: Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)
Contents
2
• Learning Contact-Rich Manipulation Skills with Guided Policy
Search
• Supersizing Self-supervision- Learning to Grasp from 50K Tries
and 700 Robot Hours
• Learning Hand-Eye Coordination for Robotic Grasping with
Deep Learning and Large-Scale Data Collection
• Playing Atari with Deep Reinforcement Learning
• Human Level Control through Deep Reinforcement Learning
• Deep Reinforcement Learning with Double Q-Learning
• For playing Atari
• For controlling real manipulators
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Learning Contact-Rich
Manipulation Skills with
Guided Policy Search
Sergey Levine, Nolan Wagener, and Pieter Abbeel
ICRA 2015
Video
64
Introduction
65
This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? What’s so great about this paper?
Personally, main contribution of this paper is to propose a direct
policy learning method that can ‘actually train a real-world robot’
to perform some tasks.
That’s it??
I guess so! By the way, ‘actually training a real-world robot’ is
harder than you might imagine!
You will see how brilliant this paper is!
Brief review of MDP and RL
66
actionobservation
reward
Agent
Brief review of MDP and RL
67
Remember! The goal of MDP and RL is to find an optimal policy!
It is like saying “I will find a function which best satisfies given
conditions!”.
However, learning a function is not an easy problem. (In fact, impossible
unless we use some ‘prior’ knowledge!)
So, instead of learning a function itself, most of the works try to find the
‘parameters’ of a function by restricting the solution space to a space of
certain parametric functions such as linear functions.
Brief review of MDP and RL
68
What are typical impediments in reinforcement learning?
2. However, linear functions do not work well in practice.
In other words, why is it so HARD to find an optimal policy??
1. We are living in a continuous world, not a discrete grid world.
3. (Dynamic) model, which is often required, is HARD to obtain.
- In this continuous world, standard MDP cannot be established.
- So instead, we usually use function approximation to handle this issue.
- And, of course, nonlinear functions are hard to optimize.
- The definition of value is “expected sum of rewards!”.
Today’s paper tackles all three problems listed above!!
69
Big Picture (which might be wrong)
RL: Reinforcement Learning
IRL: Inverse Reinforcement Learning
LfD: Learning from Demonstration
DPL: Direct Policy Learning
RL
DPL
IRL
(=IOC)
LfD
Guided Policy
Search
Objective? What’s given? What’s NOT given Algorithms
RL Find optimal policy Reward
Dynamic model
Policy Policy iteration, Value iteration,
TD learning, Q learning
IRL Find underlying reward
Find optimal policy
Experts’ demonstrations
(often dynamic model)
Reward
Policy
MaxEnt IRL, MaxMargin planning,
Apprenticeship learning
DPL Find optimal policy Experts’ demonstrations
Reward
Dynamic model (not always) Guided policy search
LfD Find underlying reward
Find optimal policy
Experts’ demonstrations
+ others..
Dynamic model (not always) GP motion controller
IOC: Inverse Optimal Control
[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015
[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010
[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015
[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008
[1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006
[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012
[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015
[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
MDP is powerful.
But it requires heavy computation for finding the value function.  LMDP [1]
Let’s use the LMDP in inverse optimal control problem!  [2]
How can we measure the ‘probability’ of (experts’) state-action sequence?  [3]
Can we learn ‘nonlinear’ reward function?  [4]
Can we do that with ‘locally’ optimal examples?  [5]
Given reward, how can we ‘effectively’ learn optimal policy?  [6]
Re-formalize the guided policy search.  [7]
Let learn both ‘dynamic model’ and policy!!  [8]
Image based control with CNN!!  [9]
Applied to a real-world robot, PR2!!  [10]
[4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011
The beginning of a new era!
(RL + Deep learning)
Note that reward is Given!!
How can we ‘effectively’ search the optimal policy?  [11] (latest)
Learning Contact-Rich Manipulation Skills with Guided Policy Search
71
Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
GPS is a direct policy search algorithm, that can effectively scale to
high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
Guided policy search
72
Stage 1) Trajectory optimization (iterative LQR)
Given a reward function and dynamic model,
Each trajectory consists of
(state-action) pairs.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
73
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Linear dynamics
Quadratic reward
Iterative LQR
74
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
75
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iteratively compute a trajectory, find a deterministic policy
based on the trajectory, and recomputed a trajectory until
convergence.
But this only results a deterministic policy. We need
something stochastic!
By exploiting the concept of linearly solvable MDP and
maximum entropy control, one can derive following
stochastic policy!
Guided Policy Search
76
Stage 2) Policy learning
From collected (state-action) pairs, Train neural network controllers,
using Importance Sampled Policy
Search.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Experiments
77
(a) stacking large lego blocks on a
fixed base, (b) onto a free-standing
block, (c) held in both gripper;
(d) threading wooden rings onto a
tight-fitting peg; (e) assembling a
toy airplane by inserting the wheels
into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing
caps onto pill bottles and (i) onto a
water bottle.
Supersizing Self-supervision:
Learning to Grasp from 50K Tries
and 700 Robot Hours
Lerrel Pinto, Abhinav Gupta
ICRA 2016
Video
79
Learning Hand-Eye Coordination
for Robotic Grasping with Deep
Learning and Large-Scale Data
Collection
Sergey Levine, Peter Paster, Alex Krizhevsky, Deirdre
Quileen
ISER 2016
Video
86
95

Deep Learning in Robotics

  • 1.
    Deep Learning applied torobotics Presenter: Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
  • 2.
    Contents 2 • Learning Contact-RichManipulation Skills with Guided Policy Search • Supersizing Self-supervision- Learning to Grasp from 50K Tries and 700 Robot Hours • Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection • Playing Atari with Deep Reinforcement Learning • Human Level Control through Deep Reinforcement Learning • Deep Reinforcement Learning with Double Q-Learning • For playing Atari • For controlling real manipulators
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
    Learning Contact-Rich Manipulation Skillswith Guided Policy Search Sergey Levine, Nolan Wagener, and Pieter Abbeel ICRA 2015
  • 64.
  • 65.
    Introduction 65 This paper winsthe ICRA 2015 Best Manipulation Paper Award! But why? What’s so great about this paper? Personally, main contribution of this paper is to propose a direct policy learning method that can ‘actually train a real-world robot’ to perform some tasks. That’s it?? I guess so! By the way, ‘actually training a real-world robot’ is harder than you might imagine! You will see how brilliant this paper is!
  • 66.
    Brief review ofMDP and RL 66 actionobservation reward Agent
  • 67.
    Brief review ofMDP and RL 67 Remember! The goal of MDP and RL is to find an optimal policy! It is like saying “I will find a function which best satisfies given conditions!”. However, learning a function is not an easy problem. (In fact, impossible unless we use some ‘prior’ knowledge!) So, instead of learning a function itself, most of the works try to find the ‘parameters’ of a function by restricting the solution space to a space of certain parametric functions such as linear functions.
  • 68.
    Brief review ofMDP and RL 68 What are typical impediments in reinforcement learning? 2. However, linear functions do not work well in practice. In other words, why is it so HARD to find an optimal policy?? 1. We are living in a continuous world, not a discrete grid world. 3. (Dynamic) model, which is often required, is HARD to obtain. - In this continuous world, standard MDP cannot be established. - So instead, we usually use function approximation to handle this issue. - And, of course, nonlinear functions are hard to optimize. - The definition of value is “expected sum of rewards!”. Today’s paper tackles all three problems listed above!!
  • 69.
    69 Big Picture (whichmight be wrong) RL: Reinforcement Learning IRL: Inverse Reinforcement Learning LfD: Learning from Demonstration DPL: Direct Policy Learning RL DPL IRL (=IOC) LfD Guided Policy Search Objective? What’s given? What’s NOT given Algorithms RL Find optimal policy Reward Dynamic model Policy Policy iteration, Value iteration, TD learning, Q learning IRL Find underlying reward Find optimal policy Experts’ demonstrations (often dynamic model) Reward Policy MaxEnt IRL, MaxMargin planning, Apprenticeship learning DPL Find optimal policy Experts’ demonstrations Reward Dynamic model (not always) Guided policy search LfD Find underlying reward Find optimal policy Experts’ demonstrations + others.. Dynamic model (not always) GP motion controller IOC: Inverse Optimal Control
  • 70.
    [10] Sergey Levine,Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015 [2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010 [11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015 [3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008 [1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006 [6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 [5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012 [9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015 [7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 [8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. MDP is powerful. But it requires heavy computation for finding the value function.  LMDP [1] Let’s use the LMDP in inverse optimal control problem!  [2] How can we measure the ‘probability’ of (experts’) state-action sequence?  [3] Can we learn ‘nonlinear’ reward function?  [4] Can we do that with ‘locally’ optimal examples?  [5] Given reward, how can we ‘effectively’ learn optimal policy?  [6] Re-formalize the guided policy search.  [7] Let learn both ‘dynamic model’ and policy!!  [8] Image based control with CNN!!  [9] Applied to a real-world robot, PR2!!  [10] [4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011 The beginning of a new era! (RL + Deep learning) Note that reward is Given!! How can we ‘effectively’ search the optimal policy?  [11] (latest)
  • 71.
    Learning Contact-Rich ManipulationSkills with Guided Policy Search 71 Main building block is a Guided Policy Search (GPS). GPS is a two stage algorithm consists of a trajectory optimization stage and policy learning stage. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 GPS is a direct policy search algorithm, that can effectively scale to high-dimensional systems. Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
  • 72.
    Guided policy search 72 Stage1) Trajectory optimization (iterative LQR) Given a reward function and dynamic model, Each trajectory consists of (state-action) pairs. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 73.
    Iterative LQR 73 Iterative linearquadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Linear dynamics Quadratic reward
  • 74.
    Iterative LQR 74 Iterative linearquadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 75.
    Iterative LQR 75 Levine, Sergey,and Vladlen Koltun. "Guided policy search." ICML 2013 Iteratively compute a trajectory, find a deterministic policy based on the trajectory, and recomputed a trajectory until convergence. But this only results a deterministic policy. We need something stochastic! By exploiting the concept of linearly solvable MDP and maximum entropy control, one can derive following stochastic policy!
  • 76.
    Guided Policy Search 76 Stage2) Policy learning From collected (state-action) pairs, Train neural network controllers, using Importance Sampled Policy Search. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 77.
    Experiments 77 (a) stacking largelego blocks on a fixed base, (b) onto a free-standing block, (c) held in both gripper; (d) threading wooden rings onto a tight-fitting peg; (e) assembling a toy airplane by inserting the wheels into a slot; (f) inserting a shoe tree into a shoe; (g,h) screwing caps onto pill bottles and (i) onto a water bottle.
  • 78.
    Supersizing Self-supervision: Learning toGrasp from 50K Tries and 700 Robot Hours Lerrel Pinto, Abhinav Gupta ICRA 2016
  • 79.
  • 85.
    Learning Hand-Eye Coordination forRobotic Grasping with Deep Learning and Large-Scale Data Collection Sergey Levine, Peter Paster, Alex Krizhevsky, Deirdre Quileen ISER 2016
  • 86.
  • 95.