Deep Learning in Robotics

Deep Learning
applied to robotics
Presenter: Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)

Contents
2
• Learning Contact-Rich Manipulation Skills with Guided Policy
Search
• Supersizing Self-supervision- Learning to Grasp from 50K Tries
and 700 Robot Hours
• Learning Hand-Eye Coordination for Robotic Grasping with
Deep Learning and Large-Scale Data Collection
• Playing Atari with Deep Reinforcement Learning
• Human Level Control through Deep Reinforcement Learning
• Deep Reinforcement Learning with Double Q-Learning
• For playing Atari
• For controlling real manipulators

Learning Contact-Rich
Manipulation Skills with
Guided Policy Search
Sergey Levine, Nolan Wagener, and Pieter Abbeel
ICRA 2015

Introduction
65
This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? What’s so great about this paper?
Personally, main contribution of this paper is to propose a direct
policy learning method that can ‘actually train a real-world robot’
to perform some tasks.
That’s it??
I guess so! By the way, ‘actually training a real-world robot’ is
harder than you might imagine!
You will see how brilliant this paper is!

Brief review of MDP and RL
66
actionobservation
reward
Agent

67
Remember! The goal of MDP and RL is to find an optimal policy!
It is like saying “I will find a function which best satisfies given
conditions!”.
However, learning a function is not an easy problem. (In fact, impossible
unless we use some ‘prior’ knowledge!)
So, instead of learning a function itself, most of the works try to find the
‘parameters’ of a function by restricting the solution space to a space of
certain parametric functions such as linear functions.

68
What are typical impediments in reinforcement learning?
2. However, linear functions do not work well in practice.
In other words, why is it so HARD to find an optimal policy??
1. We are living in a continuous world, not a discrete grid world.
3. (Dynamic) model, which is often required, is HARD to obtain.
- In this continuous world, standard MDP cannot be established.
- So instead, we usually use function approximation to handle this issue.
- And, of course, nonlinear functions are hard to optimize.
- The definition of value is “expected sum of rewards!”.
Today’s paper tackles all three problems listed above!!

69
Big Picture (which might be wrong)
RL: Reinforcement Learning
IRL: Inverse Reinforcement Learning
LfD: Learning from Demonstration
DPL: Direct Policy Learning
RL
DPL
IRL
(=IOC)
LfD
Guided Policy
Search
Objective? What’s given? What’s NOT given Algorithms
RL Find optimal policy Reward
Dynamic model
Policy Policy iteration, Value iteration,
TD learning, Q learning
IRL Find underlying reward
Find optimal policy
Experts’ demonstrations
(often dynamic model)
Reward
Policy
MaxEnt IRL, MaxMargin planning,
Apprenticeship learning
DPL Find optimal policy Experts’ demonstrations
Reward
Dynamic model (not always) Guided policy search
LfD Find underlying reward
Find optimal policy
Experts’ demonstrations
+ others..
Dynamic model (not always) GP motion controller
IOC: Inverse Optimal Control

[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015
[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010
[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015
[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008
[1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006
[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012
[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015
[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
MDP is powerful.
But it requires heavy computation for finding the value function.  LMDP [1]
Let’s use the LMDP in inverse optimal control problem!  [2]
How can we measure the ‘probability’ of (experts’) state-action sequence?  [3]
Can we learn ‘nonlinear’ reward function?  [4]
Can we do that with ‘locally’ optimal examples?  [5]
Given reward, how can we ‘effectively’ learn optimal policy?  [6]
Re-formalize the guided policy search.  [7]
Let learn both ‘dynamic model’ and policy!!  [8]
Image based control with CNN!!  [9]
Applied to a real-world robot, PR2!!  [10]
[4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011
The beginning of a new era!
(RL + Deep learning)
Note that reward is Given!!
How can we ‘effectively’ search the optimal policy?  [11] (latest)

Learning Contact-Rich Manipulation Skills with Guided Policy Search
71
Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
GPS is a direct policy search algorithm, that can effectively scale to
high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.

Guided policy search
72
Stage 1) Trajectory optimization (iterative LQR)
Given a reward function and dynamic model,
Each trajectory consists of
(state-action) pairs.

Iterative LQR
73
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Linear dynamics
Quadratic reward

Iterative LQR
74
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.

Iterative LQR
75
Iteratively compute a trajectory, find a deterministic policy
based on the trajectory, and recomputed a trajectory until
convergence.
But this only results a deterministic policy. We need
something stochastic!
By exploiting the concept of linearly solvable MDP and
maximum entropy control, one can derive following
stochastic policy!

Guided Policy Search
76
Stage 2) Policy learning
From collected (state-action) pairs, Train neural network controllers,
using Importance Sampled Policy
Search.

Experiments
77
(a) stacking large lego blocks on a
fixed base, (b) onto a free-standing
block, (c) held in both gripper;
(d) threading wooden rings onto a
tight-fitting peg; (e) assembling a
toy airplane by inserting the wheels
into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing
caps onto pill bottles and (i) onto a
water bottle.

Supersizing Self-supervision:
Learning to Grasp from 50K Tries
and 700 Robot Hours
Lerrel Pinto, Abhinav Gupta
ICRA 2016

Learning Hand-Eye Coordination
for Robotic Grasping with Deep
Learning and Large-Scale Data
Collection
Sergey Levine, Peter Paster, Alex Krizhevsky, Deirdre
Quileen
ISER 2016

Deep Learning in Robotics

More Related Content

What's hot

Viewers also liked

Similar to Deep Learning in Robotics

More from Sungjoon Choi

Recently uploaded

Deep Learning in Robotics