Recent Trends in Neural Net Policy Learning

Learning Contact-Rich
Manipulation Skills with Guided
Policy Search
Sergey Levine, Nolan Wagener, and
Pieter Abbeel
ICRA 2015
Presenter: Sungjoon Choi

Recent trends in Reinforcement
Learning
: Deep Neural Policy Learning
based on my private opinion..
which can be somewhat misleading
Presenter: Sungjoon Choi

Learning Contact-Rich Manipulation Skills
with Guided Policy Search
http://rll.berkeley.edu/icra2015gps/

This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? What’s so great about this paper?
Personally, main contribution of this paper is to propose a direct
policy learning method that can ‘actually train a real-world robot’
to perform some tasks.
That’s it??
I guess so! By the way, ‘actually training a real-world robot’ is
harder than you might imagine!
You will see how brilliant this paper is!

Brief review of MDP and RL
actionobservation
reward
Agent

State
Reward
Value
Policy
Action
Model

Remember! The goal of MDP and RL is to find an optimal policy!
It is like saying “I will find a function which best satisfies given
conditions!”.
However, learning a function is not an easy problem. (In fact, impossible
unless we use some ‘prior’ knowledge!)
So, instead of learning a function itself, most of the works try to find the
‘parameters’ of a function by restricting the solution space to a space of
certain parametric functions such as linear functions.

What are typical impediments in reinforcement learning?
2. However, linear functions do not work well in practice.
In other words, why is it so HARD to find an optimal policy??
1. We are living in a continuous world, not a discrete grid world.
3. (Dynamic) model, which is often required, is HARD to obtain.
- In this continuous world, standard MDP cannot be established.
- So instead, we usually use function approximation to handle this issue.
- And, of course, nonlinear functions are hard to optimize.
- The definition of value is “expected sum of rewards!”.
Today’s paper tackles all three problems listed above!!

RL: Reinforcement Learning
IRL: Inverse Reinforcement Learning
LfD: Learning from Demonstration
DPL: Direct Policy Learning
RL
DPL
IRL
(=IOC)
LfD
Guided Policy
Search
Objective? What’s given? What’s NOT given Algorithms
RL Find optimal policy Reward
Dynamic model
Policy Policy iteration, Value iteration,
TD learning, Q learning
IRL Find underlying reward
Find optimal policy
Experts’ demonstrations
(often dynamic model)
Reward
Policy
MaxEnt IRL, MaxMargin planning,
Apprenticeship learning
DPL Find optimal policy Experts’ demonstrations
Reward
Dynamic model (not always) Guided policy search
LfD Find underlying reward
Find optimal policy
Experts’ demonstrations
+ others..
Dynamic model (not always) GP motion controller
Big Picture (which might be wrong)
IOC: Inverse Optimal Control
Constrained Guided
Policy Search

[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015
[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010
[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015
[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008
[1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006
[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012
[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015
[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
MDP is powerful.
But it requires heavy computation for finding the value function.  LMDP [1]
Let’s use the LMDP in inverse optimal control problem!  [2]
How can we measure the ‘probability’ of (experts’) state-action sequence?  [3]
Can we learn ‘nonlinear’ reward function?  [4]
Can we do that with ‘locally’ optimal examples?  [5]
Given reward, how can we ‘effectively’ learn optimal policy?  [6]
Re-formalize the guided policy search.  [7]
Let learn both ‘dynamic model’ and policy!!  [8]
Image based control with CNN!!  [9]
Applied to a real-world robot, PR2!!  [10]
[4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011
The beginning of a new era!
(RL + Deep learning)
Note that reward is Given!!
How can we ‘effectively’ search the optimal policy?  [11] (latest)

Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
GPS is a direct policy search algorithm, that can effectively scale to
high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.

Guided Policy Search
Stage 1) Trajectory optimization (iterative LQR)
Given a reward function and dynamic model,
Each trajectory consists of
(state-action) pairs.

Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Linear dynamics
Quadratic reward

Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.

Iterative LQR
Iteratively compute a trajectory, find a deterministic policy
based on the trajectory, and recomputed a trajectory until
convergence.
But this only results a deterministic policy. We need
something stochastic!
By exploiting the concept of linearly solvable MDP and
maximum entropy control, one can derive following
stochastic policy!

Guided Policy Search
Stage 2) Policy learning
From collected (state-action) pairs, Train neural network controllers,
using Importance Sampled Policy
Search.

Importance Sampled Policy Search
𝜋 𝜃 𝜁1:𝑡 = 𝑘=1
𝑡
𝑁 𝑢 𝑘; 𝜇 𝑥 𝑘 , 𝜎2
Importance sampled policy search finds 𝜃 which maximizes
following cost function.
𝑍𝑡 𝜃 = 𝑖=1
𝑚 𝜋 𝜃 𝜁1:𝑡
𝑞 𝜁1:𝑡
reward (cost)
Neural policy (data-fitting)
Average guiding distributions
or previous policy (compensate)
Lower variance (exploration)
Analytic gradient of 𝚽(𝜽)
Neural network
Back-propagation

Constrained Guided Policy Search
What if we don’t know the dynamics of a robot?
We can use real-world trajectories to locally approximate
dynamic models.

Constrained Guided Policy Search
However, as it is a local approximation, large deviation from
previous trajectories might lead to disastrous optimization
results.
Gaussian mixture model is further used to reduce the
number of examples to model a dynamic model.
Impose a constraint on the KL-divergence between the old
and new trajectory distribution!

This paper use the constrained guided policy search to
perform contact-rich manipulation skills.

(a) stacking large lego blocks on a
fixed base, (b) onto a free-standing
block, (c) held in both gripper;
(d) threading wooden rings onto a
tight-fitting peg; (e) assembling a
toy airplane by inserting the wheels
into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing
caps onto pill bottles and (i) onto a
water bottle.

Agent 7 Torques
1. Current joint angles and
velocities
2. Cartesian velocities of two
or three points on the
manipulated object
3. Vector from the target
positions of these points to
their current position
4. Torque applied at the
previous time step

Conclusion
Constrained guided policy search is used to train a real-world PR2 robot to
perform some contact-rich tasks.
Policy function is modeled with a neural network.
Prior knowledge about dynamics is NOT required.
Iterative LQR is used for defining a guiding distribution which works as a
proposal distribution in an importance sampled policy search.

Recent Trends in Neural Net Policy Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Recent Trends in Neural Net Policy Learning

Similar to Recent Trends in Neural Net Policy Learning (20)

Recently uploaded

Recently uploaded (20)

Recent Trends in Neural Net Policy Learning