SlideShare a Scribd company logo
1 of 24
Learning Contact-Rich
Manipulation Skills with Guided
Policy Search
Sergey Levine, Nolan Wagener, and
Pieter Abbeel
ICRA 2015
Presenter: Sungjoon Choi
Recent trends in Reinforcement
Learning
: Deep Neural Policy Learning
based on my private opinion..
which can be somewhat misleading
Presenter: Sungjoon Choi
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
http://rll.berkeley.edu/icra2015gps/
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? What’s so great about this paper?
Personally, main contribution of this paper is to propose a direct
policy learning method that can ‘actually train a real-world robot’
to perform some tasks.
That’s it??
I guess so! By the way, ‘actually training a real-world robot’ is
harder than you might imagine!
You will see how brilliant this paper is!
Brief review of MDP and RL
actionobservation
reward
Agent
Brief review of MDP and RL
State
Reward
Value
Policy
Action
Model
Brief review of MDP and RL
Remember! The goal of MDP and RL is to find an optimal policy!
It is like saying “I will find a function which best satisfies given
conditions!”.
However, learning a function is not an easy problem. (In fact, impossible
unless we use some ‘prior’ knowledge!)
So, instead of learning a function itself, most of the works try to find the
‘parameters’ of a function by restricting the solution space to a space of
certain parametric functions such as linear functions.
Brief review of MDP and RL
What are typical impediments in reinforcement learning?
2. However, linear functions do not work well in practice.
In other words, why is it so HARD to find an optimal policy??
1. We are living in a continuous world, not a discrete grid world.
3. (Dynamic) model, which is often required, is HARD to obtain.
- In this continuous world, standard MDP cannot be established.
- So instead, we usually use function approximation to handle this issue.
- And, of course, nonlinear functions are hard to optimize.
- The definition of value is “expected sum of rewards!”.
Today’s paper tackles all three problems listed above!!
RL: Reinforcement Learning
IRL: Inverse Reinforcement Learning
LfD: Learning from Demonstration
DPL: Direct Policy Learning
RL
DPL
IRL
(=IOC)
LfD
Guided Policy
Search
Objective? What’s given? What’s NOT given Algorithms
RL Find optimal policy Reward
Dynamic model
Policy Policy iteration, Value iteration,
TD learning, Q learning
IRL Find underlying reward
Find optimal policy
Experts’ demonstrations
(often dynamic model)
Reward
Policy
MaxEnt IRL, MaxMargin planning,
Apprenticeship learning
DPL Find optimal policy Experts’ demonstrations
Reward
Dynamic model (not always) Guided policy search
LfD Find underlying reward
Find optimal policy
Experts’ demonstrations
+ others..
Dynamic model (not always) GP motion controller
Big Picture (which might be wrong)
IOC: Inverse Optimal Control
Constrained Guided
Policy Search
[10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015
[2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010
[11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015
[3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008
[1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006
[6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
[5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012
[9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015
[7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
[8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
MDP is powerful.
But it requires heavy computation for finding the value function.  LMDP [1]
Let’s use the LMDP in inverse optimal control problem!  [2]
How can we measure the ‘probability’ of (experts’) state-action sequence?  [3]
Can we learn ‘nonlinear’ reward function?  [4]
Can we do that with ‘locally’ optimal examples?  [5]
Given reward, how can we ‘effectively’ learn optimal policy?  [6]
Re-formalize the guided policy search.  [7]
Let learn both ‘dynamic model’ and policy!!  [8]
Image based control with CNN!!  [9]
Applied to a real-world robot, PR2!!  [10]
[4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011
The beginning of a new era!
(RL + Deep learning)
Note that reward is Given!!
How can we ‘effectively’ search the optimal policy?  [11] (latest)
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014
GPS is a direct policy search algorithm, that can effectively scale to
high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
Guided Policy Search
Stage 1) Trajectory optimization (iterative LQR)
Given a reward function and dynamic model,
Each trajectory consists of
(state-action) pairs.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Linear dynamics
Quadratic reward
Iterative LQR
Iterative linear quadratic regulator optimizes a trajectory by
repeatedly solving for the optimal policy under linear-
quadratic assumptions.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iterative LQR
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Iteratively compute a trajectory, find a deterministic policy
based on the trajectory, and recomputed a trajectory until
convergence.
But this only results a deterministic policy. We need
something stochastic!
By exploiting the concept of linearly solvable MDP and
maximum entropy control, one can derive following
stochastic policy!
Guided Policy Search
Stage 2) Policy learning
From collected (state-action) pairs, Train neural network controllers,
using Importance Sampled Policy
Search.
Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
Importance Sampled Policy Search
𝜋 𝜃 𝜁1:𝑡 = 𝑘=1
𝑡
𝑁 𝑢 𝑘; 𝜇 𝑥 𝑘 , 𝜎2
Importance sampled policy search finds 𝜃 which maximizes
following cost function.
𝑍𝑡 𝜃 = 𝑖=1
𝑚 𝜋 𝜃 𝜁1:𝑡
𝑞 𝜁1:𝑡
reward (cost)
Neural policy (data-fitting)
Average guiding distributions
or previous policy (compensate)
Lower variance (exploration)
Analytic gradient of 𝚽(𝜽)
Neural network
Back-propagation
Constrained Guided Policy Search
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
What if we don’t know the dynamics of a robot?
We can use real-world trajectories to locally approximate
dynamic models.
Constrained Guided Policy Search
Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
However, as it is a local approximation, large deviation from
previous trajectories might lead to disastrous optimization
results.
Gaussian mixture model is further used to reduce the
number of examples to model a dynamic model.
Impose a constraint on the KL-divergence between the old
and new trajectory distribution!
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
This paper use the constrained guided policy search to
perform contact-rich manipulation skills.
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
(a) stacking large lego blocks on a
fixed base, (b) onto a free-standing
block, (c) held in both gripper;
(d) threading wooden rings onto a
tight-fitting peg; (e) assembling a
toy airplane by inserting the wheels
into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing
caps onto pill bottles and (i) onto a
water bottle.
Learning Contact-Rich Manipulation Skills
with Guided Policy Search
Agent 7 Torques
1. Current joint angles and
velocities
2. Cartesian velocities of two
or three points on the
manipulated object
3. Vector from the target
positions of these points to
their current position
4. Torque applied at the
previous time step
Conclusion
Constrained guided policy search is used to train a real-world PR2 robot to
perform some contact-rich tasks.
Policy function is modeled with a neural network.
Prior knowledge about dynamics is NOT required.
Iterative LQR is used for defining a guiding distribution which works as a
proposal distribution in an importance sampled policy search.
Thank you!
Any questions?

More Related Content

What's hot

Value iteration networks
Value iteration networksValue iteration networks
Value iteration networksSungjoon Choi
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooJaeJun Yoo
 
Recent Trends in Deep Learning
Recent Trends in Deep LearningRecent Trends in Deep Learning
Recent Trends in Deep LearningSungjoon Choi
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GANJaeJun Yoo
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
Leveraged Gaussian Process
Leveraged Gaussian ProcessLeveraged Gaussian Process
Leveraged Gaussian ProcessSungjoon Choi
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yooJaeJun Yoo
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From DataSungjoon Choi
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningMLAI2
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveySangwoo Mo
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Sungjoon Choi
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 

What's hot (20)

Deep robotics
Deep roboticsDeep robotics
Deep robotics
 
Value iteration networks
Value iteration networksValue iteration networks
Value iteration networks
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
Recent Trends in Deep Learning
Recent Trends in Deep LearningRecent Trends in Deep Learning
Recent Trends in Deep Learning
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GAN
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Leveraged Gaussian Process
Leveraged Gaussian ProcessLeveraged Gaussian Process
Leveraged Gaussian Process
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yoo
 
Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksPR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networks
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 

Viewers also liked

Inverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsInverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsSungjoon Choi
 
Connection between Bellman equation and Markov Decision Processes
Connection between Bellman equation and Markov Decision ProcessesConnection between Bellman equation and Markov Decision Processes
Connection between Bellman equation and Markov Decision ProcessesSungjoon Choi
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSungjoon Choi
 
TensorFlow Tutorial Part2
TensorFlow Tutorial Part2TensorFlow Tutorial Part2
TensorFlow Tutorial Part2Sungjoon Choi
 
TensorFlow Tutorial Part1
TensorFlow Tutorial Part1TensorFlow Tutorial Part1
TensorFlow Tutorial Part1Sungjoon Choi
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep LearningSungjoon Choi
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimizationmooopan
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 

Viewers also liked (9)

Inverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning AlgorithmsInverse Reinforcement Learning Algorithms
Inverse Reinforcement Learning Algorithms
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Connection between Bellman equation and Markov Decision Processes
Connection between Bellman equation and Markov Decision ProcessesConnection between Bellman equation and Markov Decision Processes
Connection between Bellman equation and Markov Decision Processes
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
TensorFlow Tutorial Part2
TensorFlow Tutorial Part2TensorFlow Tutorial Part2
TensorFlow Tutorial Part2
 
TensorFlow Tutorial Part1
TensorFlow Tutorial Part1TensorFlow Tutorial Part1
TensorFlow Tutorial Part1
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 

Similar to Recent Trends in Neural Net Policy Learning

AI - history and recent breakthroughs
AI - history and recent breakthroughs AI - history and recent breakthroughs
AI - history and recent breakthroughs Armando Vieira
 
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...ALINLAB
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneMayank Gupta
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesYoonho Lee
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
Helping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query UnderstandingHelping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query UnderstandingDaniel Tunkelang
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016MLconf
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1YasutoTamura1
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Xiaohu ZHU
 
ICT Colloquium Presentation
ICT Colloquium PresentationICT Colloquium Presentation
ICT Colloquium PresentationRijk Mercuur
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Sanjay Srivastava
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdfhemayadav41
 
IL for AR_IROS MiniRobot_Slides.pptx
IL for AR_IROS MiniRobot_Slides.pptxIL for AR_IROS MiniRobot_Slides.pptx
IL for AR_IROS MiniRobot_Slides.pptxXiatao Sun
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perceptionDimitri Ognibene
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
 

Similar to Recent Trends in Neural Net Policy Learning (20)

AI - history and recent breakthroughs
AI - history and recent breakthroughs AI - history and recent breakthroughs
AI - history and recent breakthroughs
 
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
Context-aware Dynamics Model for Generalization in Model-Based Reinforcement ...
 
Literature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstoneLiterature Review - Presentation on Relevant work for RL4AD capstone
Literature Review - Presentation on Relevant work for RL4AD capstone
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Helping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query UnderstandingHelping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query Understanding
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
 
Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1Reinforcement course material samples: lecture 1
Reinforcement course material samples: lecture 1
 
MILA DL & RL summer school highlights
MILA DL & RL summer school highlights MILA DL & RL summer school highlights
MILA DL & RL summer school highlights
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
ICT Colloquium Presentation
ICT Colloquium PresentationICT Colloquium Presentation
ICT Colloquium Presentation
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
Difference between Artificial Intelligence, Machine Learning, Deep Learning a...
 
Reinforcement Learning with Deep Architectures
Reinforcement Learning with Deep ArchitecturesReinforcement Learning with Deep Architectures
Reinforcement Learning with Deep Architectures
 
Reinforcement Learning.pdf
Reinforcement Learning.pdfReinforcement Learning.pdf
Reinforcement Learning.pdf
 
IL for AR_IROS MiniRobot_Slides.pptx
IL for AR_IROS MiniRobot_Slides.pptxIL for AR_IROS MiniRobot_Slides.pptx
IL for AR_IROS MiniRobot_Slides.pptx
 
Robotic models of active perception
Robotic models of active perceptionRobotic models of active perception
Robotic models of active perception
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

Recent Trends in Neural Net Policy Learning

  • 1. Learning Contact-Rich Manipulation Skills with Guided Policy Search Sergey Levine, Nolan Wagener, and Pieter Abbeel ICRA 2015 Presenter: Sungjoon Choi
  • 2. Recent trends in Reinforcement Learning : Deep Neural Policy Learning based on my private opinion.. which can be somewhat misleading Presenter: Sungjoon Choi
  • 3. Learning Contact-Rich Manipulation Skills with Guided Policy Search http://rll.berkeley.edu/icra2015gps/
  • 4. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper wins the ICRA 2015 Best Manipulation Paper Award! But why? What’s so great about this paper? Personally, main contribution of this paper is to propose a direct policy learning method that can ‘actually train a real-world robot’ to perform some tasks. That’s it?? I guess so! By the way, ‘actually training a real-world robot’ is harder than you might imagine! You will see how brilliant this paper is!
  • 5. Brief review of MDP and RL actionobservation reward Agent
  • 6. Brief review of MDP and RL State Reward Value Policy Action Model
  • 7. Brief review of MDP and RL Remember! The goal of MDP and RL is to find an optimal policy! It is like saying “I will find a function which best satisfies given conditions!”. However, learning a function is not an easy problem. (In fact, impossible unless we use some ‘prior’ knowledge!) So, instead of learning a function itself, most of the works try to find the ‘parameters’ of a function by restricting the solution space to a space of certain parametric functions such as linear functions.
  • 8. Brief review of MDP and RL What are typical impediments in reinforcement learning? 2. However, linear functions do not work well in practice. In other words, why is it so HARD to find an optimal policy?? 1. We are living in a continuous world, not a discrete grid world. 3. (Dynamic) model, which is often required, is HARD to obtain. - In this continuous world, standard MDP cannot be established. - So instead, we usually use function approximation to handle this issue. - And, of course, nonlinear functions are hard to optimize. - The definition of value is “expected sum of rewards!”. Today’s paper tackles all three problems listed above!!
  • 9. RL: Reinforcement Learning IRL: Inverse Reinforcement Learning LfD: Learning from Demonstration DPL: Direct Policy Learning RL DPL IRL (=IOC) LfD Guided Policy Search Objective? What’s given? What’s NOT given Algorithms RL Find optimal policy Reward Dynamic model Policy Policy iteration, Value iteration, TD learning, Q learning IRL Find underlying reward Find optimal policy Experts’ demonstrations (often dynamic model) Reward Policy MaxEnt IRL, MaxMargin planning, Apprenticeship learning DPL Find optimal policy Experts’ demonstrations Reward Dynamic model (not always) Guided policy search LfD Find underlying reward Find optimal policy Experts’ demonstrations + others.. Dynamic model (not always) GP motion controller Big Picture (which might be wrong) IOC: Inverse Optimal Control Constrained Guided Policy Search
  • 10. [10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015 [2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010 [11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015 [3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning.“ AAAI 2008 [1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006 [6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 [5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012 [9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015 [7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 [8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. MDP is powerful. But it requires heavy computation for finding the value function.  LMDP [1] Let’s use the LMDP in inverse optimal control problem!  [2] How can we measure the ‘probability’ of (experts’) state-action sequence?  [3] Can we learn ‘nonlinear’ reward function?  [4] Can we do that with ‘locally’ optimal examples?  [5] Given reward, how can we ‘effectively’ learn optimal policy?  [6] Re-formalize the guided policy search.  [7] Let learn both ‘dynamic model’ and policy!!  [8] Image based control with CNN!!  [9] Applied to a real-world robot, PR2!!  [10] [4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011 The beginning of a new era! (RL + Deep learning) Note that reward is Given!! How can we ‘effectively’ search the optimal policy?  [11] (latest)
  • 11. Learning Contact-Rich Manipulation Skills with Guided Policy Search Main building block is a Guided Policy Search (GPS). GPS is a two stage algorithm consists of a trajectory optimization stage and policy learning stage. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 GPS is a direct policy search algorithm, that can effectively scale to high-dimensional systems. Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.
  • 12. Guided Policy Search Stage 1) Trajectory optimization (iterative LQR) Given a reward function and dynamic model, Each trajectory consists of (state-action) pairs. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 13. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Linear dynamics Quadratic reward
  • 14. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 15. Iterative LQR Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Iteratively compute a trajectory, find a deterministic policy based on the trajectory, and recomputed a trajectory until convergence. But this only results a deterministic policy. We need something stochastic! By exploiting the concept of linearly solvable MDP and maximum entropy control, one can derive following stochastic policy!
  • 16. Guided Policy Search Stage 2) Policy learning From collected (state-action) pairs, Train neural network controllers, using Importance Sampled Policy Search. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013
  • 17. Importance Sampled Policy Search 𝜋 𝜃 𝜁1:𝑡 = 𝑘=1 𝑡 𝑁 𝑢 𝑘; 𝜇 𝑥 𝑘 , 𝜎2 Importance sampled policy search finds 𝜃 which maximizes following cost function. 𝑍𝑡 𝜃 = 𝑖=1 𝑚 𝜋 𝜃 𝜁1:𝑡 𝑞 𝜁1:𝑡 reward (cost) Neural policy (data-fitting) Average guiding distributions or previous policy (compensate) Lower variance (exploration) Analytic gradient of 𝚽(𝜽) Neural network Back-propagation
  • 18. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. What if we don’t know the dynamics of a robot? We can use real-world trajectories to locally approximate dynamic models.
  • 19. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. However, as it is a local approximation, large deviation from previous trajectories might lead to disastrous optimization results. Gaussian mixture model is further used to reduce the number of examples to model a dynamic model. Impose a constraint on the KL-divergence between the old and new trajectory distribution!
  • 20. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper use the constrained guided policy search to perform contact-rich manipulation skills.
  • 21. Learning Contact-Rich Manipulation Skills with Guided Policy Search (a) stacking large lego blocks on a fixed base, (b) onto a free-standing block, (c) held in both gripper; (d) threading wooden rings onto a tight-fitting peg; (e) assembling a toy airplane by inserting the wheels into a slot; (f) inserting a shoe tree into a shoe; (g,h) screwing caps onto pill bottles and (i) onto a water bottle.
  • 22. Learning Contact-Rich Manipulation Skills with Guided Policy Search Agent 7 Torques 1. Current joint angles and velocities 2. Cartesian velocities of two or three points on the manipulated object 3. Vector from the target positions of these points to their current position 4. Torque applied at the previous time step
  • 23. Conclusion Constrained guided policy search is used to train a real-world PR2 robot to perform some contact-rich tasks. Policy function is modeled with a neural network. Prior knowledge about dynamics is NOT required. Iterative LQR is used for defining a guiding distribution which works as a proposal distribution in an importance sampled policy search.