Imitation learning tutorial

Imitation Learning
ICML 2018 Tutorial
(Slides Available Online)
Yisong Yue Hoang M. Le
yyue@caltech.edu hmle@caltech.edu
@YisongYue @HoangMinhLe
yisongyue.com hoangle.info

Imitation Learning in a Nutshell
Given: demonstrations or demonstrator
Goal: train a policy to mimic demonstrations
Expert Demonstrations State/Action Pairs Learning
Images from Stephane Ross

Ingredients of Imitation Learning
Demonstrations or Demonstrator Environment / Simulator Policy Class
Loss Function
vs
Learning Algorithm

Part 2: Extensions and Applications
● Speech Animation
● Structured prediction and search
● Improving over expert
● Filtering and sequence modeling
● Multi-objective imitation learning
● Visual / few-shot imitation learning
● Domain Adaptation imitation learning
● Multi-agent imitation learning
● Hierarchical imitation learning
● Multi-modal imitation learning
● Weaker Feedback
Tutorial Overview
Part 1: Introduction and Core Algorithms
● Teaser results
● Overview of ML landscape
● Types of Imitation learning
● Core algorithms of passive, interactive
learning and cost learning

https://www.youtube.com/watch?v=ilP4aPDTBPE
ALVINN
Dean Pomerleau et al., 1989-1999

Helicopter Acrobatics
Learning for Control from Multiple Demonstrations
Adam Coates, Pieter Abbeel, Andrew Ng, ICML 2008
An Application of Reinforcement Learning to Aerobatic Helicopter Flight
Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng, NIPS 2006 https://www.youtube.com/watch?v=0JL04JJjocc

Inferring Human Intent
Planning-based Prediction for Pedestrians
Brian Ziebart et al., IROS 2009 https://www.youtube.com/watch?v=hjOteEd7qwE

A Deep Learning Approach for Generalized Speech Animation
Sarah Taylor, Taehwan Kim, Yisong Yue et al., SIGGRAPH 2017 https://www.youtube.com/watch?v=9zL7qejW9fE

Ghosting
(Sports Analytics)
Data Driven Ghosting using
Deep Imitation Learning
Hoang M. Le et al., SSAC 2017 https://www.youtube.com/watch?v=WI-WL2cj0CA

One Shot Imitation Learning
Duan et al., NIPS ‘17
https://www.youtube.com/watch?v=oMZwkIjZzCM

Notation & Setup
State: s (sometimes x) (**state may only be partially observed)
Action: a (sometimes y)
Policy: 𝜋 𝛉 (sometimes h)
● Policy maps states to actions: 𝜋 𝛉(s) → a
● ...or distributions over actions: 𝜋 𝛉(s) → P(a)
State Dynamics: P(s’|s,a)
● Typically not known to policy.
● Essentially the simulator/environment

Notation & Setup Continued
Rollout: sequentially execute 𝜋(s0) on an initial state
● Produce trajectory 𝞽=(s0,a0,s1,a1,...)
P(𝞽|𝜋): distribution of trajectories induced by a policy
1. Sample s0 from P0 (distribution over initial states), initialize t = 1.
2. Sample action ai from 𝜋(st-1)
3. Sample next state st from applying at to st-1 (requires access to environment)
4. Repeat from Step 2 with t=t+1
P(s|𝜋): distribution of states induced by a policy
● Let Pt(s|𝜋) denote distribution over t-th state
● P(s|𝜋) = (1/T)∑t Pt(s|𝜋)

Example #1: Racing Game
(Super Tux Kart)
s = game screen
a = turning angle
Training set: D={𝞽≔(s,a)} from 𝜋*
● s = sequence of s
● a = sequence of a
Goal: learn 𝜋 𝛉(s)→a

s = location of players & ball
a = next location of player
Training set: D={𝞽≔(s,a)} from 𝜋*
● s = sequence of s
● a = sequence of a
Goal: learn 𝜋 𝛉(s)→a
Example #2:
Basketball Trajectories

Outline of 1st Half
Behavioral Cloning (simplest Imitation Learning setting)
Compare with Supervised Learning
Landscape of Imitation Learning settings

Behavioral Cloning = Reduction to Supervised Learning
(Ignoring regularization for brevity.)
Define P* = P(s|𝜋*) (distribution of states visited by expert)
(recall P(s|𝜋*) = (1/T)∑t Pt(s|𝜋*) )
(sometimes abuse notation: P* = P(s,a*=𝜋*(s)|𝜋*) )
argmin 𝛉 E(s,a*)~P*L(a*,𝜋 𝛉(s))
Interpretations:
1. Assuming perfect imitation so far, learn to continue imitating perfectly
2. Minimize 1-step deviation error along the expert trajectories
Learning objective:

(General) Imitation Learning vs Behavioral Cloning
(Ignoring regularization for brevity.)
Behavioral Cloning (Supervised Learning):
argmin 𝛉 Es~P(s|𝛉)L(𝜋*(s),𝜋 𝛉(s))
P*
(s,a*)
P0
s 𝜋 𝛉
Training Loss
Distribution depends on rollout.
P(s|𝛉) = state distribution of 𝜋 𝛉
Distribution provided exogenously
(General) Imitation Learning:

Limitations of Behavioral Cloning
P*
(s,a*)
𝜋 𝛉 makes a mistake
IID Assumption
(Supervised Learning)
P0
s 𝜋 𝛉
Reality
New state sampled not from P*!
Worst case is catastrophic!

Limitations of Behavioral Cloning
Expert Trajectories
(Training Distribution)
Behavioral Cloning
Makes mistakes, enters new states
Cannot recover from new states
Videos from Stephane Ross

When to use Behavioral Cloning?
Advantages
● Simple
● Simple
● Efficient
Disadvantages
● Distribution mismatch between
training and testing
● No long term planning
Use When:
● 1-step deviations not too bad
● Learning reactive behaviors
● Expert trajectories “cover” state
space
Don’t Use When:
● 1-step deviations can lead to
catastrophic error
● Optimizing long-term objective
(at least not without a stronger
model)

Types of Imitation Learning
Behavioral Cloning Direct Policy Learning
via Interactive Demonstrator
Works well when P* close to P 𝛉
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Requires Interactive Demonstrator
(BC is 1-step special case)
Inverse RL
Learn r such that:
𝜋* = argmax 𝛉 Es~P(s|𝛉)r(s,𝜋 𝛉(s))
RL problem
Assumes learning r is statistically
easier than directly learning 𝜋*

Direct Policy
Learning
Reward
Learning
Access to
Environment
Interactive
Demonstrator
Pre-collected
Demonstrations
Behavioral
Cloning
Yes No No No Yes
Direct Policy
Learning
(Interactive IL)
Yes No Yes Yes Optional
Inverse
Reinforcement
Learning
No Yes Yes No Yes
Types of Imitation Learning

Other Considerations
Choice of imitation loss (e.g., generative adversarial learning)
Suboptimal demonstrations (e.g., outperform teacher)
Partial demonstrations (e.g., weak feedback)
Domain transfer (e.g., few-shot learning)
Structured domains (e.g., multi-agent systems, structured prediction)

Interactive Direct Policy Learning
Behavioral Cloning is simplest example
Beyond BC: using interactive demonstrator
Often analyzed via learning reductions
■ Reduce “harder” learning problem to “easier” one
■ Imitation Learning → Supervised Learning
■ General Overview: http://hunch.net/~jl/projects/reductions/reductions.html

Learning Reductions
Behavioral Cloning:
What does learning well on B imply about A?
■ E.g., can one lift PAC learning results from B to A?
Es~P(s|𝛉)L(a*(s),𝜋 𝛉(s)) → E(s,a*)~P*L(a*,𝜋 𝛉(s))
A B

Assume L(a*,a*) = 0
Suppose: ε = E(s,a*)~P*L(a*,𝜋 𝛉(s))
Then: Es~P(s|𝛉)L(a*(s),𝜋 𝛉(s)) = O(Tε)
● Error on T-step trajectory is O(T2ε)
*Paraphrased from
Theorem 2.1 in [Efficient Reductions for Imitation Learning - Ross & Bagnell, AISTATS 2010]
Lemma 3 in [A reduction from apprenticeship learning to classification - Syed & Schapire, NIPS 2010]
Basic Results for Behavioral Cloning

Direct Policy Learning vs Interactive Expert
Sequential Learning Reductions
Sequence of distributions:
■ Es~P(m)L(𝜋*(s),𝜋(s))
■ Ideally converges to 𝜋OPT
Usually starting from:
■ E(s)~P*L(𝜋*(s),𝜋(s))
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Requires Interactive Demonstrator
(BC is 1-step special case)
Best in policy class

Interactive Expert
Can query expert at any state
Construct loss function
■ L(𝜋*(s), 𝜋(s))
Typically applied to rollout trajectories
■ s ~ P(s|𝜋)
Driving example: L(𝜋*(s), 𝜋(s)) = (𝜋*(s) - 𝜋(s))2
Example from Super Tux Kart
(Image courtesy of Stephane Ross)
Expert provides feedback on
state visited by policyImages from Stephane Ross

Alternating Optimization (Naïve Attempt)
1. Fix P, estimate 𝜋
● Solve argmin 𝛉 Es~PL(𝜋(s),𝜋 𝛉(s))
2. Fix 𝜋, estimate P
● Empirically estimate via rolling out 𝜋
3. Repeat
Not guaranteed to converge!

Sequential Learning Reductions
Initial predictor: 𝜋0
For m=1
● Collect trajectories 𝞽 via rolling out 𝜋m-1
● Estimate state distribution Pm using s∈𝞽
● Collect interactive feedback {𝜋*(s)|s∈𝞽}
● Data Aggregation (e.g., DAgger)
○ Train 𝜋m on P1 ∪ … ∪ Pm
● Policy Aggregation (e.g., SEARN & SMILe)
○ Train 𝜋’m on Pm
○ 𝜋m = β𝜋’m + (1-β)𝜋m-1
Initial expert demonstrations
Requires
interactive expert
Typically roll out
multiple times

Data Aggregation (DAgger)
Sequence of convex losses: Lm(𝜋) = Es~P(m)L(𝜋*(s),𝜋(s))
Online Learning: find sequence 𝜋m competitive with 𝜋OPT:
RM = (
1
M
) m Lm(𝜋 m) − (
1
M
) m Lm(𝜋OPT)
Follow-the-Leader: 𝜋m = min 𝛉 m′=1
m
Lm’(𝜋 𝛉)
∃𝜋m: (
1
M
) m′ Lm’(𝜋 OPT) ≤ (
1
M
) m′ Lm’(𝜋m) − 𝑂(
1
𝑀
)
A reduction of imitation learning and structured prediction to no-regret online learning
Stephane Ross, Geoff Gordon, Drew Bagnell, AISTATS 2011
Best in policy class
“Online Regret”
RM = 𝑂(1/ M) (M>T)
Typically 𝜋M

Policy Aggregation (SEARN & SMILe)
Train 𝜋’m on Pm → 𝜋m = β𝜋’m + (1-β)𝜋m-1
𝜋m=(1−β) 𝑀 𝜋0 + 𝛽
𝑚′
1 − 𝛽 M−m’ 𝜋’m’
𝜋m not much worse than 𝜋m−1
● At most βTLm(𝜋’m)+ β2T2/2 for T-step rollout
𝜋M not much worse than 𝜋0
● At most 2Tlog(T)(
𝟏
𝑴
) 𝐦 Lm(𝜋′ 𝐦) + 𝑶(1/T) for T-step rollout
𝜋0 is expert
(not available at test time)
Search-based Structured Prediction
Hal Daume et al., Machine Learning 2009
Efficient Reductions for Imitation Learning
Stephan Ross, Drew Bagnell, AISTATS 2010
𝜋0 negligible in 𝜋M

Direct Policy Learning via Interactive Expert
Reduction to sequence of supervised learning problems
● Constructed from roll-outs of previous policies
● Requires interactive expert feedback
Two approaches: Data Aggregation & Policy Aggregation
● Ensures convergence
● Motivated by different theory
Not covered:
● What is expert feedback & loss function?
Direct Policy Learning via Interactive Expert
Reduction to sequence of supervised learning problems
New distribution constructed
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Depends on application

Learn Policy w/o Expert: Reinforcement Learning
Goal: maximize cumulative rewards
Fully specified MDP: value & policy iteration
Transition model
Starting state distributionReward function
MDP Formulation:

MDP Formulation:
Policy Gradient Theorem (Sutton et al., ICML 1999)
Optimize Value Function wrt Parameterized Policy

MDP Formulation:
REINFORCE algorithm:

Also check out…
■ Introduction to Reinforcement Learning with Function
Approximation – Richard Sutton – NIPS 2015 Tutorial
■ Deep Reinforcement Learning – David Silver – ICML 2016 Tutorial
■ Deep Reinforcement Learning, Decision Making, and Control –
Sergey Levine & Chelsea Finn – ICML 2017 Tutorial
■ Deep Reinforcement Learning through Policy Optimization –
Pieter Abbeel & John Schulman – NIPS 2016 Tutorial

Challenges with Reward Engineering
https://blog.openai.com/faulty-reward-functions/
assumed givenMDP Formulation:

Inverse Reinforcement Learning
MDP Formulation:
Given:
can also be
Goal: Learn a reward function so that

or■ Inverse RL high-level recipe:
● Expert demonstrations:
● Learn reward function:
● Learn policy given reward function (RL)
● Compare learned policy with expert

Reward Learning is Ambiguous
1. Many reward functions correspond to the same policy

Imitation Learning via Inverse RL (model-given)
Abbeel & Ng, ICML ‘04
Goal: find reward function
Different from “idealized” IRL

Game-Theoretic Inverse RL (model-given)
Syed & Schapire, NIPS ‘07
Goal: find performing better than over a class of rewards
Different from “idealized” IRL

Imitation Learning via Inverse RL (model-given)
Syed & Schapire, NIPS ‘07Assumptions:
Update
reward
Run full
RL
Compare
with
expert
RL Oracle
Known Dynamics

Assume:
Value of a policy expressed in terms of feature expectation
Linear Reward Feature Expectations
feature expectations

Feature Expectation:
If reward r is linear:
Feature Matching Optimality
Don’t know exactly

Feature Matching in Inverse RL (model-given)
Stop when
Find
Why?
Algorithm: solve max-margin problem in each loop
Theory: at most iterations

See also...
Maximum Margin Planning - Ratliff et al., ICML 2006
■ Generalizes to multiple MDPs
■ Agnostic about a real underlying MDP or reward function
Mode 1: follow the
road
Mode 2: “hide” in
the trees
Experiment: learning to plan based on satellite color imagery
Train image Learned cost map Test image

See also...
A Game-Theoretic Approach to Apprenticeship Learning – Syed &
Schapire, NIPS 2007
■ Restricted to
■ Game-theoretic formulation
■ Iteratively learn by Multiplicative Weight Update

Feature Expectation Matching
Want to find s.t discounted state visitation features match
the expert
Ill-posed problem regularization
Another regularization method: maximum entropy

Policy Learning is Still Ambiguous
1. Many reward functions correspond to the same policy
2. Many stochastic mixtures of policies correspond to the
same feature expectation

Maximum Entropy Principle
Policy induces distribution over trajectories
Maximum entropy principle: The probability distribution which
best represents the current state of knowledge is the one with
largest entropy (E.T. Jaynes 1957)
Information Theory and Statistical Mechanics - E.T. Jaynes - Phys. Rev. 1957
Feature matching:

Choose trajectory distribution that satisfies feature matching
constraints without over-committing
As uniform as possible

The distribution that maximizes entropy given
linear constraints is in the exponential family
Constrained
Optimization
Maximize
Unconstrained
Objective
Lagrangian

MaxEnt Inverse RL (model-given)
Ziebart et al., AAAI ‘08
MaxEnt formulation:
reward inference: max log likelihood

MaxEnt Inverse RL (model-given)
Given from the model
Solve forward RL w.r.t.
Ziebart et al., AAAI ‘08
Gradient descent over log-likelihood
Dynamic programming: state occupancy measure (visitation freq)

Inverse RL So Far...
Model-given:
Update
reward
Run full
RL
Compare
with
expert
Small MDPKnown Dynamics

Next ...
Model-given: Model-free:
Update
reward
Run full
RL
Compare
with
expert
Large / continuous
space
Interact with
environment / simulator
(Model-free)

MaxEnt Deep IRL + Model-Free Optimization
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization - Finn et al., ICML ‘16
Need to sample to evaluate gradients
reward parameterized by neural net
Recall MaxEnt formulation
Max likelihood objective

Finn et al., ICML ‘16
Use “proposal” distribution
to sample trajectories
Optimal dist ,
which depends on optimal
policy w.r.t. unknown
Reward Optimization Policy Optimization
Approximate

Estimated log likelihood:
with gradient:
sampling distribution
(also depends on )

■ Generate samples by
preventing the policy from
changing too rapidly
■ Update policy: take only 1
policy optimization step
Figures from Chelsea Finn’s presentation

■ Policy: Generator
■ Reward Function: Discriminator
Connection to Generative Adversarial Learning*
*Generative Adversarial Imitation Learning – Ho& Ermon – NIPS ‘16

IL via Occupancy Measure Matching
Apprenticeship Learning Using Linear Programming - Syed, Bowling & Schapire, ICML ‘08
Occupancy measure
Fact:

IL Using Linear Programming (model-given)
Syed, Bowling & Schapire, ICML ‘08
Imitation learning occupancy measure matching
Direct imitation learning by solving dual LP of original MDP
Dual of IRLNo reward learning

Occupancy Measure Matching + MaxEnt
for any reward function
Ill-posed occupancy matching MaxEnt principle again:
Again, turning into unconstrained optimization:

Occupancy Matching + Model-free Optimization
Generative Adversarial Imitation Learning - Ho & Ermon, NIPS ‘16:
Jensen-Shannon divergence

Generative Adversarial Imitation Learning
Ho & Ermon, NIPS ‘16
Thanks to materials from Stefano Ermon
Discriminator Update
Find saddle point

Generative Adversarial Imitation Learning
Ho & Ermon, NIPS ‘16
Generator Update
Find saddle point

Model-given: Model-free:
Update
reward
Run full
RL
Compare
with
expert
Update
reward
1-Step
RL
Expert
+
Simulator
Small MDP Large MDPKnown Dynamics Unknown Dynamics
Occupancy Measure
Maximum Entropy

List of Core Papers Mentioned
Search-based structured prediction - Daume, Langford & Marcu, Machine Learning 2009
A reduction of imitation learning and structured prediction to no-regret online learning - Ross, Gordon &
Bagnell, AISTATS 2011
Efficient Reductions for Imitation Learning - Ross & Bagnell, AISTATS 2010
A reduction from apprenticeship learning to classification - Syed & Schapire, NIPS 2010
Apprenticeship learning via inverse reinforcement learning - Abbeel & Ng, ICML 2004
A Game-Theoretic Approach to Apprenticeship Learning - Syed & Schapire, NIPS 2007
Apprenticeship learning using linear programming - Syed, Bowling & Schapire, ICML 2008
Maximum entropy inverse reinforcement learning - Ziebart, Mass, Bagnell & Dey, AAAI 2008
Generative adversarial imitation learning - Ho & Ermon, NIPS 2016
Guided cost learning: deep inverse optimal control via policy optimization - Finn, Levine & Abbeel, ICML
2016

Imitation Learning
ICML 2018 Tutorial
Break

A Deep Learning Approach for Generalized Speech Animation
respec-
esin the
h output
ensional
dimen-
astime-
1. For
o an au-
ence of
an illus-
equence
espond-
Frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Token - p p r ih ih d d ih ih ih ih k k sh sh sh sh uh uh n -(a) x
y
“ P R E D I C T I O N ”Input speech:
(b)
Figure1: Depicting an example(a) input x and (b) output y for
the application of visual speech animation. Each dimension of
y corresponds to a parameter of a face model. Only the ﬁrst
Input sequence
Output sequence
Goal: learn predictor 𝜋 ∶ 𝑋 → 𝑌
Phoneme sequence
Sequence of face configurations
[Taylor et al., SIGGRAPH 2017]
Train using
Behavioral Cloning
𝑦 ∈ 𝑅𝐷
𝑋 = < 𝑥1, … , 𝑥𝑇 >
𝑌 = < 𝑦1, … , 𝑦𝑇 >

1 1.5 2 2.5 3
?
? ? ? ?
s s s s s ih ih ih g g r r ae ae ae ae f f f
?
?
Is the phoneme in the 8th
frame a diphthong?
Y
frame a semivowel?
YY
N
NN
…
Is the phoneme in the 3rd frame
articulated at the back of the mouth?
“SIGGRAPH”
Target Speech
?
? ? ? ?
s s s s s ih ih ih g g r r ae ae ae ae f f f
?
?
frame a diphthong?
Y
Is the phoneme in the 8
frame a semivowel?
YY
N
NN
…
Is the phoneme in the 3rd frame
articulated at the back of the mouth?
“SIGGRAPH”
Target Speech
Input Audio
Speech Recognition
Speech Animation
Retargeting
E.g., [Sumner & Popovic 2004]
(chimp rig courtesy of Hao Li)
Editing

Structured Prediction
& Learning to Search

Learn mapping from structured space X to structured space Y
Typically requires solving optimization problem at prediction
● E.g., Viterbi algorithm for MAP inference in HMMs

Conservation Reservoir
Corridors
[Daniel Sheldon et al.]
[Daniel Golovin et al.]
✦
binocular fusion of features observed by
the eyes
✦
reconstruction of their 3D preimage
left right perceived depth
[Tsukuba]
✦
the eyes
✦
[Tsukuba]
✦
the eyes
✦
[Tsukuba]
[Debadeepta Dey et al.]
[Nathan Ratliff et al.]
X Y
X Y
X Y
X X Y
[Brian Ziebart et al., 2008]

Part-of-Speech Tagging
– Given a sequence of words x
– Predict sequence of tags y (dynamic programming)
The rain wet the cat
x
Det NVDet N
y
Example: Sequence Prediction
…
Adj VVDet V
V VNAdv Det
y
y
Goal: Minimize Hamming Loss

Goal: Learn Decision Function
(In Inference/Optimization Procedure)
● Sequentially predict outputs:
x
Det NVDet N
y
State at t-th
iteration
Prediction
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, N} )
…
𝜋(𝑠𝑡) → 𝑎𝑡 ≡ 𝑦𝑡
Interactive Feedback via
Structured Labels
Supervised Learning on (s,a)

● Iteration 1: Rollout 𝜋0 (memorize training set)
Data Aggregation (DAgger) for Sequence Labeling
(Basic Version)
x
Det NVDet N
y
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, N} )
…
a1 = Det
a2 = N
a3 = V
…
P1
Train 𝜋1
on P1

● Iteration 2: Rollout 𝜋1
x
Det NAdjN N
y
s1 = ( x, Ø )
s2 = ( x, {N} )
s3 = ( x, {N, N} )
…
a1 = Det
a2 = N
a3 = V
…
P2
Train 𝜋2
on P1
and P2
(Basic Version)

● Iteration 3: Rollout 𝜋2
x
Adv NNDet V
y
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, V} )
…
a1 = Det
a2 = N
a3 = V
…
P3
Train 𝜋3
on P1
and P2
and P3
(Basic Version)

(Basic) Ingredients for Structured Prediction
● Deal with Distributional Drift (otherwise just use behavioral cloning)
○ Related approaches include scheduled sampling:
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
[Bengio et al., NIPS 2015]
● Imitation Loss compatible w/ Structured Prediction Loss
○ Hamming Loss → 0/1 per-step loss (Hamming Loss is additive)
○ More generally: reduce to cost-sensitive classification loss
● Efficiency considerations
○ Efficiently Programmable Learning to Search [Daume et al., 2014]
○ Deeply aggrevated: Differentiable imitation learning for sequential prediction
[Sun et al., 2017]

Learning Policies for Contextual Submodular Optimization
(Example of Cost-Sensitive Reduction)
Training set: (x, Fx)
Goal: learn 𝜋 that sequentially maps x to action set A to maximize Fx
Learning Reduction: at state s=(x,A) create cost for each action
○ Define: fmax = maxa Fx(A+a) – Fx(A)
○ cs(a) = fmax – ( Fx(A+a) – Fx(A) )
○ Supervised training example: (s, cs)
Stephane Ross et al., ICML 2013
Context
Monotone Submodular Function
Greedy
Submodular
Regret
Already selected actions
Can prove convergence to (1-1/e)-optimal policy!

Structured Prediction ≈ Learning to Search
Compile input description x into search problem
Train 𝜋 to effectively navigate search space
High level supervised reward/cost
● E.g., routing cost, BLEU score, Hamming loss, submodular utility, Jaccard,
integer program value, log likelihood of graphical model, etc.
Expert might be weak (search space exponentially large)

Learning to Search in Branch and Bound Algorithms
■ Used to solve Integer Programs
● Two local decisions: ranking and pruning
● Train using solved instances (e.g., by Gurobi)
[He et al., NIPS 2014]
−13/2
+∞
−13/3
+∞
−16/3
+∞
−3
−3
−4
−4
–22/5
+∞
INF INF
−3
−3
x = 5/2#
y = 3/2
x = 5/3#
y = 1
x = 5/3#
y = 2
x = 1#
y = 1
x = 1#
y = 12/5
x = 1#
y = 2
x = 0#
y = 3
y ≤ 1 y ≥ 2
x ≤ 1 x ≤ 1x ≥ 2 x ≥ 2
y ≤ 2 y ≥ 3
ub = +∞#
lb = −16/3
ub = −3#
lb = −22/5
ub = −4#
lb = −4
ub = −3#
lb = −16/3
node expansion
order
global lower and
upper bound
optimal node
fathomed node
min −2x − y#
s.t. 3x − 5y ≤ 0#
3x + 5y ≤ 15#
x ≥ 0, y ≥ 0#
x, y ∈ Z
training examples:
−13/3
+∞
−16/3
+∞
<
prune:
−13/3
+∞

Sub-Optimal Expert Imitation Learning → Reinforcement Learning
■ Query Access to Environment (simple exploration)
● Learning to Search via Retrospective Imitation [Song et al., 2018]
● Learning to Search Better than Your Teacher [Chang et al., ICML 2015]
● Reinforcement and imitation learning via interactive no-regret learning
[Ross & Bagnell, 2014]
■ More Sophisticated Exploration (also Query Access to Environment)
● Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback
[Daume et al., ICLR 2018]
■ Actor-Critic Style Approaches (also Query Access to Environment)
● Truncated Horizon Policy Search: Combining Reinforcement Learning and Imitation
Learning [Sun et al., ICLR 2018]
● Sequence Level Training with Recurrent Neural Networks [Ranzato et al., ICLR 2016]

Learning to Search Better than Your Teacher
[Chang et al., ICML 2015]
Roll-in: execute policy w/o learning
Roll-out: execute policy for learning
SEARN [Daume et al., 2009]

Improving Multi-step Prediction of Learned Time Series Models
Learn 𝜋 that maps xt to xt+1
● Input is previous prediction
● Sequence prediction special case
Can also extend to latent-variables:
● Learning to Filter with Predictive State
Inference Machines [Sun et al., 2016]
[Venkatraman et al., AAAI 2015]
xt+1 = E[f(xt)]
state
dynamics
…"
x0 x1
xt f (·)x2
…" ˆf (·)
x3
ˆx1
ˆx2
ˆxt
(a) Forward simulation of learned model (gray) introduces error
at each prediction step compared to thetrue time-series (red)
…"
x0 x1
xtx2
…"
x3
ˆx1
ˆx2
…"
x0 x1
xt f (·)x2
…" ˆf (·)
x3
ˆx1
ˆx2
ˆxt
(a) Forward simulation of learned model (gray) introduces error
at each prediction step compared to thetruetime-series (red)
…"
x0 x1
xtx2
…"
x3
ˆx1
ˆx2
ˆxt
(b) Data provides a demonstration of corrections required to re-
Interactive Imitation Learning

Imitation Learning for Structured Prediction
■ Learn decision function to navigate search space
■ Reduce to interactive imitation learning
● Expert feedback from computational oracle
■ Careful choice of loss function
■ Suboptimal experts
● Due to computational complexity of solving search space

Imitation Learning with Multiple
Objectives

Smooth Imitation Learning
Le et al., ICML ‘16
Human operated camera Robotic camera

Input: sequence of context 𝑥𝑡
• e.g., noisy player detection
State: 𝑠𝑡= 𝑥𝑡:𝑡−𝐾, 𝑎 𝑡−1:𝑡−𝐾
• recent context and actions
Goal: learn 𝜋 𝑠𝑡 → 𝑎 𝑡
• Imitate expert �
s
a

need: Imitation Accuracy + Smoothness

Smooth Imitation by Functional Regularization
Smooth Policy Class �
ComplexPolicy Class f Smooth Model g

Algorithm: Iterative Imitation Learning for Continuous Spaces
Policy
𝜋
Simulate
expert
feedback
Update
𝒇
(neural net)
Update
Policy
𝜋
Update
Params
𝒈

Theory: Monotonic policy improvement, adaptive learning rate
Expert action
Learner action

Programmatically Interpretable Reinforcement
Learning
Verma et al., ICML ‘18

Safe Imitation Learning for Autonomous Driving
Zhang & Cho, AAAI ‘17
= steering angle policy
TORCS Simulation

If , drive with , otherwise fall back to
= whether likely to
deviates from

Results:
■ Reduced number of damages
■ Faster learning - can be viewed as active DAgger
SafeDAgger:
Safety classifier in the loop
Only query “unsafe” states

Multi-task, Meta and Transfer Imitation
Learning

One Shot Imitation Learning Duan et al., NIPS ‘17
Tobin et al., IROS ‘173 instances of an example task:

Tobin et al., IROS ‘17How about other related tasks?

Training
task
Test task
C
B
A
F
E
D
B
A
D
F
E
C

One Shot Visual Imitation Learning - Transfer
Tobin et al., IROS ‘17
Trained in simulation only
using domain randomization
Test in the real world

One Shot Visual Imitation Learning - Transfer
Training entirely in simulation, expert demonstrations through VR
Tobin et al., IROS ‘17

Visual Meta-Imitation Learning
Meta Learning (MAML)
Finn et al., CoRL ‘17
Meta-objective updateFast adaptation update
Meta-Imitation Training
1. Sample task
2. Using one demonstration for
3. Using another demonstration for

One Shot Imitation from Watching Videos
Yu et al., RSS ‘18
pick up a novel object and place it into a previously unseen bowl

Multiple tasks from multiple environments sequentially
Unknown reward function
Human behaves optimally wrt in each
Repeated Inverse Reinforcement Learning
Amin et al., NIPS ‘17
latent human preferences
External
Goal: learn from demonstrations from as few tasks as possible and
generalize to new ones, under two settings:
1. Agent chooses the task actively
2. Nature chooses the task adversarially

Multi-Agent / Multi-Modal Imitation
Learning

Coordinated Multi-Agent Imitation Learning
Coordination via implicit, latent roles information

Learning needs semantically consistent input representation
Want: Reality:

Imitation Learning + Graphical Model Learning and Inference

Un-coordinated Coordinated

Generative Multi-agent Behavioral Cloning
Zhan et al., arXiv ‘18
Eq(zm
ÆT|mÆT)
2
4
TX
t=1
≠ KL q(zm
t | mÆt, zm
< t)Î p(zm
t | m< t, zm
< t) + logp(mt | zm
Æt, m< t)
VRNN MAGnet
VRNN
VRNN
agents
During training weuseweak labels ˆm = ( ˆm1
, . . . , ˆmN
), where ˆmi
isthenext loca
Macro-goals
• Generative Multi-agent Imitation Learning
– No single “correct” action
• Hierarchical
– Make predictions at multiple resolutions

Multi-modal Imitation Learning - InfoGAIL
infoGAIL: Interpretable Imitation Learning from Visual Demonstrations – Li et al., NIPS ‘17
Passing RightPassing Left

Li et al., NIPS ‘17

Further regularization: maximize the mutual information between z and (s,a)
EnvironmentPolicy
Observed
Behavior
Z
Maximize mutual information
(s,a)
Latent code

Training using Wasserstein GAN
Mode Generation
Z=0
Z=0.5
Z=1
Passing Right
Passing Left

See also...
■ Multi-Modal Imitation Learning from Unstructured
Demonstrations using Generative Adversarial Nets -
Hausman et al., NIPS ‘17
■ Multi-agent Generative Adversarial Imitation Learning
- Song et al., arXiv ‘18

Inverse Reinforcement “Teaching”

Hadfield-Menell et al., NIPS ‘16
Cooperative Inverse Reinforcement Learning
How should humans demonstrate the tasks?
Inverse RL: demonstrated behavior is
(near) optimal as it accomplishes the
task efficiently
precludes useful teaching behaviors

Cooperative Inverse Reinforcement Learning
Hadfield-Menell et al., NIPS ‘16
H may choose to to accept less reward on a particular action
in order to convey more information to R
■ Human and robot play cooperative games
■ Two players:
■ H observes the actual reward signal; R only knows a prior
distribution on reward functions

See also...
■ An Efficient, Generalized Bellman Update for Cooperative
Inverse Reinforcement Learning - Malik et al., ICML 2018
● More efficient POMDP solver
● (Partially) relax assumption of human rationality

Showing vs Doing: Teaching by Demonstration
Pedagogical Inverse Reinforcement Learning
Ho et al., NIPS ‘16
Teaching depends on expert’s inferences about
a learner’s inferences about Demonstration
Human Experiment:

Hierarchical Learning / Other Types of
Expert Feedback

Hierarchical Imitation and Reinforcement Learning
How can we most effectively leverage teacher’s feedback?
End of hall Left
Corridor Left
Hierarchical feedback is more natural
≠ Flat Imitation Learning / Reinforcement Learning

(Flat) Imitation Learning
Feedback effort proportional to
task horizon
Hierarchical Imitation Learning

state macro-action state macro-action state … state
state action … state state action … state
Key Insights
Verifying is cheaper than labeling low-level actions
Expert provides high-level feedback
and zooms in to low-level only when needed
Weak feedback

Hybrid Imitation-Reinforcement
Significantly faster learning
than pure RL
Hierarchical Imitation Learning
Significantly less expert-costly
than flat imitation learning
0K 10K 20K 30K 40K 50K 60K
expert cost (LO-level)
0
20
40
60
80
100
successrate
hg-DAgger
ﬂat DAgger

Learning from Human Preferences
Christiano et al., NIPS ‘17
Preference feedback is more natural to specify than exact numeric rewards

Learning from Human Preferences
Christiano et al., NIPS ‘17
Assume Bradley-Terry preference model:
Minimize cross-entropy loss:

■ Programming by Feedback - Akrour et al., ICML 2014
■ A Bayesian Approach for Policy Learning from
Trajectory Preference Queries - Wilson et al., NIPS
2012
● Employed similar feedback elicitation mechanism to Christiano et al.
■ Learning Trajectory Preferences for Manipulators via
Iterative Improvement - Jain et al., NIPS 2013
● Co-active feedback (feedback is only an incremental improvement, not
necessarily the optimal result)
See also previous work...

Interactive Learning from Policy-Dependent
Human Feedback (COACH)
MacGlashan et al., ICML ‘17
training interface shown to AMT users
Reward / punishment for “good” dog / “alright” dog / “bad” dog
Humans tend to give
policy-dependent
feedback
How Do Humans Train Animal?

● Policy-dependent feedback
● Differential feedback
● Diminishing return
Interactive Learning from Policy-Dependent Human
Feedback (COACH)
Treat human feedback as
approximation to advantage function
Advantage function is good
feedback model

■ Recall Policy Gradient Theorem
■ Update Rule:
where
discretized rewards

■ Interactively shaping agents via
human reinforcement: the
TAMER framework - Knox &
Stone, K-CAP ‘09
See also...
http://www.cs.utexas.edu/~bradknox/TAMER.htm
l

Summary
1. Broad research area with many interesting questions
2. Practical approaches to sequential decision making
● Especially vs RL: sample complexity, cost of
experimentation, simulator availability, generalization, etc.
3. Diverse range of applications

Imitation Learning
ICML 2018 Tutorial
Q & A

Imitation learning tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Imitation learning tutorial

Similar to Imitation learning tutorial (20)

Recently uploaded

Recently uploaded (20)

Imitation learning tutorial