The document summarizes imitation learning techniques. It introduces behavioral cloning, which frames imitation learning as a supervised learning problem by learning to mimic expert demonstrations. However, behavioral cloning has limitations as it does not allow for recovery from mistakes. Alternative approaches involve direct policy learning using an interactive expert or inverse reinforcement learning, which aims to learn a reward function that explains the expert's behavior. The document outlines different types of imitation learning problems and algorithms for interactive direct policy learning, including data aggregation and policy aggregation methods.
1. Imitation Learning
ICML 2018 Tutorial
(Slides Available Online)
Yisong Yue Hoang M. Le
yyue@caltech.edu hmle@caltech.edu
@YisongYue @HoangMinhLe
yisongyue.com hoangle.info
2. Imitation Learning in a Nutshell
Given: demonstrations or demonstrator
Goal: train a policy to mimic demonstrations
Expert Demonstrations State/Action Pairs Learning
Images from Stephane Ross
3. Ingredients of Imitation Learning
Demonstrations or Demonstrator Environment / Simulator Policy Class
Loss Function
vs
Learning Algorithm
4. Part 2: Extensions and Applications
● Speech Animation
● Structured prediction and search
● Improving over expert
● Filtering and sequence modeling
● Multi-objective imitation learning
● Visual / few-shot imitation learning
● Domain Adaptation imitation learning
● Multi-agent imitation learning
● Hierarchical imitation learning
● Multi-modal imitation learning
● Weaker Feedback
Tutorial Overview
Part 1: Introduction and Core Algorithms
● Teaser results
● Overview of ML landscape
● Types of Imitation learning
● Core algorithms of passive, interactive
learning and cost learning
6. Helicopter Acrobatics
Learning for Control from Multiple Demonstrations
Adam Coates, Pieter Abbeel, Andrew Ng, ICML 2008
An Application of Reinforcement Learning to Aerobatic Helicopter Flight
Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng, NIPS 2006 https://www.youtube.com/watch?v=0JL04JJjocc
8. A Deep Learning Approach for Generalized Speech Animation
Sarah Taylor, Taehwan Kim, Yisong Yue et al., SIGGRAPH 2017 https://www.youtube.com/watch?v=9zL7qejW9fE
10. One Shot Imitation Learning
Duan et al., NIPS ‘17
https://www.youtube.com/watch?v=oMZwkIjZzCM
11. Notation & Setup
State: s (sometimes x) (**state may only be partially observed)
Action: a (sometimes y)
Policy: 𝜋 𝛉 (sometimes h)
● Policy maps states to actions: 𝜋 𝛉(s) → a
● ...or distributions over actions: 𝜋 𝛉(s) → P(a)
State Dynamics: P(s’|s,a)
● Typically not known to policy.
● Essentially the simulator/environment
12. Notation & Setup Continued
Rollout: sequentially execute 𝜋(s0) on an initial state
● Produce trajectory 𝞽=(s0,a0,s1,a1,...)
P(𝞽|𝜋): distribution of trajectories induced by a policy
1. Sample s0 from P0 (distribution over initial states), initialize t = 1.
2. Sample action ai from 𝜋(st-1)
3. Sample next state st from applying at to st-1 (requires access to environment)
4. Repeat from Step 2 with t=t+1
P(s|𝜋): distribution of states induced by a policy
● Let Pt(s|𝜋) denote distribution over t-th state
● P(s|𝜋) = (1/T)∑t Pt(s|𝜋)
13. Example #1: Racing Game
(Super Tux Kart)
s = game screen
a = turning angle
Training set: D={𝞽≔(s,a)} from 𝜋*
● s = sequence of s
● a = sequence of a
Goal: learn 𝜋 𝛉(s)→a
Images from Stephane Ross
14. s = location of players & ball
a = next location of player
Training set: D={𝞽≔(s,a)} from 𝜋*
● s = sequence of s
● a = sequence of a
Goal: learn 𝜋 𝛉(s)→a
Example #2:
Basketball Trajectories
16. Outline of 1st Half
Behavioral Cloning (simplest Imitation Learning setting)
Compare with Supervised Learning
Landscape of Imitation Learning settings
17. Behavioral Cloning = Reduction to Supervised Learning
(Ignoring regularization for brevity.)
Define P* = P(s|𝜋*) (distribution of states visited by expert)
(recall P(s|𝜋*) = (1/T)∑t Pt(s|𝜋*) )
(sometimes abuse notation: P* = P(s,a*=𝜋*(s)|𝜋*) )
argmin 𝛉 E(s,a*)~P*L(a*,𝜋 𝛉(s))
Interpretations:
1. Assuming perfect imitation so far, learn to continue imitating perfectly
2. Minimize 1-step deviation error along the expert trajectories
Learning objective:
Images from Stephane Ross
18. (General) Imitation Learning vs Behavioral Cloning
(Ignoring regularization for brevity.)
Behavioral Cloning (Supervised Learning):
argmin 𝛉 E(s,a*)~P*L(a*,𝜋 𝛉(s))
argmin 𝛉 Es~P(s|𝛉)L(𝜋*(s),𝜋 𝛉(s))
P*
(s,a*)
P0
s 𝜋 𝛉
Training Loss
Distribution depends on rollout.
P(s|𝛉) = state distribution of 𝜋 𝛉
Distribution provided exogenously
(General) Imitation Learning:
19. Limitations of Behavioral Cloning
P*
(s,a*)
𝜋 𝛉 makes a mistake
IID Assumption
(Supervised Learning)
P0
s 𝜋 𝛉
Reality
New state sampled not from P*!
Images from Stephane Ross
Worst case is catastrophic!
20. Limitations of Behavioral Cloning
Expert Trajectories
(Training Distribution)
Behavioral Cloning
Makes mistakes, enters new states
Cannot recover from new states
Videos from Stephane Ross
21. When to use Behavioral Cloning?
Advantages
● Simple
● Simple
● Efficient
Disadvantages
● Distribution mismatch between
training and testing
● No long term planning
Use When:
● 1-step deviations not too bad
● Learning reactive behaviors
● Expert trajectories “cover” state
space
Don’t Use When:
● 1-step deviations can lead to
catastrophic error
● Optimizing long-term objective
(at least not without a stronger
model)
22. Outline of 1st Half
Behavioral Cloning (simplest Imitation Learning setting)
Compare with Supervised Learning
Landscape of Imitation Learning settings
23. Types of Imitation Learning
argmin 𝛉 E(s,a*)~P*L(a*,𝜋 𝛉(s))
Behavioral Cloning Direct Policy Learning
via Interactive Demonstrator
Works well when P* close to P 𝛉
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Requires Interactive Demonstrator
(BC is 1-step special case)
Inverse RL
Learn r such that:
𝜋* = argmax 𝛉 Es~P(s|𝛉)r(s,𝜋 𝛉(s))
RL problem
Assumes learning r is statistically
easier than directly learning 𝜋*
25. Other Considerations
Choice of imitation loss (e.g., generative adversarial learning)
Suboptimal demonstrations (e.g., outperform teacher)
Partial demonstrations (e.g., weak feedback)
Domain transfer (e.g., few-shot learning)
Structured domains (e.g., multi-agent systems, structured prediction)
26. Interactive Direct Policy Learning
Behavioral Cloning is simplest example
Beyond BC: using interactive demonstrator
Often analyzed via learning reductions
■ Reduce “harder” learning problem to “easier” one
■ Imitation Learning → Supervised Learning
■ General Overview: http://hunch.net/~jl/projects/reductions/reductions.html
27. Learning Reductions
Behavioral Cloning:
What does learning well on B imply about A?
■ E.g., can one lift PAC learning results from B to A?
Es~P(s|𝛉)L(a*(s),𝜋 𝛉(s)) → E(s,a*)~P*L(a*,𝜋 𝛉(s))
A B
28. Assume L(a*,a*) = 0
Suppose: ε = E(s,a*)~P*L(a*,𝜋 𝛉(s))
Then: Es~P(s|𝛉)L(a*(s),𝜋 𝛉(s)) = O(Tε)
● Error on T-step trajectory is O(T2ε)
*Paraphrased from
Theorem 2.1 in [Efficient Reductions for Imitation Learning - Ross & Bagnell, AISTATS 2010]
Lemma 3 in [A reduction from apprenticeship learning to classification - Syed & Schapire, NIPS 2010]
Basic Results for Behavioral Cloning
29. Direct Policy Learning vs Interactive Expert
Sequential Learning Reductions
Sequence of distributions:
■ Es~P(m)L(𝜋*(s),𝜋(s))
■ Ideally converges to 𝜋OPT
Usually starting from:
■ E(s)~P*L(𝜋*(s),𝜋(s))
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Requires Interactive Demonstrator
(BC is 1-step special case)
Best in policy class
30. Interactive Expert
Can query expert at any state
Construct loss function
■ L(𝜋*(s), 𝜋(s))
Typically applied to rollout trajectories
■ s ~ P(s|𝜋)
Driving example: L(𝜋*(s), 𝜋(s)) = (𝜋*(s) - 𝜋(s))2
Example from Super Tux Kart
(Image courtesy of Stephane Ross)
Expert provides feedback on
state visited by policyImages from Stephane Ross
31. Alternating Optimization (Naïve Attempt)
1. Fix P, estimate 𝜋
● Solve argmin 𝛉 Es~PL(𝜋(s),𝜋 𝛉(s))
2. Fix 𝜋, estimate P
● Empirically estimate via rolling out 𝜋
3. Repeat
Not guaranteed to converge!
Images from Stephane Ross
32. Sequential Learning Reductions
Initial predictor: 𝜋0
For m=1
● Collect trajectories 𝞽 via rolling out 𝜋m-1
● Estimate state distribution Pm using s∈𝞽
● Collect interactive feedback {𝜋*(s)|s∈𝞽}
● Data Aggregation (e.g., DAgger)
○ Train 𝜋m on P1 ∪ … ∪ Pm
● Policy Aggregation (e.g., SEARN & SMILe)
○ Train 𝜋’m on Pm
○ 𝜋m = β𝜋’m + (1-β)𝜋m-1
Initial expert demonstrations
Requires
interactive expert
Typically roll out
multiple times
33. Data Aggregation (DAgger)
Sequence of convex losses: Lm(𝜋) = Es~P(m)L(𝜋*(s),𝜋(s))
Online Learning: find sequence 𝜋m competitive with 𝜋OPT:
RM = (
1
M
) m Lm(𝜋 m) − (
1
M
) m Lm(𝜋OPT)
Follow-the-Leader: 𝜋m = min 𝛉 m′=1
m
Lm’(𝜋 𝛉)
∃𝜋m: (
1
M
) m′ Lm’(𝜋 OPT) ≤ (
1
M
) m′ Lm’(𝜋m) − 𝑂(
1
𝑀
)
A reduction of imitation learning and structured prediction to no-regret online learning
Stephane Ross, Geoff Gordon, Drew Bagnell, AISTATS 2011
Best in policy class
“Online Regret”
RM = 𝑂(1/ M) (M>T)
Typically 𝜋M
34. Policy Aggregation (SEARN & SMILe)
Train 𝜋’m on Pm → 𝜋m = β𝜋’m + (1-β)𝜋m-1
𝜋m=(1−β) 𝑀 𝜋0 + 𝛽
𝑚′
1 − 𝛽 M−m’ 𝜋’m’
𝜋m not much worse than 𝜋m−1
● At most βTLm(𝜋’m)+ β2T2/2 for T-step rollout
𝜋M not much worse than 𝜋0
● At most 2Tlog(T)(
𝟏
𝑴
) 𝐦 Lm(𝜋′ 𝐦) + 𝑶(1/T) for T-step rollout
𝜋0 is expert
(not available at test time)
Search-based Structured Prediction
Hal Daume et al., Machine Learning 2009
Efficient Reductions for Imitation Learning
Stephan Ross, Drew Bagnell, AISTATS 2010
𝜋0 negligible in 𝜋M
35. Direct Policy Learning via Interactive Expert
Reduction to sequence of supervised learning problems
● Constructed from roll-outs of previous policies
● Requires interactive expert feedback
Two approaches: Data Aggregation & Policy Aggregation
● Ensures convergence
● Motivated by different theory
Not covered:
● What is expert feedback & loss function?
Direct Policy Learning via Interactive Expert
Reduction to sequence of supervised learning problems
New distribution constructed
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Depends on application
36. Types of Imitation Learning
argmin 𝛉 E(s,a*)~P*L(a*,𝜋 𝛉(s))
Behavioral Cloning Direct Policy Learning
via Interactive Demonstrator
Works well when P* close to P 𝛉
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Requires Interactive Demonstrator
(BC is 1-step special case)
Inverse RL
Learn r such that:
𝜋* = argmax 𝛉 Es~P(s|𝛉)r(s,𝜋 𝛉(s))
RL problem
Assumes learning r is statistically
easier than directly learning 𝜋*
37. Learn Policy w/o Expert: Reinforcement Learning
Goal: maximize cumulative rewards
Fully specified MDP: value & policy iteration
Transition model
Starting state distributionReward function
MDP Formulation:
38. Learn Policy w/o Expert: Reinforcement Learning
MDP Formulation:
Policy Gradient Theorem (Sutton et al., ICML 1999)
Optimize Value Function wrt Parameterized Policy
40. Also check out…
■ Introduction to Reinforcement Learning with Function
Approximation – Richard Sutton – NIPS 2015 Tutorial
■ Deep Reinforcement Learning – David Silver – ICML 2016 Tutorial
■ Deep Reinforcement Learning, Decision Making, and Control –
Sergey Levine & Chelsea Finn – ICML 2017 Tutorial
■ Deep Reinforcement Learning through Policy Optimization –
Pieter Abbeel & John Schulman – NIPS 2016 Tutorial
41. Challenges with Reward Engineering
https://blog.openai.com/faulty-reward-functions/
assumed givenMDP Formulation:
43. Inverse Reinforcement Learning
or■ Inverse RL high-level recipe:
● Expert demonstrations:
● Learn reward function:
● Learn policy given reward function (RL)
● Compare learned policy with expert
44. Reward Learning is Ambiguous
1. Many reward functions correspond to the same policy
45. Imitation Learning via Inverse RL (model-given)
Abbeel & Ng, ICML ‘04
Goal: find reward function
Different from “idealized” IRL
46. Game-Theoretic Inverse RL (model-given)
Syed & Schapire, NIPS ‘07
Goal: find performing better than over a class of rewards
Different from “idealized” IRL
47. Imitation Learning via Inverse RL (model-given)
Abbeel & Ng, ICML ‘04
Syed & Schapire, NIPS ‘07Assumptions:
Update
reward
Run full
RL
Compare
with
expert
RL Oracle
Known Dynamics
Abbeel & Ng, ICML ‘04
Syed & Schapire, NIPS ‘07
48. Assume:
Value of a policy expressed in terms of feature expectation
Linear Reward Feature Expectations
feature expectations
Abbeel & Ng, ICML ‘04
Syed & Schapire, NIPS ‘07
49. Feature Expectation:
If reward r is linear:
Feature Matching Optimality
Abbeel & Ng, ICML ‘04
Syed & Schapire, NIPS ‘07
Don’t know exactly
50. Feature Matching in Inverse RL (model-given)
Abbeel & Ng, ICML ‘04
Stop when
Find
Why?
Algorithm: solve max-margin problem in each loop
Theory: at most iterations
51. See also...
Maximum Margin Planning - Ratliff et al., ICML 2006
■ Generalizes to multiple MDPs
■ Agnostic about a real underlying MDP or reward function
Mode 1: follow the
road
Mode 2: “hide” in
the trees
Experiment: learning to plan based on satellite color imagery
Train image Learned cost map Test image
52. See also...
A Game-Theoretic Approach to Apprenticeship Learning – Syed &
Schapire, NIPS 2007
■ Restricted to
■ Game-theoretic formulation
■ Iteratively learn by Multiplicative Weight Update
53. Feature Expectation Matching
Want to find s.t discounted state visitation features match
the expert
Ill-posed problem regularization
Another regularization method: maximum entropy
54. Policy Learning is Still Ambiguous
1. Many reward functions correspond to the same policy
2. Many stochastic mixtures of policies correspond to the
same feature expectation
55. Maximum Entropy Principle
Policy induces distribution over trajectories
Maximum entropy principle: The probability distribution which
best represents the current state of knowledge is the one with
largest entropy (E.T. Jaynes 1957)
Information Theory and Statistical Mechanics - E.T. Jaynes - Phys. Rev. 1957
Feature matching:
56. Maximum Entropy Principle
Choose trajectory distribution that satisfies feature matching
constraints without over-committing
As uniform as possible
57. Maximum Entropy Principle
The distribution that maximizes entropy given
linear constraints is in the exponential family
Constrained
Optimization
Maximize
Unconstrained
Objective
Lagrangian
58. MaxEnt Inverse RL (model-given)
Ziebart et al., AAAI ‘08
MaxEnt formulation:
reward inference: max log likelihood
59. MaxEnt Inverse RL (model-given)
Given from the model
Solve forward RL w.r.t.
Ziebart et al., AAAI ‘08
Gradient descent over log-likelihood
Dynamic programming: state occupancy measure (visitation freq)
60. Inverse RL So Far...
Model-given:
Update
reward
Run full
RL
Compare
with
expert
Small MDPKnown Dynamics
62. MaxEnt Deep IRL + Model-Free Optimization
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization - Finn et al., ICML ‘16
Need to sample to evaluate gradients
reward parameterized by neural net
Recall MaxEnt formulation
Max likelihood objective
63. MaxEnt Deep IRL + Model-Free Optimization
Finn et al., ICML ‘16
Use “proposal” distribution
to sample trajectories
Optimal dist ,
which depends on optimal
policy w.r.t. unknown
Reward Optimization Policy Optimization
Approximate
64. MaxEnt Deep IRL + Model-Free Optimization
Estimated log likelihood:
with gradient:
Finn et al., ICML ‘16
sampling distribution
(also depends on )
65. ■ Generate samples by
preventing the policy from
changing too rapidly
■ Update policy: take only 1
policy optimization step
MaxEnt Deep IRL + Model-Free Optimization
Figures from Chelsea Finn’s presentation
67. IL via Occupancy Measure Matching
Apprenticeship Learning Using Linear Programming - Syed, Bowling & Schapire, ICML ‘08
Occupancy measure
Fact:
68. IL Using Linear Programming (model-given)
Syed, Bowling & Schapire, ICML ‘08
Imitation learning occupancy measure matching
Direct imitation learning by solving dual LP of original MDP
Dual of IRLNo reward learning
69. Occupancy Measure Matching + MaxEnt
for any reward function
Ill-posed occupancy matching MaxEnt principle again:
Again, turning into unconstrained optimization:
78. A Deep Learning Approach for Generalized Speech Animation
respec-
esin the
h output
ensional
dimen-
astime-
1. For
o an au-
ence of
an illus-
equence
espond-
Frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Token - p p r ih ih d d ih ih ih ih k k sh sh sh sh uh uh n -(a) x
y
“ P R E D I C T I O N ”Input speech:
(b)
Figure1: Depicting an example(a) input x and (b) output y for
the application of visual speech animation. Each dimension of
y corresponds to a parameter of a face model. Only the first
Input sequence
Output sequence
Goal: learn predictor 𝜋 ∶ 𝑋 → 𝑌
Phoneme sequence
Sequence of face configurations
[Taylor et al., SIGGRAPH 2017]
Train using
Behavioral Cloning
𝑦 ∈ 𝑅𝐷
𝑋 = < 𝑥1, … , 𝑥𝑇 >
𝑌 = < 𝑦1, … , 𝑦𝑇 >
79. A Deep Learning Approach for Generalized Speech Animation
Sarah Taylor, Taehwan Kim, Yisong Yue et al., SIGGRAPH 2017 https://www.youtube.com/watch?v=9zL7qejW9fE
80. A Deep Learning Approach for Generalized Speech Animation
Sarah Taylor, Taehwan Kim, Yisong Yue et al., SIGGRAPH 2017 https://www.youtube.com/watch?v=9zL7qejW9fE
81. 1 1.5 2 2.5 3
?
? ? ? ?
s s s s s ih ih ih g g r r ae ae ae ae f f f
?
?
Is the phoneme in the 8th
frame a diphthong?
Y
Is the phoneme in the 8th
frame a semivowel?
YY
N
NN
…
Is the phoneme in the 3rd frame
articulated at the back of the mouth?
“SIGGRAPH”
Target Speech
?
? ? ? ?
s s s s s ih ih ih g g r r ae ae ae ae f f f
?
?
Is the phoneme in the 8th
frame a diphthong?
Y
Is the phoneme in the 8
frame a semivowel?
YY
N
NN
…
Is the phoneme in the 3rd frame
articulated at the back of the mouth?
“SIGGRAPH”
Target Speech
Input Audio
Speech Recognition
Speech Animation
Retargeting
E.g., [Sumner & Popovic 2004]
(chimp rig courtesy of Hao Li)
Editing
83. Structured Prediction
Learn mapping from structured space X to structured space Y
Typically requires solving optimization problem at prediction
● E.g., Viterbi algorithm for MAP inference in HMMs
84. Conservation Reservoir
Corridors
[Daniel Sheldon et al.]
[Daniel Golovin et al.]
✦
binocular fusion of features observed by
the eyes
✦
reconstruction of their 3D preimage
left right perceived depth
[Tsukuba]
✦
binocular fusion of features observed by
the eyes
✦
reconstruction of their 3D preimage
left right perceived depth
[Tsukuba]
✦
binocular fusion of features observed by
the eyes
✦
reconstruction of their 3D preimage
left right perceived depth
[Tsukuba]
[Debadeepta Dey et al.]
[Nathan Ratliff et al.]
X Y
X Y
X Y
X X Y
Structured Prediction
[Brian Ziebart et al., 2008]
85. Part-of-Speech Tagging
– Given a sequence of words x
– Predict sequence of tags y (dynamic programming)
The rain wet the cat
x
Det NVDet N
y
Example: Sequence Prediction
…
Adj VVDet V
V VNAdv Det
y
y
Goal: Minimize Hamming Loss
86. Goal: Learn Decision Function
(In Inference/Optimization Procedure)
● Sequentially predict outputs:
The rain wet the cat
x
Det NVDet N
y
State at t-th
iteration
Prediction
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, N} )
…
𝜋(𝑠𝑡) → 𝑎𝑡 ≡ 𝑦𝑡
Interactive Feedback via
Structured Labels
Supervised Learning on (s,a)
87. ● Iteration 1: Rollout 𝜋0 (memorize training set)
Data Aggregation (DAgger) for Sequence Labeling
(Basic Version)
The rain wet the cat
x
Det NVDet N
y
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, N} )
…
a1 = Det
a2 = N
a3 = V
…
P1
Train 𝜋1
on P1
88. ● Iteration 2: Rollout 𝜋1
The rain wet the cat
x
Det NAdjN N
y
s1 = ( x, Ø )
s2 = ( x, {N} )
s3 = ( x, {N, N} )
…
a1 = Det
a2 = N
a3 = V
…
P2
Train 𝜋2
on P1
and P2
Data Aggregation (DAgger) for Sequence Labeling
(Basic Version)
89. ● Iteration 3: Rollout 𝜋2
The rain wet the cat
x
Adv NNDet V
y
s1 = ( x, Ø )
s2 = ( x, {Det} )
s3 = ( x, {Det, V} )
…
a1 = Det
a2 = N
a3 = V
…
P3
Train 𝜋3
on P1
and P2
and P3
Data Aggregation (DAgger) for Sequence Labeling
(Basic Version)
90. (Basic) Ingredients for Structured Prediction
● Deal with Distributional Drift (otherwise just use behavioral cloning)
○ Related approaches include scheduled sampling:
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
[Bengio et al., NIPS 2015]
● Imitation Loss compatible w/ Structured Prediction Loss
○ Hamming Loss → 0/1 per-step loss (Hamming Loss is additive)
○ More generally: reduce to cost-sensitive classification loss
● Efficiency considerations
○ Efficiently Programmable Learning to Search [Daume et al., 2014]
○ Deeply aggrevated: Differentiable imitation learning for sequential prediction
[Sun et al., 2017]
91. Learning Policies for Contextual Submodular Optimization
(Example of Cost-Sensitive Reduction)
Training set: (x, Fx)
Goal: learn 𝜋 that sequentially maps x to action set A to maximize Fx
Learning Reduction: at state s=(x,A) create cost for each action
○ Define: fmax = maxa Fx(A+a) – Fx(A)
○ cs(a) = fmax – ( Fx(A+a) – Fx(A) )
○ Supervised training example: (s, cs)
Stephane Ross et al., ICML 2013
Context
Monotone Submodular Function
Greedy
Submodular
Regret
Already selected actions
Can prove convergence to (1-1/e)-optimal policy!
92. Structured Prediction ≈ Learning to Search
Compile input description x into search problem
Train 𝜋 to effectively navigate search space
High level supervised reward/cost
● E.g., routing cost, BLEU score, Hamming loss, submodular utility, Jaccard,
integer program value, log likelihood of graphical model, etc.
Expert might be weak (search space exponentially large)
93. Learning to Search in Branch and Bound Algorithms
■ Used to solve Integer Programs
● Two local decisions: ranking and pruning
● Train using solved instances (e.g., by Gurobi)
[He et al., NIPS 2014]
−13/2
+∞
−13/3
+∞
−16/3
+∞
−3
−3
−4
−4
–22/5
+∞
INF INF
−3
−3
x = 5/2#
y = 3/2
x = 5/3#
y = 1
x = 5/3#
y = 2
x = 1#
y = 1
x = 1#
y = 12/5
x = 1#
y = 2
x = 0#
y = 3
y ≤ 1 y ≥ 2
x ≤ 1 x ≤ 1x ≥ 2 x ≥ 2
y ≤ 2 y ≥ 3
ub = +∞#
lb = −16/3
ub = −3#
lb = −22/5
ub = −4#
lb = −4
ub = −3#
lb = −16/3
node expansion
order
global lower and
upper bound
optimal node
fathomed node
min −2x − y#
s.t. 3x − 5y ≤ 0#
3x + 5y ≤ 15#
x ≥ 0, y ≥ 0#
x, y ∈ Z
training examples:
−13/3
+∞
−16/3
+∞
<
prune:
−13/3
+∞
94. Sub-Optimal Expert Imitation Learning → Reinforcement Learning
■ Query Access to Environment (simple exploration)
● Learning to Search via Retrospective Imitation [Song et al., 2018]
● Learning to Search Better than Your Teacher [Chang et al., ICML 2015]
● Reinforcement and imitation learning via interactive no-regret learning
[Ross & Bagnell, 2014]
■ More Sophisticated Exploration (also Query Access to Environment)
● Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback
[Daume et al., ICLR 2018]
■ Actor-Critic Style Approaches (also Query Access to Environment)
● Truncated Horizon Policy Search: Combining Reinforcement Learning and Imitation
Learning [Sun et al., ICLR 2018]
● Sequence Level Training with Recurrent Neural Networks [Ranzato et al., ICLR 2016]
95. Learning to Search Better than Your Teacher
[Chang et al., ICML 2015]
Roll-in: execute policy w/o learning
Roll-out: execute policy for learning
SEARN [Daume et al., 2009]
96. Improving Multi-step Prediction of Learned Time Series Models
Learn 𝜋 that maps xt to xt+1
● Input is previous prediction
● Sequence prediction special case
Can also extend to latent-variables:
● Learning to Filter with Predictive State
Inference Machines [Sun et al., 2016]
[Venkatraman et al., AAAI 2015]
xt+1 = E[f(xt)]
state
dynamics
…"
x0 x1
xt f (·)x2
…" ˆf (·)
x3
ˆx1
ˆx2
ˆxt
(a) Forward simulation of learned model (gray) introduces error
at each prediction step compared to thetrue time-series (red)
…"
x0 x1
xtx2
…"
x3
ˆx1
ˆx2
…"
x0 x1
xt f (·)x2
…" ˆf (·)
x3
ˆx1
ˆx2
ˆxt
(a) Forward simulation of learned model (gray) introduces error
at each prediction step compared to thetruetime-series (red)
…"
x0 x1
xtx2
…"
x3
ˆx1
ˆx2
ˆxt
(b) Data provides a demonstration of corrections required to re-
Interactive Imitation Learning
97. Imitation Learning for Structured Prediction
■ Learn decision function to navigate search space
■ Reduce to interactive imitation learning
● Expert feedback from computational oracle
■ Careful choice of loss function
■ Suboptimal experts
● Due to computational complexity of solving search space
102. Smooth Imitation Learning
Le et al., ICML ‘16
Smooth Imitation by Functional Regularization
Smooth Policy Class �
ComplexPolicy Class f Smooth Model g
107. Safe Imitation Learning for Autonomous Driving
Zhang & Cho, AAAI ‘17
If , drive with , otherwise fall back to
= whether likely to
deviates from
108. Results:
■ Reduced number of damages
■ Faster learning - can be viewed as active DAgger
Safe Imitation Learning for Autonomous Driving
Zhang & Cho, AAAI ‘17
SafeDAgger:
Safety classifier in the loop
Only query “unsafe” states
113. One Shot Imitation Learning Duan et al., NIPS ‘17
Training
task
Test task
C
B
A
F
E
D
B
A
D
F
E
C
114. One Shot Visual Imitation Learning - Transfer
Tobin et al., IROS ‘17
Trained in simulation only
using domain randomization
Test in the real world
115. One Shot Visual Imitation Learning - Transfer
Training entirely in simulation, expert demonstrations through VR
Tobin et al., IROS ‘17
116. Visual Meta-Imitation Learning
Finn et al., ICML ‘17
Meta Learning (MAML)
Finn et al., CoRL ‘17
Meta-objective updateFast adaptation update
Meta-Imitation Training
1. Sample task
2. Using one demonstration for
3. Using another demonstration for
117. One Shot Imitation from Watching Videos
Yu et al., RSS ‘18
pick up a novel object and place it into a previously unseen bowl
118. Multiple tasks from multiple environments sequentially
Unknown reward function
Human behaves optimally wrt in each
Repeated Inverse Reinforcement Learning
Amin et al., NIPS ‘17
latent human preferences
External
Goal: learn from demonstrations from as few tasks as possible and
generalize to new ones, under two settings:
1. Agent chooses the task actively
2. Nature chooses the task adversarially
124. Generative Multi-agent Behavioral Cloning
Zhan et al., arXiv ‘18
Eq(zm
ÆT|mÆT)
2
4
TX
t=1
≠ KL q(zm
t | mÆt, zm
< t)Î p(zm
t | m< t, zm
< t) + logp(mt | zm
Æt, m< t)
VRNN MAGnet
VRNN
VRNN
agents
During training weuseweak labels ˆm = ( ˆm1
, . . . , ˆmN
), where ˆmi
isthenext loca
Macro-goals
• Generative Multi-agent Imitation Learning
– No single “correct” action
• Hierarchical
– Make predictions at multiple resolutions
125. Multi-modal Imitation Learning - InfoGAIL
infoGAIL: Interpretable Imitation Learning from Visual Demonstrations – Li et al., NIPS ‘17
Thanks to materials from Stefano Ermon
Passing RightPassing Left
127. Multi-modal Imitation Learning - InfoGAIL
Li et al., NIPS ‘17
Thanks to materials from Stefano Ermon
Further regularization: maximize the mutual information between z and (s,a)
EnvironmentPolicy
Observed
Behavior
Z
Maximize mutual information
(s,a)
Latent code
128. Multi-modal Imitation Learning - InfoGAIL
Li et al., NIPS ‘17
Training using Wasserstein GAN
Mode Generation
Z=0
Z=0.5
Z=1
Thanks to materials from Stefano Ermon
Passing Right
Passing Left
129. See also...
■ Multi-Modal Imitation Learning from Unstructured
Demonstrations using Generative Adversarial Nets -
Hausman et al., NIPS ‘17
■ Multi-agent Generative Adversarial Imitation Learning
- Song et al., arXiv ‘18
131. Hadfield-Menell et al., NIPS ‘16
Cooperative Inverse Reinforcement Learning
How should humans demonstrate the tasks?
Inverse RL: demonstrated behavior is
(near) optimal as it accomplishes the
task efficiently
precludes useful teaching behaviors
132. Cooperative Inverse Reinforcement Learning
Hadfield-Menell et al., NIPS ‘16
H may choose to to accept less reward on a particular action
in order to convey more information to R
■ Human and robot play cooperative games
■ Two players:
■ H observes the actual reward signal; R only knows a prior
distribution on reward functions
133. See also...
■ An Efficient, Generalized Bellman Update for Cooperative
Inverse Reinforcement Learning - Malik et al., ICML 2018
● More efficient POMDP solver
● (Partially) relax assumption of human rationality
134. Showing vs Doing: Teaching by Demonstration
Pedagogical Inverse Reinforcement Learning
Ho et al., NIPS ‘16
Teaching depends on expert’s inferences about
a learner’s inferences about Demonstration
Human Experiment:
136. Hierarchical Imitation and Reinforcement Learning
Le et al., ICML ‘18
How can we most effectively leverage teacher’s feedback?
End of hall Left
Corridor Left
Hierarchical feedback is more natural
≠ Flat Imitation Learning / Reinforcement Learning
137. Hierarchical Imitation and Reinforcement Learning
(Flat) Imitation Learning
Feedback effort proportional to
task horizon
Hierarchical Imitation Learning
Le et al., ICML ‘18
138. Hierarchical Imitation and Reinforcement Learning
state macro-action state macro-action state … state
state action … state state action … state
Le et al., ICML ‘18
Key Insights
Verifying is cheaper than labeling low-level actions
Expert provides high-level feedback
and zooms in to low-level only when needed
Weak feedback
139. Hierarchical Imitation and Reinforcement Learning
Le et al., ICML ‘18
Hybrid Imitation-Reinforcement
Significantly faster learning
than pure RL
Hierarchical Imitation Learning
Significantly less expert-costly
than flat imitation learning
0K 10K 20K 30K 40K 50K 60K
expert cost (LO-level)
0
20
40
60
80
100
successrate
hg-DAgger
flat DAgger
140. Learning from Human Preferences
Christiano et al., NIPS ‘17
Preference feedback is more natural to specify than exact numeric rewards
141. Learning from Human Preferences
Christiano et al., NIPS ‘17
Assume Bradley-Terry preference model:
Minimize cross-entropy loss:
142. ■ Programming by Feedback - Akrour et al., ICML 2014
■ A Bayesian Approach for Policy Learning from
Trajectory Preference Queries - Wilson et al., NIPS
2012
● Employed similar feedback elicitation mechanism to Christiano et al.
■ Learning Trajectory Preferences for Manipulators via
Iterative Improvement - Jain et al., NIPS 2013
● Co-active feedback (feedback is only an incremental improvement, not
necessarily the optimal result)
See also previous work...
143. Interactive Learning from Policy-Dependent
Human Feedback (COACH)
MacGlashan et al., ICML ‘17
training interface shown to AMT users
Reward / punishment for “good” dog / “alright” dog / “bad” dog
Humans tend to give
policy-dependent
feedback
How Do Humans Train Animal?
144. ● Policy-dependent feedback
● Differential feedback
● Diminishing return
Interactive Learning from Policy-Dependent Human
Feedback (COACH)
MacGlashan et al., ICML ‘17
Treat human feedback as
approximation to advantage function
Advantage function is good
feedback model
145. Interactive Learning from Policy-Dependent
Human Feedback (COACH)
■ Recall Policy Gradient Theorem
■ Update Rule:
where
MacGlashan et al., ICML ‘17
discretized rewards
147. ■ Interactively shaping agents via
human reinforcement: the
TAMER framework - Knox &
Stone, K-CAP ‘09
See also...
http://www.cs.utexas.edu/~bradknox/TAMER.htm
l
148. Summary
1. Broad research area with many interesting questions
2. Practical approaches to sequential decision making
● Especially vs RL: sample complexity, cost of
experimentation, simulator availability, generalization, etc.
3. Diverse range of applications
149. Imitation Learning
ICML 2018 Tutorial
Q & A
Yisong Yue Hoang M. Le
yyue@caltech.edu hmle@caltech.edu
@YisongYue @HoangMinhLe
yisongyue.com hoangle.info