1. Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning: Learning to
Act like Humans from Humans
SSN College of Engineering – Faculty Development Program Talk
10:45-12:15 IST, 25 Nov 2017
Anirban Santara
santara.github.io
2. Department of Computer
Science and Engineering
IIT Kharagpur
About me
Anirban Santara
Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
Graduate Research Intern at
Intel Labs for Autonomous
Driving (2017-Present)
B.Tech. in Electronics and
Electrical Communication
Engineering from IIT
Kharagpur in 2015
3. Department of Computer
Science and Engineering
IIT Kharagpur
Contents
1. Building the motivation
2. Problem definition and Different Approaches to Solution
3. Issues of Safety and Reliability
5. Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning
Imitation Learning
techniques aim to mimic
human behavior at a given
task1
1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR)
50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania
6. Department of Computer
Science and Engineering
IIT Kharagpur
Why should you care?
• Imitation learning methods are rooted in neuro-science and form an
important part of learning in humans
• Makes it possible to teach robots complex tasks with minimal expert
knowledge of the tasks
• No need for explicit programming or task-specific reward function design
• Its high time!
• Modern sensors are able to collect and transmit high volumes of data at high speed
• High performance computing is cheaper, more capable and ubiquitous than ever
• Virtual Reality systems – that are considered the best portal of human-machine
interaction – are widely available
14. Department of Computer
Science and Engineering
IIT Kharagpur
A quick primer on Machine Learning
Reference application – Driving a Racing Car
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake
15. Department of Computer
Science and Engineering
IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race
16. Department of Computer
Science and Engineering
IIT Kharagpur
Problem Setting
Our Agent has to achieve its
goal by taking a sequence of
actions in an environment
whose states change in
response to the agent’s
actions.
ActionNew State
Environment
Agent
17. Department of Computer
Science and Engineering
IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
Imitation Learning problems are often specified in terms of a Markov Decision Process
(MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
18. Department of Computer
Science and Engineering
IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences
of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
19. Department of Computer
Science and Engineering
IIT Kharagpur
Approaches to Imitation Learning
Broad Categories
Imitation Learning
Learning from a
dataset of expert
demonstrations
Behavioral
Cloning
Apprenticeship
Learning
Active learning
with an expert
21. Department of Computer
Science and Engineering
IIT Kharagpur
Problem Definition
• Given: a dataset of trajectories demonstrated by an expert:
where each trajectory is a sequence of states and actions:
• Goal: Find a policy 𝜋∗
that achieves “expert-like performance”
𝜏 𝑖 𝑖=1
𝑁
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
22. Department of Computer
Science and Engineering
IIT Kharagpur
Behavioral Cloning
Supervised learning of a mapping from states to the expert’s actions in those states
Model
𝑥1
𝑥2
.
.
.
𝑥 𝑛
state: 𝑥
𝑎
𝑎: expert action
−
statistical
divergence
Loss
Minimize this
w.r.t. model parameters
expert
23. Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Behavioral
Cloning
• Advantages:
• Simplicity!
• Drawbacks:
• Fails to work well with limited data
• Assumes that observations are i.i.d. and learn to fit single time step decisions
This leads to the problem of compounding error due to covariate shift
25. Department of Computer
Science and Engineering
IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent
26. Department of Computer
Science and Engineering
IIT Kharagpur
Goal of RL
Find a policy 𝜋∗that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Reward of a trajectory 𝑅 𝜏 is a
function of all the rewards
received in a trajectory
e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
27. Department of Computer
Science and Engineering
IIT Kharagpur
Apprenticeship Learning
1. Inverse Reinforcement Learning (IRL): Use the dataset of expert-
demonstrations to uncover the reward function that the expert is
trying to optimize.
• This reward function is expected to succinctly encode the expert’s behavior…
2. Reinforcement Learning (IRL): Learn the optimal policy for this
recovered reward function using RL.
expert
demonstrations
IRL
reward
function
RL
optimum
policy
28. Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Apprenticeship
Learning
• Advantages:
• Does not take single time-step decisions and hence compounding error is not a
problem, unlike behavioral cloning
• Drawbacks:
• IRL is a computationally expensive algorithm because it needs RL to run in an
inside loop
• Scalability issues in large environment
• Agent needs to act in the environment during learning – this may be unsafe in
risk-sensitive applications
30. Department of Computer
Science and Engineering
IIT Kharagpur
Active Learning
In Active Learning the agent
is able to query the expert
for an optimal action in any
given state and use these
active samples to improve its
policy
state
agent
confidence
High Low
Agent takes
action
Agent queries
expert
action
Agent
takes
actionAgent rectifies
policy
31. Department of Computer
Science and Engineering
IIT Kharagpur
Workflow of Active Learning
Train the agent by
behavioral cloning
Deploy the agent
in the real world
in presence of an
expert
Agent queries the
expert whenever
it is in doubt and
rectifies itself
32. Department of Computer
Science and Engineering
IIT Kharagpur
Pros and Cons of Active Learning
• Advantages:
• Safe during both training and testing
• Drawbacks:
• Getting robust confidence estimates is tough
• Requires longer supervision of the expert
37. Department of Computer
Science and Engineering
IIT Kharagpur
GAIL: Generative Adversarial Imitation
Learning
Problem of heavy tail
38. Department of Computer
Science and Engineering
IIT Kharagpur
RAIL: Risk-Averse Imitation Learning
Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017
CVaR of trajectory risk