Imitation Learning

Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning: Learning to
Act like Humans from Humans
SSN College of Engineering – Faculty Development Program Talk
10:45-12:15 IST, 25 Nov 2017
Anirban Santara
santara.github.io

IIT Kharagpur
About me
Anirban Santara
Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
Graduate Research Intern at
Intel Labs for Autonomous
Driving (2017-Present)
B.Tech. in Electronics and
Electrical Communication
Engineering from IIT
Kharagpur in 2015

IIT Kharagpur
Contents
1. Building the motivation
2. Problem definition and Different Approaches to Solution
3. Issues of Safety and Reliability

IIT Kharagpur
Description of the Imitation
Learning Problem

IIT Kharagpur
Imitation Learning
Imitation Learning
techniques aim to mimic
human behavior at a given
task1
1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR)
50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania

IIT Kharagpur
Why should you care?
• Imitation learning methods are rooted in neuro-science and form an
important part of learning in humans
• Makes it possible to teach robots complex tasks with minimal expert
knowledge of the tasks
• No need for explicit programming or task-specific reward function design
• Its high time!
• Modern sensors are able to collect and transmit high volumes of data at high speed
• High performance computing is cheaper, more capable and ubiquitous than ever
• Virtual Reality systems – that are considered the best portal of human-machine
interaction – are widely available

IIT Kharagpur
Example Application Areas

IIT Kharagpur
Autonomous Driving
No more accidents due to human error. No more traffic jams.

IIT Kharagpur
Robotic Surgery
Complex Actions in Critical Situations – Accurate. Every time.

IIT Kharagpur
Industrial Automation
Efficiency. Precise Quality Control. Safety.

IIT Kharagpur
Assistive Robotics
Elderly Care. Rehabilitation. Special Needs.

IIT Kharagpur
Conversational Agents
Assistance. Recommendation. Therapy.

IIT Kharagpur
Approaches to Solution

IIT Kharagpur
A quick primer on Machine Learning
Reference application – Driving a Racing Car
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake

IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race

IIT Kharagpur
Problem Setting
Our Agent has to achieve its
goal by taking a sequence of
actions in an environment
whose states change in
response to the agent’s
actions.
ActionNew State
Environment
Agent

IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
Imitation Learning problems are often specified in terms of a Markov Decision Process
(MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)

IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences
of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇

IIT Kharagpur
Approaches to Imitation Learning
Broad Categories
Imitation Learning
Learning from a
dataset of expert
demonstrations
Behavioral
Cloning
Apprenticeship
Learning
Active learning
with an expert

IIT Kharagpur
Learning from a Dataset of
Expert Demonstrations

IIT Kharagpur
Problem Definition
• Given: a dataset of trajectories demonstrated by an expert:
where each trajectory is a sequence of states and actions:
• Goal: Find a policy 𝜋∗
that achieves “expert-like performance”
𝜏 𝑖 𝑖=1
𝑁
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇

IIT Kharagpur
Behavioral Cloning
Supervised learning of a mapping from states to the expert’s actions in those states
Model
𝑥1
𝑥2
.
.
.
𝑥 𝑛
state: 𝑥
𝑎
𝑎: expert action
−
statistical
divergence
Loss
Minimize this
w.r.t. model parameters
expert

IIT KharagpurPros and Cons of Behavioral
Cloning
• Advantages:
• Simplicity!
• Drawbacks:
• Fails to work well with limited data
• Assumes that observations are i.i.d. and learn to fit single time step decisions
This leads to the problem of compounding error due to covariate shift

IIT Kharagpur
Apprenticeship Learning

IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent

IIT Kharagpur
Goal of RL
Find a policy 𝜋∗that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Reward of a trajectory 𝑅 𝜏 is a
function of all the rewards
received in a trajectory
e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡

IIT Kharagpur
Apprenticeship Learning
1. Inverse Reinforcement Learning (IRL): Use the dataset of expert-
demonstrations to uncover the reward function that the expert is
trying to optimize.
• This reward function is expected to succinctly encode the expert’s behavior…
2. Reinforcement Learning (IRL): Learn the optimal policy for this
recovered reward function using RL.
expert
demonstrations
IRL
reward
function
RL
optimum
policy

IIT KharagpurPros and Cons of Apprenticeship
Learning
• Advantages:
• Does not take single time-step decisions and hence compounding error is not a
problem, unlike behavioral cloning
• Drawbacks:
• IRL is a computationally expensive algorithm because it needs RL to run in an
inside loop
• Scalability issues in large environment
• Agent needs to act in the environment during learning – this may be unsafe in
risk-sensitive applications

IIT Kharagpur
Active Learning

IIT Kharagpur
Active Learning
In Active Learning the agent
is able to query the expert
for an optimal action in any
given state and use these
active samples to improve its
policy
state
agent
confidence
High Low
Agent takes
action
Agent queries
expert
action
Agent
takes
actionAgent rectifies
policy

IIT Kharagpur
Workflow of Active Learning
Train the agent by
behavioral cloning
Deploy the agent
in the real world
in presence of an
expert
Agent queries the
expert whenever
it is in doubt and
rectifies itself

IIT Kharagpur
Pros and Cons of Active Learning
• Advantages:
• Safe during both training and testing
• Drawbacks:
• Getting robust confidence estimates is tough
• Requires longer supervision of the expert

IIT Kharagpur
Issue of Safety

IIT Kharagpur
Types of Safety
Safety during
training
Safety after
deployment

IIT KharagpurDifferent Approaches to Ensuring
Safety
• Vigilance during exploration
• External Knowledge
• Prior knowledge
• Expert demonstration
• Teacher advice
• Risk-directed exploration
• Engineering the optimization criterion
• Worst case criteria
• Risk-sensitive criteria
• Constrained criteria

IIT Kharagpur
Case study on how to make
an existing algorithm safe

IIT Kharagpur
GAIL: Generative Adversarial Imitation
Learning
Problem of heavy tail

IIT Kharagpur
RAIL: Risk-Averse Imitation Learning
Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017
CVaR of trajectory risk

IIT Kharagpur
Results

IIT Kharagpur
Any Questions, Please 
Scan me to give
Anirban feedback

IIT Kharagpur
Thank You

Imitation Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Imitation Learning

Similar to Imitation Learning (20)

Recently uploaded

Recently uploaded (20)

Imitation Learning