Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning: Learning to
Act like Humans from Humans
SSN College of Engineering – Faculty Development Program Talk
10:45-12:15 IST, 25 Nov 2017
Anirban Santara
santara.github.io
Department of Computer
Science and Engineering
IIT Kharagpur
About me
Anirban Santara
Google India Ph.D. Fellow at
IIT Kharagpur (2015-Present)
Graduate Research Intern at
Intel Labs for Autonomous
Driving (2017-Present)
B.Tech. in Electronics and
Electrical Communication
Engineering from IIT
Kharagpur in 2015
Department of Computer
Science and Engineering
IIT Kharagpur
Contents
1. Building the motivation
2. Problem definition and Different Approaches to Solution
3. Issues of Safety and Reliability
Department of Computer
Science and Engineering
IIT Kharagpur
Description of the Imitation
Learning Problem
Department of Computer
Science and Engineering
IIT Kharagpur
Imitation Learning
Imitation Learning
techniques aim to mimic
human behavior at a given
task1
1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR)
50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania
Department of Computer
Science and Engineering
IIT Kharagpur
Why should you care?
• Imitation learning methods are rooted in neuro-science and form an
important part of learning in humans
• Makes it possible to teach robots complex tasks with minimal expert
knowledge of the tasks
• No need for explicit programming or task-specific reward function design
• Its high time!
• Modern sensors are able to collect and transmit high volumes of data at high speed
• High performance computing is cheaper, more capable and ubiquitous than ever
• Virtual Reality systems – that are considered the best portal of human-machine
interaction – are widely available
Department of Computer
Science and Engineering
IIT Kharagpur
Example Application Areas
Department of Computer
Science and Engineering
IIT Kharagpur
Autonomous Driving
No more accidents due to human error. No more traffic jams.
Department of Computer
Science and Engineering
IIT Kharagpur
Robotic Surgery
Complex Actions in Critical Situations – Accurate. Every time.
Department of Computer
Science and Engineering
IIT Kharagpur
Industrial Automation
Efficiency. Precise Quality Control. Safety.
Department of Computer
Science and Engineering
IIT Kharagpur
Assistive Robotics
Elderly Care. Rehabilitation. Special Needs.
Department of Computer
Science and Engineering
IIT Kharagpur
Conversational Agents
Assistance. Recommendation. Therapy.
Department of Computer
Science and Engineering
IIT Kharagpur
Approaches to Solution
Department of Computer
Science and Engineering
IIT Kharagpur
A quick primer on Machine Learning
Reference application – Driving a Racing Car
State variables (X):
• Position in track
• Distance from track
edges along different
directions
• Direction of heading
• Current speed
Action Variables (Y):
• Steering
• Acceleration
• Brake
Department of Computer
Science and Engineering
IIT Kharagpur
Comparison of ML paradigms
Supervised Learning
• Would require training
examples in the form:
{ 𝑋𝑖, 𝑌𝑖 }𝑖=1
𝑁
• Where, 𝑌𝑖 are
true/correct
actions that must be
taken in state 𝑋𝑖
Unsupervised Learning
• Works only on with the
input state information
𝑋𝑖
• Does not use any
kind of feedback
from the environment
regarding performance
of the agent
Reinforcement Learning
• Requires feedback from the
environment in the form of
reward signals
• Reward signals might be
sparse and delayed
• But it should indicate the
quality of actions being
taken by the agent in
different states
e.g. +1 if the car makes progress, -1 if it
comes to a halt, -10 if it bumps into an
obstacle, 100 if it finishes the race
Department of Computer
Science and Engineering
IIT Kharagpur
Problem Setting
Our Agent has to achieve its
goal by taking a sequence of
actions in an environment
whose states change in
response to the agent’s
actions.
ActionNew State
Environment
Agent
Department of Computer
Science and Engineering
IIT Kharagpur
Mathematical Formulation
Markov Decision Process (MDP)
Imitation Learning problems are often specified in terms of a Markov Decision Process
(MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾)
• State Space 𝑆: Set of all possible states/configurations of the environment
• Action Space 𝐴: Set of all possible actions
• Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡
• Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠)
• Temporal discount factor 𝛾
“Markov” because it assumes:
𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0
= 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
Department of Computer
Science and Engineering
IIT Kharagpur
Some more definitions
• Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state
• Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences
of an agent as it executes a policy.
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
Department of Computer
Science and Engineering
IIT Kharagpur
Approaches to Imitation Learning
Broad Categories
Imitation Learning
Learning from a
dataset of expert
demonstrations
Behavioral
Cloning
Apprenticeship
Learning
Active learning
with an expert
Department of Computer
Science and Engineering
IIT Kharagpur
Learning from a Dataset of
Expert Demonstrations
Department of Computer
Science and Engineering
IIT Kharagpur
Problem Definition
• Given: a dataset of trajectories demonstrated by an expert:
where each trajectory is a sequence of states and actions:
• Goal: Find a policy 𝜋∗
that achieves “expert-like performance”
𝜏 𝑖 𝑖=1
𝑁
𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
Department of Computer
Science and Engineering
IIT Kharagpur
Behavioral Cloning
Supervised learning of a mapping from states to the expert’s actions in those states
Model
𝑥1
𝑥2
.
.
.
𝑥 𝑛
state: 𝑥
𝑎
𝑎: expert action
−
statistical
divergence
Loss
Minimize this
w.r.t. model parameters
expert
Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Behavioral
Cloning
• Advantages:
• Simplicity!
• Drawbacks:
• Fails to work well with limited data
• Assumes that observations are i.i.d. and learn to fit single time step decisions
This leads to the problem of compounding error due to covariate shift
Department of Computer
Science and Engineering
IIT Kharagpur
Apprenticeship Learning
Department of Computer
Science and Engineering
IIT KharagpurReinforcement
Learning
Reinforcement Learning
refers to learning through
trial and error using
feedback from the
environment.
Action
Reward,
New State
Environment
Agent
Department of Computer
Science and Engineering
IIT Kharagpur
Goal of RL
Find a policy 𝜋∗that
maximizes the expectation of
the reward function 𝑅 𝜏
over trajectories 𝜏
𝜋∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)]
Reward of a trajectory 𝑅 𝜏 is a
function of all the rewards
received in a trajectory
e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
Department of Computer
Science and Engineering
IIT Kharagpur
Apprenticeship Learning
1. Inverse Reinforcement Learning (IRL): Use the dataset of expert-
demonstrations to uncover the reward function that the expert is
trying to optimize.
• This reward function is expected to succinctly encode the expert’s behavior…
2. Reinforcement Learning (IRL): Learn the optimal policy for this
recovered reward function using RL.
expert
demonstrations
IRL
reward
function
RL
optimum
policy
Department of Computer
Science and Engineering
IIT KharagpurPros and Cons of Apprenticeship
Learning
• Advantages:
• Does not take single time-step decisions and hence compounding error is not a
problem, unlike behavioral cloning
• Drawbacks:
• IRL is a computationally expensive algorithm because it needs RL to run in an
inside loop
• Scalability issues in large environment
• Agent needs to act in the environment during learning – this may be unsafe in
risk-sensitive applications
Department of Computer
Science and Engineering
IIT Kharagpur
Active Learning
Department of Computer
Science and Engineering
IIT Kharagpur
Active Learning
In Active Learning the agent
is able to query the expert
for an optimal action in any
given state and use these
active samples to improve its
policy
state
agent
confidence
High Low
Agent takes
action
Agent queries
expert
action
Agent
takes
actionAgent rectifies
policy
Department of Computer
Science and Engineering
IIT Kharagpur
Workflow of Active Learning
Train the agent by
behavioral cloning
Deploy the agent
in the real world
in presence of an
expert
Agent queries the
expert whenever
it is in doubt and
rectifies itself
Department of Computer
Science and Engineering
IIT Kharagpur
Pros and Cons of Active Learning
• Advantages:
• Safe during both training and testing
• Drawbacks:
• Getting robust confidence estimates is tough
• Requires longer supervision of the expert
Department of Computer
Science and Engineering
IIT Kharagpur
Issue of Safety
Department of Computer
Science and Engineering
IIT Kharagpur
Types of Safety
Safety during
training
Safety after
deployment
Department of Computer
Science and Engineering
IIT KharagpurDifferent Approaches to Ensuring
Safety
• Vigilance during exploration
• External Knowledge
• Prior knowledge
• Expert demonstration
• Teacher advice
• Risk-directed exploration
• Engineering the optimization criterion
• Worst case criteria
• Risk-sensitive criteria
• Constrained criteria
Department of Computer
Science and Engineering
IIT Kharagpur
Case study on how to make
an existing algorithm safe
Department of Computer
Science and Engineering
IIT Kharagpur
GAIL: Generative Adversarial Imitation
Learning
Problem of heavy tail
Department of Computer
Science and Engineering
IIT Kharagpur
RAIL: Risk-Averse Imitation Learning
Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017
CVaR of trajectory risk
Department of Computer
Science and Engineering
IIT Kharagpur
Results
Department of Computer
Science and Engineering
IIT Kharagpur
Any Questions, Please 
Scan me to give
Anirban feedback
Department of Computer
Science and Engineering
IIT Kharagpur
Thank You

Imitation Learning

  • 1.
    Department of Computer Scienceand Engineering IIT Kharagpur Imitation Learning: Learning to Act like Humans from Humans SSN College of Engineering – Faculty Development Program Talk 10:45-12:15 IST, 25 Nov 2017 Anirban Santara santara.github.io
  • 2.
    Department of Computer Scienceand Engineering IIT Kharagpur About me Anirban Santara Google India Ph.D. Fellow at IIT Kharagpur (2015-Present) Graduate Research Intern at Intel Labs for Autonomous Driving (2017-Present) B.Tech. in Electronics and Electrical Communication Engineering from IIT Kharagpur in 2015
  • 3.
    Department of Computer Scienceand Engineering IIT Kharagpur Contents 1. Building the motivation 2. Problem definition and Different Approaches to Solution 3. Issues of Safety and Reliability
  • 4.
    Department of Computer Scienceand Engineering IIT Kharagpur Description of the Imitation Learning Problem
  • 5.
    Department of Computer Scienceand Engineering IIT Kharagpur Imitation Learning Imitation Learning techniques aim to mimic human behavior at a given task1 1 Hussein, Ahmed, et al. "Imitation Learning: A Survey of Learning Methods." ACM Computing Surveys (CSUR) 50.2 (2017): 21. Image Source: GRASP lab - University of Pennsylvania
  • 6.
    Department of Computer Scienceand Engineering IIT Kharagpur Why should you care? • Imitation learning methods are rooted in neuro-science and form an important part of learning in humans • Makes it possible to teach robots complex tasks with minimal expert knowledge of the tasks • No need for explicit programming or task-specific reward function design • Its high time! • Modern sensors are able to collect and transmit high volumes of data at high speed • High performance computing is cheaper, more capable and ubiquitous than ever • Virtual Reality systems – that are considered the best portal of human-machine interaction – are widely available
  • 7.
    Department of Computer Scienceand Engineering IIT Kharagpur Example Application Areas
  • 8.
    Department of Computer Scienceand Engineering IIT Kharagpur Autonomous Driving No more accidents due to human error. No more traffic jams.
  • 9.
    Department of Computer Scienceand Engineering IIT Kharagpur Robotic Surgery Complex Actions in Critical Situations – Accurate. Every time.
  • 10.
    Department of Computer Scienceand Engineering IIT Kharagpur Industrial Automation Efficiency. Precise Quality Control. Safety.
  • 11.
    Department of Computer Scienceand Engineering IIT Kharagpur Assistive Robotics Elderly Care. Rehabilitation. Special Needs.
  • 12.
    Department of Computer Scienceand Engineering IIT Kharagpur Conversational Agents Assistance. Recommendation. Therapy.
  • 13.
    Department of Computer Scienceand Engineering IIT Kharagpur Approaches to Solution
  • 14.
    Department of Computer Scienceand Engineering IIT Kharagpur A quick primer on Machine Learning Reference application – Driving a Racing Car State variables (X): • Position in track • Distance from track edges along different directions • Direction of heading • Current speed Action Variables (Y): • Steering • Acceleration • Brake
  • 15.
    Department of Computer Scienceand Engineering IIT Kharagpur Comparison of ML paradigms Supervised Learning • Would require training examples in the form: { 𝑋𝑖, 𝑌𝑖 }𝑖=1 𝑁 • Where, 𝑌𝑖 are true/correct actions that must be taken in state 𝑋𝑖 Unsupervised Learning • Works only on with the input state information 𝑋𝑖 • Does not use any kind of feedback from the environment regarding performance of the agent Reinforcement Learning • Requires feedback from the environment in the form of reward signals • Reward signals might be sparse and delayed • But it should indicate the quality of actions being taken by the agent in different states e.g. +1 if the car makes progress, -1 if it comes to a halt, -10 if it bumps into an obstacle, 100 if it finishes the race
  • 16.
    Department of Computer Scienceand Engineering IIT Kharagpur Problem Setting Our Agent has to achieve its goal by taking a sequence of actions in an environment whose states change in response to the agent’s actions. ActionNew State Environment Agent
  • 17.
    Department of Computer Scienceand Engineering IIT Kharagpur Mathematical Formulation Markov Decision Process (MDP) Imitation Learning problems are often specified in terms of a Markov Decision Process (MDP). An MDP is defined as ℳ = (𝑆, 𝐴, 𝑇, 𝑟, 𝜌0, 𝛾) • State Space 𝑆: Set of all possible states/configurations of the environment • Action Space 𝐴: Set of all possible actions • Transition Probability 𝑇: 𝑆 × 𝐴 → 𝑆; T 𝑠𝑡, 𝑎 𝑡 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Reward function r: 𝑆 × 𝐴 → ℝ; we write 𝑟 𝑠𝑡, 𝑎 𝑡 = 𝑟𝑡 • Initial state distribution 𝜌0; 𝜌0 𝑠 = 𝑃( 𝑠0 = 𝑠) • Temporal discount factor 𝛾 “Markov” because it assumes: 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡, 𝑠𝑡−1, 𝑎 𝑡−1, … , 𝑠0 = 𝑃 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 = T(𝑠𝑡, 𝑎 𝑡)
  • 18.
    Department of Computer Scienceand Engineering IIT Kharagpur Some more definitions • Policy 𝜋: 𝑆 → 𝐴: A function that predicts actions for a given state • Trajectory 𝜏: A sequence of (𝑠𝑡, 𝑎 𝑡) tuples that describe an episode of experiences of an agent as it executes a policy. 𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
  • 19.
    Department of Computer Scienceand Engineering IIT Kharagpur Approaches to Imitation Learning Broad Categories Imitation Learning Learning from a dataset of expert demonstrations Behavioral Cloning Apprenticeship Learning Active learning with an expert
  • 20.
    Department of Computer Scienceand Engineering IIT Kharagpur Learning from a Dataset of Expert Demonstrations
  • 21.
    Department of Computer Scienceand Engineering IIT Kharagpur Problem Definition • Given: a dataset of trajectories demonstrated by an expert: where each trajectory is a sequence of states and actions: • Goal: Find a policy 𝜋∗ that achieves “expert-like performance” 𝜏 𝑖 𝑖=1 𝑁 𝜏 = 𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡, 𝑎 𝑡, … , 𝑠 𝑇
  • 22.
    Department of Computer Scienceand Engineering IIT Kharagpur Behavioral Cloning Supervised learning of a mapping from states to the expert’s actions in those states Model 𝑥1 𝑥2 . . . 𝑥 𝑛 state: 𝑥 𝑎 𝑎: expert action − statistical divergence Loss Minimize this w.r.t. model parameters expert
  • 23.
    Department of Computer Scienceand Engineering IIT KharagpurPros and Cons of Behavioral Cloning • Advantages: • Simplicity! • Drawbacks: • Fails to work well with limited data • Assumes that observations are i.i.d. and learn to fit single time step decisions This leads to the problem of compounding error due to covariate shift
  • 24.
    Department of Computer Scienceand Engineering IIT Kharagpur Apprenticeship Learning
  • 25.
    Department of Computer Scienceand Engineering IIT KharagpurReinforcement Learning Reinforcement Learning refers to learning through trial and error using feedback from the environment. Action Reward, New State Environment Agent
  • 26.
    Department of Computer Scienceand Engineering IIT Kharagpur Goal of RL Find a policy 𝜋∗that maximizes the expectation of the reward function 𝑅 𝜏 over trajectories 𝜏 𝜋∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 Ε 𝜏[𝑅(𝜏)] Reward of a trajectory 𝑅 𝜏 is a function of all the rewards received in a trajectory e.g. 𝑅 𝜏 = 𝑡 𝑟𝑡 , 𝑅 𝜏 = 𝑡 𝛾 𝑡 𝑟𝑡
  • 27.
    Department of Computer Scienceand Engineering IIT Kharagpur Apprenticeship Learning 1. Inverse Reinforcement Learning (IRL): Use the dataset of expert- demonstrations to uncover the reward function that the expert is trying to optimize. • This reward function is expected to succinctly encode the expert’s behavior… 2. Reinforcement Learning (IRL): Learn the optimal policy for this recovered reward function using RL. expert demonstrations IRL reward function RL optimum policy
  • 28.
    Department of Computer Scienceand Engineering IIT KharagpurPros and Cons of Apprenticeship Learning • Advantages: • Does not take single time-step decisions and hence compounding error is not a problem, unlike behavioral cloning • Drawbacks: • IRL is a computationally expensive algorithm because it needs RL to run in an inside loop • Scalability issues in large environment • Agent needs to act in the environment during learning – this may be unsafe in risk-sensitive applications
  • 29.
    Department of Computer Scienceand Engineering IIT Kharagpur Active Learning
  • 30.
    Department of Computer Scienceand Engineering IIT Kharagpur Active Learning In Active Learning the agent is able to query the expert for an optimal action in any given state and use these active samples to improve its policy state agent confidence High Low Agent takes action Agent queries expert action Agent takes actionAgent rectifies policy
  • 31.
    Department of Computer Scienceand Engineering IIT Kharagpur Workflow of Active Learning Train the agent by behavioral cloning Deploy the agent in the real world in presence of an expert Agent queries the expert whenever it is in doubt and rectifies itself
  • 32.
    Department of Computer Scienceand Engineering IIT Kharagpur Pros and Cons of Active Learning • Advantages: • Safe during both training and testing • Drawbacks: • Getting robust confidence estimates is tough • Requires longer supervision of the expert
  • 33.
    Department of Computer Scienceand Engineering IIT Kharagpur Issue of Safety
  • 34.
    Department of Computer Scienceand Engineering IIT Kharagpur Types of Safety Safety during training Safety after deployment
  • 35.
    Department of Computer Scienceand Engineering IIT KharagpurDifferent Approaches to Ensuring Safety • Vigilance during exploration • External Knowledge • Prior knowledge • Expert demonstration • Teacher advice • Risk-directed exploration • Engineering the optimization criterion • Worst case criteria • Risk-sensitive criteria • Constrained criteria
  • 36.
    Department of Computer Scienceand Engineering IIT Kharagpur Case study on how to make an existing algorithm safe
  • 37.
    Department of Computer Scienceand Engineering IIT Kharagpur GAIL: Generative Adversarial Imitation Learning Problem of heavy tail
  • 38.
    Department of Computer Scienceand Engineering IIT Kharagpur RAIL: Risk-Averse Imitation Learning Santara et al. 2017. Accepted at Deep Reinforcement Learning Symposium at NIPS 2017 CVaR of trajectory risk
  • 39.
    Department of Computer Scienceand Engineering IIT Kharagpur Results
  • 40.
    Department of Computer Scienceand Engineering IIT Kharagpur Any Questions, Please  Scan me to give Anirban feedback
  • 41.
    Department of Computer Scienceand Engineering IIT Kharagpur Thank You