SlideShare a Scribd company logo
Reinforcement Learning
for Self Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Motivation
Self driving cars today
Credits: NVIDIA Drive
Credits: Prof Jeff Schneider – RI Seminar Talk
Motivation
Credits: YouTube
Credits: Prof Jeff Schneider – RI Seminar Talk
Goal – To make self driving …
• Scalable to new domains.
• Robust to rare long tail events.
• Have verifiable performance through simulation.
Motivation
• A good policy exists!
Credits: Chen et al.,“Learning by cheating”
(https://arxiv.org/pdf/1912.12294.pdf)
Motivation
• A good policy exists!
• RL should in theory
outperform imitation
learning.
Credits: OpenAI Five
(Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)
Motivation
• Given a good policy, it can be
optimized further every time a
safety driver intervenes.
• RL could, in theory, outperform
human performance.
Credits: Wayve
Types of RL
algorithms
Types of RL algorithms
• On Policy Algorithms
• Uses actions from current policy to
obtain training data and updates
values.
• Off Policy Algorithms
• Uses actions from a separate
“behavior” policy to obtain training
data and updates the values.
Brief Recap of RL
• Reward - R(s,a)
• State Value Function - V(s)
• State-Action Value Function - Q(s,a)
• Discount Factor -
• Tabular Q Learning
Deep Q-Networks (DQN)
• First to use deep neural networks for learning Q functions [1]
• Main contributions:
• Uses target networks
• Uses replay buffer
1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
• Cons:
• Maximization bias
• Pros:
• Off policy – Sample efficient
Policy Gradients
• Why policy gradients?
• Direct method to compute optimal policy
• Parametrize policies and optimize using loss functions[1]
• Advantageous in large/continuous action domains
[1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
State
distribution
State-action
value function
Gradient of
Policy function
Trust Region Policy Optimization
• Pros
• Introduced the idea that a large
shift in policy is bad!
• Thus, reduces sample
complexity.
• Cons
• It is an on-policy algorithm.
Schulman, John, et al. "Trust region policy optimization." International conference on
machine learning. 2015.
Proximal Policy Algorithm
𝐴 𝑡 is functionally the same as Q within the expectation
• PPO was an improvement on
TRPO.
• We can rearrange the hard KL
constraint into the softer loss
described here.
• But, their main contribution is…
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Proximal Policy Algorithm
• The clip loss function!
• They clip the loss value instead
of a KL constraint
• Good actions will not be too
beneficial, but any bad actions
will have a minimum penalty.
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Proximal Policy Algorithm
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
Actor Critic Algorithms
• What if the gradient estimator in policy gradients has too
much variance?
• What does that mean?
• It takes too many interactions with environment to learn the optimal
policy parameters
Actor Critic Algorithms
• Turns out that we can control this variance using value functions.
• If we have some information about the current state, gradient estimation can
be better.
• Actor
• Policy network
• Critic
• Value function
Soft Actor Critic
• Uses Maximum Entropy
RL framework [1]
• Uses clipped double-Q trick to
avoid maximization bias
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
Soft Actor Critic
• Advantages :
• Off Policy algorithm
• Exploration is inherently handled
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Credits: BAIR
Soft Actor Critic
Performance
[1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
Experimental Setting (Past Work)
• State Space
• Semantically segmented bird eye view images
• An autoencoder is then trained on them.
• Waypoints!
• Action Space
• Speed - Continuous, Controlled using a simulated PID
• Steer Angle – Continuous
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Experimental Setting (Past Work)
• Inputs include waypoint features
as route to follow
• Uses the CARLA Simulator
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Experimental Setting (Past Work)
• Rewards
• (Speed Reward)
• Assuming that we are following waypoints, this is distance to goal
• (Deviation Penalty)
• Penalize if we are deviating from the trajectory/ waypoints
• (Collision Penalty)
• Avoid collisions. Even if we are going to collide, collide with low speed
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Complete Pipeline
(Past Work)
• AE Network for state
representation
• Shallow policy network
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
Past Work
Good Navigation in
Empty Lanes
Crashes with
stationary cars
Credits: Learning to Drive using Waypoints
(Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
• Uses PPO at the
moment
• DQN is being
tried
• We want to use
SAC for this task
Future Work
• Next steps -
• To focus on settings with dynamic actors.
• Improve exploration on current settings using SAC.
• Training in dense environments, possibly also through self play RL.
Thank you
Hit us with questions!
We’d appreciate any useful suggestions.

More Related Content

What's hot

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Dongmin Lee
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
Hakky St
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and Programming
Melanie Knight
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
Anand Joshi
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
Hung Le
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic Memory
Hung Le
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
Seung Jae Lee
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Dongmin Lee
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
Dongmin Lee
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
Domonkos Tikk
 

What's hot (12)

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Algorithms and Programming
Algorithms and ProgrammingAlgorithms and Programming
Algorithms and Programming
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Episodic Policy Gradient Training
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
 
Model Based Episodic Memory
Model Based Episodic MemoryModel Based Episodic Memory
Model Based Episodic Memory
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation[1808.00177] Learning Dexterous In-Hand Manipulation
[1808.00177] Learning Dexterous In-Hand Manipulation
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
 

Similar to Literature Review - Presentation on Relevant work for RL4AD capstone

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
Databricks
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
 
20181125 pybullet
20181125 pybullet20181125 pybullet
20181125 pybullet
Taku Yoshioka
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Takashi Nagata
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
SmartCat
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
Sungjoon Choi
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
Ruth Yakubu
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy Learning
Sungjoon Choi
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
Harivamshi D
 
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
EmilyJoseph18
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
Vaibhav Varshney
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
Hye-min Ahn
 
SPLT Transformer.pptx
SPLT Transformer.pptxSPLT Transformer.pptx
SPLT Transformer.pptx
Seungeon Baek
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
IRJET Journal
 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
Satoshi Hara
 
Deep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous drivingDeep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous driving
GopikaGopinath5
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 

Similar to Literature Review - Presentation on Relevant work for RL4AD capstone (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
20181125 pybullet
20181125 pybullet20181125 pybullet
20181125 pybullet
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Recent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy LearningRecent Trends in Neural Net Policy Learning
Recent Trends in Neural Net Policy Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
SPLT Transformer.pptx
SPLT Transformer.pptxSPLT Transformer.pptx
SPLT Transformer.pptx
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
 
Deep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous drivingDeep reinforcement learning framework for autonomous driving
Deep reinforcement learning framework for autonomous driving
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 

Recently uploaded

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Literature Review - Presentation on Relevant work for RL4AD capstone

  • 1. Reinforcement Learning for Self Driving Cars Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
  • 2. Motivation Self driving cars today Credits: NVIDIA Drive Credits: Prof Jeff Schneider – RI Seminar Talk
  • 3. Motivation Credits: YouTube Credits: Prof Jeff Schneider – RI Seminar Talk Goal – To make self driving … • Scalable to new domains. • Robust to rare long tail events. • Have verifiable performance through simulation.
  • 4. Motivation • A good policy exists! Credits: Chen et al.,“Learning by cheating” (https://arxiv.org/pdf/1912.12294.pdf)
  • 5. Motivation • A good policy exists! • RL should in theory outperform imitation learning. Credits: OpenAI Five (Berner et al., “Dota 2 with Large Scale Deep Reinforcement Learning”)
  • 6. Motivation • Given a good policy, it can be optimized further every time a safety driver intervenes. • RL could, in theory, outperform human performance. Credits: Wayve
  • 8. Types of RL algorithms • On Policy Algorithms • Uses actions from current policy to obtain training data and updates values. • Off Policy Algorithms • Uses actions from a separate “behavior” policy to obtain training data and updates the values.
  • 9. Brief Recap of RL • Reward - R(s,a) • State Value Function - V(s) • State-Action Value Function - Q(s,a) • Discount Factor - • Tabular Q Learning
  • 10. Deep Q-Networks (DQN) • First to use deep neural networks for learning Q functions [1] • Main contributions: • Uses target networks • Uses replay buffer 1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013). • Cons: • Maximization bias • Pros: • Off policy – Sample efficient
  • 11. Policy Gradients • Why policy gradients? • Direct method to compute optimal policy • Parametrize policies and optimize using loss functions[1] • Advantageous in large/continuous action domains [1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. State distribution State-action value function Gradient of Policy function
  • 12. Trust Region Policy Optimization • Pros • Introduced the idea that a large shift in policy is bad! • Thus, reduces sample complexity. • Cons • It is an on-policy algorithm. Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. 2015.
  • 13. Proximal Policy Algorithm 𝐴 𝑡 is functionally the same as Q within the expectation • PPO was an improvement on TRPO. • We can rearrange the hard KL constraint into the softer loss described here. • But, their main contribution is… Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 14. Proximal Policy Algorithm • The clip loss function! • They clip the loss value instead of a KL constraint • Good actions will not be too beneficial, but any bad actions will have a minimum penalty. Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 15. Proximal Policy Algorithm Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • 16. Actor Critic Algorithms • What if the gradient estimator in policy gradients has too much variance? • What does that mean? • It takes too many interactions with environment to learn the optimal policy parameters
  • 17. Actor Critic Algorithms • Turns out that we can control this variance using value functions. • If we have some information about the current state, gradient estimation can be better. • Actor • Policy network • Critic • Value function
  • 18. Soft Actor Critic • Uses Maximum Entropy RL framework [1] • Uses clipped double-Q trick to avoid maximization bias [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018). Credits: BAIR
  • 19. Soft Actor Critic • Advantages : • Off Policy algorithm • Exploration is inherently handled [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018). Credits: BAIR
  • 21. [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).
  • 22. Experimental Setting (Past Work) • State Space • Semantically segmented bird eye view images • An autoencoder is then trained on them. • Waypoints! • Action Space • Speed - Continuous, Controlled using a simulated PID • Steer Angle – Continuous Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 23. Experimental Setting (Past Work) • Inputs include waypoint features as route to follow • Uses the CARLA Simulator Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 24. Experimental Setting (Past Work) • Rewards • (Speed Reward) • Assuming that we are following waypoints, this is distance to goal • (Deviation Penalty) • Penalize if we are deviating from the trajectory/ waypoints • (Collision Penalty) • Avoid collisions. Even if we are going to collide, collide with low speed Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 25. Complete Pipeline (Past Work) • AE Network for state representation • Shallow policy network Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.)
  • 26. Past Work Good Navigation in Empty Lanes Crashes with stationary cars Credits: Learning to Drive using Waypoints (Tanmay Agarwal, Hitesh Arora, Tanvir Parhar et al.) • Uses PPO at the moment • DQN is being tried • We want to use SAC for this task
  • 27. Future Work • Next steps - • To focus on settings with dynamic actors. • Improve exploration on current settings using SAC. • Training in dense environments, possibly also through self play RL.
  • 28. Thank you Hit us with questions! We’d appreciate any useful suggestions.

Editor's Notes

  1. NOTES : Self driving cars today process sensor inputs through tens of learned systems. NEXT Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach. Less engineering heavy – Tail scenarios. Also, transferability across cities. Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks. Assuming Perception is solved we want to use RL to remove other components.
  2. NOTES : Self driving cars today process sensor inputs through tens of learned systems. NEXT Existing approach works well and will result in fully autonomous cars eventually, at least in specific domains. But we need a less engineer intensive version of the current approach. Less engineering heavy – Tail scenarios. Also, transferability across cities. Perception, Prediction, Mapping and Localization, Planning – All include several sub systems that do various tasks. Assuming Perception is solved we want to use RL to remove other components.
  3. Learning by cheating achieved 100% performance on Carla’s benchmark recently. This shows that a good policy exists. CARLA NEXT - How it works Needs expert driving trajectories
  4. RL has repeatedly shown itself to be capable of outperforming humans on highly complex tasks with large branching factors Branching factor is 10^4. Chess has 35 and Go has 250. However, transferring this performance to the real world in a noisy environment is a big challenge for RL.
  5. This is the first car to learn self driving using reinforcement learning. Explain Points on Screen.
  6. Model RL methods can have error propagation. It is difficult to fit a model to the real world, unlike in Chess or AlphaGo. So, we want Model free RL. Two major classes, either optimize policy or value function, or have a method that uses both.
  7. TO EXPLAIN Off Policy vs On Policy methods Off policy - Advantages of replay buffers in highly correlated data Off policy methods are more sample efficient as they can reuse past experiences for training later on Important experiences can be saved and reused later for training In Green – Off Policy, In Purple – On Policy
  8. As we learned in 10-601
  9. As we learned in 10-601.. 1) Tabular Q learning works well when the state space is finite. But as state space grows larger, we need to turn to function approximation which takes state and action as input and return Q value.  2) We update the parameters until we get all the Q values of state action pairs correct 3) But the update equation assumes iid. But in RL tasks, the states are correlated. One of the major contribution is that they use replay buffer to break the correlations between samples. 4) Also, the target changes(non stationary) here which causes instability in learning. Targets networks hold the parameters fixed and avoid the changing targets problem. 5) But still one problem remains. The target is estimated. What if the target is wrong? It leads to maximization bias
  10. 1) Q learning is good for discrete actions. But what if the actions are large/countinous? 2) Policy gradients is an alternative to Q learning. In Q learning, we first fit a Q function and then learn policy. Policy grad directly learns policies by parametrizing it and updating the parameters according to a loss function. 3) We don't want to go too much into the math, so the  final gradient of loss w.r.t policy parameters looks like …..  4) q_pi is estimated from experience 5) What happens is that we start with some params, and state, take an action, collect reward and next state and update …(interaction with environment)
  11. TO EXPLAIN Takes optimal steps compared to Policy Gradients Maximizes Expected Value due to NEW POLICY but old value function. Denominator is due to importance sampling. q(a|s) as the old policy. BUT we do it with a constraint. Forms a “TRUST REGION” with the help of the KL Divergence.
  12. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  13. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  14. We can rearrange the constraint to form a loss. A_t is like Q(s,a) – V(s). But their main objective is L CLIP R is the ratio in the above function. On left, it doesn’t get too confident on it’s update and thus clips towards top the loss On right, if policy become worse, it reverts changes and then proportionately more so if loss is even worse.
  15. We want to learn optimal policy using minimum number of interactions with the environment.
  16. Furthermore, as the policy changes, a new gradient is estimated independently of past estimates. --> The basic idea is that if we know about state, variance can be less 1) Actor used critic to update itself 2) Critic improves itself to catchup with the changing policy  These keep on going and they complement each other until they converge
  17. 1) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 2) Improves critic part by using clipped double Q trick
  18. CAR EXAMPLE
  19. 1) SAC belongs to the class of Actor-critic algorithms 2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies 3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 4) Improves critic part by using clipped double Q trick
  20. 1) SAC belongs to the class of Actor-critic algorithms 2) Before SAC, the major effort to reduce sample complexity is by DDPG but it is brittle and uses deterministic policies 3) Maximum entropy framework – balance between exploring and collecting rewards, need to change the value funciton definition 4) Improves critic part by using clipped double Q trick
  21. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer)
  22. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer) Rewards (Ask Hitesh) Training and Testing Towns 4 Test Scenarios - Each has several test cases Photos for everything
  23. NOTES: Experiment Setting - Problem Statement State Space (AE on Semantically Segmented images generated by CARLA) Action Space (Speed and Steer) Rewards (Ask Hitesh) Training and Testing Towns 4 Test Scenarios - Each has several test cases Photos for everything
  24. To Explain Policy Input could include current speed and steer Encoder decoder could use stack of frames