MSCV Capstone Spring 2020 Presentation - RL for AD

Reinforcement Learning
for Self Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI

Outline
• Introduction
• Related Work
• Problem Setting
• Past Work in Lab and Challenges
• Experiments
• Future Work and Timeline

Introduction
Setting up the problem

Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment [1]

Motivation – Why Reinforcement Learning?
• End-to-end system.
Self-driving cars today are highly modular [2]

• Verifiable performance
through simulation.
We can simulate rare events to test the robustness of an algorithm [1]

• Verifiable performance
through simulation.
• Once algorithm is
established, it can scale
without rigorous tuning.
If we can run it on one video log, we can run it on any video log!

Related work
A brief literature review

Related work
• Controllable Imitative
• Train an imitation learning
model and finetunes using
DDPG
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-driving." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Related work
• Learning by Cheating
• A big improvement
• Uses an abundant supply
of “oracle” info to train a
large model ( Resnet-34 )
[1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019

Related work
• Learning by Cheating
• Learning to drive in a day
• RL on a real car!
• Uses a tiny model (10k) and a
tiny state space (~10).
• No dynamic actors
• Uses DDPG
[4] Kendall, Alex, et al. "Learning to drive in a day." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.

Key
Takeaways
We propose using
the Soft Actor Critic
algorithm for our
work.
Some of the current self-driving RL algorithms have
tried DDPG, A3C and DQN algorithms.
Of these, the best performing, DDPG is extremely
brittle to hyperparameters and has exploration issues.
Since in the end we want to train on driving logs, we
need an off policy algorithm.
Off policy algorithms are in general more sample
efficient.

Problem Setting
A short description of the set up

Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …

Problem Setting
• Action space – Speed and Steer (bounded and continuous)

Problem Setting
• PID Controller – For low level control

Problem Setting
• PID Controller – For low level control
• Scenarios
• Straight
• One Turn
• Navigation
• Navigation with dynamic actors

Previous Work in Lab
• Used Proximal Policy Optimization (PPO)
• State space
• Encoded SS Images with WP
• Reward
• Speed based reward
• Distance to trajectory
• Collision reward
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD

Previous Work in Lab
• Excels in Straight, Navigation and One-
Turn
• Struggles to brake for other cars in
Navigation with dynamic actors.

Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?

RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Manual Input

State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning

Pretrained
model
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator

Focusing on RL algorithms
Proximal Policy Optimization
• Designed to improve sample
complexity of policy gradient
algorithms
• On Policy algorithm
Soft Actor Critic
• Designed to improve exploration
by using entropy framework
• Off Policy algorithm and hence
expected to be sample efficient

Task 1 (Simple)
• Settings:
• Scenario : Navigation w/o actors
• State space :
• Mean angle to next 5 waypoints
• Reward :
• Distance from trajectory

Task 1 experiments
• PPO performs
better than SAC

Task 1 experiments
• Tried changing buffer size, buffer batch size, target update frequency
• SAC is sensitive to entropy coefficient
• Tuning the entropy coefficient in SAC
• Tried both manual and learned parameters

Hypothesis
• SAC is sensitive to the learned Q values
• If the Q value estimates are biased, there is no guarantee on learned
policy
• Bias can be due to different reasons
• for ex: observed state space not satisfying the MDP assumption in TD learning
• Monte-Carlo estimates would be better in such cases

N-step rewards in Q value estimates
• Use N-step rewards to reduce the bias (at the cost of variance)

N-step rewards in SAC - Results

• PPO uses Generalized Advantage Estimation (GAE)
• Analogous to N-step Advantage estimation

• Not using importance sampling
• No major effect for small N
• Large N will induce bias again!
[6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.

N-step rewards in SAC – Results
Intermediate policy Trained policy
Crashes while turning Completes Successfully

Task 2 (Complex)
• Settings:
• Scenario : Navigation with dynamic actors
• State space:
• Mean angle to next 5 waypoints
• Nearest obstacle distance
• Nearest obstacle speed
• Vehicle speed
• Vehicle Steer
• Distance to trajectory
• Distance to goal
• Distance to red light
• Reward
• Distance from trajectory
• Collision reward
• Traffic light reward

Task 2 results
After 4M timesteps After 4M timesteps
Agent successfully learns to stop when other actors brake

Future Work – Stable RL
• Compare the performance of using smaller batch size to
make training faster
• Current SAC is unstable

Future Work – Stable RL
• Compare the performance of using smaller batch size to make training
faster
• Current SAC is unstable. We hypothesize that this is due to the
increased variance in Q value estimates
• Change policy loss to introduce Advantage function
• Similar to baseline usage in policy gradients!
• Use prioritized experience replay to improve sample complexity

Future Work – Image inputs
• Move from manual state space
to image state space
• Instead of using Autoencoder
representations, leverage the
Imitation learning pretrained
model
Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive

Timeline
September
Stability experiments, Image representations
October
Reducing Sample complexity
November
Final experimentation

References
[1] Chen et al., “Learning by cheating”, CoRL 2019.
[2] Prof Jeff Schneider’s RI Seminar Talk
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-
driving.“, ECCV, 2018.
[4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019.
[6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep
reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.

MSCV Capstone Spring 2020 Presentation - RL for AD

More Related Content

What's hot

Similar to MSCV Capstone Spring 2020 Presentation - RL for AD

Recently uploaded

MSCV Capstone Spring 2020 Presentation - RL for AD

Editor's Notes