Reinforcement Learning
for Self Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI
Outline
• Introduction
• Related Work
• Problem Setting
• Past Work in Lab and Challenges
• Experiments
• Future Work and Timeline
Introduction
Setting up the problem
Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment [1]
Motivation – Why Reinforcement Learning?
• End-to-end system.
Self-driving cars today are highly modular [2]
Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
We can simulate rare events to test the robustness of an algorithm [1]
Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Once algorithm is
established, it can scale
without rigorous tuning.
If we can run it on one video log, we can run it on any video log!
Related work
A brief literature review
Related work
• Controllable Imitative
Reinforcement Learning
• Train an imitation learning
model and finetunes using
DDPG
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-driving." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
Related work
• Controllable Imitative
Reinforcement Learning
• Learning by Cheating
• A big improvement
• Uses an abundant supply
of “oracle” info to train a
large model ( Resnet-34 )
[1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019
Related work
• Controllable Imitative
Reinforcement Learning
• Learning by Cheating
• Learning to drive in a day
• RL on a real car!
• Uses a tiny model (10k) and a
tiny state space (~10).
• No dynamic actors
• Uses DDPG
[4] Kendall, Alex, et al. "Learning to drive in a day." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
Key
Takeaways
We propose using
the Soft Actor Critic
algorithm for our
work.
Some of the current self-driving RL algorithms have
tried DDPG, A3C and DQN algorithms.
Of these, the best performing, DDPG is extremely
brittle to hyperparameters and has exploration issues.
Since in the end we want to train on driving logs, we
need an off policy algorithm.
Off policy algorithms are in general more sample
efficient.
Problem Setting
A short description of the set up
Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Scenarios
• Straight
• One Turn
• Navigation
• Navigation with dynamic actors
Past work and Challenges
Previous Work in Lab
• Used Proximal Policy Optimization (PPO)
• State space
• Encoded SS Images with WP
• Reward
• Speed based reward
• Distance to trajectory
• Collision reward
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
Previous Work in Lab
• Excels in Straight, Navigation and One-
Turn
• Struggles to brake for other cars in
Navigation with dynamic actors.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
Decoupling the problem
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Manual Input
Decoupling the problem
State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
Decoupling the problem
Pretrained
model
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on RL algorithms
Proximal Policy Optimization
• Designed to improve sample
complexity of policy gradient
algorithms
• On Policy algorithm
Soft Actor Critic
• Designed to improve exploration
by using entropy framework
• Off Policy algorithm and hence
expected to be sample efficient
Experiments
Task 1 (Simple)
• Settings:
• Scenario : Navigation w/o actors
• State space :
• Mean angle to next 5 waypoints
• Reward :
• Speed based reward
• Distance from trajectory
Task 1 experiments
• PPO performs
better than SAC
Task 1 experiments
• Tried changing buffer size, buffer batch size, target update frequency
• SAC is sensitive to entropy coefficient
• Tuning the entropy coefficient in SAC
• Tried both manual and learned parameters
Task 1 experiments
Task 1 experiments
Task 1 experiments
Task 1 experiments
Hypothesis
• SAC is sensitive to the learned Q values
• If the Q value estimates are biased, there is no guarantee on learned
policy
• Bias can be due to different reasons
• for ex: observed state space not satisfying the MDP assumption in TD learning
• Monte-Carlo estimates would be better in such cases
N-step rewards in Q value estimates
• Use N-step rewards to reduce the bias (at the cost of variance)
N-step rewards in SAC - Results
N-step rewards in SAC - Results
• PPO uses Generalized Advantage Estimation (GAE)
• Analogous to N-step Advantage estimation
N-step rewards in SAC - Results
N-step rewards in SAC - Results
N-step rewards in SAC - Results
N-step rewards in SAC - Results
N-step rewards in SAC - Results
N-step rewards in SAC - Results
• Not using importance sampling
• No major effect for small N
• Large N will induce bias again!
[6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.
N-step rewards in SAC – Results
Intermediate policy Trained policy
Crashes while turning Completes Successfully
Task 2 (Complex)
• Settings:
• Scenario : Navigation with dynamic actors
• State space:
• Mean angle to next 5 waypoints
• Nearest obstacle distance
• Nearest obstacle speed
• Vehicle speed
• Vehicle Steer
• Distance to trajectory
• Distance to goal
• Distance to red light
• Reward
• Speed based reward
• Distance from trajectory
• Collision reward
• Traffic light reward
Task 2 results
Task 2 results
Task 2 results
Task 2 results
Task 2 results
Task 2 results
After 4M timesteps After 4M timesteps
Agent successfully learns to stop when other actors brake
Future Work
Future Work – Stable RL
• Compare the performance of using smaller batch size to
make training faster
• Current SAC is unstable
Future Work – Stable RL
Future Work – Stable RL
• Compare the performance of using smaller batch size to make training
faster
• Current SAC is unstable. We hypothesize that this is due to the
increased variance in Q value estimates
• Change policy loss to introduce Advantage function
• Similar to baseline usage in policy gradients!
• Use prioritized experience replay to improve sample complexity
Future Work – Image inputs
• Move from manual state space
to image state space
• Instead of using Autoencoder
representations, leverage the
Imitation learning pretrained
model
Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive
Timeline
September
Stability experiments, Image representations
October
Reducing Sample complexity
November
Final experimentation
References
[1] Chen et al., “Learning by cheating”, CoRL 2019.
[2] Prof Jeff Schneider’s RI Seminar Talk
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-
driving.“, ECCV, 2018.
[4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep
reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.

MSCV Capstone Spring 2020 Presentation - RL for AD

  • 1.
    Reinforcement Learning for SelfDriving Cars Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider Sponsored by Argo AI
  • 2.
    Outline • Introduction • RelatedWork • Problem Setting • Past Work in Lab and Challenges • Experiments • Future Work and Timeline
  • 3.
  • 4.
    Problem Statement • Traina self driving agent using RL algorithms in simulation. • To have an algorithm that can be run on Argo’s driving logs. • To aim for sample efficient algorithms. An agent exploring the CARLA environment [1]
  • 5.
    Motivation – WhyReinforcement Learning? • End-to-end system. Self-driving cars today are highly modular [2]
  • 6.
    Motivation – WhyReinforcement Learning? • End-to-end system. • Verifiable performance through simulation. We can simulate rare events to test the robustness of an algorithm [1]
  • 7.
    Motivation – WhyReinforcement Learning? • End-to-end system. • Verifiable performance through simulation. • Once algorithm is established, it can scale without rigorous tuning. If we can run it on one video log, we can run it on any video log!
  • 8.
    Related work A briefliterature review
  • 9.
    Related work • ControllableImitative Reinforcement Learning • Train an imitation learning model and finetunes using DDPG [3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-driving." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
  • 10.
    Related work • ControllableImitative Reinforcement Learning • Learning by Cheating • A big improvement • Uses an abundant supply of “oracle” info to train a large model ( Resnet-34 ) [1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019
  • 11.
    Related work • ControllableImitative Reinforcement Learning • Learning by Cheating • Learning to drive in a day • RL on a real car! • Uses a tiny model (10k) and a tiny state space (~10). • No dynamic actors • Uses DDPG [4] Kendall, Alex, et al. "Learning to drive in a day." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.
  • 12.
    Key Takeaways We propose using theSoft Actor Critic algorithm for our work. Some of the current self-driving RL algorithms have tried DDPG, A3C and DQN algorithms. Of these, the best performing, DDPG is extremely brittle to hyperparameters and has exploration issues. Since in the end we want to train on driving logs, we need an off policy algorithm. Off policy algorithms are in general more sample efficient.
  • 13.
    Problem Setting A shortdescription of the set up
  • 14.
    Problem Setting • Statespace – Either encoded image, waypoints or manual WP 0.4 Obstacle 1 Traffic Light 0 … …
  • 15.
    Problem Setting • Statespace – Either encoded image, waypoints or manual • Action space – Speed and Steer (bounded and continuous)
  • 16.
    Problem Setting • Statespace – Either encoded image, waypoints or manual • Action space – Speed and Steer (bounded and continuous) • PID Controller – For low level control
  • 17.
    Problem Setting • Statespace – Either encoded image, waypoints or manual • Action space – Speed and Steer (bounded and continuous) • PID Controller – For low level control • Scenarios • Straight • One Turn • Navigation • Navigation with dynamic actors
  • 18.
    Past work andChallenges
  • 19.
    Previous Work inLab • Used Proximal Policy Optimization (PPO) • State space • Encoded SS Images with WP • Reward • Speed based reward • Distance to trajectory • Collision reward [5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
  • 20.
    Previous Work inLab • Excels in Straight, Navigation and One- Turn • Struggles to brake for other cars in Navigation with dynamic actors. [5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
  • 21.
    Decoupling the problem InputImages and data from simulator State Space construction RL algorithm Reward Optimization Which components need to be improved?
  • 22.
    Decoupling the problem RL algorithm RewardOptimization Input Images and data from simulator State Space construction Focusing solely on RL – Manual Input
  • 23.
    Decoupling the problem StateSpace construction RL algorithm Reward Optimization Input Images and data from simulator Focusing on representation– Imitation Learning
  • 24.
    Decoupling the problem Pretrained model RL algorithm RewardOptimization Input Images and data from simulator
  • 25.
    Focusing on RLalgorithms Proximal Policy Optimization • Designed to improve sample complexity of policy gradient algorithms • On Policy algorithm Soft Actor Critic • Designed to improve exploration by using entropy framework • Off Policy algorithm and hence expected to be sample efficient
  • 26.
  • 27.
    Task 1 (Simple) •Settings: • Scenario : Navigation w/o actors • State space : • Mean angle to next 5 waypoints • Reward : • Speed based reward • Distance from trajectory
  • 28.
    Task 1 experiments •PPO performs better than SAC
  • 29.
    Task 1 experiments •Tried changing buffer size, buffer batch size, target update frequency • SAC is sensitive to entropy coefficient • Tuning the entropy coefficient in SAC • Tried both manual and learned parameters
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
    Hypothesis • SAC issensitive to the learned Q values • If the Q value estimates are biased, there is no guarantee on learned policy • Bias can be due to different reasons • for ex: observed state space not satisfying the MDP assumption in TD learning • Monte-Carlo estimates would be better in such cases
  • 35.
    N-step rewards inQ value estimates • Use N-step rewards to reduce the bias (at the cost of variance)
  • 36.
    N-step rewards inSAC - Results
  • 37.
    N-step rewards inSAC - Results • PPO uses Generalized Advantage Estimation (GAE) • Analogous to N-step Advantage estimation
  • 38.
    N-step rewards inSAC - Results
  • 39.
    N-step rewards inSAC - Results
  • 40.
    N-step rewards inSAC - Results
  • 41.
    N-step rewards inSAC - Results
  • 42.
    N-step rewards inSAC - Results
  • 43.
    N-step rewards inSAC - Results • Not using importance sampling • No major effect for small N • Large N will induce bias again! [6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“ [7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.
  • 44.
    N-step rewards inSAC – Results Intermediate policy Trained policy Crashes while turning Completes Successfully
  • 45.
    Task 2 (Complex) •Settings: • Scenario : Navigation with dynamic actors • State space: • Mean angle to next 5 waypoints • Nearest obstacle distance • Nearest obstacle speed • Vehicle speed • Vehicle Steer • Distance to trajectory • Distance to goal • Distance to red light • Reward • Speed based reward • Distance from trajectory • Collision reward • Traffic light reward
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
    Task 2 results After4M timesteps After 4M timesteps Agent successfully learns to stop when other actors brake
  • 52.
  • 53.
    Future Work –Stable RL • Compare the performance of using smaller batch size to make training faster • Current SAC is unstable
  • 54.
    Future Work –Stable RL
  • 55.
    Future Work –Stable RL • Compare the performance of using smaller batch size to make training faster • Current SAC is unstable. We hypothesize that this is due to the increased variance in Q value estimates • Change policy loss to introduce Advantage function • Similar to baseline usage in policy gradients! • Use prioritized experience replay to improve sample complexity
  • 56.
    Future Work –Image inputs • Move from manual state space to image state space • Instead of using Autoencoder representations, leverage the Imitation learning pretrained model Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive
  • 57.
    Timeline September Stability experiments, Imagerepresentations October Reducing Sample complexity November Final experimentation
  • 58.
    References [1] Chen etal., “Learning by cheating”, CoRL 2019. [2] Prof Jeff Schneider’s RI Seminar Talk [3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self- driving.“, ECCV, 2018. [4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019. [5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD [6] Hernandez-Garcia, J. Fernando, and Richard S. Sutton. "Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“ [7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.

Editor's Notes

  • #5 Our problem statement is to train a self driving agent using RL in the Carla simulator. So, in the end, our ultimate goal is to be able to train our algorithm on Argo’s hundreds of hours of driving log. This might not happen during our capstone but that’s the main idea. Also, we want to aim for sample efficient algorithms, i.e. algos that takes less number of input datapoints to train well.
  • #6 So, there are many motivations for using RL for self driving. The first one is that current self driving cars are highly modular and each one of these modules need to finetuned for rare unaccounted for events. This is engineer intensive. An end to end model, if it can perform well enough, is much more desirable.
  • #7 Secondly, we want to have verifiable performance through simulation. Unseen events can be tested on this way, for example, rain in the video, which was not seen during training for this agent. We can also use experiences in simulation to augment real world data.
  • #8 Lastly, such an algorithm can scale very easily due to it being end-to-end and less engineer intensive.
  • #10 This was the first successful deep-RL pipeline for vision-based autonomous driving that outperforms previous modular pipeline and other imitation learning on seedriving tasks on the CARLA benchmark.
  • #11 A recent paper called Learning by cheating improved achieved near perfect results on the CARLA benchmark. They trained an “oracle” using simulator input and then used it to train a large model as now they had an abundant supply of “ground truth” data. They write in their paper that an exciting opportunity for future work is to combine the presented ideas with reinforcement learning and train systems that exceed the capabilities of the expert.
  • #12 The first real world application of deep reinforcement learning to autonomous driving. The task is simple lane following here and the network is tiny.
  • #13 So, current Self driving RL algos have tried using several algorithms. Since in the end we want to train on driving logs, we need an off policy algorithm like DDPG or SAC. The best of such algorthims that have been demonstrated so far is DDPG. But it is very brittle to hyperparameters and has exploration issues, as pointed by the authors of the Soft actor critic paper. So, SAC might be a good candidate here. Also, in general off policy algos are more sample efficient which is also one of our goals.
  • #15 For the state space, we can have an encoded representation of an RGB or semantic segmented image. We can also have the angle from the trajectory, which is described using waypoints, as an input. Also, we can directly take data from the simulator as an input which we are calling manual input.
  • #20 Previously, the Auton Lab team working on this project used Proximal Policy Optimization over encoding of top view semantically segmented images, along with waypoints. There were three rewards that they used: Firstly a speed based reward, where the agent received a reward for each waypoint collected. Secondly, it received a penalty for deviating from trajectory. Lastly, it received a large penalty for collisions.
  • #22 Threre can be two sources of problems – Representation or Algorithm
  • #26 So, we had proposed SAC for our problem before, which is very differently designed than PPO. SAC focuses on… PPO focuses on… Also, SAC is off policy which is desirable as we want to train on logs and also because they tend to be more sample efficient.
  • #29 Explain the plot
  • #35 In general, TD works under the implicit assumption that there is a MDP . But if the state space is not fully observable, then the observations in hand are not able to satisfy the MDP assumption made by TD(although there is a underlying MDP). So, in such situations, MC estimates are better. In our task, we can never be sure of state representation. So, this modification is highly likely to work for image state spaces too.
  • #37 Having an off policy algorithm with good performance is desirable.
  • #38 Having an off policy algorithm with good performance is desirable.
  • #39 Having an off policy algorithm with good performance is desirable.
  • #40 Having an off policy algorithm with good performance is desirable.
  • #41 Having an off policy algorithm with good performance is desirable.
  • #42 Having an off policy algorithm with good performance is desirable.
  • #43 Having an off policy algorithm with good performance is desirable.
  • #44 Having an off policy algorithm with good performance is desirable.
  • #47 Having an off policy algorithm with good performance is desirable.
  • #48 Having an off policy algorithm with good performance is desirable.
  • #49 Having an off policy algorithm with good performance is desirable.
  • #50 Having an off policy algorithm with good performance is desirable.
  • #51 Having an off policy algorithm with good performance is desirable.
  • #52 Having an off policy algorithm with good performance is desirable.
  • #53 Having an off policy algorithm with good performance is desirable.