Standard vs Custom Battery Packs - Decoding the Power Play
Driving Behavior for ADAS and Autonomous Driving V
1. Driving Behavior for ADAS
and Autonomous Driving V
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• An Auto-tuning Framework for AVs
• Adaptive Behavior Generation for AD using Deep RL with Compact Semantic
States
• Probabilistic Prediction of Interactive Driving Behavior via Hierarchical Inverse RL
• Zero-shot Deep RL Driving Policy Transfer for AVs based on Robust Control
• Risk-averse Behavior Planning for AD under Uncertainty
• Multimodal Trajectory Predictions for AD using Deep Convolutional Networks
• Exploring the Limitations of Behavior Cloning for AD
3. An Auto-tuning Framework for Autonomous Vehicles
• Many AD motion planners generate trajectories by optimizing a reward/cost functional.
• Designing and tuning a high-performance reward/cost functional for Level-4 autonomous
driving vehicles with exposure to different driving conditions is challenging.
• Traditionally, reward/cost functional tuning involves substantial human effort and time spent on
both simulations and road tests.
• As the scenario becomes more complicated, tuning to improve the motion planner
performance becomes increasingly difficult.
• To systematically solve this issue, develop a data-driven auto-tuning framework based on the
Apollo autonomous driving framework.
• The framework includes a rank-based conditional inverse reinforcement learning (IRL) alg., an
offline training strategy and an automatic method of collecting and labeling data.
• The auto-tuning framework has the advantages that make it suitable for tuning an autonomous
driving motion planner.
• compared to most IRL algorithms, training is efficient and capable to different scenarios.
• offline training strategy offers a safe way to adjust the parameters before public road testing.
• expert driving data and information about the surrounding environment are collected and
automatically labeled, which considerably reduces the manual effort.
• motion planner tuned by the framework is examined via both simulation and public road testing.
4. An Auto-tuning Framework for Autonomous Vehicles
Data-driven autonomous driving motion
planner on the Apollo platform.
Algorithm tuning loop for the motion planner in
the Apollo autonomous driving platform
5. An Auto-tuning Framework for Autonomous Vehicles
• The idea for learning the reward functional
includes two key parts: conditional
comparison and rank-based learning.
• Instead of comparing the expectation of
value functions of the expert demonstration
and optimal policy, compare the value
functions state by state.
• To accelerate the training process and
extend the coverage of corner cases,
sample random policies and compare
against the expert demonstration instead of
generating the optimal policy first, as in
policy gradient.
• The background (driving scenarios)
differences may impact the tuned reward
functional significantly.
Online and offline pipeline of the auto-tuning
frame- work by module.
6. An Auto-tuning Framework for Autonomous Vehicles
Feature Description Given A Trajectory Point
Given a path profile in a station-lateral
coordinate system, obstacles and predicted
moving trajectories are projected on the station-
time graph if there are any interactions with the
moving path of the ego car.
Then, the goal is to generate a speed profile on
the station- time graph that can safely avoid
obstacles and maintain smooth driving.
The optimal speed profile is generated by
optimizing the cost/reward functional, which
captures the trajectory smoothness, distance to
different obstacles, and path smoothness.
Application of speed profile generation inside EM plann
7. An Auto-tuning Framework for Autonomous Vehicles
Siamese network used in RC-IRL. The
value networks of both the human and
the sampled trajectories share the same
network parameter settings. The loss
function evaluates the difference between
the sampled data and the generated
trajectory via the value network outputs.
The value network inside the Siamese model is used to
capture driving behavior based on encoded features. The
network is a trainable linear combination of encoded
rewards at different times. The weight of the encoded
reward is a learnable time decay factor. The encoded
reward includes an input layer with 21 raw features and a
hidden layer with 15 nodes.
8. An Auto-tuning Framework for Autonomous Vehicles
• To validate the general performance in terms of conditional IRL, consider a general adversarial
network (GAN) for comparison.
• In GAN, each sampled trajectory is labeled as human driving (1) or random trajectory (0) and
trained with cross entropy loss.
• The network is hoped to distinguish human driving from a randomly generated trajectory.
• Since the GAN cannot distinguish different scenarios, the procedure compares the average
performance of human driving trajectories and that of randomly generated trajectories.
Simulation Results of the Motion Planner
9. • IEEE Intelligent Vehicles Symposium (IV), June 2018
• Making the right decision in traffic is a challenging task that is highly dependent on individual preferences as
well as the surrounding environment. Therefore it is hard to model solely based on expert knowledge.
• It uses Deep Reinforcement Learning to learn maneuver decisions based on a compact semantic state
representation.
• This ensures a consistent model of the environment across scenarios as well as a behavior adaptation
function, enabling on-line changes of desired behaviors without re-training.
• The input for the neural network is a simulated object list similar to that of Radar or Lidar sensors,
superimposed by a relational semantic scene description.
• The state & reward are extended by a behavior adaptation function and a parameterization respectively.
• With little expert knowledge and a set of mid-level actions, it can be seen that the agent is capable to adhere
to traffic rules and learns to drive safely in a variety of situations.
Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
10. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• A reinforcement learning process is commonly modeled as an MDP (S,A,R,δ,T)
where S is the set of states, A the set of actions, R : S×A×S → R the reward
function, δ : S × A → S the state transition model and T the set of terminal
states. At timestep i an agent in state s ∈S can choose an action a ∈A
according to a policy π and will progress into the successor state s′ receiving
reward r. This is defined as a transition t = (s, a, s′, r).
• The aim of reinforcement learning is to maximize the future discounted
return .
• A DQN uses Q-Learning to learn Q-values for each action given input state s
based on past transitions. The predicted Q-values of the DQN are used to
adapt the policy π and therefore change the agent’s behavior.
11. The initial traffic scene is transformed into a compact semantic state representation s and used as input for the RL agent. The
agent estimates the action a with the highest return (Q-value) and executes it, e.g., changing lanes. Afterwards a reward r is
collected and a new state s′ is reached. The transition (s, a, r, s′ ) is stored in the agent’s replay memory.
Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
12. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
(a) The ego vehicle (blue) is driving on a two lane road. Five other
vehicles are in its sensor range. Based on a vehicle scope with Λlateral =
Λahead = Λbehind = 1 only four vehicles (green) are considered in the
semantic state representation.
(b) The relational grid resulting from the scene
of (a) with Λlateral = Λahead = Λbehind = 1. Moreover,
there is no adjacent lane to the right, which is
represented in the grid as well.
An example scene (a) is transformed to a relational grid (b) using a
vehicle scope Λ with Λlateral = Λahead = Λbehind = 1. The red vehicle, which is
in sensor range, is not represented in the grid. Since there is no
vehicle driving on the same lane in front of the ego vehicle, the
respective relational position in the grid is empty.
define a relational grid, centered at the ego
vehicle vego ∈V, The rows correspond to the
relational lane topology, whereas the columns
correspond to the vehicle topology on these lanes.
13. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• For the input state representation,adapt the ontology- based concept focusing on relations
with other traffic participants as well as the road topology.
• Design the state representation to use high level preprocessed sensory inputs such as object
lists provided by common Radar and Lidar sensors and lane boundaries from visual sensors or
map data. To generate the desired behavior the reward is comprised of different factors with
varying priorities.
• A traffic scene τ is described by a semantic entity- relationship model, consisting of all scene
objects and relations between them.
• The scene objects contain all static and dynamic objects, such as vehicles, pedestrians, lane
segments, signs and traffic lights.
• This work focus on vehicles V ⊂ E, lane segments L ⊂ E and the three relation types vehicle-
vehicle relations, vehicle-lane relations and lane-lane relations.
• Every and relation holds several properties or attributes of the scene objects, such as e.g.
absolute positions or relative velocities.
• This scene description combines low level attributes with high level relational knowledge in a
generic way. It is thus applicable to any traffic scene and vehicle sensor setup, making it a
beneficial state representation.
14. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
Partially visualized entity-relationship model of the scene in the last figure (a). The vehicle topology is
modeled by vehicle-vehicle relations whereas the lane topology is modeled by lane-lane relations.
But the representation is of varying size and includes more aspects than are relevant for a given driving task. In
order to use this representation as the input to a neural network, transform it to a fixed-size relational grid that
includes only the most relevant relations.
15. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• The set of vehicles inside the vehicle scope Λ is denoted by VΛ ⊆ V. The different features of
vehicles and lanes are represented by separate layers of the grid, resulting in a semantic state
representation.
• The column vector of each grid cell holds attributes of the vehicle and the lane it belongs to.
While the ego vehicle’s features are absolute values, other vehicles’ features are relative to
the ego vehicle or the lane they are in (induced by the vehicle-vehicle and vehicle-lane
relations of the entity-relationship model):
• 1) Non-ego vehicle features: longitudinal pos & velocity, lateral alignment and heading;
• 2) Ego vehicle features: behavior adaptation function, longitudinal velocity, lane index;
• 3) Lane features: lane type and lane ending;
• Non-perceivable features or missing entities are indicated in the semantic state
representation by a dedicated value.
• The relational grid ensures a consistent representation of the environment, independent of
the road geometry or the number of surrounding vehicles.
16. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
The relational grid contains one layer per feature. The vehicle features fv
i and fv
ego share layers and are in the
cells of the vi and vego respectively. The lane features fk
l are on additional layers in the k-th row of the grid.
17. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• The vehicle’s actions space is defined by a set of semantic actions that
is deemed sufficient for most on-road driving tasks, excluding special
cases such as U-turns.
• The longitudinal movement of the vehicle is controlled by the actions
accelerate and decelerate.
• While executing these actions, the ego vehicle keeps its lane.
• Lateral movement is generated by the actions lane change left as well
as lane change right respectively.
• Only a single action is executed at a time and actions are executed in
their entirety, the vehicle is not able to prematurely abort an action.
• The default action results in no change of either velocity, lateral
alignment or heading.
18. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• With the aim to generate adaptive behavior, extend the reward function R(s,a)
by a parameterization θ.
• This parameterization is used in the behavior adaptation function Ω(τ,θ), so
that the agent is able to learn different desired behaviors without the need
to train a new model for varying parameter values.
• Furthermore, the desired driving behavior consists of several individual goals,
modeled by separated rewards.
• To rank these reward functions by three different priorities.
• The highest priority has collision avoidance, important but to a lesser extent
are rewards associated with traffic rules, least prioritized are rewards
connected to the driving style.
• The overall reward function R(s, a, θ) can be expressed
19. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
• The subset Scollision ⊂ S consists of all states s describing a collision state of the ego
vehicle vego and another vehicle vi.
• In these states the agent only receives the immediate reward without any possibility
to earn any other future rewards.
• Additionally, attempting a lane change to a nonexistent adjacent lane is also treated
as a collision.
• The state dependent evaluation of the reward factors facilitates the learning process.
• As the reward for a state is independent of rewards with lower priority, the eligibility
trace is more concise for the agent being trained.
• For example, driving at the desired velocity does not mitigate the reward for collisions.
20. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
The current state from SUMO is retrieved and transformed into the semantic state representation. This
is the input to the RL agent, which is trained using the TensorForce library. Chosen actions are mapped
to a respective state change of the simulation. The agent’s reward is then computed based on the initial
state, the chosen action and the successor state.
While the concept is able to handle data from many preprocessing methods used in
autonomous vehicles, it tested the approach with the traffic simulation SUMO.
21. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
To examine the agent’s compliance to traffic rules, it is trained and evaluated on two different traffic
scenarios. In (a) the agent has the obligation to drive on the right most lane and must not pass others from
the right, amongst other constraints. In (b) the agent is allowed to accelerate while on the on-ramp and also
might overtake vehicles on its left. But it has to leave the on-ramp before it ends.
Results of the trained agents. The agents gH
(highway)and gM (merging ramp)were only
evaluated on their respective scenario, while gC
(both highway and merging ramp)was
evaluated on both. The results for gC are listed
separately for each scenario.
22. Adaptive Behavior Generation for Autonomous Driving using
Deep Reinforcement Learning with Compact Semantic States
Overtaking maneuver. The agent (blue) is driving behind a
slower vehicle (green). In this situation the action lane change left
has the highest estimated Q-value. After the lane change, the
agent accelerates and overtakes the slower vehicle.
Subsequently, the agent changes back to the right most lane.
The average speed of each agent given varying desired
velocities [m/s]. The agents have been evaluated on the
training scenarios with normal traffic density as well as
on an empty highway. While not every agent is able to
achieve the desired velocity precisely, their behavior
adapts to the different parameter values.
23. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• To safely and efficiently interact with other road participants, AVs have to accurately predict
the behavior of surrounding vehicles and plan accordingly.
• Such prediction should be probabilistic, to address the uncertainties in human behavior.
• Such prediction should also be interactive, since the distribution over all possible trajectories
of the predicted vehicle depends not only on historical information, but also on future plans
of other vehicles that interact with it.
• To achieve such interaction-aware predictions, propose a probabilistic prediction approach
based on hierarchical inverse reinforcement learning (IRL).
• First, explicitly consider the hierarchical trajectory-generation process of human drivers
involving both discrete and continuous driving decisions.
• Based on this, the distribution over all future trajectories of the predicted vehicle is
formulated as a mixture of distributions partitioned by the discrete decisions.
• Then apply IRL hierarchically to learn the distributions from real human demonstrations.
• A case study for the ramp-merging driving scenario is provided.
• The quantitative results show that the proposed approach can accurately predict both the
discrete driving decisions such as yield or pass as well as the continuous trajectories.
24. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• It focus on predicting human drivers’ interactive behavior within two vehicles: a host vehicle
(denoted by (·)H ) and a predicted vehicle (denoted by (·)M ).
• All other in-scene vehicles are treated as surrounding vehicles, denoted by (·)i
O (i=1, 2, · · ·,
N is the index for surrounding vehicles).
• It uses ξ to represent historical vehicle trajectories, and ξˆ for future trajectories.
• It is obvious that the probability distribution over all possible trajectories of the predicted
vehicle depends on his own historical trajectory and those of surrounding vehicles.
• Mathematically, such time-domain state dependency can be modelled as a conditional
probability density function (PDF):
• The probability distribution over all possible trajectories of the predicted vehicle should also
be conditioned on potential plans of the host vehicle.
25. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• The trajectory-generation process of human drivers is naturally probabilistic and
hierarchical.
• It involves both discrete and continuous driving decisions.
• The discrete driving decisions determine “rough” pattern (or homotopy class) of his future
trajectory as a game-theoretic result (e.g., to yield or to pass), whereas the continuous
driving decisions influence details of the trajectory such as velocities, accelerations and
smoothness.
• The conditional distribution is formulated as a mixture of distributions, which explicitly
captures the influences of both discrete and continuous driving decisions of human drivers:
• Thus propose to apply inverse reinforcement learning (IRL)hierarchically to learn all the
models from observed demonstrations of human drivers.
26. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Figure shows the probabilistic and hierarchical trajectory–
generation process for a lane changing scenario.
• The predicted vehicle (blue) is trying to merge into the lane
of the host vehicle (red).
• Given all observed historical trajectories ξ={ξ1:N
O, ξH , ξM }
and his belief about the host vehicle’s future trajectory ξH ,
the trajectory distribution of the predicted vehicle over all
the trajectory space is partitioned first by the discrete
decisions: merge behind (d1
M) and merge front (d2
M).
• Under different discrete decisions, the distributions of
continuous trajectories can be significantly different, and
each of them is represented via a probability distribution
model.
• The observed demonstrations are samples satisfying the
distributions.
27. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Based on the principle of maximum entropy, assume that all drivers are exponentially more
likely to make decisions (both discrete and continuous) that lead to a lower cost.
• This introduces an family of exponential distributions depending on the cost functions, and
interest is to find the optimal hierarchical cost functions that ultimately lead to trajectory
distributions matching that of the observed trajectories in a given demonstration set Ξ.
• Since the trajectory space is continuous and demonstrations are noisily local-optimal, use
Continuous Inverse Optimal Control with Locally Optimal Examples.
• Continuous space IRL: the goal is to find the optimal parameter vector such the given
demonstration set is most likely to happen.
• Under discrete decision, assume that the cost of each trajectory can be linearly parametrized by a
group of selected features.
28. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Features of Continuous Driving Decisions:
• Speed: vs the certain speed limit
• Traffic: based on intelligent driver model (IDM)
• Control effort and smoothness: a set of kinematics-related features
• Clearance to other road participants: a distance-related feature
• Goal:
• Courtesy:
29. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Features of Discrete Driving Decisions:
• Rotation angle:
• Minimum cost:
• Discrete space IRL: assume that the decisions with lower cost are exponentially more
probable.
• The goal is to find the optimal parameter vector such that the likelihood of the demonstration set
is maximized
• The parameter vector is updated via
30. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Collect human driving data from the Next Generation SIMulation (NGSIM) dataset. It
captures the highway driving behaviors/trajectories by cameras mounted on top of
surrounding buildings. The sampling time of the trajectories is △t=0.01s.
• Choose 134 ramp-merging trajectories on Interstate 80 (near Emeryville, California), and
separated them into two sets: a training set of size 80 (denoted by Ξ, i.e., the human
demonstrations), and the other 54 trajectories.
Figure shows the road map and an example group of trajectories. There are four cars in
scene, one merging vehicle (red), one lane-keeping vehicle (blue) and two surrounding
vehicles (black), with one ahead of the blue car and the other behind. interest focuses on
the interactive driving behavior of both the merging vehicle and the lane-keeping vehicle.
31. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Use the same hierarchical IRL approach to model the conditional probability distributions for
both the merging vehicle and the lane-keeping vehicle.
• Driving Decisions: In the ramp-merging scenario, the driving decisions are listed as in Table.
• As mentioned above, x=[x1, · · · , xL]T and y=[y1, · · · , yL]T are, respectively, the coordinate
vectors in Frenet Frame along the longitudinal and lateral directions.
• L is set to be 50, i.e., in each demonstration, 5s trajectories are collected.
• Feature selection: Since the right of way for the merging vehicle and the lane-keeping
vehicles are different, define different features for them.
• Lane keeping: yield and pass;
• Merge: merge back and front, feature for IDM and courtesy.
32. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• Use Tensorflow to implement the
hierarchical IRL algorithm.
• Figure gives the training curves
regarding to continuous and discrete
driving decisions.
• Due to the hierarchical structure, first
learn all four continuous distribution
models for both the merging vehicle
and the lane- keeping vehicle under
different discrete decisions.
• Randomly sample subsets of
trajectories from the training set and
perform multiple trains.
• As seen, the parameters in each
continuous model converge
consistently with small variance.
33. Probabilistic Prediction of Interactive Driving Behavior
via Hierarchical Inverse Reinforcement Learning
• In this test, generate the most probable trajectories under different discrete driving decisions
by solving a finite horizon Model Predictive Control (MPC) problem using the learned
continuous cost functions.
• Three illustrative examples:The red dotted lines and blue solid lines represent, respectively,
the predicted most likely trajectories and the ground truth trajectories;The thick black dash-
dot lines are trajectories of other vehicles;The predicted trajectories are very close to the
ground truth ones.
34. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• Although deep RL methods have lots of strengths that are favorable if applied to
autonomous driving, real deep RL applications in autonomous driving have been
slowed down by the modeling gap between the source (training) domain and the
target (deployment) domain.
• Unlike current policy transfer approaches, which generally limit to the usage of
uninterpretable neural network representations as the transferred features, propose
to transfer concrete kinematic quantities in autonomous driving.
• The proposed robust-control-based (RC) generic transfer architecture, which call
RL-RC, incorporates a transferable hierarchical RL trajectory planner and a robust
tracking controller based on disturbance observer (DOB).
• The deep RL policies trained with known nominal dynamics model are transfered
directly to the target domain, DOB-based robust tracking control is applied to
tackle the modeling gap including the vehicle dynamics errors and the external
disturbances such as side forces.
• Simulations achieve zero-shot transfer across multiple driving scenarios such as lane
keeping, lane changing and obstacle avoidance.
35. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• The proposed RL-RC policy transfer
architecture consists of a deep RL-
based high-level planning module and
a RC- based low-level tracking
controller.
• The overall methodology consists of
the off-line deep RL policy training
and the on-line policy transfer.
• In the off-line source domain, pre-
train a deep RL policy mapping the
perception input to control commands,
which can drive the source vehicle to
produce a trajectory completing the
control task.
For the on-line transfer in the target domain, it shows how the
system works to transfer the source driving policies.
36. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• According to perception observation, the system constructs an imaginary source agent in the
same setting as the target agent.
• In the imaginary source domain, the pretrained policy is used to control the imaginary source
agent to perform the driving task for a finite horizon, resulting in a trajectory of kinematic
states, which serves as the reference for the target vehicle.
• Given the reference trajectory, the target agent uses a closed-loop robust tracking controller
to produce the actual control command for the target vehicle.
• As the target environment makes one step forward, the observation for the new timestep is
collected, and the system repeats the same procedure.
• In this architecture, the kinematic features are transfered without change, and the modeling
gap is compensated by RC.
• To generate the reference trajectory in the imaginary source domain, train a deep RL policy
network that directly maps perception input to trajectory of kinematic states.
• There are two key underlying assumptions for the RL-RC system:
• (i) the trajectory planned by the imaginary source agent is comparably satisfying for the target task;
• (ii) it is feasible for the target vehicle to track the trajectories produced by the source vehicle.
• These assumptions are reasonable as the source and target vehicles and settings are similar,
and the RL-RC is only effective for such cases.
37. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
Illustration of the terminology for the simulated environment and the linear tracking model for the controller.
In both the source domain and the target domain, simulate the vehicle dynamics using nonlinear
bicycle model discretized using forward Euler method with ∆t = 0.02s.
38. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• To learn driving policy using deep RL, define Markov Decision Process (MDP) for the driving tasks.
• Since different tasks have different observations, define clusters of observations as entities.
• Interfaces of lane keeping (LK), lane changing (LC) and obstacle avoidance (OA) are shown in Table I.
• In implementation, a hierarchical RL model that can modularize distinct driving attributes is applied.
• The attributes refer to driving behaviors such as obstacle detection, lane selection, and lane tracking.
• Table II and Figure give the detailed module interfaces and their usages.
Table I DEFINITION OF DRIVING TASKS Table II DEFINITION OF RL MODULES
39. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• Theoretically, all the three basic modules can
be optimized using any kind of method
including deep RL.
• In implementation, both lane-selection and
obstacle-detection are rule-based and
reasonably optimized for the sake of simplicity.
• The lane- tracking is optimized using model-
free deep RL.
• The benefit of hierarchical implementation over
E2E training is that basic attribute modules are
easier to optimize compared to E2E training of
complicated high-level driving policies.
Usage of hierarchical RL modules to assemble
policies for the lane keeping (LK), lane changing
(LC) and obstacle avoidance (OA) tasks.
40. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
• To design the controller, first need to obtain an
approximate linear model for the tracking
problem.
• Adopt the constant speed linear bicycle model
for lateral dynamics
• Using the equations above and applying
forward Euler discretization, the overall tracking
model can be obtained as the block diagram:
• The robust controller is designed based on the
nominal model based on the source vehicle.
• Then a DOB is added to the nominal feedback
controller,the block diagram for the overall
closed-loop system is shown:
Block diagram of linear tracking model
Block diagram of closed-loop system
41. Zero-shot Deep Reinforcement Learning Driving Policy
Transfer for Autonomous Vehicles based on Robust Control
Comparison of the driving behaviors for the RL and the RL-RC with modeling gap in the Lane Changing (LC) task.
42. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• Autonomous vehicles have to navigate the surrounding environment with partial
observability of other objects sharing the road.
• Sources of uncertainty in autonomous vehicle measurements include sensor fusion
errors, limited sensor range due to weather or object detection latency, occlusion,
and hidden parameters such as other human driver intentions.
• Behavior planning must consider all sources of uncertainty in deciding future vehicle
maneuvers.
• This paper presents a scalable framework for risk-averse behavior planning under
uncertainty by incorporating QMDP, unscented transform, and Monte Carlo tree
search (MCTS).
• It is shown that upper confidence bound (UCB) for expanding the tree results in
noisy Q-value estimates by the MCTS and a degraded performance of QMDP.
• A modification to action selection procedure in MCTS is proposed to achieve robust
performance.
43. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• A modular stack is considered where a road world model(RWM)is inferred based on sensor
inputs and map information. Let st represent the state of the EV as well as other objects
sharing the road with EV at time t, where j-th object state is defined as
• which represent lateral and longitudinal location, velocity, and acceleration for the object,
respectively, and time indices are dropped for simplicity.
• A hierarchical behavior planner is considered where the action space, A, is defined as a
composition of a high level action set AHL = {LaneKeep, LaneChangeR, LaneChangeL, Yield,
Negotiatelanechange} and a set of parameters, Θ, specifying how to execute the action (safe
time headway, min dist from a lead vehicle, desired velocity, max acceleration/deceleration,
level of politeness, and direction and max time/distance of lane change).
• The behavior planner action is then passed to a motion planning and control module to
generate the trajectory planning and throttle/steering control. Based on a time series of
inputs from the RWM, the behavior planner selects an action based on a risk-averse metric:
44. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• This is a partially observable Markov decision process (POMDP)problem and its exact
solution is computationally intractable. Instead, use a QMDP-based framework equipped
with the unscented transform to generate sigma-points to sample the belief distribution. In
particular, first select 2ns + 1 sigma-points in a deterministic way as follows:
• In implementation, any sigma-point that is too close to x0 or results in physically infeasible
situation (e.g., a car is placed out of the road, overlaps with another object, or has
acceleration/deceleration/velocity beyond physical limits) is excluded.
• Therefore, the final value of ns could be smaller than the dimension of vector s. The mean
and variance of the Q-function are then calculated as follows:
45. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• To calculate QMDP(s, a) for a given state s, use an online planner, in particular a Monte Carlo
tree search (MCTS). In baseline MCTS, the tree is expanded using the UCT bound as follows:
• It should be noted that the main objective of the MCTS approach is to select the best action
a∗, and hence ranking of the Q-values across different actions is sufficient, and accuracy of
the value function is not very important, especially for actions with low values.
• It incorporated stronger exploration approach into MCTS, by combining the UCT bound with
ε-greedy exploration.
• This implemented the ε-greedy approach since it resulted in acceptable performance.
• The modified action selection criteria is then given by
46. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
Highway ramp with limited field of view
To investigate two driving scenarios with uncertainty: 1) stationary object on the road
beyond sensor range, and 2) highway ramp with limited field of view.
It assumes there exists a stationary object on EV’s path, 400m away from its current
location. Define sensor range as a distance at which the prob. of detecting a stationary
object is 10%.
Consider a highway ramp where EV experiences a limited field of view due to existing
obstacles (e.g., trees) or difference in the elevation.
47. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• For baseline,consider two MCTS-based schemes: 1) MCTS-Genie that receives noise-free
observations, and 2) MCTS-Noisy which makes decisions based on noisy measurements.
• The initial velocity of both EV and MV is 20m/s.
• Figure compares EV’s velocity for the candidate algorithms, and Table provides further
information about the vehicles’ status at the merge point for a realization of the noise under
which initial measurements of MV velocity are lower than its actual value.
48. Risk-averse Behavior Planning for
Autonomous Driving under Uncertainty
• In the beginning, MCTS-Noisy assumes that MV is going slower than EV, and decides to
accelerate to pass MV before reaching the merge point.
• However, after few seconds, it realizes that the initial measurements were off and in fact it is
not able to take over MV.
• Although it applies aggressive deceleration (with jerk -6.8m/s3) at that moment, there is not
enough time left to create a safe gap before the merge point (see distance to lead and time
headway reported in Table).
• MCTS-Genie observes the actual velocity of MV and hence is able to make a correct
decision from the beginning.
• For RA-QMDP, select α = 0.01 and ε = 1.
• RA-QMDP takes three sigma-points one of which considers that MV is going faster than the
value that RWM reports.
• It then makes a risk-averse decision which forces the EV to slow down and increase the gap
with MV.
• Note that RA-QMDP has some delay in making that decision since its evaluation is also
influenced by two other sigma-points that suggest MV is going slow and there might be an
opportunity to take over MV.
49. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
• High uncertainty of traffic behavior and large number of situations
that a Self Driving Vehicle may encounter on the roads, making it
very difficult to create a fully generalizable system.
• To ensure safe and efficient operations, an autonomous vehicle is
required to account for this uncertainty and to anticipate a
multitude of possible behaviors of traffic actors in its surrounding.
• This is a method to predict multiple possible trajectories of actors
while also estimating their probabilities.
• The method encodes each actor’s surrounding context into a
raster image, used as input by deep convolutional networks to
automatically derive relevant features for the task.
50. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
first rasterize an actor-specific BEV raster image encoding the actor’s map surrounding and
neighboring actors (e.g., other vehicles and pedestrians).
Then, given i-th actor’s raster image and state estimate sij at time step tj, use a CNN model
to predict a multitude of M possible future state sequences, as well as each sequence’s
probability pim such that m(pim)= 1, where m indicates mode index.
simplify the task to infer i-th actor’s future x- and y-positions, while the remaining states can
be derived by considering sij and the future position estimates. Both past and future positions
at time tj are represented in the actor-centric coordinate system derived from actor’s state at
time tj , where forward direction is x-axis, left-hand direction is y-axis, and actor’s bounding
box centroid is the origin.
51. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
Modeling multimodality of future 6-second vehicle trajectories; predicted trajectories are marked in
blue, with their probabilities indicated at the end of the trajectories
52. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
It takes an actor-specific 300 × 300 RGB raster image with 0.2m resolution and actor’s current state
(velocity, acceleration, and heading change rate) as input, and outputs M modes of future x- and y-
positions (2H outputs per mode) along with their probabilities (one scalar per mode). This results in (2H
+ 1)M outputs per actor. Probability outputs are passed through a softmax layer to ensure they sum to 1.
Note that any CNN architecture can be used as the base network; here use MobileNet-v2.
53. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
It proposes to use a Multiple-Trajectory Prediction (MTP) loss, that explicitly models the
multimodality of the trajectory space.
Mixture-of-Experts (ME) loss:
First, define a single-mode loss L of the i-th actor’s m-th mode at time tj as average displacement
error (or l2-norm) between the points of ground-truth trajectory and predicted trajectory of the m-th
mode as
54. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
Mode selection methods (modes shown in
blue): the ground truth (green) matches to the
right-turn mode when using displacement, but
the straight mode when using angle.
However, this distance function in the loss does not model the
multimodal behaviors well at intersections, as illustrated in
Figure. To address this problem, propose a function that
measures distance by considering an angle between the last
points of the two trajectories as seen from the actor position,
which improves handling of the intersection scenarios.
In practice, put a threshold of 5◦ on the angle difference.
Then, all modes which differ from the ground-truth trajectory
by less than 5◦ are considered to be potential matches, and
use displacement distance to break the tie among them. It
is found that this simple optimization makes the model
generalize well for both the intersection and straight-road
scenarios.
55. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
It shows rasters for the same scene but using two different following lanes marked in light pink,
one going straight and one turning left. The method outputs trajectories that follow the intended
paths well, and can be used to generate a multitude of trajectories for lane-following vehicles.
Assuming a knowledge of the possible lanes that can be followed and a lane-scoring system
that filters unlikely lanes, add another rasterization layer that encodes this information and
train the network to output a lane-following trajectory. Then, for one scene , generate multiple
rasters with various lanes to be followed, effectively inferring multiple trajectories.
56. Multimodal Trajectory Predictions for Autonomous
Driving using Deep Convolutional Networks
The effect on output trajectories of varying number of modes M from 1 to 4, respectively from left to right image.
In Figure visualize outputs for an intersection scene with an actor going straight, when increasing
number of modes M. For M = 1 (i.e., the STP model) the inferred trajectory is roughly an average of
straight and right-turn modes. Incrementing the number of modes to 2 get a clear separation
between modes going straight and turning right. Further, setting M to 3 the left-turn mode also
appears, albeit with lower probability. When M = 4 get an interesting effect, where the go-straight
mode splits into "fast" and "slow" modes, modeling variability in actor’s longitudinal movement.
57. Exploring the Limitations of Behavior
Cloning for Autonomous Driving
• Driving requires reacting to a wide variety of complex environment conditions and agent
behaviors.
• Explicitly modeling each possible scenario is unrealistic.
• In contrast, imitation learning can, in theory, leverage data from large fleets of human-
driven cars.
• Behavior cloning in particular has been successfully used to learn simple visuomotor policies
end-to-end, but scaling to the full spectrum of driving behaviors remains an unsolved
problem.
• A benchmark to experimentally investigate the scalability and limitations of behavior cloning.
• Behavior cloning leads to state-of-the-art results, including in unseen environments,
executing complex lateral and longitudinal maneuvers without these reactions being
explicitly programmed.
• However, well-known limitations (due to dataset bias and overfitting), new generalization
issues (due to dynamic objects and the lack of a causal model), and training instability
requiring further research before behavior cloning can graduate to real-world driving.
58. Exploring the Limitations of Behavior
Cloning for Autonomous Driving
• Behavior cloning is a form of supervised learning that can learn sensorimotor policies from
off-line collected data.
• The only requirements are pairs of input sensory observations associated with expert actions.
• Use an expanded formulation for self-driving cars called Conditional Imitation Learning, CIL.
• It uses a high-level navigational command c that disambiguates imitation around multiple
types of intersections.
• Given an expert policy π∗(x) with access to the environment state x, can execute this policy
to produce a dataset, D = {⟨oi , ci , ai ⟩}N
i=1, where oi are sensor data observations, ci are
high-level commands (e.g., take the next right, left, or stay in lane) and ai = π∗(xi) are the
resulting vehicle actions (low-level controls).
• Observations oi = {i, vm} contain a single image i and the ego car speed vm added for the
system to properly react to dynamic objects on the road.
• Without the speed context, the model cannot learn if and when it should accelerate or brake
to reach a desired speed or stop.
59. Exploring the Limitations of Behavior
Cloning for Autonomous Driving
• to learn a policy π parametrized by θ to produce similar actions to π∗ based only on
observations o and high- level commands c.
• The best parameters θ∗ are obtained by minimizing an imitation cost l :
• In order to evaluate the performance of the learned policy π(oi, ci; θ) on-line at test time,
assume access to a score function giving a numeric value expressing the performance of the
policy π on a given benchmark.
• In addition to the distributional shift problem,limitations of Behavior cloning:
• Bias in Naturalistic Driving Datasets.
• Causal Confusion.
• High variance.
60. Exploring the Limitations of Behavior
Cloning for Autonomous Driving
proposed network architecture, called CILRS, for end-to-end urban driving based on CIL. A ResNet perception
module processes an input image to a latent space followed by two prediction heads: one for controls and one
for speed.
61. Exploring the Limitations of Behavior
Cloning for Autonomous Driving
• The open source simulator CARLA provides a diverse, photo-realistic, and dynamic
environment with challenging driving conditions.
• A new larger scale CARLA driving benchmark, called NoCrash, designed to test the ability of
ego vehicles to handle complex events caused by changing traffic conditions (e.g., traffic
lights) and dynamic agents in the scene.
• Three different tasks, each one corresponding to 25 goal directed episodes.
• In each episode, the agent starts at a random position and is directed by a high-level
planner into reaching some goal position.
• The three tasks as follows:
• Empty Town: no dynamic objects.
• Regular Traffic: moderate number of cars and pedestrians.
• Dense Traffic: large number of pedestrians and heavy traffic (dense urban scenario).
• Similar to the CARLA Benchmark, NoCrash has six different weather conditions, where four
were seen in training and two reserved for testing.
• Also give two towns, one that is seen during training, and the other reserved for testing.