Driving Behavior for ADAS and Autonomous Driving X

Driving Behaviors for ADAS
and Autonomous Driving X
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction
• Joint Interaction and Trajectory Prediction for Autonomous Driving using Graph Neural
Networks
• Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-
LSTMs
• Social-WaGDAT: Interaction-aware Trajectory Prediction via Wasserstein Graph Double-
Attention Network
• EvolveGraph: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction with
Evolving Interaction Graphs
• Scenario-Transferable Semantic Graph Reasoning for Interaction-Aware Probabilistic
Prediction
• VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

MultiPath: Multiple Probabilistic Anchor Trajectory
Hypotheses for Behavior Prediction
• Predicting human behavior is a difficult and crucial task required for motion planning.
• It is challenging in large part due to the highly uncertain and multimodal set of possible outcomes
in real-world domains such as autonomous driving.
• Beyond single MAP trajectory prediction, obtaining an accurate probability distribution of the
future is an area of active interest.
• MultiPath leverages a fixed set of future state-sequence anchors that correspond to modes of the
trajectory distribution.
• At inference, the model predicts a discrete distribution over the anchors and, for each anchor,
regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at
each time step.
• The model is efficient, requiring only one forward inference pass to obtain multi-modal future
distributions, and the output is parametric, allowing compact communication and analytical
probabilistic queries.

• MultiPath estimates the distribution over future trajectories per agent in a scene, as follows:
• 1) Based on a top-down scene representation, the Scene CNN extracts mid-level features that
encode the state of individual agents and their interactions.
• 2) For each agent in the scene, crop an agent-centric view of the mid-level feature representation
and predict the probabilities over the fixed set of K predefined anchor trajectories.
• 3) For each anchor, the model regresses offsets from the anchor states and uncertainty
distributions for each future time step.
• The distribution is parameterized by anchor trajectories A; directly learning a mixture suffers from
issues of mode collapse, as is common practice in other domains such as object detection and
human pose estimation, it estimates the anchors a-priori before fixing them to learn the rest of
our parameters; a practical way is the k-means algorithm as a simple approximation to obtain A.
• It trains the model via imitation learning by fitting parameters to maximize the log-likelihood of
recorded driving trajectories.

• They still represent a history of dynamic and static scene context as a 3-dimensional array of data
rendered from a top-down orthographic perspective.
• The first two dimensions represent spatial locations in the top-down image.
• The channels in the depth dimension hold static and time-varying (dynamic) content of a fixed
number of previous time steps.
• Agent observations are rendered as orientated bounding box binary images, one channel for each
time step.
• Other dynamic context such as traffic light state and static context of the road (lane connectivity
and type, stop lines, speed limit, etc.) form additional channels.
• An important benefit of using such a top-down representation is the simplicity of representing
contextual information like the agents’ spatial relationships to each other and semantic road
information.

Top: Logged trajectories of all agents are displayed in cyan. The focused agent is highlighted by a red
circle. Bottom: MultiPath showing up to 5 trajectories with uncertainty ellipses. Trajectory probabilities
(softmax outputs) are encoded in a color map shown to the right. MultiPath can predict uncertain future
trajectories for various speed (1st column), different intent at intersections (2nd and 3rd columns) and
lane changes (4th and 5th columns), where the regression baseline only predicts a single intent.

Joint Interaction and Trajectory Prediction for
Autonomous Driving using Graph Neural Networks
• This work aims to predict the future motion of vehicles in a traffic scene by explicitly modeling
their pairwise interactions.
• Specifically, they propose a graph neural network (GNN) that jointly predicts the discrete
interaction modes and 5-second future trajectories for all agents in the scene.
• The model infers an interaction graph whose nodes are agents and whose edges capture the long-
term interaction intents among the agents.
• In order to train the model to recognize known modes of interaction, it introduces an auto-
labeling function to generate ground truth interaction labels, which enables building a large
dataset of vehicle interactions without relying on human experts for manual labeling.
• Using a large-scale real-world driving dataset, they demonstrate that jointly predicting the
trajectories along with the explicit interaction types leads to significantly lower trajectory error
than baseline methods.

A joint model for interaction and trajectory prediction
Graph Network (GN)
GN consists of two components:
an edge model which combines
the representations of each edge
and its terminal nodes to output
an updated edge representation,
and a node model which operates
on each node to aggregate the
representations of incident edges
and outputs an updated node
representation.
The model consists of three components: 1) trajectory encoder network,
2) interaction prediction network, and 3) trajectory decoder network.

Examples of trajectory predictions on
real-world driving scenarios, from
time t to t + 5s: (a) 3-way intersection;
(b) unconventional 4-way intersection;
(c) canonical 4-way intersection; (d) 5-
way intersection; (e) actor merging
from outside the road network.

Forecasting Trajectory and Behavior of Road-Agents
Using Spectral Clustering in Graph-LSTMs
• It is an approach for traffic forecasting using combination of spectral graph analysis and DL.
• It predicts both the low-level information (future trajectories) as well as the high-level
information (road-agent behavior) from the extracted trajectory of each road-agent.
• The formulation represents the proximity between the road agents using a dynamic weighted
traffic-graph.
• It uses a two-stream graph convolutional LSTM network to perform traffic forecasting using
these weighted traffic-graphs.
• The first stream predicts the spatial coordinates of road-agents, while the second stream predicts
whether a road-agent is going to exhibit aggressive, conservative, or normal behavior.
• It introduces spectral cluster regularization to reduce the error margin in long term prediction (3-
5 seconds) and improve the accuracy of the predicted trajectories.
• Codes: https://gamma.umd.edu/researchdirections/autonomousdriving/spectralcows/

• The first stream is an LSTM-based encoder-decoder network, predicting the future spatial
coordinates from the trajectory history;
• The second stream is also an LSTM-based encoder-decoder network, with the set of the
spectrums’ eigenvectors from traffic graphs for each time instance of traffic until now, to predict
the eigenvectors for the next time period; this output spectrum sequence is used to reconstruct
the sequence of Laplacian matrix, which is then used to assign a behavior label to a road-agent;
• The behavior algorithm is rule-based.

The logarithm of the RMSE values. Lower values indicate the direction of better performance. The prediction
window is 5 seconds for the Lyft and Apolloscape datasets, and 3 seconds for the Argoverse dataset, which
corresponds to a frame length of 30, 10, and 30, respectively.

Social-WaGDAT: Interaction-aware Trajectory Prediction
via Wasserstein Graph Double-Attention Network
• Effective understanding of the environment and accurate trajectory prediction of surrounding
dynamic obstacles are indispensable for intelligent mobile systems (like autonomous vehicles and
social robots) to achieve safe and high-quality planning when they navigate in highly interactive
and crowded scenarios.
• Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired
for the prediction system to enable relational reasoning on different entities and provide a
distribution of future trajectories for each agent.
• This paper proposes a generic generative neural system (called Social-WaGDAT) for multi-agent
trajectory prediction, which makes a step forward to explicit interaction modeling by
incorporating relational inductive biases with a dynamic graph representation and leverages both
trajectory and scene context information.
• It also employs an efficient kinematic constraint layer applied to vehicle trajectory prediction
which not only ensures physical feasibility but also enhances model performance.
• The proposed system is evaluated on three public benchmark datasets for trajectory prediction,
where the agents cover pedestrians, cyclists and on-road vehicles.

Typical traffic scenarios with large uncertainty and interactions among multiple entities.
The upper figure in the first column was captured in a highway ramp merging scenario, where lane change
behavior with negotiation happens frequently. The lower figure was captured in a roundabout and an
unsignalized intersection scenario, where yielding and stopping behaviors happen frequently. The other two
columns shows the occupancy density maps and the velocity fields of the scenarios, which are generated
based on the training data to provide statistical context information.

• The encoder-decoder architecture of Social-WaGDAT, which consists of three key components:
• (a) A deep feature extractor which extracts state, relation and context features from the
trajectories of agents, the sequences of occupancy density maps and velocity fields.
• (b) An encoder which includes a graph double-attention network that processes spatiotemporal
graphs and generates abstract node attributes containing interaction information, and an
encoding function which maps the node attributes to a latent space. During the testing phase, the
encoding function is not used.
• (c) A decoder samples future trajectory hypotheses satisfying physical constraints for each agent.
• The feature extractor consists of three parts: State MLP, Relation MLP and Context CNN.
• A history graph (HG) and a future graph (FG) are generated respectively to represent the
information related to the involved agents, where the state features and context features are
concatenated to be the node attributes and the relation features are used as edge attributes.
• The decoder imposes a kinematic constraint cell to enforce feasible trajectory prediction
following the recurrent unit.

Qualitative and ablative results, where the green mask represents the predicted distribution
and the yellow, blue and red lines represent historical observation, ground truth and a
trajectory hypothesis sampled from the distribution with the smallest error, respectively.

EvolveGraph: Heterogeneous Multi-Agent Multi-Modal
Trajectory Prediction with Evolving Interaction Graphs
• This paper proposes a generic trajectory forecasting framework (named EvolveGraph) with
explicit interaction modeling via a latent interaction graph among multiple heterogeneous,
interactive agents.
• Considering the uncertainty and the possibility of different future behaviors, the model is
designed to provide multi-modal prediction hypotheses.
• Since the interactions may be time-varying even with abrupt changes, and different modalities of
agent state evolution may lead to different interactions, they address the necessity and
effectiveness of adaptively evolving the interaction graph and provide an effective solution.
• They also introduce a double-stage training pipeline which not only improves training efficiency
and accelerates convergence, but also enhances model performance in terms of prediction error.
• The framework is evaluated on multiple public benchmark datasets in various areas for trajectory
prediction, where the agents cover on-road vehicles, pedestrians, cyclists and sports players.

A high-level graphical illustration of the approach, where the encoding and decoding
horizons (re-encoding gap) are both set to 5 without loss of generality.
G(·) denotes the latent interaction graph obtained from the static encoding process,
and G’ (·) denotes the adjusted interaction graph with time dependence.

• Instead of e2e training in a single pipeline, the training process contains two consecutive stages:
• Static interaction graph learning: A series of encoding functions are trained to extract interaction
patterns from the observed trajectories and context information, and generate a distribution of static
latent interaction graphs.
• A series of decoding functions are trained to recurrently generate multi-modal distributions of future states.
• Dynamic interaction graph learning: The pre-trained encoding and decoding functions during the
first stage are utilized as an initialization, which are finetuned together with the training of a
recurrent network which captures the dynamics of interaction graph evolution.
• The recurrent unit can be treated as a highly flexible integration which takes past graphs into consideration.
• Due to the uncertainty of human intention and interaction outcomes, the prediction model is desired
to capture the multi-modality of human behaviors and generate diverse prediction hypotheses which
represent various possible behavior patterns.

Dashed lines are historical trajectories, solid lines
with dots are ground truth, and dash-dotted lines
with crosses are prediction hypothesis.

Scenario-Transferable Semantic Graph Reasoning for
Interaction-Aware Probabilistic Prediction
• A number of methodologies have been proposed to solve prediction problems under different
traffic situations; however, these works either focus on one particular driving scenario (e.g.
highway, intersection, or roundabout) or do not take sufficient environment information (e.g.
road topology, traffic rules, and surrounding agents) into account.
• In fact, the limitation to certain scenario is mainly due to the lackness of generic representations
of the environment, and the insufficiency of environment information further limits the flexibility
and transferability of the predictor.
• This paper proposes a scenario-transferable and interaction-aware probabilistic prediction
algorithm based on semantic graph reasoning, which predicts behaviors of selected agents.
• It puts forward generic representations for various environment information, which take into
account Frenet frame coordinates, road topological elements, traffic regulations, as well as
dynamic insertion areas (DIA) and utilize them as building blocks to construct their spatio-
temporal structural relations.
• Then it takes the advantage of these structured representations to develop a flexible and
transferable prediction algorithm, where the predictor can be directly used under unforeseen
driving circumstances that are completely different from training scenarios.

Given any driving scenario, it is able to
extract its generic static and dynamic
representations. It then utilizes semantic
graphs (SG) to describe spatio-temporal
relations within these representations and
make predictions based on the semantic
graph network (SGN) structure.

• Generic representation of the static environment:
• A traffic-free reference path can be obtained either from road’s centerline for constructed
roads or by averaging human driving paths from collected data for unconstructed areas.
• The Frenet Frame can utilize any selected reference ´ path as the reference coordinate, where
road geometrical information can be implicitly incorporated into the data without increasing
feature dimensions.
• Generic representation of the dynamic environment:
• A dynamic insertion area (DIA) is semantically defined as a dynamic area that can be inserted
or entered by agents on the road.
• DIAs under three basic topological elements: Point-overlap, line-overlap, Undecided-overlap.
• DIAs under different traffic regulations: traffic lights and signs.

• There are two types of semantic graph in SGN: 2D-SG
and 3D-SG.
• The 2D-SG is defined similar to the traditional graph.
• The spatial and temporal relationship are jointly
described by edges in 3D-SG, where the temporal
relation between any of the two nodes in a 3D-SG can
differ.
• By defining node attributes as semantic objects like
DIA, it implicitly encodes both static and dynamic
information into the graph.
• The edge attribute describes the relationship between
any of two DIAs.
• For a 2D-SG, each edge may describe the strength of
correlation between its corresponding two DIAs at the
same time step; whereas for a 3D-SG, each edge may
represent some future information of the two DIAs.

Semantic graph network (SGN)
The input can either be a 2D-SG at the current time step or a sequence of historical 2D-SGs. There are
feature encoding layer, spatial attention layer, prediction encoding layer and output layer. The output is a set
of 3D-SGs that encompass the information of how the current scene will progress in the future, which could
provide answers to our questions.

Semantic intention and attention heatmap
for case 1 (a)-(d) and case 2 (e)-(g).
The predicted vehicle is colored in black.
The darker the color of the dynamic
insertion area, the higher probability for
it to be inserted by the predicted vehicle.
For each DIA that might be inserted by
the predicted vehicle, the corresponding
horizontal grids in the heatmap reflect
how much its states will be influenced by
other DIAs respectively.

semantic intention and attention heatmap for test case 3 (a) and test case 4 (b)

VectorNet: Encoding HD Maps and Agent Dynamics
from Vectorized Representation
• Behavior prediction in dynamic, multi-agent systems is an important problem in the context of
self-driving cars, due to the complex representations and interactions of road components,
including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes,
traffic lights).
• This paper introduces VectorNet, a hierarchical graph neural network (GNN) that first exploits the
spatial locality of individual road components represented by vectors and then models the high-
order interactions among all components.
• In contrast to most recent approaches, which render trajectories of moving agents and road
context information as bird-eye images and encode them with convolutional neural networks
(ConvNets), this approach operates on a vector representation.
• By operating on the vectorized high definition (HD) maps and agent trajectories, it avoids lossy
rendering and computationally intensive ConvNet encoding steps.
• To further boost VectorNet’s capability in learning context features, it proposes a novel auxiliary
task to recover the randomly masked out map entities and agent trajectories based on their
context.
• It also outperforms the state of the art on the Argoverse dataset.

• Most of the annotations from an HD map are in the form of splines (e.g. lanes), closed shape (e.g.
regions of intersections) and points (e.g. traffic lights), with additional attribute info such as the
semantic labels of the annotations and their current states (e.g. color of the traffic light, speed
limit of the road).
• For agents, their trajectories are in the form of directed splines with respect to time.
• All of these elements can be approximated as sequences of vectors: for map features, pick a
starting point and direction, uniformly sample key points from the splines at the same spatial
distance, and sequentially connect the neighboring key points into vectors; for trajectories, just
sample key points with a fixed temporal interval (0.1 second), starting from t = 0, and connect
them into vectors.
• Given small enough spatial or temporal intervals, the resulting polylines serve as close
approximations of the original map and trajectories.
• To exploit the spatial and semantic locality of the nodes, it takes a hierarchical approach by first
constructing subgraphs at the vector level, where all vector nodes belonging to the same polyline
are connected with each other.

The computation flow on the vector nodes of
the same polyline. The polyline subgraph
network can be seen as a generalization of
PointNet. However, by embedding the ordering
information into vectors, constraining the
connectivity of subgraphs based on the
polyline groupings, and encoding attributes as
node features, this method is particularly
suitable to encode structured map annotations
and agent trajectories.

• To encourage the global interaction graph to better capture interactions among different
trajectories and map polylines, it introduces an auxiliary graph completion task.
• In order to identify an individual polyline node when its corresponding feature is masked out, it
computes the minimum values of the start coordinates from all of its belonging vectors to obtain
the identifier embedding.
• The graph completion objective is closely related to the widely successful BERT method for
natural language processing (NLP), which predicts missing tokens based on bidirectional context
from discrete and sequential text data.
• Unlike methods that generalizes the BERT objective to unordered image patches with pre-
computed visual features, the proposed node features are jointly optimized in an e2e framework.
• The final multi-task training objective is optimized:
• Ltraj is the negative Gaussian log-likelihood for the ground truth future trajectories, Lnode is the Huber
loss between predicted node features and ground truth masked node features, α = 1.0 is a scalar that
balances the two loss terms.

predictionprediction attention for road and agent attention for road and agent

Driving Behavior for ADAS and Autonomous Driving X

Driving Behavior for ADAS and Autonomous Driving X

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Driving Behavior for ADAS and Autonomous Driving X

Similar to Driving Behavior for ADAS and Autonomous Driving X (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Driving Behavior for ADAS and Autonomous Driving X