Driving Behavior for ADAS and Autonomous Driving VII

Driving Behavior for ADAS
and Autonomous Driving VII
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
• INFER: INtermediate representations for FuturE pRediction
• Deep Imitative Models for Flexible Inference, Planning, and Control
• Multi-Agent Tensor Fusion for Contextual Trajectory Prediction
• AGen: Adaptable Generative Prediction Networks for Autonomous Driving
• Conditional Generative Neural System for Probabilistic Trajectory Prediction
• Coordination and Trajectory Prediction for Vehicle Interactions via Bayesian
Generative Modeling
• Interaction-aware Multi-agent Tracking and Probabilistic Behavior Prediction
via Adversarial Learning

DESIRE: Distant Future Prediction in
Dynamic Scenes with Interacting Agents
• This is a Deep Stochastic IOC (Inverse Optimal Control) RNN Encoder- decoder framework,
DESIRE, for the task of future predictions of interacting agents in dynamic scenes.
• DESIRE predicts future locations of objects in multiple scenes by 1) accounting for the multi-
modal nature of prediction (i.e., given the same context, future may vary), 2) foreseeing the
future outcomes and make a strategic prediction, and 3) reasoning not only from the past
motion history, but also from the scene context as well as the interactions among the agents.
• DESIRE achieves these computationally efficient in a single E2E trainable NN model.
• The model first obtains a diverse set of hypothetical future prediction samples employing a
conditional variational auto-encoder (CVAE), ranked and refined by the following RNN
scoring-regression module.
• Samples are scored by accounting for accumulated future rewards, which enables better
long-term strategic decisions similar to IOC frameworks.
• An RNN scene context fusion module jointly captures past motion histories, the semantic
scene context and interactions among multiple agents.
• A feedback mechanism iterates over ranking and refinement to boost prediction accuracy.
CVPR2017

(a) A driving scenario: The white van may steer into left or right while trying to avoid a collision to other
dynamic agents. DESIRE produces accurate future predictions (shown as blue paths) by tackling multi-
modality of future prediction while accounting for a rich set of both static and dynamic scene contexts. (b)
DESIRE generates a diverse set of hypothetical prediction samples, and then ranks and refines them
through a deep IOC network.

• Sample Generation Module
• Future prediction can be inherently ambiguous and has uncertainties as multiple plausible
scenarios can be explained under the same past situation (e.g., a vehicle heading toward an
intersection can make different turns);
• Thus, learning a deterministic function f that directly maps past trajectories to future trajectories
will under-represent potential prediction space and easily over-fit to training data.
• Moreover, a naively trained network with a simple loss will produce predictions that average out
all possible outcomes.
• This sample generation module produces a set of diverse hypotheses critical to capturing the
multimodality of the prediction task, through a effective combination of CVAE and RNN
encoder-decoder.
• RNNs are implemented using gated recurrent units (GRU) to learn long-term dependencies, yet
they can be easily replaced with other popular RNNs like long short-term memory units (LSTM).
• The CVAE module generates diverse set of future trajectories based on a past trajectory.
• Two loss terms: reconstruction loss and KLD loss.

The overview of DESIRE. First, DESIRE generates multiple plausible prediction samples Yˆ via a CVAE-based
RNN encoder-decoder (Sample Generation Module). Then the following module assigns a reward to the
prediction samples at each time-step sequentially as IOC frameworks and learns displacements vector ∆Yˆ to
regress the prediction hypotheses (Ranking and Refinement Module). The regressed prediction samples are
refined by iterative feedback. The final prediction is the sample with the maximum accumulated future
reward. Note that the flow via aquamarine-colored paths is only available during the training phase.

• Ranking and Refinement Module
• Predicting a distant future can be far more challenging than predicting one close by.
• To tackle this, adopt the concept of decision-making process in reinforcement learning (RL) where
an agent is trained to choose its actions that maximizes long-term rewards to achieve its goal.
• Instead of designing manually, however, IOC learns an unknown reward function.
• It designs an RNN model that assigns rewards to each prediction hypothesis and measures their
goodness based on the accumulated long-term rewards.
• Thereafter, also directly refine prediction hypotheses by learning displacements to the actual
prediction through another FC layer.
• Lastly, the module receives iterative feedbacks from regressed predictions and keeps adjusting so
that it produces precise predictions at the end.
• There are two loss terms in training the IOC ranking and refinement module:
• Cross-entropy loss;
• Regression loss.
• The total loss:

• Scene Context Fusion
• It is important that the RNN must contain the
information about 1) individual past motion
context, 2) semantic scene context and 3) the
interaction between multiple agents, in order to
provide proper hidden representations that can
score and refine a prediction;
• It implements a spatial grid based pooling layer
similar to the SP layer in social LSTM.
• Instead of using the max pooling, operation with
rectangular grids, adopt log-polar grids with an
average pooling.
• Combined with CNN features, the SCF module
provides the RNN decoder with both static and
dynamic scene information.
• It learns consistency between semantics of
agents and scenes for reliable prediction.
Details of Scene Context Fusion (SCF) unit in
RNN Decoder2. Note that the input to the
GRU cell at each time-step integrates multiple
cues (i.e., the dynamics of agents, scene
context and interaction between agents).

KITTI results (left 3 rows): The row 1&2 in (b) show highly reactive nature of RNN ED-SI (i.e., prediction turns after it hits
near non-drivable area). On the contrary, DESIRE shows its long-term prediction capability by considering potential future
rewards. DESIRE-SI also produces more convincing predictions in the presence of other vehicles. Stanford Drone Data
results (right 3 rows): The row 1 shows the multi-modal nature of the prediction problem. While the cyclist is making a
right turn, it is also possible that he turns around the round-about (denoted with arrow). DESIRE-SI predicts such equally
possible future as the top prediction, while covering the ground truth future within top 10 predictions. The row 2&3 also
show that DESIRE-SI provides superior predictions by reasoning about both static and dynamic scene contexts.

INFER: INtermediate representations for
FuturE pRediction
• 2019.3
• In urban driving scenarios, forecasting future trajectories of surrounding vehicles is of
paramount importance.
• While several approaches for the problem have been proposed, the best-performing ones
tend to require extremely detailed input representations (e.g. image sequences).
• But, such methods do not generalize to datasets they have not been trained on.
• Here is intermediate representations that are particularly well-suited for future prediction.
• As opposed to using texture (color) information, it relies on semantics and train an AR model
to accurately predict future trajectories of traffic participants (vehicles).
• Using semantics provides a significant boost over techniques that operate over raw pixel
intensities/disparities.
• Uncharacteristic of state-of-the-art approaches, this represents and models generalize to
completely different datasets, collected across several cities, and also across countries where
people drive on opposite sides of the road (left-handed vs right-handed driving).
• Code and data: https://rebrand.ly/INFER-results.

FuturE pRediction
• The design philosophy is based on the following three desired characteristics that
knowledge representation systems must possess:
• 1) Representational adequacy: to adequately represent task- relevant information.
• 2) Inferential adequacy: to infer traits not be inferred from the original unprocessed data.
• 3) Generalizability: necessarily generalize to other data distributions (for the same task).
• The model takes as input an intermediate representation of the scene semantics
(intermediate, because it is neither too primitive, e.g. raw pixel intensities, nor too abstract
e.g. velocities, steering angles).
• Using these intermediate representations, predict the plausible future locations of the
Vehicles of Interest (VoI).
• The proposed representation does not rely heavily on the camera viewing angle, as camera
mounting parameters (height, viewing angle, etc.) vary across datasets, and this approach
hopes to be robust to such variations.

FuturE pRediction
First generate intermediate representations by fusing monocular images with depth information (from either stereo
or Lidar), obtaining semantic and instance segmentation from monocular image, followed by an orthographic
projection to bird’s-eye view. The generated intermediate representations are fed through the network, and finally it
results in prediction of the target vehicle’s trajectory registered in the sensor coordinate frame.

FuturE pRediction
• It formulates trajectory prediction as a per-cell regression over an occupancy grid.
• It uses the intermediate representations to simplify the objective and help the network
generalize better.
• It trains an autoregressive model that outputs the VoI’s position on an occupancy grid,
conditioned on the previous intermediate representations.
• It uses a simple Encoder- Decoder model connected by a convolutional LSTM to learn
temporal dynamics.
• It adds skip connections between corresponding encoder and decoder branches.
• The proposed trajectory prediction scheme takes as input a sequence of intermediate
representations and produces a single channel output occupancy grid.
• The training objective comprises two terms: reconstruction loss term, and safety loss term.

FuturE pRediction
The qualitative results from the validation fold of KITTI showcase the efficacy of INFER-Skip in using the intermediate
representation to predict complex trajectories. For example, in the left most plot, the network is able to accurately
predict the unseen second curve in the trajectory (predicted and ground truth trajectories are shown in red and blue
color, respectively). The green and red 3D bounding boxes indicate start of preconditioning and start of prediction of the
vehicle of interest (VoI), respectively. It is worth noting that the predicted trajectories are well within the lane (dark gray)
and road region (cyan), while avoiding collisions with the obstacles (magenta).

Deep Imitative Models for Flexible Inference,
Planning, and Control
• Imitation learning produces behavioral policies with limited flexibility to accommodate new
goals at test-time.
• In contrast, model-based reinforcement learning (MBRL) can plan to arbitrary goals using a
predictive dynamics model learned from data.
• It proposes “imitative models” to combine the benefits of imitation learning and MBRL.
• Imitative models are probabilistic predictive models able to plan interpretable expert-like
trajectories to achieve arbitrary goals.
• Inference with them resembles trajectory optimization in model-based reinforcement
learning, and learning them resembles imitation learning.
• This method substantially outperforms six direct imitation learning approaches (five of them
prior work) and an MBRL approach in a dynamic simulated autonomous driving task, and
can be learned efficiently from a fixed set of expert demonstrations without additional online
data collection.
2019.6

To apply the algorithm to navigation in CARLA. Left: Image depicting the current scene, in which the light recently
turned from green to red. Left-Middle: Plot showing LIDAR observations of the agent, the goals it received from a
route planner, and the plan produced by the method. The model smoothly chooses between goals based on its
prior of expert behavior. Here, the stationary agents chooses to accelerate to follow the vehicle ahead. Right-Middle:
Image depicting an intersection scene. Right: LIDAR observations, goals, cost map of simulated potholes, and a
variety of plans the method produces, colored by the planner’s preference. Although the imitative model never
observed pothole-avoidance behavior, it is able to plan a reasonable, on-road path around them with a test-time
cost map. Its preferred plan enters the intersection and steers around a pothole.

• Reinforcement learning (RL) algorithms automatically learn desirable behaviors from raw
sensory inputs with minimal engineering; However, RL generally requires online learning: the
agent must collect more data with its latest strategy, use it to update a model, and repeat.
• Deploying a partially-trained policy on a real-world autonomous system, can be dangerous.
• Learning behavior should happen offline from expert demonstrations.
• How to incorporate such demo into an autonomous car, to perform a variety of tasks?
• One is imitation learning (IL), learning policies that stay near the expert’s distribution.
• Another is model-based methods, which can use the data to fit a dynamics model, and in
principle can be used with planning algorithms to achieve any user-specified goal.
• However, model-based (MB) and model-free RL algorithms are vulnerable to distributional
drift: when acting accord. to the learned model or policy, the agent visits states different
from those in training, and in those unlikely to determine an effective course of action.
• This is problematic when the data intentionally excludes adverse events such as crashes.
• Therefore, MBRL algorithms usually require online collection and training.

Imitative planning to goals: multi-goal
waypoint planning enables fine-grained
control of the plans.
Costs can be assigned to “potholes” only seen at test-time;
expert demonstrations with potholes were never observed.
The planner prefers routes around the potholes.

• IL algorithms use expert demonstration data and, despite similar drift shortcomings, can
sometimes learn effective policies without online data collection. However, standard IL offers
little task flexibility since it only predicts low-level behavior.
• While several works augmented IL with goal conditioning, these goals must be specified in
advance during training, and are typically simple (e.g., turning left or right).
• The goal is to devise an algorithm that combines the advantages of IL and MBRL by offering
the flexibility to achieve new user-specified goals and the ability to learn from offline data.
• By learning a deep conditional probabilistic forecasting model from expert data, it captures
the distribution of expert behaviors without using manually designed reward functions.
• To plan to a goal, this method infers the most probable expert state trajectory under a
posterior distribution induced by the model and a task-specifying goal distribution.
• By incorporating a model-based representation, it can easily plan to previously unseen user-
specified goals while behaving similar to the expert, and can be flexibly repurposed to
perform a variety of test-time tasks without any additional training.

Illustration of the method applied to autonomous driving. This method trains an imitative model
from a dataset of expert examples. After training, the model is repurposed as an imitative planner.
At test-time, a route planner provides waypoints to the imitative planner, which computes expert-
like paths to each goal. The best plan is chosen according to the planning objective and provided
to a low-level PID-controller in order to produce steering and throttle actions.

Tolerating bad waypoints. The planner prefers waypoints in the distribution of expert
behavior (on the road at a reasonable distance). Columns 1, 2: Planning with ½ decoy
waypoints. Columns 3,4: Planning with all waypoints on the wrong side of the road.

Test-time plans prefer steering around potholes.
Table: Robustness to waypoint noise and test-time pothole adaptation.
The method is robust to waypoints on the wrong side of the road, and fairly
robust to decoy waypoints. The method is flexible enough to safely produce
behavior not demonstrated (pothole avoidance) by incorporating a test-
time cost. Ten episodes are collected in each Town.

Multi-Agent Tensor Fusion for Contextual
Trajectory Prediction
• Accurate prediction of others’ trajectories is essential for autonomous driving.
• Trajectory prediction is challenging because it requires reasoning about agents’ past
movements, social interactions among varying numbers and kinds of agents, constraints
from the scene context, and the stochasticity of human behavior.
• This approach models these interactions and constraints jointly within a Multi-Agent
Tensor Fusion (MATF) network.
• Specifically, the model encodes multiple agents’ past trajectories and the scene context
into a Multi-Agent Tensor, then applies convolutional fusion to capture multiagent
interactions while retaining the spatial structure of agents and the scene context.
• The model decodes recurrently to multiple agents’ future trajectories, using adversarial
loss to learn stochastic predictions.
• Experiments on both highway driving and pedestrian crowd datasets show that the model
achieves state-of- the-art prediction accuracy.
2019.7

• There are two parallel encoding streams in the MATF architecture.
• One encodes the past trajectories of each individual agent xi independently using single agent
LSTM encoders, and another encodes the static scene context image c with a CNN.
• Each LSTM encoder shares the same set of parameters, so the architecture is invariant to the
number of agents in the scene.
• The outputs of the LSTM encoders are 1-D agent state vectors {x′1 , x′2 , .., x′n } without
temporal structure.
• The output of the scene context encoder CNN is a scaled feature map c′ retaining the spatial
structure of the bird’s-eye view static scene context image.
• Next, the two encoding streams are concatenated spatially into a Multi-Agent Tensor.
• Agent encodings {x′1, x′2, .., x′n} are placed into one bird’s-eye view spatial tensor, which is
initialized to 0 and is of the same shape (width and height) as the encoded scene image c′.

• The dimension axis of the encodings fits into the channel axis of the tensor.
• The agent encodings are placed into the spatial tensor with respect to their positions at the
last time step of their past trajectories.
• This tensor is then concatenated with the encoded scene image in the channel dimension to
get a combined tensor. If multiple agents are placed into the same cell in the tensor due to
discretization, element-wise max pooling is performed.
• The Multi-Agent Tensor is fed into fully convolutional layers, which learn to represent
interactions among multiple agents and between agents and the scene context, while
retaining spatial locality, to produce a fused Multi-Agent Tensor.
• Specifically, these layers operate at multiple spatial resolution scale levels by adopting U-
Net-like architectures to model interaction at different spatial scales.
• The output feature map of this fused model c′′ has exactly the same shape as c′ in width and
height to retain the spatial structure of the encoding.

The Multi-Agent Tensor encoding is a spatial
feature map of the scene context and multiple
agents from an overhead perspective, including
agent channels (above) and context channels
(below). Agents’ feature vectors (red) output
from single- Agent LSTM encoders are placed
spatially w.r.t. agents’ coordinates to form the
agent channels. The agent channels are aligned
spatially with the context channels (a context
feature map) output from scene context
encoding layers to retain the spatial structure.

• To decode each agent’s predicted trajectory, agent- specific representations with fused
interaction features for each agent {x1′′ , x2′′ , .., xn′′ } are sliced out according to their
coordinates from the fused Multi-Agent Tensor output c′′.
• These agent-specific representations are then added as a residual to the original encoded
agent vectors to form final agent encoding vectors {x1′ + x1′′ , x2′ + x2′′ , ..., xn′ + xn′′ }, which
encode all the information from the past trajectories of the agents themselves, the static
scene context, and the interaction features among multiple agents.
• In this way, this approach allows each agent to get a different social and contextual
embedding focused on itself.
• Importantly, the model gets these embeddings for multiple agents using shared feature
extractors instead of operating n times for n agents.
• Finally, for each agent in the scene, its final vector xi′ + xi′′ is decoded to future trajectory
prediction yiˆ by LSTM decoders.
• Similar to the encoders for each agent, parameters are shared to guarantee that the network
can generalize well when the number of agents in the scene varies.

Illustration of the Multi-Agent Tensor Fusion (MATF) architecture.

Qualitative results from Massachusetts driving dataset. Past trajectories are shown in different colors for each
vehicle, followed by 100 sampled future trajectories. Ground truth future trajectories are shown in black, and lane
centers are shown in gray. (a) A complex scenario involving five vehicles; MATF accurately predicts the trajectory and
velocity profile for all. (b) MATF correctly predicts that the red vehicle will complete a lane change. (c) MATF
captures the uncertainty over whether the red vehicle will take the highway exit. (d) As soon as the purple vehicle
passes a highway exit, MATF predicts it will not take that exit. (e) Here, MATF fails to predict the precise ground truth
trajectory; however, the red vehicle is predicted to initiate a lane change maneuver in a very small number of
sampled trajectories.

AGen: Adaptable Generative Prediction
Networks for Autonomous Driving
• In highly interactive driving scenarios, accurate prediction of other road participants is critical
for safe and efficient navigation of autonomous cars.
• Prediction is challenging due to the difficulty in modeling various driving behavior, or
learning such a model.
• The model should be interactive and reflect individual differences.
• Imitation learning methods, such as parameter sharing generative adversarial imitation
learning (PS-GAIL), are able to learn interactive models.
• However, the learned models average out individual differences.
• When used to predict trajectories of individual vehicles, these models are biased.
• An adaptable generative prediction framework (AGen), performs online adaptation of the
offline learned models to recover individual differences for better prediction.
• In particular, combine the recursive least square parameter adaptation algorithm (RLS-
PAA) with the offline learned model from PS-GAIL.
• RLS-PAA has analytical solutions and is able to adapt the model for every single vehicle
efficiently online.
IVS 2019

Offline model learning extracts features for
average driving behavior. Online model adaptation
can perturb the average model to fit the behavior
of a specific driver at a specific time. In particular,
take the offline pretrained policy network of PS-
GAIL as the feature extractor for averaged driving
behavior, while adapting individual vehicle
behavior using RLS-PAA online.
Heterogeneity among drivers needs to be explicitly
accounted to improve prediction accuracy in real
world scenarios. As mentioned earlier, it is
intractable to fit a policy network for every
individual vehicle. To make heterogeneous
prediction scalable, combine offline model
learning with online model adaptation.

(a) Offline training using PS-GAIL. Critic computes
the diff btw the expert trajectory and the roll-out
trajectory from the policy network. PS-GAIL
iteratively updates the policy to minimize the diff
and the critic to maximize the diff.
(b) Online adaptation using RLS-PAA. The critic computes the 2-
norm diff btw the the expert trajectory and the roll-out
trajectory from the policy network. RLS-PAA updates the policy
network to minimize the diff. Either 1-step or 2-step adaptation.

Predicted 2 s trajectories for 22 agents after 3 s adaptions.
Average position RMSE over time
in the 22-agent scenario

Conditional Generative Neural System for
Probabilistic Trajectory Prediction
• Effective understanding of the environment and accurate trajectory prediction of
surrounding dynamic obstacles are critical for intelligent systems such as autonomous
vehicles and wheeled mobile robotics navigating in complex scenarios to achieve safe and
high-quality decision making, motion planning and control.
• Due to the uncertain nature of the future, it is desired to make inference from a probability
perspective instead of deterministic prediction.
• They propose a conditional generative neural system (CGNS) for probabilistic trajectory
prediction to approximate the data distribution, with which realistic, feasible and diverse
future trajectory hypotheses can be sampled.
• The system combines the strengths of conditional latent space learning and variational
divergence minimization, and leverages both static context and interaction information with
soft attention mechanisms.
• Also propose a regularization method for incorporating soft constraints into deep neural
networks with differentiable barrier functions, which can regulate and push the generated
samples into the feasible regions.
2019.7

Typical urban traffic scenarios with large uncertainty and interactions
among multiple entities. The shaded areas represent the reachable
sets of possible trajectories. (a) Unsignalized roundabout with four-
way yield signs; (b) Unsignalized intersection with four-way stop signs.

• Requirements to generate diverse, realistic future trajectories:
• 1) Context-aware: The system should be able to forecast trajectories which are inside the
traversable regions and collision-free with static obstacles in the environment. For instance,
when the vehicles navigate in a roundabout they need to advance along the curves and
avoid collisions with road boundaries.
• 2) Interaction-aware: The system needs to generate reasonable trajectories compliant to
traffic or social rules, which takes into account interactions and reactions among multiple
entities. For instance, when the vehicles approach an unsignalized intersection, they need to
anticipate others’ possible intentions and motions as well as the influences of their own
behaviors on surrounding entities.
• 3) Feasibility-aware: The system should anticipate naturalistic and physically-feasible
trajectories which are compliant to vehicle kinematics or dynamics constraints, although
these constraints can be ignored for pedestrians due to the large flexibility of their motions.
• 4) Probabilistic prediction: Since the future is full of uncertainty, the system should be able to
learn an approximated distribution of future trajectories close to data distribution and
generate diverse samples which represent various possible behavior patterns.

Overviewof theproposedconditionalgenerativeneuralsystem(CGNS)whichconsistsoffourkeycontributions:
(a)adeepfeatureextractorwithsoftattentionmechanism,whichextractsmulti-levelfeaturesfromscene
contextimagesequencesandtrajectories;(b)Anencodertolearnconditionallatentspacerepresentations;(c)A
generator(decoder)tosamplefuturetrajectoryhypotheses;(d)Adiscriminatortodistinguishpredicted
trajectoriesfromgroundtruth.

Fig. 3. The visualization of the context image masks and trajectory block attention masks. Particularly, in the
trajectory masks, there are four rows representing 4 historical time steps and 6 columns representing 6
vehicles in the scene. The 1st column corresponds to the predicted vehicle and the others corresponds to
surrounding ones. Brighter colors indicate larger attention weights. The predicted vehicles are indicated with
red bounding boxes. In all the cases, the image masks have a large weight around the predicted vehicle and the
area of its heading direction. In the 1st three cases, only the historical trajectories of the predicted vehicle are
assigned large attention weights, which implies that the other vehicles have little effect in these situations.
However, in the last 3 cases, more attention is paid to other vehicles since there exist strong interactions which
increases the inter-dependency.

Coordination and Trajectory Prediction for Vehicle
Interactions via Bayesian Generative Modeling
• Coordination recognition and subtle pattern prediction of future trajectories play a significant
role when modeling interactive behaviors of multiple agents.
• Due to the essential property of uncertainty in the future evolution, deterministic predictors are
not sufficiently safe and robust.
• In order to tackle the task of probabilistic prediction for multiple, interactive entities, propose a
coordination and trajectory prediction system (CTPS), which has a hierarchical structure
including a macro-level coordination recognition module and a micro-level subtle pattern
prediction module which solves a probabilistic generation task.
• Two types of representation of the coordination variable: categorized and real-valued.
• Bayesian deep learning into generative models to generate diversified prediction hypotheses.
• The proposed system is tested on multiple driving datasets in various traffic scenarios, which
achieves better performance than baseline approaches in terms of a set of evaluation metrics.
• Using the categorized coordination can better capture multi-modality and generate more
diversified samples than the real-valued coordination, while the latter can generate prediction
hypotheses with smaller errors with a sacrifice of sample diversity.
• NNs with weight uncertainty is able to generate samples with larger variance and diversity.
2019.5

Typical highway and urban driving scenarios
where two or more entities coordinate and
interact with each other. The shaded areas
represent possible future motions which
consider multi-modality. (a) Ramp merging and
lane change behaviors on highway scenarios;
(b) Unsignalized roundabout with yield signs;
(c) Unsignalized intersection with stop signs.
Although the contexts are different, they can
be treated as generalized merging scenarios.

• The multi-modal conditional distribution of future trajectories for interactive agents can be
factorized into categorized and real-valued.
• This factorization naturally divides the system into a coordination recognition module
(macro-level) and a subtle pattern prediction module (micro-level).
• The coordination c can not only be categorized to represent meaningful semantics, but also
be real-value vectors to encode the underlying representations.
• If c is categorized, the micro-level module takes c in as an indicator through one-hot
encoding; if c is a real-valued variable, the micro-level module takes c in as an additional
input feature.
• The macro-level module is based on a variational recurrent neural network (VRNN)
followed by a probabilistic classifier.
• And the micro-level module is based on a Coordination-Bayesian Conditional Generative
Adversarial Network (C-BCGAN).

Overview of CTPS: (a) Coordination recognition module: The coordination variable can be discrete categories or continuous
real-valued vectors. The discrete distribution of categorized coordination is obtained by a probabilistic classifier based on latent
features extracted by VRNN. The continuous distribution of real-valued coordination is obtained by maximizing mutual info based
on a VAE-style model. Choose either formulation according to the objective and emphasis in particular tasks; (b) Subtle pattern
prediction module: The model is based on the proposed C-BCGAN in which the generator takes as input the historical info,
coordinator indicator as well as a noise from the normal distribution. Weight uncertainties are incorporated in both generator and
discriminator network.

The visualization of prediction results in the highway scenario. (a) Generation with learned coordination; (b)
Generation with real-valued coordination. Note that, to only predict the longitudinal motions for surrounding
vehicles but both longitudinal and lateral motions for the center vehicle. That is the reason why the predicted
trajectories of surrounding vehicles do not have lateral deviation.

Interaction-aware Multi-agent Tracking and Probabilistic
Behavior Prediction via Adversarial Learning
• In order to enable high-quality decision making and motion planning of intelligent systems such as
robotics and autonomous vehicles, accurate probabilistic predictions for surrounding interactive
objects is a crucial prerequisite.
• Although many research studies have been devoted to making predictions on a single entity, it remains
an open challenge to forecast future behaviors for multiple interactive agents simultaneously.
• In this work, take advantage of the Generative Adversarial Network (GAN) due to its capability of
distribution learning and propose a generic multi-agent probabilistic prediction and tracking
framework which takes the interactions among multiple entities into account, in which all the entities
are treated as a whole.
• However, since GAN is very hard to train, make an empirical research and present the relationship
between training performance and hyperparameter values with a numerical case study.
• The results imply that the proposed model can capture both the mean, variance and multi- modalities
of the ground truth distribution.
• Moreover, apply the proposed approach to a real-world task of vehicle behavior prediction to
demonstrate its effectiveness and accuracy.
• The proposed model trained by adversarial learning can achieve a better prediction performance
than other SoA models trained by traditional supervised learning which maximizes the data likelihood.
• The well-trained model can also be utilized as an implicit proposal distribution for particle filtered
based Bayesian state estimation.
2019.4

The general diagram of the proposed model, which consists of a generator network and a discriminator network.

• To apply the proposed approach to solve a trajectory prediction task of interactive on-road
vehicles as an illustrative example, although it can be utilized to solve many other tasks such
as interactive pedestrian trajectory prediction and human-robot interactions.
A typical highway scenario is investigated where the gray car is the ego vehicle which aims
at forecasting future motions of its surrounding vehicles (red, green and yellow ones). The
observations of environment can be obtained by on-board sensors. The approach can also
be adopted in overhead traffic surveillance systems with camera- based monitors.

Visualization of cases. (a) lane change left; (b) lane change right. The red dash lines are ground truth trajectories.

Driving Behavior for ADAS and Autonomous Driving VII

Driving Behavior for ADAS and Autonomous Driving VII

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Driving Behavior for ADAS and Autonomous Driving VII

Similar to Driving Behavior for ADAS and Autonomous Driving VII (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Driving Behavior for ADAS and Autonomous Driving VII