Pedestrian behavior/intention modeling for autonomous driving V

Pedestrian Behavior/Intention
Modeling for Autonomous Driving V
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and
Abnormal Event Detection （17.2.18）
• Group LSTM: Group Trajectory Prediction in Crowded Scenarios （ECCV2018 workshop）
• Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention
Networks （7.17）
• The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic
Spatiotemporal Graphs (8.23)
• Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM （8.23）
• STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction (ICCV19)
• Neighbourhood Context Embeddings in Deep Inverse Reinforcement Learning for
Predicting Pedestrian Motion Over Long Time Horizons （ICCV19）
• GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction (3.26)
• Recursive Social Behavior Graph for Trajectory Prediction (4.22)

Soft + Hardwired Attention: An LSTM Framework for Human
Trajectory Prediction and Abnormal Event Detection
• As humans we possess an intuitive ability for navigation which we master through years
of practice; however existing approaches to model this trait for diverse tasks including
monitoring pedestrian flow and detecting abnormal events have been limited by using a
variety of hand-crafted features.
• Recent research in the area of deep- learning has demonstrated the power of learning
features directly from the data; and related research in recurrent neural networks has
shown exemplary results in sequence- to-sequence problems such as neural machine
translation and neural image caption generation.
• Motivated by these approaches, a method to predict the future motion of a pedestrian
given a short history of their, and their neighbours, past behaviour.
• The novelty of the method is the combined attention model which utilises both “soft
attention” as well as “hard-wired” attention in order to map the trajectory information
from the local neighbourhood to the future positions of the pedestrian of interest.
• How a simple approximation of attention weights (i.e. hard-wired) can be merged
together with soft attention weights in order to make our model applicable for
challenging real world scenarios with hundreds of neighbours.

A scene (on the left): The trajectory of the pedestrian of interest is shown in green, and has two neighbours (shown
in purple) to the left, one in front and none on right. Neighbourhood encoding scheme (on the right): Trajectory
information is encoded with LSTM encoders. A soft attention context vector is used to embed the trajectory
information from the pedestrian of interest, and a hardwired attention context vector is used for neighbouring
trajectories. In order to generate soft attention vector, use a soft attention function. The merged context vector is
then used to predict the future trajectory for the pedestrian of interest (shown in red).

The Soft + Hardwired Attention model. utilise the trajectory information from both the pedestrian of
interest and the neighbouring trajectories. embed the trajectory information from the pedestrian of
interest with the soft attention context vector, while neighbouring trajectories are embedded with the aid
of a hardwired attention context vector. In order to generate soft attention context vector, use a soft
attention function. Then the merged context vector, is used to predict the future state

Group LSTM: Group Trajectory Prediction in Crowded Scenarios
• The analysis of crowded scenes is one of the most challenging scenarios in visual
surveillance, and a variety of factors need to be taken into account, such as the structure
of the environments, and the presence of mutual occlusions and obstacles.
• Traditional prediction methods (such as RNN, LSTM, VAE, etc.) focus on anticipating
individual’s future path based on the precise motion history of a pedestrian.
• However, since tracking algorithms are generally not reliable in highly dense scenes,
these methods are not easily applicable in real environments.
• Nevertheless, it is very common that people (friends, couples, family members, etc.)
tend to exhibit coherent motion patterns.
• Motivated by this phenomenon, an approach to predict future trajectories in crowded
scenes, at the group level.
• First, by exploiting the motion coherency, cluster trajectories that have similar motion
trends.
• In this way, pedestrians within the same group can be well segmented.
• Then, an improved social-LSTM is adopted for future path prediction.

i
Representation of the Social hidden-state tensor. The black dot represents the pedestrian of interest. Other
pedestrians are shown in different color codes, namely green for pedestrians belonging to the same set, and
red for pedestrians belonging to a different set. The neighborhood of pedestrian of interest is described by
N0 × N0 cells, which preserves the spatial information by pooling spatially adjacent neighbors. Pedestrians
belonging to the same set are not used for the final computation of the pooling layer.

The figure represents the chain structure of the LSTM network between two consecutive
time steps. At each time step, the inputs of the LSTM cell are the previous position and
the Social pooling tensor Ht. The output of the LSTM cell is the current position.

Social-BiGAT: Multimodal Trajectory Forecasting using
Bicycle-GAN and Graph Attention Networks
• Predicting the future trajectories of multiple interacting agents in a scene has become an
increasingly important problem for many different applications ranging from control of
autonomous vehicles and social robots to security and surveillance.
• This problem is compounded by the presence of social interactions between humans and
their physical interactions with the scene.
• While the existing literature has explored some of these cues, they mainly ignored the
multimodal nature of each human’s future trajectory.
• Social-BiGAT, a graph-based generative adversarial network that generates realistic,
multimodal trajectory predictions by better modelling the social interactions of
pedestrians in a scene.
• Based on a graph attention network (GAT) that learns reliable feature representations
that encode the social interactions between humans in the scene, and a recurrent
encoder-decoder architecture that is trained adversarially to predict, based on the
features, the humans’ paths.
• The multimodal nature of the prediction by forming a reversible transformation between
each scene and its latent noise vector, as in Bicycle-GAN.

Architecture for the Social-BiGAT model. The
model consists of a single generator, two
discriminators (one at local pedestrian scale,
and one at global scene scale), and a latent
encoder that learns noise from scenes. The
model makes use of a graph attention network
(GAT) and self-attention on an image to
consider the social and physical features of a
scene.

Training process for the Social-BiGAT model. Teach the generator and discriminators using traditional
adversarial learning techniques, with an additional L2 loss on generated samples to encourage
consistency. Further train the latent encoder by ensuring it can recreate noise passed into the generator,
and by making sure it mirrors a normal distribution.

The Trajectron: Probabilistic Multi-Agent Trajectory
Modeling With Dynamic Spatiotemporal Graphs
• Developing safe human-robot interaction systems is a necessary step towards the
widespread integration of autonomous agents in society.
• A key component of such systems is the ability to reason about the many potential
futures (e.g. trajectories) of other agents in the scene.
• Trajectron, a graph-structured model that predicts many potential future trajectories of
multiple agents simultaneously in both highly dynamic and multimodal scenarios (i.e.
where the number of agents in the scene is time-varying and there are many possible
highly- distinct futures for each agent).
• It combines tools from recurrent sequence modeling and variational deep generative
modeling to produce a distribution of future trajectories for each agent in a scene.
• Test the performance of the model on several datasets, obtaining state-of-the-art results
on standard trajectory prediction metrics as well as introducing a new metric for
comparing models that output distributions.

Top: An example graph with four nodes. a is the
modeled node and is of type T3. It has three
neighbors: b of type T1, c of type T2, and d of type
T1. Here, c is about to connect with a. Bottom: The
corresponding architecture for node a.
Overall, the Trajectron employs a hybrid edge
combination scheme combining aspects of Social
Attention and the Structural-RNN.

• Trajectron combines elements of variational deep generative models (in particular, CVAEs),
recurrent sequence models (LSTMs), and dynamic spatiotemporal graphical structures to produce
high-quality multimodal trajectories that models/predicts future behaviors of multiple humans.
• Trajectron actually models a human’s velocity, which is then numerically integrated to produce
spatial trajectories.
• Build a graph G = (V , E ) representing the scene with nodes representing agents and edges based
on agents’ spatial proximity.
• Node History Encoder (NHE) to encode a node’s state history;
• Edge Encoders (EEs) to incorporate influence from neighboring nodes.
• With the previous outputs in hand, form a concatenated representation which then
parameterizes the recognition and prior distributions in the CVAE framework.

Trajectory Prediction by Coupling Scene-LSTM
with Human Movement LSTM
• A trajectory prediction system that incorporates the scene information (Scene-
LSTM) as well as individual pedestrian movement (Pedestrian-LSTM) trained
simultaneously within static crowded scenes.
• Superimpose a two-level grid structure (grid cells and subgrids) on the scene to
encode spatial granularity plus common human movements.
• The Scene-LSTM captures the commonly traveled paths that can be used to
significantly influence the accuracy of human trajectory prediction in local areas
(i.e. grid cells).
• Further design scene data filters, consisting of a hard filter and a soft filter, to
select the relevant scene information in a local region when necessary and
combine it with Pedestrian-LSTM for forecasting a pedestrian’s future locations.
• The experimental results on several publicly available datasets demonstrate that
it produces more accurate predicted trajectories in different scene contexts.

Scene-LSTM learns common human movements on a two-level
grid structure. The common human movement is filtered and
used in combination with individual movement (Pedestrian-
LSTM) to predict a pedestrian’s future locations.

The system consists of three main modules: Pedestrian Movement (PM), Scene Data (SD) and Scene Data Filter
(SDF). PM models the individual movement of pedestrians. SD encodes common human movements in each
grid cell. SDF selects relevant scene data to update the Pedestrian-LSTM, which is used to predict the future
locations. ⊗ denotes elementwise multiplication. ⊕ denotes vector addition. hi and hsare the hidden states of
Pedestrian-LSTM and Scene-LSTM, respectively.

Illustrations of the hard filter, which determines whether the scene data should be applied in predicting the
future locations of a pedestrian. (a) the frame image is first divided into n × n grid cells (n = 4 in this example)
to capture all human movements in each grid cell; (b) & (c) only non-linear grid cells are selected for further
processing at the subgrid level; the scene data is not applied for pedestrians in the linear grid cell; (d) a non-
linear grid cell is further divided into m × m subgrids (m = 4) and each trajectory is parsed into subgrid paths;
(e) the common subgrids, occupied by common subgrid paths; (f) at prediction time, the decision of use/not
use scene data depends on the current location of each pedestrian. If the pedestrian’s current location is in the
common subgrids, the scene data is used (red pedestrian); otherwise, it is not used (green pedestrian).

Illustrations of the soft filter. The relevant information of scene data (i.e. Scene-LSTM) is selected using
each pedestrians walking behavior. The filtered grid-cell memory of each pedestrian is then used in
combination with pedestrian movements (Pedestrian-LSTM) to predict the future trajectories.

STGAT: Modeling Spatial-Temporal Interactions for
Human Trajectory Prediction
• Human trajectory prediction is challenging and critical in various applications (e.g., autonomous
vehicles and social robots).
• Because of the continuity and foresight of the pedestrian movements, the moving pedestrians in
crowded spaces will consider both spatial and temporal interactions to avoid future collisions.
• However, most of the existing methods ignore the temporal correlations of interactions with
other pedestrians involved in a scene.
• Spatial-Temporal Graph Attention network (STGAT), based on a sequence-to-sequence
architecture to predict future trajectories of pedestrians.
• Besides the spatial interactions captured by the graph attention mechanism at each time-step,
adopt an extra LSTM to encode the temporal correlations of interactions.
• Test on two publicly available crowd datasets (ETH and UCY) and produces more “socially”
plausible trajectories for pedestrians.

The architecture of the STGAT model. The framework is based on seq2seq model and consists of 3 parts: Encoder,
Intermediate State and Decoder. The Encoder module includes three components: 2 types of LSTMs and Graph
Attention Network (GAT) . The Intermediate State encapsulates the spatial and temporal information of all
observed trajectories. The Decoder module generates the future trajectories based on Intermediate State.

Neighbourhood Context Embeddings in Deep Inverse Reinforcement
Learning for Predicting Pedestrian Motion Over Long Time Horizons
• Despite the fact that Deep Inverse Reinforcement Learning (D-IRL) based modelling
paradigms offer flexibility and robustness when anticipating human behaviour across
long time horizons, compared to their supervised learning counterparts, no existing
state-of-the-art D-IRL methods consider path planning in situations where there are
multiple moving pedestrians in the environment.
• To address this, a recurrent neural network based method for embedding pedestrian
dynamics in a D-IRL setting, where there are multiple moving agents.
• Capture the motion of the pedestrian of interest as well as the motion of other
pedestrians in the neighbourhood through Long-Short-Term Memory networks.
• The neighbourhood dynamics are encoded into a feature map, preserving the spatial
integrity of the observed trajectories.
• Utilising the maximum-entropy based non-linear inverse reinforcement learning
framework, map these features to a reward map.
• The importance of capturing the dynamic evolution of the environment using the
embedding scheme.

The architecture used to embed the
neighbourhood context: The trajectory of the
pedestrian of interest is shown in blue, with
three neighbours shown in green. Heading
directions are indicated with circles. encode
the trajectories using LSTMs where soft
attention is utilised to embed the information
from the pedestrian of interest and the
neighbours use hard-wired attention. Next a
feature map is generated to embed this
information spatially, based on the cartesian
points of each trajectory.

The architecture of the four layer fully convolution network used to
map the feature map G to the reward map R. The first three layers
contain 32, 1 × 1 convolution kernels with a ReLU activation, and
the final layer contains 1, 1 × 1 convolution kernel.
The learned reward map covers all
the areas of the environment,
encapsulating structural factors such
as buildings and pathways that
influence pedestrian behaviour.

GraphTCN: Spatio-Temporal Interaction Modeling
for Human Trajectory Prediction
• Trajectory prediction is a fundamental and challenging task to forecast the future path of the
agents in autonomous applications with multi-agent interaction, where the agents need to
predict the future movements of their neighbors to avoid collisions.
• To respond timely and precisely to the environment, high efficiency and accuracy are required in
the prediction.
• Conventional approaches, e.g., LSTM-based models, take considerable computation costs in the
prediction, especially for the long sequence prediction.
• To support more efficient and accurate trajectory predictions, a CNN-based spatial-temporal
graph framework GraphTCN, which captures the spatial and temporal interactions in an input-
aware manner.
• The spatial interaction between agents at each time step is captured with an edge graph
attention network (EGAT), and the temporal interaction across time step is modeled with a
modified gated convolutional network.
• In contrast to conventional models, both the spatial and temporal modeling in GraphTCN are
computed within each local time window.
• Therefore, GraphTCN can be executed in parallel for much higher efficiency, and meanwhile with
accuracy comparable to best-performing approaches.

The overview of GraphTCN, where EGAT captures the
spatial interaction between agents for each time step
and based on the spatial and historical trajectory
embedding, TCN further captures the temporal
interaction across time steps. The decoder module
then produces multiple socially acceptable
trajectories for all the agents simultaneously.

TCN with a stack of 3 causal convolution layers of kernel size 3. In each
layer, the left padding is adopted based on the kernel size. The input
contains the spatial information captured by preceding modules. The
output of TCN is collected by concatenating all the outputs across time.

Recursive Social Behavior Graph for Trajectory
Prediction
• Social interaction is an important topic in trajectory prediction to generate plausible
paths.
• Force based models utilize the distance to compute force, and they will fail when the
interaction is complicated.
• for pooling methods, the distance between two person at a single timestep is used as a
criterion to calculate the strength of the relationship.
• Attention methods also meet the same problem that Euclidean distance are used in their
method to guide the attention mechanism.
• An insight of group-based social interaction model to explore relationships among
pedestrians.
• recursively extract social representations supervised by group-based annotations and
formulate them into a social behavior graph, called Recursive Social Behavior Graph.
• recursive mechanism explores the representation power largely.
• Graph CNN is used to propagate social interaction information in such a graph.

Prediction
Overview. For individual representation, BiLSTMs are used to encode historical trajectory feature, and CNNs are
used to encode human context feature. For relational social representation, first generate RSBG recursively and
then use GCN to propagate social features. At the decoding stage, social features are concatenated with individual
features which finally decoded by an LSTM based decoder.

Prediction

Pedestrian behavior/intention modeling for autonomous driving V

Pedestrian behavior/intention modeling for autonomous driving V

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pedestrian behavior/intention modeling for autonomous driving V

Similar to Pedestrian behavior/intention modeling for autonomous driving V (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Pedestrian behavior/intention modeling for autonomous driving V