Pedestrian behavior/intention modeling for autonomous driving II

Pedestrian Behavior/Intention
Modeling for Autonomous Driving II
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Social LSTM: Human Trajectory Prediction in Crowded Spaces
• A Data-driven Model for Interaction-aware Pedestrian Motion Prediction in Object
Cluttered Environments
• Human Motion Prediction Under Social Grouping Constraints
• Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks
• SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical
Constraints
• Social Attention: Modeling Attention in Human Crowds
• Learning to Predict Pedestrian Intention via Variational Tracking Networks
• Location-Velocity Attention for Pedestrian Trajectory Prediction
• Situation-Aware Pedestrian Trajectory Prediction with Spatio-Temporal Attention Model

Social LSTM: Human Trajectory Prediction in
Crowded Spaces
• Pedestrians follow different trajectories to avoid obstacles and accommodate fellow
pedestrians.
• Any autonomous vehicle navigating such a scene should be able to foresee the future
positions of pedestrians and accordingly adjust its path to avoid collisions.
• This problem of trajectory prediction can be viewed as a sequence generation task,
interested in predicting the future trajectory of people based on their past positions.
• Following the recent success of Recurrent Neural Network (RNN) models for sequence
prediction tasks, propose an LSTM model which can learn general human movement and
predict their future trajectories.
• This is in contrast to traditional approaches which use hand-crafted functions such as Social
forces.
CVPR 2016

Crowded Spaces
The goal is to predict the motion dynamics in crowded scenes, a challenging task as the motion of each person
is typically affected by their neighbors. A proposed model called ”Social” LSTM (Social- LSTM), can jointly
predict the paths of all the people in a scene by taking into account the common sense rules and social
conventions that humans typically utilize as they navigate in shared environments. The predicted distribution
of their future trajectories is shown in the heat-map.

Crowded Spaces
Overview of Social-LSTM method. To use a
separate LSTM network for each trajectory in a
scene. The LSTMs are then connected to each
other through a Social pooling (S-pooling) layer.
Unlike the traditional LSTM, this pooling layer
allows spatially proximal LSTMs to share
information with each other. The bottom row
shows the S-pooling for one person in the
scene. The hidden-states of all LSTMs within a
certain radius are pooled together and used as
an input at the next time-step.

Crowded Spaces
It visualizes the probability distribution
of the predicted paths for 4 people
moving in a scene across 6 time steps.
The sub-caption describes what our
model is predicting. At each time-step:
the solid lines in rows 1,3 represents the
ground-truth future trajectories, the
dashed lines refer to the observed
positions till that time-step and the dots
denote the position at that time-step.
Notice that the model often correctly
predicts the future paths in challenging
settings with non-linear motions.

A Data-driven Model for Interaction-aware Pedestrian
Motion Prediction in Object Cluttered Environments
• 2018.2
• A data-driven, interaction- aware motion prediction approach for
pedestrians in environments cluttered with static obstacles.
• When navigating in such workspaces shared with humans, robots need
accurate motion predictions of the surrounding pedestrians.
• Human navigation behavior is mostly influenced by their surrounding
pedestrians and by the static obstacles in their vicinity.
• It is a model based on Long-Short Term Memory (LSTM) neural networks,
which is able to learn human motion behavior from demonstrated data.
• This is an approach using LSTMs, that incorporates both static obstacles and
surrounding pedestrians for trajectory forecasting.
• As part of the model, encoding surrounding pedestrians based on a 1d-grid
in polar angle space.

Trajectories of pedestrians predicted with the
presented motion model for interacting
pedestrians on a real-world dataset. Left: Two-
dimensional view of the navigating pedestrians,
where the ground truth data (dashed) and the
predicted trajectories (solid) are shown. Main
image: Real-world environment with predicted
trajectories projected into the image frame.
Right top: Static obstacle grid of agent 73, which
serves as one input to the model. Static
obstacles are shown in black. Right bottom:
Pedestrian grid of agent 73, which is another
input to the model. Both grids are aligned with
the agent’s position and heading, indicated by
the arrows in the center of the grids. Each
prediction is purely based on such two grids per
agent, and its current velocity (v73).

• The main factors influencing pedestrian motion are interactions among pedestrians, the
environment, and the location of their destination;
• By taking into account the interaction between pedestrians, the accuracy of the motion
models can be significantly increased;
• It was shown that using interaction-aware motion models for dynamic agent prediction in
motion planning applications makes robots more predictable for human agents and
therefore also more “socially compliant”.
• Especially for cluttered environments it is also important to model the pedestrians’
reactions to static obstacles in their close proximity.
• The existing approaches are limited by at least one of the following shortcomings: (i) The
feature functions are hand-crafted and therefore can only capture simple interactions. (ii)
The approaches are not scalable to dense crowds, therefore real-time computation is only
feasible for a small number of agents. (iii) Static obstacles are neglected and (iv) knowledge
about a set of potential destinations is assumed.

Architecture of the LSTM model. The inputs are the query agent’s state (velocity), the occupancy grid
and the angular pedestrian grid (APG), all centered at the query agent’s position and aligned with its
coordinate frame. Each of the 3 different input channels gets processed separately through embedding,
CNN and LSTM layers. The CNN / FC combination which pre-processes the occupancy grid is pre-trained
with an auto-encoder. The concatenation of the extracted features from each channel is followed by
another LSTM layer, a FC and an output layer. The output of the model is a sequence of velocities
predicted for the query agent’s future.

• Besides of social force model , the well known approach for agent interaction forecasting of
holonomic agents is the Reciprocal Velocity Obstacles (RVO) method.
• A significant amount of work has gone into modeling interacting pedestrians using
maximum entropy Inverse Reinforcement Learning (IRL).
• When navigating in cluttered environments, with other agents and / or static obstacles,
humans show an outstanding capability to “read” the intentions of other people,
perceive the environment and extract relevant information from it.
• Ultimately, it is expected robots to be well integrated into the environment in a socially
compliant way such that they can safely navigate, do not get stuck or disturb other traffic
participants — and that at a human-like level.
• In order to reach this level of autonomy, robots need to have accurate models to forecast
the evolution of the environment, including other agents.
• By taking into account motion predictions of other agents, safe paths are planned for robot.
• The more accurate the motion predictions of the others, the more predictable will the
robot be as the re-planning effort per timestep can be minimized.

Angular pedestrian grid (APG) construction from relative pedestrian
positions (Xt). The APG is centered at the query agent’s position and
aligned with its heading. The encoded value per grid cell is the minimum
distance to the pedestrians that lie within the cone of the angular grid cell.

• The 1st input, the state, is query agent’s velocity in its local Cartesian coord. frame (vx , vy ).
• The 2nd input is a 2D occupancy grid, encoding information about the static obstacles in the
vicinity of the query agent.
• The 3rd, there is the information about the other agents surrounding the query agent,
encoded in a special hybrid grid.
• In order to be able to capture well the dynamics of other pedestrians, a high resolution grid
would be needed, increasing the dimensionality a lot.
• Therefore, it introduces a new way of encoding the information of other pedestrian agents
with the angular pedestrian grid (APG).
• Compared to a standard 2D grid, the resolution of the APG only influences the
dimensionality of the input linearly, while still being able to capture radial distance changes
with continuous resolution instead of discretized grid cells.
• In addition, the observability of angular position changes of surrounding pedestrians
becomes more precise the closer they are to the query agent.
• One drawback of the APG representation is the fact that only the closest surrounding
pedestrian in each angular cone can be captured, yet assuming that these are the ones that
affect the query agents decisions the most.
• As in the occupancy grid case, the APG is centered and aligned with the pedestrian’s
position and heading, respectively.

Architecture of the pretrained convolutional AE network for grid feature extraction. By applying 3
convolutional layers and a FC layer, the original occupancy grid (gin) gets compressed to a latent space
size of 64 during the encoding phase. The decoder reconstructs a grid (gout) which ideally perfectly
matches the input grid. The encoding part of this AE network is later on used in the full LSTM motion and
interaction model.

Visualization of the truncated backpropagation through time during training. The whole sequence of length
lsequencegets split up in sub- sequences of length dtrunc. For the initial sub-sequence the hidden state of the
LSTMs gets initialized with zeros, for succeeding ones with the last one of the preceding sequence.
The full network with its LSTM cells is trained using backpropagation through time (BPTT). Since the
demonstration trajectories have different length and the vanishing / exploding gradient problem it needs to be
avoided, the truncation depth is fixed to dtrunc. Sub-sequences of length dtrunc of the demonstrated trajectories
are used for training. For succeeding sub-sequences, the hidden states of the LSTMs are initialized with the
final states of the previous sequence. In cases when there is no preceding sub-sequence, the hidden states of
the LSTMs are initialized with zeros. Thus, info from the preceding sub-sequence can be forward propagated
to the next one, however the optimization (BPTT) only affects the samples within the current sub- sequence.

Two examples for interaction- and obstacle-aware pedestrian trajectory prediction using the presented
LSTM model. The big circles represent pedestrians (number specifies the agent id), the small ones the ten
prediction steps. The current velocity of each pedestrian is indicated by the arrow. Static obstacles are
shown as black grid cells. Both examples stem from the test data which was not used for training the model.

Human Motion Prediction Under Social
Grouping Constraints
• 2018 IROS
• Accurate long-term prediction of human motion in populated spaces is an important but
difficult task for mobile robots and intelligent vehicles.
• What makes this task challenging is that human motion is influenced by a large variety of
factors including the person’s intention, the presence, attributes, actions, social relations
and social norms of other surrounding agents, and the geometry and semantics of the
environment.
• It is about computing human motion predictions that account for such factors.
• It formulates the task as an MDP planning problem with stochastic policies and propose a
weighted random walk algorithm in which each agent is locally influenced by social forces
from other nearby agents.
• It incorporates social grouping information into the prediction process reflecting the soft
formation constraints that groups typically impose to their members’ motion.

• This method is for jointly predicting trajectories of all agents in the scene.
• It assume that a person tracking system delivers short sequences of observed agent
positions, called tracklets, and that this system also provides group detection as
partitioning of individual agents into groups.
• Such systems have also been extended with the ability to detect and reason about social
grouping hypotheses.
• Local Interaction and Group Motion Modeling: The social force model describes how the
intended motion of a person changes according to the influence the repulsive forces from
other people; Other social forces are defining attraction of people walking in groups to
other members of the group (attraction term) and imposing soft constraints on the walking
formation that resembles typical patterns of humans in groups (visibility term);
• Stochastic Policy Sampling Using Random Walks: To make predictions using the stochastic
policy πg, utilize the random walk algorithm that samples K joint paths for all people in the
scene; each joint path is representing a possible future interaction given the observed
tracklets and available group information; during the random walk, evaluate the social
interactions among the agents that affect each agent’s instantaneous stochastic policy
according to the group social force model.

Algorithm 1 summarizes the operations required to obtain predictions. Assumed that K joint random paths are
requested, N people are in the scene and T prediction steps are made. The complexity of the goal sampling
operation for every human (line 2) depends on the number of goals |G|. Group center calculation is done only once
for each time step (line 4). The random action sampling procedure (line 6) depends on the action space discretization
(A angles and V velocities) and has the worst- case complexity of O(AV). This happens when the agent is moving with
velocity close to νmax. The social force in the direction of agent i (line 7) is computed for each surrounding agent
within a certain radius. In the worst-case, when all agents are densely located, the complexity is O(N). The group
social force computation (line 8) is a constant time operation. The overall complexity is O(K(N|G| +T(N(AV + N)))).

Prediction results in a simulated scenario with obstacles and 21 people walking in 7 groups. Goals are placed in
the 4 corners of the map. Left: initial positions of people are shown in colored circles, each color corresponding
to one group. Right, top row: predicted positions with GSF-MDP for several points in time. Consider e.g. the
green group that waits until the passage is cleared by the red and blue groups without losing its formation. Then
it gives way for the faster orange group. People in the red group are correctly predicted to maintain a side-by-side
walking formation. Right, bottom row: predicted positions with the JS-MDP baseline, where group motion is not
modeled. The green group performs unnecessary maneuvers, then gets separated. The same happens with the
red and orange groups, who lose their members in the crowd.

Social GAN: Socially Acceptable Trajectories
with Generative Adversarial Networks
• Understanding human motion behavior is critical for autonomous moving platforms
(like self-driving cars and social robots) if they are to navigate human-centric
environments.
• This is challenging because human motion is inherently multimodal: given a history
of human motion paths, there are many socially plausible ways that people could
move in the future.
• They tackle this problem by combining tools from sequence prediction and
generative adversarial networks: a recurrent sequence-to-sequence model observes
motion histories and predicts future behavior, using a novel pooling mechanism to
aggregate information across people.
• It predicts socially plausible futures by training adversarially against a recurrent
discriminator, and encourage diverse predictions with a novel variety loss.
• Through experiments on several datasets they demonstrate that the approach
outperforms prior work in terms of accuracy, variety, collision avoidance, and
computational complexity.
CVPR2018

Illustration of a scenario where two pedestrians want to avoid each other. There are many possible ways
that they can avoid a potential collision. This work presents a method that given the same observed past,
predicts multiple socially acceptable outputs in crowded scenes.
Forecastingthebehaviorofhumansischallenging:Interpersonal,SociallyAcceptable,Multimodal.
Whileexistingmethodshavemadegreatprogressinaddressingspecificchallenges,theysuffer
fromtwolimitations:1)theymodelalocalneighborhoodaroundeachpersonwhenmakingthe
prediction;Hencetheydonothavethecapacitytomodelinteractionsbetweenallpeopleinascene
inacomputationallyefficientfashion;2)theytendtolearnthe“averagebehavior”becauseofthe
commonlyusedlossfunctionthatminimizestheEuclideandistancebetweenthegroundtruthand
forecastedoutputs.

System overview. The model consists of three key components: Generator (G), Pooling Module, and
Discriminator (D). G takes as input past trajectories Xi and encodes the history of the person i as Hi
t. The
pooling module takes as input all Hi
tobs and outputs a pooled vector Pi for each person. The decoder
generates the future trajectory conditioned on Hi
tobs and Pi. D takes as input Treal or Tfake and classifies them
as socially acceptable or not.

Comparison between this pooling mechanism (red dotted arrows) and Social Pooling (red dashed grid) for the
red person. This method computes relative positions between the red and all other people; these positions are
concatenated with each person’s hidden state, processed independently by an MLP, then pooled elementwise
to compute red person’s pooling vector P1. Social pooling only considers people inside the grid, and cannot
model interactions between all pairs of people.

Examples of diverse predictions from this model. Each row shows a different set of observed trajectories;
columns show four different samples from our model for each scenario which demonstrate different types of
socially acceptable behavior. BEST is the sample closest to the ground-truth; in SLOW and FAST samples, people
change speed to avoid collision; in DIR samples people change direction to avoid each other. This model learns
these different avoidance strategies in a data-driven manner, and jointly predicts globally consistent and
socially acceptable trajectories for all people in the scene.

SoPhie: An Attentive GAN for Predicting Paths
Compliant to Social and Physical Constraints
• This paper addresses the problem of path prediction for multiple interacting agents in a
scene, crucial for many autonomous platforms such as self-driving cars and social robots.
• SoPhie: an interpretable framework based on Generative Adversarial Network (GAN), which
leverages two sources of information, the path history of all the agents in a scene, and the
scene context information, using images of the scene.
• To predict a future path for an agent, both physical and social info must be lever- aged.
• Previous work has not been successful to jointly model physical and social interactions.
• It blends a social attention mechanism with a physical attention that helps the model to
learn where to look and extract the most salient parts of the image relevant to the path.
• The social attention component aggregates info across the different agent interactions and
extracts the most important trajectory information from the surrounding neighbors.
• SoPhie also takes advantage of GAN to generates more realistic samples and to capture the
uncertain nature of the future paths by modeling its distribution.
arXiv: 2018.9

SoPhie predicts trajectories that are socially and physically
plausible. To perform this, our approach incorporates the
influence of all agents in the scene as well as the scene context.

• First, the feature extractor module extracts proper features from the scene, i.e. the image at the
current frame It , using a convolutional neural network.
• It also uses an LSTM encoder to encode an index invariant, but temporally dependent, feature between
the state of each agent, Xi
1:t, and the states of all other agents up to the current frame, X1:Ni
1:t.
• Then, the attention module highlights the most important information of the inputted features for the
next module.
• The attention module consists of two attention mechanisms named as social and physical attention
components.
• The physical attention learns the spatial (physical) constraints in the scene from the training data and
concentrates on physically feasible future paths for each agent.
• Similarly, the social attention module learns the interactions between agents and their influence on
each agent’s future path.
• Finally, the LSTM based GAN module takes the highlighted features from the attention module to
generate a sequence of plausible and realistic future paths for each agent.
• In more details, an LSTM decoder is used to predict the temporally dependent state of each agent in
future, i.e. Yˆi
1:T .
• Similar to GAN, a discriminator is also applied to improve the performance of the generator model by
forcing it to produce more realistic samples (trajectories).

An overview of SoPhie architecture. Sophie consists of three key modules including: (a) A feature extractor
module, (b) An attention module, and (c) An LSTM based GAN module.

Three sample scenarios where physical and social attention allow correct predictions and fixes the
Social GAN errors. In all figures, past and predicted trajectories are plotted as line and distributions,
respectively. It displays the weight maps of the physical attention mechanism highlighted in white on
the image. The white boxes on the agents show the social attention on the agents with respect to
the blue agent. The size of the boxes are relative to the attention weights on different agents.

Social Attention: Modeling Attention in
Human Crowds
• 2018.10 ICRA
• Navigating through crowds need to plan safe, efficient, and human predictable trajectories.
• This is challenging as it requires the robot to predict future human trajectories within a
crowd where everyone implicitly cooperates with each other to avoid collisions.
• Previously trajectory prediction has modeled interactions between humans as a function of
proximity.
• However, not necessarily true as some people in immediate vicinity moving in the same
direction might not be as important as further away, but that might collide in the future.
• In this work, Social Attention, a trajectory prediction model that captures the relative
importance of each person when navigating in the crowd, irrespective of their proximity.
• This method uses a feedforward, fully differentiable, and jointly trained RNN mixture to
model trajectories of all humans in the crowd, both spatially and temporally.
• The human-human interactions are modeled using a soft attention model over all humans
in the crowd, thereby not restricting the approach with the local neighborhood assumption.

Human Crowds
Humans, when navigating a crowd, pay attention to only a subset of surrounding agents at each time-
step. In this work, seek to learn such an attention model over surrounding agents to predict trajectories
of all agents in the crowd more accurately by capturing subtle human-human interactions.
The human-human interactions are modeled using a soft attention model over all humans in the crowd,
thereby not restricting the approach with the local neighborhood assumption.

Human Crowds
• Humans navigate crowds by adapting their own trajectories based on the motion of others
around them.
• It is assumed that this influence is spatially local, i.e., only spatial neighbors influence the
motion of a human in the crowd.
• But this is not necessarily true and other features such as velocity, acceleration and heading
play an important role, enabling agents who are not spatially local to influence a
pedestrian’s motion.
• To model influence of all agents in crowd by learning an attention model over the agents.
• Which surrounding agents do humans attend to, while navigating a crowd? The hypothesis
is that the representation of trajectories learned by the model enables to effectively reason
about the importance of surrounding agents better than only spatially local agents.
• To model interactions among humans, do not predict future locations of each human
independently; Instead, need to jointly reason across multiple people and couple their
predictions so that interactions among them are captured.
• Towards this goal, use a feedforward, fully differentiable, and jointly trained RNN mixture
that predicts both their future locations and captures human-human interactions.

Human Crowds
Example ST-graph, unrolled ST-graph for two time steps and corresponding factor graph

Human Crowds
• It formulates the problem of human trajectory prediction as a spatio-temporal graph.
• Nodes of st-graph represent the humans, the spatial edges connect two different humans
at the same time-step, and temporal edges connect the same human at adjacent time-steps.
• The spatial edges aim to capture the dynamics of relative orientation and distance between
two humans, and temporal edges capture the dynamics of the human’s own trajectory.
• The factor graph representation of the st-graph associates a factor function for each node
and a pairwise factor function for each edge in the graph.
• At each time-step, factors in the st-graph observe node/edge features and perform some
computation on those features. Each of these factors have parameters to be learned.
• In the formulation, all the nodes share the same factor, giving the model scalability to handle
more nodes (in dense crowds) without increasing the number of parameters.
• All spatial edges share a common factor and all temporal edges share the same factor.
• This kind of parameter sharing is necessary to generalize across scenes with varying number
of humans, and keeps the parameterization compact.

Human Crowds
Architecture of EdgeRNN (left), Attention module (middle) and NodeRNN (right)

Human Crowds
• The factor graph representation lends itself naturally to the S-RNN architecture.
• It represents each factor with an RNN: nodeRNNs for each of the node factors, and
edgeRNNs for each of the edge factors.
• Note that all the nodeRNNs, spatial edgeRNNs and temporal edgeRNNs share parameters
among themselves.
• The spatial edgeRNNs model dynamics of human-human interactions in the crowd and the
temporal edgeRNNs model dynamics of individual motion of each human in the crowd.
• The nodeRNNs use the node features and hidden states from the neighboring edgeRNNs to
predict the future location of the node at the next time- step.
• Since sharing the model parameters across all nodes and edges, the number of parameters
is independent of the number of pedestrians at any given time.

Human Crowds
Attention weights predicted by
Social Attention. In all plots, the red
trajectory is the pedestrian whose
attention weights are being
predicted. The rest of the
trajectories are other pedestrians in
the crowd. Solid dots represent
locations of the pedestrians at
previous time-steps and the blue
diamond marker represents their
current locations. The circles around
the current position of surrounding
pedestrians represents attention
with the radius proportional to
attention weight.

Learning to Predict Pedestrian Intention via
Variational Tracking Networks
• 2018.11
• A deep learning based system for short term prediction of pedestrian behavior in
front of a vehicle.
• To achieve this, develop a framework for class- specific object tracking and short
term path prediction based on a variant of a Variational Recurrent Neural Network
(VRNN), which incorporates latent variables corresponding to a dynamic state space
model.
• The low level visual features learned from this system were found to be highly
informative for the discrete intention prediction task (i.e., predicting whether a
pedestrian is stopping or crossing), and achieved high performance on the Daimler
benchmark.
• This is despite a much smaller training dataset than is normally used for training
deep learning models.
• A method without using externally trained pedestrian pose estimation systems.

Overall data flow diagram of the approach.
From a set of images and noisy object
detections in images, it uses concepts from
variational recurrent neural networks to learn
how to track pedestrians. Features arising
from object tracking were found to be useful
for intention prediction, and a separate
recurrent neural network estimates the binary
intention label.
It represents the image at time t, xt represents
the detection and vt denotes the appropriately
time-shifted binary intention prediction label.

Zoomed in view of the tracking controller block.
The key feature is a Variational recurrent neural
network (VRNN) which uses neural networks to
estimate the mean and variance of innovations to
apply to the latent state. It also maintains
estimates of the intention prediction for evaluation
on the Daimler dataset. Notably, the GRU cell does
not receive xt.
If the inference network is given the true value of xt,
the network can simply copy xt directly without
learning anything from the image.
For each training update, perform a rollout of the
system over a fixed window of training data, then
aggregate the loss over all trackers.

Location-Velocity Attention for Pedestrian
Trajectory Prediction
• 2019.1 WACV
• Pedestrian path forecasting is crucial in applications such as smart video surveillance.
• It is a challenging task because of the complex crowd movement patterns in the scenes.
• Most of existing state-of-the-art LSTM based prediction methods require rich context like
labelled static obstacles, labelled entrance/exit regions and even the background scene.
• Furthermore, incorporating contextual information into trajectory prediction increases the
computational overhead and decreases the generalization of the prediction models across
different scenes.
• This is a joint Location-Velocity Attention (LVA) LSTM based method to predict trajectories.
• Specifically, a module is designed to tweak the LSTM network and an attention mechanism is
trained to learn to optimally combine the location and the velocity information of
pedestrians in the prediction process.
• On several publicly available datasets, the results show that it not only outperforms other
prediction methods but it also has a good generalization ability.

It proposes a joint location-velocity attention prediction network.
There are two input streams: the location information and the
velocity information. A joint attention mechanism is used to link
these two information streams in the prediction process.

• Studies on pedestrian trajectory prediction started in the 90s, with the Social
Force model being one of the pioneer papers.
• Current state-of-the-art pedestrian trajectory prediction methods rely
heavily on rich contextual information in the prediction process.
• This approach does not require a pooling layer or module like to describe
neighborhood.
• It uses a module to join the relationship between the location and the
velocity information in the prediction process.
• Unlike scene related context, the location and the velocity information can be
obtained from the trajectory data.
• It has the generalization ability w.r.t. different prediction lengths and scenes.

LVA network. Two LSTM layers are used for the location and the velocity information separately. In the prediction
phase, use a tweak module to combine the location and the velocity layers’ outputs.

• In general, people have similar walking speed and similar collision avoidance habits (e.g.,
accelerating, stopping, turning left or right) despite the scenes they are in.
• This specific characteristic of velocity (and speed, in general) motivates researchers to
incorporate velocity into the prediction framework to strengthen the prediction model
generalization.
• Since each velocity term (ut , vt) captures the instantaneous walking direction and walking
stride and is independent of the absolute location on the ground, the model learns the
location-independent movement patterns of pedestrians.
• On the other hand, from the location coordinates (xt, yt), the model learns the scene
dependent feature such as the layout of walkable and non-walkable areas of the scene.
• In LVA network, the location LSTM layer and the velocity LSTM layer process the location
and velocity information of the observed trajectories in parallel.

Detail of the tweak module. It has three layers: an
attention layer, a softmax layer and a tweak layer.
Although attention mechanisms have been used in LSTM
models, this attention mechanism differs in two respects:
1) unlike the sequence Encoder-Decoder architecture,
where the attention is on specific time steps for each
input sequence, this attention mechanism is used to learn
the weights at each time step of prediction; 2) this
attention mechanism does not use any neighborhood
information around the POI like some other existing
trajectory prediction methods. In terms of computational
complexity, this attention mechanism is a lot simpler.

• The publicly available Central Station Dataset contains massive pedestrian trajectory data
which has over 10,000 trajectories, where the trajectories are extracted from a 33 minutes
long surveillance video.
• Trajectory Forecasting Benchmark (also known as TrajNet) is a large scale trajectory-based
activity benchmark, that provides a unified evaluation system for testing state- of-the-art
methods.
• It contains various trajectory-based activity forecasting datasets such as BIWI, UCY, MOT,
and Stanford Drone Dataset (SDD).
• The TrajNet benchmark contains multiple video sequences in unconstrained
environments filmed with both static and moving cameras.
• The Town Centre dataset contains hundreds of human trajectories in a real world crowded
scene.
• The annotation provides bounding boxes of the head and the body of each pedestrian.

Examples of predicted trajectories on the TrajNet benchmark. The images shown here are generated by the
TrajNet Visualizer. They have been enlarged for visualization purpose. The ground truth trajectory points are
displayed as blue squares. The predicted trajectory points generated by Social-LSTM (top row) and by LVA
(bottom row) are shown as pink circles.

Situation-Aware Pedestrian Trajectory Prediction
with Spatio-Temporal Attention Model
• 2019.2 WACV
• Pedestrian trajectory prediction is essential for collision avoidance in autonomous driving
and robot navigation.
• However, predicting a pedestrian’s trajectory in crowded environments is non-trivial as it is
influenced by other pedestrians’ motion and static structures that are present in the scene.
• Such human- human and human-space interactions lead to non- linearities in the
trajectories.
• This is a spatio-temporal graph based Long Short-Term Memory (LSTM) network for
predicting pedestrian trajectory in crowded environments, which takes into account the
interaction with static (physical objects) and dynamic (other pedestrians) elements in the
scene.
• It improves the modeling of multiple trajectories correlations over space-time dimensions
using the 2D locations of the the static and dynamic elements.
• Better than Social LSTM, it is cognizant of the static obstacles at close proximity.

• Human motion prediction focused on modeling human-human and human-space
interaction separately.
• Some works mainly target constrained environments with low crowd density.
• The human motion prediction branches into two main trends regarding context inclusion:
local context and global context.
• Existing works unfolds into two other branches in terms of distinguishing multiple objects
influence: attention-based and uniformly-based approaches.
• According to local context methods, observing interaction for a short duration once
pedestrians are close enough to each other, gives limited understanding of social interaction.
• While including social interactions on a global scene scale, enables the model to better
understand how interaction evolves between a pair of pedestrians based on velocity effect
that the model inherently grasps upon capturing change in the spatial distances along time.

(a) Crowded environment displayed over 2 time-steps.
(b) Crowd mapping to abstract spatio-temporal
graph unrolled through two time-steps
Crowd mapping to Spatio-temporal Graph. (a) A static obstacle is drawn as red rectangle surrounded by a virtual
circle which indicates its neighborhood boundaries. (b) The Blue nodes represent pedestrians 1,2,3,4,5 and the
red dashed node represents obstacle o such that o ∈ O. Directed downward lines indicate temporal edges linking
the same node over time-steps and undirected lines are two- way spatial edges connecting pedestrian nodes. A
directed edge is pointing from Obstacle node to pedestrian node to indicate obstacle influence on pedestrian.

• The spatio-temporal graph is a dynamic structure that evolves temporally and spatially, due
to the motion state of the pedestrians and changes in the scene (e.g. as the elements in the
scene in- crease/decrease).
• The representation of crowd subjects in spatio-temporal graphs G = (V , ΣS , ΣT ), comprising
three key components: nodes set V∗, spatial edges set ΣS and temporal edges set ΣT, where
nodes represent the dynamic and the static element (e.g. pedestrians and static objects),
spatial edges represent the relationship between two nodes to indicate the interaction
between them.
• Temporal edges link the same pedestrian node over successive time-steps and thus connect
the graph when it is unrolled over time.
• (a) illustrates the dynamic structure with an arbitrary crowd at two consecutive time-steps.
• (b) shows the corresponding spatio-temporal graph representation, which evolves
dynamically over the spatial and temporal domain.

Multi-node attention mechanism pipeline for pedestrian node v2 at time-step t = 2.

• The Multi-Head attention works by stacking multiple attention layers (heads) in which each
layer makes mappings between words in two sentences.
• A simple attention mechanism is used, i.e. Multi-Node attention, which only has a single
layer that jointly pays attention to the features from spatial and temporal domains and store
the attention coefficients into single vector for node v2 trajectory at each time-step.
• Neighboring edgeL-STMs states are transformed before concatenation using the
embedding function, which is a composite of Parametric ReLU and softmax.
• This combined activation ensures that hidden states remain within a small range of [-1,1]
which will be mapped once again at the sampling stage to a range of normalized outputs
range of [0,1].
• The Parametric ReLU, is the generalized ReLU function as it ties the leak parameter α as a
network learnable parameter.
• Employing such activation function with an adaptive leak parameters, allows a slightly
different span of the negative hidden states along training batches.

Visualization results for Hotel and Zara sets (a-b).

Visualization results for Hotel and Zara sets (c-d).

Pedestrian behavior/intention modeling for autonomous driving II

Pedestrian behavior/intention modeling for autonomous driving II

More Related Content

Similar to Pedestrian behavior/intention modeling for autonomous driving II

More from Yu Huang

Recently uploaded

Pedestrian behavior/intention modeling for autonomous driving II