Prediction and planning for self driving at waymo

PREDICTION AND PLANNING FOR
SELF DRIVING @WAYMO (GOOGLE)
YU HUANG
SUNNYVALE, CALIFORNIA
YU.HUANG07@GMAIL.COM

References
• ChauffeurNet: Learning To Drive By Imitating The Best Synthesizing The Worst
• Multipath: Multiple Probabilistic Anchor Trajectory Hypotheses For Behavior Prediction
• VectorNet: Encoding HD Maps And Agent Dynamics From Vectorized Representation
• TNT: Target-driven Trajectory Prediction
• Large Scale Interactive Motion Forecasting For Autonomous Driving : The Waymo Open
Motion Dataset
• Identifying Driver Interactions Via Conditional Behavior Prediction
• Peeking Into The Future: Predicting Future Person Activities And Locations In Videos
• STINet: Spatio-temporal-interactive Network For Pedestrian Detection And Trajectory
Prediction

CHAUFFEURNET: LEARNING TO DRIVE BY
IMITATING THE BEST SYNTHESIZING THE
WORST
• TRAIN A POLICY FOR AUTONOMOUS DRIVING VIA IMITATION LEARNING THAT IS ROBUST ENOUGH TO DRIVE A
REAL VEHICLE.
• STANDARD BEHAVIOR CLONING IS INSUFFICIENT FOR HANDLING COMPLEX DRIVING SCENARIOS, EVEN
LEVERAGING A PERCEPTION SYSTEM FOR PREPROCESSING THE INPUT AND A CONTROLLER FOR EXECUTING THE
OUTPUT ON THE CAR: 30 MILLION EXAMPLES ARE STILL NOT ENOUGH.
• EXPOSING THE LEARNER TO SYNTHESIZED DATA IN THE FORM OF PERTURBATIONS TO THE EXPERT’S DRIVING,
WHICH CREATES INTERESTING SITUATIONS SUCH AS COLLISIONS AND/OR GOING OFF THE ROAD.
• RATHER THAN PURELY IMITATING ALL DATA, AUGMENT THE IMITATION LOSS WITH ADDITIONAL LOSSES THAT
PENALIZE UNDESIRABLE EVENTS AND ENCOURAGE PROGRESS – THE PERTURBATIONS THEN PROVIDE AN
IMPORTANT SIGNAL FOR THESE LOSSES AND LEAD TO ROBUSTNESS OF THE LEARNED MODEL.
• CHAUFFEURNET MODEL CAN HANDLE COMPLEX SITUATIONS IN SIMULATION.

CHAUFFEURNET: LEARNING TO DRIVE BY
IMITATING THE BEST SYNTHESIZING THE
WORST

MULTIPATH: MULTIPLE PROBABILISTIC ANCHOR
TRAJECTORY HYPOTHESES FOR BEHAVIOR
PREDICTION
• PREDICTING HUMAN BEHAVIOR IS A DIFFICULT AND CRUCIAL TASK REQUIRED FOR MOTION PLANNING.
• IT IS CHALLENGING IN LARGE PART DUE TO THE HIGHLY UNCERTAIN AND MULTIMODAL SET OF POSSIBLE
OUTCOMES IN REAL-WORLD DOMAINS SUCH AS AUTONOMOUS DRIVING.
• BEYOND SINGLE MAP TRAJECTORY PREDICTION, OBTAINING AN ACCURATE PROBABILITY DISTRIBUTION OF THE
FUTURE IS AN AREA OF ACTIVE INTEREST.
• MULTIPATH LEVERAGES A FIXED SET OF FUTURE STATE-SEQUENCE ANCHORS THAT CORRESPOND TO MODES OF
THE TRAJECTORY DISTRIBUTION.
• AT INFERENCE, THE MODEL PREDICTS A DISCRETE DISTRIBUTION OVER THE ANCHORS AND, FOR EACH ANCHOR,
REGRESSES OFFSETS FROM ANCHOR WAYPOINTS ALONG WITH UNCERTAINTIES, YIELDING A GAUSSIAN
MIXTURE AT EACH TIME STEP.
• THE MODEL IS EFFICIENT, REQUIRING ONLY ONE FORWARD INFERENCE PASS TO OBTAIN MULTI-MODAL FUTURE
DISTRIBUTIONS, AND THE OUTPUT IS PARAMETRIC, ALLOWING COMPACT COMMUNICATION AND ANALYTICAL
PROBABILISTIC QUERIES.

PREDICTION

PREDICTION
• MULTIPATH ESTIMATES THE DISTRIBUTION OVER FUTURE TRAJECTORIES PER AGENT IN A SCENE, AS FOLLOWS:
• 1) BASED ON A TOP-DOWN SCENE REPRESENTATION, THE SCENE CNN EXTRACTS MID-LEVEL FEATURES THAT ENCODE
THE STATE OF INDIVIDUAL AGENTS AND THEIR INTERACTIONS.
• 2) FOR EACH AGENT IN THE SCENE, CROP AN AGENT-CENTRIC VIEW OF THE MID-LEVEL FEATURE REPRESENTATION AND
PREDICT THE PROBABILITIES OVER THE FIXED SET OF K PREDEFINED ANCHOR TRAJECTORIES.
• 3) FOR EACH ANCHOR, THE MODEL REGRESSES OFFSETS FROM THE ANCHOR STATES AND UNCERTAINTY DISTRIBUTIONS
FOR EACH FUTURE TIME STEP.
• THE DISTRIBUTION IS PARAMETERIZED BY ANCHOR TRAJECTORIES A; DIRECTLY LEARNING A MIXTURE SUFFERS FROM
ISSUES OF MODE COLLAPSE, AS IS COMMON PRACTICE IN OTHER DOMAINS SUCH AS OBJECT DETECTION AND HUMAN
POSE ESTIMATION, IT ESTIMATES THE ANCHORS A-PRIORI BEFORE FIXING THEM TO LEARN THE REST OF OUR
PARAMETERS; A PRACTICAL WAY IS THE K-MEANS ALGORITHM AS A SIMPLE APPROXIMATION TO OBTAIN A.
• IT TRAINS THE MODEL VIA IMITATION LEARNING BY FITTING PARAMETERS TO MAXIMIZE THE LOG-LIKELIHOOD OF
RECORDED DRIVING TRAJECTORIES.

PREDICTION
• THEY STILL REPRESENT A HISTORY OF DYNAMIC AND STATIC SCENE CONTEXT AS A 3-
DIMENSIONAL ARRAY OF DATA RENDERED FROM A TOP-DOWN ORTHOGRAPHIC PERSPECTIVE.
• THE FIRST TWO DIMENSIONS REPRESENT SPATIAL LOCATIONS IN THE TOP-DOWN IMAGE.
• THE CHANNELS IN THE DEPTH DIMENSION HOLD STATIC AND TIME-VARYING (DYNAMIC)
CONTENT OF A FIXED NUMBER OF PREVIOUS TIME STEPS.
• AGENT OBSERVATIONS ARE RENDERED AS ORIENTATED BOUNDING BOX BINARY IMAGES, ONE
CHANNEL FOR EACH TIME STEP.
• OTHER DYNAMIC CONTEXT SUCH AS TRAFFIC LIGHT STATE AND STATIC CONTEXT OF THE ROAD
(LANE CONNECTIVITY AND TYPE, STOP LINES, SPEED LIMIT, ETC.) FORM ADDITIONAL CHANNELS.
• AN IMPORTANT BENEFIT OF USING SUCH A TOP-DOWN REPRESENTATION IS THE SIMPLICITY OF
REPRESENTING CONTEXTUAL INFORMATION LIKE THE AGENTS’ SPATIAL RELATIONSHIPS TO EACH
OTHER AND SEMANTIC ROAD INFORMATION.

PREDICTION
Top: Logged trajectories of all agents are displayed in cyan. The focused agent is highlighted by a red
circle. Bottom: MultiPath showing up to 5 trajectories with uncertainty ellipses. Trajectory probabilities
(softmax outputs) are encoded in a color map shown to the right. MultiPath can predict uncertain future
trajectories for various speed (1st column), different intent at intersections (2nd and 3rd columns) and lane
changes (4th and 5th columns), where the regression baseline only predicts a single intent.

VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
• BEHAVIOR PREDICTION IN DYNAMIC, MULTI-AGENT SYSTEMS IS AN IMPORTANT PROBLEM IN THE CONTEXT OF SELF-
DRIVING CARS, DUE TO THE COMPLEX REPRESENTATIONS AND INTERACTIONS OF ROAD COMPONENTS, INCLUDING
MOVING AGENTS (E.G. PEDESTRIANS AND VEHICLES) AND ROAD CONTEXT INFORMATION (E.G. LANES, TRAFFIC LIGHTS).
• THIS PAPER INTRODUCES VECTORNET, A HIERARCHICAL GRAPH NEURAL NETWORK (GNN) THAT FIRST EXPLOITS THE
SPATIAL LOCALITY OF INDIVIDUAL ROAD COMPONENTS REPRESENTED BY VECTORS AND THEN MODELS THE HIGH-
ORDER INTERACTIONS AMONG ALL COMPONENTS.
• IN CONTRAST TO MOST RECENT APPROACHES, WHICH RENDER TRAJECTORIES OF MOVING AGENTS AND ROAD
CONTEXT INFORMATION AS BIRD-EYE IMAGES AND ENCODE THEM WITH CONVOLUTIONAL NEURAL NETWORKS
(CONVNETS), THIS APPROACH OPERATES ON A VECTOR REPRESENTATION.
• BY OPERATING ON THE VECTORIZED HIGH DEFINITION (HD) MAPS AND AGENT TRAJECTORIES, IT AVOIDS LOSSY
RENDERING AND COMPUTATIONALLY INTENSIVE CONVNET ENCODING STEPS.
• TO FURTHER BOOST VECTORNET’S CAPABILITY IN LEARNING CONTEXT FEATURES, IT PROPOSES A NOVEL AUXILIARY TASK
TO RECOVER THE RANDOMLY MASKED OUT MAP ENTITIES AND AGENT TRAJECTORIES BASED ON THEIR CONTEXT.
• IT ALSO OUTPERFORMS THE STATE OF THE ART ON THE ARGOVERSE DATASET.
https://github.com/DQSSSSS/VectorNet

• MOST OF THE ANNOTATIONS FROM AN HD MAP ARE IN THE FORM OF SPLINES (E.G. LANES), CLOSED SHAPE
(E.G. REGIONS OF INTERSECTIONS) AND POINTS (E.G. TRAFFIC LIGHTS), WITH ADDITIONAL ATTRIBUTE INFO SUCH
AS THE SEMANTIC LABELS OF THE ANNOTATIONS AND THEIR CURRENT STATES (E.G. COLOR OF THE TRAFFIC
LIGHT, SPEED LIMIT OF THE ROAD).
• FOR AGENTS, THEIR TRAJECTORIES ARE IN THE FORM OF DIRECTED SPLINES WITH RESPECT TO TIME.
• ALL OF THESE ELEMENTS CAN BE APPROXIMATED AS SEQUENCES OF VECTORS: FOR MAP FEATURES, PICK A
STARTING POINT AND DIRECTION, UNIFORMLY SAMPLE KEY POINTS FROM THE SPLINES AT THE SAME SPATIAL
DISTANCE, AND SEQUENTIALLY CONNECT THE NEIGHBORING KEY POINTS INTO VECTORS; FOR TRAJECTORIES,
JUST SAMPLE KEY POINTS WITH A FIXED TEMPORAL INTERVAL (0.1 SECOND), STARTING FROM T = 0, AND
CONNECT THEM INTO VECTORS.
• GIVEN SMALL ENOUGH SPATIAL OR TEMPORAL INTERVALS, THE RESULTING POLYLINES SERVE AS CLOSE
APPROXIMATIONS OF THE ORIGINAL MAP AND TRAJECTORIES.
• TO EXPLOIT THE SPATIAL AND SEMANTIC LOCALITY OF THE NODES, IT TAKES A HIERARCHICAL APPROACH BY
FIRST CONSTRUCTING SUBGRAPHS AT THE VECTOR LEVEL, WHERE ALL VECTOR NODES BELONGING TO THE
SAME POLYLINE ARE CONNECTED WITH EACH OTHER.

The computation flow on the vector nodes of the
same polyline. The polyline subgraph network
can be seen as a generalization of PointNet.
However, by embedding the ordering
information into vectors, constraining the
connectivity of subgraphs based on the polyline
groupings, and encoding attributes as node
features, this method is particularly suitable to
encode structured map annotations and agent
trajectories.

• TO ENCOURAGE THE GLOBAL INTERACTION GRAPH TO BETTER CAPTURE INTERACTIONS AMONG DIFFERENT
TRAJECTORIES AND MAP POLYLINES, IT INTRODUCES AN AUXILIARY GRAPH COMPLETION TASK.
• IN ORDER TO IDENTIFY AN INDIVIDUAL POLYLINE NODE WHEN ITS CORRESPONDING FEATURE IS MASKED OUT, IT
COMPUTES THE MINIMUM VALUES OF THE START COORDINATES FROM ALL OF ITS BELONGING VECTORS TO
OBTAIN THE IDENTIFIER EMBEDDING.
• THE GRAPH COMPLETION OBJECTIVE IS CLOSELY RELATED TO THE WIDELY SUCCESSFUL BERT METHOD FOR
NATURAL LANGUAGE PROCESSING (NLP), WHICH PREDICTS MISSING TOKENS BASED ON BIDIRECTIONAL
CONTEXT FROM DISCRETE AND SEQUENTIAL TEXT DATA.
• UNLIKE METHODS THAT GENERALIZES THE BERT OBJECTIVE TO UNORDERED IMAGE PATCHES WITH PRE-
COMPUTED VISUAL FEATURES, THE PROPOSED NODE FEATURES ARE JOINTLY OPTIMIZED IN AN E2E FRAMEWORK.
• THE FINAL MULTI-TASK TRAINING OBJECTIVE IS OPTIMIZED:
• LTRAJ IS THE NEGATIVE GAUSSIAN LOG-LIKELIHOOD FOR THE GROUND TRUTH FUTURE TRAJECTORIES, LNODE IS THE HUBER
LOSS BETWEEN PREDICTED NODE FEATURES AND GROUND TRUTH MASKED NODE FEATURES, Α = 1.0 IS A SCALAR THAT
BALANCES THE TWO LOSS TERMS.

prediction
prediction attention for road and agent attention for road and agent

TNT: TARGET-DRIVEN TRAJECTORY PREDICTION
• THIS KEY INSIGHT IS THAT FOR PREDICTION WITHIN A MODERATE TIME HORIZON, THE FUTURE MODES CAN BE
EFFECTIVELY CAPTURED BY A SET OF TARGET STATES.
• THIS LEADS TO TARGET-DRIVEN TRAJECTORY PREDICTION (TNT) FRAMEWORK.
• TNT HAS THREE STAGES WHICH ARE TRAINED END-TO-END.
• IT FIRST PREDICTS AN AGENT’S POTENTIAL TARGET STATES T STEPS INTO THE FUTURE, BY ENCODING ITS INTERACTIONS
WITH THE ENVIRONMENT AND THE OTHER AGENTS.
• TNT THEN GENERATES TRAJECTORY STATE SEQUENCES CONDITIONED ON TARGETS.
• A FINAL STAGE ESTIMATES TRAJECTORY LIKELIHOODS AND A FINAL COMPACT SET OF TRAJECTORY PREDICTIONS IS
SELECTED.
• THIS IS IN CONTRAST TO PREVIOUS WORK WHICH MODELS AGENT INTENTS AS LATENT VARIABLES, AND RELIES
ON TEST-TIME SAMPLING TO GENERATE DIVERSE TRAJECTORIES.
• BENCHMARK TNT ON TRAJECTORY PREDICTION OF VEHICLES AND PEDESTRIANS, OUTPERFORM STATE-OF-THE-
ART ON ARGOVERSE FORECASTING, INTERACTION, STANFORD DRONE AND AN IN-HOUSE PEDESTRIAN-AT-
INTERSECTION DATASET.

Illustration of the TNT framework when applied to the vehicle future trajectory prediction task. TNT
consists of three stages: (a) target prediction which proposes a set of plausible targets (stars)
among all candidates (diamonds). (b) target-conditioned motion estimation which estimates a
trajectory (distribution) towards each selected target, (c) scoring and selection which ranks
trajectory hypotheses and selects a final set of trajectory predictions with likelihood scores.

TNT model overview. Scene context is first encoded as the model’s inputs. Then follows the core
three stages of TNT: (a) target prediction which proposes an initial set of M targets; (b) target-
conditioned motion estimation which estimates a trajectory for each target; (c) scoring and selection
which ranks trajectory hypotheses and outputs a final set of K predicted trajectories.

TNT supports flexible choices of targets. Vehicle target candidate points
are sampled from the lane centerlines. Pedestrian target candidate
points are sampled from a virtual grid centered on the pedestrian.

Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• This is the most diverse interactive motion dataset so far, and provides specific labels for
interacting objects suitable for developing joint prediction models.
• With over 100,000 scenes, each 20 seconds long at 10 HZ, this dataset contains more
than 570 hours of unique data over 1750 km of roadways.
• It was collected by mining for interesting interactions between vehicles, pedestrians, and
cyclists across six cities within the united states.
• Use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes
for each road agent, and provide corresponding high definition 3D maps for each scene.
• Introduce a new set of metrics that provides a comprehensive evaluation of both single
agent and joint agent interaction motion forecasting models.
• Finally, provide strong baseline models for individual agent prediction and joint-prediction.
• https://waymo.com/open/data/motion/

Examples of interactions between
agents in a scene in the WAYMO
OPEN MOTION DATASET. Each
example highlights how predicting
the joint behavior of agents aids in
predicting likely future scenarios.
Solid and dashed lines indicate the
road graph and associated lanes.
Each numeral indicates a unique
agent in the scene.

• Compared to the onboard counterpart, offboard perception has two major advantages:
• 1) it can afford much more powerful models running on the ample computational resources;
• 2) it can maximally aggregate complementary information from different views by exploiting the
full point cloud sequence including both history and future.
• The offboard perception system employed contains three steps:
• (1) 3D object detector generates object proposals from each lidar frame.
• (2) multi-object tracker links detected objects throughout the lidar sequence.
• (3) for each object, an object-centric refinement network processes the tracked object boxes and
its point clouds across all frames in the track, and outputs temporally consistent and accurate 3D
bounding boxes of the object in each frame.

Comparison of popular behavior prediction and motion forecasting datasets.
Specifically, compare Lyft Level 5, NuScenes, Argoverse, Interactions, and
waymo motion dataset across multiple dimensions.

• The dataset provides high quality object tracks generated using an offboard perception system
along with both static and dynamic map features to provide context for the road environment.
• Mine for interesting scenarios by first hand-crafting semantic predicates involving agents’
relationships— e.g., “Agent A changed lanes at time t”, and “agents A and B crossed paths with a
time gap t and relative heading difference”.
• These predicates can be composed to retrieve more complex queries in an efficient SQL and
relational database framework on an overall data corpus orders of magnitude larger than the
resulting curated WAYMO OPEN MOTION DATASET.
• Pairwise interaction scenarios: merges, lane changes, unprotected turns, intersection left turns,
intersection right turns, pedestrian-vehicle interactions, cyclist vehicle interactions, interactions with
close proximity, and interactions with high accelerations.

Diagram of baseline architecture. An illustration of the baseline architecture
employed for the family of learned models with a base LSTM encoder for
agent states. The three detachable components are a road graph polyline
encoder, a traffic state LSTM encoder, and a high-order interactions encoder
following. The trajectories are predicted through a MLP with min-of-k loss.

• First, consider a constant velocity model in which we assume the agent will maintain its velocity at
the current timestamp for all future steps.
• Second, consider a family of deep-learned models using various encoders, with a base
architecture of a LSTM to encode a 1-second history of observed state; this includes agents’
positions, velocity, and 3d bounding boxes.
• In order to measure the importance of particular additional features, selectively provide
additional information:
• Road graph (rg): encode the 3D map information with polylines following.
• Traffic signals (ts): encode the traffic signal states with a LSTM encoder as an additional
feature.
• High-order interactions (hi): model the high-order interactions between agents with a global
interaction graph following.

• Use conditional behavior prediction (CBP) to quantify the interactivity in our dataset.
• A model can produce either unconditional predictions or predictions conditioned on a “query
trajectory” for one of the agents in the scene.
• If two agents are not interacting, then one’s actions have no effect on the other, so knowledge of that
agent’s future should not change predictions for the other agent.
• The degree of influence agent A has on agent B is defined as KL divergence between unconditional
predictions for B and the predictions for B conditioned on a’s ground truth future trajectory.
• Apply this to interactive and standard validation datasets, computing the KL divergence between
unconditional and conditional predictions for every query agent/target agent pair in the dataset.
• KL divergences are much larger in interactive validation dataset than in standard validation dataset.

The dataset contains many agents including
pedestrians and cyclists. Top: 46% of scenes have
more than 32 agents, and 11% of scenes have
more than 64 agents. Bottom: In the standard
validation set, 33.5% of scenes require at least
one pedestrian to be predicted, and 10.4% of
scenes require at least one cyclist to be predicted.

Agents selected to be predicted have diverse
trajectories. Left: Ground truth trajectory of each
predicted agent in a frame of reference where all
agents start at the origin with heading pointing along
the positive X axis (pointing up). Right: Distribution
of maximum speeds achieved by all of the agents
along their 9 second trajectory. Plots depict variety
in trajectory shapes and speed profiles.

Identifying Driver Interactions Via Conditional
Behavior Prediction
• Interactive driving scenarios, such as lane changes, merges and unprotected turns, are some
of the most challenging situations for autonomous driving.
• Planning in interactive scenarios requires accurately modeling the reactions of other agents
to different future actions of the ego agent.
• It develops end-to-end models for conditional behavior prediction (CBP) that take as an
input a query future trajectory for an ego-agent, and predict distributions over future
trajectories for other agents conditioned on the query.
• Leveraging such a model, develop a general-purpose agent interactivity score derived
from probabilistic first principles.
• The interactivity score allows to find interesting interactive scenarios for training and
evaluating behavior prediction models.

Behavior Prediction
• Define an agent trajectory S as a fixed-length, time discretized sequence of agent states up to a
finite time horizon.
• All quantities in this work consider a pair of agents A and B.
• Without loss of generality, consider A to be the query agent whose plan for the future can
potentially affect B, the target agent.
• The future trajectories of A and B are random variables SA and SB.
• The marginal probability of a particular realization of agent b’s trajectory sb is given by p(SB = sb),
also indicated by the shorthand p(sb).
• The conditional distribution of agent b’s future trajectory given a realization of agent a’s trajectory
sa is given by p(SB = sb|SA = sa), indicated by the shorthand p(sb|sa).

Behavior Prediction
• Quantify interactions by estimating the change in log likelihood of the target’s ground-truth future sb
• A large change in the log-likelihood indicates a situation in which the likelihood of the target agent’s
trajectory changes significantly as a result of the query agent’s action.
• Use the kl-divergence between the conditional and marginal distributions for the target’s predicted
future trajectory SB to quantify the degree of influence exerted on B by a a trajectory sa:
• Mutual information between the two agents’ future trajectories SA and SB is computed as
• The interactivity score between agents A and B.

Behavior Prediction
• A CBP model predicts p(SB|SA= sa, x), the distribution of future trajectories for B conditioned on sa.
• Gaussian uncertainty over the positions of the trajectory waypoints as
• Gaussian mixture model (GMM) with mixture weights fixed over all time steps of the same trajectory
• The computation of the interactivity score also requires the estimation of marginal distributions

Behavior Prediction
• Use the most likely 6 modes of the marginal distribution’s GMM as in standard motion forecasting
metrics, rather than sampling N samples from the marginal distribution
• Learn to predict distribution parameters via supervised learning with the negative log-likelihood loss
• Encourage the model to maintain that agents cannot occupy the same future location in space-time,
with a loss function

Behavior Prediction
A conditional behavior prediction model describes
how one agent’s predicted future trajectory can
shift due to the actions of other agents.
The architecture of the conditional behavior
prediction model.

Behavior Prediction
Histogram of interactivity score (mutual
information) between 8,919,306 pairs of
agents in the validation dataset.

Behavior Prediction
Two examples of interacting agents found by
sorting examples by mutual information and
wADE. The marginal (left) and conditional
predictions (right) are shown with the query in
solid green, and predictions in dashed cyan lines.

Behavior Prediction
An example in which the query and target agents slow down in parallel
lanes as a result of a traffic light change. The marginal (left) and
conditional predictions (right) are shown with the query in solid green.

PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
• Deciphering human behaviors to predict their future paths/trajectories and what they would do from
videos is important in many applications.
• Therfore, this work studies predicting a pedestrian’s future path jointly with future activities.
• They propose an end-to-end, multi-task learning system, called next, utilizing rich visual features
about human behavioral information and interaction with their surroundings.
• It encodes a person through rich semantic features about visual appearance, body movement and
interaction with the surroundings, motivated by the fact that humans derive such predictions by
relying on similar visual cues.
• To facilitate the training, the network is learned with an auxiliary task of predicting future location
in which the activity will happen.
• In the auxiliary task, it designs a discretized grid called the manhattan grid, as location prediction
target for the system.
https://github.com/JunweiLiang/social-distancing-prediction

The goal is to jointly predict a person’s future path and activity. The green and yellow line show two
possible future trajectories and two possible activities are shown in the green and yellow boxes.
Depending on the future activity, the person (top right) may take different paths, e.g. the yellow path
for “loading” and the green path for “object transfer”.

• HUMANS NAVIGATE THROUGH PUBLIC SPACES OFTEN WITH SPECIFIC PURPOSES IN MIND, RANGING
FROM SIMPLE ONES LIKE ENTERING A ROOM TO MORE COMPLICATED ONES LIKE PUTTING THINGS
INTO A CAR.
• SUCH INTENTION, HOWEVER, IS MOSTLY NEGLECTED IN EXISTING WORK.
• THE JOINT PREDICTION MODEL CAN HAVE TWO BENEFITS:
• 1) LEARNING THE ACTIVITY TOGETHER WITH THE PATH MAY BENEFIT THE FUTURE PATH PREDICTION;
INTUITIVELY, HUMANS ARE ABLE TO READ FROM OTHERS’ BODY LANGUAGE TO ANTICIPATE WHETHER
THEY ARE GOING TO CROSS THE STREET OR CONTINUE WALKING ALONG THE SIDEWALK.
• 2) THE JOINT MODEL ADVANCES THE CAPABILITY OF UNDERSTANDING NOT ONLY THE FUTURE PATH
BUT ALSO THE FUTURE ACTIVITY BY TAKING INTO ACCOUNT THE RICH SEMANTIC CONTEXT IN
VIDEOS; THIS INCREASES THE CAPABILITIES OF AUTOMATED VIDEO ANALYTICS FOR SOCIAL GOOD,
SUCH AS SAFETY APPLICATIONS LIKE ANTICIPATING PEDESTRIAN MOVEMENT AT TRAFFIC
INTERSECTIONS OR A ROAD ROBOT HELPING HUMANS TRANSPORT GOODS TO A CAR.

Overview of the Next model. Given a sequence of frames containing the person for prediction, this model utilizes
person behavior module and person interaction module to encode rich visual semantics into a feature tensor.

• 4 KEY COMPONENTS:
• PERSON BEHAVIOR MODULE EXTRACTS VISUAL INFORMATION FROM THE BEHAVIORAL SEQUENCE OF THE
PERSON.
• PERSON INTERACTION MODULE LOOKS AT THE INTERACTION BETWEEN A PERSON AND THEIR
SURROUNDINGS.
• TRAJECTORY GENERATOR SUMMARIZES THE ENCODED VISUAL FEATURES AND PREDICTS THE FUTURE
TRAJECTORY BY THE LSTM DECODER WITH FOCAL ATTENTION.
• ACTIVITY PREDICTION UTILIZES RICH VISUAL SEMANTICS TO PREDICT THE FUTURE ACTIVITY LABEL FOR THE
PERSON.
• IN ADDITION, DIVIDE THE SCENE INTO A DISCRETIZED GRID OF MULTIPLE SCALES, CALLED
MANHATTAN GRID, TO COMPUTE CLASSIFICATION AND REGRESSION FOR ROBUST ACTIVITY
LOCATION PREDICTION.

To model appearance changes of a person, utilize a pre-trained object detection model with “RoIAlign”
to extract fixed size CNN features for each person bounding box.
To average the features along the spatial dimensions for each person and feed them into an LSTM
encoder. Finally, obtain a feature representation of Tobs × d, where d is the hidden size of the LSTM. To
capture the body movement, utilize a person keypoint detection model to extract person keypoint
information. To apply the linear transformation to embed the keypoint coordinates before feeding into
the LSTM encoder. The shape of the encoded feature has the shape of Tobs × d. These appearance and
movement features are commonly used in a wide variety of studies and thus do not introduce new
concern on machine learning fairness.

The person-objects feature can capture how far away the person is to the other
person and the cars. The person-scene feature can capture whether the person is
near the sidewalk or grass. It designs this information to the model with the hope
of learning things like a person walks more often on the sidewalk than the grass
and tends to avoid bumping into cars.

• IT USES AN LSTM DECODER TO DIRECTLY PREDICT THE FUTURE TRAJECTORY IN THE XY-COORDINATE.
• THE HIDDEN STATE OF THIS DECODER IS INITIALIZED USING THE LAST STATE OF THE PERSON’S
TRAJECTORY LSTM ENCODER.
• ADD AN AUXILIARY TASK, I.E. ACTIVITY LOCATION PREDICTION, IN ADDITION TO PREDICTING THE
FUTURE ACTIVITY LABEL OF THE PERSON.
• AT EACH TIME INSTANT, THE XY-COORDINATE WILL BE COMPUTED FROM THE DECODER STATE AND
BY A FULLY CONNECTED LAYER.
• IT EMPLOYS AN EFFECTIVE FOCAL ATTENTION, ORIGINALLY PROPOSED TO CARRY OUT MULTIMODAL
INFERENCE OVER A SEQUENCE OF IMAGES FOR VISUAL QUESTION ANSWERING; WHICH KEY IDEA IS
TO PROJECT MULTIPLE FEATURES INTO A SPACE OF CORRELATION, WHERE DISCRIMINATIVE FEATURES
CAN BE EASIER TO CAPTURE BY THE ATTENTION MECHANISM.

To bridge the gap between trajectory generation and activity label prediction, it proposes an activity
location prediction (ALP) module to predict the final location of where the person will engage in the future
activity. The activity location prediction includes two tasks, location classification and location regression.

Qualitative comparison between this method and the baselines. Yellow path is the observable trajectory and
green path is the ground truth trajectory during the prediction period. Predictions are shown as blue heatmaps.

STINET: SPATIO-TEMPORAL-INTERACTIVE NETWORK
FOR PEDESTRIAN DETECTION AND TRAJECTORY
PREDICTION
• DETECTING PEDESTRIANS AND PREDICTING FUTURE TRAJECTORIES FOR THEM ARE CRITICAL TASKS FOR NUMEROUS
APPLICATIONS, SUCH AS AUTONOMOUS DRIVING.
• PREVIOUS METHODS EITHER TREAT THE DETECTION AND PREDICTION AS SEPARATE TASKS OR SIMPLY ADD A TRAJECTORY
REGRESSION HEAD ON TOP OF A DETECTOR.
• AN END-TO-END TWO-STAGE NETWORK: SPATIO-TEMPORAL-INTERACTIVE NETWORK (STINET).
• IN ADDITION TO 3D GEOMETRY MODELING OF PEDESTRIANS, MODEL THE TEMPORAL INFORMATION FOR EACH OF THE
PEDESTRIANS.
• IT PREDICTS BOTH CURRENT AND PAST LOCATIONS IN THE FIRST STAGE, SO THAT EACH PEDESTRIAN CAN BE LINKED
ACROSS FRAMES AND THE COMPREHENSIVE SPATIO-TEMPORAL INFORMATION CAN BE CAPTURED IN THE SECOND
STAGE.
• ALSO, MODEL THE INTERACTION AMONG OBJECTS WITH AN INTERACTION GRAPH, TO GATHER THE INFORMATION
AMONG THE NEIGHBORING OBJECTS.
• COMPREHENSIVE EXPERIMENTS ON THE LYFT DATASET AND THE RECENTLY RELEASED LARGE-SCALE WAYMO OPEN
DATASET FOR BOTH OBJECT DETECTION AND FUTURE TRAJECTORY PREDICTION.

PREDICTION
The overview. It takes a sequence of point clouds as input, detects pedestrians and predicts their future
trajectories simultaneously. The point clouds are processed by Pillar Feature Encoding to generate Pillar
Features. Then each Pillar Feature is fed into a backbone ResUNet to get backbone features. A Temporal
Region Proposal Network (T-RPN) takes backbone features and generated temporal proposal with past
and current boxes for each object. Spatio-Temporal-Interactive (STI) Feature Extractor learns features
for each temporal proposal which are used for final detection and trajectory prediction.

PREDICTION
Backbone. Upper: overview of the backbone. The
input point cloud sequence is fed to Voxelization and
Point net to generate pseudo images, which are then
processed by ResNet U-Net to generate final
backbone feature sequence. Lower: detailed design
of ResNet U-Net.

PREDICTION
Spatial-Temporal-Interactive Feature Extractor
(STI- FE): Local geometry, local dynamic and
history path features are extracted given a
temporal proposal. For local geometry and
local dynamics features, the yellow areas are
used for feature extraction. Relational
reasoning is performed across proposals’ local
features to generate interactive features.

PREDICTION

Prediction and planning for self driving at waymo

Prediction and planning for self driving at waymo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Prediction and planning for self driving at waymo

Similar to Prediction and planning for self driving at waymo (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Prediction and planning for self driving at waymo