ChauffeurNet: Learning To Drive By Imitating The Best Synthesizing The Worst
Multipath: Multiple Probabilistic Anchor Trajectory Hypotheses For Behavior Prediction
VectorNet: Encoding HD Maps And Agent Dynamics From Vectorized Representation
TNT: Target-driven Trajectory Prediction
Large Scale Interactive Motion Forecasting For Autonomous Driving : The Waymo Open Motion Dataset
Identifying Driver Interactions Via Conditional Behavior Prediction
Peeking Into The Future: Predicting Future Person Activities And Locations In Videos
STINet: Spatio-temporal-interactive Network For Pedestrian Detection And Trajectory Prediction
Open Source codes of trajectory prediction & behavior planningYu Huang
This document lists open source codes for trajectory prediction and behavior planning algorithms. It includes over 20 GitHub links with descriptions of projects related to predicting pedestrian and vehicle motion, modeling social interactions, and forecasting trajectories in crowded or multi-agent environments.
Open Source codes of trajectory prediction & behavior planningYu Huang
This document lists open source codes for trajectory prediction and behavior planning algorithms. It includes over 20 GitHub links with descriptions of projects related to predicting pedestrian and vehicle motion, modeling social interactions, and forecasting trajectories in crowded or multi-agent environments.
このスライドではベイズ統計学によく登場する確率分布の関係について紹介している。平易なベルヌーイ分布から多少複雑なベータ分布までがどのようにつながっているかを示している。いくつかの重要な性質については実際に証明を与えた。本スライドは2016年10月1日のNagoyaStat #2で発表したものである。
Some probability distributions are used for bayes statistics. This slide shows relationships from Bernoulli distribution to Beta distribution. Some important properties are proofed in this slide.
- The document discusses linear regression models and methods for estimating coefficients, including ordinary least squares and regularization methods like ridge regression and lasso regression.
- It explains how lasso regression, unlike ordinary least squares and ridge regression, has the property of driving some of the coefficient estimates exactly to zero, allowing for variable selection.
- An example using crime rate data shows how lasso regression can select a more parsimonious model than other methods by setting some coefficients to zero.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
このスライドではベイズ統計学によく登場する確率分布の関係について紹介している。平易なベルヌーイ分布から多少複雑なベータ分布までがどのようにつながっているかを示している。いくつかの重要な性質については実際に証明を与えた。本スライドは2016年10月1日のNagoyaStat #2で発表したものである。
Some probability distributions are used for bayes statistics. This slide shows relationships from Bernoulli distribution to Beta distribution. Some important properties are proofed in this slide.
- The document discusses linear regression models and methods for estimating coefficients, including ordinary least squares and regularization methods like ridge regression and lasso regression.
- It explains how lasso regression, unlike ordinary least squares and ridge regression, has the property of driving some of the coefficient estimates exactly to zero, allowing for variable selection.
- An example using crime rate data shows how lasso regression can select a more parsimonious model than other methods by setting some coefficients to zero.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
The document proposes a method for traffic flow prediction using KNN and LSTM. KNN is used to select neighboring stations that are spatially correlated with the test station. A two-layer LSTM network then predicts traffic flow in the selected stations. The predictions are combined using weighted averaging to obtain the final prediction for the test station. An experiment on traffic data from Minnesota highways found the proposed method improved prediction accuracy over ARIMA, SVR, WNN, DBN-SVR, and LSTM-only models, achieving on average a 12.59% reduction in error.
Deep Multi-View Spatial-Temporal Network for Taxi Demand Predictionivaderivader
This paper proposes DMVST-Net, a deep learning framework that uses three views (spatial, temporal, and semantic) to predict taxi demand. It uses a local CNN to model spatial relationships between nearby regions, an LSTM to model temporal patterns over time, and region embeddings to model semantic relationships between spatially distant but correlated regions. An experiment on a large Didi Chuxing taxi dataset showed DMVST-Net outperformed other methods at predicting future demand, demonstrating the benefit of jointly modeling spatial, temporal, and semantic relationships.
Pedestrian behavior/intention modeling for autonomous driving IIYu Huang
The document discusses several papers related to modeling pedestrian behavior and predicting pedestrian trajectories for autonomous vehicles. It begins with an outline listing the paper titles and authors. It then provides more detailed summaries of three papers:
1) "Social LSTM: Human Trajectory Prediction in Crowded Spaces" which uses an LSTM model and social pooling layer to jointly predict paths of all people in a scene by taking into account social conventions.
2) "A Data-driven Model for Interaction-aware Pedestrian Motion Prediction in Object Cluttered Environments" which uses an LSTM model incorporating static obstacles and surrounding pedestrians to forecast trajectories.
3) "Social GAN: Socially Acceptable Trajectories with Generative
IRJET - A Review on Pedestrian Behavior Prediction for Intelligent Transport ...IRJET Journal
This document reviews various techniques for predicting pedestrian behavior for intelligent transportation systems. It discusses algorithms that aim to predict pedestrian motion around crossroads to avoid potential collisions with vehicles. The document summarizes several papers that propose different methods for pedestrian behavior prediction, including the use of LSTM neural networks, goal-directed planning with deep neural networks, analyzing pedestrian-vehicle interaction behavior, using CNNs for pedestrian detection and direction prediction, predicting pedestrian intention at night using infrared cameras and fuzzy automata, and memory-based and physical modeling approaches.
This document presents Ex-SAOS, an extended spatially aware online sampling technique for processing spatial queries over streaming big data. Ex-SAOS samples proportionally from geographic divisions like neighborhoods to provide more accurate results than alternatives. It was tested on a New York City taxi trip dataset, and was shown to have lower error and higher ranking precision than plain SAOS or stratified random sampling, especially at higher sampling rates. Ex-SAOS resolves the tension between latency and accuracy for spatial queries over streaming data more effectively.
Adaptive Feature Fusion Networks for Origin-Destination Passenger Flow Predic...Shakas Technologies
The document proposes using an Adaptive Feature Fusion Network (AFFN) to predict passenger flow from origin to destination in metro systems. AFFN aims to adaptively fuse spatial dependencies from multiple data sources and capture periodic patterns based on external factors like weather. It also predicts inflow and outflow at each station as a side task to improve origin-destination prediction accuracy. Evaluation on real metro datasets showed AFFN outperforms baselines in accuracy metrics.
Coupled Layer-wise Graph Convolution for Transportation Demand Predictionivaderivader
This document summarizes a research paper that proposes a Coupled Layer-wise Convolutional Recurrent Neural Network (CCRNN) model for transportation demand prediction. CCRNN uses the following key techniques:
1. It generates adjacency matrices at each layer of the model to capture multi-level spatial dependencies, rather than using a fixed adjacency matrix.
2. It employs coupled layer-wise graph convolutions with different adjacency matrices at each layer to obtain representations at different levels of abstraction.
3. A multi-level attention mechanism aggregates representations from different layers.
4. A gated recurrent unit with graph convolutions models temporal dependencies over time.
The paper experiments with CCRNN on two
Application of Foundation Model for Autonomous DrivingYu Huang
Since DARPA’s Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...Yu Huang
The document discusses the new perception framework of bird's-eye view (BEV) networks for autonomous driving. It covers:
1) How BEV networks provide a unified representation for environment understanding and can fuse information from different views and sensors.
2) The two main categories of approaches for transforming perspective view to BEV - geometry-based and network-based methods.
3) How to apply BEV networks for autonomous driving requires a closed data loop including data selection, annotation, simulation and training frameworks.
This document outlines several papers related to joint 3D object detection and segmentation using a unified bird's-eye view (BEV) representation from multi-camera inputs. The papers described include M2BEV, which jointly performs detection and segmentation in BEV space; BEVerse, which produces spatiotemporal BEV representations for perception and prediction; and methods for learning efficient BEV representations such as GKT and BEVFusion.
The document outlines Yu Huang's research on 3D object detection from multi-view images for autonomous vehicles. It describes several methods including: DETR3D, which uses 3D object queries to link 3D positions to multi-view images; BEVDet, which detects 3D objects in bird's-eye view; BEVDet4D, which exploits temporal cues; PETR, which encodes 3D position information; FIERY, which predicts future instances; BEVDepth, which estimates reliable depth; PETRv2, which provides a unified framework; and ST-P3, which uses spatial-temporal feature learning for perception, prediction and planning. The research aims to advance 3D object detection capabilities for autonomous
Fisheye based Perception for Autonomous Driving VIYu Huang
Disentangling and Vectorization: A 3D Visual Perception Approach for Autonomous Driving Based on Surround-View Fisheye Cameras
SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround View Fisheye Cameras
FisheyeDistanceNet++: Self-Supervised Fisheye Distance Estimation with Self-Attention, Robust Loss Function and Camera View Generalization
An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras
RoadEdgeNet: Road Edge Detection System Using Surround View Camera Images
Fisheye/Omnidirectional View in Autonomous Driving VYu Huang
Road-line detection and 3D reconstruction using fisheye cameras
• Vehicle Re-ID for Surround-view Camera System
• SynDistNet: Self-Supervised Monocular Fisheye Camera Distance
Estimation Synergized with Semantic Segmentation for Autonomous
Driving
• Universal Semantic Segmentation for Fisheye Urban Driving Images
• UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a
Generic Framework for Handling Common Camera Distortion Models
• OmniDet: Surround View Cameras based Multi-task Visual Perception
Network for Autonomous Driving
• Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving
Fisheye/Omnidirectional View in Autonomous Driving IVYu Huang
FisheyeMultiNet: Real-time Multi-task Learning Architecture for
Surround-view Automated Parking System
• Generalized Object Detection on Fisheye Cameras for Autonomous
Driving: Dataset, Representations and Baseline
• SynWoodScape: Synthetic Surround-view Fisheye Camera Dataset for
Autonomous Driving
• Feasible Self-Calibration of Larger Field-of-View (FOV) Camera Sensors
for the ADAS
This document summarizes several papers related to data-driven methods for prediction and planning in autonomous driving systems at Baidu Apollo.
The papers propose and evaluate various machine learning approaches for tasks like motion planning, trajectory optimization, vehicle modeling and control parameter tuning. Frameworks presented include an auto-tuning approach for reward functions, a prediction architecture for scaling models across regions, and a tune-free control framework using learning-based dynamic modeling and Bayesian optimization. The goal is to develop self-supervised methods to improve performance and safely deploy autonomous systems at large scale.
Autonomous driving for robotaxi, like perception, prediction, planning, decision making and control etc. As well as simulation, visualization and data closed loop etc.
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)Yu Huang
Canadian Adverse Driving Conditions Dataset, 2020, 2
Deep multimodal sensor fusion in unseen adverse weather, 2020, 8
RADIATE: A Radar Dataset for Automotive Perception in Bad Weather, 2021, 4
Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection, 2021, 7
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather, 2021, 8
DSOR: A Scalable Statistical Filter for Removing Falling Snow from LiDAR Point Clouds in Severe Winter Weather, 2021, 9
Scenario-Based Development & Testing for Autonomous DrivingYu Huang
This document outlines Yu Huang's research on scenario-based development and testing for autonomous vehicles. It provides an outline and summaries of six papers on topics like formal scenario testing from simulation to real-world, a scenario-based development framework, dynamic scenario modeling and data generation, autonomous driving scenario clustering, generating and characterizing scenarios for safety testing, and creating test scenarios using a systems approach. The summaries describe the key aspects and proposed methods of each paper, such as defining scenario elements and ontologies, generating random and dangerous scenarios, characterizing scenarios with metrics, and identifying test cases based on hazards.
How to Build a Data Closed-loop Platform for Autonomous Driving?Yu Huang
Introduction;
data driven models for autonomous driving;
cloud computing infrastructure and big data processing;
annotation tools for training data;
large scale model training platform;
model testing and verification;
related machine learning techniques;
Conclusion.
Annotation tools for ADAS & Autonomous DrivingYu Huang
The document lists over 30 tools for annotating images, videos, and point cloud data. Many of the tools are open source and used for tasks like object detection, segmentation, and labeling. The tools cover a wide range of domains from natural images to LiDAR point clouds and include both online and desktop-based annotation solutions.
Simulation for autonomous driving at uber atgYu Huang
Testing Safety of SDVs by Simulating Perception and Prediction
LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World
Recovering and Simulating Pedestrians in the Wild
S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling
SceneGen: Learning to Generate Realistic Traffic Scenes
TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors
GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving
AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles
Appendix: (Waymo)
SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
RegNet: Multimodal Sensor Registration Using Deep Neural Networks
CalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks
RGGNet: Tolerance Aware LiDAR-Camera Online Calibration with Geometric Deep Learning and Generative Model
CalibRCNN: Calibrating Camera and LiDAR by Recurrent Convolutional Neural Network and Geometric Constraints
LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network
CFNet: LiDAR-Camera Registration Using Calibration Flow Network
Data pipeline and data lake for autonomous drivingYu Huang
This document outlines autonomous driving data pipelines and data lakes used by various companies. It discusses Tesla, Google Waymo, PlusAI, Alibaba Cloud, Nvidia, NetApp, Amazon AWS, Amazon TRI, Amazon Momenta, and data pipeline strategies from Eckerson DataOps and IBM. The document also provides a detailed overview of an autonomous driving data lake built on AWS that ingests vehicle telemetry data and processes drive data for labeling and search capabilities.
Lidar in the adverse weather: dust, fog, snow and rainYu Huang
This document discusses research on LiDAR performance in adverse weather conditions such as dust, snow, rain, and fog. It outlines several papers that analyze how different weather influences LiDAR sensors, develop methods to detect and classify weather conditions using LiDAR data, and explore techniques for denoising LiDAR point clouds in adverse weather. The papers covered include work on characterizing multiple LiDAR sensors for localization and mapping applications and benchmarking LiDAR performance for automated driving tasks.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Recycled Concrete Aggregate in Construction Part III
Prediction and planning for self driving at waymo
1. PREDICTION AND PLANNING FOR
SELF DRIVING @WAYMO (GOOGLE)
YU HUANG
SUNNYVALE, CALIFORNIA
YU.HUANG07@GMAIL.COM
2. References
• ChauffeurNet: Learning To Drive By Imitating The Best Synthesizing The Worst
• Multipath: Multiple Probabilistic Anchor Trajectory Hypotheses For Behavior Prediction
• VectorNet: Encoding HD Maps And Agent Dynamics From Vectorized Representation
• TNT: Target-driven Trajectory Prediction
• Large Scale Interactive Motion Forecasting For Autonomous Driving : The Waymo Open
Motion Dataset
• Identifying Driver Interactions Via Conditional Behavior Prediction
• Peeking Into The Future: Predicting Future Person Activities And Locations In Videos
• STINet: Spatio-temporal-interactive Network For Pedestrian Detection And Trajectory
Prediction
3. CHAUFFEURNET: LEARNING TO DRIVE BY
IMITATING THE BEST SYNTHESIZING THE
WORST
• TRAIN A POLICY FOR AUTONOMOUS DRIVING VIA IMITATION LEARNING THAT IS ROBUST ENOUGH TO DRIVE A
REAL VEHICLE.
• STANDARD BEHAVIOR CLONING IS INSUFFICIENT FOR HANDLING COMPLEX DRIVING SCENARIOS, EVEN
LEVERAGING A PERCEPTION SYSTEM FOR PREPROCESSING THE INPUT AND A CONTROLLER FOR EXECUTING THE
OUTPUT ON THE CAR: 30 MILLION EXAMPLES ARE STILL NOT ENOUGH.
• EXPOSING THE LEARNER TO SYNTHESIZED DATA IN THE FORM OF PERTURBATIONS TO THE EXPERT’S DRIVING,
WHICH CREATES INTERESTING SITUATIONS SUCH AS COLLISIONS AND/OR GOING OFF THE ROAD.
• RATHER THAN PURELY IMITATING ALL DATA, AUGMENT THE IMITATION LOSS WITH ADDITIONAL LOSSES THAT
PENALIZE UNDESIRABLE EVENTS AND ENCOURAGE PROGRESS – THE PERTURBATIONS THEN PROVIDE AN
IMPORTANT SIGNAL FOR THESE LOSSES AND LEAD TO ROBUSTNESS OF THE LEARNED MODEL.
• CHAUFFEURNET MODEL CAN HANDLE COMPLEX SITUATIONS IN SIMULATION.
11. MULTIPATH: MULTIPLE PROBABILISTIC ANCHOR
TRAJECTORY HYPOTHESES FOR BEHAVIOR
PREDICTION
• PREDICTING HUMAN BEHAVIOR IS A DIFFICULT AND CRUCIAL TASK REQUIRED FOR MOTION PLANNING.
• IT IS CHALLENGING IN LARGE PART DUE TO THE HIGHLY UNCERTAIN AND MULTIMODAL SET OF POSSIBLE
OUTCOMES IN REAL-WORLD DOMAINS SUCH AS AUTONOMOUS DRIVING.
• BEYOND SINGLE MAP TRAJECTORY PREDICTION, OBTAINING AN ACCURATE PROBABILITY DISTRIBUTION OF THE
FUTURE IS AN AREA OF ACTIVE INTEREST.
• MULTIPATH LEVERAGES A FIXED SET OF FUTURE STATE-SEQUENCE ANCHORS THAT CORRESPOND TO MODES OF
THE TRAJECTORY DISTRIBUTION.
• AT INFERENCE, THE MODEL PREDICTS A DISCRETE DISTRIBUTION OVER THE ANCHORS AND, FOR EACH ANCHOR,
REGRESSES OFFSETS FROM ANCHOR WAYPOINTS ALONG WITH UNCERTAINTIES, YIELDING A GAUSSIAN
MIXTURE AT EACH TIME STEP.
• THE MODEL IS EFFICIENT, REQUIRING ONLY ONE FORWARD INFERENCE PASS TO OBTAIN MULTI-MODAL FUTURE
DISTRIBUTIONS, AND THE OUTPUT IS PARAMETRIC, ALLOWING COMPACT COMMUNICATION AND ANALYTICAL
PROBABILISTIC QUERIES.
13. MULTIPATH: MULTIPLE PROBABILISTIC ANCHOR
TRAJECTORY HYPOTHESES FOR BEHAVIOR
PREDICTION
• MULTIPATH ESTIMATES THE DISTRIBUTION OVER FUTURE TRAJECTORIES PER AGENT IN A SCENE, AS FOLLOWS:
• 1) BASED ON A TOP-DOWN SCENE REPRESENTATION, THE SCENE CNN EXTRACTS MID-LEVEL FEATURES THAT ENCODE
THE STATE OF INDIVIDUAL AGENTS AND THEIR INTERACTIONS.
• 2) FOR EACH AGENT IN THE SCENE, CROP AN AGENT-CENTRIC VIEW OF THE MID-LEVEL FEATURE REPRESENTATION AND
PREDICT THE PROBABILITIES OVER THE FIXED SET OF K PREDEFINED ANCHOR TRAJECTORIES.
• 3) FOR EACH ANCHOR, THE MODEL REGRESSES OFFSETS FROM THE ANCHOR STATES AND UNCERTAINTY DISTRIBUTIONS
FOR EACH FUTURE TIME STEP.
• THE DISTRIBUTION IS PARAMETERIZED BY ANCHOR TRAJECTORIES A; DIRECTLY LEARNING A MIXTURE SUFFERS FROM
ISSUES OF MODE COLLAPSE, AS IS COMMON PRACTICE IN OTHER DOMAINS SUCH AS OBJECT DETECTION AND HUMAN
POSE ESTIMATION, IT ESTIMATES THE ANCHORS A-PRIORI BEFORE FIXING THEM TO LEARN THE REST OF OUR
PARAMETERS; A PRACTICAL WAY IS THE K-MEANS ALGORITHM AS A SIMPLE APPROXIMATION TO OBTAIN A.
• IT TRAINS THE MODEL VIA IMITATION LEARNING BY FITTING PARAMETERS TO MAXIMIZE THE LOG-LIKELIHOOD OF
RECORDED DRIVING TRAJECTORIES.
14. MULTIPATH: MULTIPLE PROBABILISTIC ANCHOR
TRAJECTORY HYPOTHESES FOR BEHAVIOR
PREDICTION
• THEY STILL REPRESENT A HISTORY OF DYNAMIC AND STATIC SCENE CONTEXT AS A 3-
DIMENSIONAL ARRAY OF DATA RENDERED FROM A TOP-DOWN ORTHOGRAPHIC PERSPECTIVE.
• THE FIRST TWO DIMENSIONS REPRESENT SPATIAL LOCATIONS IN THE TOP-DOWN IMAGE.
• THE CHANNELS IN THE DEPTH DIMENSION HOLD STATIC AND TIME-VARYING (DYNAMIC)
CONTENT OF A FIXED NUMBER OF PREVIOUS TIME STEPS.
• AGENT OBSERVATIONS ARE RENDERED AS ORIENTATED BOUNDING BOX BINARY IMAGES, ONE
CHANNEL FOR EACH TIME STEP.
• OTHER DYNAMIC CONTEXT SUCH AS TRAFFIC LIGHT STATE AND STATIC CONTEXT OF THE ROAD
(LANE CONNECTIVITY AND TYPE, STOP LINES, SPEED LIMIT, ETC.) FORM ADDITIONAL CHANNELS.
• AN IMPORTANT BENEFIT OF USING SUCH A TOP-DOWN REPRESENTATION IS THE SIMPLICITY OF
REPRESENTING CONTEXTUAL INFORMATION LIKE THE AGENTS’ SPATIAL RELATIONSHIPS TO EACH
OTHER AND SEMANTIC ROAD INFORMATION.
15. MULTIPATH: MULTIPLE PROBABILISTIC ANCHOR
TRAJECTORY HYPOTHESES FOR BEHAVIOR
PREDICTION
Top: Logged trajectories of all agents are displayed in cyan. The focused agent is highlighted by a red
circle. Bottom: MultiPath showing up to 5 trajectories with uncertainty ellipses. Trajectory probabilities
(softmax outputs) are encoded in a color map shown to the right. MultiPath can predict uncertain future
trajectories for various speed (1st column), different intent at intersections (2nd and 3rd columns) and lane
changes (4th and 5th columns), where the regression baseline only predicts a single intent.
16. VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
• BEHAVIOR PREDICTION IN DYNAMIC, MULTI-AGENT SYSTEMS IS AN IMPORTANT PROBLEM IN THE CONTEXT OF SELF-
DRIVING CARS, DUE TO THE COMPLEX REPRESENTATIONS AND INTERACTIONS OF ROAD COMPONENTS, INCLUDING
MOVING AGENTS (E.G. PEDESTRIANS AND VEHICLES) AND ROAD CONTEXT INFORMATION (E.G. LANES, TRAFFIC LIGHTS).
• THIS PAPER INTRODUCES VECTORNET, A HIERARCHICAL GRAPH NEURAL NETWORK (GNN) THAT FIRST EXPLOITS THE
SPATIAL LOCALITY OF INDIVIDUAL ROAD COMPONENTS REPRESENTED BY VECTORS AND THEN MODELS THE HIGH-
ORDER INTERACTIONS AMONG ALL COMPONENTS.
• IN CONTRAST TO MOST RECENT APPROACHES, WHICH RENDER TRAJECTORIES OF MOVING AGENTS AND ROAD
CONTEXT INFORMATION AS BIRD-EYE IMAGES AND ENCODE THEM WITH CONVOLUTIONAL NEURAL NETWORKS
(CONVNETS), THIS APPROACH OPERATES ON A VECTOR REPRESENTATION.
• BY OPERATING ON THE VECTORIZED HIGH DEFINITION (HD) MAPS AND AGENT TRAJECTORIES, IT AVOIDS LOSSY
RENDERING AND COMPUTATIONALLY INTENSIVE CONVNET ENCODING STEPS.
• TO FURTHER BOOST VECTORNET’S CAPABILITY IN LEARNING CONTEXT FEATURES, IT PROPOSES A NOVEL AUXILIARY TASK
TO RECOVER THE RANDOMLY MASKED OUT MAP ENTITIES AND AGENT TRAJECTORIES BASED ON THEIR CONTEXT.
• IT ALSO OUTPERFORMS THE STATE OF THE ART ON THE ARGOVERSE DATASET.
https://github.com/DQSSSSS/VectorNet
18. VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
• MOST OF THE ANNOTATIONS FROM AN HD MAP ARE IN THE FORM OF SPLINES (E.G. LANES), CLOSED SHAPE
(E.G. REGIONS OF INTERSECTIONS) AND POINTS (E.G. TRAFFIC LIGHTS), WITH ADDITIONAL ATTRIBUTE INFO SUCH
AS THE SEMANTIC LABELS OF THE ANNOTATIONS AND THEIR CURRENT STATES (E.G. COLOR OF THE TRAFFIC
LIGHT, SPEED LIMIT OF THE ROAD).
• FOR AGENTS, THEIR TRAJECTORIES ARE IN THE FORM OF DIRECTED SPLINES WITH RESPECT TO TIME.
• ALL OF THESE ELEMENTS CAN BE APPROXIMATED AS SEQUENCES OF VECTORS: FOR MAP FEATURES, PICK A
STARTING POINT AND DIRECTION, UNIFORMLY SAMPLE KEY POINTS FROM THE SPLINES AT THE SAME SPATIAL
DISTANCE, AND SEQUENTIALLY CONNECT THE NEIGHBORING KEY POINTS INTO VECTORS; FOR TRAJECTORIES,
JUST SAMPLE KEY POINTS WITH A FIXED TEMPORAL INTERVAL (0.1 SECOND), STARTING FROM T = 0, AND
CONNECT THEM INTO VECTORS.
• GIVEN SMALL ENOUGH SPATIAL OR TEMPORAL INTERVALS, THE RESULTING POLYLINES SERVE AS CLOSE
APPROXIMATIONS OF THE ORIGINAL MAP AND TRAJECTORIES.
• TO EXPLOIT THE SPATIAL AND SEMANTIC LOCALITY OF THE NODES, IT TAKES A HIERARCHICAL APPROACH BY
FIRST CONSTRUCTING SUBGRAPHS AT THE VECTOR LEVEL, WHERE ALL VECTOR NODES BELONGING TO THE
SAME POLYLINE ARE CONNECTED WITH EACH OTHER.
19. VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
The computation flow on the vector nodes of the
same polyline. The polyline subgraph network
can be seen as a generalization of PointNet.
However, by embedding the ordering
information into vectors, constraining the
connectivity of subgraphs based on the polyline
groupings, and encoding attributes as node
features, this method is particularly suitable to
encode structured map annotations and agent
trajectories.
20. VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
• TO ENCOURAGE THE GLOBAL INTERACTION GRAPH TO BETTER CAPTURE INTERACTIONS AMONG DIFFERENT
TRAJECTORIES AND MAP POLYLINES, IT INTRODUCES AN AUXILIARY GRAPH COMPLETION TASK.
• IN ORDER TO IDENTIFY AN INDIVIDUAL POLYLINE NODE WHEN ITS CORRESPONDING FEATURE IS MASKED OUT, IT
COMPUTES THE MINIMUM VALUES OF THE START COORDINATES FROM ALL OF ITS BELONGING VECTORS TO
OBTAIN THE IDENTIFIER EMBEDDING.
• THE GRAPH COMPLETION OBJECTIVE IS CLOSELY RELATED TO THE WIDELY SUCCESSFUL BERT METHOD FOR
NATURAL LANGUAGE PROCESSING (NLP), WHICH PREDICTS MISSING TOKENS BASED ON BIDIRECTIONAL
CONTEXT FROM DISCRETE AND SEQUENTIAL TEXT DATA.
• UNLIKE METHODS THAT GENERALIZES THE BERT OBJECTIVE TO UNORDERED IMAGE PATCHES WITH PRE-
COMPUTED VISUAL FEATURES, THE PROPOSED NODE FEATURES ARE JOINTLY OPTIMIZED IN AN E2E FRAMEWORK.
• THE FINAL MULTI-TASK TRAINING OBJECTIVE IS OPTIMIZED:
• LTRAJ IS THE NEGATIVE GAUSSIAN LOG-LIKELIHOOD FOR THE GROUND TRUTH FUTURE TRAJECTORIES, LNODE IS THE HUBER
LOSS BETWEEN PREDICTED NODE FEATURES AND GROUND TRUTH MASKED NODE FEATURES, Α = 1.0 IS A SCALAR THAT
BALANCES THE TWO LOSS TERMS.
21. VECTORNET: ENCODING HD MAPS AND AGENT
DYNAMICS FROM VECTORIZED REPRESENTATION
prediction
prediction attention for road and agent attention for road and agent
22. TNT: TARGET-DRIVEN TRAJECTORY PREDICTION
• THIS KEY INSIGHT IS THAT FOR PREDICTION WITHIN A MODERATE TIME HORIZON, THE FUTURE MODES CAN BE
EFFECTIVELY CAPTURED BY A SET OF TARGET STATES.
• THIS LEADS TO TARGET-DRIVEN TRAJECTORY PREDICTION (TNT) FRAMEWORK.
• TNT HAS THREE STAGES WHICH ARE TRAINED END-TO-END.
• IT FIRST PREDICTS AN AGENT’S POTENTIAL TARGET STATES T STEPS INTO THE FUTURE, BY ENCODING ITS INTERACTIONS
WITH THE ENVIRONMENT AND THE OTHER AGENTS.
• TNT THEN GENERATES TRAJECTORY STATE SEQUENCES CONDITIONED ON TARGETS.
• A FINAL STAGE ESTIMATES TRAJECTORY LIKELIHOODS AND A FINAL COMPACT SET OF TRAJECTORY PREDICTIONS IS
SELECTED.
• THIS IS IN CONTRAST TO PREVIOUS WORK WHICH MODELS AGENT INTENTS AS LATENT VARIABLES, AND RELIES
ON TEST-TIME SAMPLING TO GENERATE DIVERSE TRAJECTORIES.
• BENCHMARK TNT ON TRAJECTORY PREDICTION OF VEHICLES AND PEDESTRIANS, OUTPERFORM STATE-OF-THE-
ART ON ARGOVERSE FORECASTING, INTERACTION, STANFORD DRONE AND AN IN-HOUSE PEDESTRIAN-AT-
INTERSECTION DATASET.
23. TNT: TARGET-DRIVEN TRAJECTORY PREDICTION
Illustration of the TNT framework when applied to the vehicle future trajectory prediction task. TNT
consists of three stages: (a) target prediction which proposes a set of plausible targets (stars)
among all candidates (diamonds). (b) target-conditioned motion estimation which estimates a
trajectory (distribution) towards each selected target, (c) scoring and selection which ranks
trajectory hypotheses and selects a final set of trajectory predictions with likelihood scores.
24. TNT: TARGET-DRIVEN TRAJECTORY PREDICTION
TNT model overview. Scene context is first encoded as the model’s inputs. Then follows the core
three stages of TNT: (a) target prediction which proposes an initial set of M targets; (b) target-
conditioned motion estimation which estimates a trajectory for each target; (c) scoring and selection
which ranks trajectory hypotheses and outputs a final set of K predicted trajectories.
25. TNT: TARGET-DRIVEN TRAJECTORY PREDICTION
TNT supports flexible choices of targets. Vehicle target candidate points
are sampled from the lane centerlines. Pedestrian target candidate
points are sampled from a virtual grid centered on the pedestrian.
27. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• This is the most diverse interactive motion dataset so far, and provides specific labels for
interacting objects suitable for developing joint prediction models.
• With over 100,000 scenes, each 20 seconds long at 10 HZ, this dataset contains more
than 570 hours of unique data over 1750 km of roadways.
• It was collected by mining for interesting interactions between vehicles, pedestrians, and
cyclists across six cities within the united states.
• Use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes
for each road agent, and provide corresponding high definition 3D maps for each scene.
• Introduce a new set of metrics that provides a comprehensive evaluation of both single
agent and joint agent interaction motion forecasting models.
• Finally, provide strong baseline models for individual agent prediction and joint-prediction.
• https://waymo.com/open/data/motion/
28. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
Examples of interactions between
agents in a scene in the WAYMO
OPEN MOTION DATASET. Each
example highlights how predicting
the joint behavior of agents aids in
predicting likely future scenarios.
Solid and dashed lines indicate the
road graph and associated lanes.
Each numeral indicates a unique
agent in the scene.
29. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• Compared to the onboard counterpart, offboard perception has two major advantages:
• 1) it can afford much more powerful models running on the ample computational resources;
• 2) it can maximally aggregate complementary information from different views by exploiting the
full point cloud sequence including both history and future.
• The offboard perception system employed contains three steps:
• (1) 3D object detector generates object proposals from each lidar frame.
• (2) multi-object tracker links detected objects throughout the lidar sequence.
• (3) for each object, an object-centric refinement network processes the tracked object boxes and
its point clouds across all frames in the track, and outputs temporally consistent and accurate 3D
bounding boxes of the object in each frame.
30. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
Comparison of popular behavior prediction and motion forecasting datasets.
Specifically, compare Lyft Level 5, NuScenes, Argoverse, Interactions, and
waymo motion dataset across multiple dimensions.
31. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• The dataset provides high quality object tracks generated using an offboard perception system
along with both static and dynamic map features to provide context for the road environment.
• Mine for interesting scenarios by first hand-crafting semantic predicates involving agents’
relationships— e.g., “Agent A changed lanes at time t”, and “agents A and B crossed paths with a
time gap t and relative heading difference”.
• These predicates can be composed to retrieve more complex queries in an efficient SQL and
relational database framework on an overall data corpus orders of magnitude larger than the
resulting curated WAYMO OPEN MOTION DATASET.
• Pairwise interaction scenarios: merges, lane changes, unprotected turns, intersection left turns,
intersection right turns, pedestrian-vehicle interactions, cyclist vehicle interactions, interactions with
close proximity, and interactions with high accelerations.
32. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
Diagram of baseline architecture. An illustration of the baseline architecture
employed for the family of learned models with a base LSTM encoder for
agent states. The three detachable components are a road graph polyline
encoder, a traffic state LSTM encoder, and a high-order interactions encoder
following. The trajectories are predicted through a MLP with min-of-k loss.
33. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• First, consider a constant velocity model in which we assume the agent will maintain its velocity at
the current timestamp for all future steps.
• Second, consider a family of deep-learned models using various encoders, with a base
architecture of a LSTM to encode a 1-second history of observed state; this includes agents’
positions, velocity, and 3d bounding boxes.
• In order to measure the importance of particular additional features, selectively provide
additional information:
• Road graph (rg): encode the 3D map information with polylines following.
• Traffic signals (ts): encode the traffic signal states with a LSTM encoder as an additional
feature.
• High-order interactions (hi): model the high-order interactions between agents with a global
interaction graph following.
34. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
• Use conditional behavior prediction (CBP) to quantify the interactivity in our dataset.
• A model can produce either unconditional predictions or predictions conditioned on a “query
trajectory” for one of the agents in the scene.
• If two agents are not interacting, then one’s actions have no effect on the other, so knowledge of that
agent’s future should not change predictions for the other agent.
• The degree of influence agent A has on agent B is defined as KL divergence between unconditional
predictions for B and the predictions for B conditioned on a’s ground truth future trajectory.
• Apply this to interactive and standard validation datasets, computing the KL divergence between
unconditional and conditional predictions for every query agent/target agent pair in the dataset.
• KL divergences are much larger in interactive validation dataset than in standard validation dataset.
35. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
The dataset contains many agents including
pedestrians and cyclists. Top: 46% of scenes have
more than 32 agents, and 11% of scenes have
more than 64 agents. Bottom: In the standard
validation set, 33.5% of scenes require at least
one pedestrian to be predicted, and 10.4% of
scenes require at least one cyclist to be predicted.
36. Large Scale Interactive Motion Forecasting For Autonomous
Driving : The WAYMO OPEN MOTION DATASET
Agents selected to be predicted have diverse
trajectories. Left: Ground truth trajectory of each
predicted agent in a frame of reference where all
agents start at the origin with heading pointing along
the positive X axis (pointing up). Right: Distribution
of maximum speeds achieved by all of the agents
along their 9 second trajectory. Plots depict variety
in trajectory shapes and speed profiles.
37. Identifying Driver Interactions Via Conditional
Behavior Prediction
• Interactive driving scenarios, such as lane changes, merges and unprotected turns, are some
of the most challenging situations for autonomous driving.
• Planning in interactive scenarios requires accurately modeling the reactions of other agents
to different future actions of the ego agent.
• It develops end-to-end models for conditional behavior prediction (CBP) that take as an
input a query future trajectory for an ego-agent, and predict distributions over future
trajectories for other agents conditioned on the query.
• Leveraging such a model, develop a general-purpose agent interactivity score derived
from probabilistic first principles.
• The interactivity score allows to find interesting interactive scenarios for training and
evaluating behavior prediction models.
38. Identifying Driver Interactions Via Conditional
Behavior Prediction
• Define an agent trajectory S as a fixed-length, time discretized sequence of agent states up to a
finite time horizon.
• All quantities in this work consider a pair of agents A and B.
• Without loss of generality, consider A to be the query agent whose plan for the future can
potentially affect B, the target agent.
• The future trajectories of A and B are random variables SA and SB.
• The marginal probability of a particular realization of agent b’s trajectory sb is given by p(SB = sb),
also indicated by the shorthand p(sb).
• The conditional distribution of agent b’s future trajectory given a realization of agent a’s trajectory
sa is given by p(SB = sb|SA = sa), indicated by the shorthand p(sb|sa).
39. Identifying Driver Interactions Via Conditional
Behavior Prediction
• Quantify interactions by estimating the change in log likelihood of the target’s ground-truth future sb
• A large change in the log-likelihood indicates a situation in which the likelihood of the target agent’s
trajectory changes significantly as a result of the query agent’s action.
• Use the kl-divergence between the conditional and marginal distributions for the target’s predicted
future trajectory SB to quantify the degree of influence exerted on B by a a trajectory sa:
• Mutual information between the two agents’ future trajectories SA and SB is computed as
• The interactivity score between agents A and B.
40. Identifying Driver Interactions Via Conditional
Behavior Prediction
• A CBP model predicts p(SB|SA= sa, x), the distribution of future trajectories for B conditioned on sa.
• Gaussian uncertainty over the positions of the trajectory waypoints as
• Gaussian mixture model (GMM) with mixture weights fixed over all time steps of the same trajectory
• The computation of the interactivity score also requires the estimation of marginal distributions
41. Identifying Driver Interactions Via Conditional
Behavior Prediction
• Use the most likely 6 modes of the marginal distribution’s GMM as in standard motion forecasting
metrics, rather than sampling N samples from the marginal distribution
• Learn to predict distribution parameters via supervised learning with the negative log-likelihood loss
• Encourage the model to maintain that agents cannot occupy the same future location in space-time,
with a loss function
42. Identifying Driver Interactions Via Conditional
Behavior Prediction
A conditional behavior prediction model describes
how one agent’s predicted future trajectory can
shift due to the actions of other agents.
The architecture of the conditional behavior
prediction model.
43. Identifying Driver Interactions Via Conditional
Behavior Prediction
Histogram of interactivity score (mutual
information) between 8,919,306 pairs of
agents in the validation dataset.
44. Identifying Driver Interactions Via Conditional
Behavior Prediction
Two examples of interacting agents found by
sorting examples by mutual information and
wADE. The marginal (left) and conditional
predictions (right) are shown with the query in
solid green, and predictions in dashed cyan lines.
45. Identifying Driver Interactions Via Conditional
Behavior Prediction
An example in which the query and target agents slow down in parallel
lanes as a result of a traffic light change. The marginal (left) and
conditional predictions (right) are shown with the query in solid green.
46. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
• Deciphering human behaviors to predict their future paths/trajectories and what they would do from
videos is important in many applications.
• Therfore, this work studies predicting a pedestrian’s future path jointly with future activities.
• They propose an end-to-end, multi-task learning system, called next, utilizing rich visual features
about human behavioral information and interaction with their surroundings.
• It encodes a person through rich semantic features about visual appearance, body movement and
interaction with the surroundings, motivated by the fact that humans derive such predictions by
relying on similar visual cues.
• To facilitate the training, the network is learned with an auxiliary task of predicting future location
in which the activity will happen.
• In the auxiliary task, it designs a discretized grid called the manhattan grid, as location prediction
target for the system.
https://github.com/JunweiLiang/social-distancing-prediction
47. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
The goal is to jointly predict a person’s future path and activity. The green and yellow line show two
possible future trajectories and two possible activities are shown in the green and yellow boxes.
Depending on the future activity, the person (top right) may take different paths, e.g. the yellow path
for “loading” and the green path for “object transfer”.
48. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
• HUMANS NAVIGATE THROUGH PUBLIC SPACES OFTEN WITH SPECIFIC PURPOSES IN MIND, RANGING
FROM SIMPLE ONES LIKE ENTERING A ROOM TO MORE COMPLICATED ONES LIKE PUTTING THINGS
INTO A CAR.
• SUCH INTENTION, HOWEVER, IS MOSTLY NEGLECTED IN EXISTING WORK.
• THE JOINT PREDICTION MODEL CAN HAVE TWO BENEFITS:
• 1) LEARNING THE ACTIVITY TOGETHER WITH THE PATH MAY BENEFIT THE FUTURE PATH PREDICTION;
INTUITIVELY, HUMANS ARE ABLE TO READ FROM OTHERS’ BODY LANGUAGE TO ANTICIPATE WHETHER
THEY ARE GOING TO CROSS THE STREET OR CONTINUE WALKING ALONG THE SIDEWALK.
• 2) THE JOINT MODEL ADVANCES THE CAPABILITY OF UNDERSTANDING NOT ONLY THE FUTURE PATH
BUT ALSO THE FUTURE ACTIVITY BY TAKING INTO ACCOUNT THE RICH SEMANTIC CONTEXT IN
VIDEOS; THIS INCREASES THE CAPABILITIES OF AUTOMATED VIDEO ANALYTICS FOR SOCIAL GOOD,
SUCH AS SAFETY APPLICATIONS LIKE ANTICIPATING PEDESTRIAN MOVEMENT AT TRAFFIC
INTERSECTIONS OR A ROAD ROBOT HELPING HUMANS TRANSPORT GOODS TO A CAR.
49. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
Overview of the Next model. Given a sequence of frames containing the person for prediction, this model utilizes
person behavior module and person interaction module to encode rich visual semantics into a feature tensor.
50. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
• 4 KEY COMPONENTS:
• PERSON BEHAVIOR MODULE EXTRACTS VISUAL INFORMATION FROM THE BEHAVIORAL SEQUENCE OF THE
PERSON.
• PERSON INTERACTION MODULE LOOKS AT THE INTERACTION BETWEEN A PERSON AND THEIR
SURROUNDINGS.
• TRAJECTORY GENERATOR SUMMARIZES THE ENCODED VISUAL FEATURES AND PREDICTS THE FUTURE
TRAJECTORY BY THE LSTM DECODER WITH FOCAL ATTENTION.
• ACTIVITY PREDICTION UTILIZES RICH VISUAL SEMANTICS TO PREDICT THE FUTURE ACTIVITY LABEL FOR THE
PERSON.
• IN ADDITION, DIVIDE THE SCENE INTO A DISCRETIZED GRID OF MULTIPLE SCALES, CALLED
MANHATTAN GRID, TO COMPUTE CLASSIFICATION AND REGRESSION FOR ROBUST ACTIVITY
LOCATION PREDICTION.
51. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
To model appearance changes of a person, utilize a pre-trained object detection model with “RoIAlign”
to extract fixed size CNN features for each person bounding box.
To average the features along the spatial dimensions for each person and feed them into an LSTM
encoder. Finally, obtain a feature representation of Tobs × d, where d is the hidden size of the LSTM. To
capture the body movement, utilize a person keypoint detection model to extract person keypoint
information. To apply the linear transformation to embed the keypoint coordinates before feeding into
the LSTM encoder. The shape of the encoded feature has the shape of Tobs × d. These appearance and
movement features are commonly used in a wide variety of studies and thus do not introduce new
concern on machine learning fairness.
52. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
The person-objects feature can capture how far away the person is to the other
person and the cars. The person-scene feature can capture whether the person is
near the sidewalk or grass. It designs this information to the model with the hope
of learning things like a person walks more often on the sidewalk than the grass
and tends to avoid bumping into cars.
53. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
• IT USES AN LSTM DECODER TO DIRECTLY PREDICT THE FUTURE TRAJECTORY IN THE XY-COORDINATE.
• THE HIDDEN STATE OF THIS DECODER IS INITIALIZED USING THE LAST STATE OF THE PERSON’S
TRAJECTORY LSTM ENCODER.
• ADD AN AUXILIARY TASK, I.E. ACTIVITY LOCATION PREDICTION, IN ADDITION TO PREDICTING THE
FUTURE ACTIVITY LABEL OF THE PERSON.
• AT EACH TIME INSTANT, THE XY-COORDINATE WILL BE COMPUTED FROM THE DECODER STATE AND
BY A FULLY CONNECTED LAYER.
• IT EMPLOYS AN EFFECTIVE FOCAL ATTENTION, ORIGINALLY PROPOSED TO CARRY OUT MULTIMODAL
INFERENCE OVER A SEQUENCE OF IMAGES FOR VISUAL QUESTION ANSWERING; WHICH KEY IDEA IS
TO PROJECT MULTIPLE FEATURES INTO A SPACE OF CORRELATION, WHERE DISCRIMINATIVE FEATURES
CAN BE EASIER TO CAPTURE BY THE ATTENTION MECHANISM.
54. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
To bridge the gap between trajectory generation and activity label prediction, it proposes an activity
location prediction (ALP) module to predict the final location of where the person will engage in the future
activity. The activity location prediction includes two tasks, location classification and location regression.
55. PEEKING INTO THE FUTURE: PREDICTING FUTURE
PERSON ACTIVITIES AND LOCATIONS IN VIDEOS
Qualitative comparison between this method and the baselines. Yellow path is the observable trajectory and
green path is the ground truth trajectory during the prediction period. Predictions are shown as blue heatmaps.
56. STINET: SPATIO-TEMPORAL-INTERACTIVE NETWORK
FOR PEDESTRIAN DETECTION AND TRAJECTORY
PREDICTION
• DETECTING PEDESTRIANS AND PREDICTING FUTURE TRAJECTORIES FOR THEM ARE CRITICAL TASKS FOR NUMEROUS
APPLICATIONS, SUCH AS AUTONOMOUS DRIVING.
• PREVIOUS METHODS EITHER TREAT THE DETECTION AND PREDICTION AS SEPARATE TASKS OR SIMPLY ADD A TRAJECTORY
REGRESSION HEAD ON TOP OF A DETECTOR.
• AN END-TO-END TWO-STAGE NETWORK: SPATIO-TEMPORAL-INTERACTIVE NETWORK (STINET).
• IN ADDITION TO 3D GEOMETRY MODELING OF PEDESTRIANS, MODEL THE TEMPORAL INFORMATION FOR EACH OF THE
PEDESTRIANS.
• IT PREDICTS BOTH CURRENT AND PAST LOCATIONS IN THE FIRST STAGE, SO THAT EACH PEDESTRIAN CAN BE LINKED
ACROSS FRAMES AND THE COMPREHENSIVE SPATIO-TEMPORAL INFORMATION CAN BE CAPTURED IN THE SECOND
STAGE.
• ALSO, MODEL THE INTERACTION AMONG OBJECTS WITH AN INTERACTION GRAPH, TO GATHER THE INFORMATION
AMONG THE NEIGHBORING OBJECTS.
• COMPREHENSIVE EXPERIMENTS ON THE LYFT DATASET AND THE RECENTLY RELEASED LARGE-SCALE WAYMO OPEN
DATASET FOR BOTH OBJECT DETECTION AND FUTURE TRAJECTORY PREDICTION.
57. STINET: SPATIO-TEMPORAL-INTERACTIVE NETWORK
FOR PEDESTRIAN DETECTION AND TRAJECTORY
PREDICTION
The overview. It takes a sequence of point clouds as input, detects pedestrians and predicts their future
trajectories simultaneously. The point clouds are processed by Pillar Feature Encoding to generate Pillar
Features. Then each Pillar Feature is fed into a backbone ResUNet to get backbone features. A Temporal
Region Proposal Network (T-RPN) takes backbone features and generated temporal proposal with past
and current boxes for each object. Spatio-Temporal-Interactive (STI) Feature Extractor learns features
for each temporal proposal which are used for final detection and trajectory prediction.
58. STINET: SPATIO-TEMPORAL-INTERACTIVE NETWORK
FOR PEDESTRIAN DETECTION AND TRAJECTORY
PREDICTION
Backbone. Upper: overview of the backbone. The
input point cloud sequence is fed to Voxelization and
Point net to generate pseudo images, which are then
processed by ResNet U-Net to generate final
backbone feature sequence. Lower: detailed design
of ResNet U-Net.
59. STINET: SPATIO-TEMPORAL-INTERACTIVE NETWORK
FOR PEDESTRIAN DETECTION AND TRAJECTORY
PREDICTION
Spatial-Temporal-Interactive Feature Extractor
(STI- FE): Local geometry, local dynamic and
history path features are extracted given a
temporal proposal. For local geometry and
local dynamics features, the yellow areas are
used for feature extraction. Relational
reasoning is performed across proposals’ local
features to generate interactive features.