2. OUTLINE
• DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries
• BEVDet: High-Performance Multi-Camera 3D Object Detection in BEV
• BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
• PETR: Position Embedding Transformation for Multi-View 3D Object Detection
• FIERY: Future Instance Prediction in Bird’s-Eye View from Surround
Monocular Cameras
• BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection
• PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
• ST-P3: E2E Vision-based Autonomous Driving via S-T Feature Learning
3. DETR3D: 3D OBJECT DETECTION FROM MULTI-
VIEW IMAGES VIA 3D-TO-2D QUERIES
• This method manipulates predictions directly in 3D space, which architecture extracts 2D features
from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D
features, linking 3D positions to multi-view images using camera transformation matrices.
• Finally, the model makes a bounding box prediction per object query, using a set-to-set loss to
measure the discrepancy between the ground-truth and the prediction.
• This top-down approach outperforms its bottom-up counterpart in which object bounding box
prediction follows per-pixel depth estimation, since it does not suffer from the compounding error
introduced by a depth prediction model.
• Moreover, it does not require post-processing such as non-maximum suppression, dramatically
improving inference speed.
4. DETR3D: 3D OBJECT DETECTION FROM MULTI-
VIEW IMAGES VIA 3D-TO-2D QUERIES
5. DETR3D: 3D OBJECT DETECTION FROM MULTI-
VIEW IMAGES VIA 3D-TO-2D QUERIES
6. DETR3D: 3D OBJECT DETECTION FROM MULTI-
VIEW IMAGES VIA 3D-TO-2D QUERIES
7. BEVDET: HIGH-PERFORMANCE MULTI-CAMERA 3D
OBJECT DETECTION IN BIRD-EYE-VIEW
• BEVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where
route planning can be handily performed.
• In this paradigm, four kinds of modules are conducted in succession with different roles: an image-view encoder
for encoding feature in image view, a view transformer for feature transformation from image view to BEV, a
BEV encoder for further encoding feature in BEV, and a task-specific head for predicting the targets in BEV.
• reuse the existing modules for constructing BEVDet and make it feasible for multi-camera 3D object detection
by constructing an exclusive data augmentation strategy.
• The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between
computing budget and performance.
• BEVDet with 704×256 (1/8 of the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes
val set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP, and 37.2% NDS), while
requires just 12% computing budget of 239.4 GFLOPs and runs 4.3 times faster.
10. BEVDET4D: EXPLOIT TEMPORAL CUES IN MULTI-
CAMERA 3D OBJECT DETECTION
• For fundamentally pushing the performance boundary in this area, BEVDet4D is proposed to lift the
scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space.
• It upgrades the framework with a few modifications just for fusing the feature from the previous
frame with the corresponding one in the current frame.
• In this way, with negligible extra computing budget, enable the algorithm to access the temporal cues
by querying and comparing the two candidate features.
• Beyond this, also simplify the velocity learning task by removing the factors of ego-motion and time,
which equips BEVDet4D with robust generalization performance and reduces the velocity error by
52.8%.
15. PETR: POSITION EMBEDDING TRANSFORMATION
FOR MULTI-VIEW 3D OBJECT DETECTION
• In this paper, develop position embedding transformation (PETR) for multi-view 3D object
detection.
• PETR encodes the position information of 3D coordinates into image features, producing the
3D position-aware features.
• Object query can perceive the 3D position- aware features and perform end-to-end object
detection.
• PETR achieves state-of-the-art performance (50.4% NDS and 44.1% mAP) on standard
nuScenes dataset and ranks 1st place on the benchmark.
16. PETR: POSITION EMBEDDING TRANSFORMATION
FOR MULTI-VIEW 3D OBJECT DETECTION
(a) In DETR, the object queries interact with 2D features to perform 2D detection. (b) DETR3D
repeatedly projects the generated 3D reference points into image plane and samples the 2D features
to interact with object queries in decoder. (c) PETR generates the 3D position-aware features by
encoding the 3D position embedding into 2D image features. The object queries directly interact with
3D position- aware features and output 3D detection results.
22. FIERY: FUTURE INSTANCE PREDICTION IN BIRD’S-EYE
VIEW FROM SURROUND MONOCULAR CAMERAS
• Driving requires interacting with road agents and predicting their future behaviour in order to
navigate safely.
• FIERY: a probabilistic future prediction model in bird’s-eye view from monocular cameras.
• The model predicts future instance segmentation and motion of dynamic agents that can be
transformed into non-parametric future trajectories.
• The approach combines the perception, sensor fusion and prediction components of a traditional
autonomous driving stack by estimating bird’s-eye-view prediction directly from surround RGB
monocular camera inputs.
• FIERY learns to model the inherent stochastic nature of the future solely from camera driving data
in an end-to- end manner, without relying on HD maps, and predicts multimodal future trajectories.
• The code and trained models are available at https://github.com/wayveai/fiery.
29. BEVDEPTH: ACQUISITION OF RELIABLE DEPTH
FOR MULTI-VIEW 3D OBJECT DETECTION
• In this research, a new 3D object detector with a trustworthy depth estimation, dubbed BEVDepth,
for camera-based Bird’s-Eye-View (BEV) 3D object detection.
• the depth estimation is implicitly learned without camera information, making it the de-facto fake-
depth for creating the following pseudo point cloud.
• BEVDepth gets explicit depth supervision utilizing encoded intrinsic and extrinsic parameters.
• A depth correction sub-network is further introduced to counteract projecting-induced disturbances in
depth ground truth.
• To reduce the speed bottleneck while projecting features from image-view into BEV using estimated
depth, a quick view-transform operation is also proposed.
• Besides, BEVDepth can be easily extended with input from multi-frame.
36. PETRV2: A UNIFIED FRAMEWORK FOR 3D
PERCEPTION FROM MULTI-CAMERA IMAGES
• PETRv2, a unified framework for 3D perception from multi-view images.
• Based on PETR, PETRv2 explores the effectiveness of temporal modeling, which utilizes the temporal
information of previous frames to boost 3D object detection.
• More specifically, extend the 3D position embedding (3D PE) in PETR for temporal modeling.
• The 3D PE achieves the temporal alignment on object position of different frames.
• A feature-guided position encoder is further introduced to improve the data adaptability of 3D PE.
• To support for high-quality BEV segmentation, PETRv2 provides a simply yet effective solution by adding a set
of segmentation queries.
• Each segmentation query is responsible for segmenting one specific patch of BEV map.
• Code is available at https://github.com/megvii-research/PETR.
37. PETRV2: A UNIFIED FRAMEWORK FOR 3D
PERCEPTION FROM MULTI-CAMERA IMAGES
38. PETRV2: A UNIFIED FRAMEWORK FOR 3D
PERCEPTION FROM MULTI-CAMERA IMAGES
coordinate system transformation feature-guided position encoder
39. PETRV2: A UNIFIED FRAMEWORK FOR 3D
PERCEPTION FROM MULTI-CAMERA IMAGES
40. PETRV2: A UNIFIED FRAMEWORK FOR 3D
PERCEPTION FROM MULTI-CAMERA IMAGES
41. ST-P3: END-TO-END VISION-BASED AUTONOMOUS
DRIVING VIA SPATIAL-TEMPORAL FEATURE LEARNING
• While there are some pineering works on LiDAR-based input or implicit design, this paper formulates
the problem in an interpretable vision-based setting.
• In particular, propose a spatial-temporal feature learning scheme towards a set of more representative
features for perception, prediction and planning tasks simultaneously, which is called ST-P3.
• Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry
information in 3D space before the bird’s eye view transformation for perception; a dual pathway
modeling is devised to take past motion variations into account for future prediction; a temporal-based
refinement unit is introduced to compensate for recognizing vision-based elements for planning.
• Source code available at https://github.com/OpenPerceptionX/ST-P3.