SlideShare a Scribd company logo
1 of 42
Download to read offline
Joint Detection & Segmentation
in BEV Representation
Yu Huanng
Sunnyvale, California
Yu.huang07@gmail.com
Outline
• M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Bird’s-Eye View Representation
• BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-
Centric Autonomous Driving
• Learning Ego 3D Representation as Ray Tracing
• BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View
Representation
• Efficient and Robust 2D-to-BEV Representation Learning via Geometry-
guided Kernel Transformer
• BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
• BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
• M2BEV, a unified framework that jointly performs 3D object detection and map
segmentation in the Bird’s Eye View (BEV) space with multi-camera image inputs.
• Unlike the majority of previous works which separately process detection and
segmentation, M2BEV infers both tasks with a unified model and improves efficiency.
• M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in
ego-car coordinates.
• Such BEV representation is important as it enables different tasks to share a single
encoder.
• This framework further contains four important designs that benefit both accuracy and
efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a
voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to
assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that
reinforces with larger weights for more distant predictions, and (4) Large-scale 2D
detection pre-training and auxiliary supervision.
• M2BEV is memory efficient, allowing significantly higher resolution images as input, with
faster inference speed.
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
• BEVerse, a unified framework for 3D perception and prediction based on multi-
camera systems.
• Unlike existing studies focusing on the improvement of single-task approaches,
BEVerse features in producing spatio-temporal Birds-Eye-View (BEV)
representations from multi-camera videos and jointly reasoning about multiple
tasks for vision-centric autonomous driving.
• Specifically, BEVerse first performs shared feature extraction and lifting to generate
4D BEV representations from multi-timestamp and multi-view images.
• After the ego- motion alignment, the spatio-temporal encoder is utilized for further
feature extraction in BEV.
• Finally, multiple task decoders are attached for joint reasoning and prediction.
• Within the decoders, propose the grid sampler to generate BEV features with
different ranges and granularities for different tasks.
• Also, design the method of iterative flow for memory-efficient future prediction.
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
Learning Ego 3D Representation as Ray Tracing
• A self-driving perception model aims to extract 3D semantic representations from
multiple cameras collectively into the bird’s-eye- view (BEV) coordinate frame of
the ego car in order to ground down- stream planner.
• Existing perception methods often rely on error-prone depth estimation of the
whole scene or learning sparse virtual 3D representations without the target
geometry structure, both of which remain limited in performance and/or
capability.
• In this paper, a end-to-end architecture for ego 3D representation learning from
an arbitrary number of unconstrained camera views.
• Inspired by the ray tracing principle, design a polarized grid of “imaginary eyes”
as the learnable ego 3D representation and formulate the learning process with
the adaptive attention mechanism in conjunction with the 3D-to-2D projection.
• Critically, this formulation allows extracting rich 3D representation from 2D
images without any depth supervision, and with the built-in geometry structure
consistent w.r.t BEV.
Learning Ego 3D Representation as Ray Tracing
(a) The first strategy, represented by LSS, CaDDN, is based on dense pixel-level depth estimation. (b) The
second strategy represented by PON bypasses the depth estimation by learning implicit 2D-3D projection. (c)
This strategy that backtracks 2D information from “imaginary” eyes specially designed in the BEV’s geometry.
Learning Ego 3D Representation as Ray Tracing
Learning Ego 3D Representation as Ray Tracing
Learning Ego 3D Representation as Ray Tracing
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
• Recent approaches are based on point-level fusion: augmenting the LiDAR point
cloud with camera features.
• However, the camera-to-LiDAR projection throws away the semantic density of
camera features, hindering the effectiveness of such methods, especially for
semantic-oriented tasks (such as 3D scene segmentation).
• This paper breaks this deeply-rooted convention with BEVFusion, an efficient and
generic multi-task multi-sensor fusion framework.
• It unifies multi- modal features in the shared bird’s-eye view (BEV) representation
space, which nicely preserves both geometric and semantic information.
• To achieve this, diagnose and lift key efficiency bottlenecks in the view
transformation with optimized BEV pooling, reducing latency by more than 40×.
• BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D
perception tasks with almost no architectural changes.
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
• This work is a Geometry-guided Kernel Transformer (GKT), a 2D-to-BEV
representation learning mechanism.
• GKT leverages the geometric priors to guide the transformer to focus on
discriminative regions, and unfolds kernel features to generate BEV
representation.
• For fast inference, further introduce a look-up table (LUT) indexing method to get
rid of the camera’s calibrated parameters at runtime.
• GKT can run at 72.3 FPS on 3090 GPU / 45.6 FPS on 2080ti GPU and is robust to
the camera deviation and the predefined BEV height.
• And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0
mIoU (100m×100m perception range at a 0.5m resolution) on the nuScenes val
set.
• Code and models will be available at https://github.com/hustvl/GKT.
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
(a) Geometry-based pointwise transformation leverages camera’s calibrated parameters (intrinsic and extrinsic) to
determine the correspondence (one to one or one to many) between 2D positions and BEV grids. (b) Geometry-free
global transformation considers the full correlation between image and BEV. Each BEV grid interacts with all image pixels.
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
• Fusing the camera and LiDAR information has become a de-facto standard for 3D
object detection tasks.
• Current methods rely on point clouds from the LiDAR sensor as queries to
leverage the feature from the image space.
• However, people discover that this underlying assumption makes the current
fusion framework infeasible to produce any prediction when there is a LiDAR
malfunction, regardless of minor or major.
• A simple fusion framework, BEVFusion, whose camera stream does not depend
on the input of LiDAR data, thus addressing the downside of previous methods.
• Under the robustness training settings that simulate various LiDAR malfunctions,
this framework surpasses the state-of-the-art methods by 15.7% to 28.9% mAP.
• The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
• This work presents a framework termed BEVFormer, which learns unified BEV
representations with spatiotemporal transformers to support multiple
autonomous driving perception tasks.
• In a nutshell, BEVFormer exploits both spatial and temporal information by
interacting with spatial and temporal space through pre-defined grid-shaped BEV
queries.
• To aggregate spatial information, design a spatial cross-attention that each BEV
query extracts the spatial features from the RoI across camera views.
• For temporal information, propose a temporal self-attention to recurrently fuse
the history BEV information.
• This approach achieves 56.9% in terms of NDS metric on the nuScenes test set,
which is 9.0 points higher than previous best arts and on par with the
performance of LiDAR-based baselines.
• The code will be released at https://github.com/zhiqi-li/BEVFormer.
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
Thanks

More Related Content

What's hot

What's hot (20)

Immersal を活用した AR クラウドなシステム開発とハンズオン!
Immersal を活用した AR クラウドなシステム開発とハンズオン!Immersal を活用した AR クラウドなシステム開発とハンズオン!
Immersal を活用した AR クラウドなシステム開発とハンズオン!
 
【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photography【CVPR 2020 メタサーベイ】Computational Photography
【CVPR 2020 メタサーベイ】Computational Photography
 
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
 
Visual slam
Visual slamVisual slam
Visual slam
 
3D Perception for Autonomous Driving - Datasets and Algorithms -
3D Perception for Autonomous Driving - Datasets and Algorithms -3D Perception for Autonomous Driving - Datasets and Algorithms -
3D Perception for Autonomous Driving - Datasets and Algorithms -
 
SLAMチュートリアル大会資料(ORB-SLAM)
SLAMチュートリアル大会資料(ORB-SLAM)SLAMチュートリアル大会資料(ORB-SLAM)
SLAMチュートリアル大会資料(ORB-SLAM)
 
SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法について
 
object detection with lidar-camera fusion: survey
object detection with lidar-camera fusion: surveyobject detection with lidar-camera fusion: survey
object detection with lidar-camera fusion: survey
 
オープンソース SLAM の分類
オープンソース SLAM の分類オープンソース SLAM の分類
オープンソース SLAM の分類
 
20190307 visualslam summary
20190307 visualslam summary20190307 visualslam summary
20190307 visualslam summary
 
SSII2022 [TS2] 自律移動ロボットのためのロボットビジョン〜 オープンソースの自動運転ソフトAutowareを解説 〜
SSII2022 [TS2] 自律移動ロボットのためのロボットビジョン〜 オープンソースの自動運転ソフトAutowareを解説 〜SSII2022 [TS2] 自律移動ロボットのためのロボットビジョン〜 オープンソースの自動運転ソフトAutowareを解説 〜
SSII2022 [TS2] 自律移動ロボットのためのロボットビジョン〜 オープンソースの自動運転ソフトAutowareを解説 〜
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
SLAM勉強会(3) LSD-SLAM
SLAM勉強会(3) LSD-SLAMSLAM勉強会(3) LSD-SLAM
SLAM勉強会(3) LSD-SLAM
 
3次元計測とフィルタリング
3次元計測とフィルタリング3次元計測とフィルタリング
3次元計測とフィルタリング
 
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
 
LiDAR点群と画像とのマッピング
LiDAR点群と画像とのマッピングLiDAR点群と画像とのマッピング
LiDAR点群と画像とのマッピング
 
SLAM入門 第2章 SLAMの基礎
SLAM入門 第2章 SLAMの基礎SLAM入門 第2章 SLAMの基礎
SLAM入門 第2章 SLAMの基礎
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
 
20190825 vins mono
20190825 vins mono20190825 vins mono
20190825 vins mono
 
論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)論文読み会2018 (CodeSLAM)
論文読み会2018 (CodeSLAM)
 

Similar to BEV Joint Detection and Segmentation

A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
Best Jobs
 
Concept of stereo vision based virtual touch
Concept of stereo vision based virtual touchConcept of stereo vision based virtual touch
Concept of stereo vision based virtual touch
Vivek Chamorshikar
 

Similar to BEV Joint Detection and Segmentation (20)

BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
 
Concept of stereo vision based virtual touch
Concept of stereo vision based virtual touchConcept of stereo vision based virtual touch
Concept of stereo vision based virtual touch
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Models used in iOS programming, with a focus on MVVM
Models used in iOS programming, with a focus on MVVMModels used in iOS programming, with a focus on MVVM
Models used in iOS programming, with a focus on MVVM
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
 
An Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth EstimationAn Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth Estimation
 
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robotIn tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
 
Driver drowsiness monitoring system using visual behaviour and machine learning
Driver drowsiness monitoring system using visual behaviour and machine learningDriver drowsiness monitoring system using visual behaviour and machine learning
Driver drowsiness monitoring system using visual behaviour and machine learning
 
Deep vo and slam ii
Deep vo and slam iiDeep vo and slam ii
Deep vo and slam ii
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iii
 
Google | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero WhitepaperGoogle | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero Whitepaper
 
Dynamic Error Concealment Algorithm for Multiview Coding Using Lost MBs Size...
Dynamic Error Concealment Algorithm for Multiview Coding  Using Lost MBs Size...Dynamic Error Concealment Algorithm for Multiview Coding  Using Lost MBs Size...
Dynamic Error Concealment Algorithm for Multiview Coding Using Lost MBs Size...
 

More from Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 

Recently uploaded (20)

8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 

BEV Joint Detection and Segmentation

  • 1. Joint Detection & Segmentation in BEV Representation Yu Huanng Sunnyvale, California Yu.huang07@gmail.com
  • 2. Outline • M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation • BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision- Centric Autonomous Driving • Learning Ego 3D Representation as Ray Tracing • BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation • Efficient and Robust 2D-to-BEV Representation Learning via Geometry- guided Kernel Transformer • BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework • BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 3. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation • M2BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Bird’s Eye View (BEV) space with multi-camera image inputs. • Unlike the majority of previous works which separately process detection and segmentation, M2BEV infers both tasks with a unified model and improves efficiency. • M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. • Such BEV representation is important as it enables different tasks to share a single encoder. • This framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. • M2BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed.
  • 4. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 5. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 6. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 7. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 8. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 9. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving • BEVerse, a unified framework for 3D perception and prediction based on multi- camera systems. • Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. • Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. • After the ego- motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. • Finally, multiple task decoders are attached for joint reasoning and prediction. • Within the decoders, propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. • Also, design the method of iterative flow for memory-efficient future prediction.
  • 10. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 11. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 12. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 13. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 14. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 15. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 16. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 17. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 18. Learning Ego 3D Representation as Ray Tracing • A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird’s-eye- view (BEV) coordinate frame of the ego car in order to ground down- stream planner. • Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. • In this paper, a end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. • Inspired by the ray tracing principle, design a polarized grid of “imaginary eyes” as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. • Critically, this formulation allows extracting rich 3D representation from 2D images without any depth supervision, and with the built-in geometry structure consistent w.r.t BEV.
  • 19. Learning Ego 3D Representation as Ray Tracing (a) The first strategy, represented by LSS, CaDDN, is based on dense pixel-level depth estimation. (b) The second strategy represented by PON bypasses the depth estimation by learning implicit 2D-3D projection. (c) This strategy that backtracks 2D information from “imaginary” eyes specially designed in the BEV’s geometry.
  • 20. Learning Ego 3D Representation as Ray Tracing
  • 21. Learning Ego 3D Representation as Ray Tracing
  • 22. Learning Ego 3D Representation as Ray Tracing
  • 23. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation • Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. • However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). • This paper breaks this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. • It unifies multi- modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. • To achieve this, diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40×. • BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes.
  • 24. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 25. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 26. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 27. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 28. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer • This work is a Geometry-guided Kernel Transformer (GKT), a 2D-to-BEV representation learning mechanism. • GKT leverages the geometric priors to guide the transformer to focus on discriminative regions, and unfolds kernel features to generate BEV representation. • For fast inference, further introduce a look-up table (LUT) indexing method to get rid of the camera’s calibrated parameters at runtime. • GKT can run at 72.3 FPS on 3090 GPU / 45.6 FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. • And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m×100m perception range at a 0.5m resolution) on the nuScenes val set. • Code and models will be available at https://github.com/hustvl/GKT.
  • 29. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer (a) Geometry-based pointwise transformation leverages camera’s calibrated parameters (intrinsic and extrinsic) to determine the correspondence (one to one or one to many) between 2D positions and BEV grids. (b) Geometry-free global transformation considers the full correlation between image and BEV. Each BEV grid interacts with all image pixels.
  • 30. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer
  • 31. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer
  • 32. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework • Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. • Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. • However, people discover that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. • A simple fusion framework, BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. • Under the robustness training settings that simulate various LiDAR malfunctions, this framework surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. • The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
  • 33. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 34. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 35. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 36. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers • This work presents a framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. • In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through pre-defined grid-shaped BEV queries. • To aggregate spatial information, design a spatial cross-attention that each BEV query extracts the spatial features from the RoI across camera views. • For temporal information, propose a temporal self-attention to recurrently fuse the history BEV information. • This approach achieves 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. • The code will be released at https://github.com/zhiqi-li/BEVFormer.
  • 37. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 38. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 39. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 40. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 41. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers