3. Autonomous Driving is one of the most challenging AI applications in the world, defined from L2 to L5,
with Operation Design Domain, like Highway pilot, Urban pilot, Traffic Jam pilot, Robtaxi/bus/truck etc.
4. A solution could be modular, i.e. a pipeline of perception, mapping & localization, prediction, planning
and control, or end-to-end (E2E) or partially E2E;
5. n There are roughly two research & development routes, progressive step by step (L2->L4) or leaps and
bounds (L4), additionally acting like dimension reduction (L4->L2+);
n Challenging problems in AV: long tailed with corner cases, safety-critical scenarios, and mass
production requirements (closed loop).
6. BEV Network
The Bird’s-Eye-View (BEV) is a natural view to serve as a unified representation for 3-D environment
understanding for perception module in autonomous driving;
7. BEV contains rich semantic info, precise localization, and absolute scales, which can be directly deployed
by many downstream real-world applications such as behavior prediction, motion planning, etc.
BEVerse for 3D detection/map segmentation/motion prediction
8. n BEV provides a physics-interpretable way to fuse information from different views, modalities, time
series, and agents.
v Spatial and temporal fusion
BEVFormer for multiple cameras’ spatial-temporal fusion
9. n BEV provides a physics-interpretable way to fuse information from different views, modalities, time
series, and agents.
v Sensor fusion
Multi-task Fusion framework in BEVFusion
10. n BEV provides a physics-interpretable way to fuse information from different views, modalities, time
series, and agents.
v V2X collaboration
UniBEV
11. View transformation plays a vital role in camera-only 3D perception, from Perspective View (PV) to BEV.
Copied from the survey paper “Delving into the Devils of Bird’s-eye-view Perception”
12. Current BEV approaches can be divided into two main categories based on view transformation:
geometry-based and network-based;
Copied from the survey paper “Delving into the Devils of Bird’s-eye-view Perception”
geometry-based
network-based
13. In geometry-based methods, earlier work tries homograph based on the flat-ground constraints.
Sim2Real for BEV Segmentation
14. The state-of-art solution in geometry-based approaches is lifting 2D features to 3D space by explicit or
implicit depth estimation, i.e. depth-based (point-based or voxel-based).
Lift, Splat, Shoot (LSS)
15. In network-based methods, the straightforward idea is to use MLP in a bottom-up strategy to project the
PV features to BEV;
Fishing Net for Semantic Segmentation
16. Another framework in network-based BEV employs a top-down strategy by directly constructing BEV
queries and searching corresponding features on PV images by the cross attention mechanism, i.e.
transformer (with either sparse queries or dense queries).
Ego3RT: Ego 3D Representation
17. Though by a hard flat-ground assumption, homograph-based methods has good
interpretability, where IPM (inverse perspective mapping) plays a role in image
projection or feature projection for downstream perception tasks;
Depth-based methods are usually built on an explicit 3D representation, quantized
voxels or point clouds (like pseudo-LiDAR) scattering in continuous 3D space.
l Point-based suffer from the model complexity and lower performance;
l Voxel-based is popular due to computation efficiency and flexibility.
MLP-based view transform is hard due to lack of depth info, occlusion etc.;
Transformer with either sparse (detection) or dense (map segmentation as well) queries,
gains impressive performance with strong relation modeling and data-dependent
property, but the efficiency is still a problem.
19. To apply BEV for autonomous driving, a data closed loop is required to build:
• Data selection is performed at both the vehicle and server side, where the data is selected from the vehicles based
on rough rules ,like shadow modes, abnormal driving operations or specific scenario detection, and then the
collected data at the server selectively goes to annotation and training based on AI rules, such as active learning;
20. To apply BEV for autonomous driving, a data closed loop is required to build:
• A big model (offline, non-real-time) for BEV only works at the server, where transformer network with dense
queries is used for view transform;
毫末
21. To apply BEV for autonomous driving, a data closed loop is required to build:
• A light model (real-time online) for BEV is deployed only for the vehicle on board, where the voxel-based view
transform with depth supervision is designed;
22. To apply BEV for autonomous driving, a data closed loop is required to build:
• BEV data annotation is specific due to innate 3-D structure, captured either from 3-D sensor (LiDAR)
NuScenes
23. To apply BEV for autonomous driving, a data closed loop is required to build:
• BEV data annotation is specific due to innate 3-D structure, captured either from 3-D sensor (LiDAR) or from 3-D
visual reconstruction of cameras;
Images
IMU
Odometry
GPS
Big
Neural Net
Model
Segment
Depth
Flow
Static BG & Ego Traject
Moving Objects & Kine
Tesla
Elevation
24. To apply BEV for autonomous driving, a data closed loop is required to build:
• Simulation platform is used for photo-realistic image data synthesis, digital twin (from real-to-sim) , scenario
generalization and style transfer (from sim-to-real);
Google Block-NeRF Simulation with ground truth Carla Simulator
Nvidia OmniVerse
25. To apply BEV for autonomous driving, a data closed loop is required to build:
• A teacher-student training framework assists the knowledge distillation in BEV model training and deployment.
26. BEV network is the new paradigm for computer vision, showing its strong potential in
autonomous driving application;
BEV’s network design relies on the computing platform, either at the server side or the client
side (vehicle in ADS);
The data closed loop is a must for autonomous driving R&D, where BEV needs pay more attention
to data selection and annotation;
Simulation platform can relieve the burden of BEV data annotation with State-of-art techniques
like photorealistic rendering, digital twin, scenario generalization and style transfer etc.;
To optimize the best deployment of BEV, knowledge distillation is helpful in trade-off of
performance and computation complexity.