fusion of Camera and lidar for autonomous driving I
1. Fusion of Camera and LiDAR for
Autonomous Vehicles I
(via Deep Learning)
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• A General Pipeline for 3D Detection of Vehicles
• Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian
Detection
• Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object
Detection
• PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation
• RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement
• Joint 3D Proposal Generation and Object Detection from View Aggregation
• Frustum PointNets for 3D Object Detection from RGB-D Data
• Deep Continuous Fusion for Multi-Sensor 3D Object Detection
• Multi-View 3D Object Detection Network for Autonomous Driving
• End-to-end Learning of Multi-sensor 3D Tracking by Detection
3. A General Pipeline for 3D Detection of
Vehicles
• Autonomous driving requires 3D perception of vehicles and other objects in the in
environment.
• Much of the current methods support 2D vehicle detection.
• Here is a pipeline to adopt any 2D detection network and fuse it with a 3D point cloud to
generate 3D information with minimum changes of the 2D detection networks.
• To identify the 3D box, a model fitting algorithm is developed based on generalized car
models and score maps.
• A two-stage convolutional neural network (CNN) is proposed to refine the detected 3D box.
• It requires minimum efforts to modify the existing 2D networks to fit into the pipeline,
adding just one additional regression term at the output layer to estimate the vehicle
dimensions.
4. A General Pipeline for 3D Detection of
Vehicles
The raw image is passed to a 2D detection network which provides 2D boxes around the vehicles in
the image plane. Subsequently, a set of 3D points which fall into the 2D bounding box after projection
is selected. With this set, a model fitting algorithm detects the 3D location and 3D bounding box of
the vehicle. And then another CNN network, which takes the points that fit into the 3D bounding box
as input, carries out the final 3D box regression and classification.
5. Combining LiDAR Space Clustering and Convolutional
Neural Networks for Pedestrian Detection
• This is a pedestrian detector that exploits LiDAR data, in addition to visual information.
• The hypothesis is that using depth data and prior info about the size of the objects, it can
reduce the search space by providing candidates and, speeding up detection algorithms.
• A hypothesis is that this prior definition of the location and size of the candidate bounding
box will also decrease the number of false detections.
• In the approach, LiDAR data is utilized to generate region proposals by processing the three
dimensional point cloud that it provides.
• These candidate regions are then further processed by a state-of-the-art CNN classifier that
is fine-tuned for pedestrian detection.
6. Combining LiDAR Space Clustering and Convolutional
Neural Networks for Pedestrian Detection
The algorithm is built upon the idea of clustering the 3-D point cloud of the LiDAR. It starts
with raw measurements down-sampling, followed by removal of the floor plane. Then, a
density-based clustering algorithm generates the candidates that are projected on the image
space to provide a region of interest.
7. Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
• This is a method for fusing LIDAR point cloud and camera-captured images in deep
convolutional neural networks (CNN).
• The method constructs a layer called sparse non-homogeneous pooling layer to transform
features between bird’s eye view and front view.
• The sparse point cloud is used to construct the mapping between the two views.
• The pooling layer allows efficient fusion of the multi-view features at any stage of the
network.
• This is favorable for 3D object detection using camera-LIDAR fusion for autonomous driving.
• A corresponding one-stage detector is designed and tested on the KITTI bird’s eye view
object detection dataset, which produces 3D bounding boxes from the bird’s eye view
map.
• The fusion method shows significant improvement on both speed and accuracy of the
pedestrian detection over other fusion-based object detection networks.
8. Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
The sparse non-homogeneous pooling layer that fuses front view image and bird’s eye view LIDAR feature.
9. Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
The fusion-based one-stage object detection network with MS-CNN networks.
10. PointFusion: Deep Sensor Fusion for 3D
Bounding Box Estimation
• PointFusion, a generic 3D object detection method leverages both image and 3D point
cloud information.
• Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-
specific assumptions, PointFusion is conceptually simple and application- agnostic.
• It consists of: an off-the-shelf CNN that extracts appearance and geometry features from
input RGB image crops, a variant of PointNet that processes the raw 3D point cloud, and a
fusion sub-network that combines the two outputs to predict 3D bounding boxes.
• The image data and the raw point cloud data are independently processed by a CNN and a
PointNet architecture, respectively.
• The resulting outputs are then combined by a fusion network, which predicts multiple 3D
box hypotheses and their confidences, using the input 3D points as spatial anchors.
11. PointFusion: Deep Sensor Fusion for 3D
Bounding Box Estimation
Two feature extractors: a PointNet variant that processes raw point cloud data (A), and a CNN that extracts visual
features from an input image (B). Two fusion network formulations: a vanilla global architecture that directly regresses
the box corner locations (D), and a dense architecture that predicts the spatial offset of each of the 8 corners relative to
an input point, (C): for each input point, the network predicts the spatial offset (white arrows) from a corner (red dot) to
the input point (blue), and selects the prediction with the highest score as the final prediction (E).
12. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
• RoarNet, an approach for 3D object detection from 2D image and 3D Lidar point clouds.
• Based on two stage object detection framework with PointNet as backbone network.
• The first part, RoarNet 2D, estimates the 3D poses of objects from a monocular image, which
approximates where to examine further, and derives multiple candidates that are
geometrically feasible.
• This step significantly narrows down feasible 3D regions, which otherwise requires
demanding processing of 3D point clouds in a huge search space.
• The second part, RoarNet 3D, takes the candidate regions and conducts in-depth inferences
to conclude final poses in a recursive manner.
• RoarNet 3D processes 3D point clouds without any loss of data, leading to precise detection.
13. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
The model first predicts the 2D bounding boxes and a 3D poses of objects from a 2D image. For each 2D object
detection, geometric agreement search is applied to predict the location of object in 3D space. Centered on
each location prediction, set region proposal which has a shape of standing cylinder. Taking the prediction error
in bounding box and pose into account, there can be multiple region proposals for a single object.
14. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
• Each region proposal is responsible for detecting a single object.
• Taking the point clouds sampled from each region proposal as input, the model predicts the
location of an object relative to the center of region proposal, which recursively serves for
setting new region proposals for the next step.
• The model also predicts objectness score which reflects the probability of an object being
inside the region proposal.
• Only those proposals with high objectness scores are considered at the next step.
• At a final step, the model sets new region proposals at previously predicted locations.
• The model predicts all coordinates required for 3D bounding box regression including
location, rotation, and size of the objects.
• For practical reason, repeating this step more than once gives better detection performance.
15. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
Architecture of RoarNet 2D
16. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
The backbone network is a simplified version of PointNet without T-Net.
17. RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
18. Joint 3D Proposal Generation and Object
Detection from View Aggregation
• AVOD, an Aggregate View Object Detection network for autonomous driving scenarios.
• The neural network architecture uses LIDAR point clouds and RGB images to generate
features that are shared by two subnetworks: a region proposal network (RPN) and a second
stage detector network.
• The RPN uses an architecture capable of performing multimodal feature fusion on high
resolution feature maps to generate reliable 3D object proposals for multiple object classes
in road scenes.
• Using these proposals, the second stage detection network performs accurate oriented 3D
bounding box regression and category classification to predict the extents, orientation, and
classification of objects in 3D space.
• The proposed architecture produces SoA results on the KITTI 3D object detection
benchmark while running in real time with a low memory footprint.
• Code is https://github.com/kujason/avod
19. Joint 3D Proposal Generation and Object
Detection from View Aggregation
The method’s architectural diagram. The feature extractors are shown in blue, the region proposal network in
pink, and the second stage detection network in green.
20. Joint 3D Proposal Generation and Object
Detection from View Aggregation
The architecture of proposed high resolution feature extractor
shown here for the image branch. Feature maps are
propagated from the encoder to the decoder section via red
arrows. Fusion is then performed at every stage of the decoder
by a learned upsampling layer, followed by concatenation, and
then mixing via a convolutional layer, resulting in a full
resolution feature map at the last layer of the decoder.
21. Joint 3D Proposal Generation and Object
Detection from View Aggregation
A visual comparison between the 8 corner box encoding, the
axis aligned box encoding, and our 4 corner encoding.
22. Joint 3D Proposal Generation and Object
Detection from View Aggregation
Left: 3D region proposal network output, Middle: 3D detection output, and Right: the projection of the
detection output onto image space for all three classes. The 3D LIDAR point cloud has been colorized and
interpolated for better visualization.
23. Frustum PointNets for 3D Object Detection
from RGB-D Data
• 3D object detection from RGB- D data in both indoor and outdoor scenes.
• While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns
and invariances of 3D data, here directly operate on raw point clouds by popping up RGB-D
scans.
• However, a key challenge of this approach is how to efficiently localize objects in point
clouds of large-scale scenes (region proposal).
• Instead of solely relying on 3D proposals, this method leverages both mature 2D object
detectors and advanced 3D deep learning for object localization, achieving efficiency as well
as high recall for even small objects.
• Benefited from learning directly in raw point clouds, this method is also able to precisely
estimate 3D bounding boxes even under strong occlusion or with very sparse points.
• Evaluated on KITTI and SUN RGB-D 3D detection benchmarks.
24. Frustum PointNets for 3D Object Detection
from RGB-D Data
3D object detection pipeline. Given RGB-D data, first generate 2D object region proposals in the RGB
image using a CNN. Each 2D region is then extruded to a 3D viewing frustum in which to get a point
cloud from depth data. Finally, the frustum PointNet predicts a (oriented and amodal) 3D bounding box
for the object from the points in frustum.
25. Frustum PointNets for 3D Object Detection
from RGB-D Data
Frustum PointNets for 3D object detection. First leverage a 2D CNN object detector to propose 2D regions and
classify their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud
in a frustum (n × c with n points and c channels of XYZ, intensity etc. for each point), the object instance is
segmented by binary classification of each point. Based on the segmented object point cloud (m × c), a light-
weight regression PointNet (T-Net) tries to align points by translation such that their centroid is close to amodal
box center. At last the box estimation net estimates the amodal 3D bounding box for the object.
26. Frustum PointNets for 3D Object Detection
from RGB-D Data
Coordinate systems for point cloud. Artificial points (black dots) are shown to
illustrate (a) default camera coordinate; (b) frustum coordinate after rotating
frustums to center view; (c) mask coordinate with object points’ centroid at
origin; (d) object coordinate predicted by T-Net.
27. Frustum PointNets for 3D Object Detection
from RGB-D Data
Basic architectures and IO for PointNets. Architecture is illustrated for PointNet++ (v2)
models with set abstraction layers and feature propagation layers (for segmentation).
28. Frustum PointNets for 3D Object Detection
from RGB-D Data
True positive detection boxes are in green, while false positive boxes are in red and ground truth
boxes in blue are shown for false positive and false negative cases. Digit and letter beside each box
denote instance id and semantic class, with “v” for cars, “p” for pedestrian and “c” for cyclist.
29. Frustum PointNets for 3D Object Detection
from RGB-D Data
Network architectures for Frustum PointNets. v1 models are based on PointNet. v2 models are based on PointNet++ set
abstraction (SA) and feature propagation (FP) layers. The architecture for residual center estimation T-Net is shared for v1
and v2. The colors (blue for segmentation nets, red for T-Net and green for box estimation nets) of the network background
indicate the coordinate system of the input point cloud. Segmentation nets operate in frustum coordinate, T-Net processes
points in mask coordinate while box estimation nets take points in object coordinate. The small yellow square (or bar)
concatenated with global features is class one-hot vector that tells the predicted category of the underlying object.
30. Deep Continuous Fusion for Multi-Sensor
3D Object Detection
• It remains an open problem to design 3D detectors that can better exploit multiple
modalities.
• A 3D object detector can exploit both LIDAR as well as cameras to perform very accurate
localization.
• It reasons in bird’s eye view (BEV) and fuses image features by learning to project them
into BEV space.
• Towards this goal, an end-to-end learnable architecture exploits continuous convolutions to
fuse image and LIDAR feature maps at different levels of resolution.
• The proposed continuous fusion layer encode both discrete-state image features as well as
continuous geometric information.
• This enables designing a reliable and efficient end-to-end learnable 3D object detector
based on multiple sensors.
31. Deep Continuous Fusion for Multi-Sensor
3D Object Detection
Architecture of model. There are two streams, namely the camera image stream and the BEV LIDAR stream.
Continuous fusion layers are used to fuse the image features onto the BEV feature maps.
32. Deep Continuous Fusion for Multi-Sensor
3D Object Detection
Continuous fusion layer: given a target pixel on BEV image, first extract K nearest LIDAR points; then project the 3D
points onto the camera image plane; this helps retrieve corresponding image features; finally feed the image
feature + continuous geometry offset into a MLP to generate feature for the target pixel.
33. Deep Continuous Fusion for Multi-Sensor
3D Object Detection
The 2D bounding boxes are obtained by projecting the 3D detections onto the image.
The bounding box of an object on BEV and images are shown in the same color.
34. Multi-View 3D Object Detection Network
for Autonomous Driving
• Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point
cloud and RGB images as input and predicts oriented 3D bounding boxes.
• It encodes the sparse 3D point cloud with a compact multi-view representation.
• The network is composed of two subnetworks: one for 3D object proposal generation and
another for multi-view feature fusion.
• The proposal network generates 3D candidate boxes efficiently from the bird’s eye view
representation of 3D point cloud.
• A deep fusion scheme combines region-wise features from multiple views and enables
interactions between intermediate layers of different paths.
35. Multi-View 3D Object Detection Network
for Autonomous Driving
Input features of the MV3D network.
36. Multi-View 3D Object Detection Network
for Autonomous Driving
The network takes the bird’s eye view and front view of LIDAR point cloud as well as an image as input. It first
generates 3D object proposals from bird’s eye view map and project them to three views. A deep fusion
network is used to combine region-wise features obtained via ROI pooling for each view. The fused features
are used to jointly predict object class and do oriented 3D box regression.
37. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• This task, commonly referred to as Multi-target tracking, consists on identifying how many
objects there are in each frame, as well as link their trajectories over time.
• It is an approach to tracking by detection that can exploit both cameras as well as LIDAR
data to produce very accurate 3D trajectories.
• Towards this goal, it formulates the problem as inference in a deep structured model, where
the potentials are computed using convolutional neural nets.
• The matching cost of associating two detections exploits both appearance and motion via a
Siamese network that processes images and motion representations via convolutional layers.
• Inference in the model can be done exactly and efficiently by a set of feedforward passes
followed by solving a linear program.
• Importantly, the model is formulated such that it can be trained end-to-end to solve both
the detection and tracking problems.
38. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
In this work, it formulates tracking as a system containing multiple neural networks that are interwoven
together in a single architecture. Note that the system takes as external input a time series of RGB Frames
(camera images) and LIDAR point clouds. From these inputs, the system produces discrete trajectories of the
targets. In particular, an architecture is end to end trainable while still maintaining explainability, which is
achieved by formulating the system in a structured manner.
39. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
Neural networks designed for both
scoring and matching: the forward passes
over a set of detections from two frames.
40. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• To extract appearance features, employ a Siamese network based on VGG16.
• Note that in a Siamese setup, the two branches (each processing a detection) share the
same set of weights.
• This makes the architecture more efficient in terms of memory and allows learning with
fewer examples.
• In particular, resize each detection to be of dimension 224 × 224.
• To produce a concise representation of activations without using fully connected layers, each
of the max-pool outputs is passed through a product layer followed by a weighted sum,
which produces a single scalar for each max-pool layer, yielding an activation vector size 5.
• Use skip-pooling as matching should exploit both low-level features (e.g., color) as well as
semantically richer features from higher layers.
• To incorporate spatial information into the model, employ fully connected architectures that
model both 2D and 3D motion.
41. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• In particular, exploit 3D information in the form of a 180 × 200 occupancy grid in bird’s
eye view and 2D information from the occupancy region in the frontal view camera, scaled
down from the original resolution of 1242 × 375 to 124 × 37.
• In bird’s eye perspective, each 3D detection is projected onto a ground plane, leaving only
a rotated rectangle that reflects its occupancy in the world.
• Since the observer is a mobile platform (an autonomous vehicle, in this case), the coordinate
system between two subsequent frames would be shifted because the observer moved in
the time elapsed.
• Since its speed in each axis is known from the IMU data, one can calculate the displacement
of the observer between each observation and translate the coordinates accordingly; this
way, both grids are on the exact same coordinate system.
• The frontal view perspective encodes the rectangular area in the camera occupied by the
target, equivalent of projecting the 3D bounding box onto camera coordinates.
43. End-to-end Learning of Multi-sensor 3D
Tracking by Detection
Trajectories are color coded, such that having the same color means it’s the same object.