Lidar for Autonomous Driving II (via Deep Learning)

LiDAR for Autonomous Vehicles II
(via Deep Learning)
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
 Online Camera LiDAR Fusion and Object Detection on
Hybrid Data for Autonomous Driving
 RegNet: Multimodal Sensor Registration Using Deep
Neural Networks
 Vehicle Detection from 3D Lidar Using FCN
 VoxelNet: End-to-End Learning for Point Cloud Based
3D Object Detection
 Object Detection and Classification in Occupancy Grid
Maps using Deep Convolutional Networks
 RT3D: Real-Time 3-D Vehicle Detection in LiDAR Point
Cloud for Autonomous Driving
 BirdNet: a 3D Object Detection Framework from LiDAR
information
 LMNet: Real-time Multiclass Object Detection on CPU
using 3D LiDAR
 HDNET: Exploit HD Maps for 3D Object Detection
 IPOD: Intensive Point-based Object Detector for Point
Cloud
 PIXOR: Real-time 3D Object Detection from Point
Clouds
 DepthCN: Vehicle Detection Using 3D-LIDAR and
ConvNet
 SECOND: Sparsely Embedded Convolutional Detection
 YOLO3D: E2E RT 3D Oriented Object Bounding Box
Detection from LiDAR Point Cloud
 YOLO4D: A ST Approach for RT Multi-object Detection
and Classification from LiDAR Point Clouds
 Deconvolutional Networks for Point-Cloud Vehicle
Detection and Tracking in Driving Scenarios
 Fast and Furious: Real Time E2E 3D Detection,
Tracking and Motion Forecasting with a Single
Convolutional Net
…To be continued

Outline
 SqueezeSeg: Convolutional Neural Nets with
Recurrent CRF for Real-Time Road-Object
Segmentation from 3D LiDAR Point Cloud
 SEGCloud: Semantic Segmentation of 3D Point
Clouds
 Multi-View 3D Object Detection Network for
Autonomous Driving
 A General Pipeline for 3D Detection of Vehicles
 Combining LiDAR Space Clustering and Convolutional
Neural Networks for Pedestrian Detection
 Pseudo-LiDAR from Visual Depth Estimation: Bridging
the Gap in 3D Object Detection for Autonomous
Driving
 PointNet: Deep Learning on Point Sets for 3D
Classification and Segmentation
 PointNet++: Deep Hierarchical Feature Learning on
Point Sets in a Metric Space
 PointFusion: Deep Sensor Fusion for 3D Bounding
Box Estimation
 Frustum PointNets for 3D Object Detection from RGB-
D Data
 RoarNet: A Robust 3D Object Detection based on
RegiOn Approximation Refinement
 Joint 3D Proposal Generation and Object Detection
from View Aggregation
 SPLATNet: Sparse Lattice Networks for Point Cloud
Processing
 PointRCNN: 3D Object Proposal Generation and
Detection from Point Cloud
 Deep Continuous Fusion for Multi-Sensor 3D Object
Detection
 End-to-end Learning of Multi-sensor 3D Tracking by
Detection

Online Camera LiDAR Fusion and Object Detection on
 Non-calibrated sensors result in artifacts and aberration in the environment model, which
makes tasks like free-space detection more challenging.
 To improve the LiDAR and camera fusion approach of Levinson and Thrun.
 Rely on intensity discontinuities and erosion and dilation of the edge image for increased
robustness against shadows and visual patterns, which is a recurring problem in point cloud
related work.
 Use a gradient free optimizer instead of an exhaustive grid search to find the extrinsic
calibration.
 The fusion pipeline is lightweight and able to run in real-time on a computer in the car.
 For the detection task, modify the Faster R-CNN architecture to accommodate hybrid LiDAR-
camera data for improved object detection and classification.

Online Camera LiDAR Fusion and Object Detection on
sensor fusion and object detection pipeline
estimating the rotation and translation
btw their coordinate systems Non-optimal calibration

RegNet: Multimodal Sensor Registration Using Deep
Neural Networks
 RegNet, the deep CNN to infer a 6 DOF extrinsic calibration between multimodal sensors,
exemplified using a scanning LiDAR and a monocular camera.
 Compared to existing approaches, RegNet casts all 3 conventional calibration steps (feature
extraction, feature matching and global regression) into a single real-time capable CNN.
 It does not require any human interaction and bridges the gap between classical offline and
target-less online calibration approaches as it provides both a stable initial estimation as well
as a continuous online correction of the extrinsic parameters.
 During training, randomly decalibrate our system in order to train RegNet to infer the
correspondence between projected depth measurements and RGB image and finally regress
the extrinsic calibration.
 Additionally, with an iterative execution of multiple CNNs, that are trained on different
magnitudes of decalibration, it compares favorably to state-of-the-art methods in terms of a
mean calibration error of 0.28◦ for the rotational and 6 cm for the translation components
even for large decalibrations up to 1.5 m and 20◦ .

Neural Networks
It estimates the calibration btw a depth and an RGB sensor. The depth points are projected on the RGB
image using an initial calibration Hinit. In the 1st and 2nd part of the network, use NiN blocks to extract rich
features for matching. The final part regresses decalibration by gathering global info. using two FCLs.
During training φdecalib is randomly permutated resulting in different projections of the depth points.

Neural Networks

Vehicle Detection from 3D Lidar Using FCN
 Point clouds from a Velodyne scan can be
roughly projected and discretized into a 2D
point map;
The projected point map analogous to
cylindral images;
Encode the bounding box corner of the
vehicle (8 corners as 24-d);
It consists of one objectness classification
branch and one bounding box regression
branch.

(a) The input point map, with
the d channel visualized. (b)
The output confidence map of
the objectness branch. (c)
Bounding box candidates
corresponding to all points
predicted as positive, i.e. high
confidence points in (b). (d)
Remaining bounding boxes
after non-max suppression.
Vehicle Detection from 3D Lidar Using FCN

VoxelNet: End-to-End Learning for Point Cloud Based
3D Object Detection
 Just remove the need of manual feature
engineering for 3D point clouds and propose
VoxelNet, a generic 3D detection network that
unifies feature extraction and bounding box
prediction into a single stage, end-to-end trainable
deep network.
 Specifically, VoxelNet divides a point cloud into
equally spaced 3D voxels and transforms a group
of points within each voxel into a unified feature
representation through the voxel feature encoding
(VFE) layer.
 In this way, the point cloud is encoded as a
descriptive volumetric representation, which is
then connected to a RPN to generate detections.

3D Object Detection

3D Object Detection
Voxel feature encoding layer.

3D Object Detection
Region proposal network architecture

Object Detection and Classification in Occupancy
Grid Maps using Deep Convolutional Networks
 Based on a grid map environment
representation, well-suited for sensor fusion,
free-space estimation and machine learning,
detect and classify objects using deep CNNs.
 As input, use a multi-layer grid map efficiently
encoding 3D range sensor info.
 The inference output consists of a list of rotated
Bboxes with associated semantic classes.
Transform range sensor measurements to a multi-
layer grid map which serves as input for object
detection and classification network. From these top
view grid maps the network infers rotated 3D
bounding boxes together with semantic classes.
These boxes can be projected into the camera image
for visual validation. Cars are depicted green, cyclists
aquamarin and pedestrians cyan.

 Below are minimal preprocessing to obtain occupancy grid maps.
 As there are labeled objects only in the camera image, remove all points that are not in the
camera’s field of view.
 Apply ground surface segmentation and estimate different grid cell features, then the resulting
multi-layer grid maps are of size 60m×60m and a cell size of either 10cm or 15cm.
 As observed, the ground is flat in most of the scenarios, so fit a ground plane to the
representing point set.
 Then, use the full point set or a non-ground subset to construct a multi-layer grid map
containing different features.

 KITTI Bird’s Eye View Evaluation 2017 consists of 7481 images for training and 7518
images for testing as well as corresponding range sensor data represented as point sets.
 Training and test data contain 80,256 labeled objects in total which are represented as
oriented 3D Bboxes (7 parameters).
 As summarized in Table, there are 8 semantic classes labeled in the training set although
not all classes are used to determine the benchmark result.

RT3D: Real-Time 3-D Vehicle Detection in LiDAR Point
 Real-time 3-dimensional (RT3D) vehicle detection
method that utilizes pure LiDAR point cloud to
predict the location, orientation, and size of vehicles.
 Apply pre-RoIpooling convolution that moves a
majority of the convolution operations to ahead of the
RoI pooling, leaving just a small part behind, so that
significantly boosts the computation efficiency.
 A pose-sensitive feature map design is strongly
activated by the relative poses of vehicles, leading to
a high regression accuracy on the location,
orientation, and size of vehicles.
 RT3D is the 1st LiDAR 3-D vehicle detection work
that completes detection within 0.09s.

RT3D: Real-Time 3-D Vehicle Detection in LiDAR Point
The network architecture of RT3D

BirdNet: a 3D Object Detection Framework from LiDAR
information
 LiDAR- based 3D object detection pipeline entailing three stages:
 First, laser info. is projected into a novel cell encoding for bird’s eye view projection.
 Later, both object location on the plane and its heading are estimated through a
convolutional neural network originally designed for image processing.
 Finally, 3D oriented detections are computed in a post-processing phase.

BirdNet: a 3D Object Detection Framework from LiDAR
information
Results on KITTI Benchmark test set: detections in image, BEV projection, and 3D point cloud.

LMNet: Real-time Multiclass Object Detection on CPU
using 3D LiDAR
 An optimized single-stage deep CNN to detect objects in urban environments, using
nothing more than point cloud data.
 The network structure employs dilated convolutions to gradually increase the perceptive
field as depth increases, this helps to reduce the computation time by about 30%.
 The input consists of 5 perspective representations of the unorganized point cloud data.
 The network outputs an objectness map and the bounding box offset values for each point.
 Using reflection, range, and the position on each of the 3 axes helped to improve the
location and orientation of the output bounding box.
 Execution times is 50 FPS using desktop GPUs, and up to 10 FPS on a Intel Core i5 CPU.

LMNet: Real-time Multiclass Object Detection on CPU
using 3D LiDAR
Used dilated layers
The LMNet architecture
Encoded input point cloud

HDNET: Exploit HD Maps for 3D Object
Detection
 High-Definition (HD) maps provide strong priors that can boost the performance and
robustness of modern 3D object detectors.
 Here is a one stage detector to extract geometric and semantic features from the HD maps.
 As maps might not be available everywhere, a map prediction module estimates the map
on the fly from raw LiDAR data.
 The whole framework runs at 20 frames per second.

Detection
BEV LiDAR representation that exploits geometric and semantic HD map information.
(a) The raw LiDAR point cloud. (b) Incorporating geometric ground prior.
(c) Discretization of the LiDAR point cloud. (d) Incorporating semantic road prior.

Detection
Network structures for object detection (left) and online map estimation (right).

IPOD: Intensive Point-based Object Detector for Point
Cloud
 A 3D object detection framework, IPOD, based on raw point cloud.
 It seeds object proposal for each point, which is the basic element.
 An E2E trainable architecture, where features of all points within a proposal are extracted
from the backbone network and achieve a proposal feature for final bounding inference.
 These features with both context info. and precise point cloud coord.s improve the performance.

Cloud
Illustration of point-based proposal
generation. (a) Semantic segmentation result
on the image. (b) Projected segmentation
result on point cloud. (c) Point-based
proposals on positive points after NMS.

Cloud
Illustration of proposal feature generation module. It combines location info. and
context feature to generate offsets from the centroid of interior points to the center of
target instance object. The predicted residuals are added back to the location info. in order
to make feature more robust to geometric transformation.

Cloud
Backbone architecture. Bounding-box prediction network.

PIXOR: Real-time 3D Object Detection from Point
Clouds
 This method utilizes the 3D data more efficiently by representing the scene from the
Bird’s Eye View (BEV), and propose PIXOR (ORiented 3D object detection from
PIXel-wise NN predictions), a proposal-free, single-stage detector that outputs
oriented 3D object estimates decoded from pixel-wise neural network predictions.
 The input representation, network architecture, and model optimization are specially
designed to balance high accuracy and real-time efficiency.
3D object detector from Bird’s Eye View (BEV) of LIDAR point cloud.

PIXOR: Real-time 3D Object Detection from Point
Clouds
The network architecture of PIXOR
Use cross-entropy loss on the classification output
and a smooth loss on the regression output.
Sum the classification loss over all locations on the
output map, while the regression loss is computed
over positive locations only.

DepthCN: Vehicle Detection Using 3D-LIDAR and
ConvNet
 Vehicle detection based on the Hypothesis Generation (HG) and Verification (HV)
paradigms.
 The data inputted to the system is a point cloud obtained from a 3D-LIDAR
mounted on board an instrumented vehicle, which is transformed to a Dense-
depth Map (DM).
 The solution starts by removing ground points followed by point cloud
segmentation.
 Then, segmented obstacles (object hypotheses) are projected onto the DM.
 Bboxes are fitted to the segmented objects as vehicle hypotheses (HG step).
 Bboxes are used as inputs to a ConvNet to classify/verify the hypotheses of
belonging to the category ‘vehicle’ (HV step).

ConvNet
3D-LIDAR-based vehicle detection algorithm (DepthCN).

ConvNet
Top: the point cloud where the detected ground points are denoted with green and LIDAR points that are
out of the field of view of the camera are shown in red. Bottom: the projected clusters and HG results in the
form of 2D BB. Right: the zoomed view, and the vertical orange arrows indicate corresponding obstacles.

ConvNet
The generated Dense-depth Map (DM) with the
projected hypotheses (red).
The ConvNet architecture The generated hypotheses and the detection results are
shown as red and dashed-green BBs, respectively, in both
DM and images. The bottom figures show the result in PCD.

SECOND: Sparsely Embedded Convolutional
Detection
 An improved sparse convolution method for such networks, which significantly increases
the speed of both training and inference.
 Introduce a new form of angle loss regression to improve the orientation estimation
performance and a new data augmentation approach that can enhance the convergence
speed and performance.
 The proposed network produces SoA results on the KITTI 3D object detection
benchmarks while maintaining a fast inference speed.
The detector takes a raw point cloud as input, converts it to voxel features and coordinates, and applies two VFE
(voxel feature encoding) layers and a linear layer. A sparse CNN is applied and an RPN generates the detection.

Detection
The sparse convolution
algorithm is shown above, and
the GPU rule generation
algorithm is shown below. Nin
denotes the number of input
features, and Nout denotes the
number of output features. N is
the number of gathered features.
Rule is the rule matrix, where
Rule[i, :, :] is the ith rule
corresponding to the ith kernel
matrix in the convolution kernel.
The boxes with colors except
white indicate points with sparse
data and the white boxes
indicate empty points.

Detection
A GPU-based rule generation algorithm
(Algorithm 1) that runs faster on a GPU.
First, collect the input indexes and
associated spatial indexes instead of the
output indexes (1st loop). Duplicate
output locations are obtained in this
stage. Then execute a unique parallel
algorithm on the spatial index data to
obtain the output indexes and their
associated spatial indexes. A buffer with
the same spatial dimensions as those of
the sparse data is generated from the
previous results for table lookup in the
next step (2nd loop). Finally, we iterate
on the rules and use the stored spatial
indexes to obtain the output index for
each input index (3rd loop).

Detection
The structure of sparse middle feature extractor. The
yellow boxes represent sparse convolution, the
white boxes represent submanifold convolution, and
the red box represents the sparse-to-dense layer.
The upper part of the figure shows the spatial
dimensions of the sparse data.
Lθ = SmoothL1(sin(θp − θt)),
Introducing a new angle loss regression
This approach to angle loss has two advantages:
(1) it solves the adversarial problem btw orientations of 0, π;
(2) it naturally models the IoU against the angle offset
function.
Structure of RPN
downsampling
convolutional layers
concatenation
transpose convolutional layers

Detection
Results of 3D detection on the KITTI test set. For better visualization, the 3D boxes
detected using LiDAR are projected onto images from the left camera.

YOLO3D: E2E RT 3D Oriented Object Bounding Box
 Based on the success of the one-shot regression meta-architecture in the 2D perspective
image space, extend it to generate oriented 3D object Bboxes from LiDAR point cloud.
 The idea is extending the loss function of YOLO v2 to include the yaw angle, the 3D box
center in Cartesian coordinates and the height of the box as a direct regression problem.
 This formulation enables real-time performance, which is essential for automated driving.
 In KITTI, it achieves real-time performance (40 fps) on Titan X GPU.

YOLO3D: E2E RT 3D Oriented Object Bounding Box
The total loss
Project the point cloud to get bird’s eye view grid map.
create two grid maps from projection of point cloud.
The first feature map contains the maximum height,
where each grid cell (pixel) value represents the height
of the highest point associated with that cell.
The second grid map represent the density of points.
In YOLO-v2, anchors are calculated using k-means
clustering over width and length of ground truth boxes.
The point behind using anchors, is to find priors for the
boxes, onto which the model can predict modifications.
The anchors must be able to cover the whole range of
boxes that can appear in the data.
Choose not to use clustering to calculate the anchors,
and instead, calculate the mean 3D box dimensions for
each object class, and use these average box
dimensions as anchors.

YOLO4D: A ST Approach for RT Multi-object
Detection and Classification from LiDAR Point Clouds
 YOLO4D: the 3D LiDAR point clouds are aggregated over time as a 4D tensor;
3D space dimensions in addition to the time dimension, which is fed to a one-
shot fully convolutional detector, based on YOLO v2 architecture.
 YOLO3D is extended with Convol. LSTM for temporal features aggregation.
 The outputs are the oriented 3D Object BBox info., in addition to its length (L),
width (W), height (H) and orientation (yaw), together with the objects classes and
confidence scores.
 Two different techniques are evaluated to incorporate the temporal dimension:
recurrence and frame stacking.

YOLO4D: A ST Approach for RT Multi-object
Detection and Classification from LiDAR Point Clouds
Left: Frame stacking architecture; Right: Convolutional LSTM architecture.
The prediction model
The total loss

Deconvolutional Networks for Point-Cloud Vehicle
 A full vehicle detection and tracking system that
works with 3D lidar information only.
 The detection step uses a CNN that receives as
input a featured representation of the 3D
information provided by a Velodyne HDL-64
sensor and returns a per-point classification of
whether it belongs to a vehicle or not.
 The classified point cloud is then geometrically
processed to generate observations for a multi-
object tracking system implemented via a
number of Multi-Hypothesis Extended Kalman
Filters (MH-EKF) that estimate the position and
velocity of the surrounding vehicles.
The model is fed with an encoded representation
of the point cloud and computes for 3D each point
its probability of belonging to a vehicle. The
classified points are then clustered generating
trustworthy observations that are fed to MH-EKF
based tracker.

To obtain a useful input for the detector,
project the 3D point cloud raw data to a
featured image-like representation
containing ranges and reflectivity info. by
means of transformation G(·).
Ground truth for learning the classification
task is obtained by first projecting the
image-based Kitti tracklets over the 3D
Velodyne info., and then applying again
transformation G(·) over the selected points.

The network encompasses only conv. and deconv. Blocks followed by BN and ReLU nonlinearities. The first
3 blocks conduct the feature extraction step controlling, according to vehicle detection objective, the size of
the receptive fields and the feature maps generated. The next 3 deconvolutional blocks expanse the info.
enabling the point-wise classification. After each deconvolution, feature maps from the lower part of the
network are concatenated (CAT) before applying the normalization and non-linearities, providing richer info.
and better performance. During training, 3 losses are calculated at different network points.

They show the raw input point cloud, the
Deep detector output, the final tracked
vehicles and the RGB projected bounding
boxes submitted for evaluation.

Fast and Furious: Real Time E2E 3D Detection,
Convolutional Net
 A deep neural network to jointly reason about 3D detection, tracking and motion forecasting
given data captured by a 3D sensor.
 By jointly reasoning about these tasks, the holistic approach is more robust to occlusion as
well as sparse data at range.
 It performs 3D convolutions across space and time over a bird’s eye view representation of
the 3D world, which is very efficient in terms of both memory and computation.
 It can perform all tasks in as little as 30 ms.
Overlay temporal & motion forecasting data.
Green: bbox w/ 3D point. Grey: bbox w/o 3D point.

Convolutional Net
The FaF work takes multiple frames as input and performs detection, tracking and motion forecasting.

Convolutional Net
Modeling temporal information

Convolutional Net
Motion forecasting
The loss function
classification loss
The regression targets
smooth L1

SqueezeSeg: Conv. Neural Nets with Recurrent CRF for RT
Road-Object Segmentation from 3D LiDAR Point Cloud
 Semantic segmentation of road-objects from 3D LiDAR point clouds.
 Detect and categorize instances of interest, such as cars, pedestrians and cyclists.
 Formulate it as a pointwise classification problem, and propose an E2E pipeline called
SqueezeSeg based on CNN: the CNN takes a transformed LiDAR point cloud as input and
directly outputs a point-wise label map, which is then refined by a CRF as a recurrent layer.
 Instance-level labels are then obtained by conventional clustering algorithms.
 The CNN model is trained on LiDAR point clouds from the KITTI dataset, and point-wise
segmentation labels are derived from 3D bounding boxes from KITTI.
 To obtain extra training data, built a LiDAR simulator into Grand Theft Auto V (GTA-V), a
popular video game, to synthesize large amounts of realistic training data.
GT segmentation Predicted segmentation

LiDAR Projections.
Network structure of SqueezeSeg

Structure of FireModule and FireDeconv
Conditional Random Field (CRF) as an RNN layer
https://github.com/BichenWuUCB/SqueezeSeg.

SEGCloud: Semantic Segmentation of 3D Point Clouds
 SEGCloud, an E2E framework to obtain 3D point-level segmentation that combines the
advantages of NNs, trilinear interpolation(TI) and fully connected CRF (FC-CRF).
 Coarse voxel predictions from a 3D Fully Convolutional NN are transferred back to the raw
3D points via trilinear interpolation.
 FC-CRF enforces global consistency and provides fine-grained semantics on the points.
 Implement the FC-CRF as a differentiable Recurrent NN to allow joint optimization.

The 3D-FCNN is made of 3 residual layers sandwiched between 2 convolutional layers.
Max Pooling in the early stages of the network yields a 4X downsampling.

Trilinear interpolation of class scores from voxels to points: Each point’s score is
computed as the weighted sum of the scores from its 8 spatially closest voxel centers.

A 2-stage training by first optimizing over the point-level unary potentials (no
CRF) and then over the joint framework for point-level fine-grained labeling.

 Multi-View 3D networks (MV3D), a sensory-fusion
framework that takes both LIDAR point cloud and RGB
images as input and predicts oriented 3D b boxes.
Composed of 2 subnetworks: one for 3D object
proposal generation, one for multi-view feature fusion.
The proposal network generates 3D candidate boxes
from bird’s eye view representation of 3D point cloud.
A deep fusion scheme to combine region-wise
features from multiple views and enable interactions
btw intermediate layers of different paths.
Multi-View 3D Object Detection
Network for Autonomous Driving

Input features of the MV3D network.

Training strategy for the Region-
based Fusion Network: During
training, the bottom 3 paths and losses
are added to regularize the network.
The auxiliary layers share weights with
the corresponding layers in the main
network.

A General Pipeline for 3D Detection of
Vehicles
 A pipeline to adopt 2D detection net and fuse it with a 3D point cloud to generate 3D info.
 To identify the 3D box, model fitting based on generalised car models and score maps.
 A two-stage CNN is proposed to refine the detected 3D box.
General fusion pipeline. All of the point clouds viewed from the top (bird’s eye view). The height is encoded by color, with
red being the ground. A subset of points is selected based on the 2D detection. A model fitting algorithm based on the
generalised car models and score maps is applied to find the car points in the subset and a two-stage refinement CNN is
designed to fine tune the detected 3D box and re-assign an objectiveness score to it.

A General Pipeline for 3D Detection of
Vehicles
Generalised car models Score map (scores are indicated at bottom.)
Qualitative result illustration on KITTI data (top) and Boston data (bottom). Blue boxes are the 3D detection results

Combining LiDAR Space Clustering and Convolutional
 In purely image- based pedestrian detection approaches, the SoA results
have been achieved with CNN and surprisingly few detection frameworks
have been built upon multi-cue approaches.
 This is a pedestrian detector for autonomous vehicles that exploits LiDAR
data, in addition to visual info.
 LiDAR data is utilized to generate region proposals by processing the 3-d
point cloud that it provides.
 These candidate regions are then further processed by a SoA CNN
classifier that was fine-tuned for pedestrian detection.

Combining LiDAR Space Clustering and Convolutional
(a) Cluster proposal (b) Size and ratio corrections

Pseudo-LiDAR from Visual Depth Estimation: Bridging the
Gap in 3D Object Detection for Autonomous Driving
 Taking the inner workings of CNNs into consideration, convert image- based depth maps to
pseudo-LiDAR representations.
 With this representation, apply different existing LiDAR-based detection algorithms.
 On the popular KITTI benchmark, it raises the detection accuracy of objects within 30m
range from the previous SoA of 22% to an unprecedented 74%.
Pseudo-LiDAR signal from visual depth estimation.

The two-step pipeline for image-based 3D object detection. Given stereo or monocular images,
first predict the depth map, followed by transforming it into a 3D point cloud in the LiDAR
coordinate system. Call this representation as pseudo-LiDAR, and process it exactly like
LiDAR — any LiDAR-based 3D objection algorithms thus can be applied.

Apply a single 2D convolution with a
uniform kernel to the frontal view depth
map (top-left). The resulting depth map
(top-right), after projected into the bird’s-
eye view (bottom-right), reveals a large
depth distortion in comparison to the
original pseudo-LiDAR view (bottom-left),
especially for far-away objects. The boxes
are super-imposed and contain all points of
the green and yellow cars respectively.

Fusing Bird’s Eye View LIDAR Point Cloud and Front View
Camera Image for Deep Object Detection
 A method for fusing LIDAR point cloud and camera-captured images in deep
CNN.
 The method constructs a layer called sparse non-homogeneous pooling layer to
transform features between bird’s eye view and front view.
 The sparse point cloud is used to construct the mapping between the two views.
 The pooling layer allows fusion of multi-view features at any stage of the network.
 This is favorable for 3D object detection using camera-LIDAR fusion for
autonomous driving.
 A corresponding one-stage detector is designed and tested, which produces 3D
Bboxes from the bird’s eye view map.

The vanilla fusion-based one-stage object detection network
The sparse non-homogeneous pooling layer that fuses
front view image and bird’s eye view LIDAR feature.

(a)From camera to bird’s eye. (b)From bird’s eye to camera. (c)From front view conv4
layer to bird’s eye conv4 layer. (d)From bird’s eye conv4 to bird’s eye conv4.

The fusion-based one-stage object detection network with SOA single-sensor networks.

PointNet: Deep Learning on Point Sets for 3D
Applications of PointNet. PointNet is a deep net architecture that consumes raw point cloud (set of
points) without voxelization or rendering. It is a unified architecture that learns both global and local
point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks.

PointNet: Deep Learning on Point Sets for 3D
PointNet Architecture. The classification network takes n points as input, applies input and feature transformations, and
then aggregates point features by max pooling. The output is classification scores for k classes. The segmentation network
is an extension to the classification net. It concatenates global and local features and outputs per point scores.

PointNet++: Deep Hierarchical Feature Learning on
 PointNet does not capture local structures induced by the metric space points live in,
limiting its ability to recognize fine-grained patterns and generalizability to complex
scenes.
 The network called PointNet++ is able to learn deep point set features efficiently and
robustly.
 This is a hierarchical NN that applies PointNet recursively on a nested partitioning of the
input point set.
 By exploiting metric space distances, the network is able to learn local features with
increasing contextual scales.
 With further observation that point sets are usually sampled with varying densities, which
results in greatly decreased performance for networks trained on uniform densities, a set
learning layers is able to adaptively combine features from multiple scales.

PointNet++: Deep Hierarchical Feature Learning on

PointFusion: Deep Sensor Fusion for 3D Bounding Box
Estimation
 PointFusion, a generic 3D object detection method that leverages both image and 3D point
cloud information.
 The image data and the raw point cloud data are independently processed by a CNN and a
PointNet architecture, respectively.
 The resulting outputs are then combined by a novel fusion network, which predicts multiple
3D box hypotheses and their confidences, using the input 3D points as spatial anchors.
Sample 3D object detection results of
PointFusion model on the KITTI dataset
(left) and the SUN-RGBD dataset (right).

PointFusion: Deep Sensor Fusion for 3D Bounding Box
Estimation
A PointNet variant that processes raw point cloud data (A), and a CNN that extracts visual features from an input
image (B). A vanilla global architecture that directly regresses the box corner locations (D), and a dense
architecture that predicts the spatial offset of each of the 8 corners relative to an input point (C): for each input
point, the network predicts the spatial offset (white arrows) from a corner (red dot) to the input point (blue), and
selects the prediction with the highest score as the final prediction (E).

Frustum PointNets for 3D Object Detection
from RGB-D Data
 A 3D object detection solution from RGB-D data in both indoor and outdoor
scenes.
 Previous methods focus on images or 3D voxels, often obscuring natural 3D
patterns and invariances of 3D data, this operate on raw point clouds by popping
up RGB-D scans.
 A challenge is how to efficiently localize objects in point clouds of large-scale
scenes (region proposal).
 Instead of solely relying on 3D proposals, it leverages both mature 2D object
detectors and advanced 3D deep learning for object localization, achieving
efficiency as well as high recall.
 Benefited from learning directly in raw point clouds, it is also able to precisely
estimate 3D Bboxes even under strong occlusion or with very sparse points.

from RGB-D Data
3D object detection pipeline. Given RGB-D data, first generate 2D object region proposals in
the RGB image using a CNN. Each 2D region is then extruded to a 3D viewing frustum in which
to get a point cloud from depth data. Finally, the frustum PointNet predicts a (oriented and amodal)
3D bounding box for the object from the points in frustum.

from RGB-D Data
Frustum PointNets for 3D object detection. First leverage a 2D CNN object detector to propose 2D regions and
classify their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud in a
frustum (n × c with n points and c channels of XYZ, intensity etc. for each point), the object instance is segmented
by binary classification of each point. Based on the segmented object point cloud (m × c), a light-weight regression
PointNet (T-Net) tries to align points by translation such that their centroid is close to amodal box center. At last the
box estimation net estimates the amodal 3D bounding box for the object.

from RGB-D Data
Coordinate systems for point cloud. (a) default camera
coordinate; (b) frustum coordinate after rotating frustums to
center view; (c) mask coordinate with object points’ centroid
at origin; (d) object coordinate predicted by T-Net.
Basic architectures and IO for PointNets. Architecture is
illustrated for PointNet++ (v2) models with set abstraction
layers and feature propagation layers (for segmentation).

from RGB-D Data
Visualizations of Frustum PointNet results on KITTI val set.

RoarNet: A Robust 3D Object Detection based on
 RoarNet for 3D object detection from 2D image and 3D Lidar point clouds.
 Based on two stage object detection framework with PointNet as backbone network, several
ideas to improve 3D object detection performance.
 The first part, estimates the 3D poses of objects from a monocular image, which approximates
where to examine further, and derives multiple candidates that are geometrically feasible.
 This step significantly narrows down feasible 3D regions, which otherwise requires demanding
processing of 3D point clouds in a huge search space.
 The second part, takes the candidate regions and conducts in-depth inferences to conclude final
poses in a recursive manner.
 Inspired by PointNet, RoarNet processes 3D point clouds directly, leading to precise detection.
 RoarNet is implemented in Tensorflow and publicly available with pretrained models.

Detection pipeline of RoarNet. The model (a) predicts region proposals in 3D space using geometric
agreement search, (b) predicts objectness in each region proposal, (c) predicts 3D bounding boxes, (d)
calculates IoU (Intersection over Union) between 2D detection and 3D detection.

Architecture of RoarNet

(a) Previous
Architecture
(b) RoarNet 2D
Architecture

RoarNet 2D. An unified architecture detects 2D
bounding boxes and 3D poses illustrated in (a)
and (b), respectively. For each object, two
extreme cases are shown as non-filled boxes,
and final equally-spaced candidate locations as
colored dots in (b). All calculations are derived
in 3D space despite bird’s eye view (i.e., X-Z
plane) visualization.

A detection pipeline of several network architectures

Joint 3D Proposal Generation and Object Detection
 AVOD, an Aggregate View Object Detection network for autonomous driving scenarios.
 The network uses LIDAR point clouds and RGB images to generate features shared by two
subnetworks: a region proposal network (RPN) and a second stage detector network.
 The RPN is capable of performing multimodal feature fusion on high resolution feature maps to
generate reliable 3D object proposals for multiple object classes in road scenes.
 Using these proposals, the second stage detection network performs accurate oriented 3D bounding
box regression and category classification to predict the extents, orientation, and classification of
objects in 3D space.
 Source code is at: https://github.com/kujason/avod.
A visual representation of the 3D detection problem
from Bird’s Eye View (BEV). The Bbox in green is used to
determine the IoU overlap in the computation of the average
precision. The importance of explicit orientation estimation
can be seen as an object’s Bbox does not change when the
orientation (purple) is shifted by ±π radians.

The method’s architectural diagram. The feature extractors are shown in blue, the region proposal
network in pink, and the second stage detection network in green.

The architecture of high resolution
feature extractor for the image branch.
Feature maps are propagated from the
encoder to the decoder section via red
arrows. Fusion is then performed at every
stage of the decoder by a learned
upsampling layer, followed by concatenation,
and then mixing via a convolutional layer,
resulting in a full resolution feature map at
the last layer of the decoder.

Qualitative results of AVOD for cars (top) and pedestrians/cyclists (bottom). Left: 3D RPN output, Middle: 3D
detection output, and Right: the projection of the detection output onto image space for all three classes.

SPLATNet: Sparse Lattice Networks for Point Cloud
Processing
 A network architecture for processing point clouds that directly operates on a collection of
points represented as a sparse set of samples in a high-dimensional lattice.
 The network uses sparse bilateral convolutional layers as building blocks, and these layers
maintain efficiency by using indexing structures to apply convolutions only on occupied parts
of the lattice, so allow flexible specifications of the lattice structure enabling hierarchical and
spatially-aware feature learning, as well as joint 2D-3D reasoning.
 Both point-based and image-based representations can be easily incorporated in a network
with such layers and the resulting model can be trained in an E2E manner.
From point clouds and images to semantics. SPLATNet3D
directly takes point cloud as input and predicts labels for
each point. SPLATNet2D-3D, on the other hand, jointly
processes both point cloud and the corresponding multi-
view images for better 2D and 3D predictions.

Processing
Bilateral Convolution Layer (BCL). Splat: BCL
first interpolates input features F onto a dl-
dimensional permutohedral lattice defined by the
lattice features L at input points. Convolve: BCL
then does dl-dimensional convolution over this
sparsely populated lattice. Slice: The filtered signal
is then interpolated back onto the input signal.
• The input points to BCL need not be ordered or
lie on a grid as they are projected onto a dl-
dimensional grid defined by lattice features Lin.
• The input and output points can be different for
BCL with the specification of different input and
output lattice features Lin and Lout.
• Since BCL allows separate specifications of input
and lattice features, input signals can be projected
into a different dimensional space for filtering.
• Just like in standard spatial convolutions, BCL
allows an easy specification of filter neighborhood.
• Since a signal is usually sparse in high-
dimension, BCL uses hash tables to index the
populated vertices and does convolutions only at
those locations.

Processing
SPLATNet. Illustration of inputs, outputs and network architectures for SPLATNet3D and SPLATNet2D-3D.

Processing
2D to 3D projection. Using splat and slice using
splat and slice operations. Given input features of 2D
images, pixels are projected onto a 3D
permutohedral lattice defined by 3D positional lattice
features. The splatted signal is then sliced onto the
points of interest in a 3D point cloud.
Facade point cloud labeling. Sample visual
results of SPLATNet3D and SPLATNet2D-3D.

PointRCNN: 3D Object Proposal Generation and
 PointRCNN is a deep NN method for 3D object detection from raw point cloud.
 The whole framework is composed of two stages:
 stage-1 for the bottom-up 3D proposal generation;
 stage-2 for refining proposals in the canonical coord.s to obtain the detection results.
 Instead of generating proposals from RGB image or projecting point cloud to bird’s view
or voxels, this stage-1 sub-network directly generates a small number of high-quality 3D
proposals from point cloud in a bottom-up manner via segmenting the point cloud of
whole scene into FG points and BG.
 The stage-2 sub-network transforms the pooled points of each proposal to canonical
coord.s to learn local spatial features, which is combined with global semantic features of
each point learned in stage-1 for accurate box refinement and confidence prediction.

Instead of generating proposals from fused feature
maps of bird’s view and front view, or RGB images,
this method directly generates 3D proposals from raw
point cloud in a bottom-up manner.
C: PointRCNN

The PointRCNN architecture. The whole network consists of two parts: (a) for generating 3D proposals
from raw point cloud in a bottom-up manner. (b) for refining the 3D proposals in canonical coordinate.

Bin-based localization. The surrounding area along X
and Z axes of each foreground point is split into a series
of bins to locate the object center.
Canonical transformation. The pooled points belonged to
each proposal are transformed to the corresponding canonical
coordinate system for better local spatial feature learning,
where CCS denotes Canonical Coordinate System.

The upper is the image and the lower is a representative view of the corresponding point cloud.

Deep Continuous Fusion for Multi-Sensor 3D Object
Detection
 A 3D object detector exploits both LIDAR and cameras to perform very accurate localization.
 Design an E2E learnable architecture that exploits continuous convolutions to fuse image and
LIDAR feature maps at different levels of resolution.
 The continuous fusion layer encode both discrete-state image features and continuous
geometric info.
 Deep parametric continuous convolution is a learnable operator that operates over non-grid-
structured data.
 The motivation behind is to extend the standard grid-structured convolution to non-grid-structured
data, while retaining high capacity and low complexity.
 The key idea is to exploit multi-layer perceptrons as parameterized kernel functions for continuous
convolution.
 This parametric kernel function spans the full continuous domain.
 The weighted summation over finite number of neighboring points is used to approximate the
otherwise computationally prohibitive continuous convolution.
 Each neighbor is weighted differently according to its relative geometric offset wrt the target point.
 This enables a reliable and efficient E2E learnable 3D object detector based on multiple
sensors.

Detection
Continuous fusion layer: given a target pixel on BEV image, extract K nearest LIDAR points (S1); project the 3D
points onto the camera image plane (S2-3); this helps retrieve corresponding image features (S4); feed the image
feature + continuous geometry offset into a MLP to generate feature for the target pixel (S5).

Detection
Qualitative
results on KITTI
Dataset.

End-to-end Learning of Multi-sensor 3D Tracking by
Detection
 An approach of tracking by detection that can exploit both cameras as well as LIDAR data to
produce very accurate 3D trajectories.
 Towards this goal, formulate it as a linear program that can be solved exactly, and learn
convolutional networks for detection as well as matching in an end-to-end manner.
The system takes as external input a time series of RGB Frames and LIDAR point clouds. From these
inputs, the system produces discrete trajectories of the targets. In particular, an architecture that is e2e
trainable while still maintaining explainability, is achieved by formulating the system in a structured manner.

Detection
Forward passes over a set of detections from
two frames for both scoring and matching.
For each detection xj, a forward pass of a Detection
Network is computed to produce θdet
W(xj), the cost of
using or discarding xj according to the assignment to ydet
j.
For each pair of detections xj and xi from subsequent
frames, a forward pass of the Match Network is computed
to produce θlink
W(xi,xj), the cost of linking or not these two
detections according to the assignment to ylink
i,j. Finally,
each detection might start a new trajectory or end an
existing one, the costs for this are computed via θnew
W(x)
and θend
W(x), respectively, and are associated with the
assignments to ynew and yend.
Formulate the problem as inference in a deep structured model (DSM), where the factors are computed
using a set of feed forward neural nets that exploit both camera and LIDAR data to compute both detection
and matching scores. Inference in the model can be done exactly by a set of feed forward processes
followed by solving a linear program. Learning is done e2e via minimization of a structured hinge loss,
optimizing simultaneously the detector and tracker.

Detection

Lidar for Autonomous Driving II (via Deep Learning)

Lidar for Autonomous Driving II (via Deep Learning)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lidar for Autonomous Driving II (via Deep Learning)

Similar to Lidar for Autonomous Driving II (via Deep Learning) (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Lidar for Autonomous Driving II (via Deep Learning)