SlideShare a Scribd company logo
Fusion of Camera and LiDAR for
Autonomous Vehicles I
(via Deep Learning)
Yu Huang
Sunnyvale, California
• A General Pipeline for 3D Detection of Vehicles
• Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian
• Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object
• PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation
• RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement
• Joint 3D Proposal Generation and Object Detection from View Aggregation
• Frustum PointNets for 3D Object Detection from RGB-D Data
• Deep Continuous Fusion for Multi-Sensor 3D Object Detection
• Multi-View 3D Object Detection Network for Autonomous Driving
• End-to-end Learning of Multi-sensor 3D Tracking by Detection
A General Pipeline for 3D Detection of
• Autonomous driving requires 3D perception of vehicles and other objects in the in
• Much of the current methods support 2D vehicle detection.
• Here is a pipeline to adopt any 2D detection network and fuse it with a 3D point cloud to
generate 3D information with minimum changes of the 2D detection networks.
• To identify the 3D box, a model fitting algorithm is developed based on generalized car
models and score maps.
• A two-stage convolutional neural network (CNN) is proposed to refine the detected 3D box.
• It requires minimum efforts to modify the existing 2D networks to fit into the pipeline,
adding just one additional regression term at the output layer to estimate the vehicle
A General Pipeline for 3D Detection of
The raw image is passed to a 2D detection network which provides 2D boxes around the vehicles in
the image plane. Subsequently, a set of 3D points which fall into the 2D bounding box after projection
is selected. With this set, a model fitting algorithm detects the 3D location and 3D bounding box of
the vehicle. And then another CNN network, which takes the points that fit into the 3D bounding box
as input, carries out the final 3D box regression and classification.
Combining LiDAR Space Clustering and Convolutional
Neural Networks for Pedestrian Detection
• This is a pedestrian detector that exploits LiDAR data, in addition to visual information.
• The hypothesis is that using depth data and prior info about the size of the objects, it can
reduce the search space by providing candidates and, speeding up detection algorithms.
• A hypothesis is that this prior definition of the location and size of the candidate bounding
box will also decrease the number of false detections.
• In the approach, LiDAR data is utilized to generate region proposals by processing the three
dimensional point cloud that it provides.
• These candidate regions are then further processed by a state-of-the-art CNN classifier that
is fine-tuned for pedestrian detection.
Combining LiDAR Space Clustering and Convolutional
Neural Networks for Pedestrian Detection
The algorithm is built upon the idea of clustering the 3-D point cloud of the LiDAR. It starts
with raw measurements down-sampling, followed by removal of the floor plane. Then, a
density-based clustering algorithm generates the candidates that are projected on the image
space to provide a region of interest.
Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
• This is a method for fusing LIDAR point cloud and camera-captured images in deep
convolutional neural networks (CNN).
• The method constructs a layer called sparse non-homogeneous pooling layer to transform
features between bird’s eye view and front view.
• The sparse point cloud is used to construct the mapping between the two views.
• The pooling layer allows efficient fusion of the multi-view features at any stage of the
• This is favorable for 3D object detection using camera-LIDAR fusion for autonomous driving.
• A corresponding one-stage detector is designed and tested on the KITTI bird’s eye view
object detection dataset, which produces 3D bounding boxes from the bird’s eye view
• The fusion method shows significant improvement on both speed and accuracy of the
pedestrian detection over other fusion-based object detection networks.
Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
The sparse non-homogeneous pooling layer that fuses front view image and bird’s eye view LIDAR feature.
Fusing Bird’s Eye View LIDAR Point Cloud and Front
View Camera Image for Deep Object Detection
The fusion-based one-stage object detection network with MS-CNN networks.
PointFusion: Deep Sensor Fusion for 3D
Bounding Box Estimation
• PointFusion, a generic 3D object detection method leverages both image and 3D point
cloud information.
• Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-
specific assumptions, PointFusion is conceptually simple and application- agnostic.
• It consists of: an off-the-shelf CNN that extracts appearance and geometry features from
input RGB image crops, a variant of PointNet that processes the raw 3D point cloud, and a
fusion sub-network that combines the two outputs to predict 3D bounding boxes.
• The image data and the raw point cloud data are independently processed by a CNN and a
PointNet architecture, respectively.
• The resulting outputs are then combined by a fusion network, which predicts multiple 3D
box hypotheses and their confidences, using the input 3D points as spatial anchors.
PointFusion: Deep Sensor Fusion for 3D
Bounding Box Estimation
Two feature extractors: a PointNet variant that processes raw point cloud data (A), and a CNN that extracts visual
features from an input image (B). Two fusion network formulations: a vanilla global architecture that directly regresses
the box corner locations (D), and a dense architecture that predicts the spatial offset of each of the 8 corners relative to
an input point, (C): for each input point, the network predicts the spatial offset (white arrows) from a corner (red dot) to
the input point (blue), and selects the prediction with the highest score as the final prediction (E).
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
• RoarNet, an approach for 3D object detection from 2D image and 3D Lidar point clouds.
• Based on two stage object detection framework with PointNet as backbone network.
• The first part, RoarNet 2D, estimates the 3D poses of objects from a monocular image, which
approximates where to examine further, and derives multiple candidates that are
geometrically feasible.
• This step significantly narrows down feasible 3D regions, which otherwise requires
demanding processing of 3D point clouds in a huge search space.
• The second part, RoarNet 3D, takes the candidate regions and conducts in-depth inferences
to conclude final poses in a recursive manner.
• RoarNet 3D processes 3D point clouds without any loss of data, leading to precise detection.
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
The model first predicts the 2D bounding boxes and a 3D poses of objects from a 2D image. For each 2D object
detection, geometric agreement search is applied to predict the location of object in 3D space. Centered on
each location prediction, set region proposal which has a shape of standing cylinder. Taking the prediction error
in bounding box and pose into account, there can be multiple region proposals for a single object.
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
• Each region proposal is responsible for detecting a single object.
• Taking the point clouds sampled from each region proposal as input, the model predicts the
location of an object relative to the center of region proposal, which recursively serves for
setting new region proposals for the next step.
• The model also predicts objectness score which reflects the probability of an object being
inside the region proposal.
• Only those proposals with high objectness scores are considered at the next step.
• At a final step, the model sets new region proposals at previously predicted locations.
• The model predicts all coordinates required for 3D bounding box regression including
location, rotation, and size of the objects.
• For practical reason, repeating this step more than once gives better detection performance.
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
Architecture of RoarNet 2D
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
The backbone network is a simplified version of PointNet without T-Net.
RoarNet: A Robust 3D Object Detection based
on RegiOn Approximation Refinement
Joint 3D Proposal Generation and Object
Detection from View Aggregation
• AVOD, an Aggregate View Object Detection network for autonomous driving scenarios.
• The neural network architecture uses LIDAR point clouds and RGB images to generate
features that are shared by two subnetworks: a region proposal network (RPN) and a second
stage detector network.
• The RPN uses an architecture capable of performing multimodal feature fusion on high
resolution feature maps to generate reliable 3D object proposals for multiple object classes
in road scenes.
• Using these proposals, the second stage detection network performs accurate oriented 3D
bounding box regression and category classification to predict the extents, orientation, and
classification of objects in 3D space.
• The proposed architecture produces SoA results on the KITTI 3D object detection
benchmark while running in real time with a low memory footprint.
• Code is
Joint 3D Proposal Generation and Object
Detection from View Aggregation
The method’s architectural diagram. The feature extractors are shown in blue, the region proposal network in
pink, and the second stage detection network in green.
Joint 3D Proposal Generation and Object
Detection from View Aggregation
The architecture of proposed high resolution feature extractor
shown here for the image branch. Feature maps are
propagated from the encoder to the decoder section via red
arrows. Fusion is then performed at every stage of the decoder
by a learned upsampling layer, followed by concatenation, and
then mixing via a convolutional layer, resulting in a full
resolution feature map at the last layer of the decoder.
Joint 3D Proposal Generation and Object
Detection from View Aggregation
A visual comparison between the 8 corner box encoding, the
axis aligned box encoding, and our 4 corner encoding.
Joint 3D Proposal Generation and Object
Detection from View Aggregation
Left: 3D region proposal network output, Middle: 3D detection output, and Right: the projection of the
detection output onto image space for all three classes. The 3D LIDAR point cloud has been colorized and
interpolated for better visualization.
Frustum PointNets for 3D Object Detection
from RGB-D Data
• 3D object detection from RGB- D data in both indoor and outdoor scenes.
• While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns
and invariances of 3D data, here directly operate on raw point clouds by popping up RGB-D
• However, a key challenge of this approach is how to efficiently localize objects in point
clouds of large-scale scenes (region proposal).
• Instead of solely relying on 3D proposals, this method leverages both mature 2D object
detectors and advanced 3D deep learning for object localization, achieving efficiency as well
as high recall for even small objects.
• Benefited from learning directly in raw point clouds, this method is also able to precisely
estimate 3D bounding boxes even under strong occlusion or with very sparse points.
• Evaluated on KITTI and SUN RGB-D 3D detection benchmarks.
Frustum PointNets for 3D Object Detection
from RGB-D Data
3D object detection pipeline. Given RGB-D data, first generate 2D object region proposals in the RGB
image using a CNN. Each 2D region is then extruded to a 3D viewing frustum in which to get a point
cloud from depth data. Finally, the frustum PointNet predicts a (oriented and amodal) 3D bounding box
for the object from the points in frustum.
Frustum PointNets for 3D Object Detection
from RGB-D Data
Frustum PointNets for 3D object detection. First leverage a 2D CNN object detector to propose 2D regions and
classify their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud
in a frustum (n × c with n points and c channels of XYZ, intensity etc. for each point), the object instance is
segmented by binary classification of each point. Based on the segmented object point cloud (m × c), a light-
weight regression PointNet (T-Net) tries to align points by translation such that their centroid is close to amodal
box center. At last the box estimation net estimates the amodal 3D bounding box for the object.
Frustum PointNets for 3D Object Detection
from RGB-D Data
Coordinate systems for point cloud. Artificial points (black dots) are shown to
illustrate (a) default camera coordinate; (b) frustum coordinate after rotating
frustums to center view; (c) mask coordinate with object points’ centroid at
origin; (d) object coordinate predicted by T-Net.
Frustum PointNets for 3D Object Detection
from RGB-D Data
Basic architectures and IO for PointNets. Architecture is illustrated for PointNet++ (v2)
models with set abstraction layers and feature propagation layers (for segmentation).
Frustum PointNets for 3D Object Detection
from RGB-D Data
True positive detection boxes are in green, while false positive boxes are in red and ground truth
boxes in blue are shown for false positive and false negative cases. Digit and letter beside each box
denote instance id and semantic class, with “v” for cars, “p” for pedestrian and “c” for cyclist.
Frustum PointNets for 3D Object Detection
from RGB-D Data
Network architectures for Frustum PointNets. v1 models are based on PointNet. v2 models are based on PointNet++ set
abstraction (SA) and feature propagation (FP) layers. The architecture for residual center estimation T-Net is shared for v1
and v2. The colors (blue for segmentation nets, red for T-Net and green for box estimation nets) of the network background
indicate the coordinate system of the input point cloud. Segmentation nets operate in frustum coordinate, T-Net processes
points in mask coordinate while box estimation nets take points in object coordinate. The small yellow square (or bar)
concatenated with global features is class one-hot vector that tells the predicted category of the underlying object.
Deep Continuous Fusion for Multi-Sensor
3D Object Detection
• It remains an open problem to design 3D detectors that can better exploit multiple
• A 3D object detector can exploit both LIDAR as well as cameras to perform very accurate
• It reasons in bird’s eye view (BEV) and fuses image features by learning to project them
into BEV space.
• Towards this goal, an end-to-end learnable architecture exploits continuous convolutions to
fuse image and LIDAR feature maps at different levels of resolution.
• The proposed continuous fusion layer encode both discrete-state image features as well as
continuous geometric information.
• This enables designing a reliable and efficient end-to-end learnable 3D object detector
based on multiple sensors.
Deep Continuous Fusion for Multi-Sensor
3D Object Detection
Architecture of model. There are two streams, namely the camera image stream and the BEV LIDAR stream.
Continuous fusion layers are used to fuse the image features onto the BEV feature maps.
Deep Continuous Fusion for Multi-Sensor
3D Object Detection
Continuous fusion layer: given a target pixel on BEV image, first extract K nearest LIDAR points; then project the 3D
points onto the camera image plane; this helps retrieve corresponding image features; finally feed the image
feature + continuous geometry offset into a MLP to generate feature for the target pixel.
Deep Continuous Fusion for Multi-Sensor
3D Object Detection
The 2D bounding boxes are obtained by projecting the 3D detections onto the image.
The bounding box of an object on BEV and images are shown in the same color.
Multi-View 3D Object Detection Network
for Autonomous Driving
• Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point
cloud and RGB images as input and predicts oriented 3D bounding boxes.
• It encodes the sparse 3D point cloud with a compact multi-view representation.
• The network is composed of two subnetworks: one for 3D object proposal generation and
another for multi-view feature fusion.
• The proposal network generates 3D candidate boxes efficiently from the bird’s eye view
representation of 3D point cloud.
• A deep fusion scheme combines region-wise features from multiple views and enables
interactions between intermediate layers of different paths.
Multi-View 3D Object Detection Network
for Autonomous Driving
Input features of the MV3D network.
Multi-View 3D Object Detection Network
for Autonomous Driving
The network takes the bird’s eye view and front view of LIDAR point cloud as well as an image as input. It first
generates 3D object proposals from bird’s eye view map and project them to three views. A deep fusion
network is used to combine region-wise features obtained via ROI pooling for each view. The fused features
are used to jointly predict object class and do oriented 3D box regression.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• This task, commonly referred to as Multi-target tracking, consists on identifying how many
objects there are in each frame, as well as link their trajectories over time.
• It is an approach to tracking by detection that can exploit both cameras as well as LIDAR
data to produce very accurate 3D trajectories.
• Towards this goal, it formulates the problem as inference in a deep structured model, where
the potentials are computed using convolutional neural nets.
• The matching cost of associating two detections exploits both appearance and motion via a
Siamese network that processes images and motion representations via convolutional layers.
• Inference in the model can be done exactly and efficiently by a set of feedforward passes
followed by solving a linear program.
• Importantly, the model is formulated such that it can be trained end-to-end to solve both
the detection and tracking problems.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
In this work, it formulates tracking as a system containing multiple neural networks that are interwoven
together in a single architecture. Note that the system takes as external input a time series of RGB Frames
(camera images) and LIDAR point clouds. From these inputs, the system produces discrete trajectories of the
targets. In particular, an architecture is end to end trainable while still maintaining explainability, which is
achieved by formulating the system in a structured manner.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
Neural networks designed for both
scoring and matching: the forward passes
over a set of detections from two frames.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• To extract appearance features, employ a Siamese network based on VGG16.
• Note that in a Siamese setup, the two branches (each processing a detection) share the
same set of weights.
• This makes the architecture more efficient in terms of memory and allows learning with
fewer examples.
• In particular, resize each detection to be of dimension 224 × 224.
• To produce a concise representation of activations without using fully connected layers, each
of the max-pool outputs is passed through a product layer followed by a weighted sum,
which produces a single scalar for each max-pool layer, yielding an activation vector size 5.
• Use skip-pooling as matching should exploit both low-level features (e.g., color) as well as
semantically richer features from higher layers.
• To incorporate spatial information into the model, employ fully connected architectures that
model both 2D and 3D motion.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
• In particular, exploit 3D information in the form of a 180 × 200 occupancy grid in bird’s
eye view and 2D information from the occupancy region in the frontal view camera, scaled
down from the original resolution of 1242 × 375 to 124 × 37.
• In bird’s eye perspective, each 3D detection is projected onto a ground plane, leaving only
a rotated rectangle that reflects its occupancy in the world.
• Since the observer is a mobile platform (an autonomous vehicle, in this case), the coordinate
system between two subsequent frames would be shifted because the observer moved in
the time elapsed.
• Since its speed in each axis is known from the IMU data, one can calculate the displacement
of the observer between each observation and translate the coordinates accordingly; this
way, both grids are on the exact same coordinate system.
• The frontal view perspective encodes the rectangular area in the camera occupied by the
target, equivalent of projecting the 3D bounding box onto camera coordinates.
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
Detector: MV3D
End-to-end Learning of Multi-sensor 3D
Tracking by Detection
Trajectories are color coded, such that having the same color means it’s the same object.
fusion of Camera and lidar for autonomous driving I

More Related Content

What's hot

You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
Entrepreneur / Startup
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
Yu Huang
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Taegyun Jeon
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
Yu Huang
Vehicle Detection using Camera
Vehicle Detection using CameraVehicle Detection using Camera
Vehicle Detection using Camera
Shubham Agrahari
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
Wenjing Chen
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentation
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
Introduction of slam
Introduction of slamIntroduction of slam
Introduction of slam
Hung-Chih Chang
Moving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNNMoving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNN
Object detection with Tensorflow Api
Object detection with Tensorflow ApiObject detection with Tensorflow Api
Object detection with Tensorflow Api
Object Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingObject Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingShounak Mitra
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
Vikas Jain
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001
Md. Minhazul Haque
Computer Vision
Computer VisionComputer Vision
Computer Vision
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
Yu Huang
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
Yu Huang

What's hot (20)

You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
Vehicle Detection using Camera
Vehicle Detection using CameraVehicle Detection using Camera
Vehicle Detection using Camera
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentation
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
Introduction of slam
Introduction of slamIntroduction of slam
Introduction of slam
Moving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNNMoving Object Detection And Tracking Using CNN
Moving Object Detection And Tracking Using CNN
Object detection with Tensorflow Api
Object detection with Tensorflow ApiObject detection with Tensorflow Api
Object detection with Tensorflow Api
Object Detection Classification, tracking and Counting
Object Detection Classification, tracking and CountingObject Detection Classification, tracking and Counting
Object Detection Classification, tracking and Counting
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001
Computer Vision
Computer VisionComputer Vision
Computer Vision
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II

Similar to fusion of Camera and lidar for autonomous driving I

LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
Yu Huang
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
Yu Huang
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
Yu Huang
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
IRJET Journal
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
Yu Huang
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
Yu Huang
Major PRC-1 ppt.pptx
Major PRC-1 ppt.pptxMajor PRC-1 ppt.pptx
Major PRC-1 ppt.pptx
Deep Learning JP
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
Yu Huang
Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentation
Udacity-Didi Challenge Finalists
Udacity-Didi Challenge FinalistsUdacity-Didi Challenge Finalists
Udacity-Didi Challenge Finalists
David Silver
Goal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D cameraGoal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D camera
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdf
Qualcomm Research
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
Yu Huang
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPrathamesh Joshi
Remote Sensing Field Camp 2016
Remote Sensing Field Camp 2016 Remote Sensing Field Camp 2016
Remote Sensing Field Camp 2016
COGS Presentations
Rapid Laser Scanning the process
Rapid Laser Scanning the processRapid Laser Scanning the process
Rapid Laser Scanning the process
Seeview Solutions

Similar to fusion of Camera and lidar for autonomous driving I (20)

LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
Pose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learningPose estimation from RGB images by deep learning
Pose estimation from RGB images by deep learning
Major PRC-1 ppt.pptx
Major PRC-1 ppt.pptxMajor PRC-1 ppt.pptx
Major PRC-1 ppt.pptx
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentation
Udacity-Didi Challenge Finalists
Udacity-Didi Challenge FinalistsUdacity-Didi Challenge Finalists
Udacity-Didi Challenge Finalists
Goal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D cameraGoal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D camera
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdf
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking Project
Remote Sensing Field Camp 2016
Remote Sensing Field Camp 2016 Remote Sensing Field Camp 2016
Remote Sensing Field Camp 2016
Rapid Laser Scanning the process
Rapid Laser Scanning the processRapid Laser Scanning the process
Rapid Laser Scanning the process

More from Yu Huang

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
Yu Huang
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
Yu Huang
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
Yu Huang
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
Yu Huang
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
Yu Huang
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
Yu Huang
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
Yu Huang
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
Yu Huang
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
Yu Huang
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
Yu Huang
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
Yu Huang
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
Yu Huang
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
Yu Huang
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
Yu Huang
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
Yu Huang
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
Yu Huang
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
Yu Huang
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
Yu Huang
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
Yu Huang
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain

Recently uploaded

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control

Recently uploaded (20)

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf

fusion of Camera and lidar for autonomous driving I

  • 1. Fusion of Camera and LiDAR for Autonomous Vehicles I (via Deep Learning) Yu Huang Sunnyvale, California
  • 2. Outline • A General Pipeline for 3D Detection of Vehicles • Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian Detection • Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection • PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation • RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement • Joint 3D Proposal Generation and Object Detection from View Aggregation • Frustum PointNets for 3D Object Detection from RGB-D Data • Deep Continuous Fusion for Multi-Sensor 3D Object Detection • Multi-View 3D Object Detection Network for Autonomous Driving • End-to-end Learning of Multi-sensor 3D Tracking by Detection
  • 3. A General Pipeline for 3D Detection of Vehicles • Autonomous driving requires 3D perception of vehicles and other objects in the in environment. • Much of the current methods support 2D vehicle detection. • Here is a pipeline to adopt any 2D detection network and fuse it with a 3D point cloud to generate 3D information with minimum changes of the 2D detection networks. • To identify the 3D box, a model fitting algorithm is developed based on generalized car models and score maps. • A two-stage convolutional neural network (CNN) is proposed to refine the detected 3D box. • It requires minimum efforts to modify the existing 2D networks to fit into the pipeline, adding just one additional regression term at the output layer to estimate the vehicle dimensions.
  • 4. A General Pipeline for 3D Detection of Vehicles The raw image is passed to a 2D detection network which provides 2D boxes around the vehicles in the image plane. Subsequently, a set of 3D points which fall into the 2D bounding box after projection is selected. With this set, a model fitting algorithm detects the 3D location and 3D bounding box of the vehicle. And then another CNN network, which takes the points that fit into the 3D bounding box as input, carries out the final 3D box regression and classification.
  • 5. Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian Detection • This is a pedestrian detector that exploits LiDAR data, in addition to visual information. • The hypothesis is that using depth data and prior info about the size of the objects, it can reduce the search space by providing candidates and, speeding up detection algorithms. • A hypothesis is that this prior definition of the location and size of the candidate bounding box will also decrease the number of false detections. • In the approach, LiDAR data is utilized to generate region proposals by processing the three dimensional point cloud that it provides. • These candidate regions are then further processed by a state-of-the-art CNN classifier that is fine-tuned for pedestrian detection.
  • 6. Combining LiDAR Space Clustering and Convolutional Neural Networks for Pedestrian Detection The algorithm is built upon the idea of clustering the 3-D point cloud of the LiDAR. It starts with raw measurements down-sampling, followed by removal of the floor plane. Then, a density-based clustering algorithm generates the candidates that are projected on the image space to provide a region of interest.
  • 7. Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection • This is a method for fusing LIDAR point cloud and camera-captured images in deep convolutional neural networks (CNN). • The method constructs a layer called sparse non-homogeneous pooling layer to transform features between bird’s eye view and front view. • The sparse point cloud is used to construct the mapping between the two views. • The pooling layer allows efficient fusion of the multi-view features at any stage of the network. • This is favorable for 3D object detection using camera-LIDAR fusion for autonomous driving. • A corresponding one-stage detector is designed and tested on the KITTI bird’s eye view object detection dataset, which produces 3D bounding boxes from the bird’s eye view map. • The fusion method shows significant improvement on both speed and accuracy of the pedestrian detection over other fusion-based object detection networks.
  • 8. Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection The sparse non-homogeneous pooling layer that fuses front view image and bird’s eye view LIDAR feature.
  • 9. Fusing Bird’s Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection The fusion-based one-stage object detection network with MS-CNN networks.
  • 10. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation • PointFusion, a generic 3D object detection method leverages both image and 3D point cloud information. • Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset- specific assumptions, PointFusion is conceptually simple and application- agnostic. • It consists of: an off-the-shelf CNN that extracts appearance and geometry features from input RGB image crops, a variant of PointNet that processes the raw 3D point cloud, and a fusion sub-network that combines the two outputs to predict 3D bounding boxes. • The image data and the raw point cloud data are independently processed by a CNN and a PointNet architecture, respectively. • The resulting outputs are then combined by a fusion network, which predicts multiple 3D box hypotheses and their confidences, using the input 3D points as spatial anchors.
  • 11. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation Two feature extractors: a PointNet variant that processes raw point cloud data (A), and a CNN that extracts visual features from an input image (B). Two fusion network formulations: a vanilla global architecture that directly regresses the box corner locations (D), and a dense architecture that predicts the spatial offset of each of the 8 corners relative to an input point, (C): for each input point, the network predicts the spatial offset (white arrows) from a corner (red dot) to the input point (blue), and selects the prediction with the highest score as the final prediction (E).
  • 12. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement • RoarNet, an approach for 3D object detection from 2D image and 3D Lidar point clouds. • Based on two stage object detection framework with PointNet as backbone network. • The first part, RoarNet 2D, estimates the 3D poses of objects from a monocular image, which approximates where to examine further, and derives multiple candidates that are geometrically feasible. • This step significantly narrows down feasible 3D regions, which otherwise requires demanding processing of 3D point clouds in a huge search space. • The second part, RoarNet 3D, takes the candidate regions and conducts in-depth inferences to conclude final poses in a recursive manner. • RoarNet 3D processes 3D point clouds without any loss of data, leading to precise detection.
  • 13. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement The model first predicts the 2D bounding boxes and a 3D poses of objects from a 2D image. For each 2D object detection, geometric agreement search is applied to predict the location of object in 3D space. Centered on each location prediction, set region proposal which has a shape of standing cylinder. Taking the prediction error in bounding box and pose into account, there can be multiple region proposals for a single object.
  • 14. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement • Each region proposal is responsible for detecting a single object. • Taking the point clouds sampled from each region proposal as input, the model predicts the location of an object relative to the center of region proposal, which recursively serves for setting new region proposals for the next step. • The model also predicts objectness score which reflects the probability of an object being inside the region proposal. • Only those proposals with high objectness scores are considered at the next step. • At a final step, the model sets new region proposals at previously predicted locations. • The model predicts all coordinates required for 3D bounding box regression including location, rotation, and size of the objects. • For practical reason, repeating this step more than once gives better detection performance.
  • 15. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement Architecture of RoarNet 2D
  • 16. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement The backbone network is a simplified version of PointNet without T-Net.
  • 17. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement
  • 18. Joint 3D Proposal Generation and Object Detection from View Aggregation • AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. • The neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. • The RPN uses an architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. • Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. • The proposed architecture produces SoA results on the KITTI 3D object detection benchmark while running in real time with a low memory footprint. • Code is
  • 19. Joint 3D Proposal Generation and Object Detection from View Aggregation The method’s architectural diagram. The feature extractors are shown in blue, the region proposal network in pink, and the second stage detection network in green.
  • 20. Joint 3D Proposal Generation and Object Detection from View Aggregation The architecture of proposed high resolution feature extractor shown here for the image branch. Feature maps are propagated from the encoder to the decoder section via red arrows. Fusion is then performed at every stage of the decoder by a learned upsampling layer, followed by concatenation, and then mixing via a convolutional layer, resulting in a full resolution feature map at the last layer of the decoder.
  • 21. Joint 3D Proposal Generation and Object Detection from View Aggregation A visual comparison between the 8 corner box encoding, the axis aligned box encoding, and our 4 corner encoding.
  • 22. Joint 3D Proposal Generation and Object Detection from View Aggregation Left: 3D region proposal network output, Middle: 3D detection output, and Right: the projection of the detection output onto image space for all three classes. The 3D LIDAR point cloud has been colorized and interpolated for better visualization.
  • 23. Frustum PointNets for 3D Object Detection from RGB-D Data • 3D object detection from RGB- D data in both indoor and outdoor scenes. • While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, here directly operate on raw point clouds by popping up RGB-D scans. • However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). • Instead of solely relying on 3D proposals, this method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. • Benefited from learning directly in raw point clouds, this method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. • Evaluated on KITTI and SUN RGB-D 3D detection benchmarks.
  • 24. Frustum PointNets for 3D Object Detection from RGB-D Data 3D object detection pipeline. Given RGB-D data, first generate 2D object region proposals in the RGB image using a CNN. Each 2D region is then extruded to a 3D viewing frustum in which to get a point cloud from depth data. Finally, the frustum PointNet predicts a (oriented and amodal) 3D bounding box for the object from the points in frustum.
  • 25. Frustum PointNets for 3D Object Detection from RGB-D Data Frustum PointNets for 3D object detection. First leverage a 2D CNN object detector to propose 2D regions and classify their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud in a frustum (n × c with n points and c channels of XYZ, intensity etc. for each point), the object instance is segmented by binary classification of each point. Based on the segmented object point cloud (m × c), a light- weight regression PointNet (T-Net) tries to align points by translation such that their centroid is close to amodal box center. At last the box estimation net estimates the amodal 3D bounding box for the object.
  • 26. Frustum PointNets for 3D Object Detection from RGB-D Data Coordinate systems for point cloud. Artificial points (black dots) are shown to illustrate (a) default camera coordinate; (b) frustum coordinate after rotating frustums to center view; (c) mask coordinate with object points’ centroid at origin; (d) object coordinate predicted by T-Net.
  • 27. Frustum PointNets for 3D Object Detection from RGB-D Data Basic architectures and IO for PointNets. Architecture is illustrated for PointNet++ (v2) models with set abstraction layers and feature propagation layers (for segmentation).
  • 28. Frustum PointNets for 3D Object Detection from RGB-D Data True positive detection boxes are in green, while false positive boxes are in red and ground truth boxes in blue are shown for false positive and false negative cases. Digit and letter beside each box denote instance id and semantic class, with “v” for cars, “p” for pedestrian and “c” for cyclist.
  • 29. Frustum PointNets for 3D Object Detection from RGB-D Data Network architectures for Frustum PointNets. v1 models are based on PointNet. v2 models are based on PointNet++ set abstraction (SA) and feature propagation (FP) layers. The architecture for residual center estimation T-Net is shared for v1 and v2. The colors (blue for segmentation nets, red for T-Net and green for box estimation nets) of the network background indicate the coordinate system of the input point cloud. Segmentation nets operate in frustum coordinate, T-Net processes points in mask coordinate while box estimation nets take points in object coordinate. The small yellow square (or bar) concatenated with global features is class one-hot vector that tells the predicted category of the underlying object.
  • 30. Deep Continuous Fusion for Multi-Sensor 3D Object Detection • It remains an open problem to design 3D detectors that can better exploit multiple modalities. • A 3D object detector can exploit both LIDAR as well as cameras to perform very accurate localization. • It reasons in bird’s eye view (BEV) and fuses image features by learning to project them into BEV space. • Towards this goal, an end-to-end learnable architecture exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. • The proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. • This enables designing a reliable and efficient end-to-end learnable 3D object detector based on multiple sensors.
  • 31. Deep Continuous Fusion for Multi-Sensor 3D Object Detection Architecture of model. There are two streams, namely the camera image stream and the BEV LIDAR stream. Continuous fusion layers are used to fuse the image features onto the BEV feature maps.
  • 32. Deep Continuous Fusion for Multi-Sensor 3D Object Detection Continuous fusion layer: given a target pixel on BEV image, first extract K nearest LIDAR points; then project the 3D points onto the camera image plane; this helps retrieve corresponding image features; finally feed the image feature + continuous geometry offset into a MLP to generate feature for the target pixel.
  • 33. Deep Continuous Fusion for Multi-Sensor 3D Object Detection The 2D bounding boxes are obtained by projecting the 3D detections onto the image. The bounding box of an object on BEV and images are shown in the same color.
  • 34. Multi-View 3D Object Detection Network for Autonomous Driving • Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. • It encodes the sparse 3D point cloud with a compact multi-view representation. • The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. • The proposal network generates 3D candidate boxes efficiently from the bird’s eye view representation of 3D point cloud. • A deep fusion scheme combines region-wise features from multiple views and enables interactions between intermediate layers of different paths.
  • 35. Multi-View 3D Object Detection Network for Autonomous Driving Input features of the MV3D network.
  • 36. Multi-View 3D Object Detection Network for Autonomous Driving The network takes the bird’s eye view and front view of LIDAR point cloud as well as an image as input. It first generates 3D object proposals from bird’s eye view map and project them to three views. A deep fusion network is used to combine region-wise features obtained via ROI pooling for each view. The fused features are used to jointly predict object class and do oriented 3D box regression.
  • 37. End-to-end Learning of Multi-sensor 3D Tracking by Detection • This task, commonly referred to as Multi-target tracking, consists on identifying how many objects there are in each frame, as well as link their trajectories over time. • It is an approach to tracking by detection that can exploit both cameras as well as LIDAR data to produce very accurate 3D trajectories. • Towards this goal, it formulates the problem as inference in a deep structured model, where the potentials are computed using convolutional neural nets. • The matching cost of associating two detections exploits both appearance and motion via a Siamese network that processes images and motion representations via convolutional layers. • Inference in the model can be done exactly and efficiently by a set of feedforward passes followed by solving a linear program. • Importantly, the model is formulated such that it can be trained end-to-end to solve both the detection and tracking problems.
  • 38. End-to-end Learning of Multi-sensor 3D Tracking by Detection In this work, it formulates tracking as a system containing multiple neural networks that are interwoven together in a single architecture. Note that the system takes as external input a time series of RGB Frames (camera images) and LIDAR point clouds. From these inputs, the system produces discrete trajectories of the targets. In particular, an architecture is end to end trainable while still maintaining explainability, which is achieved by formulating the system in a structured manner.
  • 39. End-to-end Learning of Multi-sensor 3D Tracking by Detection Neural networks designed for both scoring and matching: the forward passes over a set of detections from two frames.
  • 40. End-to-end Learning of Multi-sensor 3D Tracking by Detection • To extract appearance features, employ a Siamese network based on VGG16. • Note that in a Siamese setup, the two branches (each processing a detection) share the same set of weights. • This makes the architecture more efficient in terms of memory and allows learning with fewer examples. • In particular, resize each detection to be of dimension 224 × 224. • To produce a concise representation of activations without using fully connected layers, each of the max-pool outputs is passed through a product layer followed by a weighted sum, which produces a single scalar for each max-pool layer, yielding an activation vector size 5. • Use skip-pooling as matching should exploit both low-level features (e.g., color) as well as semantically richer features from higher layers. • To incorporate spatial information into the model, employ fully connected architectures that model both 2D and 3D motion.
  • 41. End-to-end Learning of Multi-sensor 3D Tracking by Detection • In particular, exploit 3D information in the form of a 180 × 200 occupancy grid in bird’s eye view and 2D information from the occupancy region in the frontal view camera, scaled down from the original resolution of 1242 × 375 to 124 × 37. • In bird’s eye perspective, each 3D detection is projected onto a ground plane, leaving only a rotated rectangle that reflects its occupancy in the world. • Since the observer is a mobile platform (an autonomous vehicle, in this case), the coordinate system between two subsequent frames would be shifted because the observer moved in the time elapsed. • Since its speed in each axis is known from the IMU data, one can calculate the displacement of the observer between each observation and translate the coordinates accordingly; this way, both grids are on the exact same coordinate system. • The frontal view perspective encodes the rectangular area in the camera occupied by the target, equivalent of projecting the 3D bounding box onto camera coordinates.
  • 42. End-to-end Learning of Multi-sensor 3D Tracking by Detection Detector: MV3D
  • 43. End-to-end Learning of Multi-sensor 3D Tracking by Detection Trajectories are color coded, such that having the same color means it’s the same object.