SlideShare a Scribd company logo
1 of 54
Download to read offline
3D Interpretation from Single 2D Image
for Autonomous Driving III
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Towards Generalization Across Depth for Monocular 3D Object Detection
• RTM3D: Real-time Monocular 3D Detection from Object Keypoints for
Autonomous Driving
• Monocular 3D Object Detection with Decoupled Structured Polygon Estimation
and Height-Guided Depth Estimation
• Exploring the Capabilities and Limits of 3D Monocular Object Detection - A
Study on Simulation and Real World Data
• Object-Aware Centroid Voting for Monocular 3D Object Detection
• Monocular 3D Detection with Geometric Constraints Embedding and Semi-
supervised Training
• Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
Towards Generalization Across Depth for
Monocular 3D Object Detection
• This work advances the state of the art by introducing MoVi-3D, a single-
stage deep architecture for monocular 3D object detection.
• MoVi-3D builds upon an approach which leverages geometrical information
to generate, both at training and test time, virtual views where the object
appearance is normalized with respect to distance.
• These virtually generated views facilitate the detection task as they
significantly reduce the visual appearance variability associated to objects
placed at different distances from the camera.
• As a consequence, the deep model is relieved from learning depth-specific
representations and its complexity can be significantly reduced.
• In particular, in this work thanks to virtual views generation process, a
lightweight, single-stage architecture suffices to set new state-of-the-art
results on the popular KITTI3D benchmark.
Towards Generalization Across Depth for
Monocular 3D Object Detection
Aim at predicting a 3D bounding box for each object given a single image (left). In this image, the
scale of an object heavily depends on its distance with respect to the camera. For this reason the
complexity of the detection increases as the distance grows. Instead of performing the detection
on the original image, perform it on virtual images (middle). Each virtual image presents a cropped
and and scaled version of the original image that preserves the scale of objects as if the image
was taken at a different, given depth.
Towards Generalization Across Depth for
Monocular 3D Object Detection
Illustration of the Monocular 3D Object Detection task. Given an input image (left),
the model predicts a 3D box for each object (middle). Each box has its 3D
dimensions s = (W;H;L), 3D center c = (x; y; z) and rotation (alpha).
Towards Generalization Across Depth for
Monocular 3D Object Detection
• The goal is to devise a training/inference procedure that enables generalization across
depth, by indirectly forcing the models to develop representations for objects that are less
dependent on their actual depth in the scene.
• The idea is to feed the model with transformed images that have been put into a canonical
form that depends on some query depth.
• After this transformation, no matter where the car is in space, obtain an image of the car
that is consistent in terms of the scale of the object.
• Clearly, depth still influences the appearance, e.g. due to perspective deformations, but by
removing the scale factor from the nuisance variables,able to simplify the task that has to
be solved by the model.
• In order to apply the proposed transformation,need to know the location of the 3D objects
in advance.
Towards Generalization Across Depth for
Monocular 3D Object Detection
3D viewport
Compute the top-left and bottom-right corners of the viewport, namely (Xv,Yv,Zv) and (Xv +
Wv,Yv – Hv,Zv) respectively, and project them to the image plane of the camera, yielding the
top-left and bottom-right corners of a 2D viewport. Crop it and rescale it to the desired resolution
wv x hv to get the final output. It is a virtual image generated by the given 3D viewport.
Towards Generalization Across Depth for
Monocular 3D Object Detection
• The goal of the training procedure is to build a network that is able to make correct
predictions within a limited depth range given an image generated from a 3D viewport.
• A ground-truth-guided sampling procedure:repeatedly draw (without replacement) a
ground-truth object and then sample a 3D viewport in a neighborhood thereof so that the
object is completely visible in the virtual image.
• The location of the 3D viewport is perturbed with respect to the position of the target
ground-truth object in order to obtain a model that is robust to depth ranges up to the
predefined depth resolution Zres, which in turn plays an important role at inference time.
• In addition, let a small share of the virtual images to be generated by 3D viewports
randomly positioned in a way that the corresponding virtual image is completely contained
in the original image.
• A class-uniform sampling strategy:allows to get an even number of virtual images for each
class that is present in the original image.
Towards Generalization Across Depth for
Monocular 3D Object Detection
Training virtual image creation. We randomly sample a target object (dark-red car). Given the input
image, object position and camera parameters, compute a 3D viewport that we place at z = Zv. Then
project the 3D viewport onto the image plane, resulting in a 2D viewport. Finally crop the
corresponding region and rescale it to obtain the target virtual view (right).
Towards Generalization Across Depth for
Monocular 3D Object Detection
• Since have trained the network to be able to predict at distances that
are twice the depth step, reasonably confident not missing objects, in
the sense that each object will be covered by at least a virtual image.
• Also, due to the convolutional nature of the architecture adjust the
width of the virtual image in a way to cover the entire extent of the
input image.
• By doing so have virtual images that become wider as increasing the
depth, following the rule (W is the width of the input image):
• Finally perform NMS over detections that have been generated from
the same virtual image.
Towards Generalization Across Depth for
Monocular 3D Object Detection
Inference pipeline. Given the input image, camera parameters and Zres,create a series of 3D
viewports placing every Zres/2 meters along the Z axis. Then project these viewports onto the image,
crop and rescale the resulting regions to obtain distance-specific virtual views. Finally use these views
to perform the 3D detection.
Towards Generalization Across Depth for
Monocular 3D Object Detection
It consists of two parallel branches, the top one devoted to providing confidences about the predicted
2D and 3D bounding boxes, while the bottom one is devoted to regressing the actual bounding boxes.
White rectangles denote 33 convolutions with 128 output channels followed by iABNsync.
Towards Generalization Across Depth for
Monocular 3D Object Detection
Towards Generalization Across Depth for
Monocular 3D Object Detection
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
• It proposes an efficient and accurate monocular 3D detection framework in
single shot.
• This method predicts the nine perspective keypoints of a 3D bounding box in
image space, and then utilize the geometric relationship of 3D and 2D
perspectives to recover the dimension, location, and orientation in 3D space.
• In this method, the properties of the object can be predicted stably even
when the estimation of keypoints is very noisy, which enables us to obtain
fast detection speed with a small architecture.
• Training uses the 3D properties of the object without the need for external
networks or supervision data.
• This method is the first real-time system for monocular image 3D detection
while achieves state-of the-art performance on the KITTI benchmark.
• Code will be released at https://github.com/Banconxuan/RTM3D.
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
Overview of proposed method: first predict ordinal keypoints projected in the
image space by eight vertexes and a central point of a 3D object. then
reformulate the estimation of the 3D bounding box as the problem of minimizing
the energy function by using geometric constraints of perspective projection.
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
An overview of proposed
keypoint detection
architecture: It takes only
the RGB images as the
input and outputs main
center heatmap, vertexes
heatmap, and vertexes
coordinate as the base
module to estimate 3D
bounding box. It can also
predict other alternative
priors to further improve
the performance of 3D
detection.
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
Illustration of keypoint feature pyramid network (KFPN).
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
• Since the location recovery in 3D space is quite difficult on account of
absence of depth information, this work proposes a unified framework
which decomposes the detection problem into a structured polygon
prediction task and a depth recovery task.
• Different from the widely studied 2D bounding boxes, the proposed
structured polygon in the 2D image consists of several projected
surfaces of the target object as better representation for 3D detection.
• In order to inversely project the predicted 2D structured polygon to a
cuboid in the 3D physical world, the following depth recovery task uses
the object height prior to complete the inverse projection
transformation with the given camera projection matrix.
• Moreover, a fine-grained 3D box refinement scheme is proposed to
further rectify the 3D detection results.
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
The overall framework (Decoupled-3D) decouples the monocular 3D object detection problem into sub-tasks. The
overall network consists of three parts. (Top row) The 2D structured polygons are generated with a stacked
hourglass network. (Middle row) Object depth stage utilizes 3D object height as a prior to recover the missing depth
of the object. (Bottom row) 3D box refine stage rectifies coarse 3D boxes using bird’s eye view features in 3D-ROIs.
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Structured polygon estimation aims to estimate the 2D locations of the projected vertices
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Height-Guided Depth Estimation. Combine
object height H and corresponding pixel
value h to estimate object depth
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
3D Box Refinement. Rectify coarse boxes with bird’s eye view map
Note: Depth Net DOR(“Deep Ordinal Regression Network for Monocular Depth Estimation”)
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
• Recent deep learning methods show promising results to recover depth
information from single images by learning priors about the environment.
• In addition to the network design, the major difference of these competing
approaches lies in using a supervised or self-supervised optimization loss
function, which require different data and ground truth information.
• This paper evaluate the performance of a 3D object detection pipeline which
is parameterizable with different depth estimation configurations.
• It implement a simple distance calculation approach based on camera
intrinsics and 2D bounding box size, a self-supervised, and a supervised
learning approach for depth estimation.
• It evaluate the detection pipeline on simulator data and a real world
sequence from an autonomous vehicle on a race track.
• Advantages and drawbacks of the different depth estimation strategies are
discussed
Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
3D object detection pipeline with
three alternative configurations
Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
• Distance calculation using the 2D bounding box height, and the known
height of the real world race car as a geometric constraint. “known
height assumption”
• Depth estimation for the whole image using the supervised
DenseDepth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.
• Depth estimation for the whole image using the self-supervised
struct2depth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.
Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
Object-Aware Centroid Voting for
Monocular 3D Object Detection
• This paper propose an end-to-end trainable monocular 3D object
detector without learning the dense depth.
• Specifically, the grid coordinates of a 2D box are first projected back to
3D space with the pinhole model as 3D centroids proposals.
• Then, a object-aware voting approach is introduced, which considers
both the region-wise appearance attention and the geometric
projection distribution, to vote the 3D centroid proposals for 3D object
localization.
• With the late fusion and the predicted 3D orientation and dimension,
the 3D bounding boxes of objects can be detected from a single RGB
image.
• The method is straightforward yet significantly superior to other
monocular-based methods.
Object-Aware Centroid Voting for
Monocular 3D Object Detection
3D Object Detection Pipeline. Given an
image with predicted 2D region proposals
(yellow box), the regions are divided into
grids. Each grid point with (u;v) coordinate
is projected back to 3D space by leveraging
the pinhole model and the class-specific 3D
height H, resulting in 3D box centroid
proposals. With the voting method inspired
by both appearance and geometric cues,
3D object location is predicted.
Object-Aware Centroid Voting for
Monocular 3D Object Detection
The Architecture. 2D region proposals are first obtained from the RPN module. Then, with the 3D Center
Reasoning (left), multiple 3D centroid proposals are estimated from the 2D RoI grid coordinates.
Followed by the Object-Aware Voting (right), which consists of geometric projection distribution (GPD)
and appearance attention map (AAM), the 3D centroid proposals are voted for 3D localization. For the 3D
dimension and orientation, they are estimated together with 2D object detection head.
Object-Aware Centroid Voting for
Monocular 3D Object Detection
• For the objects on driving road, they are horizontally placed without the pose angles of yaw
and pitch with respect to the camera.
• Besides, the 3D dimension variance of each class of objects (such as Car) is quite small.
• These constraints lead to the idea that the apparent heights of objects on image are
approximately invariant when objects are in the same depth.
• Recent survey also points out that the positions and apparent size of object in an image are
applicable to infer the depth on KITTI dataset.
• Therefore, the 3D object centroid can be roughly inferred with the simple pinhole camera
model.
Object-Aware Centroid Voting for
Monocular 3D Object Detection
• Specifically, divide each 2D region proposals into s X s grid cells and project the grid
coordinates back onto 3D space.
• Since each grid point indicates the probable projection of the corresponding 3D object
centroid, get multiple 3D centroid proposals P3d where the i-th centroid proposal
P3d(Xi;Yi;Z) is computed by
Examples and statistics on KITTI training set.
Object-Aware Centroid Voting for
Monocular 3D Object Detection
• Specifically, use a single 1X1 convolution followed by sigmoid activation to generate
appearance attention map from the feature maps of RoI pooling layer.
• The activated convolution feature map from the image indicates the foreground semantic
objects due to the classification supervision in 2D object detection, leading to the object-
ware voting.
• This voting component comes from the distribution of the offset between the projected 3D
centroid and the 2D box center.
• It is demonstrated the 2D box center can be modeled as Gaussian distribution with ground
truth as expectation
• To dynamically learn the distribution, the 2D grid coordinates and image features of RoI are
concatenated together as input of a fully-connected layer to predict the offset with Kullback
Leibler (KL) divergence as loss function to supervise the learning
Object-Aware Centroid Voting for
Monocular 3D Object Detection
The object-aware voting can be formulated as the
element-wise multiplication with both the normalized
probability maps Mapp and Mgeo as follows
In the training stage, the 3D localization pipeline is
trained with smooth L1 loss
Object-Aware Centroid Voting for
Monocular 3D Object Detection
3D dimension prediction loss function comparing
predictions and the ground truth are defined in the
logarithm space through the smooth L1 loss
In 3D orientation estimation, use Multi-Bin to
disentangle it into residual angle prediction and
angle bins classification.
3D orientation estimation loss is formed as
Loss functions for the joined training of multi-tasks
of 2D and 3D object detection as
Object-Aware Centroid Voting for
Monocular 3D Object Detection
Object-Aware Centroid Voting for
Monocular 3D Object Detection
Object-Aware Centroid Voting for
Monocular 3D Object Detection
Qualitative results. Red: detected 3D boxes. Yellow: ground truth. Right: birds’ eye view (BEV) results.
Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
• A single-shot and keypoints-based framework for mono 3D objects detection, KM3D-Net.
• Here, design a fully convolutional model to predict object keypoints, dimension, and
orientation, and then combine with perspective geometry constraints to compute position.
• Further, reformulate the geometric constraints as a differentiable version and embed it into the
network while maintaining the consistency of model outputs in an E2E fashion.
• Then propose a semi-supervised training strategy where labeled training data is scarce.
• In this strategy, enforce a consensus prediction of two shared-weights KM3D-Net for the
same unlabeled image under different input augmentation conditions and network
regularization.
• In particular, unify the coordinate-dependent augmentations as the affine transformation for
the differential recovering position and propose a keypoints-dropout module for the network
regularization.
• This model only requires RGB images without synthetic data, instance segmentation, CAD
model, or depth generator.
Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Overview of KM3D-Net which
output keypoints, object
dimensions, local orientation,
and 3D confidence, followed
by differentiable geometric
consistency constraints to
predict position.
Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Overview of unsupervised training. It leverages affine transformation to unify input augmentation
and devise keypoints dropout for regularization. These two strategies make KM3D-Net output two
stochastic variables with the same input. Penalizing their differences is training goal.
Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
• The training of deep-learning-based 3D object detectors requires
large datasets with 3D bounding box labels for supervision that have to
be generated by hand-labeling.
• A network architecture and training procedure for learning monocular
3D object detection without 3D bounding box labels.
• By representing the objects as triangular meshes and employing
differentiable shape rendering, define loss functions based on depth
maps, segmentation masks, and ego- and object-motion, which are
generated by pre-trained, o-the-shelf networks.
• In comparison to SOA methods requiring 3D bounding box labels for
training and superior performance to conventional baseline methods.
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
A mono 3D vehicle detector that requires no 3D bounding box labels for training. The right image shows that
the predicted vehicles (colored shapes) fit the GT bounding boxes (red). Despite the noisy input depth (lower
left ), the method is able to accurately predict the 3D poses of vehicles due to the proposed fully differentiable
training scheme. It show the projections of the predicted bounding boxes (colored boxes, upper left).
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
The proposed model contains a single-image network and a multi-image network extension. The single-
image network back-projects the input depth map into a point cloud. A Frustum PointNet encoder predicts
the pose and shape then decoded into a predicted 3D mesh and segmentation mask through differentiable
rendering. The multi-image network architecture takes three images as the inputs, and the single-image
network is applied individually to each image. This network predicts a depth map for the middle frame based
on the vehicle‘s pose and shape. A pre-trained network predicts ego-motion and object-motion from the
images. The reconstruction loss is computed by differentiably warping the images into the middle frame.
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
In order to train without 3D bounding box labels we use three losses, the
segmentation loss Lseg, the chamfer distance Lcd, and the photometric
reconstruction loss Lrec. The first two are defined for single images and
the photometric reconstruction loss relies on temporal photo-consistency
for three consecutive frames
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
Qualitative comparison of MonoGRNet (1st row), Mono3D(2nd row), and this method (third row) with depth
maps from BTS. It show GT bounding boxes for cars (red), predicted bounding boxes (green), and the back-
projected point cloud. In comparison to Mono3D, the prediction accuracy is increased specifically for further
away vehicles. The performance of MonoGRNet and this model is comparable.
Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
3-d interpretation from single 2-d image III

More Related Content

What's hot

What's hot (20)

Stereo Matching by Deep Learning
Stereo Matching by Deep LearningStereo Matching by Deep Learning
Stereo Matching by Deep Learning
 
Deep vo and slam ii
Deep vo and slam iiDeep vo and slam ii
Deep vo and slam ii
 
Depth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep LearningDepth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep Learning
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iii
 
Fisheye Omnidirectional View in Autonomous Driving II
Fisheye Omnidirectional View in Autonomous Driving IIFisheye Omnidirectional View in Autonomous Driving II
Fisheye Omnidirectional View in Autonomous Driving II
 
Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
Depth Fusion from RGB and Depth Sensors IV
Depth Fusion from RGB and Depth Sensors  IVDepth Fusion from RGB and Depth Sensors  IV
Depth Fusion from RGB and Depth Sensors IV
 
Survey 1 (project overview)
Survey 1 (project overview)Survey 1 (project overview)
Survey 1 (project overview)
 
Depth Fusion from RGB and Depth Sensors III
Depth Fusion from RGB and Depth Sensors  IIIDepth Fusion from RGB and Depth Sensors  III
Depth Fusion from RGB and Depth Sensors III
 
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
 
Deep VO and SLAM IV
Deep VO and SLAM IVDeep VO and SLAM IV
Deep VO and SLAM IV
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
 
Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 

Similar to 3-d interpretation from single 2-d image III

10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
mokamojah
 
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking Project
Prathamesh Joshi
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
c.choi
 

Similar to 3-d interpretation from single 2-d image III (20)

3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 
On constructing z dimensional Image By DIBR Synthesized Images
On constructing z dimensional Image By DIBR Synthesized ImagesOn constructing z dimensional Image By DIBR Synthesized Images
On constructing z dimensional Image By DIBR Synthesized Images
 
Rendering Algorithms.pptx
Rendering Algorithms.pptxRendering Algorithms.pptx
Rendering Algorithms.pptx
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
visual realism in geometric modeling
visual realism in geometric modelingvisual realism in geometric modeling
visual realism in geometric modeling
 
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
 
Goal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D cameraGoal location prediction based on deep learning using RGB-D camera
Goal location prediction based on deep learning using RGB-D camera
 
Presentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking ProjectPresentation Object Recognition And Tracking Project
Presentation Object Recognition And Tracking Project
 
Vision based non-invasive tool for facial swelling assessment
Vision based non-invasive tool for facial swelling assessment Vision based non-invasive tool for facial swelling assessment
Vision based non-invasive tool for facial swelling assessment
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 
Object tracking
Object trackingObject tracking
Object tracking
 
Various object detection and tracking methods
Various object detection and tracking methodsVarious object detection and tracking methods
Various object detection and tracking methods
 
ei2106-submit-opt-415
ei2106-submit-opt-415ei2106-submit-opt-415
ei2106-submit-opt-415
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
Learning to Perceive the 3D World
Learning to Perceive the 3D WorldLearning to Perceive the 3D World
Learning to Perceive the 3D World
 
3 d scanning technology
3 d scanning technology3 d scanning technology
3 d scanning technology
 

More from Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 

Recently uploaded

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Recently uploaded (20)

Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 

3-d interpretation from single 2-d image III

  • 1. 3D Interpretation from Single 2D Image for Autonomous Driving III Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline • Towards Generalization Across Depth for Monocular 3D Object Detection • RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving • Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation • Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Object-Aware Centroid Voting for Monocular 3D Object Detection • Monocular 3D Detection with Geometric Constraints Embedding and Semi- supervised Training • Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
  • 3. Towards Generalization Across Depth for Monocular 3D Object Detection • This work advances the state of the art by introducing MoVi-3D, a single- stage deep architecture for monocular 3D object detection. • MoVi-3D builds upon an approach which leverages geometrical information to generate, both at training and test time, virtual views where the object appearance is normalized with respect to distance. • These virtually generated views facilitate the detection task as they significantly reduce the visual appearance variability associated to objects placed at different distances from the camera. • As a consequence, the deep model is relieved from learning depth-specific representations and its complexity can be significantly reduced. • In particular, in this work thanks to virtual views generation process, a lightweight, single-stage architecture suffices to set new state-of-the-art results on the popular KITTI3D benchmark.
  • 4. Towards Generalization Across Depth for Monocular 3D Object Detection Aim at predicting a 3D bounding box for each object given a single image (left). In this image, the scale of an object heavily depends on its distance with respect to the camera. For this reason the complexity of the detection increases as the distance grows. Instead of performing the detection on the original image, perform it on virtual images (middle). Each virtual image presents a cropped and and scaled version of the original image that preserves the scale of objects as if the image was taken at a different, given depth.
  • 5. Towards Generalization Across Depth for Monocular 3D Object Detection Illustration of the Monocular 3D Object Detection task. Given an input image (left), the model predicts a 3D box for each object (middle). Each box has its 3D dimensions s = (W;H;L), 3D center c = (x; y; z) and rotation (alpha).
  • 6. Towards Generalization Across Depth for Monocular 3D Object Detection • The goal is to devise a training/inference procedure that enables generalization across depth, by indirectly forcing the models to develop representations for objects that are less dependent on their actual depth in the scene. • The idea is to feed the model with transformed images that have been put into a canonical form that depends on some query depth. • After this transformation, no matter where the car is in space, obtain an image of the car that is consistent in terms of the scale of the object. • Clearly, depth still influences the appearance, e.g. due to perspective deformations, but by removing the scale factor from the nuisance variables,able to simplify the task that has to be solved by the model. • In order to apply the proposed transformation,need to know the location of the 3D objects in advance.
  • 7. Towards Generalization Across Depth for Monocular 3D Object Detection 3D viewport Compute the top-left and bottom-right corners of the viewport, namely (Xv,Yv,Zv) and (Xv + Wv,Yv – Hv,Zv) respectively, and project them to the image plane of the camera, yielding the top-left and bottom-right corners of a 2D viewport. Crop it and rescale it to the desired resolution wv x hv to get the final output. It is a virtual image generated by the given 3D viewport.
  • 8. Towards Generalization Across Depth for Monocular 3D Object Detection • The goal of the training procedure is to build a network that is able to make correct predictions within a limited depth range given an image generated from a 3D viewport. • A ground-truth-guided sampling procedure:repeatedly draw (without replacement) a ground-truth object and then sample a 3D viewport in a neighborhood thereof so that the object is completely visible in the virtual image. • The location of the 3D viewport is perturbed with respect to the position of the target ground-truth object in order to obtain a model that is robust to depth ranges up to the predefined depth resolution Zres, which in turn plays an important role at inference time. • In addition, let a small share of the virtual images to be generated by 3D viewports randomly positioned in a way that the corresponding virtual image is completely contained in the original image. • A class-uniform sampling strategy:allows to get an even number of virtual images for each class that is present in the original image.
  • 9. Towards Generalization Across Depth for Monocular 3D Object Detection Training virtual image creation. We randomly sample a target object (dark-red car). Given the input image, object position and camera parameters, compute a 3D viewport that we place at z = Zv. Then project the 3D viewport onto the image plane, resulting in a 2D viewport. Finally crop the corresponding region and rescale it to obtain the target virtual view (right).
  • 10. Towards Generalization Across Depth for Monocular 3D Object Detection • Since have trained the network to be able to predict at distances that are twice the depth step, reasonably confident not missing objects, in the sense that each object will be covered by at least a virtual image. • Also, due to the convolutional nature of the architecture adjust the width of the virtual image in a way to cover the entire extent of the input image. • By doing so have virtual images that become wider as increasing the depth, following the rule (W is the width of the input image): • Finally perform NMS over detections that have been generated from the same virtual image.
  • 11. Towards Generalization Across Depth for Monocular 3D Object Detection Inference pipeline. Given the input image, camera parameters and Zres,create a series of 3D viewports placing every Zres/2 meters along the Z axis. Then project these viewports onto the image, crop and rescale the resulting regions to obtain distance-specific virtual views. Finally use these views to perform the 3D detection.
  • 12. Towards Generalization Across Depth for Monocular 3D Object Detection It consists of two parallel branches, the top one devoted to providing confidences about the predicted 2D and 3D bounding boxes, while the bottom one is devoted to regressing the actual bounding boxes. White rectangles denote 33 convolutions with 128 output channels followed by iABNsync.
  • 13. Towards Generalization Across Depth for Monocular 3D Object Detection
  • 14. Towards Generalization Across Depth for Monocular 3D Object Detection
  • 15. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving • It proposes an efficient and accurate monocular 3D detection framework in single shot. • This method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. • In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. • Training uses the 3D properties of the object without the need for external networks or supervision data. • This method is the first real-time system for monocular image 3D detection while achieves state-of the-art performance on the KITTI benchmark. • Code will be released at https://github.com/Banconxuan/RTM3D.
  • 16. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving Overview of proposed method: first predict ordinal keypoints projected in the image space by eight vertexes and a central point of a 3D object. then reformulate the estimation of the 3D bounding box as the problem of minimizing the energy function by using geometric constraints of perspective projection.
  • 17. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving An overview of proposed keypoint detection architecture: It takes only the RGB images as the input and outputs main center heatmap, vertexes heatmap, and vertexes coordinate as the base module to estimate 3D bounding box. It can also predict other alternative priors to further improve the performance of 3D detection.
  • 18. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving Illustration of keypoint feature pyramid network (KFPN).
  • 19. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving
  • 20. RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving
  • 21. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation • Since the location recovery in 3D space is quite difficult on account of absence of depth information, this work proposes a unified framework which decomposes the detection problem into a structured polygon prediction task and a depth recovery task. • Different from the widely studied 2D bounding boxes, the proposed structured polygon in the 2D image consists of several projected surfaces of the target object as better representation for 3D detection. • In order to inversely project the predicted 2D structured polygon to a cuboid in the 3D physical world, the following depth recovery task uses the object height prior to complete the inverse projection transformation with the given camera projection matrix. • Moreover, a fine-grained 3D box refinement scheme is proposed to further rectify the 3D detection results.
  • 22. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation The overall framework (Decoupled-3D) decouples the monocular 3D object detection problem into sub-tasks. The overall network consists of three parts. (Top row) The 2D structured polygons are generated with a stacked hourglass network. (Middle row) Object depth stage utilizes 3D object height as a prior to recover the missing depth of the object. (Bottom row) 3D box refine stage rectifies coarse 3D boxes using bird’s eye view features in 3D-ROIs.
  • 23. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation Structured polygon estimation aims to estimate the 2D locations of the projected vertices
  • 24. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation Height-Guided Depth Estimation. Combine object height H and corresponding pixel value h to estimate object depth
  • 25. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation 3D Box Refinement. Rectify coarse boxes with bird’s eye view map Note: Depth Net DOR(“Deep Ordinal Regression Network for Monocular Depth Estimation”)
  • 26. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
  • 27. Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
  • 28. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Recent deep learning methods show promising results to recover depth information from single images by learning priors about the environment. • In addition to the network design, the major difference of these competing approaches lies in using a supervised or self-supervised optimization loss function, which require different data and ground truth information. • This paper evaluate the performance of a 3D object detection pipeline which is parameterizable with different depth estimation configurations. • It implement a simple distance calculation approach based on camera intrinsics and 2D bounding box size, a self-supervised, and a supervised learning approach for depth estimation. • It evaluate the detection pipeline on simulator data and a real world sequence from an autonomous vehicle on a race track. • Advantages and drawbacks of the different depth estimation strategies are discussed
  • 29. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data 3D object detection pipeline with three alternative configurations
  • 30. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data • Distance calculation using the 2D bounding box height, and the known height of the real world race car as a geometric constraint. “known height assumption” • Depth estimation for the whole image using the supervised DenseDepth network. The distance to each object is calculated as the median depth value in the bounding box crop. Explicit knowledge about the objects, like height information, is not required in this approach. • Depth estimation for the whole image using the self-supervised struct2depth network. The distance to each object is calculated as the median depth value in the bounding box crop. Explicit knowledge about the objects, like height information, is not required in this approach.
  • 31. Exploring the Capabilities and Limits of 3D Monocular Object Detection - A Study on Simulation and Real World Data
  • 32. Object-Aware Centroid Voting for Monocular 3D Object Detection • This paper propose an end-to-end trainable monocular 3D object detector without learning the dense depth. • Specifically, the grid coordinates of a 2D box are first projected back to 3D space with the pinhole model as 3D centroids proposals. • Then, a object-aware voting approach is introduced, which considers both the region-wise appearance attention and the geometric projection distribution, to vote the 3D centroid proposals for 3D object localization. • With the late fusion and the predicted 3D orientation and dimension, the 3D bounding boxes of objects can be detected from a single RGB image. • The method is straightforward yet significantly superior to other monocular-based methods.
  • 33. Object-Aware Centroid Voting for Monocular 3D Object Detection 3D Object Detection Pipeline. Given an image with predicted 2D region proposals (yellow box), the regions are divided into grids. Each grid point with (u;v) coordinate is projected back to 3D space by leveraging the pinhole model and the class-specific 3D height H, resulting in 3D box centroid proposals. With the voting method inspired by both appearance and geometric cues, 3D object location is predicted.
  • 34. Object-Aware Centroid Voting for Monocular 3D Object Detection The Architecture. 2D region proposals are first obtained from the RPN module. Then, with the 3D Center Reasoning (left), multiple 3D centroid proposals are estimated from the 2D RoI grid coordinates. Followed by the Object-Aware Voting (right), which consists of geometric projection distribution (GPD) and appearance attention map (AAM), the 3D centroid proposals are voted for 3D localization. For the 3D dimension and orientation, they are estimated together with 2D object detection head.
  • 35. Object-Aware Centroid Voting for Monocular 3D Object Detection • For the objects on driving road, they are horizontally placed without the pose angles of yaw and pitch with respect to the camera. • Besides, the 3D dimension variance of each class of objects (such as Car) is quite small. • These constraints lead to the idea that the apparent heights of objects on image are approximately invariant when objects are in the same depth. • Recent survey also points out that the positions and apparent size of object in an image are applicable to infer the depth on KITTI dataset. • Therefore, the 3D object centroid can be roughly inferred with the simple pinhole camera model.
  • 36. Object-Aware Centroid Voting for Monocular 3D Object Detection • Specifically, divide each 2D region proposals into s X s grid cells and project the grid coordinates back onto 3D space. • Since each grid point indicates the probable projection of the corresponding 3D object centroid, get multiple 3D centroid proposals P3d where the i-th centroid proposal P3d(Xi;Yi;Z) is computed by Examples and statistics on KITTI training set.
  • 37. Object-Aware Centroid Voting for Monocular 3D Object Detection • Specifically, use a single 1X1 convolution followed by sigmoid activation to generate appearance attention map from the feature maps of RoI pooling layer. • The activated convolution feature map from the image indicates the foreground semantic objects due to the classification supervision in 2D object detection, leading to the object- ware voting. • This voting component comes from the distribution of the offset between the projected 3D centroid and the 2D box center. • It is demonstrated the 2D box center can be modeled as Gaussian distribution with ground truth as expectation • To dynamically learn the distribution, the 2D grid coordinates and image features of RoI are concatenated together as input of a fully-connected layer to predict the offset with Kullback Leibler (KL) divergence as loss function to supervise the learning
  • 38. Object-Aware Centroid Voting for Monocular 3D Object Detection The object-aware voting can be formulated as the element-wise multiplication with both the normalized probability maps Mapp and Mgeo as follows In the training stage, the 3D localization pipeline is trained with smooth L1 loss
  • 39. Object-Aware Centroid Voting for Monocular 3D Object Detection 3D dimension prediction loss function comparing predictions and the ground truth are defined in the logarithm space through the smooth L1 loss In 3D orientation estimation, use Multi-Bin to disentangle it into residual angle prediction and angle bins classification. 3D orientation estimation loss is formed as Loss functions for the joined training of multi-tasks of 2D and 3D object detection as
  • 40. Object-Aware Centroid Voting for Monocular 3D Object Detection
  • 41. Object-Aware Centroid Voting for Monocular 3D Object Detection
  • 42. Object-Aware Centroid Voting for Monocular 3D Object Detection Qualitative results. Red: detected 3D boxes. Yellow: ground truth. Right: birds’ eye view (BEV) results.
  • 43. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training • A single-shot and keypoints-based framework for mono 3D objects detection, KM3D-Net. • Here, design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine with perspective geometry constraints to compute position. • Further, reformulate the geometric constraints as a differentiable version and embed it into the network while maintaining the consistency of model outputs in an E2E fashion. • Then propose a semi-supervised training strategy where labeled training data is scarce. • In this strategy, enforce a consensus prediction of two shared-weights KM3D-Net for the same unlabeled image under different input augmentation conditions and network regularization. • In particular, unify the coordinate-dependent augmentations as the affine transformation for the differential recovering position and propose a keypoints-dropout module for the network regularization. • This model only requires RGB images without synthetic data, instance segmentation, CAD model, or depth generator.
  • 44. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training Overview of KM3D-Net which output keypoints, object dimensions, local orientation, and 3D confidence, followed by differentiable geometric consistency constraints to predict position.
  • 45. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training Overview of unsupervised training. It leverages affine transformation to unify input augmentation and devise keypoints dropout for regularization. These two strategies make KM3D-Net output two stochastic variables with the same input. Penalizing their differences is training goal.
  • 46. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training
  • 47. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training
  • 48. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels • The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. • A network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. • By representing the objects as triangular meshes and employing differentiable shape rendering, define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, o-the-shelf networks. • In comparison to SOA methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.
  • 49. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels A mono 3D vehicle detector that requires no 3D bounding box labels for training. The right image shows that the predicted vehicles (colored shapes) fit the GT bounding boxes (red). Despite the noisy input depth (lower left ), the method is able to accurately predict the 3D poses of vehicles due to the proposed fully differentiable training scheme. It show the projections of the predicted bounding boxes (colored boxes, upper left).
  • 50. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels The proposed model contains a single-image network and a multi-image network extension. The single- image network back-projects the input depth map into a point cloud. A Frustum PointNet encoder predicts the pose and shape then decoded into a predicted 3D mesh and segmentation mask through differentiable rendering. The multi-image network architecture takes three images as the inputs, and the single-image network is applied individually to each image. This network predicts a depth map for the middle frame based on the vehicle‘s pose and shape. A pre-trained network predicts ego-motion and object-motion from the images. The reconstruction loss is computed by differentiably warping the images into the middle frame.
  • 51. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels In order to train without 3D bounding box labels we use three losses, the segmentation loss Lseg, the chamfer distance Lcd, and the photometric reconstruction loss Lrec. The first two are defined for single images and the photometric reconstruction loss relies on temporal photo-consistency for three consecutive frames
  • 52. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels Qualitative comparison of MonoGRNet (1st row), Mono3D(2nd row), and this method (third row) with depth maps from BTS. It show GT bounding boxes for cars (red), predicted bounding boxes (green), and the back- projected point cloud. In comparison to Mono3D, the prediction accuracy is increased specifically for further away vehicles. The performance of MonoGRNet and this model is comparable.
  • 53. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels