3D Interpretation from Single 2D Image
for Autonomous Driving II
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Task-Aware Mono Depth Estimation for 3D
Object Detection
• M3D-RPN: Mono 3D Region Proposal Network
for Object Detection
• Mono 3D Object Detection with Pseudo-LiDAR
Point Cloud
• Mono 3D Object Detection and Box Fitting
Trained E2E Using IoU Loss
• Disentangling Mono 3D Object Detection
• Shift R-CNN: Deep Mono 3d Object Detection
With Closed-Form Geometric Constraints
• Mono 3D Object Detection via Geometric
Reasoning on Keypoints
• Mono 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
• GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
• Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
• Mono3D++: Mono 3D Vehicle Detection with
Two-Scale 3D Hypotheses and Task Priors
• Orthographic Feature Transform for Mono 3D
Object Detection
• Multi-Level Fusion based 3D Object Detection
from Mono Images
• MonoGRNet: A Geometric Reasoning Network
for Mono 3D Object Localization
• 3D Bounding Boxes for Road Vehicles: A One-
Stage, Localization Prioritized Approach using
Single Mono Images
• Joint Mono 3D Vehicle Detection and Tracking
Task-Aware Monocular Depth
Estimation for 3D Object Detection
• Monocular depth estimation enables 3D perception from a single 2D image, thus attracting
much research attention for years.
• Almost all methods treat foreground and background regions (“things and stuff”) in an
image equally.
• However, depth of foreground objects plays a crucial role in 3D object recognition and
localization.
• It first analyse the data distributions and interaction of foreground and background, then get
the foreground- background separated monocular depth estimation (ForeSeE) method,
to estimate the foreground and background depth using separate optimization objectives
and decoders.
• This method significantly improves the depth estimation performance on foreground objects.
Task-Aware Monocular Depth
Estimation for 3D Object Detection
(a) ForeSeE (b) 3D Object Detection
Illustration of the overall pipeline. (a) Foreground-background separated depth estimation. (b) 3D object detection.
Task-Aware Monocular Depth
Estimation for 3D Object Detection
(a) Input Image (b) Baseline-PL (c) ForeSeE-PL
Qualitative results of 3D object detection. The ground truth 3D bounding boxes are in red; the predictions are in green.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
• Understanding the world in 3D is a critical component of urban autonomous driving.
• Generally, the combination of expensive LiDAR sensors and stereo RGB imaging has been
paramount for successful 3D object detection algorithms, whereas monocular image-only
methods experience drastically reduced performance.
• It proposes to reduce the gap by reformulating the monocular 3D detection problem as a
standalone 3D region proposal network, called M3D-RPN.
• M3D-RPN leverages the geometric relationship of 2D and 3D perspectives, allowing 3D
boxes to utilize well-known and powerful convolutional features generated in the image-
space.
• To help address the strenuous 3D parameter estimations, it further designs depth-aware
convolutional layers which enable location specific feature development and in consequence
improved 3D scene understanding.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
M3D-RPN uses a single monocular 3D region proposal network with global convolution (orange)
and local depth-aware convolution (blue) to predict multi-class 3D bounding boxes.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
Comparison of Deep3DBox (CVPR’17) and Multi-Fusion (CVPR’18) with M3D-RPN. Notice
that prior works are comprised of multi- ple internal stages (orange), and external
networks (blue), whereas M3D-RPN is a single-shot network trained end-to-end.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
Overview of M3D-RPN. The method consist of parallel paths for global(orange) and local(blue) feature extraction. The
global features use regular spatial-invariant convolution, while the local features denote depth-aware convolution.
The depth-aware convolution uses non-shared kernels in the row-space ki for i = 1 . . . b, where b denotes # of distinct
bins. To leverage both variants of features, weightedly combine each output parameter from the parallel paths.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
Anchor Formulation and Visualized 3D Anchors. To depict each parameter of within the 2D / 3D anchor
formulation (left). To visualize the precomputed 3D priors when 12 anchors are used after projection in
the image view (middle) and Bird’s Eye View (right). For visualization purposes only, to span anchors in
specific x3D locations which best minimize overlap when viewed.
M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
Qualitative Examples. To visualize qualitative examples of our method for multi-class 3D object detection.
It uses yellow to denote cars, green for pedestrians, and orange for cyclists. All illustrated images are
from Chen et al. method (NIPS’15) split and not used for training.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
• It aims at bridging the performance gap between 3D sensing and 2D sensing for 3D object
detection by enhancing LiDAR-based algorithms to work with single image input.
• Specifically, to perform monocular depth estimation and lift the input image to a point cloud
representation, called pseudo-LiDAR point cloud.
• Then train a LiDAR-based 3D detection network with pseudo-LiDAR end-to-end.
• Following the pipeline of two-stage 3D detection algorithms, detect 2D object proposals in
the input image and extract a point cloud frustum from the pseudo-LiDAR for each proposal,
later an oriented 3D bounding box is detected for each frustum.
• To handle the large amount of noise in the pseudo-LiDAR: (1) use a 2D-3D bounding box
consistency constraint, adjusting the predicted 3D bounding box to have a high overlap with
its corresponding 2D proposal after projecting onto the image; (2) use the instance mask
instead of the bounding box as the representation of 2D proposals, in order to reduce the
number of points not belonging to the object in the point cloud frustum.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
(a) Lift every pixel of input image to 3D coordinates given estimated depth to generate pseudo- LiDAR; (b) Instance
mask proposals detected for extracting point cloud frustum; (c) 3D bounding box estimated (blue) for each point cloud
frustum made to be consistent with corresponding 2D proposal. Inputs and losses are in red and orange.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
To parameterize 3D bounding box output as a set of seven
parameters, including the 3D coordinate of the object center
(x, y, z), object’s size h, w, l and its heading angle θ.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
Left: it demonstrates that, when lifting all the pixels within the 2D bounding box proposal into 3D, the
generated point cloud frustum has the long tail issue. Right: lifting only the pixels within the instance
mask proposal significantly removes the points not being enclosed by the ground truth box, resulting in a
point cloud frustum with no tail.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
To alleviate the local misalignment issue, use the geometry constraint of the bounding box consistency to
refine 3D bounding box estimate. Given an inaccurate 3D bounding box estimate, it is highly possible that its
2D projection also does not match well with the corresponding 2D proposal. By adjusting the 3D bounding box
estimate in 3D space so that its 2D projection can have a higher 2D IoU with the corresponding 2D proposal, it
demonstrates that the 3D IoU of 3D bounding box estimate with its ground truth can be also increased.
Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
Qualitative results of proposed method on KITTI set. To visualize 3D bounding box estimate (blue) and ground
truth (red) on the frontal images (1st and 3rd rows) and pseudo-LiDAR point cloud (2nd and 4th rows).
Monocular 3D Object Detection and Box Fitting Trained
End-to-End Using Intersection-over-Union Loss
• Three-dimensional object detection from a single view is a challenging task which, if
performed with good accuracy, is an important enabler of low-cost mobile robot perception.
• Previous approaches to this problem suffer either from an overly complex inference engine
or from an insufficient detection accuracy.
• To deal with these issues, propose SS3D, a single-stage monocular 3D object detector.
• The framework consists of (i) a CNN, which outputs a redundant representation of each
relevant object in the image with corresponding uncertainty estimates, and (ii) a 3D
bounding box optimizer.
• The SS3D architecture provides a solid framework upon which high performing detection
systems can be built, with autonomous driving being the main application in mind.
Monocular 3D Object Detection and Box Fitting Trained
End-to-End Using Intersection-over-Union Loss
The pipeline consists of the following steps: 1) a CNN performs object detection (yielding class scores) and
regresses a set of intermediate values used later for 3D bounding box fitting, 2) non-maximum suppression
is applied to discard redundant detections, and finally 3) the 3D bounding boxes are fitted given the
intermediate predictions using a non-linear least- squares method.
Monocular 3D Object Detection and Box Fitting Trained
End-to-End Using Intersection-over-Union Loss
In the center of each object’s ground truth 2D bounding box a rectangular support region is
created, with width and height set to 20% of the former. Each output pixel in the support
region holds regression targets and one-hot classification targets for the network. In the rare
case when support regions overlap, the closer object is favored.
Disentangling Monocular 3D Object Detection
• An approach for monocular 3D object detection from a single RGB image leverages a
disentangling transformation for 2D and 3D detection losses and a novel, self-supervised
confidence score for 3D bounding boxes.
• The proposed loss disentanglement has the twofold advantage of simplifying the training
dynamics in the presence of losses with complex interactions of parameters, and
sidestepping the issue of balancing independent regression terms.
• Its solution overcomes these issues by isolating the contribution made by groups of
parameters to a given loss, without changing its nature.
• Further to apply loss disentanglement to another signed IoU criterion-driven loss for
improving 2D detection results.
• To critically review the AP metric used in KITTI3D, identify and resolve a flaw in the 11-point
interpolated AP metric, affecting all previously published detection results and particularly
biases the results of monocular 3D detection.
Disentangling Monocular 3D Object Detection
• A two-stage architecture consists of a single-
stage 2D detector (first stage) with an additional
3D detection head (second stage) constructed
on top of features pooled from the detected 2D
bounding boxes.
• The backbone is a ResNet34 with a Feature
Pyramid Network (FPN) built on top of it.
• The FPN network has the structure with 3+2
scales, connected to the output of modules
conv3, conv4 and conv5 of ResNet34,
corresponding to down-sampling factors of ×8,
×16 and ×32, respectively.
Disentangling Monocular 3D Object Detection
• We consider the head of the single-stage 2D detector implemented in RetinaNet, which
applies a detection module independently to each output fi of the backbone.
• The detection modules share the same parameters but work inherently at different scales,
according to the scale of the features that they receive as input.
• As opposed to the standard RetinaNet, it employs iABNsync also in this head.
• The head is composed of two parallel stacks of 3 × 3 convolutions, and is parametrized by
na reference bounding box sizes (anchors) per scale level.
Disentangling Monocular 3D Object Detection
• The 3D detection head regresses a 3D bounding box for each 2D bounding box returned
by the 2D detection head (surviving the filtering step).
• It starts by applying ROIAlign to pool features from FPN into a 14 × 14 grid for each 2D
bounding box, followed by 2 × 2 average pooling, resulting in feature maps with shape 7 ×
7 × 128.
• On top of this, two parallel branches of fully connected layers with 512 channels compute
the outputs.
• Each fully connected layer but the last one per branch is followed by iABN (non-
synchronized).
Disentangling Monocular 3D Object Detection
Semantics of the outputs of the 2D and 3D detection heads
Left: 2D bounding box regression on image plane. Center: 3D bounding
box regression. Right: allocentric angle from bird-eye view.
Disentangling Monocular 3D Object Detection
Classes Car (top), Pedestrian
(middle) and Cyclist(bottom)
with corresponding birds-
eye view.
Shift R-CNN: Deep Monocular 3d Object Detection
With Closed-Form Geometric Constraints
• Shift R-CNN, a hybrid model for monocular 3D object detection, combines deep learning
with the power of geometry.
• It adapts a Faster R-CNN network for regressing initial 2D and 3D object properties and
combine it with a least squares solution for the inverse 2D to 3D geometric mapping
problem, using the camera projection matrix.
• The closed-form solution of the mathematical system, along with the initial output of the
adapted Faster R-CNN are then passed through a final ShiftNet network that refines the
result using proposed Volume Displacement Loss.
• This geometrically constrained deep learning approach to monocular 3D object detection
obtains top results on KITTI 3D Object Detection Benchmark, being the best among all
monocular methods that do not use any pre-trained network for depth estimation.
Shift R-CNN: Deep Monocular 3d Object Detection
With Closed-Form Geometric Constraints
Overview of Shift R-CNN hybrid model. Stage 1: Faster R-CNN with added 3D angle and dimension regression. Stage 2:
Closed-form solution to 3D translation using camera projection geometric constraints. Stage 3: ShiftNet refinement and
final 3D object box reconstruction.
Shift R-CNN: Deep Monocular 3d Object Detection
With Closed-Form Geometric Constraints
Stage 2 (top) and Stage 3 (bottom) results comparison. Note that Stage 3 improves the 3D estimation
due to its noise robustness. Turquoise boxes denote objects with the same orientation and magenta
color the opposite orientation.
Monocular 3D Object Detection via
Geometric Reasoning on Keypoints
• Monocular 3D object detection is well-known to be a challenging vision task due to the loss
of depth information;
• Attempts to recover depth using separate image-only approaches lead to unstable and
noisy depth estimates, harming 3D detections.
• It proposes a keypoint-based approach for 3D object detection and localization from a
single RGB image.
• It then builds the multi-branch model around 2D keypoint detection implement it with a
conceptually simple geometric reasoning method.
• This network performs in an end-to-end manner, simultaneously and interdependently
estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along
with full 3D pose in the scene.
• To fuse the outputs of distinct branches, applying a reprojection consistency loss during
training.
Monocular 3D Object Detection via
Geometric Reasoning on Keypoints
Start with a universal backbone network (Mask R-CNN) and complement it with three sub-networks:
2D object detection sub-network, 2D keypoints regression sub-network, and dimension regression
sub-network. The network is trained end-to-end using a multi-task loss function.
Monocular 3D Object Detection via
Geometric Reasoning on Keypoints
The 5 geometric classes of instances in our work are represented by 5 3D CAD
models with strongly distinct aspect ratios.
Geometric reasoning about instance depth
Predict coordinates and a visibility state for each of the
manually-chosen 14 keypoints;
Define instance depth as the depth Z of a vertical plane
passing through the two closest keypoints in the
camera reference frame.
Annotation of 3D keypoints
Monocular 3D Object Detection via
Geometric Reasoning on Keypoints
The upper part of each sub-figure
contains 2D detection inference,
including 2D bounding and 2D locations
of the visible keypoints. Each instance
and its keypoints are displayed their
distinctive color. The lower part visualizes
the 3D point cloud, showing the camera
location as the colored XYZ axes. Green
and red colors stand for the ground truth
and predicted 3D bounding boxes
respectively. The scenes were selected to
express diversity in complexity and cars
positioning w.r.t. the camera.
Monocular 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
• MonoPSR, a monocular 3D object detection method, leverages proposals and shape
reconstruction.
• First, using the fundamental relations of a pinhole camera model, detections from a mature
2D object detector are used to generate a 3D proposal per object in a scene.
• The 3D location of these proposals prove to be quite accurate, which greatly reduces the
difficulty of regressing the final 3D bounding box detection.
• Simultaneously, a point cloud is predicted in an object centered coordinate system to learn
local scale and shape information.
• However, the key challenge is how to exploit shape information to guide 3D localization.
• Aggregate losses, including a projection alignment loss, to jointly optimize these tasks in the
neural network to improve 3D localization accuracy.
Monocular 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
The network takes an image with 2D bounding boxes and regresses instance-centric 3D proposals to
produce 3D bound intimated to recover local shape and scale, and to enforce 2D-3D consistency. The
proposal regression and point cloud estimation are trained jointly in the network.
Monocular 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
The network produces a feature map using an image crop of an object and global context features as inputs. From
this feature map three tasks are performed a) the dimensions and orientation are predicted to estimate a
proposal b) offsets for the proposals are regressed c) local point clouds are predicted and transformed into the
global frame for auxiliary loss calculations.
Monocular 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
Losses for the corresponding predictions (red)
and ground truth (green). All penalties use the
smooth L1 loss at valid pixel locations using
automatically generated segmentation masks.
First, the point cloud loss penalizes the instance
point cloud along each channel (x, y, z). The point
cloud is then placed at its estimated location in
the camera coordinate frame using TCO, the
transformation between object and camera
coordinate frames, and penalized in the last
channel z. Finally, the point cloud is projected
into image space with Π, the camera projection
matrix. A projection alignment loss penalizes
points projected into the wrong image pixel
location.
Monocular 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
KITTI dataset. 2D detections (top) are shown in orange. 3D detections in green are shown projected into the
image (top) and in the 3D scene (bottom). Ground truth 3D boxes (bottom) are shown in red. Points within the
detection boxes are the estimated point clouds from the network, while the background points are taken from
the colorized interpolated LiDAR scan. Note that for pedestrians in particular, the projected 3D boxes do not fit
tightly within their 2D box, so constraining the 3D box with the 2D box is not ideal.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
• This is an efficient 3D object detection framework based on a single RGB image in the
scenario of autonomous driving.
• The efforts are put on extracting the underlying 3D information in a 2D image and
determining the accurate 3D bounding box of the object without point cloud or stereo data.
• Leveraging the off-the-shelf 2D object detector, an approach to efficiently obtain a coarse
cuboid for each predicted 2D box.
• The coarse cuboid has enough accuracy to guide determining the 3D box of the object by
refinement.
• In contrast to previous SoA methods that only use the features extracted from the 2D
bounding box for box refinement, it explores the 3D structure information of the object by
employing the visual features of visible surfaces.
• The features from surfaces are utilized to eliminate the problem of representation ambiguity
brought by only using a 2D bounding box.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
The key idea: (a) First predict reliable 2D box and its
observation orientation. (b) Based on the predicted 2D
information, utilize artful techniques to efficiently determine
a basic cuboid for the corresponding object, called guidance.
(c) Features extracted from the visible surfaces of projected
guidance as well as the tight 2D bounding box of it will be
utilized by the model to perform accurate refinement with
classification formulation and quality-aware loss.
An example of the feature representation ambiguity
caused by only using 2D bounding box. The 3D boxes
vary largely from each other and only the left one is
correct, but their corresponding 2D bounding box
are exactly the same.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
For 2D detection, modify the faster R-CNN
framework by adding a new branch of
orientation prediction.
Top view of observation angle α and
global rotation angle θ.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
3D object detection paradigm. A CNN based model (2D+O subnet) is used to obtain a 2D bounding
box and observation orientation of the object. The guidance is then generated by the proposed
algorithm using the obtained 2D box and orientation with the projection matrix. And features
extracted from visible surfaces as well as the 2D bounding box of the projected guidance are
utilized by the refinement model (3D subnet). Instead of direct regression, the refinement model
adopts classification formulation with the quality-aware loss for a more accurate result.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
Visualization of feature extraction from the projected surfaces of 3D box by
perspective transformation.
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
Details of the head of 3D subnet
GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
3D detection results
Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
• This is a monocular 3D object detection framework in the domain of autonomous driving.
• Unlike previous image-based methods on RGB feature from 2D images, this method solves
this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly.
• First leverage a stand-alone module to transform the input data from 2D image plane to
3D point clouds space for a better input representation, then perform the 3D detection
using PointNet backbone net to obtain objects’ 3D locations, dimensions and orientations.
• To enhance the discriminative capability of point clouds, apply a multi-modal features
fusion module to embed the complementary RGB cue into the generated point clouds.
• It is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e.,
X,Y, Z space) compared to the image plane (i.e., R,G,B image plane).
Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
framework for monocular 3D object detection
Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
• It consists of two main stages: 3D data generation phase and 3D box estimation phase.
• In 3D data generation phase, trained two deep CNNs to do intermediate tasks (2D
detection and depth estimation) to get position and depth information.
• In particular, transfer the generated depth into point cloud which is a better representation for 3D
detection, and then use 2D bounding box to get the prior information about the location of the
RoI (region of interest).
• Finally, extract the points in each RoI as our input data for subsequent steps.
• In 3D box estimation phase, in order to improve the final task, design two modules for
background points segmentation and RGB information aggregation, respectively.
• After that, use PointNet as our backbone net to predict the 3D location, dimension and
orientation for each RoI.
• Note that the confidence scores of 2D boxes are assigned to their corresponding 3D boxes.
Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
3D box estimation (Det-Net) with RGB features fusion module
Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
Qualitative comparisons of RGB information: 3D Boxes are projected to the image plane. The detection results
using XYZ information only are represented by write boxes, and blue boxes come from the model trained with
RGB features fusion module. The proposed RGB fusion method can improve the 3D detection accuracy,
especially for occlusion/truncation cases.
Mono3D++: Monocular 3D Vehicle Detection with Two-
Scale 3D Hypotheses and Task Priors
• A method to infer 3d pose and shape of vehicles from a single image.
• To tackle this ill-posed problem, optimize two-scale projection consistency between the
generated 3d hypotheses and their 2d pseudo-measurements.
• Specifically, use a morphable wireframe model to generate a fine-scaled representation of
vehicle shape and pose.
• To reduce its sensitivity to 2d landmarks, jointly model the 3d bounding box as a coarse
representation which improves robustness.
• Also integrate three task priors, including unsupervised monocular depth, a ground plane
constraint as well as vehicle shape priors, with forward projection errors into an overall energy
function.
Mono3D++: Monocular 3D Vehicle Detection with Two-
Scale 3D Hypotheses and Task Priors
Take a single image as input, and generates vehicles’ 3D shape and pose estimation in camera coordinates
The inference criterion combines a generative component, jointly optimizing the innovation (forward
prediction error) btw the projection of 3D hypotheses and the image pseudo- measurements,
monocular depth map constraints, geometric constraints (ground), in addition to penalizing large
deformations of the shape prior.
Mono3D++: Monocular 3D Vehicle Detection with Two-
Scale 3D Hypotheses and Task Priors
The two-scale 3D hypotheses consist of the rotated and scaled 3D Bbox and morphable wireframe model.
The image pseudo-measurements include 2D Bboxes and landmarks. In inference scheme, use the
hypotheses and the pseudo-measurements to initialize the optimization and generate the final 3D pose
and shape estimation of a vehicle.
Orthographic Feature Transform for
Monocular 3D Object Detection
• Due to the perspective image-based representation, the appearance and scale of objects
varies drastically with depth and meaningful distances are difficult to infer.
• The ability to reason about the world in 3D is an essential element of the 3D object
detection task.
• The orthographic feature transform enables to escape the image domain by mapping
image-based features into an orthographic 3D space.
• It allows to reason holistically about the spatial configuration of the scene in a domain
where scale is consistent and distances between objects are meaningful.
• Apply this transformation as part of an E2E deep learning architecture.
orthographic Feature Transform for
Monocular 3D Object Detection
Orthographic Feature Transform (OFT)
1. A front-end ResNet feature extractor which extracts
multi-scale feature maps from the input image.
2. A orthographic feature transform which transforms
the image-based feature maps at each scale into an
orthographic birds-eye-view representation.
3. A top down network, consisting of a series of ResNet
residual units, which processes the birds-eye-view
feature maps in a manner which is invariant to the
perspective effects observed in the image.
4. A set of output heads which generate, for each object
class and each location on the ground plane, a
confidence score, position offset, dimension offset and
a orientation vector.
5. A non-maximum suppression and decoding stage,
which identifies peaks in the confidence maps and
generates discrete bounding box predictions.
orthographic Feature Transform for
Monocular 3D Object Detection
Architecture overview. A front-end ResNet feature extractor generates image-based features, which are mapped to an
orthographic representation via orthographic feature transform. The top down network processes these features in the
birds-eye-view space and at each location on the ground plane predicts a confidence score S, a position offset ∆pos, a
dimension offset ∆dim and an angle vector ∆ang .
orthographic Feature Transform for
Monocular 3D Object Detection
Qualitative comparison between OFT method (left) and Mono3D at CVPR’16 (right) on the KITTI validation
set. Inset regions highlight the behaviors of the two systems at large distances. OFT is able to consistently
detect distant objects which are beyond the range of Mono3D.
Multi-Level Fusion based 3D Object
Detection from Monocular Images
• An E2E multi-level fusion based framework for 3d object detection from a single
monocular image.
• It is composed of 2 parts: one for 2d region proposal generation and another for
simultaneously predictions of objects’ 2d locations, orientations, dimensions, and 3d
locations.
• With the help of a stand-alone module to estimate the disparity and compute the 3d
point cloud, introduce the multi-level fusion scheme.
• Encode the disparity info. with a front view feature representation and fuse it with the RGB
image to enhance the input.
• Features extracted from the original input and the point cloud are combined to boost the
object detection. for 3d localization, introduce an extra stream to predict the location info.
from point cloud directly and add it to the aforementioned location prediction.
Multi-Level Fusion based 3D Object
Detection from Monocular Images
3D object detection
Multi-Level Fusion based 3D Object
Detection from Monocular Images
Visualization for 2D detection boxes, the projected 3D detection boxes on
inferred point cloud from estimated disparity.
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
• MonoGRNet for the amodal 3d object localization from a monocular RGB image via geometric
reasoning in both the observed 2d projection and the unobserved depth dimension.
• MonoGRNet is a single, unified network composed of four task-specific subnetworks,
responsible for 2d object detection, instance depth estimation (IDE), 3d localization and local
corner regression.
• Unlike the pixel-level depth estimation that needs per-pixel annotations, A IDE method that
directly predicts the depth of the targeting 3d bounding box’s center using sparse supervision.
• The 3d localization is further achieved by estimating the position in the horizontal and vertical
dimensions.
• Finally, MonoGRNet is jointly learned by optimizing the locations and poses of the 3d bounding
boxes in the global context.
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
• MonoGRNet for 3d object localization from a monocular RGB image.
• MonoGRNet consists of four subnetworks for 2d detection(brown), instance depth
estimation(green), 3d location estimation(blue) and local corner regression(yellow).
• Guided by the detected 2d Bbox, the network first estimates depth and 2d projection of
the 3d box’s center to obtain the global 3d location, and then regresses corner coord.s in
local context.
• The final 3d bounding box is optimized in an E2E manner in the global context based on
the estimated 3d location and local corners.
• VGG-16 as the CNN backbone, but without its FC layers.
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
Instance depth estimation subnet
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
Notation for 3D bounding box localization
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
Instance depth
MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
Predicted 3D bounding boxes
are drawn in orange, while
ground truths are in blue. Lidar
point clouds are plotted for
reference but not used. Camera
centers are at the bottom-left
corner. (a), (b) and (c) are
common cases when predictions
recall the ground truths.
(d), (e) and (f) demonstrate the
capability of model handling
truncated objects outside the
image. (g), (h) and (i) show the
failed detections when some
cars are heavily occluded.
3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
• Understanding 3D semantics of the surrounding objects is critically important and a
challenging requirement from the safety perspective of autonomous driving.
• This is a localization prioritized approach for effectively localizing the position of the object
in the 3D world and fit a complete 3D box around it.
• This method requires a single image and performs both 2D and 3D detection in an end to
end fashion.
• It works by effectively localizing the projection of the center of bottom face of 3D bounding
box (CBF) to the image.
• Later in the post processing stage, it uses a look up table based approach to reproject the
CBF in the 3D world.
• This stage is a single time setup and simple enough to be deployed in fixed map
communities to store complete knowledge about the ground plane.
• The object’s dimension and pose are predicted in multitask fashion using a shared set of
features.
3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
Illustration of the 2D detection boxes and the corresponding 3D projections.
Joint Mono 3D Vehicle Detection and Tracking
• Vehicle 3D extents and trajectories are critical cues for predicting the future location of
vehicles and planning future agent ego-motion based on those predictions.
• Here is an online framework for 3D vehicle detection and tracking from monocular videos.
• The framework can not only associate detections of vehicles in motion over time, but also
estimate their complete 3D bounding box information from a sequence of 2D images
captured on a moving platform.
• This method leverages 3D box depth-ordering matching for robust instance association and
utilizes 3D trajectory prediction for re-identification of occluded vehicles.
• It also designs a motion learning module based on an LSTM for more accurate long-term
motion extrapolation.
• On Argo-verse dataset, this image-based method is significantly better for tracking 3D
vehicles within 30 meters than the LiDAR-centric baseline methods.
Joint Mono 3D Vehicle Detection and Tracking
Joint online detection and tracking in 3D.
The dynamic 3D tracking pipeline predicts
3D bounding box as- sociation of
observed vehicles in image sequences
captured by a monocular camera with an
ego-motion sensor.
Joint Mono 3D Vehicle Detection and Tracking
Overview of monocular 3D tracking framework. This online approach processes monocular frames to estimate and
track region of interests (RoIs) in 3D (a). For each ROI, learn 3D layout (i.e., depth, orientation, dimension, a
projection of 3D center) estimation (b). With 3D layout, the LSTM tracker produces robust linking across frames
leveraging occlusion-aware association and depth-ordering matching (c). With the help of 3D tracking, the model
further refines the ability of 3D estimation by fusing object motion features of the previous frames (d).
Joint Mono 3D Vehicle Detection and Tracking
Illustration of depth-ordering matching. Given the tracklets and detections, sort
them into a list by depth order. For each detection of interest (DOI), calculate the
IOU between DOI and non-occluded regions of each tracklet. The depth order
naturally provides higher probabilities to tracklets near the DOI.
Joint Mono 3D Vehicle Detection and Tracking
Illustration of Occlusion-aware association. A tracked tracklet (yellow) is visible all the time, while
a tracklet (red) is occluded by another (blue) at frame T −1. During occlusion, the tracklet does
not update state but keep inference motion until reappearance. For a truncated or disappear
tracklet (blue at frame T ), left it as lost.
Joint Mono 3D Vehicle Detection and Tracking
Experimental results on Kitti dataset: 3D layout colored with tracking ID
3-d interpretation from single 2-d image for autonomous driving II

3-d interpretation from single 2-d image for autonomous driving II

  • 1.
    3D Interpretation fromSingle 2D Image for Autonomous Driving II Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2.
    Outline • Task-Aware MonoDepth Estimation for 3D Object Detection • M3D-RPN: Mono 3D Region Proposal Network for Object Detection • Mono 3D Object Detection with Pseudo-LiDAR Point Cloud • Mono 3D Object Detection and Box Fitting Trained E2E Using IoU Loss • Disentangling Mono 3D Object Detection • Shift R-CNN: Deep Mono 3d Object Detection With Closed-Form Geometric Constraints • Mono 3D Object Detection via Geometric Reasoning on Keypoints • Mono 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction • GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving • Mono Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving • Mono3D++: Mono 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors • Orthographic Feature Transform for Mono 3D Object Detection • Multi-Level Fusion based 3D Object Detection from Mono Images • MonoGRNet: A Geometric Reasoning Network for Mono 3D Object Localization • 3D Bounding Boxes for Road Vehicles: A One- Stage, Localization Prioritized Approach using Single Mono Images • Joint Mono 3D Vehicle Detection and Tracking
  • 3.
    Task-Aware Monocular Depth Estimationfor 3D Object Detection • Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. • Almost all methods treat foreground and background regions (“things and stuff”) in an image equally. • However, depth of foreground objects plays a crucial role in 3D object recognition and localization. • It first analyse the data distributions and interaction of foreground and background, then get the foreground- background separated monocular depth estimation (ForeSeE) method, to estimate the foreground and background depth using separate optimization objectives and decoders. • This method significantly improves the depth estimation performance on foreground objects.
  • 4.
    Task-Aware Monocular Depth Estimationfor 3D Object Detection (a) ForeSeE (b) 3D Object Detection Illustration of the overall pipeline. (a) Foreground-background separated depth estimation. (b) 3D object detection.
  • 5.
    Task-Aware Monocular Depth Estimationfor 3D Object Detection (a) Input Image (b) Baseline-PL (c) ForeSeE-PL Qualitative results of 3D object detection. The ground truth 3D bounding boxes are in red; the predictions are in green.
  • 6.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection • Understanding the world in 3D is a critical component of urban autonomous driving. • Generally, the combination of expensive LiDAR sensors and stereo RGB imaging has been paramount for successful 3D object detection algorithms, whereas monocular image-only methods experience drastically reduced performance. • It proposes to reduce the gap by reformulating the monocular 3D detection problem as a standalone 3D region proposal network, called M3D-RPN. • M3D-RPN leverages the geometric relationship of 2D and 3D perspectives, allowing 3D boxes to utilize well-known and powerful convolutional features generated in the image- space. • To help address the strenuous 3D parameter estimations, it further designs depth-aware convolutional layers which enable location specific feature development and in consequence improved 3D scene understanding.
  • 7.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection M3D-RPN uses a single monocular 3D region proposal network with global convolution (orange) and local depth-aware convolution (blue) to predict multi-class 3D bounding boxes.
  • 8.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection Comparison of Deep3DBox (CVPR’17) and Multi-Fusion (CVPR’18) with M3D-RPN. Notice that prior works are comprised of multi- ple internal stages (orange), and external networks (blue), whereas M3D-RPN is a single-shot network trained end-to-end.
  • 9.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection Overview of M3D-RPN. The method consist of parallel paths for global(orange) and local(blue) feature extraction. The global features use regular spatial-invariant convolution, while the local features denote depth-aware convolution. The depth-aware convolution uses non-shared kernels in the row-space ki for i = 1 . . . b, where b denotes # of distinct bins. To leverage both variants of features, weightedly combine each output parameter from the parallel paths.
  • 10.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection Anchor Formulation and Visualized 3D Anchors. To depict each parameter of within the 2D / 3D anchor formulation (left). To visualize the precomputed 3D priors when 12 anchors are used after projection in the image view (middle) and Bird’s Eye View (right). For visualization purposes only, to span anchors in specific x3D locations which best minimize overlap when viewed.
  • 11.
    M3D-RPN: Monocular 3DRegion Proposal Network for Object Detection Qualitative Examples. To visualize qualitative examples of our method for multi-class 3D object detection. It uses yellow to denote cars, green for pedestrians, and orange for cyclists. All illustrated images are from Chen et al. method (NIPS’15) split and not used for training.
  • 12.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud • It aims at bridging the performance gap between 3D sensing and 2D sensing for 3D object detection by enhancing LiDAR-based algorithms to work with single image input. • Specifically, to perform monocular depth estimation and lift the input image to a point cloud representation, called pseudo-LiDAR point cloud. • Then train a LiDAR-based 3D detection network with pseudo-LiDAR end-to-end. • Following the pipeline of two-stage 3D detection algorithms, detect 2D object proposals in the input image and extract a point cloud frustum from the pseudo-LiDAR for each proposal, later an oriented 3D bounding box is detected for each frustum. • To handle the large amount of noise in the pseudo-LiDAR: (1) use a 2D-3D bounding box consistency constraint, adjusting the predicted 3D bounding box to have a high overlap with its corresponding 2D proposal after projecting onto the image; (2) use the instance mask instead of the bounding box as the representation of 2D proposals, in order to reduce the number of points not belonging to the object in the point cloud frustum.
  • 13.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud (a) Lift every pixel of input image to 3D coordinates given estimated depth to generate pseudo- LiDAR; (b) Instance mask proposals detected for extracting point cloud frustum; (c) 3D bounding box estimated (blue) for each point cloud frustum made to be consistent with corresponding 2D proposal. Inputs and losses are in red and orange.
  • 14.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud To parameterize 3D bounding box output as a set of seven parameters, including the 3D coordinate of the object center (x, y, z), object’s size h, w, l and its heading angle θ.
  • 15.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud Left: it demonstrates that, when lifting all the pixels within the 2D bounding box proposal into 3D, the generated point cloud frustum has the long tail issue. Right: lifting only the pixels within the instance mask proposal significantly removes the points not being enclosed by the ground truth box, resulting in a point cloud frustum with no tail.
  • 16.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud To alleviate the local misalignment issue, use the geometry constraint of the bounding box consistency to refine 3D bounding box estimate. Given an inaccurate 3D bounding box estimate, it is highly possible that its 2D projection also does not match well with the corresponding 2D proposal. By adjusting the 3D bounding box estimate in 3D space so that its 2D projection can have a higher 2D IoU with the corresponding 2D proposal, it demonstrates that the 3D IoU of 3D bounding box estimate with its ground truth can be also increased.
  • 17.
    Monocular 3D ObjectDetection with Pseudo-LiDAR Point Cloud Qualitative results of proposed method on KITTI set. To visualize 3D bounding box estimate (blue) and ground truth (red) on the frontal images (1st and 3rd rows) and pseudo-LiDAR point cloud (2nd and 4th rows).
  • 18.
    Monocular 3D ObjectDetection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss • Three-dimensional object detection from a single view is a challenging task which, if performed with good accuracy, is an important enabler of low-cost mobile robot perception. • Previous approaches to this problem suffer either from an overly complex inference engine or from an insufficient detection accuracy. • To deal with these issues, propose SS3D, a single-stage monocular 3D object detector. • The framework consists of (i) a CNN, which outputs a redundant representation of each relevant object in the image with corresponding uncertainty estimates, and (ii) a 3D bounding box optimizer. • The SS3D architecture provides a solid framework upon which high performing detection systems can be built, with autonomous driving being the main application in mind.
  • 19.
    Monocular 3D ObjectDetection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss The pipeline consists of the following steps: 1) a CNN performs object detection (yielding class scores) and regresses a set of intermediate values used later for 3D bounding box fitting, 2) non-maximum suppression is applied to discard redundant detections, and finally 3) the 3D bounding boxes are fitted given the intermediate predictions using a non-linear least- squares method.
  • 20.
    Monocular 3D ObjectDetection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss In the center of each object’s ground truth 2D bounding box a rectangular support region is created, with width and height set to 20% of the former. Each output pixel in the support region holds regression targets and one-hot classification targets for the network. In the rare case when support regions overlap, the closer object is favored.
  • 21.
    Disentangling Monocular 3DObject Detection • An approach for monocular 3D object detection from a single RGB image leverages a disentangling transformation for 2D and 3D detection losses and a novel, self-supervised confidence score for 3D bounding boxes. • The proposed loss disentanglement has the twofold advantage of simplifying the training dynamics in the presence of losses with complex interactions of parameters, and sidestepping the issue of balancing independent regression terms. • Its solution overcomes these issues by isolating the contribution made by groups of parameters to a given loss, without changing its nature. • Further to apply loss disentanglement to another signed IoU criterion-driven loss for improving 2D detection results. • To critically review the AP metric used in KITTI3D, identify and resolve a flaw in the 11-point interpolated AP metric, affecting all previously published detection results and particularly biases the results of monocular 3D detection.
  • 22.
    Disentangling Monocular 3DObject Detection • A two-stage architecture consists of a single- stage 2D detector (first stage) with an additional 3D detection head (second stage) constructed on top of features pooled from the detected 2D bounding boxes. • The backbone is a ResNet34 with a Feature Pyramid Network (FPN) built on top of it. • The FPN network has the structure with 3+2 scales, connected to the output of modules conv3, conv4 and conv5 of ResNet34, corresponding to down-sampling factors of ×8, ×16 and ×32, respectively.
  • 23.
    Disentangling Monocular 3DObject Detection • We consider the head of the single-stage 2D detector implemented in RetinaNet, which applies a detection module independently to each output fi of the backbone. • The detection modules share the same parameters but work inherently at different scales, according to the scale of the features that they receive as input. • As opposed to the standard RetinaNet, it employs iABNsync also in this head. • The head is composed of two parallel stacks of 3 × 3 convolutions, and is parametrized by na reference bounding box sizes (anchors) per scale level.
  • 24.
    Disentangling Monocular 3DObject Detection • The 3D detection head regresses a 3D bounding box for each 2D bounding box returned by the 2D detection head (surviving the filtering step). • It starts by applying ROIAlign to pool features from FPN into a 14 × 14 grid for each 2D bounding box, followed by 2 × 2 average pooling, resulting in feature maps with shape 7 × 7 × 128. • On top of this, two parallel branches of fully connected layers with 512 channels compute the outputs. • Each fully connected layer but the last one per branch is followed by iABN (non- synchronized).
  • 25.
    Disentangling Monocular 3DObject Detection Semantics of the outputs of the 2D and 3D detection heads Left: 2D bounding box regression on image plane. Center: 3D bounding box regression. Right: allocentric angle from bird-eye view.
  • 26.
    Disentangling Monocular 3DObject Detection Classes Car (top), Pedestrian (middle) and Cyclist(bottom) with corresponding birds- eye view.
  • 27.
    Shift R-CNN: DeepMonocular 3d Object Detection With Closed-Form Geometric Constraints • Shift R-CNN, a hybrid model for monocular 3D object detection, combines deep learning with the power of geometry. • It adapts a Faster R-CNN network for regressing initial 2D and 3D object properties and combine it with a least squares solution for the inverse 2D to 3D geometric mapping problem, using the camera projection matrix. • The closed-form solution of the mathematical system, along with the initial output of the adapted Faster R-CNN are then passed through a final ShiftNet network that refines the result using proposed Volume Displacement Loss. • This geometrically constrained deep learning approach to monocular 3D object detection obtains top results on KITTI 3D Object Detection Benchmark, being the best among all monocular methods that do not use any pre-trained network for depth estimation.
  • 28.
    Shift R-CNN: DeepMonocular 3d Object Detection With Closed-Form Geometric Constraints Overview of Shift R-CNN hybrid model. Stage 1: Faster R-CNN with added 3D angle and dimension regression. Stage 2: Closed-form solution to 3D translation using camera projection geometric constraints. Stage 3: ShiftNet refinement and final 3D object box reconstruction.
  • 29.
    Shift R-CNN: DeepMonocular 3d Object Detection With Closed-Form Geometric Constraints Stage 2 (top) and Stage 3 (bottom) results comparison. Note that Stage 3 improves the 3D estimation due to its noise robustness. Turquoise boxes denote objects with the same orientation and magenta color the opposite orientation.
  • 30.
    Monocular 3D ObjectDetection via Geometric Reasoning on Keypoints • Monocular 3D object detection is well-known to be a challenging vision task due to the loss of depth information; • Attempts to recover depth using separate image-only approaches lead to unstable and noisy depth estimates, harming 3D detections. • It proposes a keypoint-based approach for 3D object detection and localization from a single RGB image. • It then builds the multi-branch model around 2D keypoint detection implement it with a conceptually simple geometric reasoning method. • This network performs in an end-to-end manner, simultaneously and interdependently estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along with full 3D pose in the scene. • To fuse the outputs of distinct branches, applying a reprojection consistency loss during training.
  • 31.
    Monocular 3D ObjectDetection via Geometric Reasoning on Keypoints Start with a universal backbone network (Mask R-CNN) and complement it with three sub-networks: 2D object detection sub-network, 2D keypoints regression sub-network, and dimension regression sub-network. The network is trained end-to-end using a multi-task loss function.
  • 32.
    Monocular 3D ObjectDetection via Geometric Reasoning on Keypoints The 5 geometric classes of instances in our work are represented by 5 3D CAD models with strongly distinct aspect ratios. Geometric reasoning about instance depth Predict coordinates and a visibility state for each of the manually-chosen 14 keypoints; Define instance depth as the depth Z of a vertical plane passing through the two closest keypoints in the camera reference frame. Annotation of 3D keypoints
  • 33.
    Monocular 3D ObjectDetection via Geometric Reasoning on Keypoints The upper part of each sub-figure contains 2D detection inference, including 2D bounding and 2D locations of the visible keypoints. Each instance and its keypoints are displayed their distinctive color. The lower part visualizes the 3D point cloud, showing the camera location as the colored XYZ axes. Green and red colors stand for the ground truth and predicted 3D bounding boxes respectively. The scenes were selected to express diversity in complexity and cars positioning w.r.t. the camera.
  • 34.
    Monocular 3D ObjectDetection Leveraging Accurate Proposals and Shape Reconstruction • MonoPSR, a monocular 3D object detection method, leverages proposals and shape reconstruction. • First, using the fundamental relations of a pinhole camera model, detections from a mature 2D object detector are used to generate a 3D proposal per object in a scene. • The 3D location of these proposals prove to be quite accurate, which greatly reduces the difficulty of regressing the final 3D bounding box detection. • Simultaneously, a point cloud is predicted in an object centered coordinate system to learn local scale and shape information. • However, the key challenge is how to exploit shape information to guide 3D localization. • Aggregate losses, including a projection alignment loss, to jointly optimize these tasks in the neural network to improve 3D localization accuracy.
  • 35.
    Monocular 3D ObjectDetection Leveraging Accurate Proposals and Shape Reconstruction The network takes an image with 2D bounding boxes and regresses instance-centric 3D proposals to produce 3D bound intimated to recover local shape and scale, and to enforce 2D-3D consistency. The proposal regression and point cloud estimation are trained jointly in the network.
  • 36.
    Monocular 3D ObjectDetection Leveraging Accurate Proposals and Shape Reconstruction The network produces a feature map using an image crop of an object and global context features as inputs. From this feature map three tasks are performed a) the dimensions and orientation are predicted to estimate a proposal b) offsets for the proposals are regressed c) local point clouds are predicted and transformed into the global frame for auxiliary loss calculations.
  • 37.
    Monocular 3D ObjectDetection Leveraging Accurate Proposals and Shape Reconstruction Losses for the corresponding predictions (red) and ground truth (green). All penalties use the smooth L1 loss at valid pixel locations using automatically generated segmentation masks. First, the point cloud loss penalizes the instance point cloud along each channel (x, y, z). The point cloud is then placed at its estimated location in the camera coordinate frame using TCO, the transformation between object and camera coordinate frames, and penalized in the last channel z. Finally, the point cloud is projected into image space with Π, the camera projection matrix. A projection alignment loss penalizes points projected into the wrong image pixel location.
  • 38.
    Monocular 3D ObjectDetection Leveraging Accurate Proposals and Shape Reconstruction KITTI dataset. 2D detections (top) are shown in orange. 3D detections in green are shown projected into the image (top) and in the 3D scene (bottom). Ground truth 3D boxes (bottom) are shown in red. Points within the detection boxes are the estimated point clouds from the network, while the background points are taken from the colorized interpolated LiDAR scan. Note that for pedestrians in particular, the projected 3D boxes do not fit tightly within their 2D box, so constraining the 3D box with the 2D box is not ideal.
  • 39.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving • This is an efficient 3D object detection framework based on a single RGB image in the scenario of autonomous driving. • The efforts are put on extracting the underlying 3D information in a 2D image and determining the accurate 3D bounding box of the object without point cloud or stereo data. • Leveraging the off-the-shelf 2D object detector, an approach to efficiently obtain a coarse cuboid for each predicted 2D box. • The coarse cuboid has enough accuracy to guide determining the 3D box of the object by refinement. • In contrast to previous SoA methods that only use the features extracted from the 2D bounding box for box refinement, it explores the 3D structure information of the object by employing the visual features of visible surfaces. • The features from surfaces are utilized to eliminate the problem of representation ambiguity brought by only using a 2D bounding box.
  • 40.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving The key idea: (a) First predict reliable 2D box and its observation orientation. (b) Based on the predicted 2D information, utilize artful techniques to efficiently determine a basic cuboid for the corresponding object, called guidance. (c) Features extracted from the visible surfaces of projected guidance as well as the tight 2D bounding box of it will be utilized by the model to perform accurate refinement with classification formulation and quality-aware loss. An example of the feature representation ambiguity caused by only using 2D bounding box. The 3D boxes vary largely from each other and only the left one is correct, but their corresponding 2D bounding box are exactly the same.
  • 41.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving For 2D detection, modify the faster R-CNN framework by adding a new branch of orientation prediction. Top view of observation angle α and global rotation angle θ.
  • 42.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving 3D object detection paradigm. A CNN based model (2D+O subnet) is used to obtain a 2D bounding box and observation orientation of the object. The guidance is then generated by the proposed algorithm using the obtained 2D box and orientation with the projection matrix. And features extracted from visible surfaces as well as the 2D bounding box of the projected guidance are utilized by the refinement model (3D subnet). Instead of direct regression, the refinement model adopts classification formulation with the quality-aware loss for a more accurate result.
  • 43.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving Visualization of feature extraction from the projected surfaces of 3D box by perspective transformation.
  • 44.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving Details of the head of 3D subnet
  • 45.
    GS3D: An Efficient3D Object Detection Framework for Autonomous Driving 3D detection results
  • 46.
    Mono Object Detectionvia Color-Embedded 3D Reconstruction for Autonomous Driving • This is a monocular 3D object detection framework in the domain of autonomous driving. • Unlike previous image-based methods on RGB feature from 2D images, this method solves this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly. • First leverage a stand-alone module to transform the input data from 2D image plane to 3D point clouds space for a better input representation, then perform the 3D detection using PointNet backbone net to obtain objects’ 3D locations, dimensions and orientations. • To enhance the discriminative capability of point clouds, apply a multi-modal features fusion module to embed the complementary RGB cue into the generated point clouds. • It is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e., X,Y, Z space) compared to the image plane (i.e., R,G,B image plane).
  • 47.
    Mono Object Detectionvia Color-Embedded 3D Reconstruction for Autonomous Driving framework for monocular 3D object detection
  • 48.
    Mono Object Detectionvia Color-Embedded 3D Reconstruction for Autonomous Driving • It consists of two main stages: 3D data generation phase and 3D box estimation phase. • In 3D data generation phase, trained two deep CNNs to do intermediate tasks (2D detection and depth estimation) to get position and depth information. • In particular, transfer the generated depth into point cloud which is a better representation for 3D detection, and then use 2D bounding box to get the prior information about the location of the RoI (region of interest). • Finally, extract the points in each RoI as our input data for subsequent steps. • In 3D box estimation phase, in order to improve the final task, design two modules for background points segmentation and RGB information aggregation, respectively. • After that, use PointNet as our backbone net to predict the 3D location, dimension and orientation for each RoI. • Note that the confidence scores of 2D boxes are assigned to their corresponding 3D boxes.
  • 49.
    Mono Object Detectionvia Color-Embedded 3D Reconstruction for Autonomous Driving 3D box estimation (Det-Net) with RGB features fusion module
  • 50.
    Mono Object Detectionvia Color-Embedded 3D Reconstruction for Autonomous Driving Qualitative comparisons of RGB information: 3D Boxes are projected to the image plane. The detection results using XYZ information only are represented by write boxes, and blue boxes come from the model trained with RGB features fusion module. The proposed RGB fusion method can improve the 3D detection accuracy, especially for occlusion/truncation cases.
  • 51.
    Mono3D++: Monocular 3DVehicle Detection with Two- Scale 3D Hypotheses and Task Priors • A method to infer 3d pose and shape of vehicles from a single image. • To tackle this ill-posed problem, optimize two-scale projection consistency between the generated 3d hypotheses and their 2d pseudo-measurements. • Specifically, use a morphable wireframe model to generate a fine-scaled representation of vehicle shape and pose. • To reduce its sensitivity to 2d landmarks, jointly model the 3d bounding box as a coarse representation which improves robustness. • Also integrate three task priors, including unsupervised monocular depth, a ground plane constraint as well as vehicle shape priors, with forward projection errors into an overall energy function.
  • 52.
    Mono3D++: Monocular 3DVehicle Detection with Two- Scale 3D Hypotheses and Task Priors Take a single image as input, and generates vehicles’ 3D shape and pose estimation in camera coordinates The inference criterion combines a generative component, jointly optimizing the innovation (forward prediction error) btw the projection of 3D hypotheses and the image pseudo- measurements, monocular depth map constraints, geometric constraints (ground), in addition to penalizing large deformations of the shape prior.
  • 53.
    Mono3D++: Monocular 3DVehicle Detection with Two- Scale 3D Hypotheses and Task Priors The two-scale 3D hypotheses consist of the rotated and scaled 3D Bbox and morphable wireframe model. The image pseudo-measurements include 2D Bboxes and landmarks. In inference scheme, use the hypotheses and the pseudo-measurements to initialize the optimization and generate the final 3D pose and shape estimation of a vehicle.
  • 54.
    Orthographic Feature Transformfor Monocular 3D Object Detection • Due to the perspective image-based representation, the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. • The ability to reason about the world in 3D is an essential element of the 3D object detection task. • The orthographic feature transform enables to escape the image domain by mapping image-based features into an orthographic 3D space. • It allows to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful. • Apply this transformation as part of an E2E deep learning architecture.
  • 55.
    orthographic Feature Transformfor Monocular 3D Object Detection Orthographic Feature Transform (OFT) 1. A front-end ResNet feature extractor which extracts multi-scale feature maps from the input image. 2. A orthographic feature transform which transforms the image-based feature maps at each scale into an orthographic birds-eye-view representation. 3. A top down network, consisting of a series of ResNet residual units, which processes the birds-eye-view feature maps in a manner which is invariant to the perspective effects observed in the image. 4. A set of output heads which generate, for each object class and each location on the ground plane, a confidence score, position offset, dimension offset and a orientation vector. 5. A non-maximum suppression and decoding stage, which identifies peaks in the confidence maps and generates discrete bounding box predictions.
  • 56.
    orthographic Feature Transformfor Monocular 3D Object Detection Architecture overview. A front-end ResNet feature extractor generates image-based features, which are mapped to an orthographic representation via orthographic feature transform. The top down network processes these features in the birds-eye-view space and at each location on the ground plane predicts a confidence score S, a position offset ∆pos, a dimension offset ∆dim and an angle vector ∆ang .
  • 57.
    orthographic Feature Transformfor Monocular 3D Object Detection Qualitative comparison between OFT method (left) and Mono3D at CVPR’16 (right) on the KITTI validation set. Inset regions highlight the behaviors of the two systems at large distances. OFT is able to consistently detect distant objects which are beyond the range of Mono3D.
  • 58.
    Multi-Level Fusion based3D Object Detection from Monocular Images • An E2E multi-level fusion based framework for 3d object detection from a single monocular image. • It is composed of 2 parts: one for 2d region proposal generation and another for simultaneously predictions of objects’ 2d locations, orientations, dimensions, and 3d locations. • With the help of a stand-alone module to estimate the disparity and compute the 3d point cloud, introduce the multi-level fusion scheme. • Encode the disparity info. with a front view feature representation and fuse it with the RGB image to enhance the input. • Features extracted from the original input and the point cloud are combined to boost the object detection. for 3d localization, introduce an extra stream to predict the location info. from point cloud directly and add it to the aforementioned location prediction.
  • 59.
    Multi-Level Fusion based3D Object Detection from Monocular Images 3D object detection
  • 60.
    Multi-Level Fusion based3D Object Detection from Monocular Images Visualization for 2D detection boxes, the projected 3D detection boxes on inferred point cloud from estimated disparity.
  • 61.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization • MonoGRNet for the amodal 3d object localization from a monocular RGB image via geometric reasoning in both the observed 2d projection and the unobserved depth dimension. • MonoGRNet is a single, unified network composed of four task-specific subnetworks, responsible for 2d object detection, instance depth estimation (IDE), 3d localization and local corner regression. • Unlike the pixel-level depth estimation that needs per-pixel annotations, A IDE method that directly predicts the depth of the targeting 3d bounding box’s center using sparse supervision. • The 3d localization is further achieved by estimating the position in the horizontal and vertical dimensions. • Finally, MonoGRNet is jointly learned by optimizing the locations and poses of the 3d bounding boxes in the global context.
  • 62.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization
  • 63.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization • MonoGRNet for 3d object localization from a monocular RGB image. • MonoGRNet consists of four subnetworks for 2d detection(brown), instance depth estimation(green), 3d location estimation(blue) and local corner regression(yellow). • Guided by the detected 2d Bbox, the network first estimates depth and 2d projection of the 3d box’s center to obtain the global 3d location, and then regresses corner coord.s in local context. • The final 3d bounding box is optimized in an E2E manner in the global context based on the estimated 3d location and local corners. • VGG-16 as the CNN backbone, but without its FC layers.
  • 64.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization Instance depth estimation subnet
  • 65.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization Notation for 3D bounding box localization
  • 66.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization Instance depth
  • 67.
    MonoGRNet: A GeometricReasoning Network for Monocular 3D Object Localization Predicted 3D bounding boxes are drawn in orange, while ground truths are in blue. Lidar point clouds are plotted for reference but not used. Camera centers are at the bottom-left corner. (a), (b) and (c) are common cases when predictions recall the ground truths. (d), (e) and (f) demonstrate the capability of model handling truncated objects outside the image. (g), (h) and (i) show the failed detections when some cars are heavily occluded.
  • 68.
    3D Bounding Boxesfor Road Vehicles: A One-Stage, Localization Prioritized Approach using Single Monocular Images • Understanding 3D semantics of the surrounding objects is critically important and a challenging requirement from the safety perspective of autonomous driving. • This is a localization prioritized approach for effectively localizing the position of the object in the 3D world and fit a complete 3D box around it. • This method requires a single image and performs both 2D and 3D detection in an end to end fashion. • It works by effectively localizing the projection of the center of bottom face of 3D bounding box (CBF) to the image. • Later in the post processing stage, it uses a look up table based approach to reproject the CBF in the 3D world. • This stage is a single time setup and simple enough to be deployed in fixed map communities to store complete knowledge about the ground plane. • The object’s dimension and pose are predicted in multitask fashion using a shared set of features.
  • 69.
    3D Bounding Boxesfor Road Vehicles: A One-Stage, Localization Prioritized Approach using Single Monocular Images
  • 70.
    3D Bounding Boxesfor Road Vehicles: A One-Stage, Localization Prioritized Approach using Single Monocular Images
  • 71.
    3D Bounding Boxesfor Road Vehicles: A One-Stage, Localization Prioritized Approach using Single Monocular Images
  • 72.
    3D Bounding Boxesfor Road Vehicles: A One-Stage, Localization Prioritized Approach using Single Monocular Images Illustration of the 2D detection boxes and the corresponding 3D projections.
  • 73.
    Joint Mono 3DVehicle Detection and Tracking • Vehicle 3D extents and trajectories are critical cues for predicting the future location of vehicles and planning future agent ego-motion based on those predictions. • Here is an online framework for 3D vehicle detection and tracking from monocular videos. • The framework can not only associate detections of vehicles in motion over time, but also estimate their complete 3D bounding box information from a sequence of 2D images captured on a moving platform. • This method leverages 3D box depth-ordering matching for robust instance association and utilizes 3D trajectory prediction for re-identification of occluded vehicles. • It also designs a motion learning module based on an LSTM for more accurate long-term motion extrapolation. • On Argo-verse dataset, this image-based method is significantly better for tracking 3D vehicles within 30 meters than the LiDAR-centric baseline methods.
  • 74.
    Joint Mono 3DVehicle Detection and Tracking Joint online detection and tracking in 3D. The dynamic 3D tracking pipeline predicts 3D bounding box as- sociation of observed vehicles in image sequences captured by a monocular camera with an ego-motion sensor.
  • 75.
    Joint Mono 3DVehicle Detection and Tracking Overview of monocular 3D tracking framework. This online approach processes monocular frames to estimate and track region of interests (RoIs) in 3D (a). For each ROI, learn 3D layout (i.e., depth, orientation, dimension, a projection of 3D center) estimation (b). With 3D layout, the LSTM tracker produces robust linking across frames leveraging occlusion-aware association and depth-ordering matching (c). With the help of 3D tracking, the model further refines the ability of 3D estimation by fusing object motion features of the previous frames (d).
  • 76.
    Joint Mono 3DVehicle Detection and Tracking Illustration of depth-ordering matching. Given the tracklets and detections, sort them into a list by depth order. For each detection of interest (DOI), calculate the IOU between DOI and non-occluded regions of each tracklet. The depth order naturally provides higher probabilities to tracklets near the DOI.
  • 77.
    Joint Mono 3DVehicle Detection and Tracking Illustration of Occlusion-aware association. A tracked tracklet (yellow) is visible all the time, while a tracklet (red) is occluded by another (blue) at frame T −1. During occlusion, the tracklet does not update state but keep inference motion until reappearance. For a truncated or disappear tracklet (blue at frame T ), left it as lost.
  • 78.
    Joint Mono 3DVehicle Detection and Tracking Experimental results on Kitti dataset: 3D layout colored with tracking ID