3-d interpretation from single 2-d image for autonomous driving II

3D Interpretation from Single 2D Image
for Autonomous Driving II
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Task-Aware Mono Depth Estimation for 3D
Object Detection
• M3D-RPN: Mono 3D Region Proposal Network
for Object Detection
• Mono 3D Object Detection with Pseudo-LiDAR
Point Cloud
• Mono 3D Object Detection and Box Fitting
Trained E2E Using IoU Loss
• Disentangling Mono 3D Object Detection
• Shift R-CNN: Deep Mono 3d Object Detection
With Closed-Form Geometric Constraints
• Mono 3D Object Detection via Geometric
Reasoning on Keypoints
• Mono 3D Object Detection Leveraging
Accurate Proposals and Shape Reconstruction
• GS3D: An Efficient 3D Object Detection
Framework for Autonomous Driving
• Mono Object Detection via Color-Embedded
3D Reconstruction for Autonomous Driving
• Mono3D++: Mono 3D Vehicle Detection with
Two-Scale 3D Hypotheses and Task Priors
• Orthographic Feature Transform for Mono 3D
Object Detection
• Multi-Level Fusion based 3D Object Detection
from Mono Images
• MonoGRNet: A Geometric Reasoning Network
for Mono 3D Object Localization
• 3D Bounding Boxes for Road Vehicles: A One-
Stage, Localization Prioritized Approach using
Single Mono Images
• Joint Mono 3D Vehicle Detection and Tracking

Task-Aware Monocular Depth
Estimation for 3D Object Detection
• Monocular depth estimation enables 3D perception from a single 2D image, thus attracting
much research attention for years.
• Almost all methods treat foreground and background regions (“things and stuff”) in an
image equally.
• However, depth of foreground objects plays a crucial role in 3D object recognition and
localization.
• It first analyse the data distributions and interaction of foreground and background, then get
the foreground- background separated monocular depth estimation (ForeSeE) method,
to estimate the foreground and background depth using separate optimization objectives
and decoders.
• This method significantly improves the depth estimation performance on foreground objects.

(a) ForeSeE (b) 3D Object Detection
Illustration of the overall pipeline. (a) Foreground-background separated depth estimation. (b) 3D object detection.

(a) Input Image (b) Baseline-PL (c) ForeSeE-PL
Qualitative results of 3D object detection. The ground truth 3D bounding boxes are in red; the predictions are in green.

M3D-RPN: Monocular 3D Region
Proposal Network for Object Detection
• Understanding the world in 3D is a critical component of urban autonomous driving.
• Generally, the combination of expensive LiDAR sensors and stereo RGB imaging has been
paramount for successful 3D object detection algorithms, whereas monocular image-only
methods experience drastically reduced performance.
• It proposes to reduce the gap by reformulating the monocular 3D detection problem as a
standalone 3D region proposal network, called M3D-RPN.
• M3D-RPN leverages the geometric relationship of 2D and 3D perspectives, allowing 3D
boxes to utilize well-known and powerful convolutional features generated in the image-
space.
• To help address the strenuous 3D parameter estimations, it further designs depth-aware
convolutional layers which enable location specific feature development and in consequence
improved 3D scene understanding.

M3D-RPN uses a single monocular 3D region proposal network with global convolution (orange)
and local depth-aware convolution (blue) to predict multi-class 3D bounding boxes.

Comparison of Deep3DBox (CVPR’17) and Multi-Fusion (CVPR’18) with M3D-RPN. Notice
that prior works are comprised of multi- ple internal stages (orange), and external
networks (blue), whereas M3D-RPN is a single-shot network trained end-to-end.

Overview of M3D-RPN. The method consist of parallel paths for global(orange) and local(blue) feature extraction. The
global features use regular spatial-invariant convolution, while the local features denote depth-aware convolution.
The depth-aware convolution uses non-shared kernels in the row-space ki for i = 1 . . . b, where b denotes # of distinct
bins. To leverage both variants of features, weightedly combine each output parameter from the parallel paths.

Anchor Formulation and Visualized 3D Anchors. To depict each parameter of within the 2D / 3D anchor
formulation (left). To visualize the precomputed 3D priors when 12 anchors are used after projection in
the image view (middle) and Bird’s Eye View (right). For visualization purposes only, to span anchors in
specific x3D locations which best minimize overlap when viewed.

Qualitative Examples. To visualize qualitative examples of our method for multi-class 3D object detection.
It uses yellow to denote cars, green for pedestrians, and orange for cyclists. All illustrated images are
from Chen et al. method (NIPS’15) split and not used for training.

Monocular 3D Object Detection with
Pseudo-LiDAR Point Cloud
• It aims at bridging the performance gap between 3D sensing and 2D sensing for 3D object
detection by enhancing LiDAR-based algorithms to work with single image input.
• Specifically, to perform monocular depth estimation and lift the input image to a point cloud
representation, called pseudo-LiDAR point cloud.
• Then train a LiDAR-based 3D detection network with pseudo-LiDAR end-to-end.
• Following the pipeline of two-stage 3D detection algorithms, detect 2D object proposals in
the input image and extract a point cloud frustum from the pseudo-LiDAR for each proposal,
later an oriented 3D bounding box is detected for each frustum.
• To handle the large amount of noise in the pseudo-LiDAR: (1) use a 2D-3D bounding box
consistency constraint, adjusting the predicted 3D bounding box to have a high overlap with
its corresponding 2D proposal after projecting onto the image; (2) use the instance mask
instead of the bounding box as the representation of 2D proposals, in order to reduce the
number of points not belonging to the object in the point cloud frustum.

(a) Lift every pixel of input image to 3D coordinates given estimated depth to generate pseudo- LiDAR; (b) Instance
mask proposals detected for extracting point cloud frustum; (c) 3D bounding box estimated (blue) for each point cloud
frustum made to be consistent with corresponding 2D proposal. Inputs and losses are in red and orange.

To parameterize 3D bounding box output as a set of seven
parameters, including the 3D coordinate of the object center
(x, y, z), object’s size h, w, l and its heading angle θ.

Left: it demonstrates that, when lifting all the pixels within the 2D bounding box proposal into 3D, the
generated point cloud frustum has the long tail issue. Right: lifting only the pixels within the instance
mask proposal significantly removes the points not being enclosed by the ground truth box, resulting in a
point cloud frustum with no tail.

To alleviate the local misalignment issue, use the geometry constraint of the bounding box consistency to
refine 3D bounding box estimate. Given an inaccurate 3D bounding box estimate, it is highly possible that its
2D projection also does not match well with the corresponding 2D proposal. By adjusting the 3D bounding box
estimate in 3D space so that its 2D projection can have a higher 2D IoU with the corresponding 2D proposal, it
demonstrates that the 3D IoU of 3D bounding box estimate with its ground truth can be also increased.

Qualitative results of proposed method on KITTI set. To visualize 3D bounding box estimate (blue) and ground
truth (red) on the frontal images (1st and 3rd rows) and pseudo-LiDAR point cloud (2nd and 4th rows).

Monocular 3D Object Detection and Box Fitting Trained
End-to-End Using Intersection-over-Union Loss
• Three-dimensional object detection from a single view is a challenging task which, if
performed with good accuracy, is an important enabler of low-cost mobile robot perception.
• Previous approaches to this problem suffer either from an overly complex inference engine
or from an insufficient detection accuracy.
• To deal with these issues, propose SS3D, a single-stage monocular 3D object detector.
• The framework consists of (i) a CNN, which outputs a redundant representation of each
relevant object in the image with corresponding uncertainty estimates, and (ii) a 3D
bounding box optimizer.
• The SS3D architecture provides a solid framework upon which high performing detection
systems can be built, with autonomous driving being the main application in mind.

The pipeline consists of the following steps: 1) a CNN performs object detection (yielding class scores) and
regresses a set of intermediate values used later for 3D bounding box fitting, 2) non-maximum suppression
is applied to discard redundant detections, and finally 3) the 3D bounding boxes are fitted given the
intermediate predictions using a non-linear least- squares method.

In the center of each object’s ground truth 2D bounding box a rectangular support region is
created, with width and height set to 20% of the former. Each output pixel in the support
region holds regression targets and one-hot classification targets for the network. In the rare
case when support regions overlap, the closer object is favored.

Disentangling Monocular 3D Object Detection
• An approach for monocular 3D object detection from a single RGB image leverages a
disentangling transformation for 2D and 3D detection losses and a novel, self-supervised
confidence score for 3D bounding boxes.
• The proposed loss disentanglement has the twofold advantage of simplifying the training
dynamics in the presence of losses with complex interactions of parameters, and
sidestepping the issue of balancing independent regression terms.
• Its solution overcomes these issues by isolating the contribution made by groups of
parameters to a given loss, without changing its nature.
• Further to apply loss disentanglement to another signed IoU criterion-driven loss for
improving 2D detection results.
• To critically review the AP metric used in KITTI3D, identify and resolve a flaw in the 11-point
interpolated AP metric, affecting all previously published detection results and particularly
biases the results of monocular 3D detection.

• A two-stage architecture consists of a single-
stage 2D detector (first stage) with an additional
3D detection head (second stage) constructed
on top of features pooled from the detected 2D
bounding boxes.
• The backbone is a ResNet34 with a Feature
Pyramid Network (FPN) built on top of it.
• The FPN network has the structure with 3+2
scales, connected to the output of modules
conv3, conv4 and conv5 of ResNet34,
corresponding to down-sampling factors of ×8,
×16 and ×32, respectively.

• We consider the head of the single-stage 2D detector implemented in RetinaNet, which
applies a detection module independently to each output fi of the backbone.
• The detection modules share the same parameters but work inherently at different scales,
according to the scale of the features that they receive as input.
• As opposed to the standard RetinaNet, it employs iABNsync also in this head.
• The head is composed of two parallel stacks of 3 × 3 convolutions, and is parametrized by
na reference bounding box sizes (anchors) per scale level.

• The 3D detection head regresses a 3D bounding box for each 2D bounding box returned
by the 2D detection head (surviving the filtering step).
• It starts by applying ROIAlign to pool features from FPN into a 14 × 14 grid for each 2D
bounding box, followed by 2 × 2 average pooling, resulting in feature maps with shape 7 ×
7 × 128.
• On top of this, two parallel branches of fully connected layers with 512 channels compute
the outputs.
• Each fully connected layer but the last one per branch is followed by iABN (non-
synchronized).

Semantics of the outputs of the 2D and 3D detection heads
Left: 2D bounding box regression on image plane. Center: 3D bounding
box regression. Right: allocentric angle from bird-eye view.

Classes Car (top), Pedestrian
(middle) and Cyclist(bottom)
with corresponding birds-
eye view.

Shift R-CNN: Deep Monocular 3d Object Detection
• Shift R-CNN, a hybrid model for monocular 3D object detection, combines deep learning
with the power of geometry.
• It adapts a Faster R-CNN network for regressing initial 2D and 3D object properties and
combine it with a least squares solution for the inverse 2D to 3D geometric mapping
problem, using the camera projection matrix.
• The closed-form solution of the mathematical system, along with the initial output of the
adapted Faster R-CNN are then passed through a final ShiftNet network that refines the
result using proposed Volume Displacement Loss.
• This geometrically constrained deep learning approach to monocular 3D object detection
obtains top results on KITTI 3D Object Detection Benchmark, being the best among all
monocular methods that do not use any pre-trained network for depth estimation.

Overview of Shift R-CNN hybrid model. Stage 1: Faster R-CNN with added 3D angle and dimension regression. Stage 2:
Closed-form solution to 3D translation using camera projection geometric constraints. Stage 3: ShiftNet refinement and
final 3D object box reconstruction.

Stage 2 (top) and Stage 3 (bottom) results comparison. Note that Stage 3 improves the 3D estimation
due to its noise robustness. Turquoise boxes denote objects with the same orientation and magenta
color the opposite orientation.

Monocular 3D Object Detection via
Geometric Reasoning on Keypoints
• Monocular 3D object detection is well-known to be a challenging vision task due to the loss
of depth information;
• Attempts to recover depth using separate image-only approaches lead to unstable and
noisy depth estimates, harming 3D detections.
• It proposes a keypoint-based approach for 3D object detection and localization from a
single RGB image.
• It then builds the multi-branch model around 2D keypoint detection implement it with a
conceptually simple geometric reasoning method.
• This network performs in an end-to-end manner, simultaneously and interdependently
estimating 2D characteristics, such as 2D bounding boxes, keypoints, and orientation, along
with full 3D pose in the scene.
• To fuse the outputs of distinct branches, applying a reprojection consistency loss during
training.

Start with a universal backbone network (Mask R-CNN) and complement it with three sub-networks:
2D object detection sub-network, 2D keypoints regression sub-network, and dimension regression
sub-network. The network is trained end-to-end using a multi-task loss function.

The 5 geometric classes of instances in our work are represented by 5 3D CAD
models with strongly distinct aspect ratios.
Geometric reasoning about instance depth
Predict coordinates and a visibility state for each of the
manually-chosen 14 keypoints;
Define instance depth as the depth Z of a vertical plane
passing through the two closest keypoints in the
camera reference frame.
Annotation of 3D keypoints

The upper part of each sub-figure
contains 2D detection inference,
including 2D bounding and 2D locations
of the visible keypoints. Each instance
and its keypoints are displayed their
distinctive color. The lower part visualizes
the 3D point cloud, showing the camera
location as the colored XYZ axes. Green
and red colors stand for the ground truth
and predicted 3D bounding boxes
respectively. The scenes were selected to
express diversity in complexity and cars
positioning w.r.t. the camera.

Monocular 3D Object Detection Leveraging
• MonoPSR, a monocular 3D object detection method, leverages proposals and shape
reconstruction.
• First, using the fundamental relations of a pinhole camera model, detections from a mature
2D object detector are used to generate a 3D proposal per object in a scene.
• The 3D location of these proposals prove to be quite accurate, which greatly reduces the
difficulty of regressing the final 3D bounding box detection.
• Simultaneously, a point cloud is predicted in an object centered coordinate system to learn
local scale and shape information.
• However, the key challenge is how to exploit shape information to guide 3D localization.
• Aggregate losses, including a projection alignment loss, to jointly optimize these tasks in the
neural network to improve 3D localization accuracy.

The network takes an image with 2D bounding boxes and regresses instance-centric 3D proposals to
produce 3D bound intimated to recover local shape and scale, and to enforce 2D-3D consistency. The
proposal regression and point cloud estimation are trained jointly in the network.

The network produces a feature map using an image crop of an object and global context features as inputs. From
this feature map three tasks are performed a) the dimensions and orientation are predicted to estimate a
proposal b) offsets for the proposals are regressed c) local point clouds are predicted and transformed into the
global frame for auxiliary loss calculations.

Losses for the corresponding predictions (red)
and ground truth (green). All penalties use the
smooth L1 loss at valid pixel locations using
automatically generated segmentation masks.
First, the point cloud loss penalizes the instance
point cloud along each channel (x, y, z). The point
cloud is then placed at its estimated location in
the camera coordinate frame using TCO, the
transformation between object and camera
coordinate frames, and penalized in the last
channel z. Finally, the point cloud is projected
into image space with Π, the camera projection
matrix. A projection alignment loss penalizes
points projected into the wrong image pixel
location.

KITTI dataset. 2D detections (top) are shown in orange. 3D detections in green are shown projected into the
image (top) and in the 3D scene (bottom). Ground truth 3D boxes (bottom) are shown in red. Points within the
detection boxes are the estimated point clouds from the network, while the background points are taken from
the colorized interpolated LiDAR scan. Note that for pedestrians in particular, the projected 3D boxes do not fit
tightly within their 2D box, so constraining the 3D box with the 2D box is not ideal.

GS3D: An Efficient 3D Object Detection
• This is an efficient 3D object detection framework based on a single RGB image in the
scenario of autonomous driving.
• The efforts are put on extracting the underlying 3D information in a 2D image and
determining the accurate 3D bounding box of the object without point cloud or stereo data.
• Leveraging the off-the-shelf 2D object detector, an approach to efficiently obtain a coarse
cuboid for each predicted 2D box.
• The coarse cuboid has enough accuracy to guide determining the 3D box of the object by
refinement.
• In contrast to previous SoA methods that only use the features extracted from the 2D
bounding box for box refinement, it explores the 3D structure information of the object by
employing the visual features of visible surfaces.
• The features from surfaces are utilized to eliminate the problem of representation ambiguity
brought by only using a 2D bounding box.

The key idea: (a) First predict reliable 2D box and its
observation orientation. (b) Based on the predicted 2D
information, utilize artful techniques to efficiently determine
a basic cuboid for the corresponding object, called guidance.
(c) Features extracted from the visible surfaces of projected
guidance as well as the tight 2D bounding box of it will be
utilized by the model to perform accurate refinement with
classification formulation and quality-aware loss.
An example of the feature representation ambiguity
caused by only using 2D bounding box. The 3D boxes
vary largely from each other and only the left one is
correct, but their corresponding 2D bounding box
are exactly the same.

For 2D detection, modify the faster R-CNN
framework by adding a new branch of
orientation prediction.
Top view of observation angle α and
global rotation angle θ.

3D object detection paradigm. A CNN based model (2D+O subnet) is used to obtain a 2D bounding
box and observation orientation of the object. The guidance is then generated by the proposed
algorithm using the obtained 2D box and orientation with the projection matrix. And features
extracted from visible surfaces as well as the 2D bounding box of the projected guidance are
utilized by the refinement model (3D subnet). Instead of direct regression, the refinement model
adopts classification formulation with the quality-aware loss for a more accurate result.

Visualization of feature extraction from the projected surfaces of 3D box by
perspective transformation.

Details of the head of 3D subnet

3D detection results

Mono Object Detection via Color-Embedded
• This is a monocular 3D object detection framework in the domain of autonomous driving.
• Unlike previous image-based methods on RGB feature from 2D images, this method solves
this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly.
• First leverage a stand-alone module to transform the input data from 2D image plane to
3D point clouds space for a better input representation, then perform the 3D detection
using PointNet backbone net to obtain objects’ 3D locations, dimensions and orientations.
• To enhance the discriminative capability of point clouds, apply a multi-modal features
fusion module to embed the complementary RGB cue into the generated point clouds.
• It is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e.,
X,Y, Z space) compared to the image plane (i.e., R,G,B image plane).

framework for monocular 3D object detection

• It consists of two main stages: 3D data generation phase and 3D box estimation phase.
• In 3D data generation phase, trained two deep CNNs to do intermediate tasks (2D
detection and depth estimation) to get position and depth information.
• In particular, transfer the generated depth into point cloud which is a better representation for 3D
detection, and then use 2D bounding box to get the prior information about the location of the
RoI (region of interest).
• Finally, extract the points in each RoI as our input data for subsequent steps.
• In 3D box estimation phase, in order to improve the final task, design two modules for
background points segmentation and RGB information aggregation, respectively.
• After that, use PointNet as our backbone net to predict the 3D location, dimension and
orientation for each RoI.
• Note that the confidence scores of 2D boxes are assigned to their corresponding 3D boxes.

3D box estimation (Det-Net) with RGB features fusion module

Qualitative comparisons of RGB information: 3D Boxes are projected to the image plane. The detection results
using XYZ information only are represented by write boxes, and blue boxes come from the model trained with
RGB features fusion module. The proposed RGB fusion method can improve the 3D detection accuracy,
especially for occlusion/truncation cases.

Mono3D++: Monocular 3D Vehicle Detection with Two-
Scale 3D Hypotheses and Task Priors
• A method to infer 3d pose and shape of vehicles from a single image.
• To tackle this ill-posed problem, optimize two-scale projection consistency between the
generated 3d hypotheses and their 2d pseudo-measurements.
• Specifically, use a morphable wireframe model to generate a fine-scaled representation of
vehicle shape and pose.
• To reduce its sensitivity to 2d landmarks, jointly model the 3d bounding box as a coarse
representation which improves robustness.
• Also integrate three task priors, including unsupervised monocular depth, a ground plane
constraint as well as vehicle shape priors, with forward projection errors into an overall energy
function.

Take a single image as input, and generates vehicles’ 3D shape and pose estimation in camera coordinates
The inference criterion combines a generative component, jointly optimizing the innovation (forward
prediction error) btw the projection of 3D hypotheses and the image pseudo- measurements,
monocular depth map constraints, geometric constraints (ground), in addition to penalizing large
deformations of the shape prior.

The two-scale 3D hypotheses consist of the rotated and scaled 3D Bbox and morphable wireframe model.
The image pseudo-measurements include 2D Bboxes and landmarks. In inference scheme, use the
hypotheses and the pseudo-measurements to initialize the optimization and generate the final 3D pose
and shape estimation of a vehicle.

Orthographic Feature Transform for
Monocular 3D Object Detection
• Due to the perspective image-based representation, the appearance and scale of objects
varies drastically with depth and meaningful distances are difficult to infer.
• The ability to reason about the world in 3D is an essential element of the 3D object
detection task.
• The orthographic feature transform enables to escape the image domain by mapping
image-based features into an orthographic 3D space.
• It allows to reason holistically about the spatial configuration of the scene in a domain
where scale is consistent and distances between objects are meaningful.
• Apply this transformation as part of an E2E deep learning architecture.

orthographic Feature Transform for
Orthographic Feature Transform (OFT)
1. A front-end ResNet feature extractor which extracts
multi-scale feature maps from the input image.
2. A orthographic feature transform which transforms
the image-based feature maps at each scale into an
orthographic birds-eye-view representation.
3. A top down network, consisting of a series of ResNet
residual units, which processes the birds-eye-view
feature maps in a manner which is invariant to the
perspective effects observed in the image.
4. A set of output heads which generate, for each object
class and each location on the ground plane, a
confidence score, position offset, dimension offset and
a orientation vector.
5. A non-maximum suppression and decoding stage,
which identifies peaks in the confidence maps and
generates discrete bounding box predictions.

Architecture overview. A front-end ResNet feature extractor generates image-based features, which are mapped to an
orthographic representation via orthographic feature transform. The top down network processes these features in the
birds-eye-view space and at each location on the ground plane predicts a confidence score S, a position offset ∆pos, a
dimension offset ∆dim and an angle vector ∆ang .

Qualitative comparison between OFT method (left) and Mono3D at CVPR’16 (right) on the KITTI validation
set. Inset regions highlight the behaviors of the two systems at large distances. OFT is able to consistently
detect distant objects which are beyond the range of Mono3D.

Multi-Level Fusion based 3D Object
Detection from Monocular Images
• An E2E multi-level fusion based framework for 3d object detection from a single
monocular image.
• It is composed of 2 parts: one for 2d region proposal generation and another for
simultaneously predictions of objects’ 2d locations, orientations, dimensions, and 3d
locations.
• With the help of a stand-alone module to estimate the disparity and compute the 3d
point cloud, introduce the multi-level fusion scheme.
• Encode the disparity info. with a front view feature representation and fuse it with the RGB
image to enhance the input.
• Features extracted from the original input and the point cloud are combined to boost the
object detection. for 3d localization, introduce an extra stream to predict the location info.
from point cloud directly and add it to the aforementioned location prediction.

3D object detection

Visualization for 2D detection boxes, the projected 3D detection boxes on
inferred point cloud from estimated disparity.

MonoGRNet: A Geometric Reasoning Network
for Monocular 3D Object Localization
• MonoGRNet for the amodal 3d object localization from a monocular RGB image via geometric
reasoning in both the observed 2d projection and the unobserved depth dimension.
• MonoGRNet is a single, unified network composed of four task-specific subnetworks,
responsible for 2d object detection, instance depth estimation (IDE), 3d localization and local
corner regression.
• Unlike the pixel-level depth estimation that needs per-pixel annotations, A IDE method that
directly predicts the depth of the targeting 3d bounding box’s center using sparse supervision.
• The 3d localization is further achieved by estimating the position in the horizontal and vertical
dimensions.
• Finally, MonoGRNet is jointly learned by optimizing the locations and poses of the 3d bounding
boxes in the global context.

• MonoGRNet for 3d object localization from a monocular RGB image.
• MonoGRNet consists of four subnetworks for 2d detection(brown), instance depth
estimation(green), 3d location estimation(blue) and local corner regression(yellow).
• Guided by the detected 2d Bbox, the network first estimates depth and 2d projection of
the 3d box’s center to obtain the global 3d location, and then regresses corner coord.s in
local context.
• The final 3d bounding box is optimized in an E2E manner in the global context based on
the estimated 3d location and local corners.
• VGG-16 as the CNN backbone, but without its FC layers.

Instance depth estimation subnet

Notation for 3D bounding box localization

Instance depth

Predicted 3D bounding boxes
are drawn in orange, while
ground truths are in blue. Lidar
point clouds are plotted for
reference but not used. Camera
centers are at the bottom-left
corner. (a), (b) and (c) are
common cases when predictions
recall the ground truths.
(d), (e) and (f) demonstrate the
capability of model handling
truncated objects outside the
image. (g), (h) and (i) show the
failed detections when some
cars are heavily occluded.

3D Bounding Boxes for Road Vehicles: A One-Stage, Localization
Prioritized Approach using Single Monocular Images
• Understanding 3D semantics of the surrounding objects is critically important and a
challenging requirement from the safety perspective of autonomous driving.
• This is a localization prioritized approach for effectively localizing the position of the object
in the 3D world and fit a complete 3D box around it.
• This method requires a single image and performs both 2D and 3D detection in an end to
end fashion.
• It works by effectively localizing the projection of the center of bottom face of 3D bounding
box (CBF) to the image.
• Later in the post processing stage, it uses a look up table based approach to reproject the
CBF in the 3D world.
• This stage is a single time setup and simple enough to be deployed in fixed map
communities to store complete knowledge about the ground plane.
• The object’s dimension and pose are predicted in multitask fashion using a shared set of
features.

Illustration of the 2D detection boxes and the corresponding 3D projections.

Joint Mono 3D Vehicle Detection and Tracking
• Vehicle 3D extents and trajectories are critical cues for predicting the future location of
vehicles and planning future agent ego-motion based on those predictions.
• Here is an online framework for 3D vehicle detection and tracking from monocular videos.
• The framework can not only associate detections of vehicles in motion over time, but also
estimate their complete 3D bounding box information from a sequence of 2D images
captured on a moving platform.
• This method leverages 3D box depth-ordering matching for robust instance association and
utilizes 3D trajectory prediction for re-identification of occluded vehicles.
• It also designs a motion learning module based on an LSTM for more accurate long-term
motion extrapolation.
• On Argo-verse dataset, this image-based method is significantly better for tracking 3D
vehicles within 30 meters than the LiDAR-centric baseline methods.

Joint online detection and tracking in 3D.
The dynamic 3D tracking pipeline predicts
3D bounding box association of
observed vehicles in image sequences
captured by a monocular camera with an
ego-motion sensor.

Overview of monocular 3D tracking framework. This online approach processes monocular frames to estimate and
track region of interests (RoIs) in 3D (a). For each ROI, learn 3D layout (i.e., depth, orientation, dimension, a
projection of 3D center) estimation (b). With 3D layout, the LSTM tracker produces robust linking across frames
leveraging occlusion-aware association and depth-ordering matching (c). With the help of 3D tracking, the model
further refines the ability of 3D estimation by fusing object motion features of the previous frames (d).

Illustration of depth-ordering matching. Given the tracklets and detections, sort
them into a list by depth order. For each detection of interest (DOI), calculate the
IOU between DOI and non-occluded regions of each tracklet. The depth order
naturally provides higher probabilities to tracklets near the DOI.

Illustration of Occlusion-aware association. A tracked tracklet (yellow) is visible all the time, while
a tracklet (red) is occluded by another (blue) at frame T −1. During occlusion, the tracklet does
not update state but keep inference motion until reappearance. For a truncated or disappear
tracklet (blue at frame T ), left it as lost.

Experimental results on Kitti dataset: 3D layout colored with tracking ID

3-d interpretation from single 2-d image for autonomous driving II

3-d interpretation from single 2-d image for autonomous driving II

More Related Content

What's hot

Similar to 3-d interpretation from single 2-d image for autonomous driving II

More from Yu Huang

Recently uploaded

3-d interpretation from single 2-d image for autonomous driving II