3-d interpretation from single 2-d image III

3D Interpretation from Single 2D Image
for Autonomous Driving III
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Towards Generalization Across Depth for Monocular 3D Object Detection
• RTM3D: Real-time Monocular 3D Detection from Object Keypoints for
Autonomous Driving
• Monocular 3D Object Detection with Decoupled Structured Polygon Estimation
and Height-Guided Depth Estimation
• Exploring the Capabilities and Limits of 3D Monocular Object Detection - A
Study on Simulation and Real World Data
• Object-Aware Centroid Voting for Monocular 3D Object Detection
• Monocular 3D Detection with Geometric Constraints Embedding and Semi-
supervised Training
• Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels

Towards Generalization Across Depth for
Monocular 3D Object Detection
• This work advances the state of the art by introducing MoVi-3D, a single-
stage deep architecture for monocular 3D object detection.
• MoVi-3D builds upon an approach which leverages geometrical information
to generate, both at training and test time, virtual views where the object
appearance is normalized with respect to distance.
• These virtually generated views facilitate the detection task as they
significantly reduce the visual appearance variability associated to objects
placed at different distances from the camera.
• As a consequence, the deep model is relieved from learning depth-specific
representations and its complexity can be significantly reduced.
• In particular, in this work thanks to virtual views generation process, a
lightweight, single-stage architecture suffices to set new state-of-the-art
results on the popular KITTI3D benchmark.

Aim at predicting a 3D bounding box for each object given a single image (left). In this image, the
scale of an object heavily depends on its distance with respect to the camera. For this reason the
complexity of the detection increases as the distance grows. Instead of performing the detection
on the original image, perform it on virtual images (middle). Each virtual image presents a cropped
and and scaled version of the original image that preserves the scale of objects as if the image
was taken at a different, given depth.

Illustration of the Monocular 3D Object Detection task. Given an input image (left),
the model predicts a 3D box for each object (middle). Each box has its 3D
dimensions s = (W;H;L), 3D center c = (x; y; z) and rotation (alpha).

• The goal is to devise a training/inference procedure that enables generalization across
depth, by indirectly forcing the models to develop representations for objects that are less
dependent on their actual depth in the scene.
• The idea is to feed the model with transformed images that have been put into a canonical
form that depends on some query depth.
• After this transformation, no matter where the car is in space, obtain an image of the car
that is consistent in terms of the scale of the object.
• Clearly, depth still influences the appearance, e.g. due to perspective deformations, but by
removing the scale factor from the nuisance variables，able to simplify the task that has to
be solved by the model.
• In order to apply the proposed transformation，need to know the location of the 3D objects
in advance.

3D viewport
Compute the top-left and bottom-right corners of the viewport, namely (Xv，Yv，Zv) and (Xv +
Wv，Yv – Hv，Zv) respectively, and project them to the image plane of the camera, yielding the
top-left and bottom-right corners of a 2D viewport. Crop it and rescale it to the desired resolution
wv x hv to get the final output. It is a virtual image generated by the given 3D viewport.

• The goal of the training procedure is to build a network that is able to make correct
predictions within a limited depth range given an image generated from a 3D viewport.
• A ground-truth-guided sampling procedure：repeatedly draw (without replacement) a
ground-truth object and then sample a 3D viewport in a neighborhood thereof so that the
object is completely visible in the virtual image.
• The location of the 3D viewport is perturbed with respect to the position of the target
ground-truth object in order to obtain a model that is robust to depth ranges up to the
predefined depth resolution Zres, which in turn plays an important role at inference time.
• In addition, let a small share of the virtual images to be generated by 3D viewports
randomly positioned in a way that the corresponding virtual image is completely contained
in the original image.
• A class-uniform sampling strategy：allows to get an even number of virtual images for each
class that is present in the original image.

Training virtual image creation. We randomly sample a target object (dark-red car). Given the input
image, object position and camera parameters, compute a 3D viewport that we place at z = Zv. Then
project the 3D viewport onto the image plane, resulting in a 2D viewport. Finally crop the
corresponding region and rescale it to obtain the target virtual view (right).

• Since have trained the network to be able to predict at distances that
are twice the depth step, reasonably confident not missing objects, in
the sense that each object will be covered by at least a virtual image.
• Also, due to the convolutional nature of the architecture adjust the
width of the virtual image in a way to cover the entire extent of the
input image.
• By doing so have virtual images that become wider as increasing the
depth, following the rule （W is the width of the input image）：
• Finally perform NMS over detections that have been generated from
the same virtual image.

Inference pipeline. Given the input image, camera parameters and Zres，create a series of 3D
viewports placing every Zres/2 meters along the Z axis. Then project these viewports onto the image,
crop and rescale the resulting regions to obtain distance-specific virtual views. Finally use these views
to perform the 3D detection.

It consists of two parallel branches, the top one devoted to providing confidences about the predicted
2D and 3D bounding boxes, while the bottom one is devoted to regressing the actual bounding boxes.
White rectangles denote 33 convolutions with 128 output channels followed by iABNsync.

RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
• It proposes an efficient and accurate monocular 3D detection framework in
single shot.
• This method predicts the nine perspective keypoints of a 3D bounding box in
image space, and then utilize the geometric relationship of 3D and 2D
perspectives to recover the dimension, location, and orientation in 3D space.
• In this method, the properties of the object can be predicted stably even
when the estimation of keypoints is very noisy, which enables us to obtain
fast detection speed with a small architecture.
• Training uses the 3D properties of the object without the need for external
networks or supervision data.
• This method is the first real-time system for monocular image 3D detection
while achieves state-of the-art performance on the KITTI benchmark.
• Code will be released at https://github.com/Banconxuan/RTM3D.

Overview of proposed method: first predict ordinal keypoints projected in the
image space by eight vertexes and a central point of a 3D object. then
reformulate the estimation of the 3D bounding box as the problem of minimizing
the energy function by using geometric constraints of perspective projection.

An overview of proposed
keypoint detection
architecture: It takes only
the RGB images as the
input and outputs main
center heatmap, vertexes
heatmap, and vertexes
coordinate as the base
module to estimate 3D
bounding box. It can also
predict other alternative
priors to further improve
the performance of 3D
detection.

Illustration of keypoint feature pyramid network (KFPN).

Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
• Since the location recovery in 3D space is quite difficult on account of
absence of depth information, this work proposes a unified framework
which decomposes the detection problem into a structured polygon
prediction task and a depth recovery task.
• Different from the widely studied 2D bounding boxes, the proposed
structured polygon in the 2D image consists of several projected
surfaces of the target object as better representation for 3D detection.
• In order to inversely project the predicted 2D structured polygon to a
cuboid in the 3D physical world, the following depth recovery task uses
the object height prior to complete the inverse projection
transformation with the given camera projection matrix.
• Moreover, a fine-grained 3D box refinement scheme is proposed to
further rectify the 3D detection results.

The overall framework (Decoupled-3D) decouples the monocular 3D object detection problem into sub-tasks. The
overall network consists of three parts. (Top row) The 2D structured polygons are generated with a stacked
hourglass network. (Middle row) Object depth stage utilizes 3D object height as a prior to recover the missing depth
of the object. (Bottom row) 3D box refine stage rectifies coarse 3D boxes using bird’s eye view features in 3D-ROIs.

Structured polygon estimation aims to estimate the 2D locations of the projected vertices

Height-Guided Depth Estimation. Combine
object height H and corresponding pixel
value h to estimate object depth

3D Box Refinement. Rectify coarse boxes with bird’s eye view map
Note： Depth Net DOR（“Deep Ordinal Regression Network for Monocular Depth Estimation”）

Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
• Recent deep learning methods show promising results to recover depth
information from single images by learning priors about the environment.
• In addition to the network design, the major difference of these competing
approaches lies in using a supervised or self-supervised optimization loss
function, which require different data and ground truth information.
• This paper evaluate the performance of a 3D object detection pipeline which
is parameterizable with different depth estimation configurations.
• It implement a simple distance calculation approach based on camera
intrinsics and 2D bounding box size, a self-supervised, and a supervised
learning approach for depth estimation.
• It evaluate the detection pipeline on simulator data and a real world
sequence from an autonomous vehicle on a race track.
• Advantages and drawbacks of the different depth estimation strategies are
discussed

3D object detection pipeline with
three alternative configurations

• Distance calculation using the 2D bounding box height, and the known
height of the real world race car as a geometric constraint. “known
height assumption”
• Depth estimation for the whole image using the supervised
DenseDepth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.
• Depth estimation for the whole image using the self-supervised
struct2depth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.

Object-Aware Centroid Voting for
• This paper propose an end-to-end trainable monocular 3D object
detector without learning the dense depth.
• Specifically, the grid coordinates of a 2D box are first projected back to
3D space with the pinhole model as 3D centroids proposals.
• Then, a object-aware voting approach is introduced, which considers
both the region-wise appearance attention and the geometric
projection distribution, to vote the 3D centroid proposals for 3D object
localization.
• With the late fusion and the predicted 3D orientation and dimension,
the 3D bounding boxes of objects can be detected from a single RGB
image.
• The method is straightforward yet significantly superior to other
monocular-based methods.

3D Object Detection Pipeline. Given an
image with predicted 2D region proposals
(yellow box), the regions are divided into
grids. Each grid point with (u;v) coordinate
is projected back to 3D space by leveraging
the pinhole model and the class-specific 3D
height H, resulting in 3D box centroid
proposals. With the voting method inspired
by both appearance and geometric cues,
3D object location is predicted.

The Architecture. 2D region proposals are first obtained from the RPN module. Then, with the 3D Center
Reasoning (left), multiple 3D centroid proposals are estimated from the 2D RoI grid coordinates.
Followed by the Object-Aware Voting (right), which consists of geometric projection distribution (GPD)
and appearance attention map (AAM), the 3D centroid proposals are voted for 3D localization. For the 3D
dimension and orientation, they are estimated together with 2D object detection head.

• For the objects on driving road, they are horizontally placed without the pose angles of yaw
and pitch with respect to the camera.
• Besides, the 3D dimension variance of each class of objects (such as Car) is quite small.
• These constraints lead to the idea that the apparent heights of objects on image are
approximately invariant when objects are in the same depth.
• Recent survey also points out that the positions and apparent size of object in an image are
applicable to infer the depth on KITTI dataset.
• Therefore, the 3D object centroid can be roughly inferred with the simple pinhole camera
model.

• Specifically, divide each 2D region proposals into s X s grid cells and project the grid
coordinates back onto 3D space.
• Since each grid point indicates the probable projection of the corresponding 3D object
centroid, get multiple 3D centroid proposals P3d where the i-th centroid proposal
P3d(Xi;Yi;Z) is computed by
Examples and statistics on KITTI training set.

• Specifically, use a single 1X1 convolution followed by sigmoid activation to generate
appearance attention map from the feature maps of RoI pooling layer.
• The activated convolution feature map from the image indicates the foreground semantic
objects due to the classification supervision in 2D object detection, leading to the object-
ware voting.
• This voting component comes from the distribution of the offset between the projected 3D
centroid and the 2D box center.
• It is demonstrated the 2D box center can be modeled as Gaussian distribution with ground
truth as expectation
• To dynamically learn the distribution, the 2D grid coordinates and image features of RoI are
concatenated together as input of a fully-connected layer to predict the offset with Kullback
Leibler (KL) divergence as loss function to supervise the learning

The object-aware voting can be formulated as the
element-wise multiplication with both the normalized
probability maps Mapp and Mgeo as follows
In the training stage, the 3D localization pipeline is
trained with smooth L1 loss

3D dimension prediction loss function comparing
predictions and the ground truth are defined in the
logarithm space through the smooth L1 loss
In 3D orientation estimation, use Multi-Bin to
disentangle it into residual angle prediction and
angle bins classification.
3D orientation estimation loss is formed as
Loss functions for the joined training of multi-tasks
of 2D and 3D object detection as

Qualitative results. Red: detected 3D boxes. Yellow: ground truth. Right: birds’ eye view (BEV) results.

Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
• A single-shot and keypoints-based framework for mono 3D objects detection, KM3D-Net.
• Here, design a fully convolutional model to predict object keypoints, dimension, and
orientation, and then combine with perspective geometry constraints to compute position.
• Further, reformulate the geometric constraints as a differentiable version and embed it into the
network while maintaining the consistency of model outputs in an E2E fashion.
• Then propose a semi-supervised training strategy where labeled training data is scarce.
• In this strategy, enforce a consensus prediction of two shared-weights KM3D-Net for the
same unlabeled image under different input augmentation conditions and network
regularization.
• In particular, unify the coordinate-dependent augmentations as the affine transformation for
the differential recovering position and propose a keypoints-dropout module for the network
regularization.
• This model only requires RGB images without synthetic data, instance segmentation, CAD
model, or depth generator.

Overview of KM3D-Net which
output keypoints, object
dimensions, local orientation,
and 3D confidence, followed
by differentiable geometric
consistency constraints to
predict position.

Overview of unsupervised training. It leverages affine transformation to unify input augmentation
and devise keypoints dropout for regularization. These two strategies make KM3D-Net output two
stochastic variables with the same input. Penalizing their differences is training goal.

Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
• The training of deep-learning-based 3D object detectors requires
large datasets with 3D bounding box labels for supervision that have to
be generated by hand-labeling.
• A network architecture and training procedure for learning monocular
3D object detection without 3D bounding box labels.
• By representing the objects as triangular meshes and employing
differentiable shape rendering, define loss functions based on depth
maps, segmentation masks, and ego- and object-motion, which are
generated by pre-trained, o-the-shelf networks.
• In comparison to SOA methods requiring 3D bounding box labels for
training and superior performance to conventional baseline methods.

A mono 3D vehicle detector that requires no 3D bounding box labels for training. The right image shows that
the predicted vehicles (colored shapes) fit the GT bounding boxes (red). Despite the noisy input depth (lower
left ), the method is able to accurately predict the 3D poses of vehicles due to the proposed fully differentiable
training scheme. It show the projections of the predicted bounding boxes (colored boxes, upper left).

The proposed model contains a single-image network and a multi-image network extension. The single-
image network back-projects the input depth map into a point cloud. A Frustum PointNet encoder predicts
the pose and shape then decoded into a predicted 3D mesh and segmentation mask through differentiable
rendering. The multi-image network architecture takes three images as the inputs, and the single-image
network is applied individually to each image. This network predicts a depth map for the middle frame based
on the vehicle‘s pose and shape. A pre-trained network predicts ego-motion and object-motion from the
images. The reconstruction loss is computed by differentiably warping the images into the middle frame.

In order to train without 3D bounding box labels we use three losses, the
segmentation loss Lseg, the chamfer distance Lcd, and the photometric
reconstruction loss Lrec. The first two are defined for single images and
the photometric reconstruction loss relies on temporal photo-consistency
for three consecutive frames

Qualitative comparison of MonoGRNet (1st row), Mono3D(2nd row), and this method (third row) with depth
maps from BTS. It show GT bounding boxes for cars (red), predicted bounding boxes (green), and the back-
projected point cloud. In comparison to Mono3D, the prediction accuracy is increased specifically for further
away vehicles. The performance of MonoGRNet and this model is comparable.

3-d interpretation from single 2-d image III

3-d interpretation from single 2-d image III

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 3-d interpretation from single 2-d image III

Similar to 3-d interpretation from single 2-d image III (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

3-d interpretation from single 2-d image III