3-d interpretation from single 2-d image for autonomous driving

3D INTERPRETATION FROM
SINGLE 2D IMAGE FOR
AUTONOMOUS DRIVING
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

OUTLINE
 Single View Metrology
 Joint SFM and Detection Cues for Monocular 3D
Localization in Road Scenes
 Joint 3D Estimation of Objects and Scene Layout
 CubeSLAM: Monocular 3D Object Detection and
SLAM without Prior Models
 Monocular Visual Scene Understanding:
Understanding Multi-Object Traffic Scenes
 Improved Object Detection and Pose Using Part-
Based Models
 3D Object Detection and Viewpoint Estimation
with a Deformable 3D Cuboid Model
 Are Cars Just 3D Boxes? – Jointly Estimating the
3D Shape of Multiple Objects
 Classification and Pose Estimation of Vehicles in
Videos by 3D Modeling within Discrete-
Continuous Optimization
 A mixed classification-regression framework for
3D pose estimation from 2D images
 BoxCars: Improving Fine-Grained Recognition of
Vehicles using 3D BBoxes in Traffic Surveillance
 Vehicle Detection and Pose Estimation for
Autonomous Driving (Thesis)
 Deep Cuboid Detection: Beyond 2D BBoxes
 3D Bounding Box Estimation Using Deep
Learning and Geometry
 Deep MANTA: A Coarse-to-fine Many-Task
Network for joint 2D and 3D vehicle analysis from
monocular image
 3D Object Proposals for Accurate Object Class
Detection
 Monocular 3D Object Detection for Autonomous
Driving
 SSD-6D: Making RGB-Based 3D Detection and
6D Pose Estimation Great Again
 Real-Time Seamless Single Shot 6D Object Pose
Prediction
 Implicit 3D Orientation Learning for 6D Object
Detection from RGB Images

SINGLE VIEW METROLOGY
Basic geometry: The plane’s vanishing line l
is the intersection of the image plane with a
plane parallel to the reference plane and
passing through the camera centre. The
vanishing point v is the intersection of the
image plane with a line parallel to the
reference direction through the camera
centre.
Cross ratio: The point b on the plane π
corresponds to the point t on the plane π’ .
They are aligned with the vanishing point v.
The 4 points v, t, b and the intersection i of the
line joining them with the vanishing line define
a cross-ratio. Cross-ratio decides a ratio of
distances between planes in the world.

SINGLE VIEW METROLOGY
Homology mapping btw parallel planes: point X
on plane π mapped into point X’ on π’ by parallel
projection. In the image, mapping btw images of
two planes is a homology, with v vertex and l
axis. The correspondence b -> t fixes the
remaining DoF of the homology from cross-ratio
of the 4 points: v, i, t and b.

JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
 This localization framework jointly uses info. from complementary
modalities such as SFM and object detection to achieve high
localization accuracy in both near and far fields.
 Make use of raw detection scores to allow 3D Bboxes to adapt to
better quality 3D cues.
 To extract SFM cues, take the advantages of dense tracking over
sparse mechanisms in autonomous driving scenarios.
 The formulation for 3D localization can be regarded as an extension
of sparse BA to incorporate object detection cues.
3D object localization framework
that combines cues from SFM and
object detection. Red denotes 2D
bounding boxes, the horizontal line
is the horizon from estimated
ground plane, green denotes
estimated 3D localization for far
and near objects, with distances in
magenta.

Overview of the 3d object localization system combining SFM
cues (green) with object detection cues (brown).

Coordinate system definitions for
3D object localization. The SFM
ground plane is (n⊤, h)⊤.
System overview for obtaining SFM cues on
objects, depicted in green.
K is the camera intrinsic calibration
matrix, the bottom of a 2D Bbox, b =
(x, y, 1)⊤, can be back-projected to
3D through the ground plane {h, n}:

Output of this localization
system. The bottom left panel
shows the monocular SFM
camera trajectory. The top
panel shows input 2D bounding
boxes in red, horizon from
estimated ground plane and the
estimated 3D bounding boxes
in green with distances in
magenta. The bottom right
panel shows the top view of the
ground truth object localization
from laser scanner in red,
compared to this 3D object
localization in blue.

JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
 A generative model is able to reason jointly about the 3D scene
layout as well as the 3D location and orientation of objects in the
scene.
 To infer the scene topology, geometry and traffic activities from a
video sequence from a single camera mounted on a moving car.
 It takes advantage of dynamic info. in the form of vehicle tracklets
and static info. from semantic labels and geometry (i.e., vanishing
points).
Monocular 3D Urban
Scene Understanding.
(Left) Image cues.
(Right) Estimated
layout: Detections
belonging to a tracklet
are depicted with the
same color, traffic
activities are depicted
with red lines.
Vehicle tracklets
Vanishing points
Scene labels

SCENE LAYOUT
 Assume that the road surface is flat, and model the bird’s eye perspective
as the y = 0 plane of the standard camera coordinate system;
 Detect vehicles in each frame independently using a semi-supervised
version of the part- based detector in order to obtain orientation estimates;
 2D tracklets estimated using ’tracking-by-detection’: First adjacent frames
are linked and then short tracklets are associated to create longer ones via
the hungarian method.
 3D vehicle tracklets are obtained by projecting the 2D tracklets into bird’s
eye perspective, employing error-propagation to obtain cov. estimates.
 Model lanes with splines, place parking spots at equidistant places along
street boundaries.
 The model infers whether the cars participate in traffic or are parked in
order to get more accurate layout estimations.
 Latent variables are employed to associate each detected vehicle with
positions in one of these lanes or parking spaces.

SCENE LAYOUT
Graphical model and road model with lanes represented as B-splines.
Transform the 2D tracklets into 3D tracklets: project the image coordinates
into bird’s eye perspective by backprojecting objects into 3D using several
complementary cues. Towards this goal, use the 2D bounding box footpoint
in combination with the estimated road plane. Two types of dominant
vanishing points: forward facing street and crossing street. Three semantic
classes, i.e., road, sky and background.

SCENE LAYOUT
(Left) Trackets from all frames superimposed. (Middle) Inference result
with θ known and (Right) θ unknown. The inferred intersection layout in
gray. Ground truth labels in blue. Detected activities in red.

CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
 A method for single image 3D cuboid object detection and multi-view
object SLAM without prior object model, and the two aspects can
benefit each other.
 For 3D detection, generate cuboid proposals from 2D Bboxes and
vanishing points sampling.
 The proposals are further scored,selected to align with image edges.
 Multi-view bundle adjustment with measurement functions is proposed
to jointly optimize camera poses, objects and points, utilizing single
view detection results.
 Objects can provide more geometric constraints and scale consistency
compared to points.
 Objects are utilized in two folds: provide depth initialization for points
difficult to triangulate and provide geometry constraints in BA.
 The estimated camera poses from SLAM can improve the single-view
object detection.

Monocular 3D object detection and mapping without prior object models.
Mesh model is just for visualization and not used for detection. (a) ICL NUIM
data with various objects, whose position, orientation and dimension are
optimized by SLAM. (b) KITTI 07. With object constraints, monocular SLAM
can build a consistent map and correct scale drift, without loop closure and
constant camera height assumption.
(a) (b)

 A 3D cuboid by 9 DoF parameters: 3 DoF position, 3 DoF rotation
and 3 DoF dimension.
 The cuboid coordinate frame is built at the cuboid center, aligned
with the main axes.
 The camera intrinsic calibration K is also known.
 The cuboid’s projected corners fit tightly with 2D bounding box,
there are 4 constraints corresponding to 4 sides of a rectangle
which cannot fully constrain all 9 parameters.
 3D cuboid has 3 orthogonal axes and can form 3 VPs after
perspective projections depending on object rotation R and camera
calibration K.
 After getting 8 cuboid corners in 2D, back-project to 3D space to
compute 3D position and dimensions which is up to a scale factor.
 The scale can be reasoned from camera height to ground, prior
object size and so on.

Proposals generation from 2D object box. Cuboids are divided
into three categories depending on the number of observable
faces. If one corner is estimated, the other seven corners can
also be computed from vanishing points (VPs). For example in
(a), if corner 1 is sampled, then corner 2 and 3 can be
determined through ray intersection of VP line and rectangles,
followed by corner 4 and other bottom corners.

Denote the image as I and cuboid proposal as x, then the cost
function is defined as:
Cuboid proposal scoring. (Left) Edges to align and score
the proposals. (Right) Cuboid proposals generated from
the same 2D cyan bounding box. The top left is the best
and bottom right is the worst after scoring.

Camera poses C = {ci}, 3D landmark object O = {oj}, Points P = {pk}.
BA is formulated as NLS problem:
Transform the landmark object to the camera frame then compare with
the measurement as:
3D measurement
2D measurement
Project the landmark cuboid onto image plane to get the 2D Bbox,
compare it with the detected 2D Bbox as
First transform the point to the cuboid frame then compare
with cuboid dimensions:
Point-camera measurement: the standard 3D point re-projection error
in feature based SLAM.

(a) The object SLAM pipeline. Single view object detection provides
cuboid landmark and depth initialization for SLAM while SLAM can
estimate camera pose for more accurate object detection. (b)
Measurement errors between cameras, objects and points during BA.
(a) (b)

 Object association based on point matching:
 Dynamic points detected through descriptor matching and
epipolar line checking;
 Then first associate points to objects if points are observed
enough times of belonging to the 2D object bounding box and
close to cuboid centroid in 3D space;
 Try to find object matching which has the most number of
shareable map points exceeding a threshold (10 for example);
 Well for wide baseline matching, repetitive objects, occlusions,
and dynamic scenarios.
Green points are normal map points,
and other color points are associated
to objects with the same color. The
front cyan moving car is not added as
SLAM landmark as no feature point is
associated with it. Points in object
overlapping areas are not associated
with any object.

MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
 A probabilistic 3D scene model that integrates SoA multiclass object
detection, object tracking and scene labeling together with geometric
3D reasoning.
 The model is able to represent complex object interactions such as
inter-object occlusion, physical exclusion between objects, and
geometric context.
 Inference in this model allows to jointly recover the 3D scene context
and perform 3D multi-object tracking from a mobile observer, for
objects of multiple categories, using only monocular video as input.
 This system performs explicit occlusion reasoning and is capable of
tracking objects that are partially occluded for extended periods of
time, or objects that have never been observed to their full extent.
 A joint scene tracklet model for the evidence collected over multiple
frames substantially improves performance.

Overview on this system.
For each input frame, run
an object detector and
extract semantic scene
labels. Object
hypotheses are fused to
short-term tracklets and
put into a strong 3D
scene model with explicit
occlusion reasoning.
MCMC inference allows
tractable inference in the
Bayesian scene model,
while HMM scene
tracking ensures long-
term associations.

The multi-frame 3D inference and explicit
occlusion reasoning for onboard vehicle and
pedestrian tracking with overlaid horizon
estimate for different public SoA datasets.
the 3D scene state X in the world coordinate system
the rotation angles of a vehicle mounted camera

Employing the theorem of
intersecting lines to derive the
distance to object along the
ground plane in viewing direction.
Approximate objects by their Bboxes and
project them onto the image. By leveraging the
depth order obtained from the 3D scene model,
able to estimate occluded object regions.

IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
 Work on part-based models by using accurate geometric models
both in the learning phase and at detection.
 The object model is defined as a number of roughly planar aspects
models together with a set of typical object poses;
 In the learning phase, manual annotations are used to reduce
perspective distortion before learning the part-based models.
 That training is performed on rectified images using a deformable part-
based model (DPM), leads to models which are more specific.
 At the same time a set of representative object poses are learnt.
 Transform the image according to each of the learnt typical poses.
 These are used at detection to remove perspective distortion.
 Scores from the aspect detectors are generated by running each
aspect model on each of the transformed images.
 Detections from the different aspect models are combined and
thresholded to produce the final object detection.

PART-BASED MODELS
Annotation of each visible aspect of the object in training images
The two aspect models for the bus category. The upper row shows
the frontal model and the lower row shows the side model.

PART-BASED MODELS

PART-BASED MODELS
A certain training example is similar to a pose P if the average angular
deviations for the front and side,
Measure of angular deviation for pose similarity.
are below a predefined threshold.

PART-BASED MODELS
Overview of the detection pipeline. The input image is transformed according
to each of representative poses. This produces a multiple images that are
individually run through the aspect detectors creating a set of score pyramids
containing the detector scores at different scales. These are merged into one
pyramid per aspect, in the original image coordinate system. Finally, the front
and side scores are combined and non-maximum suppression is performed.

PART-BASED MODELS
Left: how to estimate the side location (brown dot) given the frontal location
(blue dot) and the size of skewed frontal bounding box. Right: search in a
small neighborhood (blue circle) of expected location for each level.
The score combination can be expressed as

PART-BASED MODELS
Detected bounding boxes are shown in green and their layout in red.

3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
 Given a monocular image, localize the objects in 3D by enclosing them
with tight oriented 3D bounding boxes.
 An approach extends the deformable part-based model to reason in 3D.
 It represents an object class as a deformable 3D cuboid composed of
faces and parts, which are both allowed to deform with respect to their
anchors on the 3D box.
 Model the appearance of each face in fronto-parallel coordinates, thus
effectively factoring out the appearance variation induced by viewpoint.
 The model reasons about face visibility patters called aspects.
 Train the cuboid model jointly and share weights across all aspects to
attain efficiency.
 Inference then entails sliding and rotating the box in 3D and scoring
object hypotheses.
 While for inference discretize the search space, the variables are
continuous in the model.

The deformable 3D cuboid model. Viewpoint angle θ.

Aspects, together with the range of θ that they cover, for (left) cars and (right) beds.

where p = (p1, · · · , p6 ) and V (i, a) is a binary variable encoding whether
face i is visible under aspect a. Note that a = a(θ, s) can be deterministically
computed from the rotation angle θ and the position of the stitching point
s (which we assume to always be visible), which in turns determines the face
visibility V .

Learned models for (left) bed, (right) car.

We use ref to index the first visible face in the aspect model, and

Inference in this model can be done by computing

KITTI: examples of car detections. (top) Ground truth,
(bottom) The 3D detections, augmented with best fitting
CAD models to visualize inferred 3D box orientations.

ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
 Scene understanding from the perspective of 3D shape modeling: a 3D scene
representation that reasons jointly about the 3D shape of multiple objects.
 It allows to express 3D geometry and occlusion on the fine detail level of
individual vertices of 3D wireframe models, and makes it possible to treat
dependencies between objects, such as occlusion reasoning, in a
deterministic way.
Left: Coarse 3D object
bounding boxes derived from
2D bounding box detections.
Right: fine-grained 3D shape
model fits improve 3D
localization (bird’s eye views).

Scene particles (coarse 3D
geometry and fine-grained
shape). Deterministic occlusion
mask computation by ray
casting and intersection (blue).
A 3D scene model,
consisting of a common
ground plane, a set of 3D
deformable objects, and
an explicit occlusion mask
for each object.

Object likelihood.
Scene-level likelihood.
An inference scheme that proceeds in
stages, lifting an initial 2D guess
(Initialization) about object locations to a
coarse 3D model (Coarse 3D geometry),
and refining that coarse model into a final
collection of consistent 3D shapes (Final
scene-level inference).

(a) Part localization accuracy and 2D pre-detection. (b-c) Example detections
and corresponding 3D reconstructions.

COARSE+GP (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
COARSE+GP based on (a), (c) bird’s eye view of (b). (e) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (f) bird’s eye view of (e). Estimates in
red, ground truth in green.

FG+SO+DO (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
FG+SO+DO based on (a), (c) bird’s eye view of (b). (d) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (e) bird’s eye view of (d). Estimates in
red, ground truth in green.

CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
 Rank possible poses and types for each frame and exploit temporal
coherence between consecutive frames for refinement.
 Cast the estimation of a vehicle’s pose and type as a solution of a
continuous optimization problem over space and time.
 Obtain initial start points by a discrete temporal optimization
reaching a global optimum on a ranked discrete set of possible types
and poses.
 To guarantee effectiveness of the discrete-continuous optimization,
reduce the search space of potential 3D model types and poses for
each frame for the discrete optimizer.
 It avoids common expensive evaluation of all possible discretized
hypotheses.
 The key idea towards efficiency lies in a combination of detecting the
vehicle, rendering the 3D models, matching projected edges to input
images, and using a tree structured MRF to get fast and globally
optimal inference and to force the vehicle follow a feasible motion
model in the initial phase.

Improve pose estimation
over [Toshev et al., 2009] by
processing in continuous
space (columns 1, 2),
reduce wrong classifications
due to incorrect scales
(column 3) and improve
pose estimation over
[Leotta et al., 2011] by using
existing 3D models.

(a) Framework application flow. (b) Vehicle, described by
orientation α and centroid on the ground plane C = (x, y, 0).
Fast Directional Chamfer Matching (FDCM)
FDCM maps the edge pixels in U and E to an
orientation augmented space. The alignment cost
between the two edge maps is then given by
To update the matching score by setting
Given the shifted but projective wrong model
projection Ap
l and a projective correct model
projection area Bq
l, the similarity score for a
pose is calculated by combining the output of
FDCM and the area overlap by

(a) Temporal inference for ranked projections.
(b) Ackermann steering principle where φ = θ/2.
(c) Corresponding points between model’s
projected edges and edge image.

Pose estimation using FDCM only (top row), combining FDCM and MRF
(middle row), combining FDCM, MRF and continuous optimization (bottom row).

A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
 The existing 3D pose estimation methods using deep networks can be
divided in two groups:
 (i) predict 2D keypoints from images and recover 3D pose from keypoints;
 (ii) directly predict 3D pose from an image.
 A mixed classification-regression framework that uses a classification
network to produce a discrete multimodal pose estimate and a
regression network to produce a continuous refinement of the estimate.
 The framework can accommodate different architectures and loss
functions, leading to multiple classification-regression models.
A high level overview of our problem statement and proposed network architecture

the Bin & Delta model
Simple/Naive Bin & Delta
Geodesic Bin & Delta

One delta network per pose-bin
Best (top row) and Worst (bottom row) images for Category: Bus
Best (top row) and Worst (bottom row) images for Category: Car
The previous optimization problems as follows

BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
 Not limit to frontal/rear viewpoint but allow vehicles to be seen from
any viewpoint, based on 3D Bboxes built around the vehicles.
 The Bbox can be auto-constructed from traffic surveillance data.
 For scenarios where it is not possible to use the precise construction,
a method for estimation of the 3D bounding box.
 The 3D Bbox is used to normalize the image viewpoint by “unpacking”
the image into plane.
 To randomly alter the color of the image and add a rectangle with
random noise to random position in the image during training CNNs.
 A fine-grained vehicle dataset BoxCars116k, with 116k images of
vehicles from various viewpoints taken by many surveillance cameras.

Example of automatically obtained
3D bounding box used for fine-
grained vehicle classification. Top
left: vehicle with 2D bounding box
annotation, top right: estimated
contour, bottom left: estimated
directions to vanishing points,
bottom right: 3D bounding box
automatically obtained from
surveillance video (green) and our
estimated 3D bounding box (red).

3D bounding box and its unpacked version.
Examples of data normalization and
auxiliary data feeded to nets. Left to
right: vehicle with 2D bounding box,
computed 3D bounding box, vectors
encoding viewpoints on the vehicle
(View), unpacked image of the
vehicle (Unpack), and rasterized 3D
bounding box feeded to the net
(Rast).

Estimation of 3D Bbox. Left to right: image with vehicle 2D Bbox, output of contour
object detector, constructed contour, estimated directions towards vanishing points,
ground truth (green) and estimated (red) 3D Bbox.
Used CNN for estimation of
directions towards vanishing
points. The vehicle image is
fed to ResNet50 with 3
separate outputs which
predict probabilities for
directions of vanishing points
as probabilities in a quantized
angle space (60 bins from
−90ºto 90º).

VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
 A FCN for 2D and 3D bounding box detection of cars from monocular
images intended for autonomous driving applications.
 The introduced network is E2E trainable and detects objects at multiple
scales in a single pass.
 A 3D bounding box representation, which is independent of the image
projection matrix (camera used to take the images).
 The detector may be trained on several different datasets and also
detect 3D Bboxes on completely different datasets than it was trained on.
3D bounding boxes
(left) and their top
view (right) detected
by this method. The
front sides of 3D
bounding boxes are
depicted in green,
the rear sides in red.

FOR AUTONOMOUS DRIVING(THESIS)
 2D Bounding Box - BBTXT
 The 2D Bboxes are represented by the coord.s of their top-left (xmin,
ymin) and bottom-right (xmax, ymax) corners.
 3D Bounding Box - BB3TXT
 A Bbox in 3D has 9 DOF - 3 for position, 3 rotations, and 3 dimensions.
 Coord.s of the projected rear- bottom-left, front-bottom-left, and front-
bottom-right corners and the y-coord.s of the front-top-left corner.
3D bounding box corners Info. stored about 3D Bboxes. Each
Bbox is defined by 3 lines - front-
bottom, left-bottom, front-left.

FOR AUTONOMOUS DRIVING(THESIS)
 Together with the requirement of all Bboxes being pinned to the ground plane
provides sufficient amount of info. to reconstruct the 3D world positions of the 3D
Bboxes.
 Inverse Projection
 Compute the inverse of KR and the camera center from
 Reconstruction of the Bottom Side
 use the ground plane equation ax + by + cz + d = 0 (normal n = [a, b, c]T ) and the
inverse projected rays to determine the position of the bottom side of the 3D bounding
box in the world.
 obtain parallelogram in the ground plane instead of a rectangle from the re-projection
of 3 points.
Rectification of a parallelogram
(solid) to a rectangle (dashed).
Projection of a 3D bounding box
to the ground plane.

 Reconstruction of the Top Side
 use the direction of the bottom-left line as the normal vector of the frontal
plane nF = [aF , bF , cF ] and place the front-bottom-left point in the frontal
plane to calculate the front plane as aFx + bFy + cFz + dF = 0;
 Finding the intersection of the frontal plane and the ray lftl = C + (KR)-1xftl
gives the position of the vertex Xftl to determine the height of the
bounding box.
 Ground Plane Extraction
The RANSAC algorithm for ground plane estimation.

DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
 A Deep Cuboid Detector takes a consumer-quality RGB image of a
cluttered scene and localizes all 3D cuboids (box-like objects).
 An E2E deep learning system to detect cuboids across many semantic
categories (e.g., ovens, shipping boxes, and furniture).
 Localize cuboids with a 2D Bbox, and localize the cuboid’s corners,
effectively producing a 3D interpretation of box-like objects.
 Refine keypoints by pooling conv. features iteratively, improving the
baseline method significantly.
 This deep learning cuboid detector is trained in an end-to-end fashion
and is suitable for real-time applications.
2D Object detection vs.
3D Cuboid detection.

BOUNDING BOXES
Deep Cuboid Detection Pipeline. 1) find RoIs in the image where a cuboid
might be present and train a RPN to output such regions. 2) features for each
RoI are pooled from a conv. feature map. 3) These pooled features are passed
though two fully connected layers just like Faster R-CNN. 4) output normalized
offsets of the vertices from the center of the region. 5) refine predictions by
performing iterative feature pooling.

BOUNDING BOXES
 The loss function used in the RPN consists of Lanchor−cls, the log loss
over two classes (cuboid vs. not cuboid) and Lanchor−reg, the Smooth
L1 loss of the Bbox regression values for each anchor box;
 The loss function for the R-CNN is made up of LROI−cls, the log loss
over two classes (cuboid vs. not cuboid), LROI−reg, the Smooth L1
loss of the Bbox regression values for the RoI and LROI−corner, the
Smooth L1 loss over the RoI’s predicted vertex locations, also
referred as the corner regression loss.
 The complete loss function is a weighted sum of the above
mentioned losses

BOUNDING BOXES
Vertex Refinement via Iterative Feature Pooling. To refine cuboid detection
regions by re-pooling features from conv5 using the predicted bounding boxes.

3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
 3D object detection and pose estimation from a single image.
 First regresses relatively stable 3D object properties using a deep
CNN and then combines these estimates with geometric constraints
provided by a 2D object b box to produce a complete 3D b box.
 The first network output estimates the 3D object orientation using a
hybrid discrete-continuous loss, which significantly outperforms the L2
loss.
 The second output regresses the 3D object dimensions, relatively little
variance, predicted for many object types.
 These estimates, combined with geometric constraints on translation
imposed by the 2D b box, enable to recover a stable and accurate 3D
object pose.

 Perspective projection of a 3D Bbox fit tightly within its 2D det. window.
 The 3D Bbox is described by its center T, dimensions D and
orientation, by the azimuth, elevation and roll angles.
2D box side parameters
Correspondence btw the
3D bbox and 2D bbox:
Each figure shows a 3D
bbox that surrounds an
object.

 CNN Regression of 3D Box Parameters:
 Combine the ray direction at the crop center with the
estimated local orientation to compute the global
orientation of the object.
 Faster R-CNN, SSD: Divide the space of the bounding
boxes into several discrete modes “anchor boxes” and
then estimate the continuous offsets applied to each
anchor box.
 Discretize the orientation angle and divide into
overlapping bins. For each bin, the CNN network
estimates both a confidence probability that the output
angle lies inside the ith bin and the residual rotation
correction applied to the orientation of the center ray of
that bin to obtain the output angle.
 The residual rotation is represented by two numbers,
for the sine and the cosine of the angle.
 Total loss for the MultiBin orientation:
 The loss for dimension estimation:
Left: Car dimensions. Right:
Illustration of local and global
orientation of a car. The local
orientation computed wrt the ray
through the center of the crop.

The architecture for MultiBin estimation for
orientation and dimension estimation with 3
branches: The left is for estimation of dimensions
of the object of interest. The other twos are for
computing the confidence for each bin and also
compute cos(∆θ) and sin(∆θ) of each bin. Qualitative illustration of 2D detection boxes and
estimated 3D projections.

DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
 Deep MANTA (Many-Tasks), for vehicle analysis from a given image.
 A robust CNN for simultaneous vehicle detection, part localization,
visibility characterization and 3D dimension estimation.
 A coarse-to-fine object proposal that boosts the vehicle detection.
 Deep MANTA localizes vehicle parts even if they are not visible.
 In the inference, the network’s outputs are used by real time pose
estimation for fine orientation estimation and 3D vehicle localization.

System outputs. Top: 2D vehicle
bboxes, vehicle part localization
and part visibility. Bottom: 3D
vehicle bbox localization and 3D
vehicle part localization. The
camera in blue.
2D/3D model
2D vehicle b box
3D b box
2D part coord.
part visibility vector
3D part coord.

Example of one 2D/3D vehicle
model. (a) the bounding box B, (b)
2D part coordinates S and part
visibility V.
Detection loss.
Visibility loss.
Template similarity loss.
Part loss.

Overview of the Deep
MANTA approach. The
entire input image is
forwarded inside the Deep
MANTA network. Conv.
layers share the same
weights. Moreover, these
3 conv. blocks correspond
to the split of existing CNN
architecture.

Semi-automatic annotation process. (a) weak annotations on a real image (3D
b box). (b) best corresponding 3D models in green. (c) projection of these 3D
models in the image. (d) corresponding mesh of visibility. (e) Final annotations.

3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
 Exploit stereo imagery to place proposals in the form of 3D Bboxes.
 Minimizing a function encoding object size priors, ground plane and
depth features about free space, point cloud densities and distance to
the ground.
Formulate the proposal generation problem as inference in a MRF in which the
proposal y should enclose a high density region in the point cloud.
Point cloud density:
Free space:
Height prior:
Height contrast:

DETECTION
 Score Bbox proposals using CNN, which network is built on Fast R-CNN;
 It shares conv. features across all proposals and use ROI pooling layer to
compute proposal-specific features;
 Adds a context branch after the last conv. layer, and an orientation regression
loss to jointly learn object location and orientation;
 Features output from original/context branches concatenated and fed to
prediction layers.
 The context regions obtained by enlarging candidate boxes by a factor of 1.5.
 Smooth L1 loss for orientation regression.
 Parameters of context branch are initialized by copying weights from the original.
 OxfordNet trained on ImageNet to initialize the weights of conv. layers and the
branch for candidate boxes, then fine-tune it E2E on the KITTI training set.

DETECTION

MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
 Generate a set of candidate class-specific object proposals, which are
then run through a standard CNN pipeline to obtain object detections.
 An energy minimization approach that places object candidates in 3D
using the fact that objects should be on the ground-plane.
 Score each candidate box projected to the image plane via several
intuitive potentials encoding semantic segmentation, contextual
information, size and location priors and typical object shape.

CNN architecture used to score proposals for
object detection and orientation estimation.
The scoring function by
combining semantic cues
(both class and instance
level segmentation), location
priors, context and shape:

SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
 A method for detecting 3D model instances and estimating their
6D poses from RGB data in a single shot.
 To this end, extend the popular SSD paradigm to cover the full 6D
pose space and train on synthetic model data only.
 It competes or surpasses current state-of-the-art methods that
leverage RGB-D data on multiple challenging datasets.
 It produces results at around 10Hz, which is many times faster
than the related methods.
Discrete 6D pose
space with each
point representing
a classifiable
viewpoint.
The object distance can be
inferred from the projective ratio.

After predicting 2D detections (a), build 6D hypotheses and run pose
refinement and a final verification. While the unrefined poses (b) are rather
approximate, contour-based refinement (c) produces already visually
acceptable results. Occlusion-aware projective ICP with cloud data (d) leads
to a very accurate alignment.

Schematic overview of the SSD-style network prediction
C denotes the
number of object
classes, V the
number of
viewpoints and R
the number of in-
plane rotation
classes. The other
4 values are
utilized to refine
the corners of the
discrete bounding
boxes.

REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
 A single-shot approach for simultaneously detecting an object in an
RGB image and predicting its 6D pose without requiring multiple
stages or having to examine multiple hypotheses.
 Unlike a recently proposed single-shot technique for this task,
SSD-6D, that only predicts an approximate 6D pose that must then
be refined, this is accurate enough not to require additional post-
processing.
 It is much faster – 50 fps on a Titan X (Pascal) GPU – and more
suitable for real-time processing.
 The key component is a CNN architecture that directly predicts the
2D image locations of the projected vertices of the object’s 3D
bounding box.
 The object’s 6D pose is then estimated using a PnP algorithm.

POSE PREDICTION
The proposed CNN architecture.

POSE PREDICTION
(a) (b) (c) (d)
(a) An example input image with four objects. (b) The S × S grid showing
cells responsible for detecting the four objects. (c) Each cell predicts 2D
locations of the corners of the projected 3D bounding boxes in the image.
(d) The 3D output tensor from the network, which represents for each cell
a vector consisting of the 2D corner locations, the class probabilities and a
confidence value associated with the prediction.

POSE PREDICTION
In the last column, it shows failure cases due to motion blur,
severe occlusion and specularity.

POSE PREDICTION

IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
 A real-time RGB-based pipeline for object detection and 6D pose
estimation.
 This 3D orientation estimation is based on a variant of the Denoising
Autoencoder that is trained on simulated views of a 3D model using
Domain Randomization.
 This so-called Augmented Autoencoder (AAE) has several advantages
over existing methods:
 Since the training is independent from concrete representations of
object orientations within SO(3) (e.g. quaternions), able to handle
ambiguous poses caused by symmetric views because of avoiding
one-to-many mappings from images to orientations.
 Learn representations that specifically encode 3D orientations while
achieving robustness against occlusion, cluttered backgrounds and
generalizing to different environments and test sensors.
 The AAE does not require any real pose-annotated training data;
Instead, it is trained to encode 3D model views in a self-supervised way,
overcoming the need of a large pose-annotated dataset.

6D Object Detection pipeline with homogeneous transformation
(top-right) and depth-refined result (bottom-right) .

Training process for the AAE; a) reconstruction target batch x of
uniformly sampled SO(3) object views; b) geometric and color
augmented input; c) reconstruction xˆ after 30000 iterations.

Autoencoder CNN architecture with occluded test input

Top: creating a codebook from the encodings of discrete synthetic
object views; bottom: object detection and 3D orientation estimation
using the NN(s) with highest cosine similarity from the codebook.

3-d interpretation from single 2-d image for autonomous driving

3-d interpretation from single 2-d image for autonomous driving

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 3-d interpretation from single 2-d image for autonomous driving

Similar to 3-d interpretation from single 2-d image for autonomous driving (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

3-d interpretation from single 2-d image for autonomous driving