SlideShare a Scribd company logo
3D INTERPRETATION FROM
SINGLE 2D IMAGE FOR
AUTONOMOUS DRIVING
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
OUTLINE
 Single View Metrology
 Joint SFM and Detection Cues for Monocular 3D
Localization in Road Scenes
 Joint 3D Estimation of Objects and Scene Layout
 CubeSLAM: Monocular 3D Object Detection and
SLAM without Prior Models
 Monocular Visual Scene Understanding:
Understanding Multi-Object Traffic Scenes
 Improved Object Detection and Pose Using Part-
Based Models
 3D Object Detection and Viewpoint Estimation
with a Deformable 3D Cuboid Model
 Are Cars Just 3D Boxes? – Jointly Estimating the
3D Shape of Multiple Objects
 Classification and Pose Estimation of Vehicles in
Videos by 3D Modeling within Discrete-
Continuous Optimization
 A mixed classification-regression framework for
3D pose estimation from 2D images
 BoxCars: Improving Fine-Grained Recognition of
Vehicles using 3D BBoxes in Traffic Surveillance
 Vehicle Detection and Pose Estimation for
Autonomous Driving (Thesis)
 Deep Cuboid Detection: Beyond 2D BBoxes
 3D Bounding Box Estimation Using Deep
Learning and Geometry
 Deep MANTA: A Coarse-to-fine Many-Task
Network for joint 2D and 3D vehicle analysis from
monocular image
 3D Object Proposals for Accurate Object Class
Detection
 Monocular 3D Object Detection for Autonomous
Driving
 SSD-6D: Making RGB-Based 3D Detection and
6D Pose Estimation Great Again
 Real-Time Seamless Single Shot 6D Object Pose
Prediction
 Implicit 3D Orientation Learning for 6D Object
Detection from RGB Images
SINGLE VIEW METROLOGY
Basic geometry: The plane’s vanishing line l
is the intersection of the image plane with a
plane parallel to the reference plane and
passing through the camera centre. The
vanishing point v is the intersection of the
image plane with a line parallel to the
reference direction through the camera
centre.
Cross ratio: The point b on the plane π
corresponds to the point t on the plane π’ .
They are aligned with the vanishing point v.
The 4 points v, t, b and the intersection i of the
line joining them with the vanishing line define
a cross-ratio. Cross-ratio decides a ratio of
distances between planes in the world.
SINGLE VIEW METROLOGY
Homology mapping btw parallel planes: point X
on plane π mapped into point X’ on π’ by parallel
projection. In the image, mapping btw images of
two planes is a homology, with v vertex and l
axis. The correspondence b -> t fixes the
remaining DoF of the homology from cross-ratio
of the 4 points: v, i, t and b.
JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
 This localization framework jointly uses info. from complementary
modalities such as SFM and object detection to achieve high
localization accuracy in both near and far fields.
 Make use of raw detection scores to allow 3D Bboxes to adapt to
better quality 3D cues.
 To extract SFM cues, take the advantages of dense tracking over
sparse mechanisms in autonomous driving scenarios.
 The formulation for 3D localization can be regarded as an extension
of sparse BA to incorporate object detection cues.
3D object localization framework
that combines cues from SFM and
object detection. Red denotes 2D
bounding boxes, the horizontal line
is the horizon from estimated
ground plane, green denotes
estimated 3D localization for far
and near objects, with distances in
magenta.
JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Overview of the 3d object localization system combining SFM
cues (green) with object detection cues (brown).
JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Coordinate system definitions for
3D object localization. The SFM
ground plane is (n⊤, h)⊤.
System overview for obtaining SFM cues on
objects, depicted in green.
K is the camera intrinsic calibration
matrix, the bottom of a 2D Bbox, b =
(x, y, 1)⊤, can be back-projected to
3D through the ground plane {h, n}:
JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Output of this localization
system. The bottom left panel
shows the monocular SFM
camera trajectory. The top
panel shows input 2D bounding
boxes in red, horizon from
estimated ground plane and the
estimated 3D bounding boxes
in green with distances in
magenta. The bottom right
panel shows the top view of the
ground truth object localization
from laser scanner in red,
compared to this 3D object
localization in blue.
JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
 A generative model is able to reason jointly about the 3D scene
layout as well as the 3D location and orientation of objects in the
scene.
 To infer the scene topology, geometry and traffic activities from a
video sequence from a single camera mounted on a moving car.
 It takes advantage of dynamic info. in the form of vehicle tracklets
and static info. from semantic labels and geometry (i.e., vanishing
points).
Monocular 3D Urban
Scene Understanding.
(Left) Image cues.
(Right) Estimated
layout: Detections
belonging to a tracklet
are depicted with the
same color, traffic
activities are depicted
with red lines.
Vehicle tracklets
Vanishing points
Scene labels
JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
 Assume that the road surface is flat, and model the bird’s eye perspective
as the y = 0 plane of the standard camera coordinate system;
 Detect vehicles in each frame independently using a semi-supervised
version of the part- based detector in order to obtain orientation estimates;
 2D tracklets estimated using ’tracking-by-detection’: First adjacent frames
are linked and then short tracklets are associated to create longer ones via
the hungarian method.
 3D vehicle tracklets are obtained by projecting the 2D tracklets into bird’s
eye perspective, employing error-propagation to obtain cov. estimates.
 Model lanes with splines, place parking spots at equidistant places along
street boundaries.
 The model infers whether the cars participate in traffic or are parked in
order to get more accurate layout estimations.
 Latent variables are employed to associate each detected vehicle with
positions in one of these lanes or parking spaces.
JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
Graphical model and road model with lanes represented as B-splines.
Transform the 2D tracklets into 3D tracklets: project the image coordinates
into bird’s eye perspective by backprojecting objects into 3D using several
complementary cues. Towards this goal, use the 2D bounding box footpoint
in combination with the estimated road plane. Two types of dominant
vanishing points: forward facing street and crossing street. Three semantic
classes, i.e., road, sky and background.
JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
(Left) Trackets from all frames superimposed. (Middle) Inference result
with θ known and (Right) θ unknown. The inferred intersection layout in
gray. Ground truth labels in blue. Detected activities in red.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
 A method for single image 3D cuboid object detection and multi-view
object SLAM without prior object model, and the two aspects can
benefit each other.
 For 3D detection, generate cuboid proposals from 2D Bboxes and
vanishing points sampling.
 The proposals are further scored,selected to align with image edges.
 Multi-view bundle adjustment with measurement functions is proposed
to jointly optimize camera poses, objects and points, utilizing single
view detection results.
 Objects can provide more geometric constraints and scale consistency
compared to points.
 Objects are utilized in two folds: provide depth initialization for points
difficult to triangulate and provide geometry constraints in BA.
 The estimated camera poses from SLAM can improve the single-view
object detection.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Monocular 3D object detection and mapping without prior object models.
Mesh model is just for visualization and not used for detection. (a) ICL NUIM
data with various objects, whose position, orientation and dimension are
optimized by SLAM. (b) KITTI 07. With object constraints, monocular SLAM
can build a consistent map and correct scale drift, without loop closure and
constant camera height assumption.
(a) (b)
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
 A 3D cuboid by 9 DoF parameters: 3 DoF position, 3 DoF rotation
and 3 DoF dimension.
 The cuboid coordinate frame is built at the cuboid center, aligned
with the main axes.
 The camera intrinsic calibration K is also known.
 The cuboid’s projected corners fit tightly with 2D bounding box,
there are 4 constraints corresponding to 4 sides of a rectangle
which cannot fully constrain all 9 parameters.
 3D cuboid has 3 orthogonal axes and can form 3 VPs after
perspective projections depending on object rotation R and camera
calibration K.
 After getting 8 cuboid corners in 2D, back-project to 3D space to
compute 3D position and dimensions which is up to a scale factor.
 The scale can be reasoned from camera height to ground, prior
object size and so on.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Proposals generation from 2D object box. Cuboids are divided
into three categories depending on the number of observable
faces. If one corner is estimated, the other seven corners can
also be computed from vanishing points (VPs). For example in
(a), if corner 1 is sampled, then corner 2 and 3 can be
determined through ray intersection of VP line and rectangles,
followed by corner 4 and other bottom corners.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Denote the image as I and cuboid proposal as x, then the cost
function is defined as:
Cuboid proposal scoring. (Left) Edges to align and score
the proposals. (Right) Cuboid proposals generated from
the same 2D cyan bounding box. The top left is the best
and bottom right is the worst after scoring.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Camera poses C = {ci}, 3D landmark object O = {oj}, Points P = {pk}.
BA is formulated as NLS problem:
Transform the landmark object to the camera frame then compare with
the measurement as:
3D measurement
2D measurement
Project the landmark cuboid onto image plane to get the 2D Bbox,
compare it with the detected 2D Bbox as
First transform the point to the cuboid frame then compare
with cuboid dimensions:
Point-camera measurement: the standard 3D point re-projection error
in feature based SLAM.
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
(a) The object SLAM pipeline. Single view object detection provides
cuboid landmark and depth initialization for SLAM while SLAM can
estimate camera pose for more accurate object detection. (b)
Measurement errors between cameras, objects and points during BA.
(a) (b)
CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
 Object association based on point matching:
 Dynamic points detected through descriptor matching and
epipolar line checking;
 Then first associate points to objects if points are observed
enough times of belonging to the 2D object bounding box and
close to cuboid centroid in 3D space;
 Try to find object matching which has the most number of
shareable map points exceeding a threshold (10 for example);
 Well for wide baseline matching, repetitive objects, occlusions,
and dynamic scenarios.
Green points are normal map points,
and other color points are associated
to objects with the same color. The
front cyan moving car is not added as
SLAM landmark as no feature point is
associated with it. Points in object
overlapping areas are not associated
with any object.
MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
 A probabilistic 3D scene model that integrates SoA multiclass object
detection, object tracking and scene labeling together with geometric
3D reasoning.
 The model is able to represent complex object interactions such as
inter-object occlusion, physical exclusion between objects, and
geometric context.
 Inference in this model allows to jointly recover the 3D scene context
and perform 3D multi-object tracking from a mobile observer, for
objects of multiple categories, using only monocular video as input.
 This system performs explicit occlusion reasoning and is capable of
tracking objects that are partially occluded for extended periods of
time, or objects that have never been observed to their full extent.
 A joint scene tracklet model for the evidence collected over multiple
frames substantially improves performance.
MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
Overview on this system.
For each input frame, run
an object detector and
extract semantic scene
labels. Object
hypotheses are fused to
short-term tracklets and
put into a strong 3D
scene model with explicit
occlusion reasoning.
MCMC inference allows
tractable inference in the
Bayesian scene model,
while HMM scene
tracking ensures long-
term associations.
MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
The multi-frame 3D inference and explicit
occlusion reasoning for onboard vehicle and
pedestrian tracking with overlaid horizon
estimate for different public SoA datasets.
the 3D scene state X in the world coordinate system
the rotation angles of a vehicle mounted camera
MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
Employing the theorem of
intersecting lines to derive the
distance to object along the
ground plane in viewing direction.
Approximate objects by their Bboxes and
project them onto the image. By leveraging the
depth order obtained from the 3D scene model,
able to estimate occluded object regions.
MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
 Work on part-based models by using accurate geometric models
both in the learning phase and at detection.
 The object model is defined as a number of roughly planar aspects
models together with a set of typical object poses;
 In the learning phase, manual annotations are used to reduce
perspective distortion before learning the part-based models.
 That training is performed on rectified images using a deformable part-
based model (DPM), leads to models which are more specific.
 At the same time a set of representative object poses are learnt.
 Transform the image according to each of the learnt typical poses.
 These are used at detection to remove perspective distortion.
 Scores from the aspect detectors are generated by running each
aspect model on each of the transformed images.
 Detections from the different aspect models are combined and
thresholded to produce the final object detection.
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Annotation of each visible aspect of the object in training images
The two aspect models for the bus category. The upper row shows
the frontal model and the lower row shows the side model.
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
A certain training example is similar to a pose P if the average angular
deviations for the front and side,
Measure of angular deviation for pose similarity.
are below a predefined threshold.
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Overview of the detection pipeline. The input image is transformed according
to each of representative poses. This produces a multiple images that are
individually run through the aspect detectors creating a set of score pyramids
containing the detector scores at different scales. These are merged into one
pyramid per aspect, in the original image coordinate system. Finally, the front
and side scores are combined and non-maximum suppression is performed.
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Left: how to estimate the side location (brown dot) given the frontal location
(blue dot) and the size of skewed frontal bounding box. Right: search in a
small neighborhood (blue circle) of expected location for each level.
The score combination can be expressed as
IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Detected bounding boxes are shown in green and their layout in red.
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
 Given a monocular image, localize the objects in 3D by enclosing them
with tight oriented 3D bounding boxes.
 An approach extends the deformable part-based model to reason in 3D.
 It represents an object class as a deformable 3D cuboid composed of
faces and parts, which are both allowed to deform with respect to their
anchors on the 3D box.
 Model the appearance of each face in fronto-parallel coordinates, thus
effectively factoring out the appearance variation induced by viewpoint.
 The model reasons about face visibility patters called aspects.
 Train the cuboid model jointly and share weights across all aspects to
attain efficiency.
 Inference then entails sliding and rotating the box in 3D and scoring
object hypotheses.
 While for inference discretize the search space, the variables are
continuous in the model.
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
The deformable 3D cuboid model. Viewpoint angle θ.
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Aspects, together with the range of θ that they cover, for (left) cars and (right) beds.
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
where p = (p1, · · · , p6 ) and V (i, a) is a binary variable encoding whether
face i is visible under aspect a. Note that a = a(θ, s) can be deterministically
computed from the rotation angle θ and the position of the stitching point
s (which we assume to always be visible), which in turns determines the face
visibility V .
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Learned models for (left) bed, (right) car.
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
We use ref to index the first visible face in the aspect model, and
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Inference in this model can be done by computing
3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
KITTI: examples of car detections. (top) Ground truth,
(bottom) The 3D detections, augmented with best fitting
CAD models to visualize inferred 3D box orientations.
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
 Scene understanding from the perspective of 3D shape modeling: a 3D scene
representation that reasons jointly about the 3D shape of multiple objects.
 It allows to express 3D geometry and occlusion on the fine detail level of
individual vertices of 3D wireframe models, and makes it possible to treat
dependencies between objects, such as occlusion reasoning, in a
deterministic way.
Left: Coarse 3D object
bounding boxes derived from
2D bounding box detections.
Right: fine-grained 3D shape
model fits improve 3D
localization (bird’s eye views).
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
Scene particles (coarse 3D
geometry and fine-grained
shape). Deterministic occlusion
mask computation by ray
casting and intersection (blue).
A 3D scene model,
consisting of a common
ground plane, a set of 3D
deformable objects, and
an explicit occlusion mask
for each object.
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
Object likelihood.
Scene-level likelihood.
An inference scheme that proceeds in
stages, lifting an initial 2D guess
(Initialization) about object locations to a
coarse 3D model (Coarse 3D geometry),
and refining that coarse model into a final
collection of consistent 3D shapes (Final
scene-level inference).
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
(a) Part localization accuracy and 2D pre-detection. (b-c) Example detections
and corresponding 3D reconstructions.
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
COARSE+GP (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
COARSE+GP based on (a), (c) bird’s eye view of (b). (e) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (f) bird’s eye view of (e). Estimates in
red, ground truth in green.
ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
FG+SO+DO (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
FG+SO+DO based on (a), (c) bird’s eye view of (b). (d) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (e) bird’s eye view of (d). Estimates in
red, ground truth in green.
CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
 Rank possible poses and types for each frame and exploit temporal
coherence between consecutive frames for refinement.
 Cast the estimation of a vehicle’s pose and type as a solution of a
continuous optimization problem over space and time.
 Obtain initial start points by a discrete temporal optimization
reaching a global optimum on a ranked discrete set of possible types
and poses.
 To guarantee effectiveness of the discrete-continuous optimization,
reduce the search space of potential 3D model types and poses for
each frame for the discrete optimizer.
 It avoids common expensive evaluation of all possible discretized
hypotheses.
 The key idea towards efficiency lies in a combination of detecting the
vehicle, rendering the 3D models, matching projected edges to input
images, and using a tree structured MRF to get fast and globally
optimal inference and to force the vehicle follow a feasible motion
model in the initial phase.
CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
Improve pose estimation
over [Toshev et al., 2009] by
processing in continuous
space (columns 1, 2),
reduce wrong classifications
due to incorrect scales
(column 3) and improve
pose estimation over
[Leotta et al., 2011] by using
existing 3D models.
CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
(a) Framework application flow. (b) Vehicle, described by
orientation α and centroid on the ground plane C = (x, y, 0).
Fast Directional Chamfer Matching (FDCM)
FDCM maps the edge pixels in U and E to an
orientation augmented space. The alignment cost
between the two edge maps is then given by
To update the matching score by setting
Given the shifted but projective wrong model
projection Ap
l and a projective correct model
projection area Bq
l, the similarity score for a
pose is calculated by combining the output of
FDCM and the area overlap by
CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
(a) Temporal inference for ranked projections.
(b) Ackermann steering principle where φ = θ/2.
(c) Corresponding points between model’s
projected edges and edge image.
CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
Pose estimation using FDCM only (top row), combining FDCM and MRF
(middle row), combining FDCM, MRF and continuous optimization (bottom row).
A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
 The existing 3D pose estimation methods using deep networks can be
divided in two groups:
 (i) predict 2D keypoints from images and recover 3D pose from keypoints;
 (ii) directly predict 3D pose from an image.
 A mixed classification-regression framework that uses a classification
network to produce a discrete multimodal pose estimate and a
regression network to produce a continuous refinement of the estimate.
 The framework can accommodate different architectures and loss
functions, leading to multiple classification-regression models.
A high level overview of our problem statement and proposed network architecture
A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
the Bin & Delta model
Simple/Naive Bin & Delta
Geodesic Bin & Delta
A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
One delta network per pose-bin
Best (top row) and Worst (bottom row) images for Category: Bus
Best (top row) and Worst (bottom row) images for Category: Car
The previous optimization problems as follows
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
 Not limit to frontal/rear viewpoint but allow vehicles to be seen from
any viewpoint, based on 3D Bboxes built around the vehicles.
 The Bbox can be auto-constructed from traffic surveillance data.
 For scenarios where it is not possible to use the precise construction,
a method for estimation of the 3D bounding box.
 The 3D Bbox is used to normalize the image viewpoint by “unpacking”
the image into plane.
 To randomly alter the color of the image and add a rectangle with
random noise to random position in the image during training CNNs.
 A fine-grained vehicle dataset BoxCars116k, with 116k images of
vehicles from various viewpoints taken by many surveillance cameras.
Example of automatically obtained
3D bounding box used for fine-
grained vehicle classification. Top
left: vehicle with 2D bounding box
annotation, top right: estimated
contour, bottom left: estimated
directions to vanishing points,
bottom right: 3D bounding box
automatically obtained from
surveillance video (green) and our
estimated 3D bounding box (red).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
3D bounding box and its unpacked version.
Examples of data normalization and
auxiliary data feeded to nets. Left to
right: vehicle with 2D bounding box,
computed 3D bounding box, vectors
encoding viewpoints on the vehicle
(View), unpacked image of the
vehicle (Unpack), and rasterized 3D
bounding box feeded to the net
(Rast).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
Estimation of 3D Bbox. Left to right: image with vehicle 2D Bbox, output of contour
object detector, constructed contour, estimated directions towards vanishing points,
ground truth (green) and estimated (red) 3D Bbox.
Used CNN for estimation of
directions towards vanishing
points. The vehicle image is
fed to ResNet50 with 3
separate outputs which
predict probabilities for
directions of vanishing points
as probabilities in a quantized
angle space (60 bins from
−90ºto 90º).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
 A FCN for 2D and 3D bounding box detection of cars from monocular
images intended for autonomous driving applications.
 The introduced network is E2E trainable and detects objects at multiple
scales in a single pass.
 A 3D bounding box representation, which is independent of the image
projection matrix (camera used to take the images).
 The detector may be trained on several different datasets and also
detect 3D Bboxes on completely different datasets than it was trained on.
3D bounding boxes
(left) and their top
view (right) detected
by this method. The
front sides of 3D
bounding boxes are
depicted in green,
the rear sides in red.
VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING(THESIS)
 2D Bounding Box - BBTXT
 The 2D Bboxes are represented by the coord.s of their top-left (xmin,
ymin) and bottom-right (xmax, ymax) corners.
 3D Bounding Box - BB3TXT
 A Bbox in 3D has 9 DOF - 3 for position, 3 rotations, and 3 dimensions.
 Coord.s of the projected rear- bottom-left, front-bottom-left, and front-
bottom-right corners and the y-coord.s of the front-top-left corner.
3D bounding box corners Info. stored about 3D Bboxes. Each
Bbox is defined by 3 lines - front-
bottom, left-bottom, front-left.
VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING(THESIS)
 Together with the requirement of all Bboxes being pinned to the ground plane
provides sufficient amount of info. to reconstruct the 3D world positions of the 3D
Bboxes.
 Inverse Projection
 Compute the inverse of KR and the camera center from
 Reconstruction of the Bottom Side
 use the ground plane equation ax + by + cz + d = 0 (normal n = [a, b, c]T ) and the
inverse projected rays to determine the position of the bottom side of the 3D bounding
box in the world.
 obtain parallelogram in the ground plane instead of a rectangle from the re-projection
of 3 points.
Rectification of a parallelogram
(solid) to a rectangle (dashed).
Projection of a 3D bounding box
to the ground plane.
VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
 Reconstruction of the Top Side
 use the direction of the bottom-left line as the normal vector of the frontal
plane nF = [aF , bF , cF ] and place the front-bottom-left point in the frontal
plane to calculate the front plane as aFx + bFy + cFz + dF = 0;
 Finding the intersection of the frontal plane and the ray lftl = C + (KR)-1xftl
gives the position of the vertex Xftl to determine the height of the
bounding box.
 Ground Plane Extraction
The RANSAC algorithm for ground plane estimation.
VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
 A Deep Cuboid Detector takes a consumer-quality RGB image of a
cluttered scene and localizes all 3D cuboids (box-like objects).
 An E2E deep learning system to detect cuboids across many semantic
categories (e.g., ovens, shipping boxes, and furniture).
 Localize cuboids with a 2D Bbox, and localize the cuboid’s corners,
effectively producing a 3D interpretation of box-like objects.
 Refine keypoints by pooling conv. features iteratively, improving the
baseline method significantly.
 This deep learning cuboid detector is trained in an end-to-end fashion
and is suitable for real-time applications.
2D Object detection vs.
3D Cuboid detection.
DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
Deep Cuboid Detection Pipeline. 1) find RoIs in the image where a cuboid
might be present and train a RPN to output such regions. 2) features for each
RoI are pooled from a conv. feature map. 3) These pooled features are passed
though two fully connected layers just like Faster R-CNN. 4) output normalized
offsets of the vertices from the center of the region. 5) refine predictions by
performing iterative feature pooling.
DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
 The loss function used in the RPN consists of Lanchor−cls, the log loss
over two classes (cuboid vs. not cuboid) and Lanchor−reg, the Smooth
L1 loss of the Bbox regression values for each anchor box;
 The loss function for the R-CNN is made up of LROI−cls, the log loss
over two classes (cuboid vs. not cuboid), LROI−reg, the Smooth L1
loss of the Bbox regression values for the RoI and LROI−corner, the
Smooth L1 loss over the RoI’s predicted vertex locations, also
referred as the corner regression loss.
 The complete loss function is a weighted sum of the above
mentioned losses
DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
Vertex Refinement via Iterative Feature Pooling. To refine cuboid detection
regions by re-pooling features from conv5 using the predicted bounding boxes.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
 3D object detection and pose estimation from a single image.
 First regresses relatively stable 3D object properties using a deep
CNN and then combines these estimates with geometric constraints
provided by a 2D object b box to produce a complete 3D b box.
 The first network output estimates the 3D object orientation using a
hybrid discrete-continuous loss, which significantly outperforms the L2
loss.
 The second output regresses the 3D object dimensions, relatively little
variance, predicted for many object types.
 These estimates, combined with geometric constraints on translation
imposed by the 2D b box, enable to recover a stable and accurate 3D
object pose.
 Perspective projection of a 3D Bbox fit tightly within its 2D det. window.
 The 3D Bbox is described by its center T, dimensions D and
orientation, by the azimuth, elevation and roll angles.
2D box side parameters
Correspondence btw the
3D bbox and 2D bbox:
Each figure shows a 3D
bbox that surrounds an
object.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
 CNN Regression of 3D Box Parameters:
 Combine the ray direction at the crop center with the
estimated local orientation to compute the global
orientation of the object.
 Faster R-CNN, SSD: Divide the space of the bounding
boxes into several discrete modes “anchor boxes” and
then estimate the continuous offsets applied to each
anchor box.
 Discretize the orientation angle and divide into
overlapping bins. For each bin, the CNN network
estimates both a confidence probability that the output
angle lies inside the ith bin and the residual rotation
correction applied to the orientation of the center ray of
that bin to obtain the output angle.
 The residual rotation is represented by two numbers,
for the sine and the cosine of the angle.
 Total loss for the MultiBin orientation:
 The loss for dimension estimation:
Left: Car dimensions. Right:
Illustration of local and global
orientation of a car. The local
orientation computed wrt the ray
through the center of the crop.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
The architecture for MultiBin estimation for
orientation and dimension estimation with 3
branches: The left is for estimation of dimensions
of the object of interest. The other twos are for
computing the confidence for each bin and also
compute cos(∆θ) and sin(∆θ) of each bin. Qualitative illustration of 2D detection boxes and
estimated 3D projections.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
 Deep MANTA (Many-Tasks), for vehicle analysis from a given image.
 A robust CNN for simultaneous vehicle detection, part localization,
visibility characterization and 3D dimension estimation.
 A coarse-to-fine object proposal that boosts the vehicle detection.
 Deep MANTA localizes vehicle parts even if they are not visible.
 In the inference, the network’s outputs are used by real time pose
estimation for fine orientation estimation and 3D vehicle localization.
DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
System outputs. Top: 2D vehicle
bboxes, vehicle part localization
and part visibility. Bottom: 3D
vehicle bbox localization and 3D
vehicle part localization. The
camera in blue.
2D/3D model
2D vehicle b box
3D b box
2D part coord.
part visibility vector
3D part coord.
DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Example of one 2D/3D vehicle
model. (a) the bounding box B, (b)
2D part coordinates S and part
visibility V.
Detection loss.
Visibility loss.
Template similarity loss.
Part loss.
DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Overview of the Deep
MANTA approach. The
entire input image is
forwarded inside the Deep
MANTA network. Conv.
layers share the same
weights. Moreover, these
3 conv. blocks correspond
to the split of existing CNN
architecture.
DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Semi-automatic annotation process. (a) weak annotations on a real image (3D
b box). (b) best corresponding 3D models in green. (c) projection of these 3D
models in the image. (d) corresponding mesh of visibility. (e) Final annotations.
3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
 Exploit stereo imagery to place proposals in the form of 3D Bboxes.
 Minimizing a function encoding object size priors, ground plane and
depth features about free space, point cloud densities and distance to
the ground.
Formulate the proposal generation problem as inference in a MRF in which the
proposal y should enclose a high density region in the point cloud.
Point cloud density:
Free space:
Height prior:
Height contrast:
3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
 Score Bbox proposals using CNN, which network is built on Fast R-CNN;
 It shares conv. features across all proposals and use ROI pooling layer to
compute proposal-specific features;
 Adds a context branch after the last conv. layer, and an orientation regression
loss to jointly learn object location and orientation;
 Features output from original/context branches concatenated and fed to
prediction layers.
 The context regions obtained by enlarging candidate boxes by a factor of 1.5.
 Smooth L1 loss for orientation regression.
 Parameters of context branch are initialized by copying weights from the original.
 OxfordNet trained on ImageNet to initialize the weights of conv. layers and the
branch for candidate boxes, then fine-tune it E2E on the KITTI training set.
3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
 Generate a set of candidate class-specific object proposals, which are
then run through a standard CNN pipeline to obtain object detections.
 An energy minimization approach that places object candidates in 3D
using the fact that objects should be on the ground-plane.
 Score each candidate box projected to the image plane via several
intuitive potentials encoding semantic segmentation, contextual
information, size and location priors and typical object shape.
MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
CNN architecture used to score proposals for
object detection and orientation estimation.
The scoring function by
combining semantic cues
(both class and instance
level segmentation), location
priors, context and shape:
MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
 A method for detecting 3D model instances and estimating their
6D poses from RGB data in a single shot.
 To this end, extend the popular SSD paradigm to cover the full 6D
pose space and train on synthetic model data only.
 It competes or surpasses current state-of-the-art methods that
leverage RGB-D data on multiple challenging datasets.
 It produces results at around 10Hz, which is many times faster
than the related methods.
Discrete 6D pose
space with each
point representing
a classifiable
viewpoint.
The object distance can be
inferred from the projective ratio.
SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
After predicting 2D detections (a), build 6D hypotheses and run pose
refinement and a final verification. While the unrefined poses (b) are rather
approximate, contour-based refinement (c) produces already visually
acceptable results. Occlusion-aware projective ICP with cloud data (d) leads
to a very accurate alignment.
SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
Schematic overview of the SSD-style network prediction
C denotes the
number of object
classes, V the
number of
viewpoints and R
the number of in-
plane rotation
classes. The other
4 values are
utilized to refine
the corners of the
discrete bounding
boxes.
REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
 A single-shot approach for simultaneously detecting an object in an
RGB image and predicting its 6D pose without requiring multiple
stages or having to examine multiple hypotheses.
 Unlike a recently proposed single-shot technique for this task,
SSD-6D, that only predicts an approximate 6D pose that must then
be refined, this is accurate enough not to require additional post-
processing.
 It is much faster – 50 fps on a Titan X (Pascal) GPU – and more
suitable for real-time processing.
 The key component is a CNN architecture that directly predicts the
2D image locations of the projected vertices of the object’s 3D
bounding box.
 The object’s 6D pose is then estimated using a PnP algorithm.
REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
The proposed CNN architecture.
REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
(a) (b) (c) (d)
(a) An example input image with four objects. (b) The S × S grid showing
cells responsible for detecting the four objects. (c) Each cell predicts 2D
locations of the corners of the projected 3D bounding boxes in the image.
(d) The 3D output tensor from the network, which represents for each cell
a vector consisting of the 2D corner locations, the class probabilities and a
confidence value associated with the prediction.
REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
In the last column, it shows failure cases due to motion blur,
severe occlusion and specularity.
REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
 A real-time RGB-based pipeline for object detection and 6D pose
estimation.
 This 3D orientation estimation is based on a variant of the Denoising
Autoencoder that is trained on simulated views of a 3D model using
Domain Randomization.
 This so-called Augmented Autoencoder (AAE) has several advantages
over existing methods:
 Since the training is independent from concrete representations of
object orientations within SO(3) (e.g. quaternions), able to handle
ambiguous poses caused by symmetric views because of avoiding
one-to-many mappings from images to orientations.
 Learn representations that specifically encode 3D orientations while
achieving robustness against occlusion, cluttered backgrounds and
generalizing to different environments and test sensors.
 The AAE does not require any real pose-annotated training data;
Instead, it is trained to encode 3D model views in a self-supervised way,
overcoming the need of a large pose-annotated dataset.
IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
6D Object Detection pipeline with homogeneous transformation
(top-right) and depth-refined result (bottom-right) .
IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Training process for the AAE; a) reconstruction target batch x of
uniformly sampled SO(3) object views; b) geometric and color
augmented input; c) reconstruction xˆ after 30000 iterations.
IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Autoencoder CNN architecture with occluded test input
IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Top: creating a codebook from the encodings of discrete synthetic
object views; bottom: object detection and 3D orientation estimation
using the NN(s) with highest cosine similarity from the codebook.
3-d interpretation from single 2-d image for autonomous driving

More Related Content

What's hot

Kantocv 2-1-calibration publish
Kantocv 2-1-calibration publishKantocv 2-1-calibration publish
Kantocv 2-1-calibration publish
tomoaki0705
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
Yu Huang
 
AI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen GalsworthyAI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen Galsworthy
Databricks
 
3Dマップを活用したVisual Localization
3Dマップを活用したVisual Localization3Dマップを活用したVisual Localization
3Dマップを活用したVisual Localization
Hajime Taira
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithm
allyn joy calcaben
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention Network
Takahiro Kubo
 
ICRA 2019 速報
ICRA 2019 速報ICRA 2019 速報
ICRA 2019 速報
cvpaper. challenge
 
Visual-SLAM技術を利用した 果樹園の3次元圃場地図の作成
Visual-SLAM技術を利用した果樹園の3次元圃場地図の作成Visual-SLAM技術を利用した果樹園の3次元圃場地図の作成
Visual-SLAM技術を利用した 果樹園の3次元圃場地図の作成
Masahiro Tsukano
 
Feature engineering mean encodings
Feature engineering   mean encodingsFeature engineering   mean encodings
Feature engineering mean encodings
Chode Amarnath
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
Antonio Tejero de Pablos
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
Shuyo Nakatani
 
Graph U-Nets
Graph U-NetsGraph U-Nets
Graph U-Nets
Shion Honda
 
Neural scene representation and rendering の解説(第3回3D勉強会@関東)
Neural scene representation and rendering の解説(第3回3D勉強会@関東)Neural scene representation and rendering の解説(第3回3D勉強会@関東)
Neural scene representation and rendering の解説(第3回3D勉強会@関東)
Masaya Kaneko
 
人工知能概論 10
人工知能概論 10人工知能概論 10
人工知能概論 10
Tadahiro Taniguchi
 
Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元
Tatsuya Yokota
 
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
Project Samurai
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction
Wael Badawy
 
Scale Invariant feature transform
Scale Invariant feature transformScale Invariant feature transform
Scale Invariant feature transform
Shanker Naik
 

What's hot (20)

Kantocv 2-1-calibration publish
Kantocv 2-1-calibration publishKantocv 2-1-calibration publish
Kantocv 2-1-calibration publish
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
AI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen GalsworthyAI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen Galsworthy
 
3Dマップを活用したVisual Localization
3Dマップを活用したVisual Localization3Dマップを活用したVisual Localization
3Dマップを活用したVisual Localization
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithm
 
Graph Attention Network
Graph Attention NetworkGraph Attention Network
Graph Attention Network
 
ICRA 2019 速報
ICRA 2019 速報ICRA 2019 速報
ICRA 2019 速報
 
Visual-SLAM技術を利用した 果樹園の3次元圃場地図の作成
Visual-SLAM技術を利用した果樹園の3次元圃場地図の作成Visual-SLAM技術を利用した果樹園の3次元圃場地図の作成
Visual-SLAM技術を利用した 果樹園の3次元圃場地図の作成
 
Feature engineering mean encodings
Feature engineering   mean encodingsFeature engineering   mean encodings
Feature engineering mean encodings
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Graph U-Nets
Graph U-NetsGraph U-Nets
Graph U-Nets
 
PCL
PCLPCL
PCL
 
Neural scene representation and rendering の解説(第3回3D勉強会@関東)
Neural scene representation and rendering の解説(第3回3D勉強会@関東)Neural scene representation and rendering の解説(第3回3D勉強会@関東)
Neural scene representation and rendering の解説(第3回3D勉強会@関東)
 
人工知能概論 10
人工知能概論 10人工知能概論 10
人工知能概論 10
 
Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元
 
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
Pythonで画像処理をやってみよう!第7回 - Scale-space 第6回 -
 
Object recognition
Object recognitionObject recognition
Object recognition
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction
 
Scale Invariant feature transform
Scale Invariant feature transformScale Invariant feature transform
Scale Invariant feature transform
 

Similar to 3-d interpretation from single 2-d image for autonomous driving

BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
Yu Huang
 
visual realism Unit iii
 visual realism Unit iii visual realism Unit iii
visual realism Unit iii
Arun Prakash
 
Unit 3 visual realism
Unit 3 visual realismUnit 3 visual realism
Unit 3 visual realism
Javith Saleem
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
ijma
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blankenxepost
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
Yu Huang
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
Yu Huang
 
Final Project Report Nadar
Final Project Report NadarFinal Project Report Nadar
Final Project Report NadarMaher Nadar
 
Fb4301931934
Fb4301931934Fb4301931934
Fb4301931934
IJERA Editor
 
Basic image matching techniques, epipolar geometry and normalized image
Basic image matching techniques, epipolar geometry and normalized imageBasic image matching techniques, epipolar geometry and normalized image
Basic image matching techniques, epipolar geometry and normalized image
National Cheng Kung University
 
Automatic rectification of perspective distortion from a single image using p...
Automatic rectification of perspective distortion from a single image using p...Automatic rectification of perspective distortion from a single image using p...
Automatic rectification of perspective distortion from a single image using p...
ijcsa
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
Yu Huang
 
C012271015
C012271015C012271015
C012271015
IOSR Journals
 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
sipij
 
Visual Odometry using Stereo Vision
Visual Odometry using Stereo VisionVisual Odometry using Stereo Vision
Visual Odometry using Stereo Vision
RSIS International
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
theijes
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
theijes
 
3D Display
3D Display3D Display
3D Display
Toushik Paul
 
Computer Aided Design visual realism notes
Computer Aided Design visual realism notesComputer Aided Design visual realism notes
Computer Aided Design visual realism notes
KushKumar293234
 

Similar to 3-d interpretation from single 2-d image for autonomous driving (20)

posterfinal
posterfinalposterfinal
posterfinal
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 
visual realism Unit iii
 visual realism Unit iii visual realism Unit iii
visual realism Unit iii
 
Unit 3 visual realism
Unit 3 visual realismUnit 3 visual realism
Unit 3 visual realism
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blanken
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Final Project Report Nadar
Final Project Report NadarFinal Project Report Nadar
Final Project Report Nadar
 
Fb4301931934
Fb4301931934Fb4301931934
Fb4301931934
 
Basic image matching techniques, epipolar geometry and normalized image
Basic image matching techniques, epipolar geometry and normalized imageBasic image matching techniques, epipolar geometry and normalized image
Basic image matching techniques, epipolar geometry and normalized image
 
Automatic rectification of perspective distortion from a single image using p...
Automatic rectification of perspective distortion from a single image using p...Automatic rectification of perspective distortion from a single image using p...
Automatic rectification of perspective distortion from a single image using p...
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
C012271015
C012271015C012271015
C012271015
 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
 
Visual Odometry using Stereo Vision
Visual Odometry using Stereo VisionVisual Odometry using Stereo Vision
Visual Odometry using Stereo Vision
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
 
3D Display
3D Display3D Display
3D Display
 
Computer Aided Design visual realism notes
Computer Aided Design visual realism notesComputer Aided Design visual realism notes
Computer Aided Design visual realism notes
 

More from Yu Huang

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
Yu Huang
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
Yu Huang
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
Yu Huang
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
Yu Huang
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
Yu Huang
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
Yu Huang
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
Yu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
Yu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
Yu Huang
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
Yu Huang
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
Yu Huang
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
Yu Huang
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
Yu Huang
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
Yu Huang
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
Yu Huang
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
Yu Huang
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
Yu Huang
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
Yu Huang
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
Yu Huang
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
Yu Huang
 

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 

Recently uploaded

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 

Recently uploaded (20)

Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 

3-d interpretation from single 2-d image for autonomous driving

  • 1. 3D INTERPRETATION FROM SINGLE 2D IMAGE FOR AUTONOMOUS DRIVING Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. OUTLINE  Single View Metrology  Joint SFM and Detection Cues for Monocular 3D Localization in Road Scenes  Joint 3D Estimation of Objects and Scene Layout  CubeSLAM: Monocular 3D Object Detection and SLAM without Prior Models  Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes  Improved Object Detection and Pose Using Part- Based Models  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model  Are Cars Just 3D Boxes? – Jointly Estimating the 3D Shape of Multiple Objects  Classification and Pose Estimation of Vehicles in Videos by 3D Modeling within Discrete- Continuous Optimization  A mixed classification-regression framework for 3D pose estimation from 2D images  BoxCars: Improving Fine-Grained Recognition of Vehicles using 3D BBoxes in Traffic Surveillance  Vehicle Detection and Pose Estimation for Autonomous Driving (Thesis)  Deep Cuboid Detection: Beyond 2D BBoxes  3D Bounding Box Estimation Using Deep Learning and Geometry  Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image  3D Object Proposals for Accurate Object Class Detection  Monocular 3D Object Detection for Autonomous Driving  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again  Real-Time Seamless Single Shot 6D Object Pose Prediction  Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
  • 3. SINGLE VIEW METROLOGY Basic geometry: The plane’s vanishing line l is the intersection of the image plane with a plane parallel to the reference plane and passing through the camera centre. The vanishing point v is the intersection of the image plane with a line parallel to the reference direction through the camera centre. Cross ratio: The point b on the plane π corresponds to the point t on the plane π’ . They are aligned with the vanishing point v. The 4 points v, t, b and the intersection i of the line joining them with the vanishing line define a cross-ratio. Cross-ratio decides a ratio of distances between planes in the world.
  • 4. SINGLE VIEW METROLOGY Homology mapping btw parallel planes: point X on plane π mapped into point X’ on π’ by parallel projection. In the image, mapping btw images of two planes is a homology, with v vertex and l axis. The correspondence b -> t fixes the remaining DoF of the homology from cross-ratio of the 4 points: v, i, t and b.
  • 5. JOINT SFM AND DETECTION CUES FOR MONOCULAR 3D LOCALIZATION IN ROAD SCENES  This localization framework jointly uses info. from complementary modalities such as SFM and object detection to achieve high localization accuracy in both near and far fields.  Make use of raw detection scores to allow 3D Bboxes to adapt to better quality 3D cues.  To extract SFM cues, take the advantages of dense tracking over sparse mechanisms in autonomous driving scenarios.  The formulation for 3D localization can be regarded as an extension of sparse BA to incorporate object detection cues. 3D object localization framework that combines cues from SFM and object detection. Red denotes 2D bounding boxes, the horizontal line is the horizon from estimated ground plane, green denotes estimated 3D localization for far and near objects, with distances in magenta.
  • 6. JOINT SFM AND DETECTION CUES FOR MONOCULAR 3D LOCALIZATION IN ROAD SCENES Overview of the 3d object localization system combining SFM cues (green) with object detection cues (brown).
  • 7. JOINT SFM AND DETECTION CUES FOR MONOCULAR 3D LOCALIZATION IN ROAD SCENES Coordinate system definitions for 3D object localization. The SFM ground plane is (n⊤, h)⊤. System overview for obtaining SFM cues on objects, depicted in green. K is the camera intrinsic calibration matrix, the bottom of a 2D Bbox, b = (x, y, 1)⊤, can be back-projected to 3D through the ground plane {h, n}:
  • 8. JOINT SFM AND DETECTION CUES FOR MONOCULAR 3D LOCALIZATION IN ROAD SCENES Output of this localization system. The bottom left panel shows the monocular SFM camera trajectory. The top panel shows input 2D bounding boxes in red, horizon from estimated ground plane and the estimated 3D bounding boxes in green with distances in magenta. The bottom right panel shows the top view of the ground truth object localization from laser scanner in red, compared to this 3D object localization in blue.
  • 9. JOINT 3D ESTIMATION OF OBJECTS AND SCENE LAYOUT  A generative model is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene.  To infer the scene topology, geometry and traffic activities from a video sequence from a single camera mounted on a moving car.  It takes advantage of dynamic info. in the form of vehicle tracklets and static info. from semantic labels and geometry (i.e., vanishing points). Monocular 3D Urban Scene Understanding. (Left) Image cues. (Right) Estimated layout: Detections belonging to a tracklet are depicted with the same color, traffic activities are depicted with red lines. Vehicle tracklets Vanishing points Scene labels
  • 10. JOINT 3D ESTIMATION OF OBJECTS AND SCENE LAYOUT  Assume that the road surface is flat, and model the bird’s eye perspective as the y = 0 plane of the standard camera coordinate system;  Detect vehicles in each frame independently using a semi-supervised version of the part- based detector in order to obtain orientation estimates;  2D tracklets estimated using ’tracking-by-detection’: First adjacent frames are linked and then short tracklets are associated to create longer ones via the hungarian method.  3D vehicle tracklets are obtained by projecting the 2D tracklets into bird’s eye perspective, employing error-propagation to obtain cov. estimates.  Model lanes with splines, place parking spots at equidistant places along street boundaries.  The model infers whether the cars participate in traffic or are parked in order to get more accurate layout estimations.  Latent variables are employed to associate each detected vehicle with positions in one of these lanes or parking spaces.
  • 11. JOINT 3D ESTIMATION OF OBJECTS AND SCENE LAYOUT Graphical model and road model with lanes represented as B-splines. Transform the 2D tracklets into 3D tracklets: project the image coordinates into bird’s eye perspective by backprojecting objects into 3D using several complementary cues. Towards this goal, use the 2D bounding box footpoint in combination with the estimated road plane. Two types of dominant vanishing points: forward facing street and crossing street. Three semantic classes, i.e., road, sky and background.
  • 12. JOINT 3D ESTIMATION OF OBJECTS AND SCENE LAYOUT (Left) Trackets from all frames superimposed. (Middle) Inference result with θ known and (Right) θ unknown. The inferred intersection layout in gray. Ground truth labels in blue. Detected activities in red.
  • 13. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS  A method for single image 3D cuboid object detection and multi-view object SLAM without prior object model, and the two aspects can benefit each other.  For 3D detection, generate cuboid proposals from 2D Bboxes and vanishing points sampling.  The proposals are further scored,selected to align with image edges.  Multi-view bundle adjustment with measurement functions is proposed to jointly optimize camera poses, objects and points, utilizing single view detection results.  Objects can provide more geometric constraints and scale consistency compared to points.  Objects are utilized in two folds: provide depth initialization for points difficult to triangulate and provide geometry constraints in BA.  The estimated camera poses from SLAM can improve the single-view object detection.
  • 14. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS Monocular 3D object detection and mapping without prior object models. Mesh model is just for visualization and not used for detection. (a) ICL NUIM data with various objects, whose position, orientation and dimension are optimized by SLAM. (b) KITTI 07. With object constraints, monocular SLAM can build a consistent map and correct scale drift, without loop closure and constant camera height assumption. (a) (b)
  • 15. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS  A 3D cuboid by 9 DoF parameters: 3 DoF position, 3 DoF rotation and 3 DoF dimension.  The cuboid coordinate frame is built at the cuboid center, aligned with the main axes.  The camera intrinsic calibration K is also known.  The cuboid’s projected corners fit tightly with 2D bounding box, there are 4 constraints corresponding to 4 sides of a rectangle which cannot fully constrain all 9 parameters.  3D cuboid has 3 orthogonal axes and can form 3 VPs after perspective projections depending on object rotation R and camera calibration K.  After getting 8 cuboid corners in 2D, back-project to 3D space to compute 3D position and dimensions which is up to a scale factor.  The scale can be reasoned from camera height to ground, prior object size and so on.
  • 16. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS Proposals generation from 2D object box. Cuboids are divided into three categories depending on the number of observable faces. If one corner is estimated, the other seven corners can also be computed from vanishing points (VPs). For example in (a), if corner 1 is sampled, then corner 2 and 3 can be determined through ray intersection of VP line and rectangles, followed by corner 4 and other bottom corners.
  • 17. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS Denote the image as I and cuboid proposal as x, then the cost function is defined as: Cuboid proposal scoring. (Left) Edges to align and score the proposals. (Right) Cuboid proposals generated from the same 2D cyan bounding box. The top left is the best and bottom right is the worst after scoring.
  • 18. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS Camera poses C = {ci}, 3D landmark object O = {oj}, Points P = {pk}. BA is formulated as NLS problem: Transform the landmark object to the camera frame then compare with the measurement as: 3D measurement 2D measurement Project the landmark cuboid onto image plane to get the 2D Bbox, compare it with the detected 2D Bbox as First transform the point to the cuboid frame then compare with cuboid dimensions: Point-camera measurement: the standard 3D point re-projection error in feature based SLAM.
  • 19. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS (a) The object SLAM pipeline. Single view object detection provides cuboid landmark and depth initialization for SLAM while SLAM can estimate camera pose for more accurate object detection. (b) Measurement errors between cameras, objects and points during BA. (a) (b)
  • 20. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND SLAM WITHOUT PRIOR MODELS  Object association based on point matching:  Dynamic points detected through descriptor matching and epipolar line checking;  Then first associate points to objects if points are observed enough times of belonging to the 2D object bounding box and close to cuboid centroid in 3D space;  Try to find object matching which has the most number of shareable map points exceeding a threshold (10 for example);  Well for wide baseline matching, repetitive objects, occlusions, and dynamic scenarios. Green points are normal map points, and other color points are associated to objects with the same color. The front cyan moving car is not added as SLAM landmark as no feature point is associated with it. Points in object overlapping areas are not associated with any object.
  • 21. MONOCULAR VISUAL SCENE UNDERSTANDING: UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES  A probabilistic 3D scene model that integrates SoA multiclass object detection, object tracking and scene labeling together with geometric 3D reasoning.  The model is able to represent complex object interactions such as inter-object occlusion, physical exclusion between objects, and geometric context.  Inference in this model allows to jointly recover the 3D scene context and perform 3D multi-object tracking from a mobile observer, for objects of multiple categories, using only monocular video as input.  This system performs explicit occlusion reasoning and is capable of tracking objects that are partially occluded for extended periods of time, or objects that have never been observed to their full extent.  A joint scene tracklet model for the evidence collected over multiple frames substantially improves performance.
  • 22. MONOCULAR VISUAL SCENE UNDERSTANDING: UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES Overview on this system. For each input frame, run an object detector and extract semantic scene labels. Object hypotheses are fused to short-term tracklets and put into a strong 3D scene model with explicit occlusion reasoning. MCMC inference allows tractable inference in the Bayesian scene model, while HMM scene tracking ensures long- term associations.
  • 23. MONOCULAR VISUAL SCENE UNDERSTANDING: UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES The multi-frame 3D inference and explicit occlusion reasoning for onboard vehicle and pedestrian tracking with overlaid horizon estimate for different public SoA datasets. the 3D scene state X in the world coordinate system the rotation angles of a vehicle mounted camera
  • 24. MONOCULAR VISUAL SCENE UNDERSTANDING: UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES Employing the theorem of intersecting lines to derive the distance to object along the ground plane in viewing direction. Approximate objects by their Bboxes and project them onto the image. By leveraging the depth order obtained from the 3D scene model, able to estimate occluded object regions.
  • 25. MONOCULAR VISUAL SCENE UNDERSTANDING: UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
  • 26. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS  Work on part-based models by using accurate geometric models both in the learning phase and at detection.  The object model is defined as a number of roughly planar aspects models together with a set of typical object poses;  In the learning phase, manual annotations are used to reduce perspective distortion before learning the part-based models.  That training is performed on rectified images using a deformable part- based model (DPM), leads to models which are more specific.  At the same time a set of representative object poses are learnt.  Transform the image according to each of the learnt typical poses.  These are used at detection to remove perspective distortion.  Scores from the aspect detectors are generated by running each aspect model on each of the transformed images.  Detections from the different aspect models are combined and thresholded to produce the final object detection.
  • 27. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS Annotation of each visible aspect of the object in training images The two aspect models for the bus category. The upper row shows the frontal model and the lower row shows the side model.
  • 28. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS
  • 29. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS
  • 30. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS A certain training example is similar to a pose P if the average angular deviations for the front and side, Measure of angular deviation for pose similarity. are below a predefined threshold.
  • 31. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS Overview of the detection pipeline. The input image is transformed according to each of representative poses. This produces a multiple images that are individually run through the aspect detectors creating a set of score pyramids containing the detector scores at different scales. These are merged into one pyramid per aspect, in the original image coordinate system. Finally, the front and side scores are combined and non-maximum suppression is performed.
  • 32. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS Left: how to estimate the side location (brown dot) given the frontal location (blue dot) and the size of skewed frontal bounding box. Right: search in a small neighborhood (blue circle) of expected location for each level. The score combination can be expressed as
  • 33. IMPROVED OBJECT DETECTION AND POSE USING PART-BASED MODELS Detected bounding boxes are shown in green and their layout in red.
  • 34. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL  Given a monocular image, localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes.  An approach extends the deformable part-based model to reason in 3D.  It represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box.  Model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint.  The model reasons about face visibility patters called aspects.  Train the cuboid model jointly and share weights across all aspects to attain efficiency.  Inference then entails sliding and rotating the box in 3D and scoring object hypotheses.  While for inference discretize the search space, the variables are continuous in the model.
  • 35. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL The deformable 3D cuboid model. Viewpoint angle θ.
  • 36. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL Aspects, together with the range of θ that they cover, for (left) cars and (right) beds.
  • 37. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL
  • 38. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL where p = (p1, · · · , p6 ) and V (i, a) is a binary variable encoding whether face i is visible under aspect a. Note that a = a(θ, s) can be deterministically computed from the rotation angle θ and the position of the stitching point s (which we assume to always be visible), which in turns determines the face visibility V .
  • 39. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL Learned models for (left) bed, (right) car.
  • 40. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL We use ref to index the first visible face in the aspect model, and
  • 41. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL Inference in this model can be done by computing
  • 42. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH A DEFORMABLE 3D CUBOID MODEL KITTI: examples of car detections. (top) Ground truth, (bottom) The 3D detections, augmented with best fitting CAD models to visualize inferred 3D box orientations.
  • 43. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS  Scene understanding from the perspective of 3D shape modeling: a 3D scene representation that reasons jointly about the 3D shape of multiple objects.  It allows to express 3D geometry and occlusion on the fine detail level of individual vertices of 3D wireframe models, and makes it possible to treat dependencies between objects, such as occlusion reasoning, in a deterministic way. Left: Coarse 3D object bounding boxes derived from 2D bounding box detections. Right: fine-grained 3D shape model fits improve 3D localization (bird’s eye views).
  • 44. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS Scene particles (coarse 3D geometry and fine-grained shape). Deterministic occlusion mask computation by ray casting and intersection (blue). A 3D scene model, consisting of a common ground plane, a set of 3D deformable objects, and an explicit occlusion mask for each object.
  • 45. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS Object likelihood. Scene-level likelihood. An inference scheme that proceeds in stages, lifting an initial 2D guess (Initialization) about object locations to a coarse 3D model (Coarse 3D geometry), and refining that coarse model into a final collection of consistent 3D shapes (Final scene-level inference).
  • 46. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS (a) Part localization accuracy and 2D pre-detection. (b-c) Example detections and corresponding 3D reconstructions.
  • 47. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS COARSE+GP (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b) COARSE+GP based on (a), (c) bird’s eye view of (b). (e) FG+SO+DO+GP shape model fits (blue: estimated occlusion masks), (f) bird’s eye view of (e). Estimates in red, ground truth in green.
  • 48. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE 3D SHAPE OF MULTIPLE OBJECTS FG+SO+DO (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b) FG+SO+DO based on (a), (c) bird’s eye view of (b). (d) FG+SO+DO+GP shape model fits (blue: estimated occlusion masks), (e) bird’s eye view of (d). Estimates in red, ground truth in green.
  • 49. CLASSIFICATION AND POSE ESTIMATION OF VEHICLES IN VIDEOS BY 3D MODELING WITHIN DISCRETE-CONTINUOUS OPTIMIZATION  Rank possible poses and types for each frame and exploit temporal coherence between consecutive frames for refinement.  Cast the estimation of a vehicle’s pose and type as a solution of a continuous optimization problem over space and time.  Obtain initial start points by a discrete temporal optimization reaching a global optimum on a ranked discrete set of possible types and poses.  To guarantee effectiveness of the discrete-continuous optimization, reduce the search space of potential 3D model types and poses for each frame for the discrete optimizer.  It avoids common expensive evaluation of all possible discretized hypotheses.  The key idea towards efficiency lies in a combination of detecting the vehicle, rendering the 3D models, matching projected edges to input images, and using a tree structured MRF to get fast and globally optimal inference and to force the vehicle follow a feasible motion model in the initial phase.
  • 50. CLASSIFICATION AND POSE ESTIMATION OF VEHICLES IN VIDEOS BY 3D MODELING WITHIN DISCRETE-CONTINUOUS OPTIMIZATION Improve pose estimation over [Toshev et al., 2009] by processing in continuous space (columns 1, 2), reduce wrong classifications due to incorrect scales (column 3) and improve pose estimation over [Leotta et al., 2011] by using existing 3D models.
  • 51. CLASSIFICATION AND POSE ESTIMATION OF VEHICLES IN VIDEOS BY 3D MODELING WITHIN DISCRETE-CONTINUOUS OPTIMIZATION (a) Framework application flow. (b) Vehicle, described by orientation α and centroid on the ground plane C = (x, y, 0). Fast Directional Chamfer Matching (FDCM) FDCM maps the edge pixels in U and E to an orientation augmented space. The alignment cost between the two edge maps is then given by To update the matching score by setting Given the shifted but projective wrong model projection Ap l and a projective correct model projection area Bq l, the similarity score for a pose is calculated by combining the output of FDCM and the area overlap by
  • 52. CLASSIFICATION AND POSE ESTIMATION OF VEHICLES IN VIDEOS BY 3D MODELING WITHIN DISCRETE-CONTINUOUS OPTIMIZATION (a) Temporal inference for ranked projections. (b) Ackermann steering principle where φ = θ/2. (c) Corresponding points between model’s projected edges and edge image.
  • 53. CLASSIFICATION AND POSE ESTIMATION OF VEHICLES IN VIDEOS BY 3D MODELING WITHIN DISCRETE-CONTINUOUS OPTIMIZATION Pose estimation using FDCM only (top row), combining FDCM and MRF (middle row), combining FDCM, MRF and continuous optimization (bottom row).
  • 54. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK FOR 3D POSE ESTIMATION FROM 2D IMAGES  The existing 3D pose estimation methods using deep networks can be divided in two groups:  (i) predict 2D keypoints from images and recover 3D pose from keypoints;  (ii) directly predict 3D pose from an image.  A mixed classification-regression framework that uses a classification network to produce a discrete multimodal pose estimate and a regression network to produce a continuous refinement of the estimate.  The framework can accommodate different architectures and loss functions, leading to multiple classification-regression models. A high level overview of our problem statement and proposed network architecture
  • 55. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK FOR 3D POSE ESTIMATION FROM 2D IMAGES
  • 56. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK FOR 3D POSE ESTIMATION FROM 2D IMAGES the Bin & Delta model Simple/Naive Bin & Delta Geodesic Bin & Delta
  • 57. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK FOR 3D POSE ESTIMATION FROM 2D IMAGES One delta network per pose-bin Best (top row) and Worst (bottom row) images for Category: Bus Best (top row) and Worst (bottom row) images for Category: Car The previous optimization problems as follows
  • 58. BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE  Not limit to frontal/rear viewpoint but allow vehicles to be seen from any viewpoint, based on 3D Bboxes built around the vehicles.  The Bbox can be auto-constructed from traffic surveillance data.  For scenarios where it is not possible to use the precise construction, a method for estimation of the 3D bounding box.  The 3D Bbox is used to normalize the image viewpoint by “unpacking” the image into plane.  To randomly alter the color of the image and add a rectangle with random noise to random position in the image during training CNNs.  A fine-grained vehicle dataset BoxCars116k, with 116k images of vehicles from various viewpoints taken by many surveillance cameras.
  • 59. Example of automatically obtained 3D bounding box used for fine- grained vehicle classification. Top left: vehicle with 2D bounding box annotation, top right: estimated contour, bottom left: estimated directions to vanishing points, bottom right: 3D bounding box automatically obtained from surveillance video (green) and our estimated 3D bounding box (red). BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
  • 60. 3D bounding box and its unpacked version. Examples of data normalization and auxiliary data feeded to nets. Left to right: vehicle with 2D bounding box, computed 3D bounding box, vectors encoding viewpoints on the vehicle (View), unpacked image of the vehicle (Unpack), and rasterized 3D bounding box feeded to the net (Rast). BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
  • 61. Estimation of 3D Bbox. Left to right: image with vehicle 2D Bbox, output of contour object detector, constructed contour, estimated directions towards vanishing points, ground truth (green) and estimated (red) 3D Bbox. Used CNN for estimation of directions towards vanishing points. The vehicle image is fed to ResNet50 with 3 separate outputs which predict probabilities for directions of vanishing points as probabilities in a quantized angle space (60 bins from −90ºto 90º). BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
  • 62. VEHICLE DETECTION AND POSE ESTIMATION FOR AUTONOMOUS DRIVING (THESIS)  A FCN for 2D and 3D bounding box detection of cars from monocular images intended for autonomous driving applications.  The introduced network is E2E trainable and detects objects at multiple scales in a single pass.  A 3D bounding box representation, which is independent of the image projection matrix (camera used to take the images).  The detector may be trained on several different datasets and also detect 3D Bboxes on completely different datasets than it was trained on. 3D bounding boxes (left) and their top view (right) detected by this method. The front sides of 3D bounding boxes are depicted in green, the rear sides in red.
  • 63. VEHICLE DETECTION AND POSE ESTIMATION FOR AUTONOMOUS DRIVING(THESIS)  2D Bounding Box - BBTXT  The 2D Bboxes are represented by the coord.s of their top-left (xmin, ymin) and bottom-right (xmax, ymax) corners.  3D Bounding Box - BB3TXT  A Bbox in 3D has 9 DOF - 3 for position, 3 rotations, and 3 dimensions.  Coord.s of the projected rear- bottom-left, front-bottom-left, and front- bottom-right corners and the y-coord.s of the front-top-left corner. 3D bounding box corners Info. stored about 3D Bboxes. Each Bbox is defined by 3 lines - front- bottom, left-bottom, front-left.
  • 64. VEHICLE DETECTION AND POSE ESTIMATION FOR AUTONOMOUS DRIVING(THESIS)  Together with the requirement of all Bboxes being pinned to the ground plane provides sufficient amount of info. to reconstruct the 3D world positions of the 3D Bboxes.  Inverse Projection  Compute the inverse of KR and the camera center from  Reconstruction of the Bottom Side  use the ground plane equation ax + by + cz + d = 0 (normal n = [a, b, c]T ) and the inverse projected rays to determine the position of the bottom side of the 3D bounding box in the world.  obtain parallelogram in the ground plane instead of a rectangle from the re-projection of 3 points. Rectification of a parallelogram (solid) to a rectangle (dashed). Projection of a 3D bounding box to the ground plane.
  • 65. VEHICLE DETECTION AND POSE ESTIMATION FOR AUTONOMOUS DRIVING (THESIS)  Reconstruction of the Top Side  use the direction of the bottom-left line as the normal vector of the frontal plane nF = [aF , bF , cF ] and place the front-bottom-left point in the frontal plane to calculate the front plane as aFx + bFy + cFz + dF = 0;  Finding the intersection of the frontal plane and the ray lftl = C + (KR)-1xftl gives the position of the vertex Xftl to determine the height of the bounding box.  Ground Plane Extraction The RANSAC algorithm for ground plane estimation.
  • 66. VEHICLE DETECTION AND POSE ESTIMATION FOR AUTONOMOUS DRIVING (THESIS)
  • 67. DEEP CUBOID DETECTION: BEYOND 2D BOUNDING BOXES  A Deep Cuboid Detector takes a consumer-quality RGB image of a cluttered scene and localizes all 3D cuboids (box-like objects).  An E2E deep learning system to detect cuboids across many semantic categories (e.g., ovens, shipping boxes, and furniture).  Localize cuboids with a 2D Bbox, and localize the cuboid’s corners, effectively producing a 3D interpretation of box-like objects.  Refine keypoints by pooling conv. features iteratively, improving the baseline method significantly.  This deep learning cuboid detector is trained in an end-to-end fashion and is suitable for real-time applications. 2D Object detection vs. 3D Cuboid detection.
  • 68. DEEP CUBOID DETECTION: BEYOND 2D BOUNDING BOXES Deep Cuboid Detection Pipeline. 1) find RoIs in the image where a cuboid might be present and train a RPN to output such regions. 2) features for each RoI are pooled from a conv. feature map. 3) These pooled features are passed though two fully connected layers just like Faster R-CNN. 4) output normalized offsets of the vertices from the center of the region. 5) refine predictions by performing iterative feature pooling.
  • 69. DEEP CUBOID DETECTION: BEYOND 2D BOUNDING BOXES  The loss function used in the RPN consists of Lanchor−cls, the log loss over two classes (cuboid vs. not cuboid) and Lanchor−reg, the Smooth L1 loss of the Bbox regression values for each anchor box;  The loss function for the R-CNN is made up of LROI−cls, the log loss over two classes (cuboid vs. not cuboid), LROI−reg, the Smooth L1 loss of the Bbox regression values for the RoI and LROI−corner, the Smooth L1 loss over the RoI’s predicted vertex locations, also referred as the corner regression loss.  The complete loss function is a weighted sum of the above mentioned losses
  • 70. DEEP CUBOID DETECTION: BEYOND 2D BOUNDING BOXES Vertex Refinement via Iterative Feature Pooling. To refine cuboid detection regions by re-pooling features from conv5 using the predicted bounding boxes.
  • 71. 3D BOUNDING BOX ESTIMATION USING DEEP LEARNING AND GEOMETRY  3D object detection and pose estimation from a single image.  First regresses relatively stable 3D object properties using a deep CNN and then combines these estimates with geometric constraints provided by a 2D object b box to produce a complete 3D b box.  The first network output estimates the 3D object orientation using a hybrid discrete-continuous loss, which significantly outperforms the L2 loss.  The second output regresses the 3D object dimensions, relatively little variance, predicted for many object types.  These estimates, combined with geometric constraints on translation imposed by the 2D b box, enable to recover a stable and accurate 3D object pose.
  • 72.  Perspective projection of a 3D Bbox fit tightly within its 2D det. window.  The 3D Bbox is described by its center T, dimensions D and orientation, by the azimuth, elevation and roll angles. 2D box side parameters Correspondence btw the 3D bbox and 2D bbox: Each figure shows a 3D bbox that surrounds an object. 3D BOUNDING BOX ESTIMATION USING DEEP LEARNING AND GEOMETRY
  • 73.  CNN Regression of 3D Box Parameters:  Combine the ray direction at the crop center with the estimated local orientation to compute the global orientation of the object.  Faster R-CNN, SSD: Divide the space of the bounding boxes into several discrete modes “anchor boxes” and then estimate the continuous offsets applied to each anchor box.  Discretize the orientation angle and divide into overlapping bins. For each bin, the CNN network estimates both a confidence probability that the output angle lies inside the ith bin and the residual rotation correction applied to the orientation of the center ray of that bin to obtain the output angle.  The residual rotation is represented by two numbers, for the sine and the cosine of the angle.  Total loss for the MultiBin orientation:  The loss for dimension estimation: Left: Car dimensions. Right: Illustration of local and global orientation of a car. The local orientation computed wrt the ray through the center of the crop. 3D BOUNDING BOX ESTIMATION USING DEEP LEARNING AND GEOMETRY
  • 74. The architecture for MultiBin estimation for orientation and dimension estimation with 3 branches: The left is for estimation of dimensions of the object of interest. The other twos are for computing the confidence for each bin and also compute cos(∆θ) and sin(∆θ) of each bin. Qualitative illustration of 2D detection boxes and estimated 3D projections. 3D BOUNDING BOX ESTIMATION USING DEEP LEARNING AND GEOMETRY
  • 75. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE  Deep MANTA (Many-Tasks), for vehicle analysis from a given image.  A robust CNN for simultaneous vehicle detection, part localization, visibility characterization and 3D dimension estimation.  A coarse-to-fine object proposal that boosts the vehicle detection.  Deep MANTA localizes vehicle parts even if they are not visible.  In the inference, the network’s outputs are used by real time pose estimation for fine orientation estimation and 3D vehicle localization.
  • 76. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE System outputs. Top: 2D vehicle bboxes, vehicle part localization and part visibility. Bottom: 3D vehicle bbox localization and 3D vehicle part localization. The camera in blue. 2D/3D model 2D vehicle b box 3D b box 2D part coord. part visibility vector 3D part coord.
  • 77. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE Example of one 2D/3D vehicle model. (a) the bounding box B, (b) 2D part coordinates S and part visibility V. Detection loss. Visibility loss. Template similarity loss. Part loss.
  • 78. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE Overview of the Deep MANTA approach. The entire input image is forwarded inside the Deep MANTA network. Conv. layers share the same weights. Moreover, these 3 conv. blocks correspond to the split of existing CNN architecture.
  • 79. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE Semi-automatic annotation process. (a) weak annotations on a real image (3D b box). (b) best corresponding 3D models in green. (c) projection of these 3D models in the image. (d) corresponding mesh of visibility. (e) Final annotations.
  • 80. 3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS DETECTION  Exploit stereo imagery to place proposals in the form of 3D Bboxes.  Minimizing a function encoding object size priors, ground plane and depth features about free space, point cloud densities and distance to the ground. Formulate the proposal generation problem as inference in a MRF in which the proposal y should enclose a high density region in the point cloud. Point cloud density: Free space: Height prior: Height contrast:
  • 81. 3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS DETECTION  Score Bbox proposals using CNN, which network is built on Fast R-CNN;  It shares conv. features across all proposals and use ROI pooling layer to compute proposal-specific features;  Adds a context branch after the last conv. layer, and an orientation regression loss to jointly learn object location and orientation;  Features output from original/context branches concatenated and fed to prediction layers.  The context regions obtained by enlarging candidate boxes by a factor of 1.5.  Smooth L1 loss for orientation regression.  Parameters of context branch are initialized by copying weights from the original.  OxfordNet trained on ImageNet to initialize the weights of conv. layers and the branch for candidate boxes, then fine-tune it E2E on the KITTI training set.
  • 82. 3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS DETECTION
  • 83. MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING  Generate a set of candidate class-specific object proposals, which are then run through a standard CNN pipeline to obtain object detections.  An energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane.  Score each candidate box projected to the image plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape.
  • 84. MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING CNN architecture used to score proposals for object detection and orientation estimation. The scoring function by combining semantic cues (both class and instance level segmentation), location priors, context and shape:
  • 85. MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
  • 86. SSD-6D: MAKING RGB-BASED 3D DETECTION AND 6D POSE ESTIMATION GREAT AGAIN  A method for detecting 3D model instances and estimating their 6D poses from RGB data in a single shot.  To this end, extend the popular SSD paradigm to cover the full 6D pose space and train on synthetic model data only.  It competes or surpasses current state-of-the-art methods that leverage RGB-D data on multiple challenging datasets.  It produces results at around 10Hz, which is many times faster than the related methods. Discrete 6D pose space with each point representing a classifiable viewpoint. The object distance can be inferred from the projective ratio.
  • 87. SSD-6D: MAKING RGB-BASED 3D DETECTION AND 6D POSE ESTIMATION GREAT AGAIN After predicting 2D detections (a), build 6D hypotheses and run pose refinement and a final verification. While the unrefined poses (b) are rather approximate, contour-based refinement (c) produces already visually acceptable results. Occlusion-aware projective ICP with cloud data (d) leads to a very accurate alignment.
  • 88. SSD-6D: MAKING RGB-BASED 3D DETECTION AND 6D POSE ESTIMATION GREAT AGAIN Schematic overview of the SSD-style network prediction C denotes the number of object classes, V the number of viewpoints and R the number of in- plane rotation classes. The other 4 values are utilized to refine the corners of the discrete bounding boxes.
  • 89. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT POSE PREDICTION  A single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses.  Unlike a recently proposed single-shot technique for this task, SSD-6D, that only predicts an approximate 6D pose that must then be refined, this is accurate enough not to require additional post- processing.  It is much faster – 50 fps on a Titan X (Pascal) GPU – and more suitable for real-time processing.  The key component is a CNN architecture that directly predicts the 2D image locations of the projected vertices of the object’s 3D bounding box.  The object’s 6D pose is then estimated using a PnP algorithm.
  • 90. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT POSE PREDICTION The proposed CNN architecture.
  • 91. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT POSE PREDICTION (a) (b) (c) (d) (a) An example input image with four objects. (b) The S × S grid showing cells responsible for detecting the four objects. (c) Each cell predicts 2D locations of the corners of the projected 3D bounding boxes in the image. (d) The 3D output tensor from the network, which represents for each cell a vector consisting of the 2D corner locations, the class probabilities and a confidence value associated with the prediction.
  • 92. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT POSE PREDICTION In the last column, it shows failure cases due to motion blur, severe occlusion and specularity.
  • 93. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT POSE PREDICTION
  • 94. IMPLICIT 3D ORIENTATION LEARNING FOR 6D OBJECT DETECTION FROM RGB IMAGES  A real-time RGB-based pipeline for object detection and 6D pose estimation.  This 3D orientation estimation is based on a variant of the Denoising Autoencoder that is trained on simulated views of a 3D model using Domain Randomization.  This so-called Augmented Autoencoder (AAE) has several advantages over existing methods:  Since the training is independent from concrete representations of object orientations within SO(3) (e.g. quaternions), able to handle ambiguous poses caused by symmetric views because of avoiding one-to-many mappings from images to orientations.  Learn representations that specifically encode 3D orientations while achieving robustness against occlusion, cluttered backgrounds and generalizing to different environments and test sensors.  The AAE does not require any real pose-annotated training data; Instead, it is trained to encode 3D model views in a self-supervised way, overcoming the need of a large pose-annotated dataset.
  • 95. IMPLICIT 3D ORIENTATION LEARNING FOR 6D OBJECT DETECTION FROM RGB IMAGES 6D Object Detection pipeline with homogeneous transformation (top-right) and depth-refined result (bottom-right) .
  • 96. IMPLICIT 3D ORIENTATION LEARNING FOR 6D OBJECT DETECTION FROM RGB IMAGES Training process for the AAE; a) reconstruction target batch x of uniformly sampled SO(3) object views; b) geometric and color augmented input; c) reconstruction xˆ after 30000 iterations.
  • 97. IMPLICIT 3D ORIENTATION LEARNING FOR 6D OBJECT DETECTION FROM RGB IMAGES Autoencoder CNN architecture with occluded test input
  • 98. IMPLICIT 3D ORIENTATION LEARNING FOR 6D OBJECT DETECTION FROM RGB IMAGES Top: creating a codebook from the encodings of discrete synthetic object views; bottom: object detection and 3D orientation estimation using the NN(s) with highest cosine similarity from the codebook.