This document summarizes several papers related to monocular 3D object detection for autonomous driving. The first paper proposes MoVi-3D, a single-stage architecture that leverages virtual views to reduce visual appearance variability from objects at different distances, enabling detection across depths. The second paper describes RTM3D, which predicts object keypoints and uses geometric constraints to recover 3D bounding boxes in real-time. The third paper decouples detection into structured polygon estimation and height-guided depth estimation. It predicts 2D object surfaces and uses object height to estimate depth.
camera-based Lane detection by deep learningYu Huang
lane detection, deep learning, autonomous driving, CNN, RNN, LSTM, GRU, lane localization, lane fitting, ego lane, end-to-end, vanishing point, segmentation, FCN, regression, classification
camera-based Lane detection by deep learningYu Huang
lane detection, deep learning, autonomous driving, CNN, RNN, LSTM, GRU, lane localization, lane fitting, ego lane, end-to-end, vanishing point, segmentation, FCN, regression, classification
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...Seiya Ito
第5回 3D勉強会@関東
Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image
CVPR 2019 (oral)
Goal location prediction based on deep learning using RGB-D camerajournalBEEI
In the navigation system, the desired destination position plays an essential role since the path planning algorithms takes a current location and goal location as inputs as well as the map of the surrounding environment. The generated path from path planning algorithm is used to guide a user to his final destination. This paper presents a proposed algorithm based on RGB-D camera to predict the goal coordinates in 2D occupancy grid map for visually impaired people navigation system. In recent years, deep learning methods have been used in many object detection tasks. So, the object detection method based on convolution neural network method is adopted in the proposed algorithm. The measuring distance between the current position of a sensor and the detected object depends on the depth data that is acquired from RGB-D camera. Both of the object detected coordinates and depth data has been integrated to get an accurate goal location in a 2D map. This proposed algorithm has been tested on various real-time scenarios. The experiments results indicate to the effectiveness of the proposed algorithm.
An algorithm to quantify the swelling by reconstructing 3D model of the face with stereo images is presented. We
analyzed the primary problems in computational stereo, which include correspondence and depth calculation. Work has been carried out to determine suitable methods for depth estimation and standardizing volume estimations. Finally we designed software for reconstructing 3D images from 2D stereo images, which was built on Matlab and Visual C++. Utilizing
techniques from multi-view geometry, a 3D model of the face was constructed and refined. An explicit analysis of the stereo
disparity calculation methods and filter elimination disparity estimation for increasing reliability of the disparity map was
used. Minimizing variability in position by using more precise positioning techniques and resources will increase the accuracy of this technique and is a focus for future work
Chen Sagiv, co founder and co CEO of SagivTech, gave an introduction talk to Computer Vision at She Codes branch in Google Campus TLV.
In the talk an overview was given on what is computer vision, where it is used, some basic notions and algorithms and the AI revolution.
발표자: 최봉수(Chris Choy, Stanford University 박사 과정)
발표일: 2017.8.
CVPR, ICCV, CVPR, NIPS, TIP, ICRA, TPAMI, 3DV, 등 Reviewer 및 worshop organizer
개요:
3D perception consists broadly of simple geometric understanding of the objects to high-level cognition such as semantic scene understanding and inferring the relationship between objects. In this talk, I'll present broad class of works from low-level to high-level cognition tasks that encompass the 3D perception.
Application of Foundation Model for Autonomous DrivingYu Huang
Since DARPA’s Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.
Fisheye based Perception for Autonomous Driving VIYu Huang
Disentangling and Vectorization: A 3D Visual Perception Approach for Autonomous Driving Based on Surround-View Fisheye Cameras
SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround View Fisheye Cameras
FisheyeDistanceNet++: Self-Supervised Fisheye Distance Estimation with Self-Attention, Robust Loss Function and Camera View Generalization
An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras
RoadEdgeNet: Road Edge Detection System Using Surround View Camera Images
Fisheye/Omnidirectional View in Autonomous Driving VYu Huang
Road-line detection and 3D reconstruction using fisheye cameras
• Vehicle Re-ID for Surround-view Camera System
• SynDistNet: Self-Supervised Monocular Fisheye Camera Distance
Estimation Synergized with Semantic Segmentation for Autonomous
Driving
• Universal Semantic Segmentation for Fisheye Urban Driving Images
• UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a
Generic Framework for Handling Common Camera Distortion Models
• OmniDet: Surround View Cameras based Multi-task Visual Perception
Network for Autonomous Driving
• Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving
Fisheye/Omnidirectional View in Autonomous Driving IVYu Huang
FisheyeMultiNet: Real-time Multi-task Learning Architecture for
Surround-view Automated Parking System
• Generalized Object Detection on Fisheye Cameras for Autonomous
Driving: Dataset, Representations and Baseline
• SynWoodScape: Synthetic Surround-view Fisheye Camera Dataset for
Autonomous Driving
• Feasible Self-Calibration of Larger Field-of-View (FOV) Camera Sensors
for the ADAS
Autonomous driving for robotaxi, like perception, prediction, planning, decision making and control etc. As well as simulation, visualization and data closed loop etc.
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)Yu Huang
Canadian Adverse Driving Conditions Dataset, 2020, 2
Deep multimodal sensor fusion in unseen adverse weather, 2020, 8
RADIATE: A Radar Dataset for Automotive Perception in Bad Weather, 2021, 4
Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection, 2021, 7
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather, 2021, 8
DSOR: A Scalable Statistical Filter for Removing Falling Snow from LiDAR Point Clouds in Severe Winter Weather, 2021, 9
Scenario-Based Development & Testing for Autonomous DrivingYu Huang
Formal Scenario-Based Testing of Autonomous Vehicles: From Simulation to the Real World, 2020
A Scenario-Based Development Framework for Autonomous Driving, 2020
A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving, 2020
Large Scale Autonomous Driving Scenarios Clustering with Self-supervised Feature Extraction, 2021
Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles, 2021
Systems Approach to Creating Test Scenarios for Automated Driving Systems, Reliability Engineering and System Safety (215), 2021
How to Build a Data Closed-loop Platform for Autonomous Driving?Yu Huang
Introduction;
data driven models for autonomous driving;
cloud computing infrastructure and big data processing;
annotation tools for training data;
large scale model training platform;
model testing and verification;
related machine learning techniques;
Conclusion.
Simulation for autonomous driving at uber atgYu Huang
Testing Safety of SDVs by Simulating Perception and Prediction
LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World
Recovering and Simulating Pedestrians in the Wild
S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling
SceneGen: Learning to Generate Realistic Traffic Scenes
TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors
GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving
AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles
Appendix: (Waymo)
SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
RegNet: Multimodal Sensor Registration Using Deep Neural Networks
CalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks
RGGNet: Tolerance Aware LiDAR-Camera Online Calibration with Geometric Deep Learning and Generative Model
CalibRCNN: Calibrating Camera and LiDAR by Recurrent Convolutional Neural Network and Geometric Constraints
LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network
CFNet: LiDAR-Camera Registration Using Calibration Flow Network
Prediction and planning for self driving at waymoYu Huang
ChauffeurNet: Learning To Drive By Imitating The Best Synthesizing The Worst
Multipath: Multiple Probabilistic Anchor Trajectory Hypotheses For Behavior Prediction
VectorNet: Encoding HD Maps And Agent Dynamics From Vectorized Representation
TNT: Target-driven Trajectory Prediction
Large Scale Interactive Motion Forecasting For Autonomous Driving : The Waymo Open Motion Dataset
Identifying Driver Interactions Via Conditional Behavior Prediction
Peeking Into The Future: Predicting Future Person Activities And Locations In Videos
STINet: Spatio-temporal-interactive Network For Pedestrian Detection And Trajectory Prediction
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
1. 3D Interpretation from Single 2D Image
for Autonomous Driving III
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• Towards Generalization Across Depth for Monocular 3D Object Detection
• RTM3D: Real-time Monocular 3D Detection from Object Keypoints for
Autonomous Driving
• Monocular 3D Object Detection with Decoupled Structured Polygon Estimation
and Height-Guided Depth Estimation
• Exploring the Capabilities and Limits of 3D Monocular Object Detection - A
Study on Simulation and Real World Data
• Object-Aware Centroid Voting for Monocular 3D Object Detection
• Monocular 3D Detection with Geometric Constraints Embedding and Semi-
supervised Training
• Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels
3. Towards Generalization Across Depth for
Monocular 3D Object Detection
• This work advances the state of the art by introducing MoVi-3D, a single-
stage deep architecture for monocular 3D object detection.
• MoVi-3D builds upon an approach which leverages geometrical information
to generate, both at training and test time, virtual views where the object
appearance is normalized with respect to distance.
• These virtually generated views facilitate the detection task as they
significantly reduce the visual appearance variability associated to objects
placed at different distances from the camera.
• As a consequence, the deep model is relieved from learning depth-specific
representations and its complexity can be significantly reduced.
• In particular, in this work thanks to virtual views generation process, a
lightweight, single-stage architecture suffices to set new state-of-the-art
results on the popular KITTI3D benchmark.
4. Towards Generalization Across Depth for
Monocular 3D Object Detection
Aim at predicting a 3D bounding box for each object given a single image (left). In this image, the
scale of an object heavily depends on its distance with respect to the camera. For this reason the
complexity of the detection increases as the distance grows. Instead of performing the detection
on the original image, perform it on virtual images (middle). Each virtual image presents a cropped
and and scaled version of the original image that preserves the scale of objects as if the image
was taken at a different, given depth.
5. Towards Generalization Across Depth for
Monocular 3D Object Detection
Illustration of the Monocular 3D Object Detection task. Given an input image (left),
the model predicts a 3D box for each object (middle). Each box has its 3D
dimensions s = (W;H;L), 3D center c = (x; y; z) and rotation (alpha).
6. Towards Generalization Across Depth for
Monocular 3D Object Detection
• The goal is to devise a training/inference procedure that enables generalization across
depth, by indirectly forcing the models to develop representations for objects that are less
dependent on their actual depth in the scene.
• The idea is to feed the model with transformed images that have been put into a canonical
form that depends on some query depth.
• After this transformation, no matter where the car is in space, obtain an image of the car
that is consistent in terms of the scale of the object.
• Clearly, depth still influences the appearance, e.g. due to perspective deformations, but by
removing the scale factor from the nuisance variables,able to simplify the task that has to
be solved by the model.
• In order to apply the proposed transformation,need to know the location of the 3D objects
in advance.
7. Towards Generalization Across Depth for
Monocular 3D Object Detection
3D viewport
Compute the top-left and bottom-right corners of the viewport, namely (Xv,Yv,Zv) and (Xv +
Wv,Yv – Hv,Zv) respectively, and project them to the image plane of the camera, yielding the
top-left and bottom-right corners of a 2D viewport. Crop it and rescale it to the desired resolution
wv x hv to get the final output. It is a virtual image generated by the given 3D viewport.
8. Towards Generalization Across Depth for
Monocular 3D Object Detection
• The goal of the training procedure is to build a network that is able to make correct
predictions within a limited depth range given an image generated from a 3D viewport.
• A ground-truth-guided sampling procedure:repeatedly draw (without replacement) a
ground-truth object and then sample a 3D viewport in a neighborhood thereof so that the
object is completely visible in the virtual image.
• The location of the 3D viewport is perturbed with respect to the position of the target
ground-truth object in order to obtain a model that is robust to depth ranges up to the
predefined depth resolution Zres, which in turn plays an important role at inference time.
• In addition, let a small share of the virtual images to be generated by 3D viewports
randomly positioned in a way that the corresponding virtual image is completely contained
in the original image.
• A class-uniform sampling strategy:allows to get an even number of virtual images for each
class that is present in the original image.
9. Towards Generalization Across Depth for
Monocular 3D Object Detection
Training virtual image creation. We randomly sample a target object (dark-red car). Given the input
image, object position and camera parameters, compute a 3D viewport that we place at z = Zv. Then
project the 3D viewport onto the image plane, resulting in a 2D viewport. Finally crop the
corresponding region and rescale it to obtain the target virtual view (right).
10. Towards Generalization Across Depth for
Monocular 3D Object Detection
• Since have trained the network to be able to predict at distances that
are twice the depth step, reasonably confident not missing objects, in
the sense that each object will be covered by at least a virtual image.
• Also, due to the convolutional nature of the architecture adjust the
width of the virtual image in a way to cover the entire extent of the
input image.
• By doing so have virtual images that become wider as increasing the
depth, following the rule (W is the width of the input image):
• Finally perform NMS over detections that have been generated from
the same virtual image.
11. Towards Generalization Across Depth for
Monocular 3D Object Detection
Inference pipeline. Given the input image, camera parameters and Zres,create a series of 3D
viewports placing every Zres/2 meters along the Z axis. Then project these viewports onto the image,
crop and rescale the resulting regions to obtain distance-specific virtual views. Finally use these views
to perform the 3D detection.
12. Towards Generalization Across Depth for
Monocular 3D Object Detection
It consists of two parallel branches, the top one devoted to providing confidences about the predicted
2D and 3D bounding boxes, while the bottom one is devoted to regressing the actual bounding boxes.
White rectangles denote 33 convolutions with 128 output channels followed by iABNsync.
15. RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
• It proposes an efficient and accurate monocular 3D detection framework in
single shot.
• This method predicts the nine perspective keypoints of a 3D bounding box in
image space, and then utilize the geometric relationship of 3D and 2D
perspectives to recover the dimension, location, and orientation in 3D space.
• In this method, the properties of the object can be predicted stably even
when the estimation of keypoints is very noisy, which enables us to obtain
fast detection speed with a small architecture.
• Training uses the 3D properties of the object without the need for external
networks or supervision data.
• This method is the first real-time system for monocular image 3D detection
while achieves state-of the-art performance on the KITTI benchmark.
• Code will be released at https://github.com/Banconxuan/RTM3D.
16. RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
Overview of proposed method: first predict ordinal keypoints projected in the
image space by eight vertexes and a central point of a 3D object. then
reformulate the estimation of the 3D bounding box as the problem of minimizing
the energy function by using geometric constraints of perspective projection.
17. RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
An overview of proposed
keypoint detection
architecture: It takes only
the RGB images as the
input and outputs main
center heatmap, vertexes
heatmap, and vertexes
coordinate as the base
module to estimate 3D
bounding box. It can also
predict other alternative
priors to further improve
the performance of 3D
detection.
18. RTM3D: Real-time Monocular 3D Detection
from Object Keypoints for Autonomous Driving
Illustration of keypoint feature pyramid network (KFPN).
21. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
• Since the location recovery in 3D space is quite difficult on account of
absence of depth information, this work proposes a unified framework
which decomposes the detection problem into a structured polygon
prediction task and a depth recovery task.
• Different from the widely studied 2D bounding boxes, the proposed
structured polygon in the 2D image consists of several projected
surfaces of the target object as better representation for 3D detection.
• In order to inversely project the predicted 2D structured polygon to a
cuboid in the 3D physical world, the following depth recovery task uses
the object height prior to complete the inverse projection
transformation with the given camera projection matrix.
• Moreover, a fine-grained 3D box refinement scheme is proposed to
further rectify the 3D detection results.
22. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
The overall framework (Decoupled-3D) decouples the monocular 3D object detection problem into sub-tasks. The
overall network consists of three parts. (Top row) The 2D structured polygons are generated with a stacked
hourglass network. (Middle row) Object depth stage utilizes 3D object height as a prior to recover the missing depth
of the object. (Bottom row) 3D box refine stage rectifies coarse 3D boxes using bird’s eye view features in 3D-ROIs.
23. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Structured polygon estimation aims to estimate the 2D locations of the projected vertices
24. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
Height-Guided Depth Estimation. Combine
object height H and corresponding pixel
value h to estimate object depth
25. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
3D Box Refinement. Rectify coarse boxes with bird’s eye view map
Note: Depth Net DOR(“Deep Ordinal Regression Network for Monocular Depth Estimation”)
26. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
27. Monocular 3D Object Detection with Decoupled Structured
Polygon Estimation and Height-Guided Depth Estimation
28. Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
• Recent deep learning methods show promising results to recover depth
information from single images by learning priors about the environment.
• In addition to the network design, the major difference of these competing
approaches lies in using a supervised or self-supervised optimization loss
function, which require different data and ground truth information.
• This paper evaluate the performance of a 3D object detection pipeline which
is parameterizable with different depth estimation configurations.
• It implement a simple distance calculation approach based on camera
intrinsics and 2D bounding box size, a self-supervised, and a supervised
learning approach for depth estimation.
• It evaluate the detection pipeline on simulator data and a real world
sequence from an autonomous vehicle on a race track.
• Advantages and drawbacks of the different depth estimation strategies are
discussed
29. Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
3D object detection pipeline with
three alternative configurations
30. Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
• Distance calculation using the 2D bounding box height, and the known
height of the real world race car as a geometric constraint. “known
height assumption”
• Depth estimation for the whole image using the supervised
DenseDepth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.
• Depth estimation for the whole image using the self-supervised
struct2depth network. The distance to each object is calculated as the
median depth value in the bounding box crop. Explicit knowledge
about the objects, like height information, is not required in this
approach.
31. Exploring the Capabilities and Limits of 3D Monocular Object
Detection - A Study on Simulation and Real World Data
32. Object-Aware Centroid Voting for
Monocular 3D Object Detection
• This paper propose an end-to-end trainable monocular 3D object
detector without learning the dense depth.
• Specifically, the grid coordinates of a 2D box are first projected back to
3D space with the pinhole model as 3D centroids proposals.
• Then, a object-aware voting approach is introduced, which considers
both the region-wise appearance attention and the geometric
projection distribution, to vote the 3D centroid proposals for 3D object
localization.
• With the late fusion and the predicted 3D orientation and dimension,
the 3D bounding boxes of objects can be detected from a single RGB
image.
• The method is straightforward yet significantly superior to other
monocular-based methods.
33. Object-Aware Centroid Voting for
Monocular 3D Object Detection
3D Object Detection Pipeline. Given an
image with predicted 2D region proposals
(yellow box), the regions are divided into
grids. Each grid point with (u;v) coordinate
is projected back to 3D space by leveraging
the pinhole model and the class-specific 3D
height H, resulting in 3D box centroid
proposals. With the voting method inspired
by both appearance and geometric cues,
3D object location is predicted.
34. Object-Aware Centroid Voting for
Monocular 3D Object Detection
The Architecture. 2D region proposals are first obtained from the RPN module. Then, with the 3D Center
Reasoning (left), multiple 3D centroid proposals are estimated from the 2D RoI grid coordinates.
Followed by the Object-Aware Voting (right), which consists of geometric projection distribution (GPD)
and appearance attention map (AAM), the 3D centroid proposals are voted for 3D localization. For the 3D
dimension and orientation, they are estimated together with 2D object detection head.
35. Object-Aware Centroid Voting for
Monocular 3D Object Detection
• For the objects on driving road, they are horizontally placed without the pose angles of yaw
and pitch with respect to the camera.
• Besides, the 3D dimension variance of each class of objects (such as Car) is quite small.
• These constraints lead to the idea that the apparent heights of objects on image are
approximately invariant when objects are in the same depth.
• Recent survey also points out that the positions and apparent size of object in an image are
applicable to infer the depth on KITTI dataset.
• Therefore, the 3D object centroid can be roughly inferred with the simple pinhole camera
model.
36. Object-Aware Centroid Voting for
Monocular 3D Object Detection
• Specifically, divide each 2D region proposals into s X s grid cells and project the grid
coordinates back onto 3D space.
• Since each grid point indicates the probable projection of the corresponding 3D object
centroid, get multiple 3D centroid proposals P3d where the i-th centroid proposal
P3d(Xi;Yi;Z) is computed by
Examples and statistics on KITTI training set.
37. Object-Aware Centroid Voting for
Monocular 3D Object Detection
• Specifically, use a single 1X1 convolution followed by sigmoid activation to generate
appearance attention map from the feature maps of RoI pooling layer.
• The activated convolution feature map from the image indicates the foreground semantic
objects due to the classification supervision in 2D object detection, leading to the object-
ware voting.
• This voting component comes from the distribution of the offset between the projected 3D
centroid and the 2D box center.
• It is demonstrated the 2D box center can be modeled as Gaussian distribution with ground
truth as expectation
• To dynamically learn the distribution, the 2D grid coordinates and image features of RoI are
concatenated together as input of a fully-connected layer to predict the offset with Kullback
Leibler (KL) divergence as loss function to supervise the learning
38. Object-Aware Centroid Voting for
Monocular 3D Object Detection
The object-aware voting can be formulated as the
element-wise multiplication with both the normalized
probability maps Mapp and Mgeo as follows
In the training stage, the 3D localization pipeline is
trained with smooth L1 loss
39. Object-Aware Centroid Voting for
Monocular 3D Object Detection
3D dimension prediction loss function comparing
predictions and the ground truth are defined in the
logarithm space through the smooth L1 loss
In 3D orientation estimation, use Multi-Bin to
disentangle it into residual angle prediction and
angle bins classification.
3D orientation estimation loss is formed as
Loss functions for the joined training of multi-tasks
of 2D and 3D object detection as
42. Object-Aware Centroid Voting for
Monocular 3D Object Detection
Qualitative results. Red: detected 3D boxes. Yellow: ground truth. Right: birds’ eye view (BEV) results.
43. Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
• A single-shot and keypoints-based framework for mono 3D objects detection, KM3D-Net.
• Here, design a fully convolutional model to predict object keypoints, dimension, and
orientation, and then combine with perspective geometry constraints to compute position.
• Further, reformulate the geometric constraints as a differentiable version and embed it into the
network while maintaining the consistency of model outputs in an E2E fashion.
• Then propose a semi-supervised training strategy where labeled training data is scarce.
• In this strategy, enforce a consensus prediction of two shared-weights KM3D-Net for the
same unlabeled image under different input augmentation conditions and network
regularization.
• In particular, unify the coordinate-dependent augmentations as the affine transformation for
the differential recovering position and propose a keypoints-dropout module for the network
regularization.
• This model only requires RGB images without synthetic data, instance segmentation, CAD
model, or depth generator.
44. Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Overview of KM3D-Net which
output keypoints, object
dimensions, local orientation,
and 3D confidence, followed
by differentiable geometric
consistency constraints to
predict position.
45. Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
Overview of unsupervised training. It leverages affine transformation to unify input augmentation
and devise keypoints dropout for regularization. These two strategies make KM3D-Net output two
stochastic variables with the same input. Penalizing their differences is training goal.
46. Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
47. Monocular 3D Detection with Geometric Constraints
Embedding and Semi-supervised Training
48. Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
• The training of deep-learning-based 3D object detectors requires
large datasets with 3D bounding box labels for supervision that have to
be generated by hand-labeling.
• A network architecture and training procedure for learning monocular
3D object detection without 3D bounding box labels.
• By representing the objects as triangular meshes and employing
differentiable shape rendering, define loss functions based on depth
maps, segmentation masks, and ego- and object-motion, which are
generated by pre-trained, o-the-shelf networks.
• In comparison to SOA methods requiring 3D bounding box labels for
training and superior performance to conventional baseline methods.
49. Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
A mono 3D vehicle detector that requires no 3D bounding box labels for training. The right image shows that
the predicted vehicles (colored shapes) fit the GT bounding boxes (red). Despite the noisy input depth (lower
left ), the method is able to accurately predict the 3D poses of vehicles due to the proposed fully differentiable
training scheme. It show the projections of the predicted bounding boxes (colored boxes, upper left).
50. Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
The proposed model contains a single-image network and a multi-image network extension. The single-
image network back-projects the input depth map into a point cloud. A Frustum PointNet encoder predicts
the pose and shape then decoded into a predicted 3D mesh and segmentation mask through differentiable
rendering. The multi-image network architecture takes three images as the inputs, and the single-image
network is applied individually to each image. This network predicts a depth map for the middle frame based
on the vehicle‘s pose and shape. A pre-trained network predicts ego-motion and object-motion from the
images. The reconstruction loss is computed by differentiably warping the images into the middle frame.
51. Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
In order to train without 3D bounding box labels we use three losses, the
segmentation loss Lseg, the chamfer distance Lcd, and the photometric
reconstruction loss Lrec. The first two are defined for single images and
the photometric reconstruction loss relies on temporal photo-consistency
for three consecutive frames
52. Learning Monocular 3D Vehicle Detection
without 3D Bounding Box Labels
Qualitative comparison of MonoGRNet (1st row), Mono3D(2nd row), and this method (third row) with depth
maps from BTS. It show GT bounding boxes for cars (red), predicted bounding boxes (green), and the back-
projected point cloud. In comparison to Mono3D, the prediction accuracy is increased specifically for further
away vehicles. The performance of MonoGRNet and this model is comparable.