In this presentation we described important things about Image processing and computer vision. If you have any query about this presentation then feels free to visit us at:
http://www.siliconmentor.com/
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisHyeongmin Lee
드디어 PR12 Season 4가 시작되었습니다! 제가 이번 시즌에서 발표하게 된 첫 논문은 ""NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"라는 논문입니다. View Synthesis라는 Task는 몇 개의 시점에서 대상을 찍은 영상이 주어지면 주어지지 않은 위치와 방향에서 바라본 대상의 영상을 합성해내는 기술입니다. 이를 위해서 본 논문에서는 대상의 3D 정보를 통째로 Neural Network가 외우게 하는 방법을 선택했는데요, 이 방식은 Implicit Neural Representation이라는 이름으로 유명해지고 있는 추세고, 2D 이미지에 대해서도 적용하려는 접근들이 늘고 있습니다.
영상 링크: https://youtu.be/zkeh7Tt9tYQ
논문 링크: https://arxiv.org/abs/2003.08934
In this presentation we described important things about Image processing and computer vision. If you have any query about this presentation then feels free to visit us at:
http://www.siliconmentor.com/
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisHyeongmin Lee
드디어 PR12 Season 4가 시작되었습니다! 제가 이번 시즌에서 발표하게 된 첫 논문은 ""NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"라는 논문입니다. View Synthesis라는 Task는 몇 개의 시점에서 대상을 찍은 영상이 주어지면 주어지지 않은 위치와 방향에서 바라본 대상의 영상을 합성해내는 기술입니다. 이를 위해서 본 논문에서는 대상의 3D 정보를 통째로 Neural Network가 외우게 하는 방법을 선택했는데요, 이 방식은 Implicit Neural Representation이라는 이름으로 유명해지고 있는 추세고, 2D 이미지에 대해서도 적용하려는 접근들이 늘고 있습니다.
영상 링크: https://youtu.be/zkeh7Tt9tYQ
논문 링크: https://arxiv.org/abs/2003.08934
AI and Machine Learning for the Connected Home with Stephen GalsworthyDatabricks
Quby is the creator and provider of Toon, a leading European smart home platform. We enable Toon users to control and monitor their homes using both an in-home display and app. As a data driven company, we use AI and machine learning to generate actionable insights for our end users. Using the data we collect via our IoT devices we have introduced multiple data driven services, including an energy waste checker and a boiler monitoring service. In this talk, Stephen will describe how AI and machine learning are implemented on the Toon platform, and will show multiple AI use cases relating to the connected home. We’ll take a look at how Deep Learning algorithms are used to detect inefficient appliances from electricity meter data and how streaming algorithms allow users to be alerted to anomalies with their heating systems in near real-time. Stephen will share the experiences from the Data Science and Data Engineering teams at Quby with bringing data science algorithms from R&D to production and the lessons learned in offering multiple data driven services to hundreds of thousands of users on a daily basis.
Computer Vision: Feature matching with RANSAC Algorithmallyn joy calcaben
Computer Vision: Feature matching with RANdom SAmple Consensus Algorithm
CMSC197.1 Introduction to Computer Vision
April 2018
by: Allyn Joy Calcaben, Jemwel Autor, & Jefferson Butch Obero
University of the Philippines Visayas
ICRA 2019 (IEEE International Conference on Robotics and Automation; https://www.icra2019.org/ )の参加速報を書きました。
この資料には下記の項目が含まれています。
・ICRA 2019の概要
・ICRA 2019での動向や気付き
・ICRAの重要技術
・今後の方針
・論文まとめ(102本あります)
Powerful technique for feature generation learned from "How to Win a Data Science Competition: Learn from Top Kagglers"
python code implementation at https://github.com/Amarnathchode/Mean-encoding-implemen
AI and Machine Learning for the Connected Home with Stephen GalsworthyDatabricks
Quby is the creator and provider of Toon, a leading European smart home platform. We enable Toon users to control and monitor their homes using both an in-home display and app. As a data driven company, we use AI and machine learning to generate actionable insights for our end users. Using the data we collect via our IoT devices we have introduced multiple data driven services, including an energy waste checker and a boiler monitoring service. In this talk, Stephen will describe how AI and machine learning are implemented on the Toon platform, and will show multiple AI use cases relating to the connected home. We’ll take a look at how Deep Learning algorithms are used to detect inefficient appliances from electricity meter data and how streaming algorithms allow users to be alerted to anomalies with their heating systems in near real-time. Stephen will share the experiences from the Data Science and Data Engineering teams at Quby with bringing data science algorithms from R&D to production and the lessons learned in offering multiple data driven services to hundreds of thousands of users on a daily basis.
Computer Vision: Feature matching with RANSAC Algorithmallyn joy calcaben
Computer Vision: Feature matching with RANdom SAmple Consensus Algorithm
CMSC197.1 Introduction to Computer Vision
April 2018
by: Allyn Joy Calcaben, Jemwel Autor, & Jefferson Butch Obero
University of the Philippines Visayas
ICRA 2019 (IEEE International Conference on Robotics and Automation; https://www.icra2019.org/ )の参加速報を書きました。
この資料には下記の項目が含まれています。
・ICRA 2019の概要
・ICRA 2019での動向や気付き
・ICRAの重要技術
・今後の方針
・論文まとめ(102本あります)
Powerful technique for feature generation learned from "How to Win a Data Science Competition: Learn from Top Kagglers"
python code implementation at https://github.com/Amarnathchode/Mean-encoding-implemen
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...ijma
This paper deals with leader-follower formations of non-holonomic mobile robots, introducing a formation
control strategy based on pixel counts using a commercial grade electro optics camera. Localization of the
leader for motions along line of sight as well as the obliquely inclined directions are considered based on
pixel variation of the images by referencing to two arbitrarily designated positions in the image frames.
Based on an established relationship between the displacement of the camera movement along the viewing
direction and the difference in pixel counts between reference points in the images, the range and the angle
estimate between the follower camera and the leader is calculated. The Inverse Perspective Transform is
used to account for non linear relationship between the height of vehicle in a forward facing image and its
distance from the camera. The formulation is validated with experiments.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Automatic rectification of perspective distortion from a single image using p...ijcsa
Perspective distortion occurs due to the perspective projection of 3D scene on a 2D surface. Correcting the distortion of a single image without losing any desired information is one of the challenging task in the field of Computer Vision. We consider the problem of estimating perspective distortion from a single still image of an unstructured environment and to make perspective correction which is both quantitatively accurate as well as visually pleasing. Corners are detected based on the orientation of the image. A method based on plane homography and transformation is used to make perspective correction. The algorithm infers frontier information directly from the images, without any reference objects or prior knowledge of the camera parameters. The frontiers are detected using geometric context based segmentation. The goal of this paper is to present a framework providing fully automatic and fast perspective correction.
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGsipij
Object tracking is one of the most important problems in computer vision. The aim of video tracking is to extract the trajectories of a target or object of interest, i.e. accurately locate a moving target in a video sequence and discriminate target from non-targets in the feature space of the sequence. So, feature descriptors can have significant effects on such discrimination. In this paper, we use the basic idea of many trackers which consists of three main components of the reference model, i.e., object modeling, object detection and localization, and model updating. However, there are major improvements in our system. Our forth component, occlusion handling, utilizes the r-spatiogram to detect the best target candidate. While spatiogram contains some moments upon the coordinates of the pixels, r-spatiogram computes region-based compactness on the distribution of the given feature in the image that captures richer features to represent the objects. The proposed research develops an efficient and robust way to keep tracking the object throughout video sequences in the presence of significant appearance variations and severe occlusions. The proposed method is evaluated on the Princeton RGBD tracking dataset considering sequences with different challenges and the obtained results demonstrate the effectiveness of the proposed method.
Conventional non-vision based navigation systems relying on purely Global Positioning System (GPS) or inertial sensors can provide the 3D position or orientation of the user. However GPS is often not available in forested regions and cannot be used indoors. Visual odometry provides an independent method to estimate position and orientation of the user/system based on the images captured by the moving user accurately. Vision based systems also provide information (e.g. images, 3D location of landmarks, detection of scene objects) about the scene that the user is looking at. In this project, a set of techniques are used for the accurate pose and position estimation of the moving vehicle for autonomous navigation using the images obtained from two cameras placed at two different locations of the same area on the top of the vehicle. These cases are referred to as stereo vision. Stereo vision provides a method for the 3D reconstruction of the environment which is required for pose and position estimation. Firstly, a set of images are captured. The Harris corner detector is utilized to automatically extract a set of feature points from the images and then feature matching is done using correlation based matching. Triangulation is applied on feature points to find the 3D co-ordinates. Next, a new set of images is captured. Then repeat the same technique for the new set of images too. Finally, by using the 3D feature points, obtained from the first set of images and the new set of images, the pose and position estimation of moving vehicle is done using QUEST algorithm.
Distance Estimation to Image Objects Using Adapted Scaletheijes
Distance measurement is part of various robotic applications. There exist many methods for this purpose. In this paper, we introduce a new method to measure the distance from a digital camera to an arbitrary object by using its pose (X,Y pixel coordination and the angel of the camera). The method uses a pre-data that stores all the information about the relation between the pose and the distance of an object to the camera. This process designed for a robot that is a part of a robotic team participating in RoboCup KSL competition
Distance Estimation to Image Objects Using Adapted Scaletheijes
Distance measurement is part of various robotic applications. There exist many methods for this purpose. In this paper, we introduce a new method to measure the distance from a digital camera to an arbitrary object by using its pose (X,Y pixel coordination and the angel of the camera). The method uses a pre-data that stores all the information about the relation between the pose and the distance of an object to the camera. This process designed for a robot that is a part of a robotic team participating in RoboCup KSL competition.
Application of Foundation Model for Autonomous DrivingYu Huang
Since DARPA’s Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.
Fisheye based Perception for Autonomous Driving VIYu Huang
Disentangling and Vectorization: A 3D Visual Perception Approach for Autonomous Driving Based on Surround-View Fisheye Cameras
SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround View Fisheye Cameras
FisheyeDistanceNet++: Self-Supervised Fisheye Distance Estimation with Self-Attention, Robust Loss Function and Camera View Generalization
An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras
RoadEdgeNet: Road Edge Detection System Using Surround View Camera Images
Fisheye/Omnidirectional View in Autonomous Driving VYu Huang
Road-line detection and 3D reconstruction using fisheye cameras
• Vehicle Re-ID for Surround-view Camera System
• SynDistNet: Self-Supervised Monocular Fisheye Camera Distance
Estimation Synergized with Semantic Segmentation for Autonomous
Driving
• Universal Semantic Segmentation for Fisheye Urban Driving Images
• UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a
Generic Framework for Handling Common Camera Distortion Models
• OmniDet: Surround View Cameras based Multi-task Visual Perception
Network for Autonomous Driving
• Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving
Fisheye/Omnidirectional View in Autonomous Driving IVYu Huang
FisheyeMultiNet: Real-time Multi-task Learning Architecture for
Surround-view Automated Parking System
• Generalized Object Detection on Fisheye Cameras for Autonomous
Driving: Dataset, Representations and Baseline
• SynWoodScape: Synthetic Surround-view Fisheye Camera Dataset for
Autonomous Driving
• Feasible Self-Calibration of Larger Field-of-View (FOV) Camera Sensors
for the ADAS
Autonomous driving for robotaxi, like perception, prediction, planning, decision making and control etc. As well as simulation, visualization and data closed loop etc.
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)Yu Huang
Canadian Adverse Driving Conditions Dataset, 2020, 2
Deep multimodal sensor fusion in unseen adverse weather, 2020, 8
RADIATE: A Radar Dataset for Automotive Perception in Bad Weather, 2021, 4
Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection, 2021, 7
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather, 2021, 8
DSOR: A Scalable Statistical Filter for Removing Falling Snow from LiDAR Point Clouds in Severe Winter Weather, 2021, 9
Scenario-Based Development & Testing for Autonomous DrivingYu Huang
Formal Scenario-Based Testing of Autonomous Vehicles: From Simulation to the Real World, 2020
A Scenario-Based Development Framework for Autonomous Driving, 2020
A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving, 2020
Large Scale Autonomous Driving Scenarios Clustering with Self-supervised Feature Extraction, 2021
Generating and Characterizing Scenarios for Safety Testing of Autonomous Vehicles, 2021
Systems Approach to Creating Test Scenarios for Automated Driving Systems, Reliability Engineering and System Safety (215), 2021
How to Build a Data Closed-loop Platform for Autonomous Driving?Yu Huang
Introduction;
data driven models for autonomous driving;
cloud computing infrastructure and big data processing;
annotation tools for training data;
large scale model training platform;
model testing and verification;
related machine learning techniques;
Conclusion.
Simulation for autonomous driving at uber atgYu Huang
Testing Safety of SDVs by Simulating Perception and Prediction
LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World
Recovering and Simulating Pedestrians in the Wild
S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling
SceneGen: Learning to Generate Realistic Traffic Scenes
TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors
GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving
AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles
Appendix: (Waymo)
SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
RegNet: Multimodal Sensor Registration Using Deep Neural Networks
CalibNet: Self-Supervised Extrinsic Calibration using 3D Spatial Transformer Networks
RGGNet: Tolerance Aware LiDAR-Camera Online Calibration with Geometric Deep Learning and Generative Model
CalibRCNN: Calibrating Camera and LiDAR by Recurrent Convolutional Neural Network and Geometric Constraints
LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network
CFNet: LiDAR-Camera Registration Using Calibration Flow Network
Prediction and planning for self driving at waymoYu Huang
ChauffeurNet: Learning To Drive By Imitating The Best Synthesizing The Worst
Multipath: Multiple Probabilistic Anchor Trajectory Hypotheses For Behavior Prediction
VectorNet: Encoding HD Maps And Agent Dynamics From Vectorized Representation
TNT: Target-driven Trajectory Prediction
Large Scale Interactive Motion Forecasting For Autonomous Driving : The Waymo Open Motion Dataset
Identifying Driver Interactions Via Conditional Behavior Prediction
Peeking Into The Future: Predicting Future Person Activities And Locations In Videos
STINet: Spatio-temporal-interactive Network For Pedestrian Detection And Trajectory Prediction
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
2. OUTLINE
Single View Metrology
Joint SFM and Detection Cues for Monocular 3D
Localization in Road Scenes
Joint 3D Estimation of Objects and Scene Layout
CubeSLAM: Monocular 3D Object Detection and
SLAM without Prior Models
Monocular Visual Scene Understanding:
Understanding Multi-Object Traffic Scenes
Improved Object Detection and Pose Using Part-
Based Models
3D Object Detection and Viewpoint Estimation
with a Deformable 3D Cuboid Model
Are Cars Just 3D Boxes? – Jointly Estimating the
3D Shape of Multiple Objects
Classification and Pose Estimation of Vehicles in
Videos by 3D Modeling within Discrete-
Continuous Optimization
A mixed classification-regression framework for
3D pose estimation from 2D images
BoxCars: Improving Fine-Grained Recognition of
Vehicles using 3D BBoxes in Traffic Surveillance
Vehicle Detection and Pose Estimation for
Autonomous Driving (Thesis)
Deep Cuboid Detection: Beyond 2D BBoxes
3D Bounding Box Estimation Using Deep
Learning and Geometry
Deep MANTA: A Coarse-to-fine Many-Task
Network for joint 2D and 3D vehicle analysis from
monocular image
3D Object Proposals for Accurate Object Class
Detection
Monocular 3D Object Detection for Autonomous
Driving
SSD-6D: Making RGB-Based 3D Detection and
6D Pose Estimation Great Again
Real-Time Seamless Single Shot 6D Object Pose
Prediction
Implicit 3D Orientation Learning for 6D Object
Detection from RGB Images
3. SINGLE VIEW METROLOGY
Basic geometry: The plane’s vanishing line l
is the intersection of the image plane with a
plane parallel to the reference plane and
passing through the camera centre. The
vanishing point v is the intersection of the
image plane with a line parallel to the
reference direction through the camera
centre.
Cross ratio: The point b on the plane π
corresponds to the point t on the plane π’ .
They are aligned with the vanishing point v.
The 4 points v, t, b and the intersection i of the
line joining them with the vanishing line define
a cross-ratio. Cross-ratio decides a ratio of
distances between planes in the world.
4. SINGLE VIEW METROLOGY
Homology mapping btw parallel planes: point X
on plane π mapped into point X’ on π’ by parallel
projection. In the image, mapping btw images of
two planes is a homology, with v vertex and l
axis. The correspondence b -> t fixes the
remaining DoF of the homology from cross-ratio
of the 4 points: v, i, t and b.
5. JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
This localization framework jointly uses info. from complementary
modalities such as SFM and object detection to achieve high
localization accuracy in both near and far fields.
Make use of raw detection scores to allow 3D Bboxes to adapt to
better quality 3D cues.
To extract SFM cues, take the advantages of dense tracking over
sparse mechanisms in autonomous driving scenarios.
The formulation for 3D localization can be regarded as an extension
of sparse BA to incorporate object detection cues.
3D object localization framework
that combines cues from SFM and
object detection. Red denotes 2D
bounding boxes, the horizontal line
is the horizon from estimated
ground plane, green denotes
estimated 3D localization for far
and near objects, with distances in
magenta.
6. JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Overview of the 3d object localization system combining SFM
cues (green) with object detection cues (brown).
7. JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Coordinate system definitions for
3D object localization. The SFM
ground plane is (n⊤, h)⊤.
System overview for obtaining SFM cues on
objects, depicted in green.
K is the camera intrinsic calibration
matrix, the bottom of a 2D Bbox, b =
(x, y, 1)⊤, can be back-projected to
3D through the ground plane {h, n}:
8. JOINT SFM AND DETECTION CUES FOR
MONOCULAR 3D LOCALIZATION IN ROAD SCENES
Output of this localization
system. The bottom left panel
shows the monocular SFM
camera trajectory. The top
panel shows input 2D bounding
boxes in red, horizon from
estimated ground plane and the
estimated 3D bounding boxes
in green with distances in
magenta. The bottom right
panel shows the top view of the
ground truth object localization
from laser scanner in red,
compared to this 3D object
localization in blue.
9. JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
A generative model is able to reason jointly about the 3D scene
layout as well as the 3D location and orientation of objects in the
scene.
To infer the scene topology, geometry and traffic activities from a
video sequence from a single camera mounted on a moving car.
It takes advantage of dynamic info. in the form of vehicle tracklets
and static info. from semantic labels and geometry (i.e., vanishing
points).
Monocular 3D Urban
Scene Understanding.
(Left) Image cues.
(Right) Estimated
layout: Detections
belonging to a tracklet
are depicted with the
same color, traffic
activities are depicted
with red lines.
Vehicle tracklets
Vanishing points
Scene labels
10. JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
Assume that the road surface is flat, and model the bird’s eye perspective
as the y = 0 plane of the standard camera coordinate system;
Detect vehicles in each frame independently using a semi-supervised
version of the part- based detector in order to obtain orientation estimates;
2D tracklets estimated using ’tracking-by-detection’: First adjacent frames
are linked and then short tracklets are associated to create longer ones via
the hungarian method.
3D vehicle tracklets are obtained by projecting the 2D tracklets into bird’s
eye perspective, employing error-propagation to obtain cov. estimates.
Model lanes with splines, place parking spots at equidistant places along
street boundaries.
The model infers whether the cars participate in traffic or are parked in
order to get more accurate layout estimations.
Latent variables are employed to associate each detected vehicle with
positions in one of these lanes or parking spaces.
11. JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
Graphical model and road model with lanes represented as B-splines.
Transform the 2D tracklets into 3D tracklets: project the image coordinates
into bird’s eye perspective by backprojecting objects into 3D using several
complementary cues. Towards this goal, use the 2D bounding box footpoint
in combination with the estimated road plane. Two types of dominant
vanishing points: forward facing street and crossing street. Three semantic
classes, i.e., road, sky and background.
12. JOINT 3D ESTIMATION OF OBJECTS AND
SCENE LAYOUT
(Left) Trackets from all frames superimposed. (Middle) Inference result
with θ known and (Right) θ unknown. The inferred intersection layout in
gray. Ground truth labels in blue. Detected activities in red.
13. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
A method for single image 3D cuboid object detection and multi-view
object SLAM without prior object model, and the two aspects can
benefit each other.
For 3D detection, generate cuboid proposals from 2D Bboxes and
vanishing points sampling.
The proposals are further scored,selected to align with image edges.
Multi-view bundle adjustment with measurement functions is proposed
to jointly optimize camera poses, objects and points, utilizing single
view detection results.
Objects can provide more geometric constraints and scale consistency
compared to points.
Objects are utilized in two folds: provide depth initialization for points
difficult to triangulate and provide geometry constraints in BA.
The estimated camera poses from SLAM can improve the single-view
object detection.
14. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Monocular 3D object detection and mapping without prior object models.
Mesh model is just for visualization and not used for detection. (a) ICL NUIM
data with various objects, whose position, orientation and dimension are
optimized by SLAM. (b) KITTI 07. With object constraints, monocular SLAM
can build a consistent map and correct scale drift, without loop closure and
constant camera height assumption.
(a) (b)
15. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
A 3D cuboid by 9 DoF parameters: 3 DoF position, 3 DoF rotation
and 3 DoF dimension.
The cuboid coordinate frame is built at the cuboid center, aligned
with the main axes.
The camera intrinsic calibration K is also known.
The cuboid’s projected corners fit tightly with 2D bounding box,
there are 4 constraints corresponding to 4 sides of a rectangle
which cannot fully constrain all 9 parameters.
3D cuboid has 3 orthogonal axes and can form 3 VPs after
perspective projections depending on object rotation R and camera
calibration K.
After getting 8 cuboid corners in 2D, back-project to 3D space to
compute 3D position and dimensions which is up to a scale factor.
The scale can be reasoned from camera height to ground, prior
object size and so on.
16. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Proposals generation from 2D object box. Cuboids are divided
into three categories depending on the number of observable
faces. If one corner is estimated, the other seven corners can
also be computed from vanishing points (VPs). For example in
(a), if corner 1 is sampled, then corner 2 and 3 can be
determined through ray intersection of VP line and rectangles,
followed by corner 4 and other bottom corners.
17. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Denote the image as I and cuboid proposal as x, then the cost
function is defined as:
Cuboid proposal scoring. (Left) Edges to align and score
the proposals. (Right) Cuboid proposals generated from
the same 2D cyan bounding box. The top left is the best
and bottom right is the worst after scoring.
18. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Camera poses C = {ci}, 3D landmark object O = {oj}, Points P = {pk}.
BA is formulated as NLS problem:
Transform the landmark object to the camera frame then compare with
the measurement as:
3D measurement
2D measurement
Project the landmark cuboid onto image plane to get the 2D Bbox,
compare it with the detected 2D Bbox as
First transform the point to the cuboid frame then compare
with cuboid dimensions:
Point-camera measurement: the standard 3D point re-projection error
in feature based SLAM.
19. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
(a) The object SLAM pipeline. Single view object detection provides
cuboid landmark and depth initialization for SLAM while SLAM can
estimate camera pose for more accurate object detection. (b)
Measurement errors between cameras, objects and points during BA.
(a) (b)
20. CUBESLAM: MONOCULAR 3D OBJECT DETECTION AND
SLAM WITHOUT PRIOR MODELS
Object association based on point matching:
Dynamic points detected through descriptor matching and
epipolar line checking;
Then first associate points to objects if points are observed
enough times of belonging to the 2D object bounding box and
close to cuboid centroid in 3D space;
Try to find object matching which has the most number of
shareable map points exceeding a threshold (10 for example);
Well for wide baseline matching, repetitive objects, occlusions,
and dynamic scenarios.
Green points are normal map points,
and other color points are associated
to objects with the same color. The
front cyan moving car is not added as
SLAM landmark as no feature point is
associated with it. Points in object
overlapping areas are not associated
with any object.
21. MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
A probabilistic 3D scene model that integrates SoA multiclass object
detection, object tracking and scene labeling together with geometric
3D reasoning.
The model is able to represent complex object interactions such as
inter-object occlusion, physical exclusion between objects, and
geometric context.
Inference in this model allows to jointly recover the 3D scene context
and perform 3D multi-object tracking from a mobile observer, for
objects of multiple categories, using only monocular video as input.
This system performs explicit occlusion reasoning and is capable of
tracking objects that are partially occluded for extended periods of
time, or objects that have never been observed to their full extent.
A joint scene tracklet model for the evidence collected over multiple
frames substantially improves performance.
22. MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
Overview on this system.
For each input frame, run
an object detector and
extract semantic scene
labels. Object
hypotheses are fused to
short-term tracklets and
put into a strong 3D
scene model with explicit
occlusion reasoning.
MCMC inference allows
tractable inference in the
Bayesian scene model,
while HMM scene
tracking ensures long-
term associations.
23. MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
The multi-frame 3D inference and explicit
occlusion reasoning for onboard vehicle and
pedestrian tracking with overlaid horizon
estimate for different public SoA datasets.
the 3D scene state X in the world coordinate system
the rotation angles of a vehicle mounted camera
24. MONOCULAR VISUAL SCENE UNDERSTANDING:
UNDERSTANDING MULTI-OBJECT TRAFFIC SCENES
Employing the theorem of
intersecting lines to derive the
distance to object along the
ground plane in viewing direction.
Approximate objects by their Bboxes and
project them onto the image. By leveraging the
depth order obtained from the 3D scene model,
able to estimate occluded object regions.
26. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Work on part-based models by using accurate geometric models
both in the learning phase and at detection.
The object model is defined as a number of roughly planar aspects
models together with a set of typical object poses;
In the learning phase, manual annotations are used to reduce
perspective distortion before learning the part-based models.
That training is performed on rectified images using a deformable part-
based model (DPM), leads to models which are more specific.
At the same time a set of representative object poses are learnt.
Transform the image according to each of the learnt typical poses.
These are used at detection to remove perspective distortion.
Scores from the aspect detectors are generated by running each
aspect model on each of the transformed images.
Detections from the different aspect models are combined and
thresholded to produce the final object detection.
27. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Annotation of each visible aspect of the object in training images
The two aspect models for the bus category. The upper row shows
the frontal model and the lower row shows the side model.
30. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
A certain training example is similar to a pose P if the average angular
deviations for the front and side,
Measure of angular deviation for pose similarity.
are below a predefined threshold.
31. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Overview of the detection pipeline. The input image is transformed according
to each of representative poses. This produces a multiple images that are
individually run through the aspect detectors creating a set of score pyramids
containing the detector scores at different scales. These are merged into one
pyramid per aspect, in the original image coordinate system. Finally, the front
and side scores are combined and non-maximum suppression is performed.
32. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Left: how to estimate the side location (brown dot) given the frontal location
(blue dot) and the size of skewed frontal bounding box. Right: search in a
small neighborhood (blue circle) of expected location for each level.
The score combination can be expressed as
33. IMPROVED OBJECT DETECTION AND POSE USING
PART-BASED MODELS
Detected bounding boxes are shown in green and their layout in red.
34. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Given a monocular image, localize the objects in 3D by enclosing them
with tight oriented 3D bounding boxes.
An approach extends the deformable part-based model to reason in 3D.
It represents an object class as a deformable 3D cuboid composed of
faces and parts, which are both allowed to deform with respect to their
anchors on the 3D box.
Model the appearance of each face in fronto-parallel coordinates, thus
effectively factoring out the appearance variation induced by viewpoint.
The model reasons about face visibility patters called aspects.
Train the cuboid model jointly and share weights across all aspects to
attain efficiency.
Inference then entails sliding and rotating the box in 3D and scoring
object hypotheses.
While for inference discretize the search space, the variables are
continuous in the model.
35. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
The deformable 3D cuboid model. Viewpoint angle θ.
36. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Aspects, together with the range of θ that they cover, for (left) cars and (right) beds.
37. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
38. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
where p = (p1, · · · , p6 ) and V (i, a) is a binary variable encoding whether
face i is visible under aspect a. Note that a = a(θ, s) can be deterministically
computed from the rotation angle θ and the position of the stitching point
s (which we assume to always be visible), which in turns determines the face
visibility V .
39. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Learned models for (left) bed, (right) car.
40. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
We use ref to index the first visible face in the aspect model, and
41. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
Inference in this model can be done by computing
42. 3D OBJECT DETECTION AND VIEWPOINT ESTIMATION WITH
A DEFORMABLE 3D CUBOID MODEL
KITTI: examples of car detections. (top) Ground truth,
(bottom) The 3D detections, augmented with best fitting
CAD models to visualize inferred 3D box orientations.
43. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
Scene understanding from the perspective of 3D shape modeling: a 3D scene
representation that reasons jointly about the 3D shape of multiple objects.
It allows to express 3D geometry and occlusion on the fine detail level of
individual vertices of 3D wireframe models, and makes it possible to treat
dependencies between objects, such as occlusion reasoning, in a
deterministic way.
Left: Coarse 3D object
bounding boxes derived from
2D bounding box detections.
Right: fine-grained 3D shape
model fits improve 3D
localization (bird’s eye views).
44. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
Scene particles (coarse 3D
geometry and fine-grained
shape). Deterministic occlusion
mask computation by ray
casting and intersection (blue).
A 3D scene model,
consisting of a common
ground plane, a set of 3D
deformable objects, and
an explicit occlusion mask
for each object.
45. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
Object likelihood.
Scene-level likelihood.
An inference scheme that proceeds in
stages, lifting an initial 2D guess
(Initialization) about object locations to a
coarse 3D model (Coarse 3D geometry),
and refining that coarse model into a final
collection of consistent 3D shapes (Final
scene-level inference).
46. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
(a) Part localization accuracy and 2D pre-detection. (b-c) Example detections
and corresponding 3D reconstructions.
47. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
COARSE+GP (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
COARSE+GP based on (a), (c) bird’s eye view of (b). (e) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (f) bird’s eye view of (e). Estimates in
red, ground truth in green.
48. ARE CARS JUST 3D BOXES? – JOINTLY ESTIMATING THE
3D SHAPE OF MULTIPLE OBJECTS
FG+SO+DO (a-c) vs. FG+SO+DO+GP (d-e). (a) 2D bounding box detections, (b)
FG+SO+DO based on (a), (c) bird’s eye view of (b). (d) FG+SO+DO+GP shape
model fits (blue: estimated occlusion masks), (e) bird’s eye view of (d). Estimates in
red, ground truth in green.
49. CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
Rank possible poses and types for each frame and exploit temporal
coherence between consecutive frames for refinement.
Cast the estimation of a vehicle’s pose and type as a solution of a
continuous optimization problem over space and time.
Obtain initial start points by a discrete temporal optimization
reaching a global optimum on a ranked discrete set of possible types
and poses.
To guarantee effectiveness of the discrete-continuous optimization,
reduce the search space of potential 3D model types and poses for
each frame for the discrete optimizer.
It avoids common expensive evaluation of all possible discretized
hypotheses.
The key idea towards efficiency lies in a combination of detecting the
vehicle, rendering the 3D models, matching projected edges to input
images, and using a tree structured MRF to get fast and globally
optimal inference and to force the vehicle follow a feasible motion
model in the initial phase.
50. CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
Improve pose estimation
over [Toshev et al., 2009] by
processing in continuous
space (columns 1, 2),
reduce wrong classifications
due to incorrect scales
(column 3) and improve
pose estimation over
[Leotta et al., 2011] by using
existing 3D models.
51. CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
(a) Framework application flow. (b) Vehicle, described by
orientation α and centroid on the ground plane C = (x, y, 0).
Fast Directional Chamfer Matching (FDCM)
FDCM maps the edge pixels in U and E to an
orientation augmented space. The alignment cost
between the two edge maps is then given by
To update the matching score by setting
Given the shifted but projective wrong model
projection Ap
l and a projective correct model
projection area Bq
l, the similarity score for a
pose is calculated by combining the output of
FDCM and the area overlap by
52. CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
(a) Temporal inference for ranked projections.
(b) Ackermann steering principle where φ = θ/2.
(c) Corresponding points between model’s
projected edges and edge image.
53. CLASSIFICATION AND POSE ESTIMATION OF
VEHICLES IN VIDEOS BY 3D MODELING WITHIN
DISCRETE-CONTINUOUS OPTIMIZATION
Pose estimation using FDCM only (top row), combining FDCM and MRF
(middle row), combining FDCM, MRF and continuous optimization (bottom row).
54. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
The existing 3D pose estimation methods using deep networks can be
divided in two groups:
(i) predict 2D keypoints from images and recover 3D pose from keypoints;
(ii) directly predict 3D pose from an image.
A mixed classification-regression framework that uses a classification
network to produce a discrete multimodal pose estimate and a
regression network to produce a continuous refinement of the estimate.
The framework can accommodate different architectures and loss
functions, leading to multiple classification-regression models.
A high level overview of our problem statement and proposed network architecture
56. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
the Bin & Delta model
Simple/Naive Bin & Delta
Geodesic Bin & Delta
57. A MIXED CLASSIFICATION-REGRESSION FRAMEWORK
FOR 3D POSE ESTIMATION FROM 2D IMAGES
One delta network per pose-bin
Best (top row) and Worst (bottom row) images for Category: Bus
Best (top row) and Worst (bottom row) images for Category: Car
The previous optimization problems as follows
58. BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
Not limit to frontal/rear viewpoint but allow vehicles to be seen from
any viewpoint, based on 3D Bboxes built around the vehicles.
The Bbox can be auto-constructed from traffic surveillance data.
For scenarios where it is not possible to use the precise construction,
a method for estimation of the 3D bounding box.
The 3D Bbox is used to normalize the image viewpoint by “unpacking”
the image into plane.
To randomly alter the color of the image and add a rectangle with
random noise to random position in the image during training CNNs.
A fine-grained vehicle dataset BoxCars116k, with 116k images of
vehicles from various viewpoints taken by many surveillance cameras.
59. Example of automatically obtained
3D bounding box used for fine-
grained vehicle classification. Top
left: vehicle with 2D bounding box
annotation, top right: estimated
contour, bottom left: estimated
directions to vanishing points,
bottom right: 3D bounding box
automatically obtained from
surveillance video (green) and our
estimated 3D bounding box (red).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
60. 3D bounding box and its unpacked version.
Examples of data normalization and
auxiliary data feeded to nets. Left to
right: vehicle with 2D bounding box,
computed 3D bounding box, vectors
encoding viewpoints on the vehicle
(View), unpacked image of the
vehicle (Unpack), and rasterized 3D
bounding box feeded to the net
(Rast).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
61. Estimation of 3D Bbox. Left to right: image with vehicle 2D Bbox, output of contour
object detector, constructed contour, estimated directions towards vanishing points,
ground truth (green) and estimated (red) 3D Bbox.
Used CNN for estimation of
directions towards vanishing
points. The vehicle image is
fed to ResNet50 with 3
separate outputs which
predict probabilities for
directions of vanishing points
as probabilities in a quantized
angle space (60 bins from
−90ºto 90º).
BOXCARS: IMPROVING FINE-GRAINED RECOGNITION OF VEHICLES
USING 3D BOUNDING BOXES IN TRAFFIC SURVEILLANCE
62. VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
A FCN for 2D and 3D bounding box detection of cars from monocular
images intended for autonomous driving applications.
The introduced network is E2E trainable and detects objects at multiple
scales in a single pass.
A 3D bounding box representation, which is independent of the image
projection matrix (camera used to take the images).
The detector may be trained on several different datasets and also
detect 3D Bboxes on completely different datasets than it was trained on.
3D bounding boxes
(left) and their top
view (right) detected
by this method. The
front sides of 3D
bounding boxes are
depicted in green,
the rear sides in red.
63. VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING(THESIS)
2D Bounding Box - BBTXT
The 2D Bboxes are represented by the coord.s of their top-left (xmin,
ymin) and bottom-right (xmax, ymax) corners.
3D Bounding Box - BB3TXT
A Bbox in 3D has 9 DOF - 3 for position, 3 rotations, and 3 dimensions.
Coord.s of the projected rear- bottom-left, front-bottom-left, and front-
bottom-right corners and the y-coord.s of the front-top-left corner.
3D bounding box corners Info. stored about 3D Bboxes. Each
Bbox is defined by 3 lines - front-
bottom, left-bottom, front-left.
64. VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING(THESIS)
Together with the requirement of all Bboxes being pinned to the ground plane
provides sufficient amount of info. to reconstruct the 3D world positions of the 3D
Bboxes.
Inverse Projection
Compute the inverse of KR and the camera center from
Reconstruction of the Bottom Side
use the ground plane equation ax + by + cz + d = 0 (normal n = [a, b, c]T ) and the
inverse projected rays to determine the position of the bottom side of the 3D bounding
box in the world.
obtain parallelogram in the ground plane instead of a rectangle from the re-projection
of 3 points.
Rectification of a parallelogram
(solid) to a rectangle (dashed).
Projection of a 3D bounding box
to the ground plane.
65. VEHICLE DETECTION AND POSE ESTIMATION
FOR AUTONOMOUS DRIVING (THESIS)
Reconstruction of the Top Side
use the direction of the bottom-left line as the normal vector of the frontal
plane nF = [aF , bF , cF ] and place the front-bottom-left point in the frontal
plane to calculate the front plane as aFx + bFy + cFz + dF = 0;
Finding the intersection of the frontal plane and the ray lftl = C + (KR)-1xftl
gives the position of the vertex Xftl to determine the height of the
bounding box.
Ground Plane Extraction
The RANSAC algorithm for ground plane estimation.
67. DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
A Deep Cuboid Detector takes a consumer-quality RGB image of a
cluttered scene and localizes all 3D cuboids (box-like objects).
An E2E deep learning system to detect cuboids across many semantic
categories (e.g., ovens, shipping boxes, and furniture).
Localize cuboids with a 2D Bbox, and localize the cuboid’s corners,
effectively producing a 3D interpretation of box-like objects.
Refine keypoints by pooling conv. features iteratively, improving the
baseline method significantly.
This deep learning cuboid detector is trained in an end-to-end fashion
and is suitable for real-time applications.
2D Object detection vs.
3D Cuboid detection.
68. DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
Deep Cuboid Detection Pipeline. 1) find RoIs in the image where a cuboid
might be present and train a RPN to output such regions. 2) features for each
RoI are pooled from a conv. feature map. 3) These pooled features are passed
though two fully connected layers just like Faster R-CNN. 4) output normalized
offsets of the vertices from the center of the region. 5) refine predictions by
performing iterative feature pooling.
69. DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
The loss function used in the RPN consists of Lanchor−cls, the log loss
over two classes (cuboid vs. not cuboid) and Lanchor−reg, the Smooth
L1 loss of the Bbox regression values for each anchor box;
The loss function for the R-CNN is made up of LROI−cls, the log loss
over two classes (cuboid vs. not cuboid), LROI−reg, the Smooth L1
loss of the Bbox regression values for the RoI and LROI−corner, the
Smooth L1 loss over the RoI’s predicted vertex locations, also
referred as the corner regression loss.
The complete loss function is a weighted sum of the above
mentioned losses
70. DEEP CUBOID DETECTION: BEYOND 2D
BOUNDING BOXES
Vertex Refinement via Iterative Feature Pooling. To refine cuboid detection
regions by re-pooling features from conv5 using the predicted bounding boxes.
71. 3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
3D object detection and pose estimation from a single image.
First regresses relatively stable 3D object properties using a deep
CNN and then combines these estimates with geometric constraints
provided by a 2D object b box to produce a complete 3D b box.
The first network output estimates the 3D object orientation using a
hybrid discrete-continuous loss, which significantly outperforms the L2
loss.
The second output regresses the 3D object dimensions, relatively little
variance, predicted for many object types.
These estimates, combined with geometric constraints on translation
imposed by the 2D b box, enable to recover a stable and accurate 3D
object pose.
72. Perspective projection of a 3D Bbox fit tightly within its 2D det. window.
The 3D Bbox is described by its center T, dimensions D and
orientation, by the azimuth, elevation and roll angles.
2D box side parameters
Correspondence btw the
3D bbox and 2D bbox:
Each figure shows a 3D
bbox that surrounds an
object.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
73. CNN Regression of 3D Box Parameters:
Combine the ray direction at the crop center with the
estimated local orientation to compute the global
orientation of the object.
Faster R-CNN, SSD: Divide the space of the bounding
boxes into several discrete modes “anchor boxes” and
then estimate the continuous offsets applied to each
anchor box.
Discretize the orientation angle and divide into
overlapping bins. For each bin, the CNN network
estimates both a confidence probability that the output
angle lies inside the ith bin and the residual rotation
correction applied to the orientation of the center ray of
that bin to obtain the output angle.
The residual rotation is represented by two numbers,
for the sine and the cosine of the angle.
Total loss for the MultiBin orientation:
The loss for dimension estimation:
Left: Car dimensions. Right:
Illustration of local and global
orientation of a car. The local
orientation computed wrt the ray
through the center of the crop.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
74. The architecture for MultiBin estimation for
orientation and dimension estimation with 3
branches: The left is for estimation of dimensions
of the object of interest. The other twos are for
computing the confidence for each bin and also
compute cos(∆θ) and sin(∆θ) of each bin. Qualitative illustration of 2D detection boxes and
estimated 3D projections.
3D BOUNDING BOX ESTIMATION USING DEEP
LEARNING AND GEOMETRY
75. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Deep MANTA (Many-Tasks), for vehicle analysis from a given image.
A robust CNN for simultaneous vehicle detection, part localization,
visibility characterization and 3D dimension estimation.
A coarse-to-fine object proposal that boosts the vehicle detection.
Deep MANTA localizes vehicle parts even if they are not visible.
In the inference, the network’s outputs are used by real time pose
estimation for fine orientation estimation and 3D vehicle localization.
76. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
System outputs. Top: 2D vehicle
bboxes, vehicle part localization
and part visibility. Bottom: 3D
vehicle bbox localization and 3D
vehicle part localization. The
camera in blue.
2D/3D model
2D vehicle b box
3D b box
2D part coord.
part visibility vector
3D part coord.
77. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Example of one 2D/3D vehicle
model. (a) the bounding box B, (b)
2D part coordinates S and part
visibility V.
Detection loss.
Visibility loss.
Template similarity loss.
Part loss.
78. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Overview of the Deep
MANTA approach. The
entire input image is
forwarded inside the Deep
MANTA network. Conv.
layers share the same
weights. Moreover, these
3 conv. blocks correspond
to the split of existing CNN
architecture.
79. DEEP MANTA: A COARSE-TO-FINE MANY-TASK NETWORK FOR
JOINT 2D AND 3D VEHICLE ANALYSIS FROM MONOCULAR IMAGE
Semi-automatic annotation process. (a) weak annotations on a real image (3D
b box). (b) best corresponding 3D models in green. (c) projection of these 3D
models in the image. (d) corresponding mesh of visibility. (e) Final annotations.
80. 3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
Exploit stereo imagery to place proposals in the form of 3D Bboxes.
Minimizing a function encoding object size priors, ground plane and
depth features about free space, point cloud densities and distance to
the ground.
Formulate the proposal generation problem as inference in a MRF in which the
proposal y should enclose a high density region in the point cloud.
Point cloud density:
Free space:
Height prior:
Height contrast:
81. 3D OBJECT PROPOSALS FOR ACCURATE OBJECT CLASS
DETECTION
Score Bbox proposals using CNN, which network is built on Fast R-CNN;
It shares conv. features across all proposals and use ROI pooling layer to
compute proposal-specific features;
Adds a context branch after the last conv. layer, and an orientation regression
loss to jointly learn object location and orientation;
Features output from original/context branches concatenated and fed to
prediction layers.
The context regions obtained by enlarging candidate boxes by a factor of 1.5.
Smooth L1 loss for orientation regression.
Parameters of context branch are initialized by copying weights from the original.
OxfordNet trained on ImageNet to initialize the weights of conv. layers and the
branch for candidate boxes, then fine-tune it E2E on the KITTI training set.
83. MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
Generate a set of candidate class-specific object proposals, which are
then run through a standard CNN pipeline to obtain object detections.
An energy minimization approach that places object candidates in 3D
using the fact that objects should be on the ground-plane.
Score each candidate box projected to the image plane via several
intuitive potentials encoding semantic segmentation, contextual
information, size and location priors and typical object shape.
84. MONOCULAR 3D OBJECT DETECTION FOR AUTONOMOUS DRIVING
CNN architecture used to score proposals for
object detection and orientation estimation.
The scoring function by
combining semantic cues
(both class and instance
level segmentation), location
priors, context and shape:
86. SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
A method for detecting 3D model instances and estimating their
6D poses from RGB data in a single shot.
To this end, extend the popular SSD paradigm to cover the full 6D
pose space and train on synthetic model data only.
It competes or surpasses current state-of-the-art methods that
leverage RGB-D data on multiple challenging datasets.
It produces results at around 10Hz, which is many times faster
than the related methods.
Discrete 6D pose
space with each
point representing
a classifiable
viewpoint.
The object distance can be
inferred from the projective ratio.
87. SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
After predicting 2D detections (a), build 6D hypotheses and run pose
refinement and a final verification. While the unrefined poses (b) are rather
approximate, contour-based refinement (c) produces already visually
acceptable results. Occlusion-aware projective ICP with cloud data (d) leads
to a very accurate alignment.
88. SSD-6D: MAKING RGB-BASED 3D DETECTION
AND 6D POSE ESTIMATION GREAT AGAIN
Schematic overview of the SSD-style network prediction
C denotes the
number of object
classes, V the
number of
viewpoints and R
the number of in-
plane rotation
classes. The other
4 values are
utilized to refine
the corners of the
discrete bounding
boxes.
89. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
A single-shot approach for simultaneously detecting an object in an
RGB image and predicting its 6D pose without requiring multiple
stages or having to examine multiple hypotheses.
Unlike a recently proposed single-shot technique for this task,
SSD-6D, that only predicts an approximate 6D pose that must then
be refined, this is accurate enough not to require additional post-
processing.
It is much faster – 50 fps on a Titan X (Pascal) GPU – and more
suitable for real-time processing.
The key component is a CNN architecture that directly predicts the
2D image locations of the projected vertices of the object’s 3D
bounding box.
The object’s 6D pose is then estimated using a PnP algorithm.
91. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
(a) (b) (c) (d)
(a) An example input image with four objects. (b) The S × S grid showing
cells responsible for detecting the four objects. (c) Each cell predicts 2D
locations of the corners of the projected 3D bounding boxes in the image.
(d) The 3D output tensor from the network, which represents for each cell
a vector consisting of the 2D corner locations, the class probabilities and a
confidence value associated with the prediction.
92. REAL-TIME SEAMLESS SINGLE SHOT 6D OBJECT
POSE PREDICTION
In the last column, it shows failure cases due to motion blur,
severe occlusion and specularity.
94. IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
A real-time RGB-based pipeline for object detection and 6D pose
estimation.
This 3D orientation estimation is based on a variant of the Denoising
Autoencoder that is trained on simulated views of a 3D model using
Domain Randomization.
This so-called Augmented Autoencoder (AAE) has several advantages
over existing methods:
Since the training is independent from concrete representations of
object orientations within SO(3) (e.g. quaternions), able to handle
ambiguous poses caused by symmetric views because of avoiding
one-to-many mappings from images to orientations.
Learn representations that specifically encode 3D orientations while
achieving robustness against occlusion, cluttered backgrounds and
generalizing to different environments and test sensors.
The AAE does not require any real pose-annotated training data;
Instead, it is trained to encode 3D model views in a self-supervised way,
overcoming the need of a large pose-annotated dataset.
95. IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
6D Object Detection pipeline with homogeneous transformation
(top-right) and depth-refined result (bottom-right) .
96. IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Training process for the AAE; a) reconstruction target batch x of
uniformly sampled SO(3) object views; b) geometric and color
augmented input; c) reconstruction xˆ after 30000 iterations.
97. IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Autoencoder CNN architecture with occluded test input
98. IMPLICIT 3D ORIENTATION LEARNING FOR
6D OBJECT DETECTION FROM RGB IMAGES
Top: creating a codebook from the encodings of discrete synthetic
object views; bottom: object detection and 3D orientation estimation
using the NN(s) with highest cosine similarity from the codebook.