PhD_ppt_2012

Monocular Depth Cues in
Computer Vision Applications
Diego Cheda
Thesis Advisors:
Dr. Daniel Ponsa
Dr. Antonio L´opez
December 14, 2012

We don’t need two eyes to perceive depth.
[Edgar Muller]

Motivation
Human depth cues
There are diﬀerent sources of information supporting depth
perception.
Diego Cheda — Monocular Depth Cues in Computer Vision Applications 3/64

Motivation
Depth estimation from a single image
Prior information
Our world is structured In an abstract world
Gloconde Blank check
The listening room Personal values
Ren´e Magritte

Outline
1 Objectives
2 Coarse depth map estimation
3 Egomotion estimation
4 Background estimation
5 Pedestrian candidate generation
6 Conclusions and future work

Objectives
• Coarse depth map estimation
simple and low-cost
low-level features based on pictorial cues
• Increasing the performance of many applications
Egomotion estimation
Background estimation
Pedestrian candidates generation

Objectives
Segmenting an image into depth categories
• Near
Depth is usually estimated by using a stereo conﬁguration.
• Very-far
The eﬀect of camera translation at faraway distances is
inappreciable.
• Medium and Far
Interesting for potential applications.

Coarse depth map estimation
Method
Pipeline of our approach
• Multiclass classiﬁcation problem
• Supervised learning approach

Method
Ground truth dataset
• Set of urban outdoor images
Saxena et al.: 400 images for training and 134 for testing.
• Each image has an associated depth map acquired by a laser
scanner.
Thresholding depth map to be used as ground truth.

Method
Regions
Superpixels Regular grid
Superpixels conserve
intra-region similarities.
× Time consuming.
× Regular grids merge
information of diﬀerent regions.
Once for a camera
conﬁguration.

Method
Features
• Monocular pictorial cues are predominant beyond 30 m to estimate
depth.
• Low-level visual features to represent texture, relative height,
atmospheric scattering.

Method
Features - Texture
Paris street, rainy day - Gustave Caillebotte
At a greater distance, texture
patterns get ﬁner and appear
smoother
To capture textures we use
• Weibull distribution
• Gabor ﬁlters

Method
Features - Texture: Weibull distribution
• Compact representation
β parameter γ parameter

Method
Features - Texture: Gabor ﬁlter
Images
Gabor ﬁlter responses
• Capture smoothed and
textured regions

Method
Features - Relative height
When an object is near the
horizon, it is perceived as distant.
To capture relative height we use
• Location: x and y coordinates
in the image

Method
Features - Location
near medium far
Depth average over ground truth

Method
Features - Atmospheric scattering
The Virgin and Child with St. Anne - Leonardo Da Vinci
The further away objects are
unclearer and less detailed with
respect to those which are closer.
To capture atmospheric
scattering we use
• RGB histogram
• HSV histogram

Method
Learning approach
One-vs-All
• Binary classifiers
• Training one classifier per class (near, medium, far, and very-far)
• Low-performance due to number of positive examples for medium
and far regions.
Our approach
• Training three classifiers: > 30, > 50, > 70 m.

Method
Training

Method
Testing

Method
Inference
• CRF
• Combining probabilities obtained from classiﬁers
• Associating neighboring regions belonging to the same depth
category.
• Graph cut to guarantee a global maximum likelihood result.

Experimental results
Performance measurement
• Measure of performance: Jaccard index.
TP
(TP + FP + FN)
Measures the level of agreement with respect to an ideal
classiﬁcation result.

Different regions grouping
Performance our method using different oversegmentation configurations.
Regular
Grid
10 x 10 15 x 15 20 x 20
Turbo
Pixels
∼200 regions ∼400 regions ∼800 regions
Algorithm
Number of regions
20x20 15x15 10x10
Superpixels 0.3623 0.3567 0.3561
Grid 0.3586 0.3602 0.3570
• Best performing
configuration is using
superpixels

Comparison w.r.t. state-of-the-art
Saxena et al.
• A more challenging goal: photo-realistic 3D model
• For each superpixel and its neighbors: features for occlusions,
geometric, statistical and spatial information, textures, at multiple
spatial scales.
• Inferences methods with a high computational.
• MRF

Comparison w.r.t. state-of-the-art
Using a remarkable inferior number of low-level features (64
vs 646 respectively).

Relevance of visual features

Image Laser Depth Map Saxena et al. Our

Conclusions
We have presented
• A supervised learning approach to segment an image
according to certain depth categories.
• Our algorithm use a reduced number of low-level visual
features, which are based on monocular pictorial cues.
Our results show
• Monocular cues are useful for depth estimation.
• Close and distant regions are well-segmented by our approach.
• Regions at medium distances are more diﬃcult to segment.
• In average, our method outperforms Saxena et al. method.

Motivation
Estimating the vehicle position is a key component in many ADAS
systems
Autonomous navigation
Adaptive cruise control
Lane change assistance

Problem deﬁnition
Egomotion problem
Determining the changes in the 3D rigid camera position and
orientation.
• Camera motion is described as a 3D rigid motion:
pt = Rtp0 + tt
• Six degrees of freedom (DOF).

Goal
Distant regions behave as a plane at inﬁnity
Properties
• It remains in the same image coordinates during translation
• It is only aﬀected by camera rotation
Goal
• Identify distant regions in the image to estimate vehicle rotation
uncoupledly from vehicle translation.

Algorithm overview
Egomotion estimation based on distant points / regions
× Distant points are hard to be tracked since they are located at
low-textured regions.
Distant region algorithm does a maximal use of distant information.

Datasets
• Karlsruhe dataset: 8 sequences
• More than 8000 (∼ 3 km).
• GT: INS Sensor.
• Stereo depth maps

Evaluation of our distant regions segmentation

Comparison with other approaches
• The ﬁve-point algorithm (5pts) by Nister.
• The Burschka et al. method (RANSAC).
• The stereo-based algorithm by Kitt et al. (as a baseline).

Rotation estimation performance Trajectory estimation performance

Yaw angle comparison
GT (INS Sensor) DR DP

Trajectory results

Conclusions
In this section, we have
• Proposed two novel monocular egomotion methods based on
tracking distant points and distant regions.
Our results show
• Rotations are accurately estimated, since distant regions
provide strong indicators of camera rotation.
• In comparison with other state-of-the-art methods, our
approach outperforms them.
• Comparable performance with respect to the considered stereo
algorithm.

Problem deﬁnition
Automatically remove transient and moving objects from a set of
images with the aim of obtaining an occlusion-free background
image of the scene.
Background model
• Represents objects whose distance to the camera is maximal.
• Background objects are stationary.
Goal
• Identify close regions to penalize deviations from our background
model.

Example of labeling
Original images
Close/distant regions
Labeling Our result

Method
Energy function
E(f ) = p∈P Dp(fp)
Data term
+ p,q∈N Vp,q(fp, fq)
Smoothness term
Data term Penalizes deviation from our background model taking
into account color, motion and depth.
Dp(fp) = αDS
p (fp) + βDM
p (fp) + γDP
p
• Color variations between sort time intervals
• Moving objects by using motion boundaries
• Close objects using our approach
Smoothness term Penalizes the intensity diﬀerences between
neighboring regions, giving a higher cost when images do not
match well.

Datasets
Towers City Train Market
#frames: 11 #frames: 7 #frames: 3 #frames: 8

Agarwala et al.
• State-of-the-art method.
• Require user intervention to refine results.
• Refined results used as ground truth.
Norm of absolute difference in RGB channels
Sequences
Towers City Train Market
0.0551 0.0804 0.0479 0.0603

Independent moving object
Original images
Our method Agarwala et al.

Conclusions
In this section,
• We have presented a method to background estimation
containing moving/transient objects.
• This method uses depth information for such purpose by
penalizing close regions in a cost function.
Our results show that
• Our method signiﬁcantly outperforms the median ﬁlter.
• Our approach is comparable to Agarwala et al. method,
without performing any user intervention.

Pedestrian candidate generation
Problem definition
Pedestrian candidate generation Generating hypothesis to be
evaluated by a pedestrian classifier.
[Gerónimo 2010]
Goal
Exploiting geometric and depth information available on single images
to reduce the number of windows to be further processed.

Method
Overview
a) Original Image
d) Pedestrian Candidate Windows
b) Geometric Information
c) Depth Information
Fusion

Method
Agglomerative clustering schema
• Regions over ground surface
• Agglomerating regions maintaining size coherence w.r.t. depth
Original
Image
Geometric and Depth
Information
Superpixels
(a) Geometric, Depth, and Spatial Information
(b) Superpixels are merged
Gravity
Depth
Size
Hierarchical clustering
(c) Bounding boxes surrounding regions

Dataset
• CVC Pedestrian dataset.
• 15 sequences taken from a stereo-rig rigidly mounted in a car
while it is driving on an urban scenario (4364 frames).
• 7983 manual annotated pedestrians visible at less than 50
meters.
Performance measures
• Number of pedestrian candidates generated.
• True Positive Rate TPR =
TP
TP + FN

Lost pedestrians
4 %
1 8 %
7 8 %
0 - 1 0 1 0 - 2 5 > 2 5
0
3 0 0
6 0 0
9 0 0
1 2 0 0
LostPedestrians
D i s t a n c e ( m )

Conclusions
In this section, we have presented:
• Novel monocular method for generating pedestrian candidates.
• It is based on geometric relationships and depth.
Our results show that:
• Our method overcome all considered methods because
signiﬁcantly reduces the number of candidates.
• High value for TPR.

Conclusions and future work
Conclusions
• We have proposed a supervised learning approach to classify the
pixels of outdoor images in just four categories: near,
medium-distance, far and very-far, based on monocular pictorial
cues.
• In comparison against the results of a most complex depth map
estimation method, our method overcomes the performance of it,
using low computational demanding techniques.
• We have demonstrated the usefulness of our coarse depth maps in
improving the results of egomotion estimation, background
estimation, and pedestrian candidates generation. In each
application, we have contributed with novel methods from a
diﬀerent perspective based on the use of coarse depth.

Future work
• Extend our approach to consider more monocular depth cues like
occlusions, relative and familiar size, that could improve our coarse
estimation.
• Explore other possible applications of depth information (tracking,
for initializing 3D reconstruction algorithms, learning pedestrians
classiﬁers according with depth, etc).
• Integrate our depth estimation method in diﬀerent ADAS modules.

Publications
This thesis take as bases the following publications:
Conference papers
• Camera Egomotion Estimation in the ADAS Context, D. Cheda, D. Ponsa and
A. M. López, IEEE Conf. Intell. Transp. Syst., 2010.
• Monocular Egomotion Estimation based on Image Matching, D. Cheda, D.
Ponsa and A. M. López, Int. Conf. Pattern Recognit. Appl. and Methods, 2012.
• Monocular Depth-based Background Estimation, D. Cheda, D. Ponsa and A. M.
López, Int. Conf. Comput. Vision Theory Appl., 2012.
• Pedestrian Candidates Generation using Monocular Cues, D. Cheda, D. Ponsa
and A. M. López, IEEE Intell. Vehicles Symposium, 2012.
Journal papers under reviewing
• Monocular Multilayer Depth Segmentation and Applications, D. Cheda, D.
Ponsa and A. M. López, submitted to IJCV, Springer.
• Monocular Visual Odometry Boosted by Monocular Depth Cues, D. Cheda, D.
Ponsa and A. M. López, submitted to ITS, IEEE.

PhD_ppt_2012

More Related Content

What's hot

Viewers also liked

Similar to PhD_ppt_2012

PhD_ppt_2012