Urban 3D Semantic Modelling Using Stereo Vision

Urban 3D Semantic Modelling
Using Stereo Vision
Sunando Sengupta1, Eric Greveson2, Ali
Shahrokni2, Philip HS Torr1
1Oxford Brookes Vision Group,
22d3 Sensing.

Urban 3D Semantic Modelling Road Scene
• Given a sequence of stereo images we generate a dense
3D, semantic model
Input Stereo image Sequence Dense 3D Semantic Model

• Stereo images
Pipeline –Semantic Reconstruction

• Depth map generation
• Camera estimation

• Surface reconstruction

• Semantic labelling of street view images

• Semantic model generation

Camera Estimation
• Feature tracking using left-right pair and consecutive
frames

Camera Estimation
• Use the feature tracks to
estimate camera poses.
• Use bundle adjustment
[a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Suite CVPR 2012

Depth-Map Estimation
• Semiglobal block matching[1] for disparity estimation
• Per-pixel depth computed as z = B x f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baseline
f - Focal Length
d – pixel disparity

Depth Fusion
• Depth estimates are fused
using camera poses.
• Fused into truncated signed
distance (TSDF) volumetric
representation[1].
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images
Siggraph 96.

TSDF Volume[1]
• Entire space divided into grids of voxels.
• For each voxel compute the truncated signed distance.
– +ve increasing when it lies in the free space,
– -ve when it lies behind the surface
– zero when lies on the surface
• Performed for all depth maps.
[1] B. Curless et. al. A volumetric method for building complex models from range images.

TSDF Volume
-.8
Camera
Actual
surfaceTSDF volume

TSDF Volume
-1 -.8 -.3 .2 .8 1 1 1
-1 -.9 -.4 .1 .5 1 1 1
-1 -1 -.8 -.2 .1 1 1 1
-1 -1 -.9 -.3 .2 .8 1 1
-1 -1 -.9 -.4 .3 .9 1 1
-1 -1 -.8 -.3 .3 .9 1 1
-1 -1 -.9 -.5 .2 .8 1 1
-1 -1 -.6 .1 .7 1 1 1
Camera
TSDF volume
Actual
surface

Incremental Volume Update
• Road scenes are arbitrary
length long sequence.
• 3x3x1 volume of voxel grids
initialised

Incremental Volume Update
• Road scenes are arbitrary
length long sequence.
• 3x3x1 volume of voxel grids
initialised
• Incrementally add volume as
the vehicle moves out of the
region

CRF
construction
Semantic Image Segmentation
• We use conditional random field framework (CRF)
Final SegmentationInput Image
17
• Each pixel is a node in a grid graph G = (V,E).
• Each node is a random variable x taking a label from
label set.
X
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.

• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
– Computed via the dense boosting approach
– Multi feature variant of texton boost[1]
x
Car 0.2
Road 0.3
18
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.

• Epair- Model each pixels neighbourhood interaction.
– Encourages label consistency in
adjacent pixels and sensitive to edges.
– Contrast sensitive Potts model xi xj
Car
Road
0
g(i,j)
Car
Road
19
Epair

• Eregion - Model behaviour of a group of pixels.
– Encourages all the pixels in a region to take the same label.
– Group of pixels given by a
multiple meanshift segmentations
c
Car 0.3
Road 0.1
20

Semantic Image Segmentation - Results
• Input Images, output of our image level CRF, ground
truths.

Mesh Face Labelling
• A histogram of labels is
built for each mesh face
(Zf ), by projecting the
points from the face into
labelled images.
• Majority label is
considered as the label
of the face.

Semantic Model
Top: Left – Surface reconstruction, Right – Semantic model
Bottom: Left - input image, Right- object label set

Evaluation
• The Model is projected back using the estimated camera
poses to create labelled images.
• The points in the model far away from the camera are
ignored in the projection.

Evaluation
• Metrics
– Recall = tp/(tp+fn)
– Intersection vs Union = tp/(tp+fn+fp)

Future Work
http://cms.brookes.ac.uk/research/visiongroup/projects
• Use semantic to build the structure.
• Realtime implementation.
• Combine image level information and geometric
contextual information.
Thank you!!!

Urban 3D Semantic Modelling Using Stereo Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Urban 3D Semantic Modelling Using Stereo Vision

Similar to Urban 3D Semantic Modelling Using Stereo Vision (20)

Recently uploaded

Recently uploaded (20)

Urban 3D Semantic Modelling Using Stereo Vision

Editor's Notes