1) Given a sequence of stereo images, the pipeline generates a dense 3D semantic model of the urban environment.
2) Depth maps are generated from stereo images and fused into a volumetric representation using camera poses from feature tracking.
3) Semantic segmentation of street view images is done using a CRF model, and labels are projected onto the 3D model faces to generate the semantic model.
4) The semantic model is evaluated by projecting it back to the input images and calculating metrics like recall and intersection over union. Future work includes real-time implementation and combining image and geometric context.
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Urban 3D Semantic Modelling Using Stereo Vision
1. Urban 3D Semantic Modelling
Using Stereo Vision
Sunando Sengupta1, Eric Greveson2, Ali
Shahrokni2, Philip HS Torr1
1Oxford Brookes Vision Group,
22d3 Sensing.
2. Urban 3D Semantic Modelling Road Scene
• Given a sequence of stereo images we generate a dense
3D, semantic model
Input Stereo image Sequence Dense 3D Semantic Model
9. Camera Estimation
• Use the feature tracks to
estimate camera poses.
• Use bundle adjustment
[a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Suite CVPR 2012
10. Depth-Map Estimation
• Semiglobal block matching[1] for disparity estimation
• Per-pixel depth computed as z = B x f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baseline
f - Focal Length
d – pixel disparity
11. Depth Fusion
• Depth estimates are fused
using camera poses.
• Fused into truncated signed
distance (TSDF) volumetric
representation[1].
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images
Siggraph 96.
12. TSDF Volume[1]
• Entire space divided into grids of voxels.
• For each voxel compute the truncated signed distance.
– +ve increasing when it lies in the free space,
– -ve when it lies behind the surface
– zero when lies on the surface
• Performed for all depth maps.
[1] B. Curless et. al. A volumetric method for building complex models from range images.
15. Incremental Volume Update
• Road scenes are arbitrary
length long sequence.
• 3x3x1 volume of voxel grids
initialised
16. Incremental Volume Update
• Road scenes are arbitrary
length long sequence.
• 3x3x1 volume of voxel grids
initialised
• Incrementally add volume as
the vehicle moves out of the
region
17. CRF
construction
Semantic Image Segmentation
• We use conditional random field framework (CRF)
Final SegmentationInput Image
17
• Each pixel is a node in a grid graph G = (V,E).
• Each node is a random variable x taking a label from
label set.
X
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
18. Semantic Image Segmentation
• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
– Computed via the dense boosting approach
– Multi feature variant of texton boost[1]
x
Car 0.2
Road 0.3
18
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
19. Semantic Image Segmentation
• Total energy E = Epix + Epair + Eregion
• Epair- Model each pixels neighbourhood interaction.
– Encourages label consistency in
adjacent pixels and sensitive to edges.
– Contrast sensitive Potts model xi xj
Car
Road
0
g(i,j)
Car
Road
19
Epair
20. Semantic Image Segmentation
• Total energy E = Epix + Epair + Eregion
• Eregion - Model behaviour of a group of pixels.
– Encourages all the pixels in a region to take the same label.
– Group of pixels given by a
multiple meanshift segmentations
c
Car 0.3
Road 0.1
20
22. Mesh Face Labelling
• A histogram of labels is
built for each mesh face
(Zf ), by projecting the
points from the face into
labelled images.
• Majority label is
considered as the label
of the face.
23. Semantic Model
Top: Left – Surface reconstruction, Right – Semantic model
Bottom: Left - input image, Right- object label set
24. Evaluation
• The Model is projected back using the estimated camera
poses to create labelled images.
• The points in the model far away from the camera are
ignored in the projection.
In this work we attempt to create a semantic model of a road scene. We perform a dense 3D reconstruction and associate semantic meaning to the model. Such a reconstruction is particularly useful for intelligent/autonomous navigation where the system needs to recreate the environment in which it resides and also recognise the objects present there. In our case, we perform modelling of a road scene, where each voxel in the reconstructed model is labeled with class labels like car, road, pavement, building. The input to our system is a sequence of stereo images., while the right side shows our desired output.
The camera pose estimation has two main steps, namely feature matching and bundle adjustment
We consider feature matches satifying both the left-right frames and the consecutive fames (stereo and ego-motion), estimate the camera pose. This helps the bundle adjuster to estimate the camera poses and feature points more accurately. In the bundle adjustment phase, our
optimiser estimates camera poses and the associated features viewed by the last n cameras (20).
This is an example of hte feature and carea pose estimation. The figure shows the camera centres
and 3D points, registered manually to the Google map.
For generating the surface, we first estimate the depth maps from stereo pairs. These are merged into a the Truncated
Signed Distance (TSDF) volume using the estimated camera poses. Finally a mesh iscreated using marching tetrahedra algorithm
For each of hte depth estimate we get from the stereo image pairs in merged into the volume.
The tsdf volume building works as follows.
Entire space is divided into grid of voxels. Distance is measured for each voxel from the surface. The distance is zero at the surface. Positive and increasing in the free space towards the camera, negative and decreasing for the voxels that lie inside.
The distance measure is truncated at some value.
This is done for each depth maps.
Finally the the voxel with zero estimates lie on the surface.
A simple example, of how the tsdf volume is build. Each estimate of the depth update the voxels.
The zeros give the surface.
As we are reconstructing road scenes which can run from hundreds of meters to kilometer, we use an online grid update method
We consider an active 3x3x1 grid of voxel volumes at any time of the fusion.
As the vehicle track goes out of the range of current grid, the current grid blocks are written to memory and a new grid is initialised.
We use Conditional Random field (CRF). This has achieved state of the art results in classification of road scene data in recent years.
In this framework,. The image is described as a grid graph.
All the pixels in the image are the vertices of the graph. each pixel is modelled as a variable which takes a value from the label set which we need to find out.
We now define each of the energy component in details.
Epix models the individual pixels cost of taking a label. Generally this is the classifier cost. In our case it is computed via the boosting approach. This is shown in the example, the green node has the cost of taking the label car.2 and road 0.1. similarly all the cost of other nodes are also computed.
The next cost item is the pairwise cost which models the pixel neighbourhood region. This kind of cost will try to enforce the consistency in label among adjacent pixels. Thus it is Sensitive to edges and preserves boundaries in images. Here in this example you can see the cost of two pixels taking the same label car is zero but non zero otherwise.
The final cost terms models the behaviour of the group of the pixels defining a region. Here the region is found by unsupervised segmentetiontion technique like meanshift, and encourages the entire region to be classified into a same class label. In this example the entire clique will take a cost for an object label.
These are some of hte street level details
For each of the mesh face in the reconstructed model, we sample a certain number of points on the face. These face points are projected into the labeled images and a label histogram is built for each of hte mesh face.
Majority label is associated as the label of that particular mesh face.
As generating ground truth 3d models is quite expensive, to evaluate the model, we project the mesh labels into the image domain and compare with the ground truth.
We evaluate our model on both recall and intersection-union metrics.
This is the result video of our system. We use the stereo images from KITTI dataset.
Use semi-global block mathcing stereo to obtain the disparity maps.
The depth estimated from the disparity maps are fused into the volume usinga tsdf.
The street image s are labelled in a crf framework and finally a labelled dense 3D reconstruciton is obtained.
We try our method on large scale upto a kilometer. The inset image is hte google earth mage of the corresponding vehice track overlaid on the gmap image.