Scene Interpretation For SLAM using Stereo-vision
In robotics, SLAM (Simultaneous Localisation and Mapping) is the problem of estimating concurrently
the robot’s position and a map of its surrounding environment. The task is challenging because of two main
factors: noisy robot dynamics and limitation of the sensors. Although work in SLAM has preferentially used laser-
rangefinders, here stereo-vision is used. Vision-sensors enable us to use more distant objects as landmarks for
navigation, and to learn and use color and texture models of the environment, in looking further ahead than is
possible with range-sensors alone.
The problem can be sub-divided into the following areas:
1. Scene Interpretation ( Local Map-building)
2. Global Map-building
3. Planning and Control
This report only focuses on Scene Interpretation.
1. Scene Interpretation
The objective is to construct a local map – distinguishing obstacles from free-regions, from the
visual information. The basic procedure is to compute a ground plane from the disparity image, analyse
the height differences from the ground to find obstacles, and finally coupling this information with color
analysis to give the traversable regions of the terrain.
The first task is to compute the disparity image from the left and right images. The stereo
camera used for the purpose is videre mdcs2 stereo-cam. Calibrating the camera with a set of 20
images of the chequered pattern yielded the fundamental matrix. 2 methods to compute the
disparity image were tried out:
• SRI Stereo-Engine, bundled as SVS software
The Stereo Engine algorithm computes disparity between a pair of stereo images, using an
extremely efficient implementation of area correlation. It comprises the following 4 stages:
1. Rectification - remove distortion from images
2. Feature Extraction - compute the Laplacian of Gaussian for each image
3. Correlation - compute disparities by matching
4. Filtering - remove uncertain matches
• Using Graph-Cuts on CUDA
The goal of the algorithm is to compute one disparity estimate for each pixel in a
reference image, which is generally the left input image. For SLAM (specifically ground-plane
extraction), a dense disparity map of the reference image is an important requirement. It is
to be noted that the algorithm takes rectified images as input. In other words, the goal is to
find a labeling f that assigns each pixel p ε P a label fp ε ℓ ( a finite set), where f is both
piecewise smooth and consistent with the observed data.
This is one of the methods for global optimization, whereby the following energy
function is minimized (the choice of the energy function depends upon the specific
E(f) = Esmooth(f) + Edata(f)
For more details on the energy function, please refer [ ]
This energy-minimization problem is solved using graph-cuts, which is implemented as a
multi-threaded algorithm on CUDA, hence speeding up the execution considerably. In the
disparity image thus obtained, each disparity image point [u,v,d] corresponds to a 3D point
in the robot’s frame.
1.2 Ground-Plane Extraction
The next step is to obtain the ground-plane. Although it is possible to detect obstacles using
local variation in height, using a ground plane simplifies processing and yields more stable results. To
extract ground-plane, RANSAC method is generally used (refer ), using sets of 3 non-collinear
points. Hypothesized planes are ranked by the number of points that are close to the plane. Points
that lie too high above the ground-plane, but lower than the robot’s height, are labeled as obstacles.
Note: The method of extraction of ground-plane using RANSAC is useful generally in relatively open
spaces. For environments which are cluttered with obstacles, it is better to resort directly to
detection of obstacles using local variation in height.
1.3 Ground-Plane Segmentation using color-model
The input image is first over-segmented into a number of super-pixels i.e, contiguous regions
with fairly uniform color and texture . Each super-pixel is classified as belonging as either to
ground or non-ground, based on the whether the membership score (X) of the super-pixel is above or
below a threshold.
Each individual image-pixel is classified as belonging either to an obstacle or the ground based on its
color appearance. The algorithm consists of the following four steps:
1. The color image is filtered with a 5X5 Gaussian filter to reduce the level of noise.
2. The filtered RGB values are transformed into the HSV color space. Valid values to hue and
saturation are assigned only if intensity is above a threshold, because color information is
very noisy at low intensity. Similarly, hue is assigned a valid value if the corresponding
saturation value is above a threshold. This causes the segmentation to be less sensitive to
3. A trapezoidal area in front of the mobile robot is used for reference. The valid hue and
intensity values of the pixels inside the trapezoidal reference area are histogrammed into
two one-dimensional, one for hue and other for intensity.
4. All pixels of the filtered input image are compared to the hue and the intensity
histograms. A pixel is classified as an ground if both the two conditions are satisfied:
• The hue histogram bin value at the pixel’s hue value is above the hue threshold.
• The intensity histogram bin value at the pixel’s intensity value at the pixel’s value is
above the intensity threshold.
The membership score is calculated as follow:
X(Si) = 1/|Si| ∑ (ground-pixels in the super-pixel)
1.4 Local Map Construction
Occupancy maps are very suitable representations for SLAM purposes. Although occupancy
grids may be implemented in any number of dimensions, 2D grids are mostly used. The reduction of
3D data of the world to 2D representation is justified, since a robot fundamentally move on the
ground-plane, hence inhabiting a 2D world. Each grid represents a space in the physical world, and
the value in the grid signifies the probability of it being occupied.
From the disparity image obtained from 1.1, the ground portion is identified from 1.2 and 1.3,
and removed from the disparity image, leaving only the obstacles. In this disparity image, darker
areas indicate lower disparities, and are farther away from the camera.
From the resulting disparity image, the maximum disparity is taken from each column, to
create a single row of maximum disparities. Fig.1.4 (b) shows these column values converted into
distance, and (c) shows these distance values converted into an occupancy grid representation, with
black indicating the uncertain region around the obstacles, and white indicating regions that are
clear. Spikes such as in fig.(b) may be present, of there are mismatches in the stereo-algorithm.
Fig.1.4 (a) Row of Maximum Disparity
Fig. 1.4 (b) Column vs Distance Graph (c) Occupancy Grid Representation
The spikes can be overcome using surface-segmentation. These spikes are caused due to errors
which are locally stable, but not large and they have no support from the surrounding surfaces. True
surfaces in the stereo image should not only be globally consistent, but globally part of a larger 3D
surface. To segment the image into surfaces of continuous disparity, the following logic is applied:
i= L given j ε N(i)
|dj – di| <=1
where i is any given pixel, L is a surface label, N(i) is a neighborhood of pixels around i and di is the
disparity value at location i.
Now using 1.4(c), the map has to be updated. Though there are efficient probabilistic
frameworks exist for this purpose, this report considers a much simpler method. Grid-values range
from 0-255, with 0 indicating free-space. All the grids start with a value of 128. With fig. 1.4(c), the
update rule is:
If i ε OCC (r) G(i) = G(i) + K
If i ε CLEAR (r) G(i) = G(i) - K
where i is an occupancy grid location, r is the sensor reading, OCC(r) represents the region of
uncertainty around the sensed obstacle, CLEAR(r) represents the clear region between robot and
the sensed obstacle, G(i) represents the current grid-value at location i and K is a constant. Typically,
one may take K=20, and G(i) < 50 to be free of obstacles and G(i) > 150 to be definitely obstructed.
2. Global Map-Building
The aim is to have accurate registration of local surroundings, and have global consistency
maintained at each step. The tools at our disposal are: Visual odometry (VO) and shaft-encoder data.
Visual Odometry ensures local consistency in map registration. Over larger regions, VO provides the
necessary corrections to the shaft-encoder data to keep the errors from growing without bounds.
The key aspect of VO is to track features from one frame to another, and hence determine the
relative incremental motion between two frames, that is the position and orientation displacement
between two frames. In our case the features are Harris corners. After they are extracted, the optical flow-
field is constructed. From this point, several parameters are calculated using camera and epipolar
geometry, and using Kalman filter. However, the report does not go into a detailed description of these.
. Don Murray and Jim Little: Using Real-time Stereo-vision for mobile robot navigation
. D.Santosh, Supreeth Achar, CV Jawahar: Autonomous Image-based Exploration for Mobile Robot Navigation
. Iwan Ulrich and Illah Nourbakhsh: Appearance-Based Obstacle Detection with Monocular Color-Vision
. Konolige, Agarwal, Bolles, Cowan, Fischler, Gerkey: Outdoor Mapping and Navigation using Stereo-Vision
 Felzenszwalb and Huttenlocher: Efficient Graph-Based Image Segmentation