3-D Visual Reconstruction: A System Perspective

3-D Visual Reconstruction: A System Perspective
Guillermo Medina-Zegarra ∗ and Edgar Lobaton †

Abstract.- This paper presents a comprehensive study of involves the following steps: image acquisition, camera mod-
all the challenges associated with the design of a platform eling, feature extraction, image matching, depth determina-
for the reconstruction of 3-D objects from stereo images. tion and interpolation. There is a large body of literature in
The overview covers: Design and Calibration of a physical the area of computer vision which deals with this problem [9]
setup, Rectification and disparity computation for stereo [10] [11].
pairs, and construction and rendering of meshes in 3-D for In this paper we will describe and show the common steps
visualization. We outline some challenges and research in to build a 3-D visual reconstruction from two images, and
each of these areas, and present some results using some we will demonstrate some results using a custom built stereo
basic algorithm and a system build with off-the-shelf com- testbed.
ponents. The rest of the paper is organized as follows: section 2
describes the single view geometry; section 3 describes the
Index Terms.- Epipolar geometry, Rectification, Dis- epipolar geometry for two views; in section 4, the correspon
parity calculation, Triangulation, 3-D reconstruction. dence problem is presented; in section 5, the physical and al-
gorithmic architecture used for our experiment are described;
in section 6, some results using our platform are shown; fi-
1 Introduction nally, section 7 presents our conclusions.

To get the impression of depth in a painting some artists
of the Renaissance such as Filippo Brunelleschi, Piero della 2 Single View Geometry
Francesca, Leonardo Da Vinci [1] and Albert Durer re-
searched the way a human eye sees the world. They bridged Cameras are usually modeled using a pinhole imaging model
the space between the artist and the object. In Figure 1 you [10]. As illustrated in Figure 2, images are formed by project-
can see the perspective machine which was used to determine ing the observations from 3-D objects onto the image plain.
the projection of 3-D points to a plane known as the image The image of a point P = [X,Y, Z] in a physical object is
plane. The projections were found by drawing lines between the point p = [x, y, 1] corresponding to the intersect between
the 3-D points and the eyes of the artist (vanishing points). the image plain and a ray from the camera focal point C to
P. Given that the world coordinate system and the coordinate
system associated with a camera coincide (as shown in the
figure), we have

X Y
x= f and y= f , (1)
Z Z
where f is the focal length of the camera. However, in the case
that the world coordinate system does not agree with the cam-
Figure 1: Perspectiva machine [2] era frame of reference and the point P is expressed in world

It is now well known that the same mathematical principles
used to understand how humans see the world can be used to
build 3-D models from two images. There are many appli-
cations for 3-D reconstruction such as: Photo Tourism [3],
visualization in teleimmersive environments [4] [5] [6], video
games (e.g. Microsoft Kinect), facial animation [7], etc.
The stereo computation paradigm [8] is broadly defined as
the recovery of 3-D characteristics of a scene from multiples
images taken from different points of view. The paradigm

∗ Universidad Cat´ lica San Pablo, Arequipa - Per´ (e-mail:
o u
guillermoemz@gmail.com)
† North Carolina State University, North Carolina State - USA (e-mail:

edgar.lobaton@ncsu.edu) Figure 2: Pinhole camera model [10].

1

coordinate system as P0 = [X0 ,Y0 , Z0 ] , then we have that

P = R P0 + t, (2)

where R and t are the rotation and translation defining the
rigid body transformation between the camera and the world
coordinate frames. These are called the extrinsic parameters
for a camera. Given this model, the coordinates of a point in
the image domain satisfy:
 
    X0
x f 0 0
R t  Y0 
λ  y  =  0 f 0  Π0   (3)
0 1  Z0 
1 0 0 1
1
Figure 3: Epipolar Geometry for stereo views [10]
where λ := Z and
 
1 0 0 0 and p2 , known as corresponding image points, can be found
Π0 =  0 1 0 0 . by intercepting the image planes with the lines from the 3-D
0 0 1 0 point P to the optic centers C1 and C2 . The so called epipolar
plane is defined by the points P and the optic centers C1 and
The previous equation is obtained under the assumption of an C2 .
ideal camera model. In general, the coordinates of the image Due to the geometry of the setup it is possible to verify that
plain are not perfectly aligned and may not have the same the image points p1 and p2 must satisfy the so called epipolar
scale (i.e., uneven spacing between sensors in a camera sensor equation:
array). This leads to the more general equations: pT F p1 = 0 (5)
2
 
  X0 where F is the fundamental matrix [9][13] and it is defined
x
R t  Y0  as
λ  y  = KΠ0   (4) −T −1
0 1  Z0  F = K2 EK1 (6)
1
1
The matrices K1 and K2 are the matrices of intrinsic parame-
where   ters and E is the so-called essential matrix. The matrix E is
f sx f sθ ox given by
K= 0 f s y oy  , E = [T21 ]x R21 (7)
0 0 1
where R21 and T21 are the rotation and translation matrices be-
[ox , oy ] is called the principal point, sx and sy specify the tween camera 1 and 2, and [T ]x for T = [T1 , T2 , T3 ] is defined
spacing between sensors in the x and y directions in the cam- as
era sensor array, and sθ is an skewing factor accounting for
misalignment between sensors. The entries of K are known  
as the intrinsic parameters of a camera. All of these param- 0 −T3 T2
eters can be found by measurements of image points from 3-D [T ]x =  T3 0 −T1  (8)
points with known coordinates. −T2 T1 0
In addition to the linear distortions that are modeled by K,
cameras with wide field of view can have radial distortions Equation 5 can be used to solve for the fundamental matrix
which are modeled using polynomial terms in the model [10]. F given that enough corresponding image points are provided.
Cameras can be calibrated to obtain undistorted images which One such algorithm is known as the eight point algorithm
we will assume it is the case from now on. [14]. Hence, since the matrices K1 and K2 can be learned
from individual calibration of the cameras, then the essential
matrix E can be found. One can solve for the rotation and
3 Epipolar geometry translation (up to scale) between cameras by using E.
Figure 3 also illustrates one more fact that can be used to
Epipolar geometry [9] studies the geometric relationship be- find corresponding points between images once the intrinsic
tween the points in 3-D and their projections in the image and extrinsic parameters are known. Essentially, given a point
planes of stereo cameras for a vision system. p1 , we know that the 3-D point P has to lay in the plane
In Figure 3, one can observe that the 3-D point P has pers defined by T21 and the line from C1 to p1 (i.e., the epipolar
pective projection p1 and p2 [12] in the left image plane I1 plane). This plane intercepts the image plane I1 in the line l1
and in the right image plane I2 , respectively. The points p1 and the image plane I2 in the line l2 . That is, we know that

2

any point in the line l1 must correspond to some point in the 5 Testing Platform
line l2 . This turns the identification of corresponding points
into a 1-D search problem. This process is known as dispar- In this section, we introduce the design of a simple stereo plat-
ity computation [15] [11]. Finally, it is common to warp the form used to building 3-D models using the framework pre-
original image planes such that the lines l1 and l2 are both sented so far.
horizontal and aligned. This process, known as rectification
[16], is done to simplify computations during the search. 5.1 Physical Architecture
Finally, given that the intrinsic parameter, extrinsic param-
eters, and the corresponding points p1 and p2 are known as Figure 4a shows a frontal view of two digital camera config-
well, it is possible to recover the 3-D position of the point uration used for our experiments. Figure 4b shows the back
P via a process called triangulation [17]. Essentially, P is view. The cameras were mounted using screws to prevent any
found to be the intercept between the line passing through p1 motion during the calibration and image acquisition.
and C1 , and the line passing through p2 and C2 .

4 The Correspondence Problem
In order to estimate the essential matrix E and to perform the
disparity computations, we require the identification of cor- (a) (b)
responding points (i.e., points in the image planes that are
the projection of the same physical point in 3-D). This is a Figure 4: Frontal (a) and back (b) views of the stereo camera
very challenging and ambiguous problem [11] due to possibly testbed.
large changes in perspective between cameras, the occurrence
of repeated patterns in images, and the existence of occlu-
The brands of the two digital cameras that were used are
sions.
Sony DSC-S750 and Canon SD1200 IS. Both cameras were
configured to have the same settings.
4.1 Generic Correspondence Problem
In order to find corresponding image points for the estimation 5.2 Algorithmic Architecture
of the essential matrix E, it is possible to use techniques that The Camera Calibration toolbox [21] was used to calibrate
identify unique points. The correspondence between points the cameras. Figure 5 illustrates the steps of our implementa-
can then be determined by a metric between local descriptors. tion of 3-D visual reconstruction. Image acquisition was done
The SIFT and GLOH [18] descriptors are two commonly used using the camera setup described before. We pre-processing
techniques. the two images with gaussian filters to reduce noise and man-
For the case of calibration [19] [20] in controlled environ- ual segmentation is used to remove the background. The met-
ments, it is commonly assumed that the observed object has ric used to calculate the disparity map is the normalized cross
a regular pattern that is easy to identify (e.g., a checkerboard correlation [10]. As a post-processing step, we use a median
pattern). In these cases, it is possible to use more specialized filter to correct the problem of outliers. After that, we generate
region detection algorithms such as the corner Harris detector a Delaunay triangulation [22] of the disparity image, which is
[11]. then used to obtain a 3-D mesh by projecting key points from

4.2 Disparity Computations
If the intrinsic and extrinsic calibration parameters between
two cameras are known, it is possible to rectify the images
such that corresponding points can be found along horizontal
lines in the images. As mentioned before, finding correspon-
dences turns into a 1-D search problem known as disparity
computation. In order to perform this search, it is possible to
specify a metric that characterizes the difference between two
regions (e.g., normalized cross correlation). Then, we look
for the location along the horizontal line that minimizes the
difference between these regions [11]. This is a very simple
methodology that will be used in this paper in order to per-
form disparity computations. Of course, there are many more
sophisticated algorithms that give more accurate results [15]. Figure 5: Preliminar outline of our reconstruction model

3

the image plane to 3-D as described in section 3. Finally, the Figure 7 illustrates the 3-D reconstruction obtained from
color texture is mapped from the right image to the 3-D mesh. the cube images. Figure 7a is the processed disparity map
obtained from the pre-processed rectified images. Figures 7b
and 7c illustrate a cloud of point and the resulting 3-D mesh,
6 Results respectively. Figures 7d and 7e show the surface of the cube
without and with smoothing. Finally, Figure 7f corresponds
In this section we present some 3-D reconstruction results us- to the 3-D mesh with the color texture from the rectified right
ing the methodology described in this paper. We also discuss image.
limitations and drawbacks.

6.1 3-D Reconstruction Results
We used a checkerboard pattern of 10 × 7 squares for the cal-
ibration. The corner points of the checkerboard are automati-
cally selected using the Camera Calibration toolbox [21].
Most of the implementation is done in MATLAB. How- (a) (b)
ever, Paraview Viewer [23] is also used for the visualization
of the point clouds, which are processed using the decimate
and smooth filters from the application.
Figure 6 shows some images illustrating the steps needed
for the preparation of a pair of images of a cube. Figures 6a
and 6b show the two original images of the object. Figures 6c
and 6d show the rectified images. The black regions indicate
points in the rectified images that are outside the range of the (c) (d)
original ones. Figure 6e and 6f show the images with their
background removed.

(e) (f)

Figure 7: (a) Disparity map for the cube images, (b) matching
(a) (b)
points, (c) 3-D mesh, (d) reconstructed surface, (e) smoothing
of the surface, and (f) surface with texture.

Figures 8 and 9 show similar results using a pair of images
of a teddy bear instead of a cube.

6.2 Limitations and finding problems
(c) (d)
One limitation was the lack of lighting control during the im-
age acquisition process, which causes artifacts such as shad-
ows and color variations between images as observed in Fig-
ure 10a and 10b. These artifacts cause errors in the disparity
calculation as observed in Figure 10c, which translate into in-
consistent reconstruction of the object as observed in Figure
10d.
(e) (f) As described earlier, we used a normalized cross correla-
tion approach to compute the disparity map. This process re-
quires the specification of a neighborhood, which is a chal-
Figure 6: (a - b) Original images of the cube, (c - d) rectified lenging problem on its own [10]. The disparity process also
images, and (e - f) images without background. requires some knowledge of the maximum offset between cor-

4

(a) (b)
(a) (b)

(c) (d)
(c) (d)

(e) (f)
(e) (f)
Figure 9: (a) Disparity map for the teddy bear images, (b)
Figure 8: (a - b) Original images of the teddy bear, (c - d) matching points, (c) 3-D mesh, (d) reconstructed surface, (e)
rectified images, and (e - f) images without background. smoothing of the surface, and (f) surface with texture.

responding points [11], which may not be known in advanced.
Figure 11 illustrate a comparison between a failed and a good
reconstruction using the approach described in this paper.
On appendix A you can watch our algorithm of disparity
calculation. Time for the processing is five minutes. The com- (a) (b) (c) (d)
putacional cost of our algorithm in the worst case is O(n 3)

and the best case is Θ(n2 ). Tests were performed on a ma- Figure 10: (a - b) Original images with artifacts, (c) disparity
chinery Sony VAIO VGN-FE880E with a processor Intel (R) map, and (d) 3-D reconstruction.
Core (TM) 2 CPU 1.66 GHz and memory RAM 2GB with an
operating system Windows 7 Ultimate 64-bit. Time for the
processing of our algorithm is five minutes. can lead to artifacts in the reconstruction. There is also a large
dependency between steps in the reconstruction. For example,
small errors in calibration [24] affect the rectification process,
7 Conclusion causing problem in the disparity computation, and leading to
erroneous 3-D reconstructions. For this reason, it is necessary
In this paper, we outline all the steps needed to generate a to put special care to the design of a stable physical architec-
3-D reconstruction of an objected by using a stereo image ture for image acquisition.
pair. These 3-D models can be used in a variety of applica- The authors of this paper template would like to thanks to:
tions ranging from photo tourism [3] to virtual conferencing Alex Cuadros, Juan Carlos Gutierrez and Nestor Calvo.
[6]. The method used for disparity computation uses a simple
normalized cross correlation to figure out correspondences,
which is not a very robust approach but it is simple enough References
for demonstration purposes. There are more robust methods
that use techniques such as dynamic programming and genetic [1] Leonardo Da Vinci. El Tratado de la Pintura. Edimat Libros,
algorithms [15]. S.A., 2005.
One of the main challenges observed during the reconstruc- [2] Olivier Faugeras. Three-dimensional Computer Vision: A Ge-
tion process was the variability of lighting conditions, which ometric Viewpoint. The MIT Press, 1993.

5

[17] Richard Hartley and Peter Sturm. Triangulation. Computer
Vision and Image Understanding, pag. 146-157, 1997.
[18] K. Mikolajczyk and C. Schmid. A Performance Evaluation of
Local Descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(10):1615–1630, 2005.
[19] X. Armangu´ , J. Salvi, and J. Batlle. A Comparative Review Of
e
Camera Calibrating Methods with Accuracy Evaluation. Pat-
tern Recognition, 35:1617–1635, 2000.
[20] Zhengyou Zhang. Flexible camera calibration by viewing a
(a) (b) plane from unknown orientations. Seventh International Con-
ference on Computer Vision, pag. 666-673, 1999.
Figure 11: (a) Failed reconstruction and (b) good reconstruc- [21] Camera Calibration Toolbox for matlab. http://www.
vision.caltech.edu/bouguetj/calib_doc/. 18 Oc-
tion of a cube.
tubre, 2011.
[22] Jonathan Richard Shewchuk. Triangle: Engineering a 2d qual-
ity mesh generator and delaunay triangulator. First Workshop
[3] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo on Applied Computational Geometry, pag. 124-133, 1996.
tourism: Exploring photo collections in 3d. In SIGGRAPH [23] Paraview. http://www.paraview.org. 16 Octubre, 2011.
Conference Proceedings, pages 835–846, New York, NY, USA,
[24] Gregorij Kurillo, Ramanarayan Vasudevan, Edgar Lobaton,
2006. ACM Press.
and Ruzena Bajcsy. A framework for collaborative real-time
[4] Jason Leigh, Andrew E. Johnson, Maxine Brown, Daniel J. 3d teleimmersion in a geographically distributed environment.
Sandin, and Thomas A. DeFanti. Visualization in teleimmer- In Proceedings of the 2008 Tenth IEEE International Sympo-
sive environments. Computer, 32:66–73, December 1999. sium on Multimedia, ISM ’08, pages 111–118, Washington,
[5] Sang-Hack Jung and Ruzena Bajcsy. A framework for con- DC, USA, 2008. IEEE Computer Society.
structing real-time immersive environments for training physi-
cal activities. Journal of Multimedia.
[6] Ramanarayan Vasudevan, Zhong Zhou, Gregorij Kurillo,
Edgar J. Lobaton, Ruzena Bajcsy, and Klara Nahrstedt. Real-
time stereo-vision system for 3d teleimmersive collaboration.
In ICME’10, pages 1208–1213, 2010.
[7] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Re-
altime performance-based facial animation. ACM Transac-
tions on Graphics (Proceedings SIGGRAPH 2011), 30(4), July
2011.
[8] Stephen T. Barnard and Martin A. Fischler. Computational
stereo. ACM Computing Surveys (CSUR), pag. 553-572, Di-
ciembre, 1982.
[9] Richard Hartley and Andrew Zisserman. Multiple View Geome-
try in Computer Vision. Second Edition. Cambridge University
Press, 2004.
[10] Yi Ma, Stefano Soatto, Jana Koseck´ , and S. Shankar Sastry.
a
An Invitation to 3D Vision from Images to Geometric Models.
Springer, 2004.
[11] D.A. Forsyth and Ponce J. Computer Vision: A Modern ap-
proach. Prentice Hall, 2003.
[12] Donald Hearn and M. Pauline Baker. Gr´ ficas por Computa-
a
dora. Segunda Edici´ n. Prentice Hall, 1995.
o
[13] Emanuele Trucco and Alessandro Verri. Introductory Tech-
niques for 3-D Computer Vision. Prentice Hall, 1998.
[14] Richard Hartley. In defense of the eight-point algorithm. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
pag. 383-395, 1997.
[15] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of
Dense Two-Frame Stereo Correspondence Algorithms. Inter-
national Journal of Computer Vision, 47:7–42, 2002.
[16] Andrea Fusiello, Emanuele Trucco, and Alessandro Verri. A
compact algorithm for rectification of stereo pairs. Machine
Vision and Applications, pag. 16-22, 2000.

6

A Appendix
dispComp(imDerecha,imIzquierda,maxDisp)
1 thNorm ← escalar ∗ (2 ∗ r + 1)
2 for i = 1 + r to col − r do
3 for j = 1 + r to f il − maxDisp − r do
5 pBase ← imDerecha(i − r : i + r, j − r : j + r)
6 pBase ← pBase − promedio(pBase)
7 nBase ← norma(pBase)
8 if nBase <= thNorm then
9 continue
10 end if
11 pBase ← pBase/nBase
12 for sh = 1 to maxDisp do
13 pShi f t ← imIzquierda(i − r : i + r, j + sh − r : j + sh + r)
14 pShi f t ← pShi f t − promedio(pShi f t)
15 nShi f t ← norma(pShi f t)
16 if nShi f t <= thNorm then
17 corr[sh] ← 0
18 continue
19 end if
20 corr[sh] ← sum((pShi f t/nShi f t). ∗ (pBase))
21 end for
22 [valor indice] ← max(corr)
23 if valor == 0 then
24 imDisp[i, j] ← 0
25 else
26 imDisp[i, j] ← indice
27 end if
28 end for
29 end for
30 return(imDisp)

7

3-D Visual Reconstruction: A System Perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to 3-D Visual Reconstruction: A System Perspective

Similar to 3-D Visual Reconstruction: A System Perspective (20)

Recently uploaded

Recently uploaded (20)

3-D Visual Reconstruction: A System Perspective