Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3-D Visual Reconstruction: A System Perspective


Published on

This is an interesting article about the thesis. The article was accepted on a symposium.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

3-D Visual Reconstruction: A System Perspective

  1. 1. 3-D Visual Reconstruction: A System Perspective Guillermo Medina-Zegarra ∗ and Edgar Lobaton †Abstract.- This paper presents a comprehensive study of involves the following steps: image acquisition, camera mod-all the challenges associated with the design of a platform eling, feature extraction, image matching, depth determina-for the reconstruction of 3-D objects from stereo images. tion and interpolation. There is a large body of literature inThe overview covers: Design and Calibration of a physical the area of computer vision which deals with this problem [9]setup, Rectification and disparity computation for stereo [10] [11].pairs, and construction and rendering of meshes in 3-D for In this paper we will describe and show the common stepsvisualization. We outline some challenges and research in to build a 3-D visual reconstruction from two images, andeach of these areas, and present some results using some we will demonstrate some results using a custom built stereobasic algorithm and a system build with off-the-shelf com- testbed.ponents. The rest of the paper is organized as follows: section 2 describes the single view geometry; section 3 describes the Index Terms.- Epipolar geometry, Rectification, Dis- epipolar geometry for two views; in section 4, the corresponparity calculation, Triangulation, 3-D reconstruction. dence problem is presented; in section 5, the physical and al- gorithmic architecture used for our experiment are described; in section 6, some results using our platform are shown; fi-1 Introduction nally, section 7 presents our conclusions.To get the impression of depth in a painting some artistsof the Renaissance such as Filippo Brunelleschi, Piero della 2 Single View GeometryFrancesca, Leonardo Da Vinci [1] and Albert Durer re-searched the way a human eye sees the world. They bridged Cameras are usually modeled using a pinhole imaging modelthe space between the artist and the object. In Figure 1 you [10]. As illustrated in Figure 2, images are formed by project-can see the perspective machine which was used to determine ing the observations from 3-D objects onto the image plain.the projection of 3-D points to a plane known as the image The image of a point P = [X,Y, Z] in a physical object isplane. The projections were found by drawing lines between the point p = [x, y, 1] corresponding to the intersect betweenthe 3-D points and the eyes of the artist (vanishing points). the image plain and a ray from the camera focal point C to P. Given that the world coordinate system and the coordinate system associated with a camera coincide (as shown in the figure), we have X Y x= f and y= f , (1) Z Z where f is the focal length of the camera. However, in the case that the world coordinate system does not agree with the cam- Figure 1: Perspectiva machine [2] era frame of reference and the point P is expressed in world It is now well known that the same mathematical principlesused to understand how humans see the world can be used tobuild 3-D models from two images. There are many appli-cations for 3-D reconstruction such as: Photo Tourism [3],visualization in teleimmersive environments [4] [5] [6], videogames (e.g. Microsoft Kinect), facial animation [7], etc. The stereo computation paradigm [8] is broadly defined asthe recovery of 3-D characteristics of a scene from multiplesimages taken from different points of view. The paradigm ∗ Universidad Cat´ lica San Pablo, Arequipa - Per´ (e-mail: o † North Carolina State University, North Carolina State - USA ( Figure 2: Pinhole camera model [10]. 1
  2. 2. coordinate system as P0 = [X0 ,Y0 , Z0 ] , then we have that P = R P0 + t, (2)where R and t are the rotation and translation defining therigid body transformation between the camera and the worldcoordinate frames. These are called the extrinsic parametersfor a camera. Given this model, the coordinates of a point inthe image domain satisfy:       X0 x f 0 0 R t  Y0  λ  y  =  0 f 0  Π0   (3) 0 1  Z0  1 0 0 1 1 Figure 3: Epipolar Geometry for stereo views [10]where λ := Z and   1 0 0 0 and p2 , known as corresponding image points, can be found Π0 =  0 1 0 0 . by intercepting the image planes with the lines from the 3-D 0 0 1 0 point P to the optic centers C1 and C2 . The so called epipolar plane is defined by the points P and the optic centers C1 andThe previous equation is obtained under the assumption of an C2 .ideal camera model. In general, the coordinates of the image Due to the geometry of the setup it is possible to verify thatplain are not perfectly aligned and may not have the same the image points p1 and p2 must satisfy the so called epipolarscale (i.e., uneven spacing between sensors in a camera sensor equation:array). This leads to the more general equations: pT F p1 = 0 (5) 2     X0 where F is the fundamental matrix [9][13] and it is defined x R t  Y0  as λ  y  = KΠ0   (4) −T −1 0 1  Z0  F = K2 EK1 (6) 1 1 The matrices K1 and K2 are the matrices of intrinsic parame-where   ters and E is the so-called essential matrix. The matrix E is f sx f sθ ox given by K= 0 f s y oy  , E = [T21 ]x R21 (7) 0 0 1 where R21 and T21 are the rotation and translation matrices be-[ox , oy ] is called the principal point, sx and sy specify the tween camera 1 and 2, and [T ]x for T = [T1 , T2 , T3 ] is definedspacing between sensors in the x and y directions in the cam- asera sensor array, and sθ is an skewing factor accounting formisalignment between sensors. The entries of K are known  as the intrinsic parameters of a camera. All of these param- 0 −T3 T2eters can be found by measurements of image points from 3-D [T ]x =  T3 0 −T1  (8)points with known coordinates. −T2 T1 0 In addition to the linear distortions that are modeled by K,cameras with wide field of view can have radial distortions Equation 5 can be used to solve for the fundamental matrixwhich are modeled using polynomial terms in the model [10]. F given that enough corresponding image points are provided.Cameras can be calibrated to obtain undistorted images which One such algorithm is known as the eight point algorithmwe will assume it is the case from now on. [14]. Hence, since the matrices K1 and K2 can be learned from individual calibration of the cameras, then the essential matrix E can be found. One can solve for the rotation and3 Epipolar geometry translation (up to scale) between cameras by using E. Figure 3 also illustrates one more fact that can be used toEpipolar geometry [9] studies the geometric relationship be- find corresponding points between images once the intrinsictween the points in 3-D and their projections in the image and extrinsic parameters are known. Essentially, given a pointplanes of stereo cameras for a vision system. p1 , we know that the 3-D point P has to lay in the plane In Figure 3, one can observe that the 3-D point P has pers defined by T21 and the line from C1 to p1 (i.e., the epipolarpective projection p1 and p2 [12] in the left image plane I1 plane). This plane intercepts the image plane I1 in the line l1and in the right image plane I2 , respectively. The points p1 and the image plane I2 in the line l2 . That is, we know that 2
  3. 3. any point in the line l1 must correspond to some point in the 5 Testing Platformline l2 . This turns the identification of corresponding pointsinto a 1-D search problem. This process is known as dispar- In this section, we introduce the design of a simple stereo plat-ity computation [15] [11]. Finally, it is common to warp the form used to building 3-D models using the framework pre-original image planes such that the lines l1 and l2 are both sented so far.horizontal and aligned. This process, known as rectification[16], is done to simplify computations during the search. 5.1 Physical Architecture Finally, given that the intrinsic parameter, extrinsic param-eters, and the corresponding points p1 and p2 are known as Figure 4a shows a frontal view of two digital camera config-well, it is possible to recover the 3-D position of the point uration used for our experiments. Figure 4b shows the backP via a process called triangulation [17]. Essentially, P is view. The cameras were mounted using screws to prevent anyfound to be the intercept between the line passing through p1 motion during the calibration and image acquisition.and C1 , and the line passing through p2 and C2 .4 The Correspondence ProblemIn order to estimate the essential matrix E and to perform thedisparity computations, we require the identification of cor- (a) (b)responding points (i.e., points in the image planes that arethe projection of the same physical point in 3-D). This is a Figure 4: Frontal (a) and back (b) views of the stereo cameravery challenging and ambiguous problem [11] due to possibly testbed.large changes in perspective between cameras, the occurrenceof repeated patterns in images, and the existence of occlu- The brands of the two digital cameras that were used aresions. Sony DSC-S750 and Canon SD1200 IS. Both cameras were configured to have the same settings.4.1 Generic Correspondence ProblemIn order to find corresponding image points for the estimation 5.2 Algorithmic Architectureof the essential matrix E, it is possible to use techniques that The Camera Calibration toolbox [21] was used to calibrateidentify unique points. The correspondence between points the cameras. Figure 5 illustrates the steps of our implementa-can then be determined by a metric between local descriptors. tion of 3-D visual reconstruction. Image acquisition was doneThe SIFT and GLOH [18] descriptors are two commonly used using the camera setup described before. We pre-processingtechniques. the two images with gaussian filters to reduce noise and man- For the case of calibration [19] [20] in controlled environ- ual segmentation is used to remove the background. The met-ments, it is commonly assumed that the observed object has ric used to calculate the disparity map is the normalized crossa regular pattern that is easy to identify (e.g., a checkerboard correlation [10]. As a post-processing step, we use a medianpattern). In these cases, it is possible to use more specialized filter to correct the problem of outliers. After that, we generateregion detection algorithms such as the corner Harris detector a Delaunay triangulation [22] of the disparity image, which is[11]. then used to obtain a 3-D mesh by projecting key points from4.2 Disparity ComputationsIf the intrinsic and extrinsic calibration parameters betweentwo cameras are known, it is possible to rectify the imagessuch that corresponding points can be found along horizontallines in the images. As mentioned before, finding correspon-dences turns into a 1-D search problem known as disparitycomputation. In order to perform this search, it is possible tospecify a metric that characterizes the difference between tworegions (e.g., normalized cross correlation). Then, we lookfor the location along the horizontal line that minimizes thedifference between these regions [11]. This is a very simplemethodology that will be used in this paper in order to per-form disparity computations. Of course, there are many moresophisticated algorithms that give more accurate results [15]. Figure 5: Preliminar outline of our reconstruction model 3
  4. 4. the image plane to 3-D as described in section 3. Finally, the Figure 7 illustrates the 3-D reconstruction obtained fromcolor texture is mapped from the right image to the 3-D mesh. the cube images. Figure 7a is the processed disparity map obtained from the pre-processed rectified images. Figures 7b and 7c illustrate a cloud of point and the resulting 3-D mesh,6 Results respectively. Figures 7d and 7e show the surface of the cube without and with smoothing. Finally, Figure 7f correspondsIn this section we present some 3-D reconstruction results us- to the 3-D mesh with the color texture from the rectified righting the methodology described in this paper. We also discuss image.limitations and drawbacks.6.1 3-D Reconstruction ResultsWe used a checkerboard pattern of 10 × 7 squares for the cal-ibration. The corner points of the checkerboard are automati-cally selected using the Camera Calibration toolbox [21]. Most of the implementation is done in MATLAB. How- (a) (b)ever, Paraview Viewer [23] is also used for the visualizationof the point clouds, which are processed using the decimateand smooth filters from the application. Figure 6 shows some images illustrating the steps neededfor the preparation of a pair of images of a cube. Figures 6aand 6b show the two original images of the object. Figures 6cand 6d show the rectified images. The black regions indicatepoints in the rectified images that are outside the range of the (c) (d)original ones. Figure 6e and 6f show the images with theirbackground removed. (e) (f) Figure 7: (a) Disparity map for the cube images, (b) matching (a) (b) points, (c) 3-D mesh, (d) reconstructed surface, (e) smoothing of the surface, and (f) surface with texture. Figures 8 and 9 show similar results using a pair of images of a teddy bear instead of a cube. 6.2 Limitations and finding problems (c) (d) One limitation was the lack of lighting control during the im- age acquisition process, which causes artifacts such as shad- ows and color variations between images as observed in Fig- ure 10a and 10b. These artifacts cause errors in the disparity calculation as observed in Figure 10c, which translate into in- consistent reconstruction of the object as observed in Figure 10d. (e) (f) As described earlier, we used a normalized cross correla- tion approach to compute the disparity map. This process re- quires the specification of a neighborhood, which is a chal-Figure 6: (a - b) Original images of the cube, (c - d) rectified lenging problem on its own [10]. The disparity process alsoimages, and (e - f) images without background. requires some knowledge of the maximum offset between cor- 4
  5. 5. (a) (b) (a) (b) (c) (d) (c) (d) (e) (f) (e) (f) Figure 9: (a) Disparity map for the teddy bear images, (b)Figure 8: (a - b) Original images of the teddy bear, (c - d) matching points, (c) 3-D mesh, (d) reconstructed surface, (e)rectified images, and (e - f) images without background. smoothing of the surface, and (f) surface with texture.responding points [11], which may not be known in advanced.Figure 11 illustrate a comparison between a failed and a goodreconstruction using the approach described in this paper. On appendix A you can watch our algorithm of disparitycalculation. Time for the processing is five minutes. The com- (a) (b) (c) (d)putacional cost of our algorithm in the worst case is O(n 3)and the best case is Θ(n2 ). Tests were performed on a ma- Figure 10: (a - b) Original images with artifacts, (c) disparitychinery Sony VAIO VGN-FE880E with a processor Intel (R) map, and (d) 3-D reconstruction.Core (TM) 2 CPU 1.66 GHz and memory RAM 2GB with anoperating system Windows 7 Ultimate 64-bit. Time for theprocessing of our algorithm is five minutes. can lead to artifacts in the reconstruction. There is also a large dependency between steps in the reconstruction. For example, small errors in calibration [24] affect the rectification process,7 Conclusion causing problem in the disparity computation, and leading to erroneous 3-D reconstructions. For this reason, it is necessaryIn this paper, we outline all the steps needed to generate a to put special care to the design of a stable physical architec-3-D reconstruction of an objected by using a stereo image ture for image acquisition.pair. These 3-D models can be used in a variety of applica- The authors of this paper template would like to thanks to:tions ranging from photo tourism [3] to virtual conferencing Alex Cuadros, Juan Carlos Gutierrez and Nestor Calvo.[6]. The method used for disparity computation uses a simplenormalized cross correlation to figure out correspondences,which is not a very robust approach but it is simple enough Referencesfor demonstration purposes. There are more robust methodsthat use techniques such as dynamic programming and genetic [1] Leonardo Da Vinci. El Tratado de la Pintura. Edimat Libros,algorithms [15]. S.A., 2005. One of the main challenges observed during the reconstruc- [2] Olivier Faugeras. Three-dimensional Computer Vision: A Ge-tion process was the variability of lighting conditions, which ometric Viewpoint. The MIT Press, 1993. 5
  6. 6. [17] Richard Hartley and Peter Sturm. Triangulation. Computer Vision and Image Understanding, pag. 146-157, 1997. [18] K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615–1630, 2005. [19] X. Armangu´ , J. Salvi, and J. Batlle. A Comparative Review Of e Camera Calibrating Methods with Accuracy Evaluation. Pat- tern Recognition, 35:1617–1635, 2000. [20] Zhengyou Zhang. Flexible camera calibration by viewing a (a) (b) plane from unknown orientations. Seventh International Con- ference on Computer Vision, pag. 666-673, 1999.Figure 11: (a) Failed reconstruction and (b) good reconstruc- [21] Camera Calibration Toolbox for matlab. http://www. 18 Oc-tion of a cube. tubre, 2011. [22] Jonathan Richard Shewchuk. Triangle: Engineering a 2d qual- ity mesh generator and delaunay triangulator. First Workshop [3] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo on Applied Computational Geometry, pag. 124-133, 1996. tourism: Exploring photo collections in 3d. In SIGGRAPH [23] Paraview. 16 Octubre, 2011. Conference Proceedings, pages 835–846, New York, NY, USA, [24] Gregorij Kurillo, Ramanarayan Vasudevan, Edgar Lobaton, 2006. ACM Press. and Ruzena Bajcsy. A framework for collaborative real-time [4] Jason Leigh, Andrew E. Johnson, Maxine Brown, Daniel J. 3d teleimmersion in a geographically distributed environment. Sandin, and Thomas A. DeFanti. Visualization in teleimmer- In Proceedings of the 2008 Tenth IEEE International Sympo- sive environments. Computer, 32:66–73, December 1999. sium on Multimedia, ISM ’08, pages 111–118, Washington, [5] Sang-Hack Jung and Ruzena Bajcsy. A framework for con- DC, USA, 2008. IEEE Computer Society. structing real-time immersive environments for training physi- cal activities. Journal of Multimedia. [6] Ramanarayan Vasudevan, Zhong Zhou, Gregorij Kurillo, Edgar J. Lobaton, Ruzena Bajcsy, and Klara Nahrstedt. Real- time stereo-vision system for 3d teleimmersive collaboration. In ICME’10, pages 1208–1213, 2010. [7] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Re- altime performance-based facial animation. ACM Transac- tions on Graphics (Proceedings SIGGRAPH 2011), 30(4), July 2011. [8] Stephen T. Barnard and Martin A. Fischler. Computational stereo. ACM Computing Surveys (CSUR), pag. 553-572, Di- ciembre, 1982. [9] Richard Hartley and Andrew Zisserman. Multiple View Geome- try in Computer Vision. Second Edition. Cambridge University Press, 2004.[10] Yi Ma, Stefano Soatto, Jana Koseck´ , and S. Shankar Sastry. a An Invitation to 3D Vision from Images to Geometric Models. Springer, 2004.[11] D.A. Forsyth and Ponce J. Computer Vision: A Modern ap- proach. Prentice Hall, 2003.[12] Donald Hearn and M. Pauline Baker. Gr´ ficas por Computa- a dora. Segunda Edici´ n. Prentice Hall, 1995. o[13] Emanuele Trucco and Alessandro Verri. Introductory Tech- niques for 3-D Computer Vision. Prentice Hall, 1998.[14] Richard Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, pag. 383-395, 1997.[15] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Inter- national Journal of Computer Vision, 47:7–42, 2002.[16] Andrea Fusiello, Emanuele Trucco, and Alessandro Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, pag. 16-22, 2000. 6
  7. 7. A Appendix dispComp(imDerecha,imIzquierda,maxDisp)1 thNorm ← escalar ∗ (2 ∗ r + 1)2 for i = 1 + r to col − r do3 for j = 1 + r to f il − maxDisp − r do5 pBase ← imDerecha(i − r : i + r, j − r : j + r)6 pBase ← pBase − promedio(pBase)7 nBase ← norma(pBase)8 if nBase <= thNorm then9 continue10 end if11 pBase ← pBase/nBase12 for sh = 1 to maxDisp do13 pShi f t ← imIzquierda(i − r : i + r, j + sh − r : j + sh + r)14 pShi f t ← pShi f t − promedio(pShi f t)15 nShi f t ← norma(pShi f t)16 if nShi f t <= thNorm then17 corr[sh] ← 018 continue19 end if20 corr[sh] ← sum((pShi f t/nShi f t). ∗ (pBase))21 end for22 [valor indice] ← max(corr)23 if valor == 0 then24 imDisp[i, j] ← 025 else26 imDisp[i, j] ← indice27 end if28 end for29 end for30 return(imDisp) 7