Medioni, kang -_emerging_topics_in_computer_vision

847 views
831 views

Published on

Medioni, kang -_emerging_topics_in_computer_vision
gonzalo santiago martinez

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
847
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Medioni, kang -_emerging_topics_in_computer_vision

  1. 1. Emerging Topics in Computer Vision Edited by G´erard Medioni and Sing Bing Kang
  2. 2. Contents PREFACE ix CONTRIBUTORS x 1 INTRODUCTION 1 1.1 Organization 1 1.2 How to Use the Book 2 1.3 Contents of DVDs 2 SECTION I: FUNDAMENTALS IN COMPUTER VISION 3 2 CAMERA CALIBRATION 5 Zhengyou Zhang 2.1 Introduction 5 2.2 Notation and Problem Statement 7 2.2.1 Pinhole Camera Model 8 2.2.2 Absolute Conic 9 2.3 Camera Calibration with 3D Objects 11 2.3.1 Feature Extraction 13 2.3.2 Linear Estimation of the Camera Projection Matrix 13 2.3.3 Recover Intrinsic and Extrinsic Parameters from P 14 2.3.4 Refine Calibration Parameters Through a Nonlinear Optimization 15 2.3.5 Lens Distortion 15 2.3.6 An Example 17 2.4 Camera Calibration with 2D Objects: Plane Based Technique 18 2.4.1 Homography between the model plane and its image 18 2.4.2 Constraints on the intrinsic parameters 19 2.4.3 Geometric Interpretation2 19 2.4.4 Closed-form solution 20 i
  3. 3. ii Contents 2.4.5 Maximum likelihood estimation 22 2.4.6 Dealing with radial distortion 22 2.4.7 Summary 23 2.4.8 Experimental Results 24 2.4.9 Related Work 26 2.5 Solving Camera Calibration With 1D Objects 27 2.5.1 Setups With Free-Moving 1D Calibration Objects 28 2.5.2 Setups With 1D Calibration Objects Moving Around a fixed Point 29 2.5.3 Basic Equations 30 2.5.4 Closed-Form Solution 32 2.5.5 Nonlinear Optimization 33 2.5.6 Estimating the fixed point 34 2.5.7 Experimental Results 35 2.6 Self Calibration 39 2.7 Conclusion 39 2.8 Appendix: Estimating Homography Between the Model Plane and its Image 40 Bibliography 41 3 MULTIPLE VIEW GEOMETRY 45 Anders Heyden and Marc Pollefeys 3.1 Introduction 45 3.2 Projective Geometry 46 3.2.1 The central perspective transformation 46 3.2.2 Projective spaces 47 3.2.3 Homogeneous coordinates 49 3.2.4 Duality 52 3.2.5 Projective transformations 54 3.3 Tensor Calculus 56 3.4 Modelling Cameras 58 3.4.1 The pinhole camera 58 3.4.2 The camera matrix 59 3.4.3 The intrinsic parameters 59 3.4.4 The extrinsic parameters 60 3.4.5 Properties of the pinhole camera 62 3.5 Multiple View Geometry 63 3.5.1 The structure and motion problem 63 3.5.2 The two-view case 64 3.5.3 Multi-view constraints and tensors 70
  4. 4. Contents iii 3.6 Structure and Motion I 75 3.6.1 Resection 75 3.6.2 Intersection 75 3.6.3 Linear estimation of tensors 76 3.6.4 Factorization 78 3.7 Structure and Motion II 80 3.7.1 Two-view geometry computation 80 3.7.2 Structure and motion recovery 83 3.8 Auto-calibration 87 3.9 Dense Depth Estimation 90 3.9.1 Rectification 90 3.9.2 Stereo matching 92 3.9.3 Multi-view linking 93 3.10 Visual Modeling 94 3.10.1 3D surface reconstruction 95 3.10.2 Image-based rendering 98 3.10.3 Match-moving 101 3.11 Conclusion 101 Bibliography 103 4 ROBUST TECHNIQUES FOR COMPUTER VISION 109 Peter Meer 4.1 Robustness in Visual Tasks 109 4.2 Models and Estimation Problems 113 4.2.1 Elements of a Model 113 4.2.2 Estimation of a Model 119 4.2.3 Robustness of an Estimator 122 4.2.4 Definition of Robustness 124 4.2.5 Taxonomy of Estimation Problems 127 4.2.6 Linear Errors-in-Variables Regression Model 130 4.2.7 Objective Function Optimization 134 4.3 Location Estimation 139 4.3.1 Why Nonparametric Methods 140 4.3.2 Kernel Density Estimation 141 4.3.3 Adaptive Mean Shift 146 4.3.4 Applications 150 4.4 Robust Regression 157 4.4.1 Least Squares Family 158 4.4.2 M-estimators 163 4.4.3 Median Absolute Deviation Scale Estimate 166
  5. 5. iv Contents 4.4.4 LMedS, RANSAC and Hough Transform 169 4.4.5 The pbM-estimator 173 4.4.6 Applications 178 4.4.7 Structured Outliers 180 4.5 Conclusion 183 Bibliography 183 5 THE TENSOR VOTING FRAMEWORK 191 G´erard Medioni and Philippos Mordohai 5.1 Introduction 191 5.1.1 Motivation 192 5.1.2 Desirable descriptions 194 5.1.3 Our approach 195 5.1.4 Chapter Overview 197 5.2 Related Work 198 5.3 Tensor Voting in 2D 203 5.3.1 Second Order Representation and Voting in 2-D 203 5.3.2 First Order Representation and Voting in 2-D 210 5.3.3 Voting Fields 213 5.3.4 Vote analysis 215 5.3.5 Results in 2-D 218 5.3.6 Illusory Contours 219 5.4 Tensor Voting in 3D 221 5.4.1 Representation in 3-D 222 5.4.2 Voting in 3-D 224 5.4.3 Vote analysis 226 5.4.4 Results in 3-D 228 5.5 Tensor Voting in ND 229 5.5.1 Computational Complexity 232 5.6 Application to Computer Vision Problems 233 5.6.1 Initial Matching 234 5.6.2 Uniqueness 235 5.6.3 Discrete Densification 236 5.6.4 Discontinuity Localization 237 5.6.5 Stereo 239 5.6.6 Multiple View Stereo 241 5.6.7 Visual Motion from Motion Cues 243 5.6.8 Visual Motion on Real Images 245 5.7 Conclusion and Future Work 246 5.8 Acknowledgment 250
  6. 6. Contents v Bibliography 250 SECTION II: APPLICATIONS IN COMPUTER VISION 254 6 IMAGE BASED LIGHTING 255 Paul E. Debevec 6.1 Basic Image Based Lighting 257 6.1.1 Capturing Light 257 6.1.2 Illuminating Synthetic Objects with Real Light 260 6.1.3 Lighting Entire Environments with IBL 269 6.2 Advanced Image Based Lighting 269 6.2.1 Capturing a Light Probe in Direct Sunlight 272 6.2.2 Compositing objects into the scene including shadows 282 6.2.3 Image-Based Lighting in Fiat Lux 288 6.2.4 Capturing and Rendering Spatially-Varying Illumination291 6.3 Image Based Relighting 294 6.4 Conclusion 299 Bibliography 301 7 COMPUTER VISION IN VISUAL EFFECTS 305 Doug Roble 7.1 Introduction 305 7.2 Computer Vision Problems Unique to Film 306 7.2.1 Welcome to the Set 306 7.3 Feature Tracking 319 7.4 Optical Flow 321 7.5 Camera Tracking and Structure from Motion 325 7.6 The Future 330 Bibliography 330 8 CONTENT BASED IMAGE RETRIEVAL: AN OVERVIEW 333 Theo Gevers and Arnold W.M. Smeulders 8.1 Overview of the chapter 334 8.2 Image Domains 339 8.2.1 Search modes 339 8.2.2 The sensory gap 341 8.2.3 The semantic gap 342 8.2.4 Discussion 343 8.3 Image Features 344
  7. 7. vi Contents 8.3.1 Color 345 8.3.2 Shape 348 8.3.3 Texture 349 8.3.4 Discussion 352 8.4 Representation and Indexing 352 8.4.1 Grouping data 353 8.4.2 Features accumulation 354 8.4.3 Feature accumulation and image partitioning 357 8.4.4 Salient features 358 8.4.5 Shape and object features 359 8.4.6 Structure and lay-out 361 8.4.7 Discussion 361 8.5 Similarity and Search 362 8.5.1 Semantic interpretation 362 8.5.2 Similarity between features 363 8.5.3 Similarity of object outlines 366 8.5.4 Similarity of object arrangements 367 8.5.5 Similarity of salient features 368 8.5.6 Discussion 369 8.6 Interaction and Learning 369 8.6.1 Interaction on a semantic level 369 8.6.2 Classification on a semantic level 370 8.6.3 Learning 371 8.6.4 Discussion 371 8.7 Conclusion 372 Bibliography 372 9 FACE DETECTION, ALIGNMENT AND RECOGNITION 385 Stan Z. Li and Juwei Lu 9.1 Introduction 385 9.2 Face Detection 388 9.2.1 Appearance and Learning Based Approach 389 9.2.2 Preprocessing 391 9.2.3 Neural and Kernel Methods 393 9.2.4 Boosting Based Methods 394 9.2.5 Post-Processing 400 9.2.6 Evaluation 401 9.3 Face Alignment 404 9.3.1 Active Shape Model 405 9.3.2 Active Appearance Model 407
  8. 8. Contents vii 9.3.3 Modeling Shape from Texture 408 9.3.4 Dealing with Head Pose 414 9.3.5 Evaluation 416 9.4 Face Recognition 419 9.4.1 Preprocessing 419 9.4.2 Feature Extraction 420 9.4.3 Pattern Classification 431 9.4.4 Evaluation 439 Bibliography 445 10 PERCEPTUAL INTERFACES 455 Matthew Turk and Mathias K¨olsch 10.1 Introduction 455 10.2 Perceptual Interfaces and HCI 457 10.3 Multimodal Interfaces 464 10.4 Vision Based Interfaces 472 10.4.1 Terminology 476 10.4.2 Elements of VBI 479 10.4.3 Computer Vision Methods for VBI 491 10.4.4 VBI Summary 504 10.5 Brain-Computer Interfaces 504 10.6 Summary 507 Bibliography 509 SECTION III: PROGRAMMING FOR COMPUTER VISION 520 11 OPEN SOURCE COMPUTER VISION LIBRARY 521 Gary Bradski 11.1 Overview 521 11.1.1 Installation 522 11.1.2 Organization 527 11.1.3 Optimizations 529 11.2 Functional Groups: What’s Good for What 532 11.2.1 By Area 534 11.2.2 By Task 538 11.2.3 Demos and Samples 542 11.3 Pictorial Tour 545 11.3.1 Functional Groups 545 11.3.2 Demo Tour 561
  9. 9. viii Contents 11.4 Programming Examples Using C/C++ 561 11.4.1 Read Images from Disk 566 11.4.2 Read AVIs from Disk, or Video from a Camera 568 11.5 Other Interfaces 570 11.5.1 Ch 570 11.5.2 Matlab 575 11.5.3 Lush 577 11.6 Appendix A 578 11.7 Appendix B 579 Bibliography 580 12 SOFTWARE ARCHITECTURE FOR COMPUTER VI- SION 585 Alexandre R.J. Fran¸cois 12.1 Introduction 585 12.1.1 Motivation 585 12.1.2 Contribution 588 12.1.3 Outline 589 12.2 SAI: A Software Architecture Model 590 12.2.1 Beyond Pipes and Filters 590 12.2.2 The SAI Architectural Style 597 12.2.3 Example Designs 601 12.2.4 Architectural Properties 620 12.3 MFSM: An Architectural Middleware 622 12.3.1 MFSM overview 623 12.3.2 A First Image Manipulation Example 626 12.3.3 Custom Elements 634 12.3.4 A Shared Memory Access Example 643 12.4 Conclusion 648 12.4.1 Summary 648 12.4.2 Perspectives 649 12.5 Acknowledgments 651 Bibliography 651
  10. 10. PREFACE One of the major changes instituted at the 2001 Conference on Computer Vision and Pattern Recognition (CVPR) in Kauai, HI was the replacement of the traditional tutorial sessions with a set of short courses. The topics of these short courses were carefully chosen to reflect the diversity in com- puter vision and represent very promising areas. The response to these short courses was a very pleasant surprise, with up to more than 200 people attend- ing a single short course. This overwhelming response was the inspiration for this book. There are three parts in this book. The first part covers some of the more fundamental aspects of computer vision, the second describes a few interest- ing applications, and third details specific approaches to facilitate program- ming for computer vision. This book is not intended to be a comprehensive coverage of computer vision; it can, however, be used as a complement to most computer vision textbooks. A unique aspect of this book is the accompanying DVD which features videos of lectures by the contributors. We feel that these lectures would be very useful for readers as quick previews of the topics covered in the book. In addition, these lectures are much more effective in depicting results in the form of video or animations, compared to printed material. We would like to thank all the contributors for all their hard work, and Bernard Goodwin for his support and enthusiasm for our book project. The USC Distance Education Network helped to tape and produce the lectures and Bertran Harden tirelessly assembled all the multimedia content onto a DVD. We are also grateful to P. Anandan and Microsoft Corporation for the financial support used to defray some of the lecture production costs. G´erard Medioni, University of Southern California Sing Bing Kang, Microsoft Research November, 2003 ix
  11. 11. CONTRIBUTORS Gary Bradski Mgr: Machine Learning Group Intel Labs SC12-303 2200 Mission College Blvd. Santa Clara, CA 95052-8119 USA www.intel.com/research/mrl/research/opencv www.intel.com/research/mrl/research/media-visual.htm Paul Debevec USC Institute for Creative Technologies 13274 Fiji Way, 5th Floor Marina del Rey, CA 90292 USA http://www.debevec.org/ Alexandre R.J. Fran¸cois PHE-222 MC-0273 Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA 90089-0273 USA afrancoi@usc.edu iris.usc.edu/ afrancoi x
  12. 12. Contributors xi Theo Gevers University of Amsterdam Kruislaan 403 1098 SJ Amsterdam The Netherlands gevers@science.uva.nl http://carol.science.uva.nl/ gevers/ Anders Heyden Centre for Mathematical Sciences Lund University Box 118 SE-221 00 Lund Sweden heyden@maths.lth.se www.maths.lth.se/matematiklth/personal/andersp/ Mathias K¨olsch Computer Science Department University of California Santa Barbara, CA 93106 USA matz@cs.ucsb.edu http://www.cs.ucsb.edu/∼matz Stan Z. Li Microsoft Research Asia 5/F, Beijing Sigma Center No. 49, Zhichun Road, Hai Dian District Beijing, China 100080 szli@microsoft.com www.research.microsoft.com/∼szli Juwei Lu Bell Canada Multimedia Lab University of Toronto Bahen Centre for Information Technology Room 4154, 40 St George Str. Toronto, ON, M5S 3G4 Canada juwei@dsp.utoronto.ca www.dsp.utoronto.ca/∼juwei/
  13. 13. xii Contributors Gerard Medioni SAL 300, MC-0781 Computer Science Department University of Southern California Los Angeles, CA 90089-0781 USA medioni@iris.usc.edu iris.usc.edu/home/iris/medioni/User.html Peter Meer Electrical and Computer Engineering Department Rutgers University 94 Brett Road Piscataway, NJ 08854-8058 USA meer@caip.rutgers.edu www.caip.rutgers.edu/∼meer Philippos Mordohai PHE 204, MC-0273 3737 Watt Way Los Angeles, CA 90089-0273 USA mordohai@usc.edu iris.usc.edu/home/iris/mordohai/User.html Marc Pollefeys Department of Computer Science University of North Carolina Sitterson Hall, CB#3175 Chapel Hill, NC 27599-3175 USA marc@cs.unc.edu www.cs.unc.edu/∼marc/ Doug Roble Digital Domain 300 Rose Av Venice, CA 90291 USA www.d2.com
  14. 14. Contributors xiii Arnold Smeulders ISIS group, University of Amsterdam Kruislaan 403 1098SJ AMSTERDAM The Netherlands smeulder@science.uva.nl www.science.uva.nl/isis/ Matthew Turk Computer Science Department University of California Santa Barbara, CA 93106 USA mturk@cs.ucsb.edu www.cs.ucsb.edu/∼mturk Zhengyou Zhang Microsoft Corporation One Microsoft Way Redmond, WA 98052 USA zhang@microsoft.com www.research.microsoft.com/∼zhang/
  15. 15. Chapter 1 INTRODUCTION The topics in this book were handpicked to showcase what we consider to be exciting and promising in computer vision. They are a mix of more well-known and traditional topics (such as camera calibration, multi-view geometry, and face detection), and newer ones (such as vision for special effects and tensor voting framework). All have the common denominator of either demonstrated longevity or potential for endurance in computer vision, when the popularity of a number of areas have come and gone in the past. 1.1 Organization The book is organized into three sections, covering various fundamentals, applications, and programming aspects of computer vision. The fundamentals section consists of four chapters. Two of the chapters deal with the more conventional but still popular areas: camera calibration and multi-view geometry. They deal with the most fundamental operations associated with vision. The chapter on robust estimation techniques will be very useful for researchers and practitioners of computer vision alike. There is also a chapter on a more recent tool (namely the tensor voting framework) developed that can be customized for a variety of problems. The applications section covers two more recent applications (image- based lighting and vision for visual effects) and three in more conventional areas (image seach engines, face detection and recognition, and perceptual interfaces). One of the more overlooked area in computer vision is the programming aspect of computer vision. While there are generic commercial packages that can be used, there exists popular libraries or packages that are specifically geared for computer vision. The final section of the book describes two different approaches to facilitate programming for computer vision. 1
  16. 16. 2 Introduction Chapter 1 1.2 How to Use the Book The book is designed to be accompanying material to computer vision text- books. Each chapter is designed to be self-contained, and is written by well- known authorities in the area. We suggest that the reader watch the lecture first before reading the chapter, as the lecture (given by the contributor) provides an excellent overview of the topic. 1.3 Contents of DVDs The two DVDs are organized by chapter as follows: – Chap. 2: Camera Calibration (Z. Zhang) – VS – Chap. 3: Multiple View Geometry (A. Heyden, M. Pollefeys) – VS – Chap. 4: Robust Techniques for Computer Vision (P. Meer) – VS – Chap. 5: The Tensor Voting Framework (G. Medioni, P. Mordohai) – VS – Chap. 6: Image Based Lighting (P.E. Debevec) – VSC – Chap. 7: Computer Vision in Visual Effects (D. Roble) – SC – Chap. 8: Content Based Image Retrieval: An Overview (T. Gevers, A.W.M. Smeulders) – V – Chap. 9: Face Detection, Alignment and Recognition (S.Z. Li, J. Lu) – V – Chap. 10: Perceptual Interfaces (M. Turk, M. K¨olsch) – VS – Chap. 11: Open Source Computer Vision Library (G. Bradski) – SP – Chap. 12: Software Architecture for Computer Vision (A.R.J. Fran¸cois) – VS (Note: V=video presentation, S=slides in PDF format, C=color images in both BMP and PDF formats, P=project and source code.)
  17. 17. SECTION I: FUNDAMENTALS IN COMPUTER VISION It is only fitting that we start with some of the more fundamental concepts in computer vision. The range of topics covered in this section is wide: camera calibration, structure from motion, dense stereo, 3D modeling, robust techniques for model fitting, and a more recently developed concept called tensor voting. In Chapter 2, Zhang reviews the different techniques for calibrating a camera. More specifically, he describes calibration techniques that use 3D reference objects, 2D planes, and 1D lines, as well as self-calibration tech- niques. One of more popular (and difficult) areas in computer vision is stereo. Heyden and Pollefeys describe how camera motion and scene structure can be reliably extracted from image sequences in Chapter 3. Once this is ac- complished, dense depth distributions can be extracted for 3D surface recon- struction and image-based rendering applications. A basic task in computer vision is hypothesizing models (e.g., 2D shapes) and using input data (typically image data) to corroborate and fit the models. In practice, however, robust techniques for model fitting must be used to handle input noise. In Chapter 4, Meer describes various robust regression techniques such as M-estimators, RANSAC, and Hough transform. He also covers the mean shift algorithm for the location estimation problem. The claim by Medioni and his colleagues that computer vision problems can be addressed within a Gestalt framework is the basis of their work on tensor voting. In Chapter 5, Medioni and Mordohai provide an introduction to the concept of tensor voting, which is a form of binning according to 3
  18. 18. 4 Section I: Fundamentals in Computer Vision proximity to ideal primitives such as edges and points. They show how this scheme can be applied to a variety of applications, such as curve and surface extraction from noisy 2D and 3D points (respectively), stereo matching, and motion-based grouping.
  19. 19. Chapter 2 CAMERA CALIBRATION Zhengyou Zhang Camera calibration is a necessary step in 3D computer vision in order to extract metric information from 2D images. It has been studied extensively in computer vision and photogrammetry, and even recently new techniques have been proposed. In this chapter, we review the techniques proposed in the literature include those using 3D apparatus (two or three planes or- thogonal to each other, or a plane undergoing a pure translation, etc.), 2D objects (planar patterns undergoing unknown motions), 1D objects (wand with dots) and unknown scene points in the environment (self-calibration). The focus is on presenting these techniques within a consistent framework. 2.1 Introduction Camera calibration is a necessary step in 3D computer vision in order to extract metric information from 2D images. Much work has been done, starting in the photogrammetry community (see [3, 6] to cite a few), and more recently in computer vision ([12, 11, 33, 10, 37, 35, 22, 9] to cite a few). According to the dimension of the calibration objects, we can classify those techniques roughly into three categories. 3D reference object based calibration. Camera calibration is performed by observing a calibration object whose geometry in 3-D space is known with very good precision. Calibration can be done very efficiently [8]. The calibration object usually consists of two or three planes orthog- onal to each other. Sometimes, a plane undergoing a precisely known translation is also used [33], which equivalently provides 3D reference points. This approach requires an expensive calibration apparatus and 5
  20. 20. 6 Camera Calibration Chapter 2 an elaborate setup. 2D plane based calibration. Techniques in this category requires to ob- serve a planar pattern shown at a few different orientations [42, 31]. Different from Tsai’s technique [33], the knowledge of the plane motion is not necessary. Because almost anyone can make such a calibration pattern by him/her-self, the setup is easier for camera calibration. 1D line based calibration. Calibration objects used in this category are composed of a set of collinear points [44]. As will be shown, a camera can be calibrated by observing a moving line around a fixed point, such as a string of balls hanging from the ceiling. Self-calibration. Techniques in this category do not use any calibration object, and can be considered as 0D approach because only image point correspondences are required. Just by moving a camera in a static scene, the rigidity of the scene provides in general two con- straints [22, 21] on the cameras’ internal parameters from one camera displacement by using image information alone. Therefore, if images are taken by the same camera with fixed internal parameters, cor- respondences between three images are sufficient to recover both the internal and external parameters which allow us to reconstruct 3-D structure up to a similarity [20, 17]. Although no calibration objects are necessary, a large number of parameters need to be estimated, re- sulting in a much harder mathematical problem. Other techniques exist: vanishing points for orthogonal directions [4, 19], and calibration from pure rotation [16, 30]. Before going further, I’d like to point out that no single calibration tech- nique is the best for all. It really depends on the situation a user needs to deal with. Following are my few recommendations: – Calibration with apparatus vs. self-calibration. Whenever possible, if we can pre-calibrate a camera, we should do it with a calibration appa- ratus. Self-calibration cannot usually achieve an accuracy comparable with that of pre-calibration because self-calibration needs to estimate a large number of parameters, resulting in a much harder mathematical problem. When pre-calibration is impossible (e.g., scene reconstruction from an old movie), self-calibration is the only choice. – Partial vs. full self-calibration. Partial self-calibration refers to the case where only a subset of camera intrinsic parameters are to be cal-
  21. 21. Section 2.2. Notation and Problem Statement 7 ibrated. Along the same line as the previous recommendation, when- ever possible, partial self-calibration is preferred because the number of parameters to be estimated is smaller. Take an example of 3D re- construction with a camera with variable focal length. It is preferable to pre-calibrate the pixel aspect ratio and the pixel skewness. – Calibration with 3D vs. 2D apparatus. Highest accuracy can usually be obtained by using a 3D apparatus, so it should be used when accuracy is indispensable and when it is affordable to make and use a 3D apparatus. From the feedback I received from computer vision researchers and practitioners around the world in the last couple of years, calibration with a 2D apparatus seems to be the best choice in most situations because of its ease of use and good accuracy. – Calibration with 1D apparatus. This technique is relatively new, and it is hard for the moment to predict how popular it will be. It, however, should be useful especially for calibration of a camera network. To calibrate the relative geometry between multiple cameras as well as their intrinsic parameters, it is necessary for all involving cameras to simultaneously observe a number of points. It is hardly possible to achieve this with 3D or 2D calibration apparatus1 if one camera is mounted in the front of a room while another in the back. This is not a problem for 1D objects. We can for example use a string of balls hanging from the ceiling. This chapter is organized as follows. Section 2.2 describes the camera model and introduces the concept of the absolute conic which is important for camera calibration. Section 2.3 presents the calibration techniques using a 3D apparatus. Section 2.4 describes a calibration technique by observing a freely moving planar pattern (2D object). Its extension for stereo calibration is also addressed. Section 2.5 describes a relatively new technique which uses a set of collinear points (1D object). Section 2.6 briefly introduces the self- calibration approach and provides references for further reading. Section 2.7 concludes the chapter with a discussion on recent work in this area. 2.2 Notation and Problem Statement We start with the notation used in this chapter. 1 An exception is when those apparatus are made transparent; then the cost would be much higher.
  22. 22. 8 Camera Calibration Chapter 2 2.2.1 Pinhole Camera Model CC θθ α β ),( 00 vu           = Z Y X M mm ),( tR Figure 2.1. Pinhole camera model A 2D point is denoted by m = [u, v]T . A 3D point is denoted by M = [X, Y, Z]T . We use x to denote the augmented vector by adding 1 as the last element: m = [u, v, 1]T and M = [X, Y, Z, 1]T . A camera is modeled by the usual pinhole (see Figure 2.1): The image of a 3D point M, denoted by m is formed by an optical ray from M passing through the optical center C and intersecting the image plane. The three points M, m, and C are collinear. In Figure 2.1, for illustration purpose, the image plane is positioned between the scene point and the optical center, which is mathematically equivalent to the physical setup under which the image plane is in the other side with respect to the optical center. The relationship between the 3D point M and its image projection m is given by sm = A R t P M ≡ PM , (2.1) with A =   α γ u0 0 β v0 0 0 1   (2.2) and P = A R t (2.3) where s is an arbitrary scale factor, (R, t), called the extrinsic parameters, is the rotation and translation which relates the world coordinate system to the camera coordinate system, and A is called the camera intrinsic matrix, with (u0, v0) the coordinates of the principal point, α and β the scale factors
  23. 23. Section 2.2. Notation and Problem Statement 9 in image u and v axes, and γ the parameter describing the skew of the two image axes. The 3 × 4 matrix P is called the camera projection matrix, which mixes both intrinsic and extrinsic parameters. In Figure 2.1, the angle between the two image axes is denoted by θ, and we have γ = α cot θ. If the pixels are rectangular, then θ = 90◦ and γ = 0. The task of camera calibration is to determine the parameters of the transformation between an object in 3D space and the 2D image observed by the camera from visual information (images). The transformation includes – Extrinsic parameters (sometimes called external parameters): orienta- tion (rotation) and location (translation) of the camera, i.e., (R, t); – Intrinsic parameters (sometimes called internal parameters): charac- teristics of the camera, i.e., (α, β, γ, u0, v0). The rotation matrix, although consisting of 9 elements, only has 3 degrees of freedom. The translation vector t obviously has 3 parameters. Therefore, there are 6 extrinsic parameters and 5 intrinsic parameters, leading to in total 11 parameters. We use the abbreviation A−T for (A−1)T or (AT )−1. 2.2.2 Absolute Conic Now let us introduce the concept of the absolute conic. For more details, the reader is referred to [7, 15]. A point x in 3D space has projective coordinates x = [x1, x2, x3, x4]T . The equation of the plane at infinity, Π∞, is x4 = 0. The absolute conic Ω is defined by a set of points satisfying the equation x2 1 + x2 2 + x2 3 = 0 x4 = 0 . (2.4) Let x∞ = [x1, x2, x3]T be a point on the absolute conic (see Figure 2.2). By definition, we have xT ∞x∞ = 0. We also have x∞ = [x1, x2, x3, 0]T and xT ∞x∞ = 0. This can be interpreted as a conic of purely imaginary points on Π∞. Indeed, let x = x1/x3 and y = x2/x3 be a point on the conic, then x2 + y2 = −1, which is an imaginary circle of radius √ −1. An important property of the absolute conic is its invariance to any rigid transformation. Let the rigid transformation be H = R t 0 1 . Let x∞ be a point on Ω. By definition, its projective coordinates: x∞ = x∞ 0 with
  24. 24. 10 Camera Calibration Chapter 2 Plane at infinity 01 =∞ −− ∞ mAAm TT C ∞m ∞x Absolute Conic 0=∞∞xxT Image of Absolute Conic Figure 2.2. Absolute conic and its image xT ∞x∞ = 0. The point after the rigid transformation is denoted by x∞, and x∞ = Hx∞ = Rx∞ 0 . Thus, x∞ is also on the plane at infinity. Furthermore, x∞ is on the same Ω because x T ∞x∞ = (Rx∞)T (Rx∞) = xT ∞(RT R)x∞ = 0 . The image of the absolute conic, denoted by ω, is also an imaginary conic, and is determined only by the intrinsic parameters of the camera. This can be seen as follows. Consider the projection of a point x∞ on Ω, denoted by
  25. 25. Section 2.3. Camera Calibration with 3D Objects 11 m∞, which is given by m∞ = sA[R t] x∞ 0 = sARx∞ . It follows that mT A−T A−1 m = s2 xT ∞RT Rx∞ = s2 xT ∞x∞ = 0 . Therefore, the image of the absolute conic is an imaginary conic, and is defined by A−T A−1. It does not depend on the extrinsic parameters of the camera. If we can determine the image of the absolute conic, then we can solve the camera’s intrinsic parameters, and the calibration is solved. We will show several ways in this chapter how to determine ω, the image of the absolute conic. 2.3 Camera Calibration with 3D Objects The traditional way to calibrate a camera is to use a 3D reference object such as those shown in Figure 2.3. In Fig. 2.3a, the calibration apparatus used at INRIA [8] is shown, which consists of two orthogonal planes, on each a checker pattern is printed. A 3D coordinate system is attached to this apparatus, and the coordinates of the checker corners are known very accurately in this coordinate system. A similar calibration apparatus is a cube with a checker patterns painted in each face, so in general three faces will be visible to the camera. Figure 2.3b illustrates the device used in Tsai’s technique [33], which only uses one plane with checker pattern, but the plane needs to be displaced at least once with known motion. This is equivalent to knowing the 3D coordinates of the checker corners. A popular technique in this category consists of four steps [8]: 1. Detect the corners of the checker pattern in each image; 2. Estimate the camera projection matrix P using linear least squares; 3. Recover intrinsic and extrinsic parameters A, R and t from P; 4. Refine A, R and t through a nonlinear optimization. Note that it is also possible to first refine P through a nonlinear optimization, and then determine A, R and t from the refined P. It is worth noting that using corners is not the only possibility. We can avoid corner detection by working directly in the image. In [25], calibration
  26. 26. 12 Camera Calibration Chapter 2 Know n displacem ent Know n displacem ent (a) (b) Figure 2.3. 3D apparatus for calibrating cameras is realized by maximizing the gradients around a set of control points that define the calibration object. Figure 2.4 illustrates the control points used in that work. Figure 2.4. Control points used in a gradient-based calibration technique
  27. 27. Section 2.3. Camera Calibration with 3D Objects 13 2.3.1 Feature Extraction If one uses a generic corner detector, such as Harris corner detector, to detect the corners in the check pattern image, the result is usually not good because the detector corners have poor accuracy (about one pixel). A better solution is to leverage the known pattern structure by first estimating a line for each side of the square and then computing the corners by intersecting the fitted lines. There are two common techniques to estimate the lines. The first is to first detect edges, and then fit a line to the edges on each side of the square. The second technique is to directly fit a line to each side of a square in the image such that the gradient on the line is maximized. One possibility is to represent the line by an elongated Gaussian, and estimate the parameters of the elongated Gaussian by maximizing the total gradient covered by the Gaussian. We should note that if the lens distortion is not severe, a better solution is to fit just one single line to all the collinear sides. This will leads a much more accurate estimation of the position of the checker corners. 2.3.2 Linear Estimation of the Camera Projection Matrix Once we extract the corner points in the image, we can easily establish their correspondences with the points in the 3D space because of knowledge of the patterns. Based on the projection equation (2.1), we are now able to estimate the camera parameters. However, the problem is quite nonlinear if we try to estimate directly A, R and t. If, on the other hand, we estimate the camera projection matrix P, a linear solution is possible, as to be shown now. Given each 2D-3D correspondence mi = (ui, vi) ↔ Mi = (Xi, Yi, Zi), we can write down 2 equations based on (2.1): Xi Yi Zi 1 0 0 0 0 uiXi uiYi uiZi ui 0 0 0 0 Xi Yi Zi 1 viXi viYi viZi vi Gi p = 0 where p = [p11, p12, . . . , p34]T and 0 = [0, 0]T . For n point matches, we can stack all equations together: Gp = 0 with G = [GT 1 , . . . , GT n ]T Matrix G is a 2n × 12 matrix. The projection matrix can now be solved by min p Gp 2 subject to p = 1
  28. 28. 14 Camera Calibration Chapter 2 The solution is the eigenvector of GT G associated with the smallest eigen- value. In the above, in order to avoid the trivial solution p = 0 and considering the fact that p is defined up to a scale factor, we have set p = 1. Other normalizations are possible. In [1], p34 = 1, which, however, introduce a sin- gularity when the correct value of p34 is close to zero. In [10], the constraint p2 31 + p2 32 + p2 33 = 1 was used, which is singularity free. Anyway, the above linear technique minimizes an algebraic distance, and yields a biased estimation when data are noisy. We will present later an unbiased solution. 2.3.3 Recover Intrinsic and Extrinsic Parameters from P Once the camera projection matrix P is known, we can uniquely recover the intrinsic and extrinsic parameters of the camera. Let us denote the first 3×3 submatrix of P by B and the last column of P by b, i.e., P ≡ [B b]. Since P = A[R t], we have B = AR (2.5) b = At (2.6) From (2.5), we have K ≡ BBT = AAT =        α2 + γ2 + u2 0 ku u0 v0 + c β kc u0 u0 v0 + c α kc α2 v + v2 0 kv v0 u0 v0 1        Because P is defined up to a scale factor, the last element of K = BBT is usu- ally not equal to 1, so we have to normalize it such that K33(the last element) = 1. After that, we immediately obtain u0 = K13 (2.7) v0 = K23 (2.8) β = kv − v2 0 (2.9) γ = kc − u0 v0 β (2.10) α = ku − u2 0 − γ2 (2.11)
  29. 29. Section 2.3. Camera Calibration with 3D Objects 15 The solution is unambiguous because: α > 0 and β > 0. Once the intrinsic parameters, or equivalently matrix A, are known, the extrinsic parameters can be determined from (2.5) and (2.6) as: R = A−1 B (2.12) t = A−1 b . (2.13) 2.3.4 Refine Calibration Parameters Through a Nonlinear Opti- mization The above solution is obtained through minimizing an algebraic distance which is not physically meaningful. We can refine it through maximum likelihood inference. We are given n 2D-3D correspondences mi = (ui, vi) ↔ Mi = (Xi, Yi, Zi). Assume that the image points are corrupted by independent and identically distributed noise. The maximum likelihood estimate can be obtained by minimizing the distances between the image points and their predicted po- sitions, i.e., min P i mi − φ(P, Mi) 2 (2.14) where φ(P, Mi) is the projection of Mi onto the image according to (2.1). This is a nonlinear minimization problem, which can be solved with the Levenberg-Marquardt Algorithm as implemented in Minpack [23]. It requires an initial guess of P which can be obtained using the linear technique de- scribed earlier. Note that since P is defined up to a scale factor, we can set the element having the largest initial value as 1 during the minimization. Alternatively, instead of estimating P as in (2.14), we can directly esti- mate the intrinsic and extrinsic parameters, A, R, and t, using the same criterion. The rotation matrix can be parameterized with three variables such as Euler angles or scaled rotation vector. 2.3.5 Lens Distortion Up to this point, we use the pinhole model to describe a camera. It says that the point in 3D space, its corresponding point in image and the camera’s optical center are collinear. This linear projective equation is sometimes not sufficient, especially for low-end cameras (such as WebCams) or wide-angle cameras; lens distortion has to be considered. According to [33], there are four steps in camera projection including lens distortion:
  30. 30. 16 Camera Calibration Chapter 2 Step 1: Rigid transformation from world coordinate system (Xw, Yw, Zw) to camera one (X, Y, Z): [X Y Z]T = R [Xw Yw Zw]T + t Step 2: Perspective projection from 3D camera coordinates (X, Y, Z) to ideal image coordinates (x, y) under pinhole camera model: x = f X Z , y = f Y Z where f is the effective focal length. Step 3: Lens distortion2: ˘x = x + δx , ˘y = y + δy where (˘x, ˘y) are the distorted or true image coordinates, and (δx, δy) are distortions applied to (x, y). Step 4: Affine transformation from real image coordinates (˘x, ˘y) to frame buffer (pixel) image coordinates (˘u, ˘v): ˘u = d−1 x ˘x + u0 , ˘v = d−1 y ˘y + v0 , where (u0, v0) are coordinates of the principal point; dx and dy are dis- tances between adjacent pixels in the horizontal and vertical directions, respectively. There are two types of distortions: Radial distortion: It is symmetric; ideal image points are distorted along radial directions from the distortion center. This is caused by imperfect lens shape. Decentering distortion: This is usually caused by improper lens assem- bly; ideal image points are distorted in both radial and tangential di- rections. The reader is referred to [29, 3, 6, 37] for more details. 2 Note that the lens distortion described here is different from Tsai’s treatment. Here, we go from ideal to real image coordinates, similar to [36].
  31. 31. Section 2.3. Camera Calibration with 3D Objects 17 The distortion can be expressed as power series in radial distance r = x2 + y2: δx = x(k1r2 + k2r4 + k3r6 + · · · ) + [p1(r2 + 2x2 ) + 2p2xy](1 + p3r2 + · · · ) , δy = y(k1r2 + k2r4 + k3r6 + · · · ) + [2p1xy + p2(r2 + 2y2 )](1 + p3r2 + · · · ) , where ki’s are coefficients of radial distortion and pj’s and coefficients of decentering distortion. Based on the reports in the literature [3, 33, 36], it is likely that the distortion function is totally dominated by the radial components, and es- pecially dominated by the first term. It has also been found that any more elaborated modeling not only would not help (negligible when compared with sensor quantization), but also would cause numerical instability [33, 36]. Denote the ideal pixel image coordinates by u = x/dx, and v = y/dy. By combining Step 3 and Step 4 and if only using the first two radial distortion terms, we obtain the following relationship between (˘u, ˘v) and (u, v): ˘u = u + (u − u0)[k1(x2 + y2 ) + k2(x2 + y2 )2 ] (2.15) ˘v = v + (v − v0)[k1(x2 + y2 ) + k2(x2 + y2 )2 ] . (2.16) Following the same reasoning as in (2.14), camera calibration including lens distortion can be performed by minimizing the distances between the image points and their predicted positions, i.e., min A,R,t,k1,k2 i mi − ˘m(A, R, t, k1, k2, Mi) 2 (2.17) where ˘m(A, R, t, k1, k2, Mi) is the projection of Mi onto the image according to (2.1), followed by distortion according to (2.15) and (2.16). 2.3.6 An Example Figure 2.5 displays an image of a 3D reference object, taken by a camera to be calibrated at INRIA. Each square has 4 corners, and there are in total 128 points used for calibration. Without considering lens distortion, the estimated camera projection ma- trix is P =   7.025659e−01 −2.861189e−02 −5.377696e−01 6.241890e+01 2.077632e−01 1.265804e+00 1.591456e−01 1.075646e+01 4.634764e−04 −5.282382e−05 4.255347e−04 1   From P, we can calculate the intrinsic parameters: α = 1380.12, β = 2032.57, γ ≈ 0, u0 = 246.52, and v0 = 243.68. So, the angle between the two
  32. 32. 18 Camera Calibration Chapter 2 Figure 2.5. An example of camera calibration with a 3D apparatus image axes is 90◦, and the aspect ratio of the pixels is α/β = 0.679. For the extrinsic parameters, the translation vector t = [−211.28, −106.06, 1583.75]T (in mm), i.e., the calibration object is about 1.5m away from the camera; the rotation axis is [−0.08573, −0.99438, 0.0621]T (i.e., almost vertical), and the rotation angle is 47.7◦. Other notable work in this category include [27, 38, 36, 18]. 2.4 Camera Calibration with 2D Objects: Plane-based Tech- nique In this section, we describe how a camera can be calibrated using a moving plane. We first examine the constraints on the camera’s intrinsic parameters provided by observing a single plane. 2.4.1 Homography between the model plane and its image Without loss of generality, we assume the model plane is on Z = 0 of the world coordinate system. Let’s denote the ith column of the rotation matrix
  33. 33. Section 2.4. Camera Calibration with 2D Objects: Plane Based Technique 19 R by ri. From (2.1), we have s   u v 1   = A r1 r2 r3 t     X Y 0 1     = A r1 r2 t   X Y 1   . By abuse of notation, we still use M to denote a point on the model plane, but M = [X, Y ]T since Z is always equal to 0. In turn, M = [X, Y, 1]T . Therefore, a model point M and its image m is related by a homography H: sm = HM with H = A r1 r2 t . (2.18) As is clear, the 3 × 3 matrix H is defined up to a scale factor. 2.4.2 Constraints on the intrinsic parameters Given an image of the model plane, an homography can be estimated (see Appendix). Let’s denote it by H = [h1 h2 h3]. From (2.18), we have [h1 h2 h3] = λA [ r1 r2 t ] , where λ is an arbitrary scalar. Using the knowledge that r1 and r2 are orthonormal, we have hT 1 A−T A−1 h2 = 0 (2.19) hT 1 A−T A−1 h1 = hT 2 A−T A−1 h2 . (2.20) These are the two basic constraints on the intrinsic parameters, given one homography. Because a homography has 8 degrees of freedom and there are 6 extrinsic parameters (3 for rotation and 3 for translation), we can only obtain 2 constraints on the intrinsic parameters. Note that A−T A−1 actually describes the image of the absolute conic [20]. In the next subsection, we will give an geometric interpretation. 2.4.3 Geometric Interpretation We are now relating (2.19) and (2.20) to the absolute conic [22, 20]. It is not difficult to verify that the model plane, under our convention, is described in the camera coordinate system by the following equation: r3 rT 3 t T     x y z w     = 0 ,
  34. 34. 20 Camera Calibration Chapter 2 where w = 0 for points at infinity and w = 1 otherwise. This plane intersects the plane at infinity at a line, and we can easily see that r1 0 and r2 0 are two particular points on that line. Any point on it is a linear combination of these two points, i.e., x∞ = a r1 0 + b r2 0 = ar1 + br2 0 . Now, let’s compute the intersection of the above line with the absolute conic. By definition, the point x∞, known as the circular point [26], satisfies: xT ∞x∞ = 0, i.e., (ar1 +br2)T (ar1 +br2) = 0, or a2 +b2 = 0 . The solution is b = ±ai, where i2 = −1. That is, the two intersection points are x∞ = a r1 ± ir2 0 . The significance of this pair of complex conjugate points lies in the fact that they are invariant to Euclidean transformations. Their projection in the image plane is given, up to a scale factor, by m∞ = A(r1 ± ir2) = h1 ± ih2 . Point m∞ is on the image of the absolute conic, described by A−T A−1 [20]. This gives (h1 ± ih2)T A−T A−1 (h1 ± ih2) = 0 . Requiring that both real and imaginary parts be zero yields (2.19) and (2.20). 2.4.4 Closed-form solution We now provide the details on how to effectively solve the camera calibration problem. We start with an analytical solution. This initial estimation will be followed by a nonlinear optimization technique based on the maximum likelihood criterion, to be described in the next subsection. Let B = A−T A−1 ≡   B11 B12 B13 B12 B22 B23 B13 B23 B33   (2.21) =     1 α2 − γ α2β v0γ−u0β α2β − γ α2β γ2 α2β2 + 1 β2 −γ(v0γ−u0β) α2β2 − v0 β2 v0γ−u0β α2β −γ(v0γ−u0β) α2β2 − v0 β2 (v0γ−u0β)2 α2β2 + v2 0 β2 +1     . (2.22)
  35. 35. Section 2.4. Camera Calibration with 2D Objects: Plane Based Technique 21 Note that B is symmetric, defined by a 6D vector b = [B11, B12, B22, B13, B23, B33]T . (2.23) Let the ith column vector of H be hi = [hi1, hi2, hi3]T . Then, we have hT i Bhj = vT ijb (2.24) with vij = [hi1hj1, hi1hj2+hi2hj1, hi2hj2, hi3hj1+hi1hj3, hi3hj2+hi2hj3, hi3hj3]T . Therefore, the two fundamental constraints (2.19) and (2.20), from a given homography, can be rewritten as 2 homogeneous equations in b: vT 12 (v11 − v22)T b = 0 . (2.25) If n images of the model plane are observed, by stacking n such equations as (2.25) we have Vb = 0 , (2.26) where V is a 2n×6 matrix. If n ≥ 3, we will have in general a unique solution b defined up to a scale factor. If n = 2, we can impose the skewless constraint γ = 0, i.e., [0, 1, 0, 0, 0, 0]b = 0, which is added as an additional equation to (2.26). (If n = 1, we can only solve two camera intrinsic parameters, e.g., α and β, assuming u0 and v0 are known (e.g., at the image center) and γ = 0, and that is indeed what we did in [28] for head pose determination based on the fact that eyes and mouth are reasonably coplanar. In fact, Tsai [33] already mentions that focal length from one plane is possible, but incorrectly says that aspect ratio is not.) The solution to (2.26) is well known as the eigenvector of VT V associated with the smallest eigenvalue (equivalently, the right singular vector of V associated with the smallest singular value). Once b is estimated, we can compute all camera intrinsic parameters as follows. The matrix B, as described in Sect. 2.4.4, is estimated up to a scale factor, i.e.,, B = λA−T A with λ an arbitrary scale. Without difficulty, we can uniquely extract the intrinsic parameters from matrix B. v0 = (B12B13 − B11B23)/(B11B22 − B2 12) λ = B33 − [B2 13 + v0(B12B13 − B11B23)]/B11 α = λ/B11 β = λB11/(B11B22 − B2 12) γ = −B12α2 β/λ u0 = γv0/α − B13α2 /λ .
  36. 36. 22 Camera Calibration Chapter 2 Once A is known, the extrinsic parameters for each image is readily computed. From (2.18), we have r1 = λA−1 h1 , r2 = λA−1 h2 , r3 = r1 × r2 , t = λA−1 h3 with λ = 1/ A−1h1 = 1/ A−1h2 . Of course, because of noise in data, the so-computed matrix R = [r1, r2, r3] does not in general satisfy the properties of a rotation matrix. The best rotation matrix can then be obtained through for example singular value decomposition [13, 41]. 2.4.5 Maximum likelihood estimation The above solution is obtained through minimizing an algebraic distance which is not physically meaningful. We can refine it through maximum likelihood inference. We are given n images of a model plane and there are m points on the model plane. Assume that the image points are corrupted by independent and identically distributed noise. The maximum likelihood estimate can be obtained by minimizing the following functional: n i=1 m j=1 mij − m(A, Ri, ti, Mj) 2 , (2.27) where m(A, Ri, ti, Mj) is the projection of point Mj in image i, according to equation (2.18). A rotation R is parameterized by a vector of 3 parameters, denoted by r, which is parallel to the rotation axis and whose magnitude is equal to the rotation angle. R and r are related by the Rodrigues formula [8]. Minimizing (2.27) is a nonlinear minimization problem, which is solved with the Levenberg-Marquardt Algorithm as implemented in Minpack [23]. It requires an initial guess of A, {Ri, ti|i = 1..n} which can be obtained using the technique described in the previous subsection. Desktop cameras usually have visible lens distortion, especially the ra- dial components. We have included these while minimizing (2.27). See my technical report [41] for more details. 2.4.6 Dealing with radial distortion Up to now, we have not considered lens distortion of a camera. However, a desktop camera usually exhibits significant lens distortion, especially radial distortion. The reader is referred to Section 2.3.5 for distortion modeling. In this section, we only consider the first two terms of radial distortion.
  37. 37. Section 2.4. Camera Calibration with 2D Objects: Plane Based Technique 23 Estimating Radial Distortion by Alternation. As the radial distortion is ex- pected to be small, one would expect to estimate the other five intrinsic parameters, using the technique described in Sect. 2.4.5, reasonable well by simply ignoring distortion. One strategy is then to estimate k1 and k2 after having estimated the other parameters, which will give us the ideal pixel coordinates (u, v). Then, from (2.15) and (2.16), we have two equations for each point in each image: (u−u0)(x2+y2) (u−u0)(x2+y2)2 (v−v0)(x2+y2) (v−v0)(x2+y2)2 k1 k2 = ˘u−u ˘v−v . Given m points in n images, we can stack all equations together to obtain in total 2mn equations, or in matrix form as Dk = d, where k = [k1, k2]T . The linear least-squares solution is given by k = (DT D)−1 DT d . (2.28) Once k1 and k2 are estimated, one can refine the estimate of the other param- eters by solving (2.27) with m(A, Ri, ti, Mj) replaced by (2.15) and (2.16). We can alternate these two procedures until convergence. Complete Maximum Likelihood Estimation. Experimentally, we found the convergence of the above alternation technique is slow. A natural extension to (2.27) is then to estimate the complete set of parameters by minimizing the following functional: n i=1 m j=1 mij − ˘m(A, k1, k2, Ri, ti, Mj) 2 , (2.29) where ˘m(A, k1, k2, Ri, ti, Mj) is the projection of point Mj in image i ac- cording to equation (2.18), followed by distortion according to (2.15) and (2.16). This is a nonlinear minimization problem, which is solved with the Levenberg-Marquardt Algorithm as implemented in Minpack [23]. A rota- tion is again parameterized by a 3-vector r, as in Sect. 2.4.5. An initial guess of A and {Ri, ti|i = 1..n} can be obtained using the technique described in Sect. 2.4.4 or in Sect. 2.4.5. An initial guess of k1 and k2 can be obtained with the technique described in the last paragraph, or simply by setting them to 0. 2.4.7 Summary The recommended calibration procedure is as follows:
  38. 38. 24 Camera Calibration Chapter 2 1. Print a pattern and attach it to a planar surface; 2. Take a few images of the model plane under different orientations by moving either the plane or the camera; 3. Detect the feature points in the images; 4. Estimate the five intrinsic parameters and all the extrinsic parameters using the closed-form solution as described in Sect. 2.4.4; 5. Estimate the coefficients of the radial distortion by solving the linear least-squares (2.28); 6. Refine all parameters, including lens distortion parameters, by mini- mizing (2.29). There is a degenerate configuration in my technique when planes are parallel to each other. See my technical report [41] for a more detailed description. In summary, this technique only requires the camera to observe a planar pattern from a few different orientations. Although the minimum number of orientations is two if pixels are square, we recommend 4 or 5 different orientations for better quality. We can move either the camera or the planar pattern. The motion does not need to be known, but should not be a pure translation. When the number of orientations is only 2, one should avoid positioning the planar pattern parallel to the image plane. The pattern could be anything, as long as we know the metric on the plane. For example, we can print a pattern with a laser printer and attach the paper to a reasonable planar surface such as a hard book cover. We can even use a book with known size because the four corners are enough to estimate the plane homographies. 2.4.8 Experimental Results The proposed algorithm has been tested on both computer simulated data and real data. The closed-form solution involves finding a singular value decomposition of a small 2n × 6 matrix, where n is the number of images. The nonlinear refinement within the Levenberg-Marquardt algorithm takes 3 to 5 iterations to converge. Due to space limitation, we describe in this section one set of experiments with real data when the calibration pattern is at different distances from the camera. The reader is referred to [41] for more experimental results with both computer simulated and real data, and to the following Web page:
  39. 39. Section 2.4. Camera Calibration with 2D Objects: Plane Based Technique 25 (A) (B) Figure 2.6. Two sets of images taken at different distances to the calibration pattern. Each set contains five images. On the left, three images from the set taken at a close distance are shown. On the right, three images from the set taken at a larger distance are shown.
  40. 40. 26 Camera Calibration Chapter 2 Table 2.1. Calibration results with the images shown in Figure 2.6 image set α β ϑ u0 v0 k1 k2 A 834.01 839.86 89.95◦ 305.51 240.09 -0.2235 0.3761 B 836.17 841.08 89.92◦ 301.76 241.51 -0.2676 1.3121 A+B 834.64 840.32 89.94◦ 304.77 240.59 -0.2214 0.3643 http://research.microsoft.com/˜zhang/Calib/ for some experimental data and the software. The example is shown in Fig. 2.6. The camera to be calibrated is an off-the-shelf PULNiX CCD camera with 6 mm lens. The image resolution is 640×480. As can be seen in Fig. 2.6, the model plane contains a 9 × 9 squares with 9 special dots which are used to identify automatically the cor- respondence between reference points on the model plane and square corners in images. It was printed on a A4 paper with a 600 DPI laser printer, and attached to a cardboard. In total 10 images of the plane were taken (6 of them are shown in Fig. 2.6). Five of them (called Set A) were taken at close range, while the other five (called Set B) were taken at a larger distance. We applied our calibration algorithm to Set A, Set B, and also to the whole set (called Set A+B). The results are shown in Table 2.1. For intuitive understanding, we show the estimated angle between the image axes, ϑ, instead of the skew factor γ. We can see that the angle ϑ is very close to 90◦, as expected with almost all modern CCD cameras. The cameras parameters were estimated consistently for all three sets of images, except the distortion parameters with Set B. The reason is that the calibration pattern only occupies the central part of the image in Set B, where lens distortion is not significant and therefore cannot be estimated reliably. 2.4.9 Related Work Almost at the same time, Sturm and Maybank [31], independent from us, developed the same technique. They assumed the pixels are square (i.e., γ = 0) and have studied the degenerate configurations for plane-based camera calibration. Gurdjos et al. [14] have re-derived the plane-based calibration technique from the center line constraint. My original implementation (only the executable) is available at
  41. 41. Section 2.5. Solving Camera Calibration With 1D Objects 27 http://research.microsoft.com/˜zhang/calib/. Bouguet has re-implemented my technique in Matlab, which is available at http://www.vision.caltech.edu/bouguetj/calib doc/. In many applications such as stereo, multiple cameras need to be cali- brated simultaneously in order to determine the relative geometry between cameras. In 2000, I have extended (not published) this plane-based technique to stereo calibration for my stereo-based gaze-correction project [40, 39]. The formulation is similar to (2.29). Consider two cameras, and denote the quantity related to the second camera by . Let (Rs, ts) be the rigid trans- formation between the two cameras such that (R , t ) = (R, t) ◦ (Rs, ts) or more precisely: R = RRs and t = Rts + t. Stereo calibration is then to solve A, A , k1, k2, k1, k2, {(Ri, ti)|i = 1, . . . , n}, and (Rs, ts) by minimizing the following functional: n i=1 m j=1 δij mij − ˘m(A, k1, k2, Ri, ti, Mj) 2 + δij mij − ˘m(A , k1, k2, Ri, ti, Mj) 2 (2.30) subject to Ri = RiRs and ti = Rits + ti . In the above formulation, δij = 1 if point j is visible in the first camera, and δij = 0 otherwise. Similarly, δij = 1 if point j is visible in the second camera. This formulation thus does not require the same number of feature points to be visible over time or across cameras. Another advantage of this formulation is that the number of extrinsic parameters to be estimated has been reduced from 12n if the two cameras are calibrated independently to 6n + 6. This is a reduction of 24 dimensions in parameter space if 5 planes are used. Obviously, this is a nonlinear optimization problem. To obtain the ini- tial guess, we run first single-camera calibration independently for each cam- era, and compute Rs through SVD from Ri = RiRs (i = 1, . . . , n) and ts through least-squares from ti = Rits + ti (i = 1, . . . , n). Recently, a closed- form initialization technique through factorization of homography matrices is proposed in [34]. 2.5 Solving Camera Calibration With 1D Objects In this section, we describe in detail how to solve the camera calibration problem from a number of observations of a 1D object consisting of 3 collinear points moving around one of them [43, 44]. We only consider this minimal
  42. 42. 28 Camera Calibration Chapter 2 configuration, but it is straightforward to extend the result if a calibration object has four or more collinear points. 2.5.1 Setups With Free-Moving 1D Calibration Objects We now examine possible setups with 1D objects for camera calibration. As already mentioned in the introduction, we need to have several observa- tions of the 1D objects. Without loss of generality, we choose the camera coordinate system to define the 1D objects; therefore, R = I and t = 0 in (2.1). Two points with known distance. This could be the two endpoints of a stick, and we take a number of images while waving freely the stick. Let A and B be the two 3D points, and a and b be the observed image points. Because the distance between A and B is known, we only need 5 parameters to define A and B. For example, we need 3 parameters to specify the coordinates of A in the camera coordinate system, and 2 parameters to define the orientation of the line AB. On the other hand, each image point provides two equations according to (2.1), giving in total 4 equations. Given N observations of the stick, we have 5 intrinsic parameters and 5N parameters for the point positions to estimate, i.e., the total number of unknowns is 5+5N. However, we only have 4N equations. Camera calibration is thus impossible. Three collinear points with known distances. By adding an additional point, say C, the number of unknowns for the point positions still remains the same, i.e., 5 + 5N, because of known distances of C to A and B. For each observation, we have three image points, yielding in total 6N equations. Calibration seems to be plausible, but is in fact not. This is because the three image points for each observation must be collinear. Collinearity is preserved by perspective projection. We therefore only have 5 independent equations for each observation. The total number of independent equations, 5N, is always smaller than the number of unknowns. Camera calibration is still impossible. Four or more collinear points with known distances. As seen above, when the number of points increases from two to three, the number of independent equations (constraints) increases by one for each observation. If we have a fourth point, will we have in total 6N independent equations? If so, we would be able to solve the problem because the number of unknowns remains the same, i.e., 5 + 5N, and we would have more than enough constraints if N ≥ 5. The reality is that the addition of the fourth point or even more points does not increase the number of independent equations. It will always
  43. 43. Section 2.5. Solving Camera Calibration With 1D Objects 29 be 5N for any four or more collinear points. This is because the cross ratio is preserved under perspective projection. With known cross ratios and three collinear points, whether they are in space or in images, other points are determined exactly. 2.5.2 Setups With 1D Calibration Objects Moving Around a fixed Point From the above discussion, calibration is impossible with a free moving 1D calibration object, no matter how many points on the object. Now let us examine what happens if one point is fixed. In the sequel, without loss of generality, point A is the fixed point, and a is the corresponding image point. We need 3 parameters, which are unknown, to specify the coordinates of A in the camera coordinate system, while image point a provides two scalar equations according to (2.1). Two points with known distance. They could be the endpoints of a stick, and we move the stick around the endpoint that is fixed. Let B be the free endpoint and b, its corresponding image point. For each observation, we need 2 parameters to define the orientation of the line AB and therefore the position of B because the distance between A and B is known. Given N observations of the stick, we have 5 intrinsic parameters, 3 parameters for A and 2N parameters for the free endpoint positions to estimate, i.e., the total number of unknowns is 8 + 2N. However, each observation of b provides two equations, so together with a we only have in total 2 + 2N equations. Camera calibration is thus impossible. Three collinear points with known distances. As already explained in the last subsection, by adding an additional point, say C, the number of unknowns for the point positions still remains the same, i.e., 8+2N. For each observa- tion, b provides two equations, but c only provides one additional equation because of the collinearity of a, b and c. Thus, the total number of equations is 2 + 3N for N observations. By counting the numbers, we see that if we have 6 or more observations, we should be able to solve camera calibration, and this is the case as we shall show in the next section. Four or more collinear points with known distances. Again, as already ex- plained in the last subsection, The number of unknowns and the number of independent equations remain the same because of invariance of cross-ratios. This said, the more collinear points we have, the more accurate camera cal- ibration will be in practice because data redundancy can combat the noise in image data.
  44. 44. 30 Camera Calibration Chapter 2 2.5.3 Basic Equations Figure 2.7. Illustration of 1D calibration objects Refer to Figure 2.7. Point A is the fixed point in space, and the stick AB moves around A. The length of the stick AB is known to be L, i.e., B − A = L . (2.31) The position of point C is also known with respect to A and B, and therefore C = λAA + λBB , (2.32) where λA and λB are known. If C is the midpoint of AB, then λA = λB = 0.5. Points a, b and c on the image plane are projection of space points A, B and C, respectively. Without loss of generality, we choose the camera coordinate system to define the 1D objects; therefore, R = I and t = 0 in (2.1). Let the unknown depths for A, B and C be zA, zB and zC, respectively. According to (2.1), we
  45. 45. Section 2.5. Solving Camera Calibration With 1D Objects 31 have A = zAA−1 a (2.33) B = zBA−1 b (2.34) C = zCA−1 c . (2.35) Substituting them into (2.32) yields zCc = zAλAa + zBλBb (2.36) after eliminating A−1 from both sides. By performing cross-product on both sides of the above equation with c, we have zAλA(a × c) + zBλB(b × c) = 0 . In turn, we obtain zB = −zA λA(a × c) · (b × c) λB(b × c) · (b × c) . (2.37) From (2.31), we have A−1 (zBb − zAa) = L . Substituting zB by (2.37) gives zA A−1 a + λA(a × c) · (b × c) λB(b × c) · (b × c) b = L . This is equivalent to z2 AhT A−T A−1 h = L2 (2.38) with h = a + λA(a × c) · (b × c) λB(b × c) · (b × c) b . (2.39) Equation (2.38) contains the unknown intrinsic parameters A and the un- known depth, zA, of the fixed point A. It is the basic constraint for camera calibration with 1D objects. Vector h, given by (2.39), can be computed from image points and known λA and λB. Since the total number of unknowns is 6, we need at least six observations of the 1D object for calibration. Note that A−T A actually describes the image of the absolute conic [20].
  46. 46. 32 Camera Calibration Chapter 2 2.5.4 Closed-Form Solution Let B = A−T A−1 ≡   B11 B12 B13 B12 B22 B23 B13 B23 B33   (2.40) =     1 α2 − γ α2β v0γ−u0β α2β − γ α2β γ2 α2β2 + 1 β2 −γ(v0γ−u0β) α2β2 − v0 β2 v0γ−u0β α2β −γ(v0γ−u0β) α2β2 − v0 β2 (v0γ−u0β)2 α2β2 + v2 0 β2 +1     . (2.41) Note that B is symmetric, and can be defined by a 6D vector b = [B11, B12, B22, B13, B23, B33]T . (2.42) Let h = [h1, h2, h3]T , and x = z2 Ab, then equation (2.38) becomes vT x = L2 (2.43) with v = [h2 1, 2h1h2, h2 2, 2h1h3, 2h2h3, h2 3]T . When N images of the 1D object are observed, by stacking n such equations as (2.43) we have Vx = L2 1 , (2.44) where V = [v1, . . . , vN ]T and 1 = [1, . . . , 1]T . The least-squares solution is then given by x = L2 (VT V)−1 VT 1 . (2.45) Once x is estimated, we can compute all the unknowns based on x = z2 Ab. Let x = [x1, x2, . . . , x6]T . Without difficulty, we can uniquely extract the intrinsic parameters and the depth zA as v0 = (x2x4 − x1x5)/(x1x3 − x2 2) zA = x6 − [x2 4 + v0(x2x4 − x1x5)]/x1 α = zA/x1 β = zAx1/(x1x3 − x2 2) γ = −x2α2 β/zA u0 = γv0/α − x4α2 /zA . At this point, we can compute zB according to (2.37), so points A and B can be computed from (2.33) and (2.34), while point C can be computed according to (2.32).
  47. 47. Section 2.5. Solving Camera Calibration With 1D Objects 33 2.5.5 Nonlinear Optimization The above solution is obtained through minimizing an algebraic distance which is not physically meaningful. We can refine it through maximum likelihood inference. We are given N images of the 1D calibration object and there are 3 points on the object. Point A is fixed, and points B and C moves around A. Assume that the image points are corrupted by independent and identically distributed noise. The maximum likelihood estimate can be obtained by minimizing the following functional: N i=1 ai − φ(A, A) 2 + bi − φ(A, Bi) 2 + ci − φ(A, Ci) 2 , (2.46) where φ(A, M) (M ∈ {A, Bi, Ci}) is the projection of point M onto the image, according to equations (2.33) to (2.35). More precisely, φ(A, M) = 1 zM AM, where zM is the z-component of M. The unknowns to be estimated are: – 5 camera intrinsic parameters α, β, γ, u0 and v0 that define matrix A; – 3 parameters for the coordinates of the fixed point A; – 2N additional parameters to define points Bi and Ci at each instant (see below for more details). Therefore, we have in total 8 + 2N unknowns. Regarding the parameteri- zation for B and C, we use the spherical coordinates φ and θ to define the direction of the 1D calibration object, and point B is then given by B = A + L   sin θ cos φ sin θ sin φ cos θ   where L is the known distance between A and B. In turn, point C is computed according to (2.32). We therefore only need 2 additional parameters for each observation. Minimizing (2.46) is a nonlinear minimization problem, which is solved with the Levenberg-Marquardt Algorithm as implemented in Minpack [23]. It requires an initial guess of A, A, {Bi, Ci|i = 1..N} which can be obtained using the technique described in the last subsection.
  48. 48. 34 Camera Calibration Chapter 2 2.5.6 Estimating the fixed point In the above discussion, we assumed that the image coordinates, a, of the fixed point A are known. We now describe how to estimate a by considering whether the fixed point A is visible in the image or not. Invisible fixed point. The fixed point does not need to be visible in the image. And the camera calibration technique becomes more versatile without the visibility requirement. In that case, we can for example hang a string of small balls from the ceiling, and calibrate multiple cameras in the room by swinging the string. The fixed point can be estimated by intersecting lines from different images as described below. Each observation of the 1D object defines an image line. An image line can be represented by a 3D vector l = [l1, l2, l3]T , defined up to a scale factor such as a point m = [u, v]T on the line satisfies lT m = 0. In the sequel, we also use (n, q) to denote line l, where n = [l1, l2]T and q = l3. To remove the scale ambiguity, we normalize l such that l = 1. Furthermore, each l is associated with an uncertainty measure represented by a 3 × 3 covariance matrix Λ. Given N images of the 1D object, we have N lines: {(li, Λi)|i = 1, . . . , N}. Let the fixed point be a in the image. Obviously, if there is no noise, we have lT i a = 0, or nT i a + qi = 0. Therefore, we can estimate a by minimizing F = N i=1 wi lT i a 2 = N i=1 wi nT i a + qi 2 = N i=1 wi(aT ninT i a + 2qinT i a + q2 i ) (2.47) where wi is a weighting factor (see below). By setting the derivative of F with respect to a to 0, we obtain the solution, which is given by a = − N i=1 wininT i −1 N i=1 wiqini . The optimal weighting factor wi in (2.47) is the inverse of the variance of lT i a, which is wi = 1/(aT Λia). Note that the weight wi involves the unknown a. To overcome this difficulty, we can approximate wi by 1/ trace(Λi) for the first iteration, and by re-computing wi with the previously estimated a in the subsequent iterations. Usually two or three iterations are enough. Visible fixed point. Since the fixed point is visible, we have N observations: {ai|i = 1, . . . , N}. We can therefore estimate a by minimizing N i=1 a−ai 2, assuming that the image points are detected with the same accuracy. The solution is simply a = ( N i=1 ai)/N.
  49. 49. Section 2.5. Solving Camera Calibration With 1D Objects 35 The above estimation does not make use of the fact that the fixed point is also the intersection of the N observed lines of the 1D object. Therefore, a better technique to estimate a is to minimize the following function: F = N i=1 (a−ai)T V−1 i (a−ai)+wi lT i a 2 = N i=1 (a−ai)T V−1 i (a−ai)+wi nT i a+qi 2 (2.48) where Vi is the covariance matrix of the detected point ai. The derivative of the above function with respect to a is given by ∂F ∂a = 2 N i=1 V−1 i (a − ai) + wininT i a + wiqini . Setting it to 0 yields a = N i=1 (V−1 i + wininT i ) −1 N i=1 (V−1 i ai − wiqini) . If more than three points are visible in each image, the known cross ratio provides an additional constraint in determining the fixed point. For an accessible description of uncertainty manipulation, the reader is referred to [45, Chapter 2]. 2.5.7 Experimental Results The proposed algorithm has been tested on both computer simulated data and real data. Computer Simulations The simulated camera has the following property: α = 1000, β = 1000, γ = 0, u0 = 320, and v0 = 240. The image resolution is 640×480. A stick of 70 cm is simulated with the fixed point A at [0, 35, 150]T . The other endpoint of the stick is B, and C is located at the half way between A and B. We have generated 100 random orientations of the stick by sampling θ in [π/6, 5π/6] and φ in [π, 2π] according to uniform distribution. Points A, B, and C are then projected onto the image. Gaussian noise with 0 mean and σ standard deviation is added to the projected image points a, b and c. The estimated camera parameters are compared with the ground truth, and we measure their relative errors with respect to the focal length α. Note that we measure the relative errors in
  50. 50. 36 Camera Calibration Chapter 2 Table 2.2. Calibration results with real data. Solution α β γ u0 v0 Closed-form 889.49 818.59 -0.1651 (90.01◦) 297.47 234.33 Nonlinear 838.49 799.36 4.1921 (89.72◦) 286.74 219.89 Plane-based 828.92 813.33 -0.0903 (90.01◦) 305.23 235.17 Relative difference 1.15% 1.69% 0.52% (0.29◦) 2.23% 1.84% (u0, v0) with respect to α, as proposed by Triggs in [32]. He pointed out that the absolute errors in (u0, v0) is not geometrically meaningful, while computing the relative error is equivalent to measuring the angle between the true optical axis and the estimated one. We vary the noise level from 0.1 pixels to 1 pixel. For each noise level, we perform 120 independent trials, and the results shown in Fig. 2.8 are the average. Figure 2.8a displays the relative errors of the closed-form solution while Figure 2.8b displays those of the nonlinear minimization result. Errors increase almost linearly with the noise level. The nonlinear minimization refines the closed-form solution, and produces significantly better result (with 50% less errors). At 1 pixel noise level, the errors for the closed-form solution are about 12%, while those for the nonlinear minimization are about 6%. Real Data For the experiment with real data, I used three toy beads from my kids and strung them together with a stick. The beads are approximately 14 cm apart (i.e., L = 28). I then moves the stick around while trying to fix one end with the aid of a book. A video of 150 frames was recorded, and four sample images are shown in Fig. 2.9. A bead in the image is modeled as a Gaussian blob in the RGB space, and the centroid of each detected blob is the image point we use for camera calibration. The proposed algorithm is therefore applied to the 150 observations of the beads, and the estimated camera parameters are provided in Table 2.2. The first row is the estimation from the closed-form solution, while the second row is the refined result after nonlinear minimization. For the image skew parameter γ, we also provide the angle between the image axes in parenthesis (it should be very close to 90◦). For comparison, we also used the plane-based calibration technique de- scribed in [42] to calibrate the same camera. Five images of a planar pattern were taken, and one of them is shown in Fig. 2.10. The calibration result is shown in the third row of Table 2.2. The fourth row displays the relative
  51. 51. Section 2.5. Solving Camera Calibration With 1D Objects 37 (a) Closed-form solution (b) Nonlinear optimization Figure 2.8. Calibration errors with respect to the noise level of the image points.
  52. 52. 38 Camera Calibration Chapter 2 Frame 10 Frame 60 Frame 90 Frame 140 Figure 2.9. Sample images of a 1D object used for camera calibration. Figure 2.10. A sample image of the planar pattern used for camera calibration.
  53. 53. Section 2.6. Self Calibration 39 difference between the plane-based result and the nonlinear solution with re- spect to the focal length (we use 828.92). As we can observe, the difference is about 2%. There are several sources contributing to this difference. Besides ob- viously the image noise and imprecision of the extracted data points, one source is our current rudimentary experimental setup: – The supposed-to-be fixed point was not fixed. It slipped around on the surface. – The positioning of the beads was done with a ruler using eye inspection. Considering all the factors, the proposed algorithm is very encouraging. 2.6 Self-Calibration Self-calibration is also called auto-calibration. Techniques in this category do not require any particular calibration object. They can be considered as 0D approach because only image point correspondences are required. Just by moving a camera in a static scene, the rigidity of the scene provides in general two constraints [22, 21, 20] on the cameras’ internal parameters from one camera displacement by using image information alone. Absolute conic, described in Section 2.2.2, is an essential concept in understanding these constraints. Therefore, if images are taken by the same camera with fixed internal parameters, correspondences between three images are sufficient to recover both the internal and external parameters which allow us to recon- struct 3-D structure up to a similarity [20, 17]. Although no calibration objects are necessary, a large number of parameters need to be estimated, resulting in a much harder mathematical problem. We do not plan to go further into details of this approach because two recent books [15, 7] provide an excellent recount of those techniques. 2.7 Conclusion In this chapter, we have reviewed several camera calibration techniques. We have classified them into four categories, depending whether they use 3D apparatus, 2D objects (planes), 1D objects, or just the surrounding scenes (self-calibration). Recommendations on choosing which technique were given in the introduction section. The techniques described so far are mostly focused on a single-camera calibration. We touched a little bit on stereo calibration in Section 2.4.9.
  54. 54. 40 Camera Calibration Chapter 2 Camera calibration is still an active research area because more and more applications use cameras. In [2], spheres are used to calibrate one or more cameras, which can be considered as a 2D approach since only the surface property is used. In [5], a technique is described to calibrate a camera net- work consisting of an omni-camera and a number of perspective cameras. In [24], a technique is proposed to calibrate a projector-screen-camera system. 2.8 Appendix: Estimating Homography Between the Model Plane and its Image There are many ways to estimate the homography between the model plane and its image. Here, we present a technique based on maximum likelihood criterion. Let Mi and mi be the model and image points, respectively. Ideally, they should satisfy (2.18). In practice, they don’t because of noise in the extracted image points. Let’s assume that mi is corrupted by Gaussian noise with mean 0 and covariance matrix Λmi . Then, the maximum likelihood estimation of H is obtained by minimizing the following functional i (mi − mi)T Λ−1 mi (mi − mi) , where mi = 1 ¯hT 3 Mi ¯hT 1 Mi ¯hT 2 Mi with ¯hi, the ith row of H. In practice, we simply assume Λmi = σ2I for all i. This is reasonable if points are extracted independently with the same procedure. In this case, the above problem becomes a nonlinear least-squares one, i.e., minH i mi − mi 2. The nonlinear minimization is conducted with the Levenberg-Marquardt Al- gorithm as implemented in Minpack [23]. This requires an initial guess, which can be obtained as follows. Let x = [¯hT 1 , ¯hT 2 , ¯hT 3 ]T . Then equation (2.18) can be rewritten as MT 0T −uMT 0T MT −vMT x = 0 . When we are given n points, we have n above equations, which can be written in matrix equation as Lx = 0, where L is a 2n×9 matrix. As x is defined up to a scale factor, the solution is well known to be the right singular vector of L associated with the smallest singular value (or equivalently, the eigenvector of LT L associated with the smallest eigenvalue). In L, some elements are constant 1, some are in pixels, some are in world coordinates, and some are multiplication of both. This makes L poorly conditioned numerically. Much better results can be obtained by performing a simple data normalization prior to running the above procedure.
  55. 55. Bibliography 41 Bibliography [1] Y.I. Abdel-Aziz and H.M. Karara. Direct linear transformation into object space coordinates in close-range photogrammetry. In Proceedings of the Sym- posium on Close-Range Photogrammetry, University of Illinois at Urbana- Champaign, Urbana, Illinois, pages 1–18, January 1971. [2] M. Agrawal and L. Davis. Camera calibration using spheres: A semi-definite programming approach. In Proceedings of the 9th International Conference on Computer Vision, pages 782–789, Nice, France, October 2003. IEEE Computer Society Press. [3] D. C. Brown. Close-range camera calibration. Photogrammetric Engineering, 37(8):855–866, 1971. [4] B. Caprile and V. Torre. Using Vanishing Points for Camera Calibration. The International Journal of Computer Vision, 4(2):127–140, March 1990. [5] X. Chen, J. Yang, and A. Waibel. Calibration of a hybrid camera network. In Proceedings of the 9th International Conference on Computer Vision, pages 150–155, Nice, France, October 2003. IEEE Computer Society Press. [6] W. Faig. Calibration of close-range photogrammetry systems: Mathematical formulation. Photogrammetric Engineering and Remote Sensing, 41(12):1479– 1486, 1975. [7] O. Faugeras and Q.-T. Luong. The Geometry of Multiple Images. The MIT Press, 2001. With contributions from T. Papadopoulo. [8] O. Faugeras. Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, 1993. [9] O. Faugeras, T. Luong, and S. Maybank. Camera self-calibration: theory and experiments. In G. Sandini, editor, Proc 2nd ECCV, volume 588 of Lecture Notes in Computer Science, pages 321–334, Santa Margherita Ligure, Italy, May 1992. Springer-Verlag. [10] O. Faugeras and G. Toscani. The calibration problem for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15–20, Miami Beach, FL, June 1986. IEEE. [11] S. Ganapathy. Decomposition of transformation matrices for robot vision. Pat- tern Recognition Letters, 2:401–412, December 1984. [12] D. Gennery. Stereo-camera calibration. In Proceedings of the 10th Image Un- derstanding Workshop, pages 101–108, 1979. [13] G.H. Golub and C.F. van Loan. Matrix Computations. The John Hopkins University Press, Baltimore, Maryland, 3 edition, 1996. [14] P. Gurdjos, A. Crouzil, and R. Payrissat. Another way of looking at plane-based calibration: the centre circle constraint. In Proceedings of the 7th European Conference on Computer Vision, volume IV, pages 252–266, Copenhagen, May 2002. [15] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. [16] R. Hartley. Self-calibration from multiple views with a rotating camera. In J- O. Eklundh, editor, Proceedings of the 3rd European Conference on Computer Vision, volume 800-801 of Lecture Notes in Computer Science, pages 471–478, Stockholm, Sweden, May 1994. Springer-Verlag.
  56. 56. 42 Camera Calibration Chapter 2 [17] R. Hartley. An algorithm for self calibration from several views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 908–912, Seattle, WA, June 1994. IEEE. [18] J. Heikkil¨a and O. Silv´en. A four-step camera calibration procedure with im- plicit image correction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1106–1112, San Juan, Puerto Rico, June 1997. IEEE Computer Society. [19] D. Liebowitz and A. Zisserman. Metric rectification for perspective images of planes. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 482–488, Santa Barbara, California, June 1998. IEEE Computer Society. [20] Q.-T. Luong and O.D. Faugeras. Self-calibration of a moving camera from point correspondences and fundamental matrices. The International Journal of Computer Vision, 22(3):261–289, 1997. [21] Q.-T. Luong. Matrice Fondamentale et Calibration Visuelle sur l’Environnement-Vers une plus grande autonomie des syst`emes robotiques. PhD thesis, Universit´e de Paris-Sud, Centre d’Orsay, December 1992. [22] S. J. Maybank and O. D. Faugeras. A theory of self-calibration of a moving camera. The International Journal of Computer Vision, 8(2):123–152, August 1992. [23] J.J. More. The levenberg-marquardt algorithm, implementation and theory. In G. A. Watson, editor, Numerical Analysis, Lecture Notes in Mathematics 630. Springer-Verlag, 1977. [24] T. Okatani and K. Deguchi. Autocalibration of projector-screen-camera sys- tem: Theory and algorithm for screen-to-camera homography estimation. In Proceedings of the 9th International Conference on Computer Vision, pages 774–781, Nice, France, October 2003. IEEE Computer Society Press. [25] L. Robert. Camera calibration without feature extraction. Computer Vision, Graphics, and Image Processing, 63(2):314–325, March 1995. also INRIA Tech- nical Report 2204. [26] J.G. Semple and G.T. Kneebone. Algebraic Projective Geometry. Oxford: Clarendon Press, 1952. Reprinted 1979. [27] S.W. Shih, Y.P. Hung, and W.S. Lin. Accurate linear technique for camera calibration considering lens distortion by solving an eigenvalue problem. Optical Engineering, 32(1):138–149, 1993. [28] I. Shimizu, Z. Zhang, S. Akamatsu, and K. Deguchi. Head pose determina- tion from one image using a generic model. In Proceedings of the IEEE Third International Conference on Automatic Face and Gesture Recognition, pages 100–105, Nara, Japan, April 1998. [29] C. C. Slama, editor. Manual of Photogrammetry. American Society of Pho- togrammetry, fourth edition, 1980. [30] G. Stein. Accurate internal camera calibration using rotation, with analysis of sources of error. In Proc. Fifth International Conference on Computer Vision, pages 230–236, Cambridge, Massachusetts, June 1995.
  57. 57. Bibliography 43 [31] P. Sturm and S. Maybank. On plane-based camera calibration: A general algorithm, singularities, applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 432–437, Fort Collins, Colorado, June 1999. IEEE Computer Society Press. [32] B. Triggs. Autocalibration from planar scenes. In Proceedings of the 5th Euro- pean Conference on Computer Vision, pages 89–105, Freiburg, Germany, June 1998. [33] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3D ma- chine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal of Robotics and Automation, 3(4):323–344, August 1987. [34] T. Ueshiba and F. Tomita. Plane-based calibration algorithm for multi-camera systems via factorization of homography matrices. In Proceedings of the 9th International Conference on Computer Vision, pages 966–973, Nice, France, October 2003. IEEE Computer Society Press. [35] G.Q. Wei and S.D. Ma. A complete two-plane camera calibration method and experimental comparisons. In Proc. Fourth International Conference on Computer Vision, pages 439–446, Berlin, May 1993. [36] G.Q. Wei and S.D. Ma. Implicit and explicit camera calibration: Theory and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):469–480, 1994. [37] J. Weng, P. Cohen, and M. Herniou. Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(10):965–980, October 1992. [38] R. Willson. Modeling and Calibration of Automated Zoom Lenses. PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon Univer- sity, 1994. [39] R. Yang and Z. Zhang. Eye gaze correction with stereovision for video tele- conferencing. In Proceedings of the 7th European Conference on Computer Vision, volume II, pages 479–494, Copenhagen, May 2002. Also available as Technical Report MSR-TR-01-119. [40] R. Yang and Z. Zhang. Model-based head pose tracking with stereovision. In Proc. Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FG2002), pages 255–260, Washington, DC, May 2002. Also avail- able as Technical Report MSR-TR-01-102. [41] Z. Zhang. A flexible new technique for camera calibration. Technical Report MSR-TR-98-71, Microsoft Research, December 1998. Available together with the software at http://research.microsoft.com/˜zhang/Calib/. [42] Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000. [43] Z. Zhang. Camera calibration with one-dimensional objects. Technical Report MSR-TR-2001-120, Microsoft Research, December 2001. [44] Z. Zhang. Camera calibration with one-dimensional objects. In Proc. Euro- pean Conference on Computer Vision (ECCV’02), volume IV, pages 161–174, Copenhagen, Denmark, May 2002. [45] Z. Zhang and O.D. Faugeras. 3D Dynamic Scene Analysis: A Stereo Based Approach. Springer, Berlin, Heidelberg, 1992.
  58. 58. Chapter 3 MULTIPLE VIEW GEOMETRY Anders Heyden and Marc Pollefeys 3.1 Introduction There exist intricate geometric relations between multiple views of a 3D scene. These relations are related to the camera motion and calibration as well as to the scene structure. In this chapter we introduce these concepts and discuss how they can be applied to recover 3D models from images. In Section 3.2 a rather thorough description of projective geometry is given. Section 3.3 gives a short introduction to tensor calculus and Sec- tion 3.4 describes in detail the camera model used. In Section 3.5 a modern approach to multiple view geometry is presented and in Section 3.6 simple structure and motion algorithms are presented. In Section 3.7 more ad- vanced algorithms are presented that are suited for automatic processing on real image data. Section 3.8 discusses the possibility of calibrating the cam- era from images. Section 3.9 describes how the depth can be computed for most image pixels and Section 3.10 presents how the results of the previous sections can be combined to yield 3D models, render novel views or combine real and virtual elements in video. 45
  59. 59. 46 Multiple View Geometry Chapter 3 3.2 Projective Geometry Projective geometry is a fundamental tool for dealing with structure from motion problems in computer vision, especially in multiple view geometry. The main reason is that the image formation process can be regarded as a projective transformation from a 3-dimensional to a 2-dimensional projective space. It is also a fundamental tool for dealing with auto-calibration prob- lems and examining critical configurations and critical motion sequences. This section deals with the fundamentals of projective geometry, includ- ing the definitions of projective spaces, homogeneous coordinates, duality, projective transformations and affine and Euclidean embeddings. For a tra- ditional approach to projective geometry, see [9] and for more modern treat- ments, see [14], [15], [24]. 3.2.1 The central perspective transformation We will start the introduction of projective geometry by looking at a central perspective transformation, which is very natural from an image formation point of view, see Figure 3.1. x Πi E Πo X h l l2 l1 Figure 3.1. A central perspective transformation Definition 1. A central perspective transformation maps points, X,
  60. 60. Section 3.2. Projective Geometry 47 on the object plane, Πo, to points on the image plane Πi, by intersecting the line through E, called the centre, and X with Πi. We can immediately see the following properties of the planar perspective transformation: – All points on Πo maps to points on Πi except for points on l, where l is defined as the intersection of Πo with the plane incident with E and parallel with Πi. – All points on Πi are images of points on Πo except for points on h, called the horizon, where h is defined as the intersection of Πi with the plane incident with E and parallel with Πo. – Lines in Πo are mapped to lines in Πi. – The images of parallel lines intersects in a point on the horizon, see e.g. l1 and l2 in Figure 3.1. – In the limit where the point E moves infinitely far away, the planar perspective transformation turns into a parallel projection. Identify the planes Πo and Πi with R2, with a standard cartesian coordinate system Oe1e2 in Πo and Πi respectively. The central perspective transfor- mation is nearly a bijective transformation between Πo and Πi, i.e. from R2 to R2. The problem is the lines l ∈ Πo and h ∈ Πi. If we remove these lines we obtain a bijective transformation between R2 l and R2 h, but this is not the path that we will follow. Instead, we extend each R2 with an extra line defined as the images of points on h and points that maps to l, in the natural way, i.e. maintaining continuity. Thus by adding one artificial line to each plane, it is possible to make the central perspective transformation bijective from (R2 + an extra line) to (R2 + an extra line). These extra lines correspond naturally to directions in the other plane, e.g. the images of the lines l1 and l2 intersects in a point on h corresponding to the direction of l1 and l2. The intersection point on h can be viewed as the limit of images of a point on l1 moving infinitely far away. Inspired by this observation we make the following definition: Definition 2. Consider the set L of all lines parallel to a given line l in R2 and assign a point to each such set, pideal, called an ideal point or point at infinity, cf. Figure 3.2. 3.2.2 Projective spaces We are now ready to make a preliminary definition of the two-dimensional projective space, i.e. the projective plane.
  61. 61. 48 Multiple View Geometry Chapter 3 pideal L e1 e2 Figure 3.2. The point at infinity corresponding to the set of lines L. Definition 3. The projective plane, P2, is defined according to P2 = R2 ∪ {ideal points} . Definition 4. The ideal line, l∞ or line at infinity in P2 is defined ac- cording to l∞ = {ideal points} . The following constructions could easily be made in P2: 1. Two different points define a line (called the join of the points) 2. Two different lines intersect in a point with obvious interpretations for ideal points and the ideal line, e.g. the line defined by an ordinary point and an ideal point is the line incident with the ordinary point with the direction given by the ideal point. Similarly we define Definition 5. The projective line, P1, is defined according to P1 = R1 ∪ {ideal point} .
  62. 62. Section 3.2. Projective Geometry 49 Observe that the projective line only contains one ideal point, which could be regarded as the point at infinity. In order to define three-dimensional projective space, P3, we start with R3 and assign an ideal point to each set of parallel lines, i.e. to each direction. Definition 6. The projective space, P3, is defined according to P3 = R3 ∪ {ideal points} . Observe that the ideal points in P3 constitutes a two-dimensional manifold, which motivates the following definition. Definition 7. The ideal points in P3 builds up a plane, called the ideal plane or plane at infinity, also containing ideal lines. The plane at infinity contains lines, again called lines at infinity. Every set of parallel planes in R3 defines an ideal line and all ideal lines builds up the ideal plane. A lot of geometrical constructions can be made in P3, e.g. 1. Two different points define a line (called the join of the two points) 2. Three different points define a plane (called the join of the three points) 3. Two different planes intersect in a line 4. Three different planes intersect in a point 3.2.3 Homogeneous coordinates It is often advantageous to introduce coordinates in the projective spaces, so called analytic projective geometry. Introduce a cartesian coordinate system, Oexey in R2 and define the line l : y = 1, see Figure 3.3. We make the following simple observations: Lemma 1. The vectors (p1, p2) and (q1, q2) determines the same line through the origin iff (p1, p2) = λ(q1, q2), λ = 0 . Proposition 1. Every line, lp = (p1, p2)t, t ∈ R, through the origin, except for the x-axis, intersect the line l in one point, p. We can now make the following definitions:
  63. 63. 50 Multiple View Geometry Chapter 3 lp : (p1.p2)t, t ∈ R ex ey p l : y = 1 Figure 3.3. Definition of homogeneous coordinates in P1 . Definition 8. The pairs of numbers (p1, p2) and (q1, q2) are said to be equiv- alent if (p1, p2) = λ(q1, q2), λ = 0 . We write (p1, p2) ∼ (q1, q2) . There is a one-to-one correspondence between lines through the origin and points on the line l if we add an extra point on the line, corresponding to the line x = 0, i.e. the direction (1, 0). By identifying the line l augmented with this extra point, corresponding to the point at infinity, p∞, with P1, we can make the following definitions: Definition 9. The one-dimensional projective space, P1, consists of pairs of numbers (p1, p2) (under the equivalence above), where (p1, p2) = (0, 0). The pair (p1, p2) is called homogeneous coordinates for the corre- sponding point in P1. Theorem 1. There is a natural division of P1 into two disjoint subsets P1 = {(p1, 1) ∈ P1 } ∪ {(p1, 0) ∈ P1 } , corresponding to ordinary points and the ideal point. The introduction of homogeneous coordinates can easily be generalized to P2 and P3 using three and four homogeneous coordinates respectively. In the case of P2 fix a cartesian coordinate system Oexeyez in R3 and define the plane Π : z = 1, see Figure 3.4. The vectors (p1, p2, p3) and (q1, q2, q3)
  64. 64. Section 3.2. Projective Geometry 51 p ex ey lp : (p1.p2, p3)t, t ∈ R Π : z = 1 ez Figure 3.4. Definition of homogeneous coordinates in P2 . determines the same line through the origin iff (p1, p2, p3) = λ(q1, q2, q3), λ = 0 . Every line through the origin, except for those in the x-y-plane, intersect the plane Π in one point. Again, there is a one-to-one correspondence between lines through the origin and points on the plane Π if we add an extra line, corresponding to the lines in the plane z = 0, i.e. the line at infinity, l∞, built up by lines of the form (p1, p2, 0). We can now identifying the plane Π augmented with this extra line, corresponding to the points at infinity, l∞, with P2. Definition 10. The pairs of numbers (p1, p2, p3) and (q1, q2, q3) are said to be equivalent iff (p1, p2, p3) = λ(q1, q2, q3), λ = 0 written (p1, p2, p3) ∼ (q1, q2, q3) . Definition 11. The two-dimensional projective space P2 consists of all triplets of numbers (p1, p2, p3) = (0, 0, 0). The triplet (p1, p2, p3) is called homogeneous coordinates for the corresponding point in P2. Theorem 2. There is a natural division of P2 into two disjoint subsets P2 = {(p1, p2, 1) ∈ P2 } ∪ {(p1, p2, 0) ∈ P2 } , corresponding to ordinary points and ideal points (or points at infinity). The same procedure can be carried out to construct P3 (and even Pn for any n ∈ N), but it is harder to visualize, since we have to start with R4.

×