3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition

19496: Individual Project
Final Report
I hereby declare that this work has not been submitted for any other degree/course at this
University or any other institution and that, except where reference is made to the work of other
authors, the material presented is originally and entirely the result of my own work at the
University of Strathclyde under the supervision of Dr Paul Murray and Professor Stephen
Marshall
3-DIMENSIONAL FACIAL MODEL CREATION USING
STEREOSCOPIC IMAGING TECHNIQUES FOR FACIAL
RECOGNITION
Fraser Macfarlane - MEng Electronic and Digital Systems
201301364
Supervisors: Dr Paul Murray and Professor Stephen Marshall
Second Assessor: Dr Lykourgos Petropoulakis
31/03/2017

1
Abstract
Multiple computer vision and image processing techniques were employed in both MATLAB and
the OpenCV C++ API to create virtual 3-dimensional representations of faces which could be used,
along with traditional 2-dimensional greyscale images, for facial recognition. The motivation of this
work was to investigate the potential to increase the effectiveness of law enforcement’s current
surveillance capabilities and subject recognition practices. Both a passive binocular stereo imaging
system and an active, infrared based, stereo imaging system were used to infer depth and create
the 3-dimensional representations of faces used in facial detection and subsequent recognition.
The captured faces of the participants were used to train multiple facial recognition algorithms and
subset of the faces were used to both validate and test the performance of each facial recognition
algorithm. It was identified that the best results came from two algorithms, Principal Components
Analysis and Local Binary Pattern, which performed strongly on the data from both imaging systems
used providing accurate recognition rapidly over all of the captured data.

2
Acknowledgements
I would like to take this opportunity to thank those who supported and assisted me during my work
on this dissertation.
Firstly, to my friends and family who supported and encouraged me to continue through difficult
moments. To my friends and colleagues who volunteered to take part in my research providing
essential data without whom this work would not have been possible. To Professor Stephen
Marshall who has offered his support and wisdom throughout the course of this project. And finally,
to Dr Paul Murray who has again provided steadfast assistance and guidance throughout the year
in both professional and personal matters.

3
Table of Contents
Table of Figures.................................................................................................................................. 5
1. Introduction ............................................................................................................................... 8
2. Literature review........................................................................................................................ 9
3. Multi-camera stereo vision ...................................................................................................... 13
3.1. Camera models................................................................................................................. 13
3.2. Camera calibration........................................................................................................... 15
3.2.1. Camera parameters.................................................................................................. 15
3.2.2. Rotations and translations ....................................................................................... 16
3.2.3. Distortions................................................................................................................ 18
3.2.4. Chessboard calibration............................................................................................. 20
3.2.5. Homography............................................................................................................. 20
3.3. Image pair rectification .................................................................................................... 22
3.3.1. Epipolar geometry.................................................................................................... 22
3.3.2. The Essential and Fundamental matrices ................................................................ 24
3.3.3. Rectification of stereo image pairs........................................................................... 25
3.4. Stereo correspondence.................................................................................................... 26
3.4.1. The stereo correspondence problem....................................................................... 26
3.4.2. Stereo matching process.......................................................................................... 27
3.4.3. Semi-Global Block Matching .................................................................................... 27
3.5. Disparity computation and depth estimation.................................................................. 32
4. Structured light and time-of-flight infrared ............................................................................. 34
4.1. The Microsoft Kinect v1 and structured light infrared..................................................... 34
4.2. The Microsoft Kinect v2 and time of flight infrared......................................................... 35
5. Facial detection using the Viola-Jones algorithm..................................................................... 36
6. Facial recognition ..................................................................................................................... 39
6.1. Eigenfaces......................................................................................................................... 39
6.2. Fisherfaces........................................................................................................................ 42

4
6.3. Local Binary Pattern histograms ...................................................................................... 44
6.4. Eigensurfaces and Fishersurfaces .................................................................................... 46
6.5. Feature detectors............................................................................................................. 46
7. Experimental results................................................................................................................. 47
7.1. Camera calibration........................................................................................................... 47
7.1.1. Camera calibration in OpenCV................................................................................. 47
7.1.2. Camera calibration using the MATLAB Stereo Camera Calibrator app.................... 49
7.2. Image pair rectification .................................................................................................... 51
7.2.1. Image pair rectification in OpenCV .......................................................................... 52
7.2.2. Calibrated image pair rectification in MATLAB........................................................ 54
7.2.3. Un-calibrated image pair rectification in MATLAB................................................... 56
7.3. Disparity calculation......................................................................................................... 59
7.3.1. Disparity map creation in OpenCV........................................................................... 59
7.3.2. Disparity map creation using the calibrated and uncalibrated models in MATLAB 61
7.4. Active stereo with the Microsoft Kinect .......................................................................... 62
7.5. Image Post Processing and Face detection using the Viola-Jones algorithm .................. 63
7.6. Facial Recognition ............................................................................................................ 67
7.6.1. Eigenfaces and Eigensurfaces................................................................................... 67
7.6.2. Local Binary Pattern Histogram................................................................................ 69
7.6.3. SURF feature detection ............................................................................................ 71
7.6.4. Harris corner detection ............................................................................................ 71
7.6.5. Comparison and discussion of results...................................................................... 72
8. Conclusion................................................................................................................................ 74
9. Further work............................................................................................................................. 75
References........................................................................................................................................ 77
Appendix A – Confusion matrices produced by facial recognition algorithms................................ 81

5
Table of Figures
Figure 1 - Pinhole camera model ..................................................................................................... 14
Figure 2 - Discretised pinhole model................................................................................................ 14
Figure 3 - Simple stereo camera model ........................................................................................... 15
Figure 4 - Rotations about orthogonal axes of a frame. .................................................................. 16
Figure 5 - Translation vector describing the change in position between two orthogonal frames. 17
Figure 6 - Combined rotation and translation between two orthogonal frames............................ 18
Figure 7 - Types of radial distortion. ................................................................................................ 19
Figure 8 - View of a planar chessboard described in the object plane and the image plane. G.
Bradski and A. Kaehler, Learning OpenCV [46]................................................................................ 21
Figure 9 - Extended stereo camera model....................................................................................... 23
Figure 10 - Visualisation of Epipolar geometry................................................................................ 24
Figure 11 – Rectified image pair and horizontal epipolar lines........................................................ 25
Figure 12 – Depth estimation through the use of a rectified binocular stereo system................... 32
Figure 13 – Variation in disparity with depth................................................................................... 33
Figure 14 – Microsoft Kinect v1 sensor camera arrangement – credit Microsoft........................... 34
Figure 15 - Speckled light structured light pattern – credit Trevor Taylor (MSDN) [68].................. 35
Figure 16 - Microsoft Kinect v2 sensor camera arrangement – credit Microsoft............................ 35
Figure 17 – Set of Haar-like features, A) two-rectangle horizontal feature, B) two-rectangle vertical
feature, C) three-rectangle feature and D) four-rectangle feature ................................................. 36
Figure 18 – Facial feature detection using Haar-like features - credit Viola/Jones [34].................. 37
Figure 19 - Integral image layout example....................................................................................... 37
Figure 20 - Cascade classifier structure............................................................................................ 38
Figure 21 -Representation of an image as a vector ......................................................................... 39
Figure 22 – Comparison of Eigenfaces and Fisherfaces algorithms (Database size vs Error Rate)
[58] ................................................................................................................................................... 43
Figure 23 – Original LBP algorithm comprised of a 3x3 neighbourhood of pixels to be thresholded.
.......................................................................................................................................................... 44
Figure 24 – Local binary patterns using (8,1) [Left], (16,3) [Centre], and (16,4) [Right]
neighbourhoods............................................................................................................................... 45
Figure 25 – Local binary patterns found using a (8,1) operator [58] ............................................... 45
Figure 26 – Detected chessboard in camera calibration boards visible in stereo image pairs using
OpenCV ............................................................................................................................................ 48
Figure 27 - Pixel reprojection error between calibration board points and calibrated estimates .. 49

6
Figure 28 - Calibration boards 11-15 captured using a binocular stereo camera system and
MATLAB............................................................................................................................................ 50
Figure 29 – Extrinsic parameters from an arbitrary view ................................................................ 51
Figure 30 - Unrectified image pair (OpenCV)................................................................................... 52
Figure 31 - Stereo anaglyph of unrectified image pairs (OpenCV)................................................... 52
Figure 32 – Rectified image pair (OpenCV)...................................................................................... 53
Figure 33 – Stereo anaglyph of rectified image pairs (OpenCV)...................................................... 53
Figure 34 - Unrectified Image pair (MATLAB) .................................................................................. 54
Figure 35 – Overlaid anaglyph of the unrectified image pair........................................................... 54
Figure 36 – Rectified image pair (MATLAB) ..................................................................................... 55
Figure 37 – Overlaid anaglyph of the rectified image pair............................................................... 55
Figure 38 – Rectified image pair with horizontal (epipolar) lines highlighted................................. 55
Figure 39 - Fifty best SURF features in the right image.................................................................... 56
Figure 40 - Fifty best SURF features in the left image...................................................................... 56
Figure 41 - Matched features in the image pair............................................................................... 57
Figure 42 - Valid matched features after outlier removal ............................................................... 58
Figure 43 - Uncalibrated rectified image pair .................................................................................. 59
Figure 44 – Multiple disparity images created in OpenCV............................................................... 60
Figure 45 – Disparity map produced using the calibrated model for rectification.......................... 61
Figure 46 - Disparity map produced using the uncalibrated model for rectification....................... 61
Figure 47 - Microsoft Kinect RGB image captured........................................................................... 62
Figure 48 - Depth mapped colour image and depth image from the Kinect sensor........................ 62
Figure 49 - Greyscale Kinect image .................................................................................................. 63
Figure 50 - Detected face using the Viola-Jones algorithm on the Kinect greyscale data............... 63
Figure 51 - Greyscale depth mapped Kinect RGB image.................................................................. 64
Figure 52 - Face detected in the depth mapped greyscale image and in the depth image............. 64
Figure 53 – Image pair captured from the passive binocular stereo camera system...................... 65
Figure 54 – Processed greyscale base image ................................................................................... 65
Figure 55 – Unaltered disparity image returned from the SGBM algorithm................................... 66
Figure 56 – Disparity image after CLAHE has been applied ............................................................. 66
Figure 57 – Detected faces in the greyscale base image and in the computed disparity map........ 67
Figure 58 - Mean face of the stereo greyscale training dataset ...................................................... 68
Figure 59 -The six most varied Eigenfaces of the stereo greyscale training dataset....................... 68
Figure 60 – The six most varied Eigensurfaces of the stereo disparity training dataset ................. 69
Figure 61 - Mean surface of the stereo disparity training dataset .................................................. 69

7
Figure 62 – Image pair with matching subject................................................................................. 70
Figure 63 – Image pair with non-matching subjects........................................................................ 70
Figure 64 – LBP Histogram error results analysis and match determination................................... 70
Figure 65 – Matched SURF features between test and training images from the Kinect greyscale
dataset.............................................................................................................................................. 71
Figure 66 - Matched Harris features between test and training images from the Kinect greyscale
dataset.............................................................................................................................................. 71
Figure 67 – Percentage of correct facial matches for each method of facial recognition and each
dataset.............................................................................................................................................. 73
Figure 68 – Match times per image for each method of facial recognition and each dataset........ 74

8
1. Introduction
Close Circuit Television (CCTV) security monitoring forms an integral part of daily life acting as both
a real-time protection and prevention tool as well as being a useful retrospective investigative tool.
In the event of a serious incident, this public and private space CCTV footage is supplemented with
vast amounts of image and video data obtained by private citizens via various devices such as
smartphones, tablets and traditional digital cameras. As a result of the multitude of sources
presented, any available footage will have varying position, resolution and field of view, each
providing fragmented partial images of the full scene. If the relative positions of these cameras are
known, or can be estimated, they can be combined to form a view of a scene unlike that which
could be given by a single image.
In addition to CCTV, facial detection and recognition also prove substantially useful as investigative
tools, allowing potential suspects to be identified automatically and rapidly whereas prior to the
development of such techniques the process was a manual and laborious undertaking. Methods of
facial recognition are becoming increasingly robust and this is an area where significant research
has been undertaken [1][2][3] resulting in innovations and improvements on existing techniques.
Other forms of image recognition based security systems such as those at airport security
checkpoints, London Heathrow for instance, include cameras at border control which capture
frontal images of faces for comparison prior to permitting a passenger to board their flight. To avoid
the additional complexities that come from having multiple cameras positioned at various extreme
angles and distances around a room, a more idealised system akin to airport security checkpoint
cameras was developed.
While not widely used in security systems, 3-dimensional model creation is used extensively in
other areas such as computer animation and computer generated imagery (CGI) in both the film
and video game industries [4]. Similar 3-dimensional imaging and modelling is also prevalent in the
medical industry, being used for diagnosis purposes in the form of magnetic resonance imaging
(MRI) or through the use of 3-dimensional modelling and printing for surgical preparation and to
create practice surgical procedures, prosthetics, synthetic tissue and organ transplants along with
other implants [5].
This project aims to utilise various digital image processing techniques to estimate depth employing
two distinctly different imaging techniques. The first of which is stereo vision or stereoscopy and
the second is emission and reception of infrared light using a Microsoft Kinect sensor, both of these
methods will provide an idealistic emulation of CCTV source footage or mimic a system such as the

9
example at London Heathrow airport. The output from both of these image acquisition systems can
be used to create 3-dimensional models of a subject’s face which can then be used for 3-
dimensional facial recognition purposes. As well as using this 3-dimensional data, the images
themselves can be utilised to carry out traditional 2-dimensional facial recognition against a
database of faces.
This report is split into four parts with the first part being comprised of sections 1 and 2, which form
the introduction and review of literature relevant to the project. Sections 3 through 6 form the
second part which is concerned with the relevant background details of the techniques covered in
literature. Section 7 displays the results gathered over the course of the project and the final part,
sections 8 and 9 form the conclusion of work and potential future work to be investigated
respectively.
2. Literature review
As discussed in section 1, there were two main broad areas of initial investigation and research to
be carried out in this project: 1) 3-dimensional model creation and 2) facial recognition. As the
project progressed, various other avenues of analysis were explored as and when it became
appropriate to do so.
Both the 3-dimensional model creation and facial recognition have a strong reliance on the set of
images captured from the various cameras in a passive stereo imaging system as well as those
obtained using the colour camera and infrared sensor in the Microsoft Kinect which can also be
thought of as an active stereo imaging method. The model creation relies on the stereo system and
implemented algorithms to accurately find correspondences between the image pairs returned.
This allows for a good estimation of depth in the shared field of view of the cameras in the system
[6] which can then be used in further stages. The reliance on the Microsoft Kinect sensor in the 3-
dimensional imaging is lesser as it has a specialised infrared sensor to accurately estimate depth.
The facial recognition stages have a reliance on not only the raw images from the stereo and Kinect
systems, but also on the estimated depth for the implementation of 3-dimensional recognition
techniques. The ability to accurately match faces based on a set of training data relies inherently
on the images acquired from the various sources available.
There are two widely used methods of 3-dimensional modelling using passive stereo imaging, these
are stereo vision, which this project focuses on, and structure (or stereo) from motion (SfM). These
methods are similar in the way they construct 3-dimensional models from image data however they

10
vary in the way they capture said data, stereo vision being a multiple-camera static system viewing
an object from multiple angles while SfM consists of a single moving camera capturing sequential
images or video footage of an object [7][8][9][10].
Both stereo imaging and SfM rely on feature detection methods in order to match points in image
pairs or image sequences to then be able to estimate depth from them at all. There are various
industry standard methods, two of the most known and well-used algorithms are the Scale-
Invariant Feature Transform (SIFT) [11] or Speeded-Up Robust Features (SURF) [12]. These
algorithms have been compared in various papers, observing and evaluating their speed,
computation cost and accuracy [13][14][15]. The general conclusion of the comparisons drawn is
that SIFT is the more robust of the two, returning a far greater number of matched points more
regularly at the expense of both computation cost and speed. SURF runs far faster than SIFT and
therefore has a justification for use under certain circumstances that require real-time or faster
computation over robustness. Initially it was believed these algorithms were available in OpenCV,
which is the open source computer vision library being used in this project, however they were
removed from the native version of the library in version 3 and required additional installations
which were not included due to time constraints. They are however available in MATLAB and were
used in an investigation into uncalibrated depth estimation (see Section 7.2.3) as well as for their
potential use towards facial recognition (Section 7.6.3).
Although depth estimation can be achieved through the use of a single sensor, as is the case in the
technology used in the Kinect sensor, true stereo vision requires a minimum of two cameras and
uses image correspondences to estimate depth, taking advantage of known or estimated camera
positions. There have been various papers published on the use of two camera, three camera
[16][17][18][19] and multi-camera scalable arrays [20][21]. With increased cameras, the complexity
and computation cost of the associated depth estimation increases, but so does the accuracy of the
system and, as such, would produce higher resolution results.
As highlighted in [6] and [22] , as well as, most other literature encountered on the subject, stereo
imaging requires offline pre-processing in order to relate the cameras in the system to each other
and their relative positions in the real world that they are imaging. This is achieved through camera
calibration. Camera calibration can be accomplished through the use of a known pattern which is
used to calculate the parameters associated with the stereo cameras. Any pattern with known
structure and size can be used although two common patterns that stand out most in literature are
the checkerboard and asymmetrical circle patterns [10][23]. Camera calibration corrects distortions
and calculates the camera parameters that will be used in further steps of stereo imaging. Due to

11
this dependence on calibration exhibited by both the facial recognition and depth estimation
stages, it is essential that images are captured and corrected accurately. If images go forward having
been poorly corrected or uncorrected, error will be accrued in the steps discussed in Section 3
which could ultimately mean a poor depth estimation is produced.
It is possible to apply facial recognition to 2-dimensional and 3-dimensional data with a substantial
amount of research effort being focused in these areas over the past two decades [24]. It is
generally recognised that two methods of facial recognition exist; global methods which, as the
name suggests, compare faces as a whole [24] and local methods which attempt to match faces
through the comparison of small neighbourhoods of images, such as local binary pattern (LBP) [2],
or by using feature detection and extraction methods such as SIFT and SURF [11][12] as well as
some similarity measure such as a distance between matched features.
Two of the most well-known and well-used algorithms for global facial recognition are Eigenfaces,
first described by Turk and Pentland in 1991 [24][25], which uses Principal Component Analysis
(PCA), and Fisherfaces, which uses Fisher’s Linear Discriminant Analysis. The latter was first
theorised in 1936 by R.A. Fisher as a classification method of species [26] and his method has since
been applied to the facial recognition problem where it first carries out the same steps as the
Eigenfaces method but then expands on this to increase classification accuracy (see Section 426.2).
Both of the aforementioned methods utilise PCA for dimensionality reduction on the data in order
to reduce the computation required and improve run-time. Due to their similarity and popularity
they have been compared closely in literature [27][28][29].
In addition to 2-dimensional methods, another method of facial recognition is to model a face in 3-
dimensional space and compare its spatial characteristics with those of models stored in a
database. These techniques have similar associated steps to those in traditional 2-dimensional
methods however they exploit a new additional dimension of data and are immune to variations in
illumination or pose [3][24].
The literature reviewed until this stage was part of an initial research phase prior to the submission
of the interim report. Due to the nature of the project and its broad spectrum of component parts
a secondary review of further material was conducted after the initial interim report was written.
This was done in order to both refresh knowledge in certain areas and also look into new or updated
features and technologies that could be applied to the project.
As mentioned above the Kinect sensors use infrared light in order to aid depth estimation and
because of this they are referred to as active stereo imaging systems. The technology behind the
Kinect sensors was investigated further in order to implement a system to estimate depth and carry

12
out facial recognition using data collected along with that from the passive system constructed in
order to compare their results. The Kinect v1 uses a technology called structured light [30][31][32]
which scatters a known infrared pattern into a scene and the distortions of which allow the
estimation of depth whereas the newer model the Kinect v2 uses infrared time of flight [32][33]
which utilises the emission and reflection of IR light measuring the time taken between emitting a
pulse and receiving the same pulse which, knowing the speed at which the light travels, allows the
depth to be half the distance travelled. Due to hardware restrictions, only the Kinect v2 Time of
Flight (ToF) sensor was used for the purposes of this project.
Prior to the submission of the interim report, it was identified that some method of automatically
and dynamically detecting faces would be required due to the quantities of data to be accrued. One
algorithm, created by Paul Viola and Michael Jones, offered equivalent speed and accuracy in its
detection of faces compared to the best alternatives of the time [34]. The Viola-Jones algorithm
introduces three novel contributions [34][35][36], the first being the use of an integral image, a
form of image representation that allows for rapid identification of prominent facial features. The
second is the use of a boosting algorithm called AdaBoost [37] to construct a classifier and the third
and final contribution is the combination of increasingly more complex classifiers in a cascade
structure [34][35].
Due to its use as a stereo correspondence algorithm as standard in both MATLAB and OpenCV
algorithms, the Semi-Global Block Matching (SGBM) algorithm first developed by Heiko
Hirschmüller in 2005 [38] [39] was investigated. This algorithm uses mutual information (MI) in its
matching of stereo correspondences, although this feature is not implemented in either the
MATLAB or OpenCV implementations. This particular algorithm performed well against similar
algorithms of the time, if not better where sub-pixel accuracy is concerned. The algorithm also
boasts a swift runtime per image and a variety of adjustable parameters which allow for great
control over the matching process. Due to this an additional investigation into the ideal parameters
to use for the purposes of the project was performed. The algorithm and its component parts are
described in greater detail in Section 3.4.3.
Following on from initial research into methods of applying facial recognition to the collected image
data, additional algorithms were investigated as the facial recognition stage became more of a
focus. In particular, methods of 3-dimensional facial recognition that have stemmed from 2-
dimensional techniques and employing feature detection algorithms to carry out facial recognition.
3-Dimensional variants of Eigenfaces and Fisherfaces, Eigensurfaces [40] and Fishersurfaces
[41][42], were investigated and implemented. Feature detectors such as SURF and SIFT could be

13
utilised to perform facial recognition by extracting prominent features from images and through
some metric, the features present in each image could be measured against one another. These
metrics can be various distance measurements between matched features or simply the number
of matched features returned by these algorithms.
Various other feature detectors present in MATLAB were also tested, these included Binary Robust
Invariant Scalable Keypoints (BRISK) [43], the Harris feature detector [44] and the Features from
Accelerated Segment Test (FAST) feature detector [45].
3. Multi-camera stereo vision
The first method of acquiring data to create models and to perform facial recognition was a passive
stereo imaging system. This system initially was composed of three Logitech standard computer
webcams however, due to one camera lens having developed an unrepairable defect, the system
was downgraded to using only two of the original three cameras.
There are four main steps in a stereo vision system, these are; camera calibration, image pair
rectification, stereo correspondences and disparity or depth estimation.
Before attempting to carry out these steps and in order to understand how the use of multiple
cameras could estimate the depth in their field-of-view, the theoretical camera models and the
image formation process were researched.
3.1. Camera models
One of the simplest representation of a camera is the pinhole camera model, where light converges
to a single point through an image plane, this point is the camera centre, camera origin or optical
centre. Any point, 𝑃, in space will be represented by a point, 𝑝, on the image plane, as shown in
Figure 1.

14
In reality the image plane has a finite number of elements which depends on the resolution, as such
the model becomes discrete, as shown in Figure 2, and the point 𝑃 now corresponds to some pixel,
[𝑢 𝑝, 𝑣 𝑝], on the image plane.
The point 𝑃 can exist anywhere on the vector passing through the camera origin, 𝑝 and 𝑃 with an
infinite number of potential objects that the point 𝑃 could represent. Adding a second camera and
viewing the same point 𝑃, now notated as 𝑝′, in the second reference frame allows for the inference
of distance, or depth, as shown Figure 3. This is the principle behind binocular or multi-view stereo
imaging.
Figure 1 - Pinhole camera model
Figure 2 - Discretised pinhole model

15
3.2. Camera calibration
The first stage in carrying out stereo imaging is the camera calibration step. Calibration is performed
offline, meaning it can be carried out and the results stored for later use assuming the cameras are
not moved or adjusted in any way. Calibration aims to find the extrinsic and intrinsic parameter of
the cameras in the system [10] [46] allowing coordinates in the camera frame(s) to be equivalent
to coordinates in the world frame [9]. Extrinsic parameters are those which relate the cameras in
the system to each other [46] along with their relative positions and orientations in a 3-dimensional
space, whereas intrinsic parameters relate to the geometry and optical characteristics that are
internal and unique to each camera. The calibration process is described in the following sections.
3.2.1. Camera parameters
As discussed above, a camera can be described by its intrinsic and extrinsic parameters. The intrinsic
parameters generally include the effective focal length, f, the scale factor, s, (alternatively known
as the aspect ratio α) and the image centre, either referred to as [cx,cy] or [u0,v0] in literature. The
intrinsic parameters of a camera can be described in terms of a 3x3 matrix, 𝑀𝑖𝑛𝑡, which relates the
image points x,y and w to points in a 3-dimensional space X,Y and Z. The third coordinate in the
image frame, w, is equivalent to Z using homogeneous coordinate systems [46]. This relation is
shown in Equation (1) as described in [23] and [46]:
Figure 3 - Simple stereo camera model

16
𝑞 = 𝑀𝑖𝑛𝑡 𝑄 → [
𝑥
𝑦
𝑤
] = [
𝑓𝑥 0 𝑐 𝑥
0 𝑓𝑦 𝑐 𝑦
0 0 1
] [
𝑋
𝑌
𝑍
]
𝑤ℎ𝑒𝑟𝑒 𝑓𝑥 = −
𝑓
𝑠 𝑥
𝑎𝑛𝑑 𝑓𝑦 = −
𝑓
𝑠 𝑦
𝑓𝑥, 𝑓𝑦 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑠𝑐𝑎𝑙𝑒𝑑 𝑓𝑜𝑐𝑎𝑙 𝑙𝑒𝑛𝑔𝑡ℎ𝑠 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑖𝑛 𝑝𝑖𝑥𝑒𝑙 𝑢𝑛𝑖𝑡𝑠
(1)
The extrinsic parameters,𝑀𝑒𝑥𝑡, are used to transform object coordinates to a camera centred
coordinate frame and is essentially the joint rotation-translation matrix [𝑅|𝑡] which is shown in
Equation (6) in Section 3.2.2. The extrinsic parameters are also used to describe the relationship
between cameras in the system using their individual frames of reference and additionally to
describe the relative position and orientation against that of a global world frame.
3.2.2. Rotations and translations
Any object in a scene has an orientation, or pose, associated with it relative to the frame of the
camera which can be computed in terms of a rotation, 𝑅, and a translation, 𝑡 [46]. Any rotation can
be defined in terms of individual angular rotations about the three orthogonal axes in a world
frame, these rotations are yaw about the z-axis, φ, pitch about the y-axis, ψ, and roll about the x-
axis, ω. Figure 4 displays these three rotations diagrammatically. The three rotations can be
expressed individually by three 3x3 matrices show in equations (2), (3) and (4) for roll, pitch and yaw
respectively, and the total rotation can be expressed as the multiplication of these three individual
rotation matrices as expressed in Equation (5).
Figure 4 - Rotations about orthogonal axes of a frame.
Yaw, φ
z
x y
Roll, ω Pitch, ψ

17
𝑅 𝑥(𝜔) = [
1 0 0
0 cos 𝜔 − sin 𝜔
0 sin 𝜔 cos 𝜔
] (2)
𝑅 𝑦(𝜓) = [
cos 𝜓 0 sin 𝜓
0 1 0
− sin 𝜓 0 cos 𝜓
] (3)
𝑅 𝑧(𝜑) = [
cos 𝜑 − sin 𝜑 0
sin 𝜑 cos 𝜑 0
0 0 1
] (4)
𝑅 = 𝑅 𝑥(𝜔)𝑅 𝑦(𝜓)𝑅 𝑧(𝜑)
(5)
These sets of rotations describe an object’s pose in a 3-dimensional world, their position relative to
a world frame can also be described, this time in terms of a 3x1 vector whose elements contain the
change in 𝑥, 𝑦 and 𝑧 position of the object frame’s origin. The translation between a frame {𝐴} and
another frame {𝐵} is denoted as 𝐴
𝑃𝐵𝑜𝑟𝑖𝑔𝑖𝑛 and is displayed in Figure 5.
Combining a rotation and a translation fully describes both an object’s relative pose and position in
relation to a common frame. Figure 6 shows the rotation and translation between two frames and
the general transformation matrix, 𝑇, or the joint rotation-translation matrix, [𝑅|𝑡], is shown in
Equation (6).
Figure 5 - Translation vector describing the change in position between two
orthogonal frames.

18
𝑀𝑒𝑥𝑡 = [𝑅|𝑡] = [
𝑟11 𝑟12 𝑟13 𝑡 𝑥
𝑟21 𝑟22 𝑟23 𝑡 𝑦
𝑟31 𝑟32 𝑟33 𝑡 𝑧
] (6)
If both the rotation and translation between the camera frame(s) and the world frame are known,
any point in the camera frame(s), 𝑃𝑐, can be related to its corresponding point in the world frame,
𝑃𝑤, using Equation (7).
𝑃𝑐 = 𝑅(𝑃𝑤 − 𝑇) (7)
3.2.3. Distortions
There are two main and noticeably visible forms of distortion associated with camera lenses, radial
and tangential distortion [46]. Both are the consequence of physical properties of the lens and can
be accounted for to an extent during the calibration process [47][48].
Radial distortion occurs in a lens as a result of non-planar geometry which will result in a more
eccentric angle of incidence of light passing through a lens the further from the optical centre it is.
Figure 6 - Combined rotation and translation between two orthogonal frames.

19
As such, the radial distortion is a function of radial distance, 𝑟, and there are generally two
categories of radial distortion: pincushion and barrel distortion that are both visualised in Figure 7.
Radial distortion can be described by the first few terms of a Taylor series [46] with coefficients 𝑘1,
𝑘2 and 𝑘3. Generally, only the first two coefficients are required to fully describe the distortion
unless it is particularly prominent. Radial distortion in the x and y axes can be corrected using
Equation (8) and (9) respectively:
𝑥 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = 𝑥(1 + 𝑘1 𝑟2
+ 𝑘2 𝑟4
+ 𝑘3 𝑟6
) (8)
𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = 𝑦(1 + 𝑘1 𝑟2
+ 𝑘2 𝑟4
+ 𝑘3 𝑟6
) (9)
Tangential distortion arises from the lens not being parallel to the optical plane of the sensor and
is purely a manufacturing defect. The tangential distortion is described by two parameters 𝑝1 and
𝑝2 and can be minimised in the x and y axes by using Equations (10) and (11):
𝑥 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = 𝑥 + [2𝑝1 + 𝑝2(𝑟2
+ 2𝑥2)] (10)
𝑦𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = 𝑦 + [𝑝1(𝑟2
+ 2𝑦2) + 2𝑝2] (11)
The five distortion parameters needed to correct distortions in a lens are more often than not
bunded into a single vector in software packages [46] in the form [𝑘1, 𝑘2, 𝑘3, 𝑝1, 𝑝2] and these
Figure 7 - Types of radial distortion.

20
distortion coefficients along with the intrinsic camera parameters and the rotation and translation
information make 10 parameters needing to be solved for. Using a flat, planar object to calibrate a
camera system fixes eight of the ten parameters but a minimum of two views of the object are
needed for every camera to solve for the geometric parameters 𝑅 and 𝑡. This object is known as
the calibration object [46] and for the purposes of this project an offset chessboard was used.
3.2.4. Chessboard calibration
In theory, any planar object with distinct known characteristics can be used as a calibration object
however some practical choices are repeating patterns of circles or, more frequently, a chessboard
[46][49]. Chessboards are commonly used as the repeating pattern of black and white squares
cause little to no bias towards any one section of the calibration object [49]. An offset chessboard,
or one with non-equal sized sides, is used to ensure that the bottom black corner of the chessboard
is always used as the world origin.
Calibrating the camera(s) with a chessboard returns the geometric parameters of the camera
system as the size of the squares, number of squares and the number of inside corners on the board
are known. By comparing the position of the inside points of the chessboard in every view returned
from the cameras in the system the camera positions themselves can be estimated and as the lines
in the object should appear perfectly straight any distortions caused by the lens can be corrected.
The more images captured the more accurate the calibration will be.
3.2.5. Homography
In computer vision homography is defined as the mapping of an object from one plane to another
[46] and the homography matrix, 𝐻, contains the parameters to perform this mapping. For any
point in space, 𝑄, the corresponding point on the image plane, 𝑞, can be found using Equation (12).
𝑞̃ = 𝑠𝐻𝑄̃
𝑤ℎ𝑒𝑟𝑒 𝑞̃ = [𝑥 𝑦 1] 𝑇
𝑎𝑛𝑑 𝑄̃ = [𝑋 𝑌 𝑍 1] 𝑇
(12)

21
Figure 8, from [46], shows this mapping from the object plane to the image plane through
homography.
The extrinsic, 𝑀𝑒𝑥𝑡, and intrinsic, 𝑀𝑖𝑛𝑡, parameters of the camera, from Equations (1) and (6)
respectively, also need to be taken into account while projecting objects onto the image frame.
Equation (13) shows how the extrinsic and intrinsic parameters are used to project points between
planes:
𝑞̃ = 𝑠𝑀𝑖𝑛𝑡 𝑀𝑒𝑥𝑡 𝑄̃ (13)
The projected coordinate, 𝑄̃, is defined for all of space along the vector through 𝑄̃ and 𝑞̃. The true
point of interest lies on the object plane where 𝑍 = 0. The rotational matrix, 𝑅, is divided into
individual column vectors 𝑟1, 𝑟2 and 𝑟3 and, for the purposes of this computation, 𝑟3 is not required
[46]. As a result, Equation (13) is reduced to the form shown in Equation (14).
𝑞̃ = 𝑠𝑀𝑖𝑛𝑡[𝑟1 𝑟2 𝑟3 𝑡]𝑄̃ = 𝑠𝑀𝑖𝑛𝑡[𝑟1 𝑟2 𝑡]𝑄′̃
where 𝑞̃ = [𝑥 𝑦 1] 𝑇
, 𝑄̃ = [𝑋 𝑌 0 1 ]T
and 𝑄′̃ = [𝑋 𝑌 1] 𝑇
(14)
Figure 8 - View of a planar chessboard described in the object plane and the image plane.
G. Bradski and A. Kaehler, Learning OpenCV [46]

22
From Equation (14) is can be observed that the homography matrix, 𝐻, is now a 3x3 matrix and can
be written as:
𝐻 = 𝑠𝑀𝑖𝑛𝑡[𝑟1 𝑟2 𝑡] (15)
Therefore,
𝑞̃ = 𝑠𝐻𝑄′̃ (16)
Given that 𝑠 can be factored out of 𝐻.
Now, with the homography matrix known, any point on a source (object) plane can be described
on a destination (image) plane and vice versa using 𝐻 and Equations (17) and (18) respectively:
𝑝 𝑑𝑠𝑡 = 𝐻𝑝𝑠𝑟𝑐 (17)
𝑝𝑠𝑟𝑐 = 𝐻−1
𝑝 𝑑𝑠𝑡 (18)
𝑤ℎ𝑒𝑟𝑒 𝑝 𝑑𝑠𝑡 = [
𝑥 𝑑𝑠𝑡
𝑦 𝑑𝑠𝑡
1
] 𝑎𝑛𝑑 𝑝𝑠𝑟𝑐 = [
𝑥 𝑠𝑟𝑐
𝑦𝑠𝑟𝑐
1
]
3.3. Image pair rectification
Once calibration has taken place and the parameters of the cameras have been found and the
ability to map objects to and from different planes has been gained, the next stage is to rectify or
align the cameras virtually. Rectification is the process of utilising the camera’s geometry and
resulting image geometry in order to warp the data to appear as if the cameras in the system are
aligned on the same virtual plane, this process also significantly simplifies what is known as the
correspondence problem (Section 3.4.1).
3.3.1. Epipolar geometry
The geometry exhibited by a stereo camera system is known as epipolar geometry. A binocular
arrangement of cameras can be described in terms of two pinhole camera models, as seen in Figure

23
3. This model can be extended by adding the baseline, 𝑏, or the distance between the two camera
centres, 𝑂 𝑅 and 𝑂 𝑇, which corresponds to the translation in space as calculated during calibration,
which is visualised in Figure 9.
Figure 9 also shows the problem that is solved using multiple cameras, points 𝑃 and 𝑄 in space
correspond with the same pixel on the reference image plane, 𝜋 𝑅. By using a second target image
plane, 𝜋 𝑇, it becomes apparent that they separate points existing at 𝑝′ and 𝑞′ on the second, target,
image plane, 𝜋 𝑇.
The points of intersection with the baseline in each image plane are referred to as the epipoles, 𝑒 𝑅
and 𝑒 𝑇
, and the triangle confined by the point 𝑃 in space and the two camera centres describes the
epipolar plane. The lines on the image plane that join projected points and epipoles, or alternatively
where the epipolar plane meets the image plane, are known as the epipolar lines and Figure 10
displays these relationships.
Figure 9 - Extended stereo camera model

24
Due to the nature of epipolar geometry, any point along an epipolar line in one image plane will
correspond to a point on the epipolar line on the other image plane, this is known as the epipolar
constraint [6][46] and is vital, along with the stereo rectification stage, in solving the stereo
correspondence problem.
3.3.2. The Essential and Fundamental matrices
Two additional variables are needed in order to compute epipolar lines, these are the essential
matrix, 𝐸, and the fundamental matrix, 𝐹. The essential matrix contains information on the rotation
and translation between the cameras in space. The fundamental matrix differs slightly in that it also
contains information about the intrinsic parameters of the cameras and therefore relate the
cameras in terms of pixel coordinates [46]. Both matrices can be obtained during calibration, or
alternatively estimated as shown in Section 7.2.3.
The essential matrix is defined as a matrix that ensures the equality shown in Equation (19) holds
true:
𝑝𝑟
𝑇
𝐸𝑝𝑙 = 0
Where 𝑝𝑟 is a point in the left image and 𝑝𝑟 is the corresponding point in the right image
(19)
Figure 10 - Visualisation of Epipolar geometry.

25
The fundamental matrix is defined in Equation (20) and again ensures the equality in Equation (21)
is observed.
𝐹 = (𝑀𝑖𝑛𝑡 𝑟
−1
)
𝑇
𝐸𝑀𝑖𝑛𝑡 𝑙
−1
Where 𝑀𝑖𝑛𝑡 𝑟
and 𝑀𝑖𝑛𝑡 𝑙
are the intrinsic parameters of the right and left cameras
respectively
(20)
𝑝𝑟
𝑇
𝐹𝑝𝑙 = 0
(21)
If the equalities from Equations (19) and (21) are equal to zero, it means that the two points 𝑝𝑟 and
𝑝𝑙 lie on the same epipolar line viewed in each image.
3.3.3. Rectification of stereo image pairs
Stereo rectification aims to align the images perfectly on a virtual plane so that points occur in each
image on a horizontal line. Knowing the extrinsic parameters, the rotation and translation from one
camera axis to another, along with the essential and fundamental matrices allows for a
transformation between frames to be calculated. This transformation consists of a rotation,
translation and, often, scaling operation such that the epipolar lines become horizontal. Figure 11
shows the rectification process visually.
Figure 11 – Rectified image pair and horizontal epipolar lines

26
A number of algorithms can be used to both calculate and transform the epipolar lines of a stereo
image pair [46], with some requiring calibration and others able to estimate both the essential and
fundamental matrices to a lesser degree of accuracy during rectification.
3.4. Stereo correspondence
After rectification has taken place, the search for matches along horizontal scanlines (the epipolar
lines) can begin. There are a variety of challenges that come with having two different views of the
same object and numerous ways of overcoming them. Ensuring good matches are made is essential
in the estimation of depth but this stage also relies on an accurate calibration process. Rectification
is not always essential, however the epipolar geometry is always required to be known or
estimated.
3.4.1. The stereo correspondence problem
The stereo correspondence problem stems from the lack of prior knowledge on where
correspondences may occur in an image given points in its pair image. Using cameras with a
resolution of 640x480 means that any one pixel could have any one of 307,200 pixel matches in the
other image. It is expensive and infeasible to test every one of these pixels against every other in a
paired image so ways of overcoming this problem are required.
The solution comes in the form of epipolar geometry, the epipolar constraint states that any point
on one image plane’s epipolar line will correspond to a point on the paired image plane’s
corresponding epipolar line. Knowing this, the number of potential matches reduces drastically as
a 2-dimensional search area becomes a single dimensional vector.
Rectification further simplifies the problem as all matches occur on the same horizontal line in an
image but is not essential if the epipolar geometry is known [39]. This is again due to the epipolar
constraint as if the epipolar lines are known or can be computed, one pixel in one image will
correspond to another along the epipolar line. Despite not being horizontal in the case of rectified
images, the search space is still only a 1-dimensional vector.

27
3.4.2. Stereo matching process
A taxonomy by Scharstein and Szeliski in 2002 [50] identified four key building blocks or steps that
were present in the algorithms of the time and are still applicable today. These steps are as follows:
1) Matching cost computation
2) Cost aggregation
3) Disparity computation
4) Disparity refinement
Matching cost computation is simply the cost associated with finding correspondences in images,
the cost aggregation is finding the total summed cost and making a decision based on these
matched costs. Where the costs mentioned here are associated with the similarity of any two
correspondences and the aim is to minimise the total cost function which, theoretically, results in
the strongest correspondences. Disparity computation is the stage which acts on these decisions to
find the disparity between matched points and the final stage is to process and refine this disparity
value.
As well as these general steps, there are multiple classes of matching methods; local methods and
global methods. There are also less commonly used methods such as dynamic programming and
cooperative algorithms. Generally, local methods focus on the cost computation and aggregation
stages whereas in global methods the majority of work is carried out in the disparity computation
stage as they are primarily concerned with the image as a whole [50].
3.4.3. Semi-Global Block Matching
There are various stereo matching algorithms available however one algorithm was selected, not
only due to its implementation in both OpenCV and MATLAB but also due to its performance. Semi-
global matching was first described by Hirschmüller in 2007 [39] and offers a way of creating dense
disparity maps with sub-pixel precision and achieving quality akin to that obtained using a global
algorithms in a fraction of the time [39] [51].
Semi-Global Matching (SGM) is implemented in both MATLAB and OpenCV as Semi-Global Block
Matching (SGBM) and works in a local neighbourhood of pixels in order to determine
correspondences in stereo image pairs. These implementations use a Birchfield-Tomasi cost
calculation [52] as opposed to mutual information.
The SGM algorithm follows the four steps in Section 3.4.2 from Scharstein and Szeliski’s taxonomy
[50].

28
3.4.3.1. Matching cost computation
For SGM, the matching cost between a base image, 𝐵, and a matching image,𝑀, can be calculated
on a pixelwise scale. For any pixel, 𝑝, on the base image, it’s intensity can be represented as 𝐼 𝐵𝑝,
and similarly the suspected corresponding pixel on the matching image, 𝑞, has an intensity 𝐼 𝑀𝑞
where 𝑞 is found using the computed epipolar line 𝑒 𝐵𝑀 and Equation (22):
𝑞 = 𝑒 𝐵𝑀(𝑝, 𝑑) (22)
Where: 𝑑 is an estimated disparity value between the images providing they have been rectified
𝑒 𝐵𝑀(𝑝, 𝑑) represents the epipolar line and is found using Equation (23)
𝑒 𝐵𝑀(𝑝, 𝑑) = [𝑝 𝑥 − 𝑑, 𝑝 𝑦]
𝑇 (23)
While the images themselves are not required to be rectified, it is essential that the epipolar
geometry exhibited between them is known.
Generally, pixelwise matching can be inaccurate so a small matching window is used around the
pixel of interest. The size and shape of the window are important parameters to consider as they
will have a direct impact on the disparity image produced [51]. A commonly used window is a
variable sized, odd sided square neighbourhood of pixels where the centre pixel is the pixel to be
matched. Changing the dimensions of the window will affect the disparity image produced, where
increasing the window size will make the matching process more robust however due to the
assumption of constant disparity within the window, object edges or smaller areas can become
blurred [39][51][53].
The Birchfield-Tomasi cost function 𝐶 𝐵𝑇(𝑝, 𝑑) seeks to find the minimum difference between 𝐼 𝐵𝑝
and 𝐼 𝑀𝑞 which corresponds to a minimum cost [39][52].
3.4.3.2. Cost aggregation
Calculating cost on a pixelwise scale can often be incorrect [39][51][53]. There is a potential that
noise in an image can produce varying degrees of uncertainty. An incorrect match could be
perceived to be a better match than the noisy true correspondence due to its lower cost function.
Ideally, matching individual pixels would provide the truest representation of the scene but
individual pixels rarely contain sufficient information for this to be realised [53].

29
Global matching methods sum matching cost over the image and aim to find the disparity image, 𝐷,
which minimises Equation (24):
𝐸(𝐷) = ∑(𝐶(𝑝, 𝐷 𝑝) + ∑ 𝑃𝑇[|𝐷 𝑝 − 𝐷 𝑞| ≥ 1]
𝑞∈𝑁 𝑝
)
𝑝
(24)
The first term corresponds to the sum of all matching costs across the image with the second term
adding a penalty to pixels with a disparity different to their own. Differences in disparities are
permitted in this case if the match is particularly strong.
The SGM algorithm adds a third term which lessens the penalty on neighbouring pixels with a small
disparity step more akin to a curve or sloped surface rather than a discontinuity. Equation (25)
shows this new relation:
𝐸(𝐷) = ∑(𝐶(𝑝, 𝐷 𝑝) + ∑ 𝑃1 𝑇[|𝐷 𝑝 − 𝐷 𝑞| = 1]
𝑞∈𝑁 𝑝
+ ∑ 𝑃2 𝑇[|𝐷 𝑝 − 𝐷 𝑞| ≥ 1]
𝑞∈𝑁 𝑝
)
𝑝
(25)
The first term is again the summed match costs across the image, the second term adds the small
penalty, 𝑃1, associated with small disparity steps and the third time adds a larger penalty, 𝑃2, for
larger steps in disparity.
The global minimisation of the energy function 𝐸(𝐷) in 2-dimensions over the disparity image, 𝐷,
can be described as NP-complete and therefore takes substantial time to compute. Minimisation in
1-dimension however can be performed efficiently [39][53]. The aggregated costs are the matched
costs of 1-dimensional vectors terminating at pixel 𝑝 from all directions summed together to give a
new smooth aggregate cost of 𝑆(𝑝, 𝑑).
The cost along a path to 𝑝 in a certain direction, 𝑟, is denoted as 𝐿 𝑟
′
(𝑝, 𝑑) and is calculated using
Equation (26) below:
𝐿′ 𝑟(𝑝, 𝑑) = 𝐶(𝑝, 𝑑) + 𝑚𝑖𝑛
{
𝐿′ 𝑟(𝑝 − 𝑟, 𝑑)
𝐿′ 𝑟(𝑝 − 𝑟, 𝑑 − 1) + 𝑃1
𝐿′ 𝑟(𝑝 − 𝑟, 𝑑 + 1) + 𝑃1
min
𝑖
𝐿′ 𝑟(𝑝 − 𝑟, 𝑖) + 𝑃2
}
(26)

30
The 𝑝 − 𝑟 term in (26) corresponds to the previous pixel in the path and at each iteration the
minimum cost of the previous pixel is added on to the cost of the current pixel. This creates a
perpetually increasing value of 𝐿′
[39]which can lead to large costs in larger images. This increasing
cost can be combated by altering Equation (26) to that seen in Equation (27), adding a third term
which subtracts the minimum path cost of the previous pixel from the new cost [38][39].
𝐿 𝑟(𝑝, 𝑑) = 𝐶(𝑝, 𝑑) + 𝑚𝑖𝑛
{
𝐿 𝑟(𝑝 − 𝑟, 𝑑)
𝐿 𝑟(𝑝 − 𝑟, 𝑑 − 1) + 𝑃1
𝐿 𝑟(𝑝 − 𝑟, 𝑑 + 1) + 𝑃1
min
𝑖
𝐿 𝑟(𝑝 − 𝑟, 𝑖) + 𝑃2
}
− min
𝑘
𝐿 𝑟(𝑝 − 𝑟, 𝑘) (27)
As mentioned previously, the smoothed, or aggregated, costs of a particular match, 𝑆(𝑝, 𝑑), is
simply the sum of all the path-wise costs to a pixel and can be found using Equation (28):
𝑆(𝑝, 𝑑) = ∑ 𝐿 𝑟(𝑝, 𝑑)
𝑟
(28)
3.4.3.3. Disparity computation
A disparity image, 𝐷 𝐵, created using the base image, 𝐼 𝐵, as reference corresponds to the one which,
for every pixel, minimises the smoothed cost function 𝑆(𝑝, 𝑑) giving Equation (29)
𝐷 𝐵 = min
𝑑
𝑆(𝑝, 𝑑) (29)
The disparity image, 𝐷 𝑀, can also be calculated for the matched image, 𝐼 𝑀 , by exploiting the
epipolar geometry exhibited by the images. By going from pixel, 𝑞, in 𝐼 𝑀 to the corresponding pixel,
𝑝, in 𝐼 𝐵 the disparity can be selected such that it is produced by the minimum cost and can be found
using Equation (30):
𝐷 𝑀 = min
𝑑
𝑆(𝑒 𝑀𝐵(𝑞, 𝑑), 𝑑) (30)

31
As well as utilising the epipolar geometry of the image pairs, 𝐷 𝑀 can be calculated using the same
steps as calculating 𝐷 𝐵 but in this case 𝐼 𝑀 is used as a base image and 𝐼 𝐵 is the match image. The
addition of a second disparity image to this process can yield improved results at the expense of an
increased runtime [38][39].
If both disparity images 𝐷 𝐵 and 𝐷 𝑀 have been determined, occlusions and false matches can be
detected by performing a consistency check. Each disparity value of one image is compared to that
of the other and is the difference is greater than one, the disparity is set to invalid, along with
producing a higher quality disparity image, the check acts as a uniqueness constraint and enforces
one-to-one mappings only. This check is performed using Equation (31).
𝐷 𝑃 = {
𝐷 𝐵 if |𝐷 𝐵𝑝 − 𝐷 𝑀𝑞| ≤ 1
𝐷𝑖𝑛𝑣 otherwise
𝑞 = 𝑒 𝐵𝑀(𝑝, 𝐷 𝐵𝑝)
(31)
The estimation of depth from the computed disparity is discussed in Section 3.5.
3.4.3.4. Disparity Refinement
Finally, SGM’s disparity refinement stage works on three distinct problems commonly present in
disparity images. The first is the removal of peaks, or outliers, which occur in the image pair due to
areas of low texture or a high concentration of noise such as reflection. In order to further improve
these areas further post processing techniques, such as thresholding or segmentation, are often
required [51]. Thresholds can be used to filter out small invalid regions in connected pixels and
segmentation based on drastic changes in disparity value can also be used to highlight an area for
filtration. The second refinement stage is the intensity consistent disparity selection stage, which
looks to improve on the problem that arises in predominantly indoor, low-texture environments. If
a detailed, textured object in the foreground appears in front of a texture-less background, as SGM
utilises multiple 1-dimensional paths for its cost aggregation stage which may not all agree on a
disparity step, distorted edges can occur around objects. In order to combat this, Hirschmüller
applied a fixed bandwidth mean shift segmentation [39][51][54]. The third and final form of
disparity refinement is discontinuity preserving interpolation which aims to fill in the obscured
areas caused by occlusion due to the translation of cameras. By identifying the direction to

32
interpolate the background from, these occluded areas can be estimated based on the surrounding
disparities.
Both the first and third refinement techniques were applied in post processing in an attempt to
improve the disparity images returned by the algorithm.
3.5. Disparity computation and depth estimation
Once correspondences have been determined in the stereo images, the disparity can be calculated
by exploiting the geometry of triangles. Figure 12 shows the rectified stereo model from a top
down, perpendicular to the virtual image plane, view. Given a point 𝑃 in space in the field of view
of two cameras with optical centres 𝑂 𝑅 and 𝑂 𝑇 respectively with baseline 𝐵, the matched points 𝑝
and 𝑝′ correspond to the visualisation of 𝑃 on the rectified image planes. The pixel corresponding
to 𝑃 on the reference image plane 𝜋 𝑅, 𝑝, is 𝑥 𝑅 pixels from the left-hand side of the image on a
particular scanline whereas 𝑝′ on the target image pane, 𝜋 𝑇, is a distance of 𝑥 𝑇 from the left of the
resulting image on the same y-coordinate scanline.
Figure 12 – Depth estimation through the use of a rectified binocular stereo system
πR πT

33
The disparity, 𝑑, is equivalent to the difference in x-coordinates of any corresponding pixels. In
Figure 12 above, the disparity is equivalent to (𝑥 𝑅 − 𝑥 𝑇). Knowing the disparity, the depth, 𝑍, can
be estimated using the derivation shown in Equation (32).
𝐵
𝑍
=
(𝐵 + 𝑥 𝑇) − 𝑥 𝑅
𝑍 − 𝑓
(32)
𝑍 − 𝑓
𝑍
=
(𝐵 + 𝑥 𝑇) − 𝑥 𝑅
𝐵
1 −
𝑓
𝑍
=
((𝐵 + 𝑥 𝑇) − 𝑥 𝑅)
𝐵
1 −
((𝐵 + 𝑥 𝑇) − 𝑥 𝑅)
𝐵
=
𝑓
𝑍
𝑍 =
𝑓
1 − ((𝐵 + 𝑥 𝑇) − 𝑥 𝑅)
𝐵
𝑍 =
𝐵×𝑓
𝐵 − 𝐵 − 𝑥 𝑇 + 𝑥 𝑅
𝑍 =
𝐵×𝑓
𝑥 𝑅 − 𝑥 𝑇
=
𝐵×𝑓
𝑑
The disparity, 𝑥 𝑅 − 𝑥 𝑇, is inversely proportional to depth, meaning that objects have a higher
disparity the close they are to the imaging system they are. This occurs due to the object appearing
in drastically different locations in the resulting images. Figure 13 highlights this inversely
proportional property between disparity and depth. Using either the value for disparity or depth, a
basic 3-dimensional representation, or disparity map, of a scene can be created.
Figure 13 – Variation in disparity with depth

34
4. Structured light and time-of-flight infrared
The steps described previously are those corresponding to what is called a passive stereo vision
system as there are no driven exterior processes needed in order to calculate depth. One example
of an active stereo system is the Microsoft Kinect and its use of infrared light to calculate depth in
a scene.
4.1. The Microsoft Kinect v1 and structured light infrared
The first iteration of Microsoft’s Kinect sensor uses a technology called structured light which
projects a known pattern of infrared light into a scene. The projected light becomes deformed by
the geometric shape of objects in the scene and from this deformation, depth can be inferred.
The Kinect v1 is comprised of an RGB colour camera an infrared emitter and an infrared receiver
and these are arranged as illustrated in Figure 14.
Due to the infrared receiver and emitter being in offset positions, similar techniques utilised in
passive stereo vision are employed in the Kinect sensor. By calculating the disparity between the
projected pattern and the received pattern through triangulation, depth can be inferred using the
infrared emitter and receiving camera and Equation (32).
The Kinect sensor uses a speckled light [31] pattern comprising of dots of infrared light in a pattern
similar to the one seen in Figure 15. Triangulation is used between the two views in order to infer
depth, however, rather than attempting to match a single projected pixel to a single pixel in the
received scene a 9x9 window is matched using correlation [32]. The window with the highest
correlation’s centre pixel is taken as a match and the disparity can then be calculated accordingly.
Figure 14 – Microsoft Kinect v1 sensor camera arrangement – credit Microsoft

35
4.2. The Microsoft Kinect v2 and time of flight infrared
Due to Microsoft’s proprietary restraints, limited technical information specific to the updated
Kinect sensor is known other than that it uses a technology called Time-of-Flight (ToF) [32][33]. ToF
cameras measure the time taken for infrared light to be emitted, reflect of a surface and return to
the sensor. When this travel time is known, the distance can be calculated and depth of an object
in a scene can therefore be estimated. Figure 16 shows the hardware configuration of the Kinect
v2 sensor.
ToF sensors illuminate a scene with infrared light which is modulated using continuous wave
intensity modulation [32]. The emitted intensity modulated light is periodic and upon reflection off
of a surface returns to the sensor, resulting in a small phase shift. This phase shift corresponds to a
time shift in the emitted optical signal between emission and reception, 𝛷[𝑠], which can be found
for each sensor pixel. The distance can be calculated using this time shift and the constant, 𝑐, which
represents the speed of light. The depth can be calculated as half of the total distance travelled.
Figure 15 - Speckled light structured light pattern – credit Trevor Taylor (MSDN) [68]
Figure 16 - Microsoft Kinect v2 sensor camera arrangement – credit Microsoft

36
5. Facial detection using the Viola-Jones algorithm
After the initial research into facial recognition, it was discovered that a method of face detection
prior to recognition was necessary so as to avoid characterisation of non-face objects in the
background. The Viola-Jones algorithm offers a robust, real-time machine learning approach to the
object detection process with broad applications however has a lengthy training period [34][35].
The algorithm is comprised of three major elements: the first of which is the integral image, the
second is a set simple classifiers built using AdaBoost and finally the combination of these weak
classifiers in an increasingly complex cascade structure [34].
The detection algorithm aims to find simple rectangular features in images and classify them
accordingly. This method uses features, as opposed to pixels, as feature based systems encode
more information and are faster than pixel based systems [34]. The features used are similar to the
Haar basis functions used in [55] and are coined Haar-like features. There are three specific forms
of feature used in the algorithm, two-rectangle feature, three-rectangle features and four-rectangle
features and they are shown in Figure 17.
In each rectangular feature, the sum of pixels in the lighter region is subtracted from the sum of
pixels in the dark region to obtain a value for that feature. Using these features allows for rapid
identification of potential discriminatory facial features such as eyes and noses and these examples
are shown in Figure 18 from [34].
Figure 17 – Set of Haar-like features, A) two-rectangle horizontal feature, B) two-
rectangle vertical feature, C) three-rectangle feature and D) four-rectangle feature

37
The integral image proposed by Viola/Jones in [34] provides a rapid method of evaluating features.
A particular (𝑥, 𝑦) coordinate in the integral image, is equal to the sum of all pixels above and to
the left of (𝑥, 𝑦) and is calculated using Equation (33).
𝑖𝑖(𝑥, 𝑦) = ∑ 𝑖(𝑥′
, 𝑦′)
𝑥′≤𝑥,𝑦′≤𝑦
(33)
Every pixel in the image produced using Equation (33) is the sum of all the pixels in the region
contained above and to the left of it. Knowing this the rectangular regions of images can be
evaluated using only four reference points. Figure 19 shows an example of an integral image with
four regions A, B, C and D. Location 1 is the sum of all values within region A, location 2 is the sum
of all pixels within regions A and B collectively. Similarly, location 3 is the sum of all pixels in regions
A and C and finally location 4 is the sum of all the pixels in the entire image.
Figure 18 – Facial feature detection using Haar-like features - credit Viola/Jones [34]
Figure 19 - Integral image layout example

38
In order to find the sum of pixels in region D, only the values of these four locations are needed
and the sum within D can be calculated as:
𝐷 = 𝑙(4) + 𝑙(1) − (𝑙(2) + 𝑙(3))
(34)𝐷 = (𝐴 + 𝐵 + 𝐶 + 𝐷) + 𝐴 − (𝐴 + 𝐵 + 𝐴 + 𝐶)
𝐷 = 2𝐴 − 2𝐴 + 𝐵 − 𝐵 + 𝐶 − 𝐶 + 𝐷
𝐷 = 𝐷
Similar processes to the one described in Equation (34) can be applied to any rectangular region
and this characteristic of the integral image is what makes it well suited to evaluating Haar-like
features. For any two-rectangle feature six reference points are needed (as two of the eight
individual locations are shared), similarly for a three-rectangle feature where eight reference points
are needed and finally nine are needed for a four-rectangle feature [34].
After the features from the image have been evaluated they must then be classified, AdaBoost
[56][57] is used to select features and train a set of weak classifiers [34]. There are too many
rectangular features to classify individually whilst retaining the efficiency offered by using them, so
the most promising features are selected and combined to form an efficient classifier. The first two
features selected by AdaBoost are shown in Figure 18 and correspond with properties common to
faces in that the region of the eyes is darker than the nose and cheeks of a subject, the second
common feature is that the eyes will be less illuminated than the bridge of the subject’s nose [34].
By cascading simple classifiers based on more general features first, followed by more complex
classifiers looking for very particular features, the computation time can be reduced drastically
whilst also providing a low false positive rate [34]. Figure 20 shows this cascade structure where
the higher numbered the layer is the more complex the classifier. The final classifier in the chain
will output features for further processing.
Figure 20 - Cascade classifier structure

39
6. Facial recognition
Following on from an initial investigation into methods of facial recognition, further algorithms
were researched and tested for their application of the data obtained from both the stereo camera
system and the Kinect sensor. At first, only the Fisherfaces method [1][24][58] was implemented
and both Eigenfaces [24][25][29][58] and Local Binary Pattern Histograms [2][24] were researched.
Due to the unsuccessful application of Fisherfaces on the captured data, multiple other methods of
facial recognition were also tested with the subsequent implementation of successful methods.
6.1. Eigenfaces
Eigenfaces uses Principal Components Analysis (PCA) [59] to reduce a data set by identifying the
principal components of the data set. The principal components are the most varied in a set of data
while the rest are removed.
Eigenfaces can be applied to facial recognition problems by projecting training images into a PCA
subspace [58] and projecting an image to be recognised into this now trained subspace. By finding
the nearest neighbour to the query image through a distance transform such as the Euclidean,
Manhattan or Mahalanobis distances, the test image can be matched to a trained image and
potentially recognised.
The process of dimensionality reduction begins with representing an 𝑁×𝑁 face image 𝐼 by an 𝑁2
×1
vector, 𝛤, and subtracting the average face from the set. This new lower dimension vector is
represented by 𝛷. Figure 21 shows the representation of an image as a vector. In order to carry out
facial recognition, a training set of faces is reduced in dimension and the eigenvectors of the images
are projected into the subspace.
Figure 21 -Representation of an image as a vector

40
The following steps are to compute the eigenfaces of a training set [25]:
For the 𝑀 images in the training set 𝐼1, 𝐼2, … , 𝐼 𝑀, each image, 𝐼𝑖, was represented as a vector, 𝛤𝑖,
and the average face vector, 𝛹, could be computed using Equation (35):
𝛹 =
1
𝑀
∑ 𝛤𝑖
𝑀
𝑖=1
(35)
The mean face calculated in Equation (35) was then subtracted from every face vector to obtain the
reduced dimension image vector 𝛷𝑖 using Equation (36):
𝛷𝑖 = 𝛤𝑖 − 𝛹 (36)
The covariance matrix of the images, 𝐶, was then calculated using Equation (37):
𝐶 =
1
𝑀
∑ 𝛷𝑛 𝛷𝑛
𝑇
𝑀
𝑛=1
= 𝐴𝐴 𝑇
(𝑁2
×𝑁2
𝑚𝑎𝑡𝑟𝑖𝑥) (37)
𝑤ℎ𝑒𝑟𝑒 𝐴 = [𝛷1 𝛷2 … 𝛷 𝑀] (𝑁2
×𝑀 𝑚𝑎𝑡𝑟𝑖𝑥)
The eigenvectors, 𝑢𝑖, of 𝐴𝐴 𝑇
were then computed, however 𝐴𝐴 𝑇
is a very large matrix and is not
practical for any computation. Instead the eigenvectors of 𝐴 𝑇
𝐴 (𝑀×𝑀 matrix),𝑣𝑖, are calculated
using the derivation shown below in (38).
𝐴 𝑇
𝐴𝑣𝑖 = 𝜇𝑖 𝑣𝑖 => 𝐴𝐴 𝑇
𝐴𝑣𝑖 = 𝜇𝑖 𝐴𝑣𝑖
(38)
𝐶𝐴𝑣𝑖 = 𝜇𝑖 𝐴𝑣𝑖 => 𝐶𝑢𝑖 = 𝜇𝑖 𝑢𝑖
𝑤ℎ𝑒𝑟𝑒 𝑢𝑖 = 𝐴𝑣𝑖
The eigenvalues of 𝐴𝐴 𝑇
are identical to those in 𝐴 𝑇
𝐴 and their eigenvectors, 𝑢𝑖 and 𝑣𝑖, are related
by:
𝑢𝑖 = 𝐴𝑣𝑖
Where 𝑢𝑖 are the eigenvectors of 𝐴𝐴 𝑇
and 𝑣𝑖 are the eigenvectors of 𝐴 𝑇
𝐴
(39)

41
Each face could then be represented as a combination of the 𝐾 best eigenvectors as seen in
Equation (40):
𝛷𝑖
̂ − 𝛹 = ∑ 𝑤𝑗 𝑢𝑗
𝐾
𝑗=1
𝑤ℎ𝑒𝑟𝑒 𝑤𝑗 = 𝑢𝑗
𝑇
𝛷𝑖
(40)
The 𝑢𝑗’s are the eigenfaces of the set and 𝑤𝑗’s are the weights of each.
Each normalised face, 𝛷𝑖, can be represented by a vector 𝛺𝑖 containing the weights of eigenfaces
used to represent them which is then used to project images into the eigenspace:
𝛺𝑖 = [𝑤1
𝑖
𝑤2
𝑖
… 𝑤 𝑘
𝑖
]
𝑇
, 𝑖 = 1,2, … , 𝑀 (41)
The mean face as well as the six best eigenfaces from the two Stereo training datasets (greyscale
and disparity) are shown in Section 7.6.1.
In order to recognise faces, a similar procedure is followed. The query image is required to be the
same size and be centred in a similar fashion to the training images. In this case the face detection
algorithm used centred and resized the images as part of its process. The query image, 𝛤, is first
normalised by again subtracting the mean face of the trained set of images:
𝛷 = 𝛤 − 𝛹 (42)
The now normalised image is projected onto the PCA subspace, or eigenspace [25], using equation
(43):
𝛷̂ = ∑ 𝑤𝑗 𝑢𝑗
𝐾
𝑗=1
𝑤ℎ𝑒𝑟𝑒 𝑤𝑗 = 𝑢𝑗
𝑇
𝛷𝑖
(43)
The face image, 𝛷, was then be represented by a vector of weights, 𝛺:
𝛺 = [𝑤1 𝑤2 … 𝑤 𝐾] (44)

42
In order to find a matching image, the minimum distance, 𝑒 𝑑, between the projected query image,
𝛺, and the 𝑖 projected test images, 𝛺𝑖, is found using Equation (45):
𝑒 𝑑 = 𝑚𝑖𝑛𝑖‖𝛺 − 𝛺𝑖‖ (45)
The minimum distance 𝑒 𝑑 determines the training image 𝛤𝑖 that matches the query
image, 𝛤.
6.2. Fisherfaces
While PCA determines a linear combination of features that maximise the variance of a data set or
set of images, this process may not be optimal. The eigenfaces method does not consider that data
can be represented as a class and therefore potentially important and informative data can be
overlooked and discarded as seemingly redundant information [58]. Illumination in particular
constitutes variation in the data but this is not necessarily information that defines a particular
person or face and therefore can lead to misclassification or misrecognition.
Linear Discriminant Analysis (LDA), or Fisherfaces, performs class specific dimensionality reduction
[27][58] which maximises the between-class variance or between-class scatter, 𝑆 𝐵, and minimises
the within-class variance or within-class scatter, 𝑆 𝑊, which are found using Equations (46) and (47)
respectively.
For a data set 𝑋 = {𝑥1, 𝑥2, … , 𝑥 𝐶} with 𝐶 classes, where each class 𝑥𝑖 = {𝑥1, 𝑥2, … , 𝑥 𝑛} has 𝑛
samples:
𝑆 𝐵 = ∑ 𝑁𝑖(𝜇𝑖 − 𝜇)(𝜇𝑖 − 𝜇) 𝑇
𝐶
𝑖=1
(46)
𝑆 𝑊 = ∑ ∑(𝑥𝑖 − 𝜇𝑖)(𝑥𝑖 − 𝜇𝑖) 𝑇
𝑗=1
𝐶
𝑖=1
Where: 𝐶 is the number of classes,
𝑛 is the number of samples,
𝑁𝑖 is the number of samples in class 𝑥𝑖,
𝜇𝑖 is the mean image of class 𝑥𝑖,
and 𝜇 is the mean image of the entire set.
(47)

43
Fishers original algorithm [26] then strives to find an optimal projection, 𝛺 𝑜𝑝𝑡, which aims to satisfy
the conditions mentioned above in regards to within and between-class spacing. 𝛺 𝑜𝑝𝑡 can be found
𝛺 𝑜𝑝𝑡 = arg max
𝛺
|𝛺 𝑇
𝑆 𝐵 𝛺|
|𝛺 𝑇 𝑆 𝑊 𝛺|
(48)
Following the steps in [27] the projection matrix, 𝛺, can be expressed in terms of PCA and LDA
operations using Equation (49):
𝛺 = 𝛺𝑙𝑑𝑎
𝑇
𝛺 𝑝𝑐𝑎
𝑇 (49)
The optimised 𝛺 𝑝𝑐𝑎 and 𝛺𝑙𝑑𝑎 terms can be calculated using Equations (50) and (51) respectively:
𝛺 𝑝𝑐𝑎 = arg max
𝛺
|𝛺 𝑇
𝑆 𝐵 𝛺| (50)
𝛺𝑙𝑑𝑎 = arg max
𝛺
|𝛺 𝑇
𝛺 𝑝𝑐𝑎
𝑇
𝑆 𝐵 𝛺 𝑝𝑐𝑎 𝛺|
|𝛺 𝑇 𝛺 𝑝𝑐𝑎
𝑇
𝑆 𝑊 𝛺 𝑝𝑐𝑎 𝛺|
(51)
Using the resultant, 𝛺, from Equation (49) the images can be projected into subspace, optimising
the class scatter.
Both Eigenfaces and Fisherfaces rely on a sufficiently large training database of faces to correctly
identify an unknown face [58]. Figure 22 shows the two algorithms have a generally positive trend
in recognition rate as the number of training images increases.
Figure 22 – Comparison of Eigenfaces and Fisherfaces algorithms (Database size vs Error Rate) [58]

44
6.3. Local Binary Pattern histograms
Local Binary Pattern (LBP) was originally created as a two dimensional surface texture description
operator that accounted for both spatial patterns and greyscale contrasts in an image [60]. LBP and
its extended algorithms exhibit invariance to both monotonic illumination changes and slight
rotations [61]. The original algorithm operates over a 3x3 square neighbourhood of pixels to
produce a binary description, or label, of the eight pixels surrounding the centre based on a
thresholding operation where the threshold used is the centre pixel.
Figure 23 shows this operation, where the centre pixel has a value of 38 and the surrounding eight
pixels are either assigned as zero or one depending whether their value is less that or greater than
the threshold assigned by the centre pixel respectively. In the case of Figure 23, the label is
comprised of the 8-bit binary number formed by reading the eight pixel values circularly giving the
label 01111001 or 121 in decimal. The histogram of the resulting labels can be used to describe
texture in sections of an image known as cells.
The original operator was then extended to consider different sizes of circular neighbourhoods with
radius 𝑅. As 𝑅 increases the number of points, 𝑃, that can fall on the edge of the neighbourhood
encapsulated by the circle also increases and the algorithm is no longer restricted to an eight
element 3x3 array of pixels. However, as the radius changes, the sampling points on it may not
correspond with an exact pixel coordinate and therefore a bilinear interpolation is applied to obtain
the nearest pixel [2] [60] [62]. The neighbourhood being investigated is described in the form (𝑃, 𝑅)
and various examples for different values of 𝑅 and 𝑃 are shown below in Figure 24.
Figure 23 – Original LBP algorithm comprised of a 3x3 neighbourhood of pixels to be thresholded.

45
LBP is used in textural analysis due to characteristics from the patterns returned from a
neighbourhood being easily classifiable, either visually or programmatically. Figure 25 shows five
potential results from an (8,1) circular operator to classify various neighbourhoods based only on
pixel values.
After applying LBP and obtaining an LBP labelled image, 𝑓𝑙(𝑥, 𝑦), has been obtained the LBP
histogram for that image can be defined using Equation (52) [60]:
𝐻𝑖 = ∑ 𝐼{𝑓𝑙(𝑥, 𝑦) = 𝑖}, 𝑖 = 0, … , 𝑛 − 1
𝑥,𝑦
(52)
Where 𝑛 is the number of LBP labels and:
𝐼{𝐴} = {
1, 𝐴 𝑖𝑠 𝑡𝑟𝑢𝑒
0, 𝐴 𝑖𝑠 𝑓𝑎𝑙𝑠𝑒
The histogram created using Equation (52) only takes the distributions of localised patterns into
consideration, in order to retain the information conveyed in spatial features present in faces the
image can be split into 𝑚 regions 𝑅0, 𝑅1, … , 𝑅 𝑚−1 and an improved histogram can be created by
Figure 24 – Local binary patterns using (8,1) [Left], (16,3) [Centre], and (16,4) [Right] neighbourhoods
Figure 25 – Local binary patterns found using a (8,1) operator [58]

46
𝐻𝑖,𝑗 = ∑ 𝐼{𝑓𝑙(𝑥, 𝑦) = 𝑖}𝐼{(𝑥, 𝑦) ∈ 𝑅𝑗}, 𝑖 = 0, … , 𝑛 − 1, 𝑗 =
𝑥,𝑦
0, … , 𝑚 − 1 (53)
Using a description such as the one in Equation (53) gives information on three levels of locality
[62]; the first at a pixel level describing the patterns in small neighbourhoods, the second being the
summation of these patterns to form a histogram of the spatial information in each particular
region and the final being the concatenation of these regional histograms into one which describes
the image globally. The global histograms representing different images can be compared and by
selecting a match based on a minimisation of error, facial recognition can be achieved.
6.4. Eigensurfaces and Fishersurfaces
Although significant advancements have been made in the field of facial recognition, many
algorithms concerned with 2-dimensional data are susceptible to inaccuracies caused by the
environment in which images are captured. Environmental factors can include changes in
illumination, pose and facial expression which all affect recognition rates making the need for
consistent conditions paramount [40][41]. Using 3-dimensional data in the form of facial models
eliminates the need for consistent conditions and also provides robustness to changes in
orientation and expression of faces [63].
Eigensurfaces is the application of 3-dimensional data to the Eigenfaces algorithm [40]. Similarly,
Fishersurfaces is the application of 3-dimensional data to the Fisherfaces algorithm [27][41].
6.5. Feature detectors
In addition to both 2-dimensional and 3-dimensional facial recognition algorithms, multiple feature
detection and matching algorithms were investigated for their potential use in this project. As this
was considered late in the timeline of the project, functions that were available natively in MATLAB
were considered for use. The feature detectors included in MATLAB are as follows; SURF [12], the
Harris corner detector [44], Binary Robust Invariant Scaleable Keypoints (BRISK) [43] and the FAST
feature detector [45].
SIFT[11] [64] did not have a native algorithm as standard in MATLAB (as of version 2016b) although
a third party implementation was available, due to time constraints this was not included.

47
Due to all but SURF and Harris returning insufficient features only the two aforementioned methods
could be implemented and tested, some results of which are shown in Sections 7.6.3 and 7.6.4
respectively.
7. Experimental results
This section exhibits a subset of the results obtained at every stage of the project, passive and active
stereo vision in both OpenCV and MATLAB, image post-processing, facial detection and facial
recognition.
7.1. Camera calibration
As discussed in Section 3.2 camera calibration is the first and, quite possibly, the most important
step in any stereo vision system. For the purposes of this project calibration was implemented in
OpenCV initially and the again using MATLAB and the Stereo Camera Calibration App. In both
methods a calibration object, which in this case was a chessboard pattern, was held in front of the
camera pair with variations in position and rotation so as not to bias the calibration process.
7.1.1. Camera calibration in OpenCV
OpenCV has multiple native functions to apply the theory discussed in Section 3.2. Using a
chessboard calibration pattern, multiple calibration images could be acquired by taking snapshots
of a video feed. Snapshots could only be taken providing a chessboard had been detected using the
OpenCV function findChessboardCorners() which could be visualised by superimposing the
chessboard back onto the image pair using drawChessboardCorners(). Figure 26 shows the detected
chessboards superimposed onto the images of the calibration object by using the two
aforementioned functions.
Using the points corresponding to the detected chessboard corners the camera can be calibrated
using the stereoCalibrate() function which takes the known position of points in the calibration
object (as its dimensions are known) and the points from each image corresponding to these object
points and returns the calibrated intrinsic and extrinsic camera parameters. As discussed in the
theory, these parameters are used in future stages and are essential to the operation of the system
as a whole.

48
Figure 26 – Detected chessboard in camera calibration boards visible in stereo image pairs using OpenCV

49
7.1.2. Camera calibration using the MATLAB Stereo Camera Calibrator app
A similar calibration process to the one developed with OpenCV was created in MATLAB using the
Stereo Camera Calibrator App initially and then using this example to develop an original
framework. The first step in carrying out calibration was once again to capture images of
chessboards and detect the chessboard points in each of them which allows the cameras to be
calibrated. Twenty image pairs, five of which are displayed in Figure 28, were captured using the
binocular stereo system and chessboards could be detected in them using the
detectCheckerboardPoints() function in MATLAB that would return the pixel locations of the
chessboard object in each of the images.
Through prior knowledge of the physical dimensions of each of the chessboard squares and the
number of squares present on the calibration object, a set of world coordinates that apply to the
board can be specified. Using these world points along with the image points returned previously
allow for the calculation of camera parameters through the estimateCameraParameters() function.
Using the calibrated camera parameters, estimates of the location of each chessboard in the set of
calibration images can be made to a high level of accuracy. These reprojections are overlaid onto
the chessboards, seen in Figure 28 as green markers, and the reprojection errors for each image
pair, or the distance in pixels between the true point and the re-projected point, can be seen in
Figure 27.
Figure 27 - Pixel reprojection error between calibration board points and calibrated estimates

50
Figure 28 - Calibration boards 11-15 captured using a binocular stereo camera system and MATLAB

51
The extrinsic parameters, the location of each camera and every board in the world frame, can also
be visualised using MATLAB. This helps to ensure no bias is made towards any portion of the stereo
systems field of view. The visualisation of the extrinsic parameters is shown in Figure 29.
7.2. Image pair rectification
Once the calibration stage had been completed and the parameters for the system were obtained,
new images could be captured and rectified before further stereo processing. Rectification was
carried out in both OpenCV and MATLAB with a calibrated camera arrangement as well as an
investigation into an uncalibrated rectification using feature detection.
Figure 29 – Extrinsic parameters from an arbitrary view

52
7.2.1. Image pair rectification in OpenCV
Using the parameters estimated during calibration the images could be rectified in order that the
images are aligned and the correspondence problem is therefore reduced. The first stage is to
capture a normal image pair, an example of which is shown in Figure 30.
By viewing the stereo anaglyph of these images, Figure 31, it becomes apparent that a large offset
exists in the location of a face. In order to perform stereo matching, these images need to be
warped and scaled through rectification.
OpenCV has multiple functions that aid in the rectification process however stereoRectify()
performs the entire operation taking the two images, the intrinsic and extrinsic parameters and
returning the rectified images.
Figure 30 - Unrectified image pair (OpenCV)
Figure 31 - Stereo anaglyph of unrectified image pairs (OpenCV)

53
Figure 32 shows the rectified stereo pair. On first inspection, it may appear that there is very little
difference between the rectified and unrectified image pairs. However, on closer inspection of the
stereo anaglyph of the rectified image, shown in Figure 33, when compared to the anaglyph of the
unrectified images, the difference becomes clear.
The rectified stereo anaglyph shows little disparity in the face however, on the body and in the
background, there is an obvious linear shift in the horizontal x direction.
Figure 32 – Rectified image pair (OpenCV)
Figure 33 – Stereo anaglyph of rectified image pairs (OpenCV)

54
7.2.2. Calibrated image pair rectification in MATLAB
The rectification process initially carried out in OpenCV was replicated using MATLAB following on
from the calibration process described earlier. An image pair, Figure 34, could be captured using
the stereo cameras and viewing the stereo anaglyph of the images, Figure 35, highlights this
disparity in the locations of the various object in the scene in both an x and y-axis shift.
The MATLAB function rectifyStereoImages() uses the same method described in theory and in the
OpenCV implementation to produce the rectified image pair shown in Figure 36.
Figure 34 - Unrectified Image pair (MATLAB)
Figure 35 – Overlaid anaglyph of the unrectified image pair

55
The rectification process can be verified by both viewing the stereo anaglyph, shown in Figure 37,
as well as determining whether the two images are horizontally aligned by using multiple horizontal
lines overlaid onto the image pair. These horizontal lines act as confirmations of the epipolar
constraint and are shown in Figure 38.
Figure 36 – Rectified image pair (MATLAB)
Figure 37 – Overlaid anaglyph of the rectified image pair
Figure 38 – Rectified image pair with horizontal (epipolar) lines highlighted

56
7.2.3. Un-calibrated image pair rectification in MATLAB
While calibration is important for any stereo vision system there are sometimes cases where it is
not possible. There has been some research into utilising methods of feature detection to find
correspondences to then calculate the fundamental matrix in order to rectify the images without
ever needing to calibrate the system [65]. Using a MathWorks example [66] for guidance a
framework was created which allowed for the uncalibrated rectification of image pairs.
Taking the same test image pair used in Section 7.2.2 above, Figure 34, and viewing its stereo
anaglyph, Figure 35, it is apparent these images are offset. With no camera parameters to calculate
the fundamental matrix, features in both images are detected instead using speeded-up robust
features (SURF). Using the MATLAB function detectSURFFeatures() on each of the input images
returned a multitude of features with the fifty strongest features in the right and left images being
shown in Figure 39 and Figure 40 respectively.
Figure 39 - Fifty best SURF features in the right image
Figure 40 - Fifty best SURF features in the left image

57
The strongest features in one image may not correspond to the strongest in the other, this is usually
caused by occlusion. In order to determine correspondences, the features from each image had to
first be extracted using the MATLAB extractFeatures() which returns a vector of features for each
image before these features are attempted to be matched using matchFeatures(). The resulting
matched features are displayed below in Figure 41, superimposed onto the stereo anaglyph of the
image pair.
Not all of these features are valid however. It is apparent from Figure 41 that a handful of features
have been matched with features corresponding to objects on the opposite side of the scene. These
outliers can be removed by applying the epipolar constraint by first estimating the fundamental
matrix using the matched features in conjunction with the Random Sample Consensus (RANSAC)
algorithm [67] in the MATLAB function estimateFundamentalMatrix(). This function takes a random
set of input features and aims to converge on an estimate for the fundamental matrix over a certain
number of iterations. The next stage is to assess whether the estimated matrix is acceptable by
using the function isEpipoleInImage() which determines whether the decomposition of the
fundamental matrix is valid. This stage also ensures there is an intersection of epipolar lines
between the two perceived camera centres and their image planes resulting in a set of epipoles as
visualised in Figure 10.
Figure 41 - Matched features in the image pair

58
The resulting valid points are those which adhere to the epipolar constraint and can be seen in
Figure 42.
Using the estimateUncalibratedRectification() function in MATLAB, the rectification transforms can
be can be calculated based on the valid matched points and the fundamental matrix. The
transforms returned here can then be used as inputs into the rectifyStereoImages() function where
before camera parameters found through calibration were required.
The uncalibrated rectified image pair’s stereo anaglyph is shown in Figure 43.
Figure 42 - Valid matched features after outlier removal

59
7.3. Disparity calculation
Once the image pairs had been rectified the disparity between corresponding pixels could be
determined through the use of the SGBM algorithm, described in Section 3.4.3, for both the
OpenCV and MATLAB implementations.
7.3.1. Disparity map creation in OpenCV
An SGBM object could be created and its parameters defined in order to compute disparity images
from a rectified image pair. Some examples of the disparity computed in OpenCV are shown in
Figure 44.
Figure 43 - Uncalibrated rectified image pair

60
It became apparent, despite testing and refining parameter choice, that the disparity maps being
produced in OpenCV were inferior to those produced in other applications using similar techniques,
only providing depth information on objects close to the camera. A substantial amount of effort
was undertaken in order to rectify the issues presented here, especially in determining the ideal
parameters to use in conjunction with this algorithm and the image data available. However, these
steps provided little to no improvement and it was at this stage that a decision was taken that a
MATLAB implementation should also be developed.
Figure 44 – Multiple disparity images created in OpenCV

61
7.3.2. Disparity map creation using the calibrated and uncalibrated models in MATLAB
The disparity could be estimated in MATLAB by providing the two rectified images as well as the
parameters needed for SGBM. Prior to carrying out this process and following on from issues
encountered in the OpenCV implementation, an investigation into selecting ideal parameters was
performed. Following this investigation, the parameters required during the call to the disparity()
function in MATLAB were optimised to provide the most informative disparity images. The same
parameters were used in order to determine disparity from both the calibrated and uncalibrated
methods. Figure 45 shows the disparity image created using the calibrated method and Figure 46
shows the uncalibrated method’s disparity image.
Figure 45 – Disparity map produced using the calibrated model for rectification
Figure 46 - Disparity map produced using the uncalibrated model for rectification

3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition

3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition

Recommended

Recommended

More Related Content

Similar to 3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition

Similar to 3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition (20)

Recently uploaded

Recently uploaded (20)

3-Dimensional Facial Model Creation Using Stereoscopic Imaging Techniques for Facial Recognition