Im-ception - An exploration into facial PAD through the use of fine tuning deep confolutional neural networks

Im-ception
Facial PAD Through Fine Tuning Deep Convolutional
Neural Networks
by
Cooper Wakeﬁeld
School of Information Technology and Electrical Engineering,
The University of Queensland.
Submitted for the degree of
Bachelor of Engineering
in the ﬁeld of Mechatronic Engineering. . .
June 2019.

iii
Cooper Wakefield
cooper.wakefield@uq.net.au
43593321
Tel. +614 21 185 173
June 9, 2019
Prof. Michael Brünig
Head of School
School of Information Technology and Electrical Engineering
The University of Queensland
St Lucia, Q 4072
Dear Professor Brünig,
In accordance with the requirements of the degree of Bachelor of Engineering
(Honours) in the division of Mechatronic Engineering, in the school of Technology
and Electrical Engineering I present the following thesis entitled:
“Im-Ception”
Facial PAD Through Fine Tuning Deep Convolutional Neural Networks
This work was performed under the supervision of Prof. Brian Lovell.
I declare that the work submitted in this thesis is my own, except as acknowl-
edged in the text and footnotes, and has not been previously submitted for a degree
at The University of Queensland or any other institution.
Yours sincerely,
Cooper Wakefield.

Keywords
Deep learning; transfer learning; bio-metrics; facial PAD; presentation attack detec-
tion; PAD; ﬁne tuning; convolutional neural network; facial recognition engines;
iv

Acknowledgments
I would like to express my appreciation for the opportunity to work under the su-
pervision of Brian Lovell, whose initial brain child spawned this thesis. To have
access to, and make acquaintance with someone of Professor Lovell’s stature within
the computer vision and bio-metric community abroad, has been an amazing oppor-
tunity for an aspiring engineer. I would also like to thank Doctor Arnold Wiliem
for his early guidance in the fields of deep learning, and data set structuring. For
someone like myself that had very little exposure to machine/deep learning before
this project, his insights were invaluable to the success of this project.
I also feel obliged to extend my appreciation to Fran¸cois Chollet, whose develop-
ment of Keras (deep learning framework), has enabled the accelerated advancement
of deep learning world wide by making it more accessible and less verbose in its
implementation. This has had a direct effect on my thesis, enabling a streamlined
and efficient development period.
v

Abstract
In our modern epoch, the ‘necessity’ for fast and efficient access to our data has
often left us with the feeling that security is more of an afterthought than a priority.
At the forefront of this debate over security of our data has been facial recognition
engines, that have, over the last few years, begun to appear in consumer-based prod-
ucts. The quick progression of facial recognition systems in a real-time environment
has raised new concerns over their ability to resist presentation attacks [1]. With
companies such as Apple leading the charge with facial recognition to control every-
thing, from unlocking your phone to logging into your banking, it has never been a
more important time to reliably detect presentation attacks on our facial recognition
engines. This thesis will outline and develop a presentation attack detection (PAD)
system through fine tuning a deep convolutional neural network.
It was found that leveraging pre-trained networks, and fine tuning the upper layers
of the network to detect facial presentation attacks was feasible, with an F1 score
of 99.96% under the scope of this thesis.
vii

Contents
Keywords iv
Acknowledgments v
Abstract vii
List of Figures xi
List of Tables xii
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Boundaries and Limitations . . . . . . . . . . . . . . . . . . . 3
1.2.2 Scope Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Relevance and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature review / prior art 7
2.1 Vertical-Cavity Surface-Emitting Lasers (VCSEL) . . . . . . . . . . . 7
2.2 Motion-Based Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Image-Quality Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Texture-Based Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Prior Uses of CNN for PAD . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Where this solution places . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Theory 13
3.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Recent advancements in Artificial Intelligence . . . . . . . . . 15
3.1.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Transfer Learning (Fine tuning) . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Fine Tuning vs. Training from scratch . . . . . . . . . . . . . 19
viii

CONTENTS ix
3.2.2 Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Facial recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Presentation Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Types of facial PA . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Solution Design and Implementation 23
4.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Software used for development . . . . . . . . . . . . . . . . . . 30
4.2.3 Overview of deep learning framework . . . . . . . . . . . . . . 30
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Size and Structure . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.3 Improvements and limitations . . . . . . . . . . . . . . . . . . 32
4.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.1 Fine tuning model (transfer learning) . . . . . . . . . . . . . . 34
4.4.2 Design Rationale . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Results and Discussion 37
6 Conclusions 41
6.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Possible future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Appendices 43
A Code Listings 44
A.1 Bottleneck Features Generation Script . . . . . . . . . . . . . . . . . 44
A.2 Fine Tune network script . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.3 Evaluate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.4 Visualisation Generation Script . . . . . . . . . . . . . . . . . . . . . 53
A.5 Conﬁg File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B Companion disk 55
C Tensorboard Graphics 56
D Timeline From Proposal 58

List of Figures
2.1 FaceID: VCSEL projecting 30,000 dots [37] . . . . . . . . . . . . . . . 8
3.1 Most basic structure of a CNN . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Distribution of image to hidden layers . . . . . . . . . . . . . . . . . . 14
3.3 Illustration of active nodes . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 VGG16 graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Design ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Comparison of GPU’s . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Example of spoof image . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Machine used for training . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Samples of real images . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Samples of fake images . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 VGG16 graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Tensorboard outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Fake spooﬁng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Final Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
C.1 Tensorboard output of network . . . . . . . . . . . . . . . . . . . . . 57
D.1 Timeline of project from project proposal . . . . . . . . . . . . . . . . 59
xi

List of Tables
1.1 Scope Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
xii

Chapter 1
Introduction
Biometric systems are becoming common place in our everyday lives, from finger
print readers, to voice recognition, and pertinent to this thesis topic; facial recogni-
tion systems. As with any technological advancement, there are often flaws with the
early iterations, and facial recognition is no exception. We are poised in a period
of the development stage of facial recognition systems where real-time applications
are in demand, and with the move into this space it has raised concern from the
biometric community and the public over the ability to resist presentation attacks
[1]. The proposed solution utilises transfer learning to fine tune the VGG16 deep
convolutional neural network architecture to enable differentiation between real and
fake facial artefacts. Through extensive experimentation with varying parameters
which will be outlined in the proceeding sections on a self curated dataset, an F1
Score of 99.96% was obtained.
1.1 Problem Definition
Failure in facial recognition software has made headlines since Apple’s Face ID
launch on September 12, 2017 [2]; most notably failing to differentiate between
twins, siblings with similar faces, and even to the extremity of people of the same
race [3]. Apple’s own Face ID Security release states that the probability of someone
other than yourself being able to unlock your phone through the Face ID protocol is
“1 in 1,000,000 (versus 1 in 50,000 for Touch ID)”, however caveats this statement
by saying “The probability of a false match is different for twins and siblings that
look like you as well as among children under the age of 13, because their distinct
facial features may not have fully developed.” [4].
A presentation attack is exactly that, with the end goal to attack the security of
a biometric facial recognition engine through the presentation of a facial biometric
1

2 CHAPTER 1. INTRODUCTION
artefact [1]. Facial biometric artefacts can range from a simple print out of the
individuals face, to the complexity of a 3-dimensional modelled mask. Prior devel-
opments of PAD techniques have been largely based on classical computer vision and
can be loosely categorised into three main groups; (1) Motion based schemes, (2)
Image quality analysis schemes (3) and micro-texture-based schemes [1]. Another
approach that has been employed to perform PAD on facial recognition engines is
that of vertical-cavity surface-emitting lasers (VCSELs), to provide 3-dimensional
sensing [16]. This is what is found in the iPhone X; this technology requires extra
costly hardware to implement, and thus a solution that leverages the pre-existing
hardware within a device is favourable if the level of accuracy can be comparable to
that of current technologies.
1.2 Scope
This thesis will move away from a classical computer vision approach, and harness
recent developments in deep learning to develop a novel way of detecting presenta-
tion attacks. This will look at the generation (1st or 2nd) of an image, and thus
looks at the problem from an alternative approach than one of focus on the face.
This will have applications in not only consumer-based products as discussed above,
but any security application where facial recognition is employed; and potentially
outside of this scope depending on the necessity for determining the generation of
an image: think veriﬁcation of image authenticity.
Upon commencement of this thesis project, the ambitious intention was to pro-
duce a solution that encompassed multiple forms of presentation attack detection.
During the course of producing and developing the PAD software, it became quickly
apparent the shear size of data that would be needed in order to ensure a consistent
and reliable detection. As the use case for this product is in the area of biometric
security, it is imperative that accuracy be as high as possible. As a result of this, It
was decided to limit the focus of this thesis on one attack method, the replay attack.
The logic behind this, was that if the developed detection engine could successfully
and reliably detect on this type of presentation attack, then it would mean that
with enough structured data it would be possible to develop a PAD method for the
other means of attack.

1.2. SCOPE 3
1.2.1 Boundaries and Limitations
• Collection of data from a wide variety of the population was un-feasible without
monetary injection to fund a uni wide data collection (reward based incentive
for each participant).
• Computing power was limited, access to the schools GPU was possible, how-
ever inconvenient. Another option was to use cloud computing, however this
is a very expensive option, and when trialling one option from FloydHub, I
found that the speed increase after dealing with uploading of datasets, was
comparable to training on my GTX1050ti.
• Access to varying devices for diversity of the dataset for training a model on
replay attacks.
1.2.2 Scope Deﬁnition
Condition In Scope Out of Scope
Lighting Consistent lighting Extreme (really dark or re-
ally light)
PA Method Replay attack Lollipop, print, 3d modelled
mask
Size of Dataset approximately 70,000 train-
ing and 10,000 validation
larger scale diverse data
Implementation Manual testing (command
line)
GUI and or live detection
Dataset Collection Limited to myself, close
friends and family
Any extension to this
Diversity of Dataset Male and Female, varying
age, More of myself than
other individuals, Lighting
fairly similar across all, us-
ing the same device to dis-
play the fake (Dell XPS15)
Varying ethnicity, varying
lighting, replay attack on
various devices
Production Level Proof of Concept stage Not production ready,
trained on one method of
PA. No GUI
Table 1.1: Scope Deﬁnition

4 CHAPTER 1. INTRODUCTION
1.3 Aims
The aims of this project were as follows:
• Phase 1: The project sits in a field that is at the forefront of development in
the artificial intelligence and more specifically the use of artificial intelligence
in the biometric security space. As such, an extensive research phase was
needed in order to develop an effective and in depth solution the problem.
Not only this, but the level of data needed was extensive, and thus the data
collection process took time, as well as the cleansing of this data. This can be
summarised as:
– Data set collection
– Research into differing deep learning techniques.
– Consulting with Professor Lovell and Wiliem to discuss potential solu-
tions/directions to take.
– Exploring the prior work on PAD in facial recognition engines.
– Exploring the computing power that would be feasible to complete this
project under the scope aforementioned.
– Pulling on the knowledge gained at work on leveraging AI to develop
solutions to problems
– Testing multiple deep learning architectures and their ability to be fine
tuned to this challenge.
• Phase 2: Phase two of this thesis was reserved to the development of the
solution. This was both time consuming due to the trial and error nature
of deep learning, where adjustments to a plethora of variables is needed to
optimise the results; as well as the computationally expensive nature of the
task at hand.
The culmination of these two phases will result in a proof of concept showing the
ability of deep learning in the field of biometric security; specifically that of facial
recognition engines. Through proving the effectiveness on replay attacks, it will
lead on to further solutions for other presentation attack methods, thus bringing an
accessible and hardware light solution to PAD in facial recognition implementations.

1.4. RELEVANCE AND IMPACT 5
1.4 Relevance and Impact
The relevance of this problem is high in our modern world, as stated above, there
is an increased use of facial recognition engines in our day to day lives. This is only
set to continue, with a prediction that facial recognition will continue to increase,
up to 26 percent through 2025 [17]. As such, it becomes increasingly important to
develop techniques to prevent unwanted or unauthorised access to these devices or
establishments. Current methods like in the iPhone X [16], employ 3 dimensional
sensing in order to conduct facial recognition. However this requires an added hard-
ware component that is expensive percentage wise of the over all cost of the handset
[18]. With android going as far as to not pursue facial recognition as a method
for unlocking their phones as they cannot command the same price premiums that
Apple does [18].
Access to devices is not the only use case for this solution. A huge market that fa-
cial recognition is utilised in is entry into establishments. Think entry into buildings
without the need of a guard. This is applicable in a range of environments from
nightclubs, to oﬃce buildings. This could allow for 24 hour access without the need
for staﬀ to monitor the doors.

Chapter 2
Literature review / prior art
The section looks to highlight the prior work done on facial recognition presentation
attack detection. In doing so it will set the theoretical grounds for where this project
launches from. This will outline traditional means of PAD on facial recognition
engines; namely, VCSEL, image-quality schemes, motion-based schemes and texture-
based schemes. As well as explore briefly some of the prior work done in the field of
using CNN’s for the detection of presentation attacks on facial recognition engines.
This is a very new field, and thus there are not a vast range of papers on the matter.
2.1 Vertical-Cavity Surface-Emitting Lasers (VC-
SEL)
Potentially the most prevalent use of facial recognition in our modern world is that
of the iPhone X. Announced in September 2017 [2], iPhone’s FaceID was one of the
first technologies to implement facial recognition to replace the tradition finger print
reader. It utilises a 3D sensing technology named Vertical-Cavity Surface-Emitting
Lasers (VCSEL). The traditional use case for VCSEL is as a light source to carry
data over optical fiber [16]. The use of VCSEL in consumer based products as a
source of 3D sensing hadn’t been done before the release of FaceID [16]. Apple
have achieved this using a Time of Flight (TOF) sensor powered by an infrared
illuminator, for use as a trigger for the dot projector, which shines a single VCSEL,
which creates an array of 30,000 spots (Figure 2.1)[4] as well as captures an image.
The dot projector returns a 3D mapping of the face which enables the application
processing unit of the phone to determine whether the face presented to the camera
is in fact the owner and whether or not it is the real thing, or a biometric artefact [4].
While VCSEL provides a novel solution to PAD in facial recognition engines, we
still see a lagging of other companies (such as Android) to adopt this technology,
7

8 CHAPTER 2. LITERATURE REVIEW / PRIOR ART
Figure 2.1: FaceID: VCSEL projecting 30,000 dots [37]
citing the cost of these units too high to warrant inclusion in their handsets [18].
Apple is able to use technology such as this, as they are able to charge a premium for
their handsets unlike other companies. This is the major draw back of VCSEL and
is why a solution that leverages the pre-existing hardware within the phone (camera
and processing unit) positions itself within the developments in facial recognition
engine PAD as highly important.

2.2. MOTION-BASED SCHEMES 9
2.2 Motion-Based Schemes
Motion based schemes have been used to large success, especially in the facial bio-
metric use cases [6] [13]. Motion based schemes are especially useful for the analysis
of video streams, where the motion of the total image is analysed, identifying abnor-
mal motion either using motion correlation [14] or non-rigid analysis of the motion
using GMM [8], Eulerian magnification [8] or DTA. The motion-based schemes are
used for protection against methods like photo presentation, where movement be-
tween subject’s head movement and scene context [6] is non-existent. It can also
be used with lollipop presentation attacks, where irregular motion of the presented
‘photo on a stick’ is detected. Motion based schemes also encompass live-face specific
motion such as blinking, mouth movement or head movements [13].
2.3 Image-Quality Schemes
Image quality analysis is the most basic of PAD, where the image itself is analysed
for basic indicators like contrast, sharpness, blur, chromatic aberration etc. [5]. This
can be effective for photo presentation attacks or even video presentation attacks,
however are not useful for the application of 3D mask attacks. Image quality schemes
are becoming even less relevant as technological advancements are making PAD
harder to detect. Where print quality was often quite poor, in today’s world, the level
of printing quality possible can render these schemes irrelevant or at the minimum,
less effective.
2.4 Texture-Based Schemes
These pertain mostly to the class of Local Binary Patterns [5], or to Difference of
Gaussians [15]. These techniques are used to identify the texture of the image, where
a recreation of the original will not have the same texture of skin (eg. Skin tone,
lighting irregularities). The first work using LBP for facial PAD was conducted by
Maatta et al. in 2011, whereby the image is broken into a representation of itself
as a feature vector (a concatenation of a LBPu2 16,2 and 8,1 histogram over the
whole image and histograms of 9 overlapping blocks) [13]. This representation of
the image in a binary number sequence after thresholding is then used as a resultant
label [42]. A histogram of these labels then forms a descriptor of the texture of the
image. The next step of this process is often to feed it into a support vector machine
(SVM). SVM’s are often used for binary classification tasks and have been used to
great success for years, with it first being developed in 1963 by Vladimir N. Vapnik
and Alexey Ya. Chervonenkis [43]. The concept of an SVN is simple, whereby it

aims to generate a hyperplane or line between the data, isolating it into two separate
classes.
2.5 Prior Uses of CNN for PAD
With the intersection of popular consumer products and facial recognition engines
becoming a reality in the last few years, and in parallel the increased development in
graphics processing units; deep learning has become a feasible solution to PAD only
in the last few years. As a result there has been very few explorations into this field,
and no integration into production products that I have been able to find. Four
papers were able to be found that explore this concept on facial recognition engines,
and each use a slightly different method to leverage deep convolutional neural nets
to attain PAD [19, 21, 22, 23].
The most recent of which challenges the notion within this paper that creat-
ing a binary classification engine using deep convolutional neural networks is not
enough to reliably detect presentation attacks [19]. Rather they propose that an
auxiliary supervision step be used to guide the learning towards discriminative and
generalizable cues. They utilise a CNN-RNN model to estimate the face depth with
pixel-wide supervision, as well as to estimate the rPPG signals with sequence-wise
supervision. These two estimates are combined to distinguish between real and fake
images. They were able to achieve an F1 score of 45.8%, while this is not a great
result in comparison to the results that we found, their scope was a lot wider and
thus this is where the lower score is attained.
The paper that most similarly represents the solution presented in this paper
explores the concept of fine tuning only the fully connected layer, and rather than
extracting the prediction from there, they perform a principle component analysis
(PCA) to reduce dimensionality of the features which helps reduce over-fitting [22].
The features were extracted from the 11th
, 13th
, 15th
, 18th
, 20th
, 22nd
, 25th
and 27th
layers of their fine-tuned network. Finally, they use a support vector machine (SVM)
to distinguish between real and fake facial artefacts. They were able to obtain an
EER percentage of 4.99%.

2.6. WHERE THIS SOLUTION PLACES 11
2.6 Where this solution places
The solution proposed in this paper, while similar to the other CNN implementa-
tions, explores in more depth the process of transfer learning to garner meaningful
results. In comparison to tradition computer vision techniques, fine tuning a deep
convolutional neural network gives a more robust solution to a difficult problem.
Traditional computer vision techniques rely on an individual to determine what the
differing factors are that differentiate between a real biometric artefact and a presen-
tation attack; and subsequently search for these through some of the aforementioned
techniques. The benefit of deep learning is that instead of telling a network what
to look for, one can pre-determine the classified data and feed it to the network
and allow it to decide what differs the two. The uniqueness of this solution is the
leveraging of the networks lower levels pre-learned weights on generalised features
to streamline the training process. This solution thus standalone in its’ simplicity
of implementation, while maintaining a high level of accuracy, and relying on the
quality of data to expand the robustness of a future solution.

Chapter 3
Theory
This section will provide related background theory on the topic at hand, that will
provide the reader with the understanding, or if already versed in the field of deep
learning and biometric security; the ability to reference key concepts.
3.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are as the name suggests, a network of
a large number of computational nodes which are referred to as neurons. These
computational neurons work in a distributed fashion to collectively learn patterns
from the input in order to optimise the output [24]. The most basic form of a CNN
can be seen in figure 3.1, where we see a convolution layer, pooling layer and a fully
connected layers before achieving an output. This is the most simple of architectures
that will be encountered, and any number of convolution, pooling layers and differing
structures of fully connected layers can be arranged to optimise the networks ability
to learn from the training data.
Figure 3.1: Most basic structure of a CNN
13

14 CHAPTER 3. THEORY
After loading in our multidimensional vector (image) into the input layer, it is
distributed into the hidden layers as illustrated in figure 3.2. The input nodes are
passive, merely relaying their single input into their multiple outputs. This is in
contrast to the hidden layer and output layer, whereby they are active; continually
modifying the signal by weighing up how a stochastic change has affected the final
output either positively or negatively.
Figure 3.2: Distribution of image to hidden layers
This is where the process of learning takes place, and is summarised in figure 3.3,
where each value from the input layer is distributed to all hidden nodes. As the in-
put enters the hidden node it is multiplied by a weight which is both predetermined
and updated as learning takes place. These weighted inputs are then summed to
produce a single output value which before leaving the node is passed through a
non-linear mathematical activation function; in this case, a sigmoid activation [25].
The notion of deep learning comes when we strategically stack multiple of these
hidden layers on top of each other to maximise the learning process, and optimise
the efficiency.
The training of CNNs is performed predominantly under two circumstances, that
is: Supervised and UnSupervised.
Supervised Learning is the phenomenon where a CNN learns from a pre-labelled
dataset; with a set of vectors representing the image and a corresponding pre-defined
output. This method of training provides a lower overall classification error.
Unsupervised Learning is different from the surpervised case whereby the input
data does not have associated labels, but rather the network groups input vectors
with similar characteristics into bundles.

3.1. CONVOLUTIONAL NEURAL NETWORKS 15
3.1.1 Recent advancements in Artificial Intelligence
Three major breakthroughs in the last few years can be credited for the quantum
leap in deep learning and artificial intelligence on the whole. The first major break-
through was seen with the introduction of graphic processing units (GPUs)
over a decade ago. A parallel computing chip that we now take for granted. These
were necessitated as gaming became more computationally expensive, with the need
to make millions of re-calculations every second. As these became more popular and
advancements in graphics and general computing become more demanding, the de-
velopment of GPU’s was increased, and thus by around 2005 the cost of GPUs had
come down dramatically [33].
The pivotal moment in artificial intelligence came when a team at Stanford univer-
sity discovered they could use GPUs to compute neural networks in parallel [33].
Today, where nearly every consumer based laptop or desktop that is purchase has
some form of GPU, deep learning has become a viable option for problems that
were otherwise inaccessible to solve by the average individual or even small data
science team that did not have access to large computing power. An extension of
this advancement in hardware is the ability to utilise cloud computing, where one
can rent space on compute engines that would otherwise be unattainable on a nor-
mal budget; with some single GPU units costing upwards of $5000 AUD if you were
to purchase one for a build. The second important breakthrough over the last few
years has come about with the awareness of big data, and the value that has been
placed in it [34]. Without access to incredibly vast datasets, our artificial intelligence
is anything but intelligent. As larger companies like Google and Amazon pave the
way for cloud computing and big data collection, their developments begin to trickle
down through the data science community and the general public alike. Where we
benefit from the data they collect, the algorithms they develop and the computing
power they make available [33].
The last key development over the last few years that has accelerated development
and feasibility of artificial intelligence can be found in improved algorithms[35].
As aforementioned, the developments of these large companies in the fields of AI
have had flow on effects for the community as a whole. With greater access to tried
and tested algorithms that have been trained on big data on faster parallel comput-
ing engines, it has enabled society to leverage these advancements to solve problems
in a more efficient and less taxing manner.
With the convergence of these three fields, we have seen an exponential advancement
in AI over the last decade [36] and if the trend of development continues we will
be sure to see AI become even more apart of our lives, perhaps without us being
entirely aware.

Figure 3.3: Illustration of active nodes
3.1.2 Architectures
There are numerous different architectures that have been developed by leading data
science teams and academic groups to perform image classification and recognition
tasks. These are often readily accessible and have been trained on extremely large
datasets that would otherwise be inaccessible to the average individual. Keras -
which will be explained in section 4 - makes available multiple architectures and
associated models for image classification tasks, with their associated pre-trained
weights. These include:
• Xception
• VGG16
• VGG19
• ResNet, ResNeXt, ResNetV2
• InceptionV3
• InceptionResNetV2
• MobileNet
• MobileNetV2
• DenseNet
• NASNet

3.1. CONVOLUTIONAL NEURAL NETWORKS 17
Chosen for this application was VGG16, due to its depth, receptivity to fine tuning
and the fact that I had exposure to it prior and have been able to obtain good
results with prior image classification tasks where other architectures have not been
as successful.
VGG16
First debuted in the 2014 ImageNet Challenge 2014, VGG16 explored the effects of
the depth of a network in the large-scale image recognition setting [30]. Their main
contribution being exploring the effects very small (3x3) convolution filters have on
increasingly deep networks. It found that greater accuracy could be achieved in
comparison to prior methods using 16-19 weight layers; hence the 16 in VGG16 and
similar architecture VGG19. The VGG comes from the developers group name of
Visual Geometry Group [30].
The VGG team have released these two models as their highest performing archi-
tectures for further development and utilisation. This success was achieved through
other parameters of the architecture being fixed, and steadily increase the depth of
the network by adding more convolutional layers. This was feasible due to the use
of the aforementioned very small (3 x 3) convolution filters in all layers [30]. The
architectures structure can be seen in Figure 3.4, where by the input is a fixed-size
(224 x 224) RGB image. The only preprocessing that VGG have implemented is
subtracting the mean RGB value, computed on the training set, from each pixel
[30]. The illustration of the structure can be explained as:
• Stack of convolutional layers with (3 x 3) filters applied. This is the smallest
size to capture spatial location.
• Convolution stride fixed to 1 pixel
• Spatial pooling is achieved with a max pooling layer after each of the 5 con-
volution blocks. This is done over a (2 x 2) pixel widow, with stride 2.
• The 5 convolution blocks are usually followed by 3 Fully Connected layers,
however in the implemented model, due to the fine tuning, a unique Fully
Connected layer is added which will be explained in section 4.
• Hidden layers have ReLU activation.

Figure 3.4: VGG16 graphic
3.2 Transfer Learning (Fine tuning)
When approaching a deep learning problem, the old adage is that a data scientist
will spend approximately 90 percent of their time collecting and cleaning data, 5
percent of their time training models and actually writing code, and the last 5 per-
cent writing documentation and cleaning up repositories [29]. This is particularly
accurate when approaching computer vision tasks, where by the patterns that are
learned are often not obvious however solely dictated by the quality and type of data
that is fed into the network. Therefore copious amounts of data is needed to train
these large deep convolutional neural networks from scratch. The architecture used
in this thesis (VGG16) was trained on a dataset called imagenet which contained
1000 classes and over one million training images [30] at the time of training. While
training time is heavily dependant on the equipment used; to give the magnitude of
the task at hand, the team in [30] trained this classifier over 74 epochs on a single
machine running 4 NVIDIA Titan Black GPUs, utilising data parallelism. This is
achieved by splitting the training data into several GPU batches, processed in par-
allel on the 4 GPUs. Taking this all into account, and the considerable resources
needed to train a network from scratch on a dataset; it took the team 2-3 weeks to
train the model. This gives a clear understanding of the shear resources needed to
effectively train these deep CNNs and thus shows why transfer learning is such an
important part of deep learning, and why it was applicable in this particular thesis;
with the limited computational resources at hand.
Fine tuning is the process of retraining higher levels of the CNN architecture,
by leveraging the idea that the features that the early layers of the model learn are
very generalised features that are common among most image based classification
tasks. In contrast, the later layers start to learn features and patterns more specific

3.2. TRANSFER LEARNING (FINE TUNING) 19
to the classification task at hand [31].
3.2.1 Fine Tuning vs. Training from scratch
The act of fine tuning is achieved using the same techniques as one would use
for training an entire network, whereby the weights of each layer are tweaked and
updated as the model learns. The difference however comes in the layers that this
learning process operates on. We have the ability to freeze the model up to a certain
layer, and to train everything from that point onwards. Where freezing is the ability
to lock the weights already obtained through training on a prior dataset (in this case
imagenet), and not allow our new data or the model to affect these.
Therefore, we can see that fine tuning allows us to:
• Reduce the size of the dataset needed to train a model, as we are not training
the whole network, rather just the upper layers. Second to this, the layers we
are training are not being trained from scratch.
• As there are fewer layers to train and thus fewer weights to update, as well
as less data to iterate over; the training time is dramatically reduced. In our
case, from weeks to hours.
3.2.2 Bottleneck
As discussed in this section, fine tuning leverages networks pre trained on extremely
large datasets and thus has learned features that are applicable to most computer
vision problems. A key technique utilised in transfer learning is the idea of collecting
’bottleneck’ features of the network. This is achieved by running the training and
validation data over the convolutional part of the network once, and storing the
output of this. This output is the last activation mappings that the network pro-
duces before a fully connected layer[32]. The main reason that bottleneck features
are utilised is for computational efficiency. This enables the use of these large com-
putationally costly networks on lower powered parallel computing chips as once the
bottleneck features are captured, we can load these in a more efficient data structure
to reduce the size of the trainable network. Another reason that bottleneck features
coupled with fine tuning of the upper layers of the network is utilised is to prevent
overfitting. Large neural networks have a large entropic capacity and thus have a
tendency to overfit if data resources are insufficient.

3.3 Facial recognition
Facial recognition pertains to the use of an individuals unique anatomical feature
that is their face, as an identification mechanism. This is done in a general sense by
comparing a presented face to a list of accepted faces in a database. It has a wide
range of uses within society, from consumer based goods such as the iPhone X up to
target identification in surveillance work in government organisations and the like
[27].
For the use case that is explored in this thesis, we will be looking at a novel solution to
preventing the spoofing of these facial recognition engines. This is not limited to any
exact application of facial recognition, and thus highlights the importance of work
and the subsequent advancements in the field of artificial intelligence; specifically in
the use case of PAD of facial recognition engines.
3.4 Biometrics
This thesis is based in the use of biometric identification, and thus it is important
to have a thorough understanding of what biometrics is, how it can be leveraged
and its’ potential vulnerabilities.
Biometric identification or biometrics is the automatic identification of a person
through the use of their unique anatomical features; which in the case of this thesis
is their face [26]. The use of biometrics has advantages over the traditional form of
identification such as a physical card or passwords. These are:
• The individual has to be present at the time of identification, therefore pass-
words and/or ID can’t be used regardless of whether an individual is present.
• The use of biometrics as a form of entrance or verification is convenient as you
do not have to remember passwords or carry a form of identification on you
at all times.
• A lot harder to steal or replicate in comparision with tradition forms of ID.
3.5 Presentation Attacks
3.5.1 Types of facial PA
There are many types of presentation attacks that are used to penetrate facial recog-
nition engines, namely: print attack, replay attack and 3D modelled mask. Each of
these have their own parameters for detection, and thus inherent difficulty associ-
ated with detecting them. This thesis explores the detection of the replay attack,

3.5. PRESENTATION ATTACKS 21
however it is important to understand the problem in its entirety to comprehend
the broad nature of presentation attacks to our lives.
Print Attack (2D image)
A common method of by-passing 2D facial recognition engines is to present a pho-
tograph of the intended subject [6]. In our world where our face is plastered all over
the internet with our social media platforms, the security of using biometric facial
measures to unlock our devices, or allow access to rooms, is an increasing risk. This
image of the intended subject can be easily obtained and displayed, either in hard
copy or in digital form, on a screen [6]; and it is known that facial recognition sys-
tem respond quite poorly to attacks due to the ease of recreation [7] [6]. 2D image
presentation is the simplest form of presentation attack, and therefore the easiest to
detect. The latter methods prove to be harder to detect.
The ‘lollipop’ attack is an extension to the preceding 2D image attack. In the
2D image attack, motion detection can be used to detect the difference between
the subject and its background context [8]. The ‘lollipop’ attack is essentially the
2D image however placed on a stick, to provide a motion difference between the
background and subject, thus increasing the difficulty of PAD.
Replay Attack
A recorded video can be used to fool facial recognition systems. PAD becomes even
harder again as the previous methods used on the 2D image and lollipop attacks
do not apply as easily to the moving video, and thus methods like detecting the
frame of the device, or reflections in the glass are necessary to detect a presentation
attack. This type of presentation attack is particularly important, as this thesis will
be primarily using video and a video of a video to train the CNN as well as to test
the CNN; detecting whether an image is 1st generation or 2nd generation (real or
fake).
3D Printed Mask
With recent development in PAD, 3D molded masks have become a way of attack-
ing these facial recognition engines, where-by precise measurements are taken of the
intended subject and then a realistic mask is manufactured. This method of presen-
tation attack is particularly effective in fooling facial recognition systems, however it
is incredibly hard to manufacture. The likelihood of being able to obtain someone’s
exact facial measurements, skin texture, etc. is extremely improbable if you are
trying to hack their system, and thus the risk for this type of attack is low. This

is summarized by Rich Mogull, a security analyst, who said about Apple’s Face ID,
[9]”If you have to 3D print a model of someone’s face to defeat this, that’s proba-
bly an acceptable risk for most of the population.” However, he goes on to caveat
this statement by saying “if I were an intelligence agent, I wouldn’t turn on any
biometric.” [9] .
3.6 Summary and conclusions
It can be seen from the preceding theoretical overview of the key design choices
for this thesis that ﬁne tuning deep convolutional neural networks presents exiting
potential in the biometric security space, speciﬁcally in facial recognition PAD. The
implementation of transfer learning coupled with Visual Geometry Group’s VGG16
architecture is an exciting prospect for PAD and that access to extensive diverse
data is paramount to the success of a production level product. The proceeding
section will serve to outline the implementation of this theoretical solution.

Chapter 4
Solution Design and
Implementation
The purpose of this thesis was to develop a presentation attack defence mechanism
for facial recognition engines through leveraging transfer learning on convolutional
neural networks. It aims to solve the problem of biometric security as we move
into an age where facial recognition engines are becoming ever present in everything
from consumer based products to the highest security clearance applications. Below
will outline the techniques used, the frameworks utilised and the hardware that it
was trained on. This was conducted as more of a proof of concept compared to
a production ready product, and as such this must be taken into account when
considering the implementation at this point.
I broke this thesis into two phases, to spread the workload over the two semesters
it was conducted in. As I outlined previously in Section 1.3,
Phase 1 was reserved for:
• Data set collection
• Research into diﬀering deep learning techniques.
• Consulting with Professor Lovell and Wiliem to discuss potential solutions/di-
rections to take.
• Exploring the prior work on PAD in facial recognition engines.
• Exploring the computing power that would be feasible to complete this project
under the scope aforementioned.
• Pulling on the knowledge gained at work on leveraging AI to develop solutions
to problems
23

24 CHAPTER 4. SOLUTION DESIGN AND IMPLEMENTATION
• Testing multiple deep learning architectures and their ability to be fine tuned
to this challenge.
This phase commenced at the beginning of semester 2 2018 and was completed dur-
ing the summer break, where dataset collection was by far the most time consuming
task with over 120GB of data being collected personally, and an additional 5GB ob-
tained from a dataset online, however as will be discussed later this dataset became
less useful and could be omitted if training was conducted again. Furthermore, the
data cleansing process after the dataset had been formed to ensure a balanced and
non-biased dataset was the second most time consuming part of phase 1. Through
trial and error, it was found that initial models had a heavy bias due to the data
presented to the model and thus skewed results. Later iterations after the referenced
data cleansing had been conducted, eliminated this bias.
Phase 2 - As outlined in Section 1.3, was for the development of the model.
Once the data was curated appropriately, the actual task of training the model and
optimising the results was not as time consuming as the dataset collection and cu-
ration however it still was a time consuming and at times frustrating endeavour as
training the model was computationally expensive and thus time consuming. With
changes and tweaks to the model took approximately 20 hours to implement which
will be explained later.
Reflecting on the timeline decision of breaking this thesis into two phases high-
lights the importance that data has on the success of a project, especially this one.
Having two phases ensured that there was a logical flow to the project and ensured
there was always progress in the right direction, instead of wasting time training
models on improperly curated data.

4.1. DESIGN OVERVIEW 25
4.1 Design Overview
Figure 4.1: Design flow
The developed solution leverages a network architecture developed by Visual Ge-
ometry Group called VGG16. The network is fine-tuned from the last convolutional
layer, up through the newly created fully connected layer. The solution is written
in Python, using a Tensorflow back end and Keras front end. The decision to use
Keras as a frontend, as opposed to writing the solution purely in Tensorflow is for
simplicity. Keras is an open-sourced neural-network API written in Python that can
operate on top of Theano, CNTK or in the case of this thesis, Tensorflow. It en-
ables fast deep learning experimentation and development and comes with VGG16
pre-packaged; which increases the appeal of using Keras. It is surmised as ”Being
able to go from idea to result with the least possible delay is key to doing good
research”[38]. Which was exactly the paradigm of this thesis.
This solution was not developed as a complete solution to the problem that is PAD
in facial recognition engines. Rather, it was developed as an exploration into the
possibility of using deep learning as a means to preventing these attacks; where cur-
rent solutions rely on extra (and often very expensive) hardware, as explored prior.
As a result it does not come ready for integration into doorways or smart phones.
For this integration to happen there would need to be an individual development
period for each integration, as the differing hardware, operating systems and envi-
ronments all would impact on the detection capabilities and general operation of the
solution. As aforementioned, this solution also does not incorporate all methods of
PAD, but rather focuses on the replay attack under very specific conditions, which
will be outlined below. The theorised implementation flow would look like Figure
4.1, if it were to be further developed and implemented.

4.2 Implementation
With no funding for this research project, the implementation of it was thus confined
to the hardware that was easily accessible. This thankfully included a GPU, which
would be considered the most crucial element of hardware in any deep learning
problem. It allowed for faster than CPU training and testing times, which meant a
more streamlined development phase. While the GPU used was far from optimal,
it still provided adequate performance for the task at hand.
The work flow for training and garnering insights into the solution was designed as
a three step process. At their most simple level they are:
• Generating bottleneck features
• Fine tune model
• Evaluate and generate insights into the model
The reasons behind pursuing fine tuning for this solution has been thoroughly estab-
lished through out this paper, so that will not be elaborated in this section. Rather,
the key aspects of the solution will be explained.
4.2.1 Hardware
The following section will outline the hardware used for the generation of this solu-
tion. It will encompass the devices used for capture, the device used for presenting
the replay attack and the machine used for training the neural network.
Throughout the process of this thesis, accessibility and cost effectiveness were two
key criteria that were adhered to; and the hardware was no different with its con-
siderations. While more powerful and capable options were at the time available,
they came at a price premium.
Graphics Processing Unit
Crutial to any deep learning problem is the graphics processing unit (GPU). It is
the single most important piece of hardware, as it provides the computational power
necessary to conduct the complex matrix calculations inherent in deep learning. As
aforementioned, as this project had no budget, it was necessary to minimise the
cost of all stages of this solution and as a result it was decided to utilise the GPU
that was already built into my personal laptop. Thankfully, while it is a low end
GPU, it is still considered a good entry level GPU for deep learning and can be
parallelised with NVIDIA’s CUDA parallel computing platform. The GPU utilised
was the Nvidia GTX 1050ti. A comparision can be seen in figure 4.2.

4.2. IMPLEMENTATION 27
Figure 4.2: Comparison of GPU’s
While cost was a major consideration, the ability to prototype and receive mean-
ingful results in a reasonably timely manner was also important. In the initial stages
of the project, careful consideration was taken when deciding whether to continue
training on the 1050ti or to pursue cloud computing as an option. Using FloyHub,
a comparison was run where an early iteration of a model was trained for 20 epochs
on their most affordable GPU instance, which was a Tesla K80. Taking into account
the increased RAM of the GPU, the batch size was increased from 20 to 256. While
there was an improvement in overall training time, the difference did not warrant
spending upwards of $20 AUD for each training run. The small decrease in total
training time, the quite large cost of using cloud instances ($9 USD a month, plus
$1.20 USD per hour of training) as well as the laborious task of migrating large data
sets to the cloud (in the case of this thesis, 15GB) it meant that pursuing a cloud
computing option under the budget constraints of this project was not possible.
Presentation Hardware
As defined in the scope, the thesis revolves around developing a deep learning solu-
tion to presentation attacks on facial recognition engines, specifically replay attacks.
When deconstructing this problem, it becomes apparent that there is a multitude
of potential ways that a replay attack could occur. Most of these variations come
through device that they are presented on. In the same grain as limiting the explo-
ration of this solution to replay attacks, it also became necessary to limit the device
these replay attacks were presented on. It was decided that the replay attacks would
be display on a Dell XPS 15 laptop. This was chosen due to the way the replay

is rendered on the screen (see Figure 4.3). For the sake of garnering meaningful
insights into the way the model differentiated between real and fake, this display
was utilised, as it provided a very definite framing of the face.
Figure 4.3: Example of spoof image
Host Machine
The machine used for all development and training of the solution was the Dell
XPS15, specifications for the machine can been seen in Figure 4.4.

4.2. IMPLEMENTATION 29
Figure 4.4: Machine used for training

4.2.2 Software used for development
The development environment, much like any other software development consisted
of an IDE and version control for keeping everything organised and logical. This
allowed for a methodical approach to development. The IDE used was VS Code,
with integrated linting, version control portal, and intellisense. VS Code, especially
for deep learning development, where easy visual access to dataset paths, and config
files streamlines the process and the monotony of development, offers a very nice
platform. The version control that was used was Git, with a repository stored in
GitHub which is linked in Appendix B.
4.2.3 Overview of deep learning framework
There are many languages that can be used to write a deep learning solution, how-
ever python presents as the most supported online as well as being a very easy
language to be able to prototype and eventually develop solution in. This simplicity
lends itself to providing easy to follow code and therefore reliable solutions. This is
down to the simplified syntax and emphasis on the use of natural language through-
out. This is why python was used to develop this solution.
Back end
Tensorflow was used as a backend language to facilitate the use of neural networks.
Much like python, tensorflow was utilised for its accessibility, and support network.
While it is not as fast as other DL frameworks like CTNK and MxNET, the wide
support online and its seemless integration with Python as well as Keras makes it a
logical choice [40].
Front end
Tensorflow as a language, while powerful, is extremely verbose in its implementation,
this is why Keras has been utilised as a front end API to make the task of develop-
ment easier. Much like any other deep learning problem, this problem utilises large
amounts of data, and thus Keras offers helpful functions for interfacing the network
with this data.
4.3 Dataset
4.3.1 Data Collection
The data collection phase took the longest amount of time during the development
phase. As has been made abundantly clear throughout this paper, the need for
extensive and varied data is paramount to the success of a deep learning solution. As
a result, it was made a priority that data collection would begin as soon as possible

4.3. DATASET 31
in the first semester of this thesis. However, as there was no budget for this project,
there was no opportunity to conduct capture sessions whereby a reward could be
offered to come and partake in these sessions. As a result, the data collection relied
on family and friends donating their time and their faces for the benefit of this
project.
The collection was kept as simple as possible, whereby they were asked to film
their face with their phones, occasionally moving their face around etc. to try and
generate as many unique frames as possible for the network to learn from. The
following devices were used in the collection:
• iPhone X
• iPhone 6
• iPhone 7
• Logitech C920 HD Pro
• Dell XPS 15 webcam
4.3.2 Size and Structure
As has been outlined through out, the data was the most important part of ensuring
a successful outcome. With any deep learning problem, there needs to be adequate
data presented to the network for it to begin to identify deterministic features within
each classification class. In this case, the network needed to be able to learn the
difference between real and fake images presented to it. The initial dataset that
was curated was 120GB, with a split of 70/20/10 between training, validation and
test sets. The equated to a total of approximately 850,000 images. However these
images were only spread over 10 individuals faces. Due to the method of collection
(filming of faces and then splitting each video into it’s frames) there was a large
amount of similar data within this set. Furthermore, there was an uneven balance
of individuals faces which resulted in a heavy bias. As a result, the decision was
made to further curate this dataset into more unique examples. The end result was
a 15GB dataset, with a 80/10/10 split over train, validation and test respectively.
This equated to a total of approximately 98,000 images. Therefore, this reduction in
size of the dataset without loosing unique data allowed for a more efficient training
process with respect to total training time.

Figure 4.5: Samples of real images
A sample of the variation of conditions facial artefacts were captured under can
be seen in Figures 4.5 and 4.6. While effort was made to try and keep a reasonably
consistent lighting and crop of the face, variation did creep in under the nature
of each individuals interpretation of the instructions given to them. This did not
appear to affect the end results, however the test cases were also conducted under
these similar lighting conditions and as a result may be representative of just what
the network learnt.
Figure 4.6: Samples of fake images
4.3.3 Improvements and limitations
It is clear from figures 4.5 and 4.6 and what was touched on above, that the dataset
did not perfectly represent all lighting conditions that would be expected in a real
world implementation. Rather, reasonably good lighting was employed under each
data gathering session. If this solution was to progress past this point of early
proof of concept, considerable emphasis would need to be placed on the conditions
under which the data gathering sessions were conducted. Not just to ensure optimal
conditions, but the contrary, to ensure all possible environments are covered that
would be expected in a real world implementation of this solution. This includes
the sub-optimal conditions often experienced in dimly lit rooms. This is where deep
learning, in a situation where it is starved of certain variations of data, will fall over;
and where other techniques such as VCSEL would allow an accurate detection.
The obvious limitations of this solution is that is was only trained for detecting replay
attacks and under very specific circumstances which have been outlined throughout.
However this was by design; it was never meant to be a complete solution, rather a

4.4. ARCHITECTURE 33
proof that this particular method, even on limited hardware and with limited access
to extensive data, a reasonable result can be obtained in the endeavour to minimise
hardware overhead within devices for the detection of presentation attacks on facial
recognition engines. Further work will be outlined below, however, this is where
the improvements will take place. Through the diversity of data, and with that the
attack methods that it is able to detect.
4.4 Architecture
The architecture of VGG16 was outlined above, and the revised architecture can be
seen in Figure 4.7; where the fully connected layer found in the standard VGG16
has been replaced with one that favours binary classification tasks. With a sigmoid
activation instead of softmax and output of a single class (Real or Fake) rather than
the standard 1000. The two halves of the network (the convolutional base and the
fully connected layer) are instantiated separately and run once over the training and
validation data to obtain what we call bottleneck features. These bottleneck features
for the convolutional base represent the last activation mappings before the fully
connected layer. As explained in the theory, these bottleneck features are generated
and stored in numpy arrays rather than fusing the convolutional base and the fully
connected layer and training the whole lot, for computational efficiency. This was
only possible as we were not utilising data augmentation as we had sufficient data
for the task at hand. These modifications to the network facilitate an efficient fine
tuning process; one that is tailored to the task at hand: binary classification.
Figure 4.7: VGG16 graphic

4.4.1 Fine tuning model (transfer learning)
The last convolutional layer of the model - illustrated in Figure 4.7 by the 14x14x512
block - as well as the fully connected layer; is fine tuned. This is done after the collec-
tion of the bottleneck features of the full convolutional base and the fully connected
layer separately. This process is achieved by freezing the weights of every layer up
until the last convolutional layer; meaning the weights can’t be updated. Fine tuning
is possible when using VGG16 that is packaged with Keras as it has been trained on
imagenet, thus having trained weights. If either layer did not already have trained
weights, the large gradient updates in the fully connected layer, triggered by the
randomly initialised weights would destroy the learned weights in the convolutional
base. The weights for the fully connected layer are obtained when the bottle neck
features are created.
Another reason for using fine tuning in comparison to the reasons previously men-
tioned, is that a network this large and coupled with a large dataset with limited
variation, has a very large entropic capacity and thus a tendency to over fit. As
the features learnt at the lower levels of the convolutional base are more general
features (edges, variation in lighting etc.) they apply to a large range of computer
vision tasks, and thus can be leveraged in this solution. Another consideration that
is important in order to obtain meaningful results is to use a relatively slow learning
rate (in this case the learning rate was 1e−4). This is to prevent large magnitude
updates that have a tendency to ruin previously learnt weights.
During the training and validation process, the dataset was run over the last
convolutional block and the fully connected layer. This was done over 20 epochs,
with a batch size of 20. The low batch size was due to the limited nature of the
graphics card that was used for training, having only 4GB of RAM. While 20 epochs
is not a lot in comparison to the usual deep learning task, it can be seen in Figure
5.3 that the loss and the accuracy had been optimised and had plateaued by epoch
20, and running it for any longer would not have been beneficial. This can be poten-
tially put down to the fact that the task at hand was reasonably easy being a binary
classifier, and therefore the network only needing to learn the difference between
two classes. Furthermore, the two classes were obviously different in the way they
were represented thus the network learning relatively quickly. Another factor that
contributes to the efficient convergences of the network despite its depth and large
number of parameters is the implicit regularisation imposed by this greater depth
and smaller convolution filter sizes [30].
Overall training time for this configuration was approximately 20 hours. Deep learn-
ing is an extremely computationally expensive task, and as has been outlined every
effort was made to make the process more efficient. With the hardware at hand,

4.4. ARCHITECTURE 35
operating over a large network like VGG16 and reasonably large dataset, this train-
ing time is about as efficient as one could expect. This only serves to highlight the
effectiveness of transfer learning. When considering training only the last convo-
lutional layer and the fully connected layer took 20 hours, if we were to train the
network from scratch, the time for that would significantly blow out, and would be
completely un-feasible on this hardware. As a comparison, in the original training
of VGG16 with the state of the art computing power (x4 Titan Black GPU’s) it
took the Visual Geometry Group 2-3 weeks to train a single net [30].
4.4.2 Design Rationale
While the implementation of this solution is not the most ground breaking or revo-
lutionary, it was almost intentionally done for those reasons. In our modern epoch,
where advances in artificial intelligence is starting to drive change in all forms of
industries, it can often be forgotten that simple implementations can sometimes
garner the most reliable of results. This being a proof of concept, at an honours
level, it became an interesting foray into the possibilities of leveraging established
techniques and applying it to a problem that is at the forefront of concern. As this
thesis developed, it became a goal to be able to show that even relatively simple
solutions can be implemented to solve a very difficult problem, and to highlight the
importance of meaningful data.
In phase one of this thesis where extensive research was conducted into similar
methods used for the problem, there was none that covered the exact implementa-
tion outlined in this paper. Furthermore, there is countless tutorials, articles and
papers written on binary classification with meaningless examples used to illustrate
the concept. This posed an interesting opportunity to develop a binary classifier
using these methods that solved a real world problem.

Chapter 5
Results and Discussion
Upon commencing this thesis topic, there was very little in the way of presentation
attack detection for facial recognition engines. As a result there was very little to
compare results against, or benchmark techniques against. Even at this point, there
is still only a hand full of approaches that pursue a similar technique. However,
with an F1 score of 99.96% as seen in Figure 5.4, it is clear that the results of this
exploration into the viability of fine tuning deep convolutional neural networks as a
detection method highlight the effectiveness of such a technique. When conducting
training, it can be seen in Figure 5.1 that a high level of accuracy was attained early
on (around epoch 6), meaning that over the training data the network quickly learnt
the differentiation between real and fake. This can be attributed to the very binary
nature of the task at hand. If one was to look at two images it is very evident which
is the real and fake of the two. It is clear that one is displayed on a laptop screen
with a black background surrounding it.
Figure 5.1: Tensorboard outputs
37

38 CHAPTER 5. RESULTS AND DISCUSSION
It was thus hypothesised that this is what the network was learning when dif-
ferentiating between the two cases. This was verified by taking one of the real
images and pasting it over a black background and running it through the model.
As expected, it picked up the image as a fake image as seen in figure 5.2.
Figure 5.2: Fake spoofing
Further investigating the results we obtained, in figure 5.3 we can see that there
were no examples of the model predicting an image as real when it was actually
fake. The false positives came from the model predicting the artefact as fake when
in fact it was real, much like in figure 5.2. This can be explained as the model
looking for this black border around the image; which in fake images - as they were
all captured under the same circumstance - is always present. In the case of real
images, there is potential that a darker area may be present within an image which
could be mistaken for the black area around the image on the computer screen.
This further highlights the need for a versatile and varied dataset that does not
introduce bias into the model. The dataset was purposely curated in this way to
allow the generation of meaningful results in the absence of extensive varied data.
It gave the network something easier to learn than what would be if a production
ready solution of the problem were to be developed down the line.

39
Figure 5.3: Confusion Matrix
The scores calculated in Figure 5.4 were calculated in the following ways[39]:
precision =
tp
(tp + fp)
The precision ratio gives an intuitive measurement of the models ability to not
label an artefact as positive when in fact it is negative. Where tp is the number of
true positive and fp is the number of false positives.
Recall =
tp
tp + fn
The Recall gives a representation of the models ability to attain all the true
positives. Where tp is the number of true positive and fn is the number of false
negatives.
F1 − Score = 2 ∗
Precision ∗ Recall
Precision + Recall
The F-1 Score conveys this balance between precision and recall. It is the mea-
sure of the classiﬁers accuracy, it is the harmonic average of the two prior measures,
and is best at 1 and at worst 0.
Figure 5.4: Final Scores

40 CHAPTER 5. RESULTS AND DISCUSSION

Chapter 6
Conclusions
6.1 Summary and conclusions
As seen in the result, the F1-score that was obtained was incredibly promising. It
showed that under the conditions that the network was trained under, it was ex-
tremely capable of detecting presentation attack’s through determining whether an
image was fake or real. In saying this, the solution was not perfect. Through thor-
ough testing it was found the network could be fooled if the lighting, screen that it
was replayed on or any other host of environmental variables were altered. While
this is not a failure, it is a note on the important of extremely large and varied data
sets. It highlights that this problem - like any deep learning problem - would require
thorough scenario planning before the solution is developed. This would ensure that
when data is being collected and collated that it would incorporate all the potential
scenes that the engine is likely to see.
With interest in the intersection of bio metrics and convenience of access to things
from data to buildings; it has never been more important than now to start to focus
on the other side of this coin. The security of these systems. With the emergence of
viable deep learning solutions, the need for standalone hardware solutions to facial
PAD might be a thing of the past.
6.2 Possible future work
As has been touched on throughout this paper, the solution presented is not one
that is complete and only explores the possibility of implementing such a technique
in a real world scenario through investigating the eﬀectiveness of detecting replay
attacks. As a result, the further work would revolve around the exploration of all
the types of presentation attack for facial recognition engines. Once the scope of
41

42 CHAPTER 6. CONCLUSIONS
the final solution has been established, an extensive data gathering and collation
process would need to take place in order to ensure there is enough variation to
effectively identify presentation attacks under each of these methods. Some things
that would need to be taken into consideration are:
• Type of attack
• Gender
• Device the data is captured with
• Device the artefacts are presented on (in the case of attacks that use a device
to display)
• Variation in lighting
• Age of the individuals
• Angle the data is captured at
This is not an exhaustive list but aims to highlight the consideration needed when
capturing vast amounts of data for a task such as this. Where a false negative could
result in the loss of valuable information or assets. When curating data sets, it is
easy to overlook certain factors and subsequently introduce a bias into your model,
which in turn introduces vulnerability into the detection mechanism.
While VGG16 as a network for transfer learning served as a great platform to
experiment and test the feasibility of this approach for implementation; considering
the majority use case for this sort of technology will be within hand held devices like
phones, a more lightweight network would need to be implemented. This is down to
the processing power available on current phones; which in comparison to even the
cheapest CUDA optimised GPU’s pales in comparison. However, in the future this
may not be as much of an issue as technological advancements will likely continue
in the same grain they have over the last decade.
Given the right amount of time, adequate computing power and access to large
segments of the population for data collection, the problem of facial presentation
attack detection is one that could be solved with deep learning, specifically fine
tuning deep convolutional neural networks.

Appendix A
Code Listings
A.1 Bottleneck Features Generation Script
1 import tensorflow as t f
2 from keras . a p p l i c a t i o n s import VGG16
3 from keras import models
4 from keras . l a y e r s import Flatten , Dense , Dropout
5 from keras import optimizers
6 from keras . models import Sequential
7 from keras . callbacks import EarlyStopping
8 from keras . callbacks import ModelCheckpoint
9 import matplotlib . pyplot as plt
10 from keras . preprocessing . image import ImageDataGenerator
11 import json
12 import numpy as np
13 import math
14 from d a t a v i s u a l i s a t i o n import vis dataset , train samples ,
validation samples
15
16 config = t f . ConfigProto ()
17 config . gpu options . allow growth = True
18 s e s s i o n = t f . Session ( config=config )
19
20 with open ( ’ config . json ’ ) as f :
21 conf = json . load ( f )
22
23 top model weights path = ’ fc model 13gb . h5 ’
24 # v i s d a t a s e t ()
25
26
27 def s a v e b o t t l e n e c k f e a t u r e s () :
28 datagen = ImageDataGenerator ( r e s c a l e =1. / 255)
29 # build the VGG16 network
30 model = VGG16( include top=False , weights=’ imagenet ’ )
44

A.1. BOTTLENECK FEATURES GENERATION SCRIPT 45
31
32 generator = datagen . flow from directory (
33 conf [ ’ train path ’ ] ,
34 t a r g e t s i z e =(conf [ ’ height ’ ] , conf [ ’ width ’ ] ) ,
35 batch size=conf [ ’ batch size ’ ] ,
36 class mode=None ,
37 s h u f f l e=False )
38 b o t t l e n e c k f e a t u r e s t r a i n = model . predict generator (
39 generator , int (math . c e i l ( train samples () [ 0 ] / conf [ ’ batch size ’
] ) ) ,
40 verbose =1) #hard code value as can ’ t r e f e r e n c e f i l e s
f i l e c a l c u l a t i o n s () [ 0 ]
41 np . save ( ’ b o t t l e n e c k f e a t u r e s t r a i n 1 3 g b . npy ’ ,
42 b o t t l e n e c k f e a t u r e s t r a i n )
43
44 generator = datagen . flow from directory (
45 conf [ ’ validation path ’ ] ,
48 class mode=None ,
49 s h u f f l e=False )
50 b o t t l e n e c k f e a t u r e s v a l i d a t i o n = model . predict generator (
51 generator , int (math . c e i l ( validation samples () [ 0 ] / conf [ ’
batch size ’ ] ) ) ,
52 verbose =1) #f i l e c a l c u l a t i o n s () [ 0 ]
53 np . save ( ’ b o t t l e n e c k f e a t u r e s v a l i d a t i o n 1 3 g b . npy ’ ,
54 b o t t l e n e c k f e a t u r e s v a l i d a t i o n )
55
56
57 def train top model () :
58 train data = np . load ( ’ b o t t l e n e c k f e a t u r e s t r a i n 1 3 g b . npy ’ )
59 # w i l l need to make validation set to do t h i s method
60 t r a i n l a b e l s = np . array (
61 [ 0 ] ∗ ( train samples () [ 0 ] // 2) + [ 1 ] ∗ ( train samples () [ 0 ] //
2) )
62
63 validation data = np . load ( ’ b o t t l e n e c k f e a t u r e s v a l i d a t i o n 1 3 g b . npy ’
)
64 v a l i d a t i o n l a b e l s = np . array (
65 [ 0 ] ∗ ( validation samples () [ 0 ] // 2) + [ 1 ] ∗ (
validation samples () [ 0 ] // 2) )
66
67 model = Sequential ()
68 model . add ( Flatten ( input shape=( train data . shape [ 1 : ] ) ) )
69 model . add ( Dense (256 , activation=’ relu ’ ) )
70 model . add ( Dropout ( 0 . 5 ) )
71 model . add ( Dense (1 , activation=’ sigmoid ’ ) )

46 APPENDIX A. CODE LISTINGS
72
73 model . compile (
74 optimizer=’ rmsprop ’ ,
75 l o s s=’ binary crossentropy ’ ,
76 metrics =[ ’ accuracy ’ ]
77 )
78
79 model . f i t (
80 train data ,
81 t r a i n l a b e l s ,
82 epochs=conf [ ’ epochs ’ ] ,
84 verbose =1,
85 validation data =(validation data , v a l i d a t i o n l a b e l s ) )
86 model . save weights ( top model weights path )
87 s a v e b o t t l e n e c k f e a t u r e s ()
88 train top model ()
A.2 Fine Tune network script
1 import tensorflow as t f
2 from keras . a p p l i c a t i o n s import VGG16
4 from keras import l a y e r s
5 from keras import optimizers
6 from keras . models import Sequential , Model
7 from keras . callbacks import EarlyStopping , TensorBoard , ModelCheckpoint
8 from keras . models import model from json
9 from keras . callbacks import ModelCheckpoint
12 import json
13 from d a t a v i s u a l i s a t i o n import vis dataset , train samples ,
validation samples
14 import time
15
16 NAME = ” imception−finetune −on−13gb−20epochs −{}” . format ( int ( time . time () )
)
17
18 config = t f . ConfigProto ()
19 config . gpu options . allow growth = True
20 s e s s i o n = t f . Session ( config=config )
21
22 top model weights path = ’ fc model 13gb . h5 ’
23
24
25 tensorboard = TensorBoard ( l o g d i r=’ logs /{} ’ . format (NAME) )

A.2. FINE TUNE NETWORK SCRIPT 47
26
29
30 # Build VGG16 base model
31 VGG model = VGG16(
32 weights=conf [ ’ weights ’ ] ,
33 input shape=(conf [ ” height ” ] , conf [ ”width” ] , 3) ,
34 include top=conf [ ’ include top ’ ] )
35 print ( ’ Model loaded . ’ )
36
37 # Building f u l l connected c l a s s i f i e r
38 top model = Sequential ()
39 top model . add ( l a y e r s . Flatten ( input shape=VGG model . output shape [ 1 : ] ) )
40 top model . add ( l a y e r s . Dense (256 , activation=’ relu ’ ) )
41 top model . add ( l a y e r s . Dropout ( 0 . 5 ) )
42 top model . add ( l a y e r s . Dense (1 , activation=’ sigmoid ’ ) )
43
44 # Load in weights from bottleneck t rain ing .
45 # Needed in order to conduct f i n e tuning
46 print ( ” [INFO] − Loading top model weights ” )
47 top model . load weights ( top model weights path )
48
49 # Add FC c l a s s i f i e r model to base model
50 print ( ” [INFO] − Adding top layer ” )
51 # model . add ( top model )
52 model = Model ( input=VGG model . input , output=top model (VGG model . output )
)
53 model . summary ()
54
55 # Set up to the l a s t Conv block as non−t r a in a b l e
56 # This w i l l preserve weights in these l a y e r s
57 f or layer in model . l a y e r s [ : 1 5 ] :
58 layer . t r a i na b l e = False
59
60 print ( ” [INFO] − Compiling . . . ” )
61 model . compile ( optimizer=optimizers .SGD( l r =1e −4, momentum=0.9) ,
62 l o s s=’ binary crossentropy ’ ,
63 metrics =[ ’ accuracy ’ ] )
64
65 # Won’ t apply data augmentation f or now
66 datagen = ImageDataGenerator ( r e s c a l e =1. / 255)
67 train batches = datagen . flow from directory (
68 conf [ ’ train path ’ ] ,
71 class mode=’ binary ’

72 )
73 valid batches = datagen . flow from directory (
74 conf [ ’ validation path ’ ] ,
78 )
79
80 es = EarlyStopping ( monitor=’ val acc ’ ,
81 mode=’max ’ ,
82 verbose =1,
83 patience =7)
84 mc = ModelCheckpoint ( ’ best model 13gb . h5 ’ ,
85 monitor=’ val acc ’ ,
86 mode=’max ’ ,
87 verbose =1,
88 save best only=True )
89 history = model . f i t g e n e r a t o r (
90 train batches ,
91 validation data=valid batches ,
92 epochs=conf [ ’ epochs ’ ] ,
93 steps per epoch=(train samples () [ 0 ] / / conf [ ’ batch size ’ ] ) ,
94 v a l i d a t i o n s t e p s =( validation samples () [ 0 ] / / conf [ ’ batch size ’ ] ) ,
95 verbose =1,
96 callbacks =[ tensorboard ]
97 )
98 # s e r i a l i z e model to JSON
99 model json = model . to json ()
100 with open ( ” finetuned vgg16 13gb . json ” , ”w” ) as j s o n f i l e :
101 j s o n f i l e . write ( model json )
102 # s e r i a l i z e weights to HDF5
103 model . save weights ( ’ imception finetune 13gb . h5 ’ )
104 print ( ” [INFO] − Saved model to disk ” )

A.3. EVALUATE MODEL 49
A.3 Evaluate Model
1 # import the necessary packages
2 from keras . preprocessing . image import img to array
4 from keras . models import load model
5 from keras . models import Model
7 from keras . models import model from json
8 import numpy as np
9 import argparse
10 import imutils
11 import glob
12 import json
13 import cv2
14 from sklearn . metrics import accuracy score
15 from sklearn . metrics import p r e c i s i o n s c o r e
16 from sklearn . metrics import r e c a l l s c o r e
17 from sklearn . metrics import f 1 s c o r e
18 from sklearn . metrics import cohen kappa score
19 from sklearn . metrics import roc auc score
20 from sklearn . metrics import confusion matrix
21 from sklearn . metrics import p r e c i s i o n r e c a l l f s c o r e s u p p o r t
23 import i t e r t o o l s
24 np . s e t p r i n t o p t i o n s ( suppress=True )
25
28
29 # load the image
30 def single image () :
31 f or f i l e in glob . glob ( ”C:/ Users / enqui /AppData/ Local /Programs/Python
/Python36/ Thesis / repo / imception /demo images /∗. png” ) :
32 image = cv2 . imread ( f i l e )
33 orig = image . copy ()
34
35 # pre−process the image f o r c l a s s i f i c a t i o n
36 image = cv2 . r e s i z e ( image , (224 , 224) )
37 image = image . astype ( ” f l o a t ” ) / 255.0
38 image = img to array ( image )
39 image = np . expand dims ( image , axis =0)
40
41 # load the trained convolutional neural network
42 print ( ” [INFO] loading network . . . ” )
43 j s o n f i l e = open ( conf [ ’ FT model ’ ] , ’ r ’ )
44 loaded model json = j s o n f i l e . read ()

45 j s o n f i l e . c l o s e ()
46 model = model from json ( loaded model json )
47 #load weights into new model
48 model . load weights ( conf [ ’ FT weights ’ ] )
49 print ( ” loaded model from disk ” )
50
51 # c l a s s i f y the input image
52 prediction = model . predict ( image )
53 # build the l a b e l
54 l a b e l = ”Real” i f prediction == 1 e l s e ”Fake”
55 l a b e l = ”{}” . format ( l a b e l )
56
57 # draw the l a b e l on the image
58 output = imutils . r e s i z e ( orig , width=400)
59 cv2 . putText ( output , label , (10 , 25) , cv2 .FONT HERSHEY SIMPLEX,
60 0.7 , (0 , 255 , 0) , 2)
61
62 # show the output image
63 cv2 . imshow( ”Output” , output )
64 cv2 . waitKey (0)
65
66
67 def load model () :
72 j s o n f i l e . c l o s e ()
74 # load weights into new model
77 # return model
78
79
80 def generate score s () :
85 j s o n f i l e . c l o s e ()
87 # load weights into new model
90
91 generator = ImageDataGenerator ()

A.3. EVALUATE MODEL 51
92 t e s t g e n e r a t o r = generator . flow from directory (
93 conf [ ’ test path ’ ] ,
96 s h u f f l e=False ,
98 )
99
100 # predict c r i s p c l a s s e s f or t e s t set
101 t e s t g e n e r a t o r . r e s e t ()
102 p r e d i c t i o n s = model . predict generator ( test generator , verbose =1)
103 p r e d i c t i o n s = np . concatenate ( predictions , axis =0)
104 p r e d i c t i o n s = p r e d i c t i o n s . astype ( int )
105 v a l t r u e s = ( t e s t g e n e r a t o r . c l a s s e s )
106
107 cf = confusion matrix ( val trues , p r e d i c t i o n s )
108 precisions , r e c a l l , f1 score , = p r e c i s i o n r e c a l l f s c o r e s u p p o r t (
109 val trues , predictions , average=’ binary ’
110 )
111 # plt . matshow( cf )
112 # plt . t i t l e ( ’ Confusion Matrix Plot ’)
113 # plt . colorbar ()
114 # plt . xlabel ( ’ Precited ’)
115 # plt . ylabel ( ’ Actual ’)
116 # plt . show ()
117 plt . imshow ( cf , cmap=plt .cm. Blues , i n t e r p o l a t i o n=’ nearest ’ )
118 plt . colorbar ()
119 plt . t i t l e ( ’ Confusion Matrix without Normalization ’ )
120 plt . xlabel ( ’ Predicted n F1 Score : { 0 : . 3 f}% ’ . format ( f 1 s c o r e ∗100) )
121 plt . ylabel ( ’ Actual ’ )
122 tick marks = np . arange ( len ( set ( v a l t r u e s ) ) ) # length of c l a s s e s
123 c l a s s l a b e l s = [ ’ Fake ’ , ’ Real ’ ]
124 tick marks
125 plt . xticks ( tick marks , c l a s s l a b e l s )
126 plt . yticks ( tick marks , c l a s s l a b e l s )
127 # p lo tt i ng text value i n s i d e c e l l s
128 thresh = cf .max() / 2.
129 f or i , j in i t e r t o o l s . product ( range ( cf . shape [ 0 ] ) , range ( cf . shape
[ 1 ] ) ) :
130 plt . text (
131 j , i , format ( cf [ i , j ] , ’d ’ ) ,
132 horizontalalignment=’ center ’ ,
133 color=’ white ’ i f cf [ i , j ] > thresh e l s e ’ black ’
134 )
135 plt . show ()
136
137 print ( ’F1 score : %f ’ % f 1 s c o r e )

138 print ( ’ Recall Score : %f ’ % r e c a l l )
139 print ( ’ p r e c i s i o n s : %f ’ % p r e c i s i o n s )
140 # g enerate scores ()
141 single image ()

A.4. VISUALISATION GENERATION SCRIPT 53
A.4 Visualisation Generation Script
2 import os
3 import json
4
5
8
9
10 def train samples () :
11 path real , d i r s r e a l , f i l e s r e a l = next ( os . walk ( conf [ ’ train path ’ ]
+ ’ / r e a l ’ ) )
12 path fake , dirs fake , f i l e s f a k e = next ( os . walk ( conf [ ’ train path ’ ]
+ ’ / fake ’ ) )
13
14 f i l e c o u n t r e a l = len ( f i l e s r e a l )
15 f i l e c o u n t f a k e = len ( f i l e s f a k e )
16 print ( f i l e c o u n t f a k e+f i l e c o u n t r e a l )
17 return f i l e c o u n t f a k e+f i l e c o u n t r e a l , f i l e c o u n t r e a l ,
f i l e c o u n t f a k e
18
19 def validation samples () :
20 path real , d i r s r e a l , f i l e s r e a l = next ( os . walk ( conf [ ’
validation path ’ ] + ’ / r e a l ’ ) )
21 path fake , dirs fake , f i l e s f a k e = next ( os . walk ( conf [ ’
validation path ’ ] + ’ / fake ’ ) )
22
23 f i l e c o u n t r e a l = len ( f i l e s r e a l )
24 f i l e c o u n t f a k e = len ( f i l e s f a k e )
25 print ( f i l e c o u n t f a k e+f i l e c o u n t r e a l )
26 return f i l e c o u n t f a k e+f i l e c o u n t r e a l , f i l e c o u n t r e a l ,
f i l e c o u n t f a k e
27 def v i s d a t a s e t () :
28
29 f i g = plt . f i g u r e ()
30 x = [ ’ r e a l ’ , ’ fake ’ ]
31 y = [ f i l e c a l c u l a t i o n s () [ 1 ] , f i l e c a l c u l a t i o n s () [ 2 ] ]
32
33 plt . bar (x , y)
34 plt . t i t l e ( ’ Balance of t rai ning dataset ’ )
35 plt . xlabel ( ’ Labels ’ )
36 plt . ylabel ( ’Number of images ’ )
37 plt . show ()
38 f i g . s a v e f i g ( conf [ ’ directory ’ ] + ’ / t r a i n i n g v i s u a l i s a t i o n . jpg ’ )
39

40 # v i s d a t a s e t ()
A.5 Conﬁg File
1 {
2 ” weights ” : ”imagenet ” ,
3 ” include top ” : f a l s e ,
4
5 ”FT model” : ” Final Model/ best run 98 .5%. json ” ,
6 ”FT weights” : ” Final Model/ best run 98 .5%.h5 ” ,
7
8 ” directory ” : ”C:/ Users / enqui /AppData/ Local /Programs/Python/
Python36/ Thesis / repo / imception ” ,
9 ” train path ” : ”C:/ Users / enqui /AppData/ Local /Programs/Python/
Python36/ Thesis / repo / imception /new curated smaller / tr aini ng ” ,
10 ” test path ” : ”C:/ Users / enqui /AppData/ Local /Programs/Python/
Python36/ Thesis / repo / imception /new curated smaller / t e s t ” ,
11 ” validation path ” : ”C:/ Users / enqui /AppData/ Local /Programs/Python/
Python36/ Thesis / repo / imception /new curated smaller / validation ” ,
12 ” height ” : 224 ,
13 ”width” : 224 ,
14 ” v a l i d a t i o n s p l i t ”: 0.20 ,
15 ” seed ” : 9 ,
16 ” num classes ” : 2 ,
17 ” batch size ” : 20 ,
18 ” epochs ” : 20
19 }

Appendix B
Companion disk
https://github.com/wakeﬁeldcooper/imception.git
Data Set has not been provided as it contains the faces of individ-
uals that gave permission to use their faces for training data, and not
for public distribution.
55

57
Appendix C
Tensorboard Graphics
Figure C.1: Tensorboard output of network

59
Appendix D
Timeline From Proposal
Figure D.1: Timeline of project from project proposal

60 APPENDIX D. TIMELINE FROM PROPOSAL

Bibliography
[1] R. R. a. C. Busch, “Presentation attack detection methods for
face recognition systems: A Comprehensive Survey,” Norwegian
Biometric Laboratory, Norwegian University of Science and Tech-
nology (NTNU), Gjøvik, Norway, 2017.
[2] V. Savov, “The Verge,” 12 September 2017. [Online]. Avail-
able: https://www.theverge.com/2017/9/12/16288806/apple-
iphone-x-price-release-date-features-announced. [Accessed 22 Au-
gust 2018].
[3] C. Zhao, “News Week,” 18 December 2017. [Online]. Available:
https://www.newsweek.com/iphone-x-racist-apple-refunds-
device-cant-tell-chinese-people-apart-woman-751263. [Accessed
22 August 2018].
[4] A. Inc., “Face ID Security,” Apple, Silicon Valley, Nov, 2017.
[5] R. R. a. C. Busch, “Presentation attack detection algorithm for
face and iris biometrics,” in 2014 22nd European Signal Process-
ing Conference (EUSIPCO), LIsbon, Portugal, 2014.
[6] A. A. a. S. Marcel, “Counter-Measures to photo attacks in face
recognition: A public database and a baseline,” International
joint conference on biometrics (IJCB), 2011.
[7] N. M. D. a. B. Q. Minh, “Your face is not your password,” in
Black Hat Conference, 2009.
61

62 BIBLIOGRAPHY
[8] M.-A. Waris, “The 2nd competition on counter measures to 2d
face spoofing attacks,” ICB, 2013.
[9] A. Greenberg, “Wired,” Wired, 9 December 2017. [Online]. Avail-
able: www.wired.com/story/iphone-x-faceid-security/. [Accessed
23 August 2018].
[10] Apple, “About Face ID advanced technology,” Apple, 2017. [On-
line]. Available: www.support.apple.com/en-us/HT208108. [Ac-
cessed 23 Aug 2018].
[11] A. Karpathy, “TensorFlow,” 2014. [Online]. Available:
https://karpathy.github.io/2014/09/02/what-i-learned-from-
competing-against-a-convnet-on-imagenet/. [Accessed 23 Aug
2018].
[12] S. H. e. al., “Convolutional Neural Networks for Iris Presenta-
tion Attack Detection; Toward Cross-Dataset and Cross-Sensor
Generalization,” IEEE, Michigan State, 2018.
[13] A. A. a. S. I. Chingovska, “On the effectiveness of local binary
patterns in face anti-spoofing,” International conference of the
biometric special interest group (BIOSIG), 2012.
[14] M. M. C. a. S. M. Andre Anjos, “Motion-based counter-measures
to photo attacks in face recognition,” IET Biometrics, 2013.
[15] J. Y. S. L. Z. L. D. Y. a. S. L. Shiwei Zhang, “A face antispoofing
database with diverse attacks,” IAPR, 2012.
[16] A. Extance, Faces light up over VCSEL prospects, SPIE News-
room, 2018
[17] C. Burt. Facial Recognition to grow by more
than 26 percent through 2025. [online] Available:

BIBLIOGRAPHY 63
https://www.biometricupdate.com/201811/facial-recognition-to-
grow-by-more-than-26-percent-through-2025 [accessed 21 April
2019]
[18] B. Mayo Face ID deemed too costly to copy, Android
makers target in-display fingerprint sensors instead [online]
Available: https://9to5mac.com/2018/03/23/face-id-premium-
android-fingerprint-sensors/ [Accessed 24 April 2019]
[19] Y, X. Liu et.al Learning Deep Models for Face Anti-Spoofing:
Binary or Auxiliary Supervision. Computer Vision Foundation,
2018.
[20] Godoy, Alan Simões, Flávio Stuchi, Jose Angeloni, Marcus
Uliani, Mário Violato, Ricardo. Using Deep Learning for Detect-
ing Spoofing Attacks on Speech Signals, (2015).
[21] L. Feng, L.-M. Po, Y. Li, X. Xu, F. Yuan, T. C.-H. Cheung, and
K.-W. Cheung. Integration of image quality and motion cues for
face anti-spoofing: A neural network approach. J. Visual Com-
munication and Image Representation, 38:451– 460, 2016.
[22] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid. An
original face anti-spoofing approach using partial convolutional
neural network. In IPTA, 2016
[23] J. Yang, Z. Lei, and S. Z. Li. Learn convolutional neural network
for face anti-spoofing. arXiv preprint arXiv:1408.5601, 2014
[24] O’Shea, Keiron Nash, Ryan. An Introduction to Convolutional
Neural Networks. ArXiv e-prints, (2015).
[25] S. Smith The Scientist and Engineer’s Guide to Digital Signal
Processing, 2011.
[26] ”Biometrics: Overview”. Biometrics.cse.msu.edu. 6 September
2007. [Accessed 22 April].

64 BIBLIOGRAPHY
[27] ”Mugspot Can Find A Face In The Crowd – Face-Recognition
Software Prepares To Go To Work In The Streets”. ScienceDaily.
12 November 1997. [Accessed 22 April 2019].
[28] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid. An
original face anti-spoofing approach using partial convolutional
neural network. In IPTA, 2016.
[29] Gil Press, Cleaning Big Data: Most Time-Consuming,
Least Enjoyable Data Science Task, Survey Says. Available
at: https://www.forbes.com/sites/gilpress/2016/03/23/data-
preparation-most-time-consuming-least-enjoyable-data-science-
task-survey-says/41c86b376f63. [Accessed 22 April 2019]
[30] Karen Simonyan Andrew Zisserman, Very Deep Convolutional
Networks for Large-Scale Image Recognition, Visual Geometry
Group, Department of Engineering Science, University of Oxford,
2015.
[31] V. Gupta, Fine-tuning using pre-trained models, Available at:
https://www.learnopencv.com/keras-tutorial-fine-tuning-using-
pre-trained-models/. [Accessed: 23 April 2019]
[32] F. Chollet, Building powerful image classification models using
very little data, Available at: https://blog.keras.io/building-
powerful-image-classification-models-using-very-little-data.html
[Accessed 23 April 2019]
[33] Kevin Kelly, The Three Breakthroughs That Have Finally Un-
leashed AI on the World, 2017.
[34] F. Provos T. Fawcett, Data Science and its Relationship to Big
Data and Data-Driven Decision Making, 2013.
[35] Y. Bengio Y. LeCun, Scaling Learning Algorithms towards AI,
2007.

BIBLIOGRAPHY 65
[36] K. C. Morris, C. Schlenoﬀ, V. Srinivasan, A Remarkable Resur-
gence of Artiﬁcial Intelligence and its Impact on Automation and
Autonomy, IEEE, 2017.
[37] c. Page, US cops warned not to gawp at
iPhones due to Face ID lock-out. Available at:
https://www.theinquirer.net/inquirer/news/3064480/us-cops-
warned-not-to-gawp-at-iphones-due-to-face-id-lock-out. [Ac-
cessed 26 April 2019]
[38] ”Keras: The Python Deep Learning library”, Available at:
https://keras.io/. [Accessed 3 June 2019]
[39] ”Precision, recall and f-score” Available at: https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.precision
recall fscore support.html [Accessed June 2019]
[40] S. Shi, Q. Wang, P. Xu, X. Chu, Benchmarking State-of-the-Art
Deep Learning Software Tools, arxiv, 2017.
[41] D. T. Nguyen, T. D. Pham, M. B. Lee and K. R. Park, Visible-
Light Camera Sensor-Based Presentation Attack Detection for
Face Recognition by Combining Spatial and Temporal Informa-
tion, MDPI, 2019.
[42] Singh, Aruni Singh, Sanjay Tiwari, Dr. Shrikant. Comparison
of face Recognition Algorithms on Dummy Faces. International
Journal of Multimedia Its Applications, 2012.
[43] Cortes, Corinna; Vapnik, Vladimir N. ”Support-vector networks”.
Machine Learning, 1995.
[44] Raghavendra, R B. Raja, Kiran Busch, Christoph. Presentation
Attack Detection for Face Recognition Using Light Field Camera.
IEEE transactions on image processing : a publication of the
IEEE Signal Processing Society, 2015.

Im-ception - An exploration into facial PAD through the use of fine tuning deep confolutional neural networks

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Im-ception - An exploration into facial PAD through the use of fine tuning deep confolutional neural networks

Similar to Im-ception - An exploration into facial PAD through the use of fine tuning deep confolutional neural networks (20)

Recently uploaded

Recently uploaded (20)

Im-ception - An exploration into facial PAD through the use of fine tuning deep confolutional neural networks