1. FACIAL KEY POINTS
DETECTION USING
DEEP CONVOLUTIONAL
NEURAL NETWORK-
NAIMISHNET
By
NAIMISH AGARWAL
Summer Internship 2016 Project
Department of Analytical Information Systems and Business
Intelligence, University of Paderborn, Germany
External Guide
DR ARTUS KROHN-GRIMBERGHE
Internal Guide
DR RANJANA VYAS
2. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
2
Candidate’s Declaration
I hereby declare that the work presented in this project report entitled Facial Key Points
Detection using Deep Convolutional Neural Network - NaimishNet, submitted as Summer
Internship 2016 project report at Indian Institute of Information Technology, Allahabad, is
an authenticated record of my original work carried out from June 2016 to July 2016 under
the guidance of Dr Artus Krohn-Grimberghe and Dr Ranjana Vyas. Due
acknowledgements have been made in the text to all other material used. The project
was done in full compliance with the requirements and constraints of the prescribed
curriculum.
Signature
______________________________
Naimish Agarwal (IRM2013013)
Paderborn, Germany
14th July, 2016
3. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
3
Certificate
This is to certify that the above statement made by the candidate is correct to the best
of my knowledge.
Signatures
______________________________
Dr Artus Krohn-Grimberghe
Paderborn, Germany
14th July, 2016
_____________________________
Dr Ranjana Vyas,
Allahabad, India
1st August, 2016
4. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
4
Table of Contents
Candidate’s Declaration..............................................................................................................2
Certificate.......................................................................................................................................3
List of Figures...................................................................................................................................5
List of Tables....................................................................................................................................6
About Me........................................................................................................................................7
External Guide................................................................................................................................7
Internal Guide ................................................................................................................................7
Acknowledgement .......................................................................................................................8
1 Introduction.................................................................................................................................9
2 NaimishNet Architecture .........................................................................................................10
2.1 NaimishNet Architecture Design Decisions.....................................................................11
3 Experiments ...............................................................................................................................13
3.1 Dataset Description ...........................................................................................................13
3.2 Proposed Approach..........................................................................................................13
3.2.1 Data Augmentation....................................................................................................13
3.2.2 Data Pre-processing ...................................................................................................14
3.2.3 Pre-training Analysis ....................................................................................................14
3.2.4 Training..........................................................................................................................15
3.2.5 Evaluation.....................................................................................................................16
4 Results.........................................................................................................................................17
5 Conclusion.................................................................................................................................30
6 Future Scope .............................................................................................................................31
7 Appendix ...................................................................................................................................32
7.1 Software and Hardware Deployed.................................................................................32
8 References.................................................................................................................................33
5. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
5
List of Figures
Figure 1 Number of non-missing target rows in the augmented and pre-processed train
dataset..........................................................................................................................................15
Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red
for predicted. ...............................................................................................................................17
Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87,
RMSEmiddle = 1.04, and RMSEbottom = 0.37, the original FKPs are shown in blue and the
predicted FKPs are shown in red. ..............................................................................................18
Figure 4 Number of Epochs per NaimishNet ............................................................................19
Figure 5 Loss History of NaimishNet for Left Eye Center as Target .........................................20
Figure 6 Loss History of NaimishNet for Right Eye Center as Target.......................................20
Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target................................21
Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target ..............................21
Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target .............................22
Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target..........................22
Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target ..........................23
Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target.........................23
Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target........................24
Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target ......................24
Figure 15 Loss History of NaimishNet for Nose Tip as Target....................................................25
Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target ..................................25
Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target................................26
Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target ............................26
Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target......................27
Figure 20 Loss per final check-pointed NaimishNet ................................................................28
Figure 21 Comparison of Kaggle Scores...................................................................................29
6. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
6
List of Tables
Table 1 NaimishNet layer-wise architecture.............................................................................10
Table 2 Filter details of Convolution2d layers...........................................................................11
Table 3 Target Facial Key Points which need to be swapped while horizontally flipping
the data........................................................................................................................................14
Table 4 Software deployed with version information .............................................................32
7. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
7
About Me
Naimish Agarwal is a student pursuing B. Tech in Information
Technology with M. Tech in Robotics, a 5 years’ dual degree program,
in the batch of 2013-2018, at Indian Institute of Information
Technology, Allahabad, India.
External Guide
Dr Artus Krohn-Grimberghe is a Junior Professor and Head of the
Department of Analytical Information Systems and Business
Intelligence, at University of Paderborn, Germany.
Internal Guide
Dr Ranjana Vyas is an Assistant Professor at Indian Institute of
Information Technology, Allahabad, India.
8. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
8
Acknowledgement
I would like to express my special gratitude to Prof Dr O.P. Vyas, Dean Research &
Development, Indian Institute of Information Technology, Allahabad for providing me the
opportunity to join University of Paderborn, Germany for a two months’ internship in the
summer of 2016.
I would like to express my gratitude to Dr Rahul Kala, Assistant Professor in Robotics and
Artificial Intelligence Laboratory, Indian Institute of Information Technology, Allahabad for
motivating me when in need.
9. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
9
1 Introduction
Facial Key Points (FKPs) detection is an important and challenging problem in the field of
computer vision, which involves detecting FKPs like centers and corners of eyes, nose tip,
etc. The problem is to predict the (x, y) real-valued co-ordinates in the space of image
pixels of the FKPs for a given face image. It finds its application in tracking faces in images
and videos, analysis of facial expressions, detection of dysmorphic facial signs for medical
diagnosis, face recognition, etc.
Facial features vary greatly from one individual to another, and even for a single individual
there is a large amount of variation due to pose, size, position, etc. The problem becomes
even more challenging when the face images are taken under different illumination
conditions, viewing angles, etc.
In the past few years, advancements in FKPs detection are made by the application of
deep convolutional neural network (DCNN), which is a special type of feed-forward
neural network with shared weights and local connectivity. DCNNs have helped build
state-of-the-art models for image recognition, recommender systems, natural language
processing, etc. Krizhevsky et al. [1] applied DCNN in ImageNet image classification
challenge and outperformed the previous state-of-the-art model for image classification.
Wang et al. [2] addressed FKPs detection by first applying histogram stretching for image
contrast enhancement, followed by principal component analysis for noise reduction and
mean patch search algorithm with correlation scoring and mutual information scoring for
predicting left and right eye centers. Sun et al. [3] estimated FKPs by using a three level
convolutional neural network, where at each level, outputs of multiple networks were
fused for robust and accurate estimation. Longpre et al. [4] predicted FKPs by first applying
data augmentation to expand the number of training examples, followed by testing
different architectures of convolutional neural networks like LeNet [5] and VGGNet [6],
and finally used a weighted ensemble of models. Nouri et al. [7] used six specialist DCNNs
trained over pre-trained weights. Oneto et al. [8] applied a variety of data pre-processing
techniques like histogram stretching, Gaussian blurring, followed by image flipping, key
point grouping, and then finally applied LeNet.
Taigman et al. [9] provided a new deep network architecture, DeepFace, for state-of-
the-art face recognition. Li et al. [10] provided a new DCNN architecture for state-of-the
art face alignment.
We present a DCNN architecture – NaimishNet, based on LeNet, which addresses the
problem of facial key points detection by providing a learning model for a single facial
key point.
10. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
10
2 NaimishNet Architecture
NaimishNet consists of 4 convolution2d layers, 4 maxpooling2d layers and 3 dense layers,
with sandwiched dropout and activation layers, as shown in Table 1.
Layer Number Layer Name Layer Shape
1 Input1 (1, 96, 96)
2 Convolution2d1 (32, 93, 93)
3 Activation1 (32, 93, 93)
4 Maxpooling2d1 (32, 46, 46)
5 Dropout1 (32, 46, 46)
6 Convolution2d2 (64, 44, 44)
7 Activation2 (64, 44, 44)
8 Maxpooling2d2 (64, 22, 22)
9 Dropout2 (64, 22, 22)
10 Convolution2d3 (128, 21, 21)
11 Activation3 (128, 21, 21)
12 Maxpooling2d3 (128, 10, 10)
13 Dropout3 (128, 10, 10)
14 Convolution2d4 (256, 10, 10)
15 Activation4 (256, 10, 10)
16 Maxpooling2d4 (256, 5, 5)
17 Dropout4 (256, 5, 5)
18 Flatten1 (6400)
19 Dense1 (1000)
20 Activation5 (1000)
21 Dropout5 (1000)
22 Dense2 (1000)
23 Activation6 (1000)
24 Dropout6 (1000)
25 Dense3 (2)
Table 1 NaimishNet layer-wise architecture
Specifics of NaimishNet:
o Input1 is the input layer.
o Activation1 to Activation5 use Exponential Linear Units (ELUs) [11] as activation
functions, whereas Activation6 uses Linear Activation Function.
o Dropout [12] probability is increased from 0.1 to 0.6 from Dropout1 to Dropout6, with
a step size of 0.1.
11. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
11
o Maxpooling2d1 to Maxpooling2d4 use a pool shape of (2, 2), with non-overlapping
strides and no zero padding.
o Flatten1 flattens 3d input to 1d output.
o Convolution2d1 to Convolution2d4 do not use zero padding, have their weights
initialized with random numbers drawn from uniform distribution, and the specifics
of number of filters and filter shape are shown in Table 2.
o Dense1 to Dense3 are regular fully connected layers with weights initialized using
Glorot uniform initialization [13].
o Adam [14] optimizer, with learning rate of 0.001, beta1 of 0.9, beta2 of 0.999 and
epsilon of 1e-08, is used for minimizing Mean Squared Error (MSE).
o There are 7,488,962 model parameters
Layer Name No. of Filters Filter Shape
Convolution2d1 32 (4, 4)
Convolution2d2 64 (3, 3)
Convolution2d3 128 (2, 2)
Convolution2d4 256 (1, 1)
Table 2 Filter details of Convolution2d layers
2.1 NaimishNet Architecture Design Decisions
We have tried various different deep network architectures before finally settling down
with NaimishNet architecture. Most of the design decisions were governed primarily by the
hardware restrictions and the time taken per epoch for different runs of the network.
Although we wanted one architecture to fit well to all the different FKPs as target, we have
mainly taken decisions based on the performance of the model on left eye center FKP as
target. We have run the different architectures for at most 40 epochs and forecasted its
performance and generalizability based on our experience we gathered while training
multiple deep network architectures. We do not provide the exact experimental numbers
about the RMSE scores since this data was not recorded while experimenting with different
architectures. We also do not provide the specifics of every architecture we tried since
multiple changes were made without taking any record before arriving at NaimishNet
architecture.
We experimented with 5 convolution2d and 5 maxpooling2d layers and 4 dense layers
with rectified linear units as activation functions, but that over fitted. We experimented
with 3 convolution2d and 3 maxpooling2d layers and 2 dense layers with rectified linear
units as activation functions, but that in turn required pre-training the weights to achieve
good performance. We wanted to avoid pre-training to reduce the training time. We did
not experiment with zero padding and strides.
12. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
12
We sandwiched batch normalization layers [15] after every maxpooling2d layer but that
increased the per epoch training time by fivefold. We used dropout layers with dropout
probability of 0.5 for every dropout layer but that showed a poor fit. We removed the
dropout layers but that resulted in overfitting. We tried different initialization schemes like
He normal initialization [15], but that did not show nice performance in the first 20 epochs,
in fact uniform initialization performed better than that for us. We experimented with
different patience levels (PLs) for early stopping callback function and concluded that PL
of 10% of total number of epochs, in our case 300, would do well in most of the cases. For
the aforementioned cases, we used RMSProp optimizer [16].
The main aim was to reduce the training time along with achieving maximum
performance and generalizability. To achieve that we experimented with exponential
linear units as activation function, Adam optimizer, and linearly increasing dropout
probabilities along the depth of the network. The specifics of the filter were chosen based
on our past experience with exponentially increasing number of filters and linearly
decreasing filter size. The batch size was kept 128 because vector operations are
optimized for sizes which are in powers of 2. Also, given over 7 million model parameters,
and limited GPU memory, batch size of 128 was an ideal size for us. The number of main
layers i.e. 4 convolutional layers (convolution2d and maxpooling2d) and 3 dense layers
was ideal since exponential linear units perform well in cases where the number of layers
are greater than 5. The parameters of Adam optimizer were used as is as recommended
by its authors and we did not experiment with them.
NaimishNet, is inspired by LeNet [5], since LeNet’s architecture follows the pattern input ->
conv2d -> maxpool2d -> conv2d -> maxpool2d -> conv2d -> maxpool2d -> dense ->
output. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled
architectures have been successfully applied for Facial Key Points Detection problem.
13. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
13
3 Experiments
3.1 Dataset Description
We have used the dataset from Kaggle Competition – Facial Key Points Detection [17].
The dataset was chosen to benchmark our solution against the existing approaches which
address FKPs detection problem.
There are 15 FKPs per face image like left eye center, right eye center, left eye inner corner,
left eye outer corner, right eye inner corner, right eye outer corner, left eyebrow inner end,
left eyebrow outer end, right eyebrow inner end, right eyebrow outer end, nose tip, mouth
left corner, mouth right corner, mouth center top lip and mouth center bottom lip. Here,
left and right are from the point of view of the subject.
The greyscale input images, with pixel values in range of [0, 255], have size of 96 x 96 pixels.
The train dataset consists of 7049 images, with 30 targets, i.e. (x, y) for each of 15 FKPs. It
has missing target values for many FKPs for many face images. The test dataset consists of
1783 images with no target information. The Kaggle submission file consists of 27124 FKPs
co-ordinates, which are to be predicted.
The Kaggle submissions are scored based on the Root Mean Squared Error (RMSE), which
is given by
𝑅𝑀𝑆𝐸 = √
∑ (𝑦𝑖 − 𝑦̂𝑖)2𝑛
𝑖=1
𝑛
where 𝑦𝑖 is the original value, 𝑦̂𝑖 is the predicted value and n is the number of targets to
be predicted.
3.2 Proposed Approach
3.2.1 Data Augmentation
Data Augmentation helps boost the performance of a deep learning model when the
training data is limited by generating more training data. We have horizontally flipped [7]
the images for which target information for all the 15 FKPs are available, and also swapped
the target values according to Table 3. Then, we vertically stacked the new horizontally
flipped data under the original train data to create the augmented train dataset.
14. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
14
Target Facial Key Point Target Facial Key Point
Left Eye Center X Right Eye Center X
Left Eye Center Y Right Eye Center Y
Left Eye Inner Corner X Right Eye Inner Corner X
Left Eye Inner Corner Y Right Eye Inner Corner Y
Left Eye Outer Corner X Right Eye Outer Corner X
Left Eye Outer Corner Y Right Eye Outer Corner Y
Left Eyebrow Inner Corner X Right Eyebrow Inner Corner X
Left Eyebrow Inner Corner Y Right Eyebrow Inner Corner Y
Left Eyebrow Outer Corner X Right Eyebrow Outer Corner X
Left Eyebrow Outer Corner Y Right Eyebrow Outer Corner Y
Mouth Left Corner X Mouth Right Corner X
Mouth Left Corner Y Mouth Right Corner Y
Table 3 Target Facial Key Points which need to be swapped while horizontally flipping the data
3.2.2 Data Pre-processing
The image pixels are normalized to the range [0, 1] by dividing by 255.0, and the train
targets are zero-centered to the range [-1, 1] by first dividing by 48.0, since the images are
96 x 96, and then subtracting 48.0.
3.2.3 Pre-training Analysis
Figure 1 shows that there are different number of non-missing target rows for different FKPs,
so, we have created 15 NaimishNet models.
15. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
15
Figure 1 Number of non-missing target rows in the augmented and pre-processed train dataset
3.2.4 Training
For each of the 15 NaimishNet models, filter (keep) the rows with non-missing target values
and split the filtered train data into 80% train dataset (T) and 20% validation dataset (V).
Compute feature-wise mean M on T, and zero-center T and V by subtracting M, and also
store M. Reshape features of T and V to (1, 96, 96), and start the training. Finally, validate
on V, and store the model’s loss history.
9179
9176
4411
4407
4408
4408
4410
4365
4410
4376
9189
4409
4410
4415
9156
0 2000 4000 6000 8000 10000
LEFT EYE CENTER
RIGHT EYE CENTER
LEFT EYE INNER CORNER
LEFT EYE OUTER CORNER
RIGHT EYE INNER CORNER
RIGHT EYE OUTER CORNER
LEFT EYEBROW INNER END
LEFT EYEBROW OUTER END
RIGHT EYEBROW INNER END
RIGHT EYEBROW OUTER END
NOSE TIP
MOUTH LEFT CORNER
MOUTH RIGHT CORNER
MOUTH CENTER TOP LIP
MOUTH CENTER BOTTOM LIP
TargetFacialKeyPoint
Number of non-missing target rows
16. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
16
Each model is trained with a batch size of 128 and maximum number of epochs are 300.
Two callback functions are used which execute at the end of every epoch – Early
Stopping Callback (ESC) and Model Checkpoint Callback (MCC).
ESC stops the training when the number of contiguous epochs without improvement in
validation loss are 30 (Patience Level). MCC stores the weights of the model with the
lowest validation loss.
3.2.5 Evaluation
For each of the 15 NaimishNet models, reload M, zero center test features by subtracting
M and reshape the samples to (1, 96, 96). Reload the best check-pointed model weights,
predict on test features, and store the predictions.
17. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
17
4 Results
Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red for predicted.
Figure 2 shows the annotated faces with 15 FKPs of 6 individuals. It shows that the
predicted locations are very close to the actual locations of FKPs.
18. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
18
Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87, RMSEmiddle = 1.04, and RMSEbottom =
0.37, the original FKPs are shown in blue and the predicted FKPs are shown in red.
19. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
19
In Figure 3, high RMSEtop is because of the presence of less number of images, in the
training dataset, with the side face pose. Most of the images in the training dataset are
that of the frontal face with nearly closed lips. So, NaimishNet delivers lower RMSEmiddle and
RMSEbottom.
Figure 4 Number of Epochs per NaimishNet
Figure 4 shows that total number of epochs are 3507. Total training time of 15 NaimishNet
models is approximately 20 hours.
20. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
20
Figure 5 Loss History of NaimishNet for Left Eye Center as Target
Figure 6 Loss History of NaimishNet for Right Eye Center as Target
Figure 5 and Figure 6 show that both train and validation RMSE decrease over time with
small wiggles along the curve showcasing that the batch size was rightly chosen and end
up being close enough, thus showing that the two NaimishNet models generalized well.
The best check-pointed models were achieved before 300 epochs, thus certifying the
decision of fixing 300 epochs as the upper limit.
21. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
21
Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target
Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target
Figure 7 and Figure 8 show that both train and validation RMSE decrease over time with
small wiggles and end up being close enough. The best check-pointed models were
achieved before 300 epochs. Also, for Figure 8, since the train RMSE does not become less
than validation RMSE implies that the Patience Level (PL) for ESC could have been
increased to 50. Thus, the training could have been continued for some more epochs.
22. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
22
Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target
Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target
Figure 9 and Figure 10 show that both train and validation RMSE decrease over time with
very small wiggles and end up being close enough. The best check-pointed models were
achieved close to 300 epochs. For Figure 10, PL of ESC could have been increased to 50
so that the train RMSE becomes less than validation RMSE.
23. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
23
Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target
Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target
Figure 11 and Figure 12, show that the train and validation RMSE decreases over time with
small wiggles, and end up being close enough. The best check-pointed models were
achieved far before 300 epochs. No further improvement could have been seen by
increasing PL of ESC.
24. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
24
Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target
Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target
Figure 13 and Figure 14 show that both train and validation RMSE decrease over time with
small wiggles and end up being close in the check-pointed model, which can be seen 30
epochs before the last epoch on x-axis.
25. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
25
Figure 15 Loss History of NaimishNet for Nose Tip as Target
Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target
Figure 15 and Figure 16 show that both train and validation RMSE decrease over time and
they end up being close enough in the final check-pointed model. For Figure 16, PL of ESC
could have been increased to 50 but not necessarily there would have been any visible
improvement.
26. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
26
Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target
Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target
Figure 17 and Figure 18 show that the train and validation RMSE decrease over time, with
small wiggles, and the upper limit of 300 epochs was just right enough for the models to
show convergence.
27. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
27
Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target
Figure 19 shows that the train and validation RMSE decreases over time and the wiggly
nature of the validation curve increases towards the end. We could have tried to increase
the batch size to 256 if it could be handled by our hardware. The curves are close together
and show convergence before 300 epochs.
Figure 20 shows that the NaimishNet generalizes well, since the average train / validation
RMSE is 1.03, where average loss is calculated by
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐿𝑜𝑠𝑠 = √
∑ 𝑅𝑀𝑆𝐸𝑖
215
𝑖̇=1
15
28. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
28
Figure 20 Loss per final check-pointed NaimishNet
2.12
1.99
1.45
1.81
1.29
1.57
2.08
2.4
1.8
1.77
2.86
2.26
1.78
1.82
2.71
2.03
2.23
2.06
1.42
1.58
1.31
1.35
1.8
2.34
1.76
2.05
2.73
1.92
1.82
1.75
2.78
1.98
0.95
0.96
1.02
1.15
0.98
1.17
1.16
1.02
1.02
0.86
1.05
1.18
0.98
1.04
0.97
1.03
Left Eye Center
Right Eye Center
Left Eye Inner Corner
Left Eye Outer Corner
Right Eye Inner Corner
Right Eye Outer Corner
Left Eyebrow Inner End
Left Eyebrow Outer End
Right Eyebrow Inner End
Right Eyebrow Outer End
Nose Tip
Mouth Left Corner
Mouth Right Corner
Mouth Center Top Lip
Mouth Center Bottom Lip
Average
TargetFacialKeyPoint Loss per final check-pointed NaimishNet
Train / Validation RMSE Validation RMSE Train RMSE
29. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
29
Figure 21 Comparison of Kaggle Scores
Figure 21 shows that NaimishNet outperforms the approaches used by Longpre et al. [4],
Oneto et al. [8], Wang et al. [2], since lower the score, better the model.
3.15
3.089
3.347
3.104
3.3
2.91784
2.8843
2.52471
0 0.5 1 1.5 2 2.5 3 3.5 4
LENET-5 W/O TEST TIME AUG [4]
LENET-5 [4]
ENSEMBLE (LENET-5 + DROPOUT) [4]
ENSEMBLE (LENET-5 + LENET-3) [4]
ENSEMBLE (LENET-5 + VGG) [4]
ONETO ET AL [8]
WANG ET AL [2]
NAIMISHNET
Kaggle Score
30. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
30
5 Conclusion
NaimishNet – a learning model for a single facial key point, is based on LeNet [5]. As per
Figure 21, LeNet based architectures have proven to be successful for Facial Key Points
Detection. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled
architectures have been successfully applied for Facial Key Points Detection problem, as
is also shown in Figure 21.
31. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
31
6 Future Scope
Given enough compute resources and time, NaimishNet can be further modified by
experimenting with different initialization schemes, different activation functions, different
number of filters and kernel size, different number of layers, switching from LeNet [5] to
VGGNet [6], introducing Inception modules as in GoogleNet [18], or introducing Residual
Networks [19], etc. Different image pre-processing approaches like histogram stretching,
zero centering, etc. can be tried to check which approaches improve the model’s
accuracy. New approaches to Facial Key Points Detection can be based on the
application of deep learning as a three stage process - face detection, face alignment,
and facial key points detection, with the use of the state-of-the-art algorithms for each
stage.
32. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
32
7 Appendix
7.1 Software and Hardware Deployed
Name Version
Python 2.7.11
Numpy 1.10.4
Matplotlib 1.5.1
Pandas 0.18.0
Theano 0.9.0
Keras 1.0.4
Scikit-learn 0.17.1
cPickle 1.71
Anaconda 4.0.0
Cuda 7.5
cuDNN 4
OpenCV 3.1.0
Windows 10
Table 4 Software deployed with version information
HP Envy Notebook, with NVIDIA GeForce GTX 950M GPU and Intel core i7 5500 @ 2.4 GHz,
was used to perform all the experiments.
33. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
33
8 References
[1] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep
Convolutional Neural Networks," in NIPS, 2012.
[2] Y. Wang and Y. Song, "Facial Keypoints Detection," Stanford University, 2014.
[3] Y. Sun, X. Wang and X. Tang, "Deep Convolutional Network Cascade for Facial Key
Point Detection," in Conference on Computer Vision and Pattern Recognition, 2013.
[4] S. Longpre and A. Sohmshetty, "Facial Keypoint Detection," Stanford University,
2016.
[5] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-Based Learning Applied to
Document Recognition," in Proceedings of the IEEE, 1998.
[6] K. Simoyan and A. Zisserman, "Very Deep Convolutional Networks for Large-scale
Image Recognition," in International Conference on Learning Representation, 2015.
[7] D. Nouri, "Using convolutional neural nets to detect facial keypoints tutorial," 17 12
2014. [Online]. Available: http://danielnouri.org/notes/2014/12/17/using-
convolutional-neural-nets-to-detect-facial-keypoints-tutorial/.
[8] M. Oneto, L. Yang, Y. Jang and C. Dailey, "Facial Keypoints Detection: An Effort to
Top the Kaggle Leaderboard," Berkeley School of Information, 2015.
[9] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to
Human-Level Performance in Face Verification," in Computer Vision and Pattern
Recognition, 2014.
[10] P. Li, X. Liao, Y. Xu, H. Ling and C. Liao, "GPU Boosted Deep Learning in Real-time
Face Alignment," in GPU Technology Conference, 2016.
[11] D.-A. Clevert, T. Unterthiner and S. Hochreiter, "Fast and Accurate Deep Network
Learning by Exponential Linear Units (ELUs)," in International Conference on
Learning Representations, 2016.
[12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A
Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine
Learning Research, pp. 1929-1958, 2014.
[13] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward
neural networks," Journal of Machine Learning Research, pp. 249-256, 2010.
34. NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
34
[14] D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimization," in
International Conference on Learning Representation, 2015.
[15] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 2: Setting up the Data
and Loss," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-2/.
[16] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 3: Learning and
Evaluation," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-3/.
[17] "Kaggle Competition - Facial Key Points Detection," 7 May 2013. [Online]. Available:
https://www.kaggle.com/c/facial-keypoints-detection.
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
Vanhoucke and A. Rabinovich, "Going Deeper with Convolutions," in Computer
Vision and Pattern Recognition, 2014.
[19] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition,"
in Computer Vision and Pattern Recognition, 2015.