SlideShare a Scribd company logo
1 of 34
Download to read offline
FACIAL KEY POINTS
DETECTION USING
DEEP CONVOLUTIONAL
NEURAL NETWORK-
NAIMISHNET
By
NAIMISH AGARWAL
Summer Internship 2016 Project
Department of Analytical Information Systems and Business
Intelligence, University of Paderborn, Germany
External Guide
DR ARTUS KROHN-GRIMBERGHE
Internal Guide
DR RANJANA VYAS
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
2
Candidate’s Declaration
I hereby declare that the work presented in this project report entitled Facial Key Points
Detection using Deep Convolutional Neural Network - NaimishNet, submitted as Summer
Internship 2016 project report at Indian Institute of Information Technology, Allahabad, is
an authenticated record of my original work carried out from June 2016 to July 2016 under
the guidance of Dr Artus Krohn-Grimberghe and Dr Ranjana Vyas. Due
acknowledgements have been made in the text to all other material used. The project
was done in full compliance with the requirements and constraints of the prescribed
curriculum.
Signature
______________________________
Naimish Agarwal (IRM2013013)
Paderborn, Germany
14th July, 2016
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
3
Certificate
This is to certify that the above statement made by the candidate is correct to the best
of my knowledge.
Signatures
______________________________
Dr Artus Krohn-Grimberghe
Paderborn, Germany
14th July, 2016
_____________________________
Dr Ranjana Vyas,
Allahabad, India
1st August, 2016
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
4
Table of Contents
Candidate’s Declaration..............................................................................................................2
Certificate.......................................................................................................................................3
List of Figures...................................................................................................................................5
List of Tables....................................................................................................................................6
About Me........................................................................................................................................7
External Guide................................................................................................................................7
Internal Guide ................................................................................................................................7
Acknowledgement .......................................................................................................................8
1 Introduction.................................................................................................................................9
2 NaimishNet Architecture .........................................................................................................10
2.1 NaimishNet Architecture Design Decisions.....................................................................11
3 Experiments ...............................................................................................................................13
3.1 Dataset Description ...........................................................................................................13
3.2 Proposed Approach..........................................................................................................13
3.2.1 Data Augmentation....................................................................................................13
3.2.2 Data Pre-processing ...................................................................................................14
3.2.3 Pre-training Analysis ....................................................................................................14
3.2.4 Training..........................................................................................................................15
3.2.5 Evaluation.....................................................................................................................16
4 Results.........................................................................................................................................17
5 Conclusion.................................................................................................................................30
6 Future Scope .............................................................................................................................31
7 Appendix ...................................................................................................................................32
7.1 Software and Hardware Deployed.................................................................................32
8 References.................................................................................................................................33
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
5
List of Figures
Figure 1 Number of non-missing target rows in the augmented and pre-processed train
dataset..........................................................................................................................................15
Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red
for predicted. ...............................................................................................................................17
Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87,
RMSEmiddle = 1.04, and RMSEbottom = 0.37, the original FKPs are shown in blue and the
predicted FKPs are shown in red. ..............................................................................................18
Figure 4 Number of Epochs per NaimishNet ............................................................................19
Figure 5 Loss History of NaimishNet for Left Eye Center as Target .........................................20
Figure 6 Loss History of NaimishNet for Right Eye Center as Target.......................................20
Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target................................21
Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target ..............................21
Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target .............................22
Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target..........................22
Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target ..........................23
Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target.........................23
Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target........................24
Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target ......................24
Figure 15 Loss History of NaimishNet for Nose Tip as Target....................................................25
Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target ..................................25
Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target................................26
Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target ............................26
Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target......................27
Figure 20 Loss per final check-pointed NaimishNet ................................................................28
Figure 21 Comparison of Kaggle Scores...................................................................................29
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
6
List of Tables
Table 1 NaimishNet layer-wise architecture.............................................................................10
Table 2 Filter details of Convolution2d layers...........................................................................11
Table 3 Target Facial Key Points which need to be swapped while horizontally flipping
the data........................................................................................................................................14
Table 4 Software deployed with version information .............................................................32
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
7
About Me
Naimish Agarwal is a student pursuing B. Tech in Information
Technology with M. Tech in Robotics, a 5 years’ dual degree program,
in the batch of 2013-2018, at Indian Institute of Information
Technology, Allahabad, India.
External Guide
Dr Artus Krohn-Grimberghe is a Junior Professor and Head of the
Department of Analytical Information Systems and Business
Intelligence, at University of Paderborn, Germany.
Internal Guide
Dr Ranjana Vyas is an Assistant Professor at Indian Institute of
Information Technology, Allahabad, India.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
8
Acknowledgement
I would like to express my special gratitude to Prof Dr O.P. Vyas, Dean Research &
Development, Indian Institute of Information Technology, Allahabad for providing me the
opportunity to join University of Paderborn, Germany for a two months’ internship in the
summer of 2016.
I would like to express my gratitude to Dr Rahul Kala, Assistant Professor in Robotics and
Artificial Intelligence Laboratory, Indian Institute of Information Technology, Allahabad for
motivating me when in need.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
9
1 Introduction
Facial Key Points (FKPs) detection is an important and challenging problem in the field of
computer vision, which involves detecting FKPs like centers and corners of eyes, nose tip,
etc. The problem is to predict the (x, y) real-valued co-ordinates in the space of image
pixels of the FKPs for a given face image. It finds its application in tracking faces in images
and videos, analysis of facial expressions, detection of dysmorphic facial signs for medical
diagnosis, face recognition, etc.
Facial features vary greatly from one individual to another, and even for a single individual
there is a large amount of variation due to pose, size, position, etc. The problem becomes
even more challenging when the face images are taken under different illumination
conditions, viewing angles, etc.
In the past few years, advancements in FKPs detection are made by the application of
deep convolutional neural network (DCNN), which is a special type of feed-forward
neural network with shared weights and local connectivity. DCNNs have helped build
state-of-the-art models for image recognition, recommender systems, natural language
processing, etc. Krizhevsky et al. [1] applied DCNN in ImageNet image classification
challenge and outperformed the previous state-of-the-art model for image classification.
Wang et al. [2] addressed FKPs detection by first applying histogram stretching for image
contrast enhancement, followed by principal component analysis for noise reduction and
mean patch search algorithm with correlation scoring and mutual information scoring for
predicting left and right eye centers. Sun et al. [3] estimated FKPs by using a three level
convolutional neural network, where at each level, outputs of multiple networks were
fused for robust and accurate estimation. Longpre et al. [4] predicted FKPs by first applying
data augmentation to expand the number of training examples, followed by testing
different architectures of convolutional neural networks like LeNet [5] and VGGNet [6],
and finally used a weighted ensemble of models. Nouri et al. [7] used six specialist DCNNs
trained over pre-trained weights. Oneto et al. [8] applied a variety of data pre-processing
techniques like histogram stretching, Gaussian blurring, followed by image flipping, key
point grouping, and then finally applied LeNet.
Taigman et al. [9] provided a new deep network architecture, DeepFace, for state-of-
the-art face recognition. Li et al. [10] provided a new DCNN architecture for state-of-the
art face alignment.
We present a DCNN architecture – NaimishNet, based on LeNet, which addresses the
problem of facial key points detection by providing a learning model for a single facial
key point.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
10
2 NaimishNet Architecture
NaimishNet consists of 4 convolution2d layers, 4 maxpooling2d layers and 3 dense layers,
with sandwiched dropout and activation layers, as shown in Table 1.
Layer Number Layer Name Layer Shape
1 Input1 (1, 96, 96)
2 Convolution2d1 (32, 93, 93)
3 Activation1 (32, 93, 93)
4 Maxpooling2d1 (32, 46, 46)
5 Dropout1 (32, 46, 46)
6 Convolution2d2 (64, 44, 44)
7 Activation2 (64, 44, 44)
8 Maxpooling2d2 (64, 22, 22)
9 Dropout2 (64, 22, 22)
10 Convolution2d3 (128, 21, 21)
11 Activation3 (128, 21, 21)
12 Maxpooling2d3 (128, 10, 10)
13 Dropout3 (128, 10, 10)
14 Convolution2d4 (256, 10, 10)
15 Activation4 (256, 10, 10)
16 Maxpooling2d4 (256, 5, 5)
17 Dropout4 (256, 5, 5)
18 Flatten1 (6400)
19 Dense1 (1000)
20 Activation5 (1000)
21 Dropout5 (1000)
22 Dense2 (1000)
23 Activation6 (1000)
24 Dropout6 (1000)
25 Dense3 (2)
Table 1 NaimishNet layer-wise architecture
Specifics of NaimishNet:
o Input1 is the input layer.
o Activation1 to Activation5 use Exponential Linear Units (ELUs) [11] as activation
functions, whereas Activation6 uses Linear Activation Function.
o Dropout [12] probability is increased from 0.1 to 0.6 from Dropout1 to Dropout6, with
a step size of 0.1.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
11
o Maxpooling2d1 to Maxpooling2d4 use a pool shape of (2, 2), with non-overlapping
strides and no zero padding.
o Flatten1 flattens 3d input to 1d output.
o Convolution2d1 to Convolution2d4 do not use zero padding, have their weights
initialized with random numbers drawn from uniform distribution, and the specifics
of number of filters and filter shape are shown in Table 2.
o Dense1 to Dense3 are regular fully connected layers with weights initialized using
Glorot uniform initialization [13].
o Adam [14] optimizer, with learning rate of 0.001, beta1 of 0.9, beta2 of 0.999 and
epsilon of 1e-08, is used for minimizing Mean Squared Error (MSE).
o There are 7,488,962 model parameters
Layer Name No. of Filters Filter Shape
Convolution2d1 32 (4, 4)
Convolution2d2 64 (3, 3)
Convolution2d3 128 (2, 2)
Convolution2d4 256 (1, 1)
Table 2 Filter details of Convolution2d layers
2.1 NaimishNet Architecture Design Decisions
We have tried various different deep network architectures before finally settling down
with NaimishNet architecture. Most of the design decisions were governed primarily by the
hardware restrictions and the time taken per epoch for different runs of the network.
Although we wanted one architecture to fit well to all the different FKPs as target, we have
mainly taken decisions based on the performance of the model on left eye center FKP as
target. We have run the different architectures for at most 40 epochs and forecasted its
performance and generalizability based on our experience we gathered while training
multiple deep network architectures. We do not provide the exact experimental numbers
about the RMSE scores since this data was not recorded while experimenting with different
architectures. We also do not provide the specifics of every architecture we tried since
multiple changes were made without taking any record before arriving at NaimishNet
architecture.
We experimented with 5 convolution2d and 5 maxpooling2d layers and 4 dense layers
with rectified linear units as activation functions, but that over fitted. We experimented
with 3 convolution2d and 3 maxpooling2d layers and 2 dense layers with rectified linear
units as activation functions, but that in turn required pre-training the weights to achieve
good performance. We wanted to avoid pre-training to reduce the training time. We did
not experiment with zero padding and strides.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
12
We sandwiched batch normalization layers [15] after every maxpooling2d layer but that
increased the per epoch training time by fivefold. We used dropout layers with dropout
probability of 0.5 for every dropout layer but that showed a poor fit. We removed the
dropout layers but that resulted in overfitting. We tried different initialization schemes like
He normal initialization [15], but that did not show nice performance in the first 20 epochs,
in fact uniform initialization performed better than that for us. We experimented with
different patience levels (PLs) for early stopping callback function and concluded that PL
of 10% of total number of epochs, in our case 300, would do well in most of the cases. For
the aforementioned cases, we used RMSProp optimizer [16].
The main aim was to reduce the training time along with achieving maximum
performance and generalizability. To achieve that we experimented with exponential
linear units as activation function, Adam optimizer, and linearly increasing dropout
probabilities along the depth of the network. The specifics of the filter were chosen based
on our past experience with exponentially increasing number of filters and linearly
decreasing filter size. The batch size was kept 128 because vector operations are
optimized for sizes which are in powers of 2. Also, given over 7 million model parameters,
and limited GPU memory, batch size of 128 was an ideal size for us. The number of main
layers i.e. 4 convolutional layers (convolution2d and maxpooling2d) and 3 dense layers
was ideal since exponential linear units perform well in cases where the number of layers
are greater than 5. The parameters of Adam optimizer were used as is as recommended
by its authors and we did not experiment with them.
NaimishNet, is inspired by LeNet [5], since LeNet’s architecture follows the pattern input ->
conv2d -> maxpool2d -> conv2d -> maxpool2d -> conv2d -> maxpool2d -> dense ->
output. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled
architectures have been successfully applied for Facial Key Points Detection problem.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
13
3 Experiments
3.1 Dataset Description
We have used the dataset from Kaggle Competition – Facial Key Points Detection [17].
The dataset was chosen to benchmark our solution against the existing approaches which
address FKPs detection problem.
There are 15 FKPs per face image like left eye center, right eye center, left eye inner corner,
left eye outer corner, right eye inner corner, right eye outer corner, left eyebrow inner end,
left eyebrow outer end, right eyebrow inner end, right eyebrow outer end, nose tip, mouth
left corner, mouth right corner, mouth center top lip and mouth center bottom lip. Here,
left and right are from the point of view of the subject.
The greyscale input images, with pixel values in range of [0, 255], have size of 96 x 96 pixels.
The train dataset consists of 7049 images, with 30 targets, i.e. (x, y) for each of 15 FKPs. It
has missing target values for many FKPs for many face images. The test dataset consists of
1783 images with no target information. The Kaggle submission file consists of 27124 FKPs
co-ordinates, which are to be predicted.
The Kaggle submissions are scored based on the Root Mean Squared Error (RMSE), which
is given by
𝑅𝑀𝑆𝐸 = √
∑ (𝑦𝑖 − 𝑦̂𝑖)2𝑛
𝑖=1
𝑛
where 𝑦𝑖 is the original value, 𝑦̂𝑖 is the predicted value and n is the number of targets to
be predicted.
3.2 Proposed Approach
3.2.1 Data Augmentation
Data Augmentation helps boost the performance of a deep learning model when the
training data is limited by generating more training data. We have horizontally flipped [7]
the images for which target information for all the 15 FKPs are available, and also swapped
the target values according to Table 3. Then, we vertically stacked the new horizontally
flipped data under the original train data to create the augmented train dataset.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
14
Target Facial Key Point Target Facial Key Point
Left Eye Center X Right Eye Center X
Left Eye Center Y Right Eye Center Y
Left Eye Inner Corner X Right Eye Inner Corner X
Left Eye Inner Corner Y Right Eye Inner Corner Y
Left Eye Outer Corner X Right Eye Outer Corner X
Left Eye Outer Corner Y Right Eye Outer Corner Y
Left Eyebrow Inner Corner X Right Eyebrow Inner Corner X
Left Eyebrow Inner Corner Y Right Eyebrow Inner Corner Y
Left Eyebrow Outer Corner X Right Eyebrow Outer Corner X
Left Eyebrow Outer Corner Y Right Eyebrow Outer Corner Y
Mouth Left Corner X Mouth Right Corner X
Mouth Left Corner Y Mouth Right Corner Y
Table 3 Target Facial Key Points which need to be swapped while horizontally flipping the data
3.2.2 Data Pre-processing
The image pixels are normalized to the range [0, 1] by dividing by 255.0, and the train
targets are zero-centered to the range [-1, 1] by first dividing by 48.0, since the images are
96 x 96, and then subtracting 48.0.
3.2.3 Pre-training Analysis
Figure 1 shows that there are different number of non-missing target rows for different FKPs,
so, we have created 15 NaimishNet models.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
15
Figure 1 Number of non-missing target rows in the augmented and pre-processed train dataset
3.2.4 Training
For each of the 15 NaimishNet models, filter (keep) the rows with non-missing target values
and split the filtered train data into 80% train dataset (T) and 20% validation dataset (V).
Compute feature-wise mean M on T, and zero-center T and V by subtracting M, and also
store M. Reshape features of T and V to (1, 96, 96), and start the training. Finally, validate
on V, and store the model’s loss history.
9179
9176
4411
4407
4408
4408
4410
4365
4410
4376
9189
4409
4410
4415
9156
0 2000 4000 6000 8000 10000
LEFT EYE CENTER
RIGHT EYE CENTER
LEFT EYE INNER CORNER
LEFT EYE OUTER CORNER
RIGHT EYE INNER CORNER
RIGHT EYE OUTER CORNER
LEFT EYEBROW INNER END
LEFT EYEBROW OUTER END
RIGHT EYEBROW INNER END
RIGHT EYEBROW OUTER END
NOSE TIP
MOUTH LEFT CORNER
MOUTH RIGHT CORNER
MOUTH CENTER TOP LIP
MOUTH CENTER BOTTOM LIP
TargetFacialKeyPoint
Number of non-missing target rows
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
16
Each model is trained with a batch size of 128 and maximum number of epochs are 300.
Two callback functions are used which execute at the end of every epoch – Early
Stopping Callback (ESC) and Model Checkpoint Callback (MCC).
ESC stops the training when the number of contiguous epochs without improvement in
validation loss are 30 (Patience Level). MCC stores the weights of the model with the
lowest validation loss.
3.2.5 Evaluation
For each of the 15 NaimishNet models, reload M, zero center test features by subtracting
M and reshape the samples to (1, 96, 96). Reload the best check-pointed model weights,
predict on test features, and store the predictions.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
17
4 Results
Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red for predicted.
Figure 2 shows the annotated faces with 15 FKPs of 6 individuals. It shows that the
predicted locations are very close to the actual locations of FKPs.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
18
Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87, RMSEmiddle = 1.04, and RMSEbottom =
0.37, the original FKPs are shown in blue and the predicted FKPs are shown in red.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
19
In Figure 3, high RMSEtop is because of the presence of less number of images, in the
training dataset, with the side face pose. Most of the images in the training dataset are
that of the frontal face with nearly closed lips. So, NaimishNet delivers lower RMSEmiddle and
RMSEbottom.
Figure 4 Number of Epochs per NaimishNet
Figure 4 shows that total number of epochs are 3507. Total training time of 15 NaimishNet
models is approximately 20 hours.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
20
Figure 5 Loss History of NaimishNet for Left Eye Center as Target
Figure 6 Loss History of NaimishNet for Right Eye Center as Target
Figure 5 and Figure 6 show that both train and validation RMSE decrease over time with
small wiggles along the curve showcasing that the batch size was rightly chosen and end
up being close enough, thus showing that the two NaimishNet models generalized well.
The best check-pointed models were achieved before 300 epochs, thus certifying the
decision of fixing 300 epochs as the upper limit.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
21
Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target
Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target
Figure 7 and Figure 8 show that both train and validation RMSE decrease over time with
small wiggles and end up being close enough. The best check-pointed models were
achieved before 300 epochs. Also, for Figure 8, since the train RMSE does not become less
than validation RMSE implies that the Patience Level (PL) for ESC could have been
increased to 50. Thus, the training could have been continued for some more epochs.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
22
Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target
Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target
Figure 9 and Figure 10 show that both train and validation RMSE decrease over time with
very small wiggles and end up being close enough. The best check-pointed models were
achieved close to 300 epochs. For Figure 10, PL of ESC could have been increased to 50
so that the train RMSE becomes less than validation RMSE.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
23
Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target
Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target
Figure 11 and Figure 12, show that the train and validation RMSE decreases over time with
small wiggles, and end up being close enough. The best check-pointed models were
achieved far before 300 epochs. No further improvement could have been seen by
increasing PL of ESC.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
24
Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target
Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target
Figure 13 and Figure 14 show that both train and validation RMSE decrease over time with
small wiggles and end up being close in the check-pointed model, which can be seen 30
epochs before the last epoch on x-axis.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
25
Figure 15 Loss History of NaimishNet for Nose Tip as Target
Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target
Figure 15 and Figure 16 show that both train and validation RMSE decrease over time and
they end up being close enough in the final check-pointed model. For Figure 16, PL of ESC
could have been increased to 50 but not necessarily there would have been any visible
improvement.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
26
Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target
Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target
Figure 17 and Figure 18 show that the train and validation RMSE decrease over time, with
small wiggles, and the upper limit of 300 epochs was just right enough for the models to
show convergence.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
27
Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target
Figure 19 shows that the train and validation RMSE decreases over time and the wiggly
nature of the validation curve increases towards the end. We could have tried to increase
the batch size to 256 if it could be handled by our hardware. The curves are close together
and show convergence before 300 epochs.
Figure 20 shows that the NaimishNet generalizes well, since the average train / validation
RMSE is 1.03, where average loss is calculated by
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐿𝑜𝑠𝑠 = √
∑ 𝑅𝑀𝑆𝐸𝑖
215
𝑖̇=1
15
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
28
Figure 20 Loss per final check-pointed NaimishNet
2.12
1.99
1.45
1.81
1.29
1.57
2.08
2.4
1.8
1.77
2.86
2.26
1.78
1.82
2.71
2.03
2.23
2.06
1.42
1.58
1.31
1.35
1.8
2.34
1.76
2.05
2.73
1.92
1.82
1.75
2.78
1.98
0.95
0.96
1.02
1.15
0.98
1.17
1.16
1.02
1.02
0.86
1.05
1.18
0.98
1.04
0.97
1.03
Left Eye Center
Right Eye Center
Left Eye Inner Corner
Left Eye Outer Corner
Right Eye Inner Corner
Right Eye Outer Corner
Left Eyebrow Inner End
Left Eyebrow Outer End
Right Eyebrow Inner End
Right Eyebrow Outer End
Nose Tip
Mouth Left Corner
Mouth Right Corner
Mouth Center Top Lip
Mouth Center Bottom Lip
Average
TargetFacialKeyPoint Loss per final check-pointed NaimishNet
Train / Validation RMSE Validation RMSE Train RMSE
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
29
Figure 21 Comparison of Kaggle Scores
Figure 21 shows that NaimishNet outperforms the approaches used by Longpre et al. [4],
Oneto et al. [8], Wang et al. [2], since lower the score, better the model.
3.15
3.089
3.347
3.104
3.3
2.91784
2.8843
2.52471
0 0.5 1 1.5 2 2.5 3 3.5 4
LENET-5 W/O TEST TIME AUG [4]
LENET-5 [4]
ENSEMBLE (LENET-5 + DROPOUT) [4]
ENSEMBLE (LENET-5 + LENET-3) [4]
ENSEMBLE (LENET-5 + VGG) [4]
ONETO ET AL [8]
WANG ET AL [2]
NAIMISHNET
Kaggle Score
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
30
5 Conclusion
NaimishNet – a learning model for a single facial key point, is based on LeNet [5]. As per
Figure 21, LeNet based architectures have proven to be successful for Facial Key Points
Detection. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled
architectures have been successfully applied for Facial Key Points Detection problem, as
is also shown in Figure 21.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
31
6 Future Scope
Given enough compute resources and time, NaimishNet can be further modified by
experimenting with different initialization schemes, different activation functions, different
number of filters and kernel size, different number of layers, switching from LeNet [5] to
VGGNet [6], introducing Inception modules as in GoogleNet [18], or introducing Residual
Networks [19], etc. Different image pre-processing approaches like histogram stretching,
zero centering, etc. can be tried to check which approaches improve the model’s
accuracy. New approaches to Facial Key Points Detection can be based on the
application of deep learning as a three stage process - face detection, face alignment,
and facial key points detection, with the use of the state-of-the-art algorithms for each
stage.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
32
7 Appendix
7.1 Software and Hardware Deployed
Name Version
Python 2.7.11
Numpy 1.10.4
Matplotlib 1.5.1
Pandas 0.18.0
Theano 0.9.0
Keras 1.0.4
Scikit-learn 0.17.1
cPickle 1.71
Anaconda 4.0.0
Cuda 7.5
cuDNN 4
OpenCV 3.1.0
Windows 10
Table 4 Software deployed with version information
HP Envy Notebook, with NVIDIA GeForce GTX 950M GPU and Intel core i7 5500 @ 2.4 GHz,
was used to perform all the experiments.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
33
8 References
[1] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep
Convolutional Neural Networks," in NIPS, 2012.
[2] Y. Wang and Y. Song, "Facial Keypoints Detection," Stanford University, 2014.
[3] Y. Sun, X. Wang and X. Tang, "Deep Convolutional Network Cascade for Facial Key
Point Detection," in Conference on Computer Vision and Pattern Recognition, 2013.
[4] S. Longpre and A. Sohmshetty, "Facial Keypoint Detection," Stanford University,
2016.
[5] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-Based Learning Applied to
Document Recognition," in Proceedings of the IEEE, 1998.
[6] K. Simoyan and A. Zisserman, "Very Deep Convolutional Networks for Large-scale
Image Recognition," in International Conference on Learning Representation, 2015.
[7] D. Nouri, "Using convolutional neural nets to detect facial keypoints tutorial," 17 12
2014. [Online]. Available: http://danielnouri.org/notes/2014/12/17/using-
convolutional-neural-nets-to-detect-facial-keypoints-tutorial/.
[8] M. Oneto, L. Yang, Y. Jang and C. Dailey, "Facial Keypoints Detection: An Effort to
Top the Kaggle Leaderboard," Berkeley School of Information, 2015.
[9] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to
Human-Level Performance in Face Verification," in Computer Vision and Pattern
Recognition, 2014.
[10] P. Li, X. Liao, Y. Xu, H. Ling and C. Liao, "GPU Boosted Deep Learning in Real-time
Face Alignment," in GPU Technology Conference, 2016.
[11] D.-A. Clevert, T. Unterthiner and S. Hochreiter, "Fast and Accurate Deep Network
Learning by Exponential Linear Units (ELUs)," in International Conference on
Learning Representations, 2016.
[12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A
Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine
Learning Research, pp. 1929-1958, 2014.
[13] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward
neural networks," Journal of Machine Learning Research, pp. 249-256, 2010.
NAIMISH AGARWAL
FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET
34
[14] D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimization," in
International Conference on Learning Representation, 2015.
[15] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 2: Setting up the Data
and Loss," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-2/.
[16] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 3: Learning and
Evaluation," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-3/.
[17] "Kaggle Competition - Facial Key Points Detection," 7 May 2013. [Online]. Available:
https://www.kaggle.com/c/facial-keypoints-detection.
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
Vanhoucke and A. Rabinovich, "Going Deeper with Convolutions," in Computer
Vision and Pattern Recognition, 2014.
[19] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition,"
in Computer Vision and Pattern Recognition, 2015.

More Related Content

Similar to Facial Key Point Detection Using Deep Learning

Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...SamanthaGallone
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsChetan Hireholi
 
Usability of Web Based Financial Services
Usability of Web Based Financial ServicesUsability of Web Based Financial Services
Usability of Web Based Financial ServicesAustin Dimmer
 
IUCRC_EconImpactFeasibilityReport_FinalFinal
IUCRC_EconImpactFeasibilityReport_FinalFinalIUCRC_EconImpactFeasibilityReport_FinalFinal
IUCRC_EconImpactFeasibilityReport_FinalFinalJay Lee
 
Man 00851 rev 001 understanding image checker 9.0
Man 00851 rev 001 understanding image checker 9.0Man 00851 rev 001 understanding image checker 9.0
Man 00851 rev 001 understanding image checker 9.0alex123123123
 
Senior Project: Methanol Injection Progressive Controller
Senior Project: Methanol Injection Progressive Controller Senior Project: Methanol Injection Progressive Controller
Senior Project: Methanol Injection Progressive Controller QuyenVu47
 
Machine to Machine White Paper
Machine to Machine White PaperMachine to Machine White Paper
Machine to Machine White PaperJosep Pocalles
 
Data analysis with R.pdf
Data analysis with R.pdfData analysis with R.pdf
Data analysis with R.pdfPepeMara
 
Smart Oilfield Data Mining Final Project-Rod Pump Failure Prediction
Smart Oilfield Data Mining Final Project-Rod Pump Failure PredictionSmart Oilfield Data Mining Final Project-Rod Pump Failure Prediction
Smart Oilfield Data Mining Final Project-Rod Pump Failure PredictionJeffrey Daniels
 
Real time emotion_detection_from_videos
Real time emotion_detection_from_videosReal time emotion_detection_from_videos
Real time emotion_detection_from_videosCafer Yıldız
 

Similar to Facial Key Point Detection Using Deep Learning (20)

Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...
 
Wisr2011 en
Wisr2011 enWisr2011 en
Wisr2011 en
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
 
Usability of Web Based Financial Services
Usability of Web Based Financial ServicesUsability of Web Based Financial Services
Usability of Web Based Financial Services
 
IUCRC_EconImpactFeasibilityReport_FinalFinal
IUCRC_EconImpactFeasibilityReport_FinalFinalIUCRC_EconImpactFeasibilityReport_FinalFinal
IUCRC_EconImpactFeasibilityReport_FinalFinal
 
Moving cr-to-dr
Moving cr-to-drMoving cr-to-dr
Moving cr-to-dr
 
Man 00851 rev 001 understanding image checker 9.0
Man 00851 rev 001 understanding image checker 9.0Man 00851 rev 001 understanding image checker 9.0
Man 00851 rev 001 understanding image checker 9.0
 
report
reportreport
report
 
Thesis
ThesisThesis
Thesis
 
Rand rr2637
Rand rr2637Rand rr2637
Rand rr2637
 
Case sas 2
Case sas 2Case sas 2
Case sas 2
 
2D ROBOTIC PLOTTER
2D ROBOTIC PLOTTER2D ROBOTIC PLOTTER
2D ROBOTIC PLOTTER
 
Senior Project: Methanol Injection Progressive Controller
Senior Project: Methanol Injection Progressive Controller Senior Project: Methanol Injection Progressive Controller
Senior Project: Methanol Injection Progressive Controller
 
thesis
thesisthesis
thesis
 
thesis
thesisthesis
thesis
 
Machine to Machine White Paper
Machine to Machine White PaperMachine to Machine White Paper
Machine to Machine White Paper
 
Data analysis with R.pdf
Data analysis with R.pdfData analysis with R.pdf
Data analysis with R.pdf
 
Smart Oilfield Data Mining Final Project-Rod Pump Failure Prediction
Smart Oilfield Data Mining Final Project-Rod Pump Failure PredictionSmart Oilfield Data Mining Final Project-Rod Pump Failure Prediction
Smart Oilfield Data Mining Final Project-Rod Pump Failure Prediction
 
Real time emotion_detection_from_videos
Real time emotion_detection_from_videosReal time emotion_detection_from_videos
Real time emotion_detection_from_videos
 

Facial Key Point Detection Using Deep Learning

  • 1. FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK- NAIMISHNET By NAIMISH AGARWAL Summer Internship 2016 Project Department of Analytical Information Systems and Business Intelligence, University of Paderborn, Germany External Guide DR ARTUS KROHN-GRIMBERGHE Internal Guide DR RANJANA VYAS
  • 2. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 2 Candidate’s Declaration I hereby declare that the work presented in this project report entitled Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet, submitted as Summer Internship 2016 project report at Indian Institute of Information Technology, Allahabad, is an authenticated record of my original work carried out from June 2016 to July 2016 under the guidance of Dr Artus Krohn-Grimberghe and Dr Ranjana Vyas. Due acknowledgements have been made in the text to all other material used. The project was done in full compliance with the requirements and constraints of the prescribed curriculum. Signature ______________________________ Naimish Agarwal (IRM2013013) Paderborn, Germany 14th July, 2016
  • 3. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 3 Certificate This is to certify that the above statement made by the candidate is correct to the best of my knowledge. Signatures ______________________________ Dr Artus Krohn-Grimberghe Paderborn, Germany 14th July, 2016 _____________________________ Dr Ranjana Vyas, Allahabad, India 1st August, 2016
  • 4. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 4 Table of Contents Candidate’s Declaration..............................................................................................................2 Certificate.......................................................................................................................................3 List of Figures...................................................................................................................................5 List of Tables....................................................................................................................................6 About Me........................................................................................................................................7 External Guide................................................................................................................................7 Internal Guide ................................................................................................................................7 Acknowledgement .......................................................................................................................8 1 Introduction.................................................................................................................................9 2 NaimishNet Architecture .........................................................................................................10 2.1 NaimishNet Architecture Design Decisions.....................................................................11 3 Experiments ...............................................................................................................................13 3.1 Dataset Description ...........................................................................................................13 3.2 Proposed Approach..........................................................................................................13 3.2.1 Data Augmentation....................................................................................................13 3.2.2 Data Pre-processing ...................................................................................................14 3.2.3 Pre-training Analysis ....................................................................................................14 3.2.4 Training..........................................................................................................................15 3.2.5 Evaluation.....................................................................................................................16 4 Results.........................................................................................................................................17 5 Conclusion.................................................................................................................................30 6 Future Scope .............................................................................................................................31 7 Appendix ...................................................................................................................................32 7.1 Software and Hardware Deployed.................................................................................32 8 References.................................................................................................................................33
  • 5. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 5 List of Figures Figure 1 Number of non-missing target rows in the augmented and pre-processed train dataset..........................................................................................................................................15 Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red for predicted. ...............................................................................................................................17 Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87, RMSEmiddle = 1.04, and RMSEbottom = 0.37, the original FKPs are shown in blue and the predicted FKPs are shown in red. ..............................................................................................18 Figure 4 Number of Epochs per NaimishNet ............................................................................19 Figure 5 Loss History of NaimishNet for Left Eye Center as Target .........................................20 Figure 6 Loss History of NaimishNet for Right Eye Center as Target.......................................20 Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target................................21 Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target ..............................21 Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target .............................22 Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target..........................22 Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target ..........................23 Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target.........................23 Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target........................24 Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target ......................24 Figure 15 Loss History of NaimishNet for Nose Tip as Target....................................................25 Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target ..................................25 Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target................................26 Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target ............................26 Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target......................27 Figure 20 Loss per final check-pointed NaimishNet ................................................................28 Figure 21 Comparison of Kaggle Scores...................................................................................29
  • 6. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 6 List of Tables Table 1 NaimishNet layer-wise architecture.............................................................................10 Table 2 Filter details of Convolution2d layers...........................................................................11 Table 3 Target Facial Key Points which need to be swapped while horizontally flipping the data........................................................................................................................................14 Table 4 Software deployed with version information .............................................................32
  • 7. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 7 About Me Naimish Agarwal is a student pursuing B. Tech in Information Technology with M. Tech in Robotics, a 5 years’ dual degree program, in the batch of 2013-2018, at Indian Institute of Information Technology, Allahabad, India. External Guide Dr Artus Krohn-Grimberghe is a Junior Professor and Head of the Department of Analytical Information Systems and Business Intelligence, at University of Paderborn, Germany. Internal Guide Dr Ranjana Vyas is an Assistant Professor at Indian Institute of Information Technology, Allahabad, India.
  • 8. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 8 Acknowledgement I would like to express my special gratitude to Prof Dr O.P. Vyas, Dean Research & Development, Indian Institute of Information Technology, Allahabad for providing me the opportunity to join University of Paderborn, Germany for a two months’ internship in the summer of 2016. I would like to express my gratitude to Dr Rahul Kala, Assistant Professor in Robotics and Artificial Intelligence Laboratory, Indian Institute of Information Technology, Allahabad for motivating me when in need.
  • 9. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 9 1 Introduction Facial Key Points (FKPs) detection is an important and challenging problem in the field of computer vision, which involves detecting FKPs like centers and corners of eyes, nose tip, etc. The problem is to predict the (x, y) real-valued co-ordinates in the space of image pixels of the FKPs for a given face image. It finds its application in tracking faces in images and videos, analysis of facial expressions, detection of dysmorphic facial signs for medical diagnosis, face recognition, etc. Facial features vary greatly from one individual to another, and even for a single individual there is a large amount of variation due to pose, size, position, etc. The problem becomes even more challenging when the face images are taken under different illumination conditions, viewing angles, etc. In the past few years, advancements in FKPs detection are made by the application of deep convolutional neural network (DCNN), which is a special type of feed-forward neural network with shared weights and local connectivity. DCNNs have helped build state-of-the-art models for image recognition, recommender systems, natural language processing, etc. Krizhevsky et al. [1] applied DCNN in ImageNet image classification challenge and outperformed the previous state-of-the-art model for image classification. Wang et al. [2] addressed FKPs detection by first applying histogram stretching for image contrast enhancement, followed by principal component analysis for noise reduction and mean patch search algorithm with correlation scoring and mutual information scoring for predicting left and right eye centers. Sun et al. [3] estimated FKPs by using a three level convolutional neural network, where at each level, outputs of multiple networks were fused for robust and accurate estimation. Longpre et al. [4] predicted FKPs by first applying data augmentation to expand the number of training examples, followed by testing different architectures of convolutional neural networks like LeNet [5] and VGGNet [6], and finally used a weighted ensemble of models. Nouri et al. [7] used six specialist DCNNs trained over pre-trained weights. Oneto et al. [8] applied a variety of data pre-processing techniques like histogram stretching, Gaussian blurring, followed by image flipping, key point grouping, and then finally applied LeNet. Taigman et al. [9] provided a new deep network architecture, DeepFace, for state-of- the-art face recognition. Li et al. [10] provided a new DCNN architecture for state-of-the art face alignment. We present a DCNN architecture – NaimishNet, based on LeNet, which addresses the problem of facial key points detection by providing a learning model for a single facial key point.
  • 10. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 10 2 NaimishNet Architecture NaimishNet consists of 4 convolution2d layers, 4 maxpooling2d layers and 3 dense layers, with sandwiched dropout and activation layers, as shown in Table 1. Layer Number Layer Name Layer Shape 1 Input1 (1, 96, 96) 2 Convolution2d1 (32, 93, 93) 3 Activation1 (32, 93, 93) 4 Maxpooling2d1 (32, 46, 46) 5 Dropout1 (32, 46, 46) 6 Convolution2d2 (64, 44, 44) 7 Activation2 (64, 44, 44) 8 Maxpooling2d2 (64, 22, 22) 9 Dropout2 (64, 22, 22) 10 Convolution2d3 (128, 21, 21) 11 Activation3 (128, 21, 21) 12 Maxpooling2d3 (128, 10, 10) 13 Dropout3 (128, 10, 10) 14 Convolution2d4 (256, 10, 10) 15 Activation4 (256, 10, 10) 16 Maxpooling2d4 (256, 5, 5) 17 Dropout4 (256, 5, 5) 18 Flatten1 (6400) 19 Dense1 (1000) 20 Activation5 (1000) 21 Dropout5 (1000) 22 Dense2 (1000) 23 Activation6 (1000) 24 Dropout6 (1000) 25 Dense3 (2) Table 1 NaimishNet layer-wise architecture Specifics of NaimishNet: o Input1 is the input layer. o Activation1 to Activation5 use Exponential Linear Units (ELUs) [11] as activation functions, whereas Activation6 uses Linear Activation Function. o Dropout [12] probability is increased from 0.1 to 0.6 from Dropout1 to Dropout6, with a step size of 0.1.
  • 11. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 11 o Maxpooling2d1 to Maxpooling2d4 use a pool shape of (2, 2), with non-overlapping strides and no zero padding. o Flatten1 flattens 3d input to 1d output. o Convolution2d1 to Convolution2d4 do not use zero padding, have their weights initialized with random numbers drawn from uniform distribution, and the specifics of number of filters and filter shape are shown in Table 2. o Dense1 to Dense3 are regular fully connected layers with weights initialized using Glorot uniform initialization [13]. o Adam [14] optimizer, with learning rate of 0.001, beta1 of 0.9, beta2 of 0.999 and epsilon of 1e-08, is used for minimizing Mean Squared Error (MSE). o There are 7,488,962 model parameters Layer Name No. of Filters Filter Shape Convolution2d1 32 (4, 4) Convolution2d2 64 (3, 3) Convolution2d3 128 (2, 2) Convolution2d4 256 (1, 1) Table 2 Filter details of Convolution2d layers 2.1 NaimishNet Architecture Design Decisions We have tried various different deep network architectures before finally settling down with NaimishNet architecture. Most of the design decisions were governed primarily by the hardware restrictions and the time taken per epoch for different runs of the network. Although we wanted one architecture to fit well to all the different FKPs as target, we have mainly taken decisions based on the performance of the model on left eye center FKP as target. We have run the different architectures for at most 40 epochs and forecasted its performance and generalizability based on our experience we gathered while training multiple deep network architectures. We do not provide the exact experimental numbers about the RMSE scores since this data was not recorded while experimenting with different architectures. We also do not provide the specifics of every architecture we tried since multiple changes were made without taking any record before arriving at NaimishNet architecture. We experimented with 5 convolution2d and 5 maxpooling2d layers and 4 dense layers with rectified linear units as activation functions, but that over fitted. We experimented with 3 convolution2d and 3 maxpooling2d layers and 2 dense layers with rectified linear units as activation functions, but that in turn required pre-training the weights to achieve good performance. We wanted to avoid pre-training to reduce the training time. We did not experiment with zero padding and strides.
  • 12. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 12 We sandwiched batch normalization layers [15] after every maxpooling2d layer but that increased the per epoch training time by fivefold. We used dropout layers with dropout probability of 0.5 for every dropout layer but that showed a poor fit. We removed the dropout layers but that resulted in overfitting. We tried different initialization schemes like He normal initialization [15], but that did not show nice performance in the first 20 epochs, in fact uniform initialization performed better than that for us. We experimented with different patience levels (PLs) for early stopping callback function and concluded that PL of 10% of total number of epochs, in our case 300, would do well in most of the cases. For the aforementioned cases, we used RMSProp optimizer [16]. The main aim was to reduce the training time along with achieving maximum performance and generalizability. To achieve that we experimented with exponential linear units as activation function, Adam optimizer, and linearly increasing dropout probabilities along the depth of the network. The specifics of the filter were chosen based on our past experience with exponentially increasing number of filters and linearly decreasing filter size. The batch size was kept 128 because vector operations are optimized for sizes which are in powers of 2. Also, given over 7 million model parameters, and limited GPU memory, batch size of 128 was an ideal size for us. The number of main layers i.e. 4 convolutional layers (convolution2d and maxpooling2d) and 3 dense layers was ideal since exponential linear units perform well in cases where the number of layers are greater than 5. The parameters of Adam optimizer were used as is as recommended by its authors and we did not experiment with them. NaimishNet, is inspired by LeNet [5], since LeNet’s architecture follows the pattern input -> conv2d -> maxpool2d -> conv2d -> maxpool2d -> conv2d -> maxpool2d -> dense -> output. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled architectures have been successfully applied for Facial Key Points Detection problem.
  • 13. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 13 3 Experiments 3.1 Dataset Description We have used the dataset from Kaggle Competition – Facial Key Points Detection [17]. The dataset was chosen to benchmark our solution against the existing approaches which address FKPs detection problem. There are 15 FKPs per face image like left eye center, right eye center, left eye inner corner, left eye outer corner, right eye inner corner, right eye outer corner, left eyebrow inner end, left eyebrow outer end, right eyebrow inner end, right eyebrow outer end, nose tip, mouth left corner, mouth right corner, mouth center top lip and mouth center bottom lip. Here, left and right are from the point of view of the subject. The greyscale input images, with pixel values in range of [0, 255], have size of 96 x 96 pixels. The train dataset consists of 7049 images, with 30 targets, i.e. (x, y) for each of 15 FKPs. It has missing target values for many FKPs for many face images. The test dataset consists of 1783 images with no target information. The Kaggle submission file consists of 27124 FKPs co-ordinates, which are to be predicted. The Kaggle submissions are scored based on the Root Mean Squared Error (RMSE), which is given by 𝑅𝑀𝑆𝐸 = √ ∑ (𝑦𝑖 − 𝑦̂𝑖)2𝑛 𝑖=1 𝑛 where 𝑦𝑖 is the original value, 𝑦̂𝑖 is the predicted value and n is the number of targets to be predicted. 3.2 Proposed Approach 3.2.1 Data Augmentation Data Augmentation helps boost the performance of a deep learning model when the training data is limited by generating more training data. We have horizontally flipped [7] the images for which target information for all the 15 FKPs are available, and also swapped the target values according to Table 3. Then, we vertically stacked the new horizontally flipped data under the original train data to create the augmented train dataset.
  • 14. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 14 Target Facial Key Point Target Facial Key Point Left Eye Center X Right Eye Center X Left Eye Center Y Right Eye Center Y Left Eye Inner Corner X Right Eye Inner Corner X Left Eye Inner Corner Y Right Eye Inner Corner Y Left Eye Outer Corner X Right Eye Outer Corner X Left Eye Outer Corner Y Right Eye Outer Corner Y Left Eyebrow Inner Corner X Right Eyebrow Inner Corner X Left Eyebrow Inner Corner Y Right Eyebrow Inner Corner Y Left Eyebrow Outer Corner X Right Eyebrow Outer Corner X Left Eyebrow Outer Corner Y Right Eyebrow Outer Corner Y Mouth Left Corner X Mouth Right Corner X Mouth Left Corner Y Mouth Right Corner Y Table 3 Target Facial Key Points which need to be swapped while horizontally flipping the data 3.2.2 Data Pre-processing The image pixels are normalized to the range [0, 1] by dividing by 255.0, and the train targets are zero-centered to the range [-1, 1] by first dividing by 48.0, since the images are 96 x 96, and then subtracting 48.0. 3.2.3 Pre-training Analysis Figure 1 shows that there are different number of non-missing target rows for different FKPs, so, we have created 15 NaimishNet models.
  • 15. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 15 Figure 1 Number of non-missing target rows in the augmented and pre-processed train dataset 3.2.4 Training For each of the 15 NaimishNet models, filter (keep) the rows with non-missing target values and split the filtered train data into 80% train dataset (T) and 20% validation dataset (V). Compute feature-wise mean M on T, and zero-center T and V by subtracting M, and also store M. Reshape features of T and V to (1, 96, 96), and start the training. Finally, validate on V, and store the model’s loss history. 9179 9176 4411 4407 4408 4408 4410 4365 4410 4376 9189 4409 4410 4415 9156 0 2000 4000 6000 8000 10000 LEFT EYE CENTER RIGHT EYE CENTER LEFT EYE INNER CORNER LEFT EYE OUTER CORNER RIGHT EYE INNER CORNER RIGHT EYE OUTER CORNER LEFT EYEBROW INNER END LEFT EYEBROW OUTER END RIGHT EYEBROW INNER END RIGHT EYEBROW OUTER END NOSE TIP MOUTH LEFT CORNER MOUTH RIGHT CORNER MOUTH CENTER TOP LIP MOUTH CENTER BOTTOM LIP TargetFacialKeyPoint Number of non-missing target rows
  • 16. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 16 Each model is trained with a batch size of 128 and maximum number of epochs are 300. Two callback functions are used which execute at the end of every epoch – Early Stopping Callback (ESC) and Model Checkpoint Callback (MCC). ESC stops the training when the number of contiguous epochs without improvement in validation loss are 30 (Patience Level). MCC stores the weights of the model with the lowest validation loss. 3.2.5 Evaluation For each of the 15 NaimishNet models, reload M, zero center test features by subtracting M and reshape the samples to (1, 96, 96). Reload the best check-pointed model weights, predict on test features, and store the predictions.
  • 17. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 17 4 Results Figure 2 Annotated Faces with 15 Facial Key Points, marked in blue for original and in red for predicted. Figure 2 shows the annotated faces with 15 FKPs of 6 individuals. It shows that the predicted locations are very close to the actual locations of FKPs.
  • 18. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 18 Figure 3 shows the worst, medium and best predictions, where the RMSEtop = 6.87, RMSEmiddle = 1.04, and RMSEbottom = 0.37, the original FKPs are shown in blue and the predicted FKPs are shown in red.
  • 19. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 19 In Figure 3, high RMSEtop is because of the presence of less number of images, in the training dataset, with the side face pose. Most of the images in the training dataset are that of the frontal face with nearly closed lips. So, NaimishNet delivers lower RMSEmiddle and RMSEbottom. Figure 4 Number of Epochs per NaimishNet Figure 4 shows that total number of epochs are 3507. Total training time of 15 NaimishNet models is approximately 20 hours.
  • 20. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 20 Figure 5 Loss History of NaimishNet for Left Eye Center as Target Figure 6 Loss History of NaimishNet for Right Eye Center as Target Figure 5 and Figure 6 show that both train and validation RMSE decrease over time with small wiggles along the curve showcasing that the batch size was rightly chosen and end up being close enough, thus showing that the two NaimishNet models generalized well. The best check-pointed models were achieved before 300 epochs, thus certifying the decision of fixing 300 epochs as the upper limit.
  • 21. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 21 Figure 7 Loss History of NaimishNet for Left Eye Inner Corner as Target Figure 8 Loss History of NaimishNet for Left Eye Outer Corner as Target Figure 7 and Figure 8 show that both train and validation RMSE decrease over time with small wiggles and end up being close enough. The best check-pointed models were achieved before 300 epochs. Also, for Figure 8, since the train RMSE does not become less than validation RMSE implies that the Patience Level (PL) for ESC could have been increased to 50. Thus, the training could have been continued for some more epochs.
  • 22. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 22 Figure 9 Loss History of NaimishNet for Right Eye Inner Corner as Target Figure 10 Loss History of NaimishNet for Right Eye Outer Corner as Target Figure 9 and Figure 10 show that both train and validation RMSE decrease over time with very small wiggles and end up being close enough. The best check-pointed models were achieved close to 300 epochs. For Figure 10, PL of ESC could have been increased to 50 so that the train RMSE becomes less than validation RMSE.
  • 23. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 23 Figure 11 Loss History of NaimishNet for Left Eyebrow Inner End as Target Figure 12 Loss History of NaimishNet for Left Eyebrow Outer End as Target Figure 11 and Figure 12, show that the train and validation RMSE decreases over time with small wiggles, and end up being close enough. The best check-pointed models were achieved far before 300 epochs. No further improvement could have been seen by increasing PL of ESC.
  • 24. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 24 Figure 13 Loss History of NaimishNet for Right Eyebrow Inner End as Target Figure 14 Loss History of NaimishNet for Right Eyebrow Outer End as Target Figure 13 and Figure 14 show that both train and validation RMSE decrease over time with small wiggles and end up being close in the check-pointed model, which can be seen 30 epochs before the last epoch on x-axis.
  • 25. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 25 Figure 15 Loss History of NaimishNet for Nose Tip as Target Figure 16 Loss History of NaimishNet for Mouth Left Corner as Target Figure 15 and Figure 16 show that both train and validation RMSE decrease over time and they end up being close enough in the final check-pointed model. For Figure 16, PL of ESC could have been increased to 50 but not necessarily there would have been any visible improvement.
  • 26. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 26 Figure 17 Loss History of NaimishNet for Mouth Right Corner as Target Figure 18 Loss History of NaimishNet for Mouth Center Top Lip as Target Figure 17 and Figure 18 show that the train and validation RMSE decrease over time, with small wiggles, and the upper limit of 300 epochs was just right enough for the models to show convergence.
  • 27. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 27 Figure 19 Loss History of NaimishNet for Mouth Center Bottom Lip as Target Figure 19 shows that the train and validation RMSE decreases over time and the wiggly nature of the validation curve increases towards the end. We could have tried to increase the batch size to 256 if it could be handled by our hardware. The curves are close together and show convergence before 300 epochs. Figure 20 shows that the NaimishNet generalizes well, since the average train / validation RMSE is 1.03, where average loss is calculated by 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐿𝑜𝑠𝑠 = √ ∑ 𝑅𝑀𝑆𝐸𝑖 215 𝑖̇=1 15
  • 28. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 28 Figure 20 Loss per final check-pointed NaimishNet 2.12 1.99 1.45 1.81 1.29 1.57 2.08 2.4 1.8 1.77 2.86 2.26 1.78 1.82 2.71 2.03 2.23 2.06 1.42 1.58 1.31 1.35 1.8 2.34 1.76 2.05 2.73 1.92 1.82 1.75 2.78 1.98 0.95 0.96 1.02 1.15 0.98 1.17 1.16 1.02 1.02 0.86 1.05 1.18 0.98 1.04 0.97 1.03 Left Eye Center Right Eye Center Left Eye Inner Corner Left Eye Outer Corner Right Eye Inner Corner Right Eye Outer Corner Left Eyebrow Inner End Left Eyebrow Outer End Right Eyebrow Inner End Right Eyebrow Outer End Nose Tip Mouth Left Corner Mouth Right Corner Mouth Center Top Lip Mouth Center Bottom Lip Average TargetFacialKeyPoint Loss per final check-pointed NaimishNet Train / Validation RMSE Validation RMSE Train RMSE
  • 29. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 29 Figure 21 Comparison of Kaggle Scores Figure 21 shows that NaimishNet outperforms the approaches used by Longpre et al. [4], Oneto et al. [8], Wang et al. [2], since lower the score, better the model. 3.15 3.089 3.347 3.104 3.3 2.91784 2.8843 2.52471 0 0.5 1 1.5 2 2.5 3 3.5 4 LENET-5 W/O TEST TIME AUG [4] LENET-5 [4] ENSEMBLE (LENET-5 + DROPOUT) [4] ENSEMBLE (LENET-5 + LENET-3) [4] ENSEMBLE (LENET-5 + VGG) [4] ONETO ET AL [8] WANG ET AL [2] NAIMISHNET Kaggle Score
  • 30. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 30 5 Conclusion NaimishNet – a learning model for a single facial key point, is based on LeNet [5]. As per Figure 21, LeNet based architectures have proven to be successful for Facial Key Points Detection. Works of Longpre et al. [4], Oneto et al. [8], etc. have shown that LeNet styled architectures have been successfully applied for Facial Key Points Detection problem, as is also shown in Figure 21.
  • 31. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 31 6 Future Scope Given enough compute resources and time, NaimishNet can be further modified by experimenting with different initialization schemes, different activation functions, different number of filters and kernel size, different number of layers, switching from LeNet [5] to VGGNet [6], introducing Inception modules as in GoogleNet [18], or introducing Residual Networks [19], etc. Different image pre-processing approaches like histogram stretching, zero centering, etc. can be tried to check which approaches improve the model’s accuracy. New approaches to Facial Key Points Detection can be based on the application of deep learning as a three stage process - face detection, face alignment, and facial key points detection, with the use of the state-of-the-art algorithms for each stage.
  • 32. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 32 7 Appendix 7.1 Software and Hardware Deployed Name Version Python 2.7.11 Numpy 1.10.4 Matplotlib 1.5.1 Pandas 0.18.0 Theano 0.9.0 Keras 1.0.4 Scikit-learn 0.17.1 cPickle 1.71 Anaconda 4.0.0 Cuda 7.5 cuDNN 4 OpenCV 3.1.0 Windows 10 Table 4 Software deployed with version information HP Envy Notebook, with NVIDIA GeForce GTX 950M GPU and Intel core i7 5500 @ 2.4 GHz, was used to perform all the experiments.
  • 33. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 33 8 References [1] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in NIPS, 2012. [2] Y. Wang and Y. Song, "Facial Keypoints Detection," Stanford University, 2014. [3] Y. Sun, X. Wang and X. Tang, "Deep Convolutional Network Cascade for Facial Key Point Detection," in Conference on Computer Vision and Pattern Recognition, 2013. [4] S. Longpre and A. Sohmshetty, "Facial Keypoint Detection," Stanford University, 2016. [5] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," in Proceedings of the IEEE, 1998. [6] K. Simoyan and A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition," in International Conference on Learning Representation, 2015. [7] D. Nouri, "Using convolutional neural nets to detect facial keypoints tutorial," 17 12 2014. [Online]. Available: http://danielnouri.org/notes/2014/12/17/using- convolutional-neural-nets-to-detect-facial-keypoints-tutorial/. [8] M. Oneto, L. Yang, Y. Jang and C. Dailey, "Facial Keypoints Detection: An Effort to Top the Kaggle Leaderboard," Berkeley School of Information, 2015. [9] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to Human-Level Performance in Face Verification," in Computer Vision and Pattern Recognition, 2014. [10] P. Li, X. Liao, Y. Xu, H. Ling and C. Liao, "GPU Boosted Deep Learning in Real-time Face Alignment," in GPU Technology Conference, 2016. [11] D.-A. Clevert, T. Unterthiner and S. Hochreiter, "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)," in International Conference on Learning Representations, 2016. [12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, pp. 1929-1958, 2014. [13] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Journal of Machine Learning Research, pp. 249-256, 2010.
  • 34. NAIMISH AGARWAL FACIAL KEY POINTS DETECTION USING DEEP CONVOLUTIONAL NEURAL NETWORK - NAIMISHNET 34 [14] D. P. Kingma and J. L. Ba, "Adam: A Method for Stochastic Optimization," in International Conference on Learning Representation, 2015. [15] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 2: Setting up the Data and Loss," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-2/. [16] F.-F. Li, A. Karpathy and J. Johnson, "Neural Networks Part 3: Learning and Evaluation," 2016. [Online]. Available: http://cs231n.github.io/neural-networks-3/. [17] "Kaggle Competition - Facial Key Points Detection," 7 May 2013. [Online]. Available: https://www.kaggle.com/c/facial-keypoints-detection. [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, "Going Deeper with Convolutions," in Computer Vision and Pattern Recognition, 2014. [19] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," in Computer Vision and Pattern Recognition, 2015.