1. Face Presentation Attack Detection
using Color Spaces Features and
Convolutional Neural Network
Rachmawan Atmaji Perdana1, Muhammad Nurkhoiri Hindratno1, Ahmad Syafiq Kamil1, Muhammad Rafi Juliansyah1,
Rully Kusumajaya1, Mohammad Hamdani2, Gembong Satrio Wibowanto1, Anto Satriyo Nugroho1
(1)Research Center for Artificial Intelligence and Cyber Security, (2)Research Center for Electronics
National Research and Innovation Agency
2. Background
High-tech devices such as smartphone becomes accessible to most of the
people
Face recognition has gained popularity as use case for authentication
(biometric authentication using face modality)
Face biometrics is more accessible than other biometric modalities since it
does not need special acquisition device, unlike fingerprint or irises, which
requires special scanner or camera.
3. (Face) Presentation Attack
Where someone tries to make himself recognized by the system as another person (who has
the authority to access the service)
Or in other words, impersonation
The presentation attack are carried by photos, videos, or other people's face masks
Needed a Presentation Attack Detection (PAD) system to distinguish attack (spoof) image and
real face image
4. Presentation Attack Detection Approach
Liveness based PAD
Detect ‘life’ by tracking facial
movements such as eye blinking or
lip movements
Texture-based PAD
Finding inherent properties that can
be used to distinguish real and spoof
face image such as color texture and
depths
Static based : only look for
frequency or spatial properties from
a single image (used in this paper
since it’s not expensive)
Dynamic based : looks for
spatiotemporal texture properties
from the sequence of images (video)
5. Related Works
As conducted by [2], such research proposes a method to analyze an image's texture using multi-scale LBP and improve histogram features using a micro-texture pattern.
•This study shows that the EER value is smaller than in previous studies.
•Research utilizing the REPLAY ATTACK database revealed that a time domain analysis produced better results than a still frame analysis.
The study by [3] introduces a new public dataset called REPLAY-ATTACK.
•This dataset has three attacks that have been proven effective in bypassing biometric recognition systems.
•It achieved better performance compared to state-of-the-art at that time.
The earliest paper mentioned using CNN as a feature learner to tackle the PAD problem was researched by [4].
•They trained the AlexNet CNN model in a supervised style to find and distinguish spoof attacks based on discriminatory features of the face images.
•It achieved better performance compared to state-of-the-art at that time.
In 2016, Boulkenafet et al. published one of the most influential papers in the realm of PAD [5].
•They obtained chrominance and luminance information from the HSV and YCbCr channels, then extracted low texture features from the color components in these channels and made histograms of each of
these features.
•It achieved 3.2% and 0.0% EER on CASIA and Replay Attack datasets, respectively.
Anjum and Sonekar [6] proposed a method for liveness detection using color texture and image distortion analysis.
•They used HSV and YCbCr in addition to exploring color texture information in RGB.
•This study's results indicate that using textures and image distortion can improve accuracy compared to using only SVM, which operates one of this information.
6. Related Works - continued
Atoum et al. in [7] published a novel that merges two CNNs, namely Patch-based CNN and Depth-based CNN.
• Patch-based CNN uses several parts of the face as input to detect spoofing, while Depth-based CNN detects the depth of the image.
• The merging of the two CNNs proved to complement each other's deficiencies and produced equal or better results.
The research conducted by Lin and Su [8] uses CNN with the architecture created by them.
• CNN architecture is designed with input images extracted using RGB and HSV channels.
• This research results show that the calculation value of the proposed APCER, BPCER, and ACER models is smaller than the existing ones.
In their study [9], Grover and Mehra investigated how to extract deep features using CNN.
• The features are extracted by combining the LBP descriptor with CNN.
• Based on this analysis, it can be concluded that combining the modified LBP descriptor with CNN can produce a system that is fast in detecting anti-spoofing attacks.
Das et al. [10] conducted anti-spoofing research using VGG16 and the LBP to determine the real and fake faces.
• Their study combined feature information from the luminance and chrominance channels using the LBP descriptor.
• This study yielded good results in identifying real and fake faces.
An experiment by [11] that used an LBP histogram of only three components of the HSV and YCbCr color spaces, thus requiring less computational resources and memory to run on an FPGA.
Research related to denoising color using YCbCr and CIELUV was also conducted by Balamurali et al. [12]. The detected face image is converted to the said color spaces.
• The detected face image is converted to the said color spaces.
• The vector features of the two images are obtained by passing the image to the VGG architecture and then combined into one and classified using SVM to determine whether the image is genuine or fake.
7. Why HSV and YCbCr Color Space?
Often used in many PAD tasks and has been proven useful in many studies
Give an additional facial texture for spoofing detection using chrominance and luminance
information
Should be able to discriminate between spoof and non-spoof images
Idea : By combining the face image representation in HSV and YCbCr feeds into Convolutional
Neural Network, it will become a feature learner and classifier for PAD
8. Proposed Method
Face
Alignment to
normalize
rotated image
(by
‘straightening’
eyes)
Localize face
area using
ResNet10 SSD
(from OpenCV)
Crop and
Resize face
image to 128 x
128 pixel
Get HSV and
YCbCr
representation
of face images
Feed and Train
images in CNN
to Create
Model
9. CASIA Dataset
• Includes 50 real participants, and artificial faces are created using high-
quality records of the real faces.
• The low-quality, average-quality, and high-quality imaging are considered.
• There are three phony face attacks: the warped photo attack, the cut photo
attack, and the video attack.
• The final database contains 600 video clips, with 12 videos per subject (3
real and 9 fraudulent) (240 for train and 360 for test).
Folder
Real
face
Warped
Photo Attack
Cut
Photo
Attack
Video
Attack
Total
train 651 791 725 751 2918
test 1021 1202 1060 1024 4307
10. NUAA Dataset
• The NUAA photo database is collected by utilizing several webcams purchased from an online
marketplace.
• The database is organized in two-week periods between two sessions with varied sessions.
• All the color images in the database have the same number of pixels.
• Each subject uses webcams to record a series of data photos during every session.
• Unlike CASIA, NUAA only has real face images and photo attack images.
11. CNN Architectures
• Using State-Of-The-Art CNN Architecture with Top-5 accuracy values above 90%
• Fine-tuned with initial weights from ImageNet
• Using Binary CE as Loss Function
𝐿𝑜𝑠𝑠 = −
1
𝑁
𝑖=1
𝑁
𝑦𝑖 ∙ log(𝑝 𝑦𝑖 + (1 − 𝑦𝑖) ∙ log(1 − 𝑝(𝑦𝑖)))
• Scenario I : Feed image in HSV/YCbCr to CNN
• Scenario II : Feed both HSV and YCbCr to CNN
MobileNetV2
VGG16
ResNet50
Figure 4. Detailed Architecture of CNN Used in Second Scenario
12. Experiment Result (EER and HTER)
• Fusion of HSV and YCbCr for PAD
generally yields lower EER and
HTER than using HSV or YCbCr
color space only.
• This results confirms two
experiments conducted by
Boulkenaffet, et. al. (2015, 2016)
• We are investigating why HSV
performs better in CASIA and
YCbCr performs better in NUAA
Model
HSV YCbCr HSV + YCbCr
EER HTER EER HTER EER HTER
ResNet50 14.65 14.65 2.96 2.94 1.85 1.85
VGG16 8.41 8.43 1.6 1.61 1.25 1.18
MobileNetV2 16.1 16.1 6.49 6.5 5.99 5.98
Model
HSV YCbCr HSV + YCbCr
EER HTER EER HTER EER HTER
ResNet50 6.56 6.54 8.62 8.61 6.95 6.9
VGG16 6.86 7.05 10.18 10.19 5.58 5.57
MobileNetV2 5.09 5.07 7.54 7.63 3.62 3.8
EER and HTER of CASIA Dataset
EER and HTER of NUAA Dataset
13. Color Space Analysis in CASIA and NUAA
Dataset
Color space analysis can be employed as a technique to define the
characteristic differences between CASIA and NUAA Dataset
Histogram similarity between bona fide and impostor images on each
color component can be used to measure differences between bona
fide and impostor images.
The greater the differences means that the measured color space shall
well-distinguish bona fide and impostor images.
14. MEAN OF CHI-SQUARE VALUES OF BONA FIDE AND IMPOSTOR
IMAGE HISTOGRAM
Mean of Chi-Square Value in CASIA
Color Component
Bonafide vs Impostor
Warped Photo
Attack
Cut Photo
Attack
Video Attack
H 368000.309 261274.257 323010.222
S 165192.486 160347.83 153888.729
V 72267.6273 80817.2635 66737.3916
Y 58123.7854 107410.825 63554.0094
Cr 151599.445 125699.966 147597.709
Cb 188433.815 121616.277 189304.904
H, S, and Cb has better discriminatory nature
Mean of Chi-Square Value in NUAA
Color Component
Bonafide vs
Impostor
Photo Attack
H 535133.498
S 86404.3076
V 201754.048
Y 233779.829
Cr 245715.501
Cb 1281518.85
H, Y, Cr, and Cb has better discriminatory nature
𝑑 𝐻1, 𝐻2 =
𝐼
𝐻1 𝐼 − 𝐻2 𝐼
2
𝐻1 𝐼
15. Conclusion
• HSV could reduce the number of video
attacks on average by 61.8%, warped
photo attack by 13.6%, cut photo
attack by 7% compared to YCbCr.
• This result confirms study by
Boulkenaffet (2015) which stated that
HSV color space is more effective
against video attack than YCbCr in
CASIA.
False Negatives on MobileNetV2
Color Space Warped Photo Attack Cut Photo Attack Video Attack Total
HSV + YCbCr 55 28 47 130
HSV 81 55 30 166
YCbCr 91 63 100 254
False Negatives on ResNet50
Color Space Warped Photo Attack Cut Photo Attack Video Attack Total
HSV + YCbCr 105 86 34 225
HSV 143 51 20 214
YCbCr 147 82 54 283
False Negatives on VGG16
Color Space Warped Photo Attack Cut Photo Attack Video Attack Total
HSV + YCbCr 64 32 87 183
HSV 121 58 59 238
YCbCr 166 45 124 335
16. Comparison with Other Methods in CASIA
Method EER (%) HTER (%)
Chingovska et al. [3] 18.2 -
Yang et al. [4] 7.4 -
Moon Y. et al [11] 10.22 1.43
Boulkenafet et al. [14] 6.2 -
Boulkenafet et al. [5] 3.2 -
He and Luo [19] 5.83 -
Khammari, M. [20] 2.62 2.14
Larbi et. al. [15] - 10.68
Proposed (HSV + YCbCr MobileNetV2) 3.62 3.8
17. Conclusions
We proposed the usage of the fusion of HSV and YCbCr color space and texture features fed into state-of-the-
art CNN models for presentation attack detection.
Our experiments shows that using joint information from HSV and YCbCr color spaces yields better and more
promising results than single color space (HSV or YCbCr)
Our experiments shows with EER value of less than 6% in both CASIA and NUAA datasets, with the best EER
achieved of 1.25% in NUAA dataset using VGG16 CNN and 3.62% in CASIA dataset using MobileNetV2