Facial kinship verification- a machine learning approach

123
SPRINGER BRIEFS IN COMPUTER SCIENCE
Haibin Yan
Jiwen Lu
Facial Kinship
Verification
A Machine Learning Approach

SpringerBriefs in Computer Science
Series editors
Stan Zdonik, Brown University, Providence, Rhode Island, USA
Shashi Shekhar, University of Minnesota, Minneapolis, Minnesota, USA
Xindong Wu, University of Vermont, Burlington, Vermont, USA
Lakhmi C. Jain, University of South Australia, Adelaide, South Australia, Australia
David Padua, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
Xuemin (Sherman) Shen, University of Waterloo, Waterloo, Ontario, Canada
Borko Furht, Florida Atlantic University, Boca Raton, Florida, USA
V.S. Subrahmanian, University of Maryland, College Park, Maryland, USA
Martial Hebert, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Katsushi Ikeuchi, University of Tokyo, Tokyo, Japan
Bruno Siciliano, Università di Napoli Federico II, Napoli, Italy
Sushil Jajodia, George Mason University, Fairfax, Virginia, USA
Newton Lee, Newton Lee Laboratories, LLC, Tujunga, California, USA

More information about this series at http://www.springer.com/series/10028

Haibin Yan • Jiwen Lu
Facial Kinship Veriﬁcation
A Machine Learning Approach
123

Haibin Yan
Beijing University of Posts
and Telecommunications
Beijing
China
Jiwen Lu
Tsinghua University
Beijing
China
ISSN 2191-5768 ISSN 2191-5776 (electronic)
SpringerBriefs in Computer Science
ISBN 978-981-10-4483-0 ISBN 978-981-10-4484-7 (eBook)
DOI 10.1007/978-981-10-4484-7
Library of Congress Control Number: 2017938628
© The Author(s) 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword
Any parent knows that people are always interested in whether their own child
bears some resemblance to any of them, mother or father. Human interest in family
ties and connections, in common ancestry and lineage, is as natural to life as life
itself.
Biometrics has made enormous advances over the last 30 years. When I wrote
my first paper on automated face recognition (in 1985) I could count the active
researchers in the area on the fingers of my hand. Now my fingers do not suffice
even to count the countries in which automatic face recognition is being developed,
my fingers suffice only to count the continents. Surprisingly, even with all the
enormous volume of research in automated face recognition and in face biometrics,
there has not as yet been much work on kinship analysis in biometrics and computer
vision despite the natural interest. And it does seem a rather obvious tack to take.
The work has actually concerned conference papers and journal papers, and now it
is a book. Those are prudent routes to take in the development of any approach and
technology.
There are many potential applications here. In the context of forensic systems,
the police now use familial DNA to trace offenders, particularly for cold crimes
(those from a long time ago). As we reach systems where automated search will
become routine, one can countenance systems that look for kinship. It could even
putatively help to estimate appearance with age, given a missing subject. There are
many other possible applications too, especially as advertising has noted the potent
abilities of biometric systems.
For now those topics are a long way ahead. First facial kinship analysis needs
exposure and development in terms of science. It needs data and it needs technique.
I am pleased to see that is what is being considered in this new text on Facial
Kinship Verification: A Machine Learning Approach. This will be part of our
future. Enjoy!
Southampton, UK
February 2017
Mark Nixon
IAPR Fellow
v

Preface
Facial images convey many important human characteristics, such as identity,
gender, expression, age, and ethnicity. Over the past two decades, a large number of
face analysis problems have been investigated in the computer vision and pattern
recognition community. Representative examples include face recognition, facial
expression recognition, facial age estimation, gender classification and ethnicity
recognition. Compared with these face analysis tasks, facial kinship verification is a
relatively new research topic in face analysis and only some attempts have been
made over the past few years. However, this new research topic has several
potential applications such as family album organization, image annotation, social
media analysis, and missing children/parents search. Hence, it is desirable to write a
book to summarize the state-of-the-arts of research findings in this direction and
provide some useful suggestions to researchers who are working in this field.
This book is specialized in facial kinship verification, covering from the classical
feature representation, metric learning methods to the state-of-the-art facial kinship
verification methods with feature learning and metric learning techniques. It mainly
comprises three parts. The first part focuses on the feature learning methods, which
are recently developed for facial kinship verification. The second part presents
several metric learning methods for facial kinship verification, including both
conventional methods and some recently proposed methods. The third part dis-
cusses some recent studies on video-based facial kinship verification. As feature
learning and metric learning methods presented in this book can also be easily
applied to other face analysis tasks, e.g., face recognition, facial expression
recognition, facial age estimation, and gender classification, it will be beneficial for
researchers and practitioners who are searching for solutions for their specific face
analysis applications or even pattern recognition problems. The book is also suit-
able for graduates, researchers, and practitioners interested in computer vision and
machine learning both as a learning text and as a reference book.
I thank Prof. Marcelo H. Ang Jr. and Prof. Aun Neow Poo in the National
University of Singapore for bringing me to the world of computer vision and
robotics, and for their valuable suggestions on my research and career. I also thank
the publication team of SpringerBriefs for its assistance.
vii

The writing of this book is supported in part by the National Natural Science
Foundation of China (Grant No. 61603048, 61672306), the Beijing Natural Science
Foundation (Grant No. 4174101), the Fundamental Research Funds for the Central
Universities at Beijing University of Posts and Telecommunications (Grant No.
2017RC21) and the Thousand Talents Program for Distinguished Young Scholars.
Beijing, China Haibin Yan
February 2017
viii Preface

Contents
1 Introduction to Facial Kinship Verification . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of Facial Kinship Verification . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline of This Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Feature Learning for Facial Kinship Verification . . . . . . . . . . . . . . . . 7
2.1 Conventional Face Descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Gabor Feature Representation. . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Learning Compact Binary Face Descriptor . . . . . . . . . . . . . . 10
2.2.2 Prototype-Based Discriminative Feature Learning . . . . . . . . 16
2.2.3 Multiview Prototype-Based Discriminative Feature
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Metric Learning for Facial Kinship Verification . . . . . . . . . . . . . . . . . 37
3.1 Conventional Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.3 Locality Preserving Projections. . . . . . . . . . . . . . . . . . . . . . . 40
3.1.4 Information-Theoretical Metric Learning . . . . . . . . . . . . . . . 41
3.1.5 Side-Information Linear Discriminant Analysis . . . . . . . . . . 42
3.1.6 Keep It Simple and Straightforward Metric Learning . . . . . . 43
3.1.7 Cosine Similarity Metric Learning . . . . . . . . . . . . . . . . . . . . 44
3.2 Neighborhood Repulsed Correlation Metric Learning . . . . . . . . . . . 44
3.3 Discriminative Multi-metric Learning. . . . . . . . . . . . . . . . . . . . . . . . 46
ix

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Video-Based Facial Kinship Veriﬁcation. . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Conclusions and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
x Contents

Chapter 1
Introduction to Facial Kinship Verification
Abstract In this chapter, we first introduce the background of facial kinship
verification and then review the state-of-the-art of facial kinship verification. Lastly,
we outline the organization of the book.
1.1 Overview of Facial Kinship Verification
Recent advances in psychology and cognitive sciences [1–3, 11] have revealed that
human face is an important cue for kin similarity measure as children usually look
like their parents more than other adults, because children and their parents are
biologically related and have overlapped genetics. Inspired by this finding, there have
been some seminal attempts on kinship verification via human faces, and computer
vision researchers have developed several advanced computational models to verify
human kinship relations via facial image analysis [7–10, 14–19, 21, 22]. While there
are many potential applications for kinship verification such as missing children
searching and social media mining, it is still challenging to develop a robust kinship
verification system for real applications because there are usually large variations
on pose, illumination, expression, and aging on facial images, especially, when face
images are captured in unconstrained environments.
Kinshipverificationviafacialimageanalysisisaninterestingproblemincomputer
vision. In recent years, there have been a few seminal studies in the literature [7–10,
15–19, 21, 22]. Existing kinship verification methods can be mainly categorized
into two classes: feature-based [5, 8, 10, 21, 22] and model-based [9, 15, 18, 19].
Generally, feature-based methods aim to extract discriminative feature descriptors
to effectively represent facial images so that stable kin-related characteristics can
be well preserved. Existing feature representation methods include skin color [8],
histogram of gradient [8, 16, 21], Gabor wavelet [6, 16, 18, 22], gradient orientation
pyramid [22], local binary pattern [4, 15], scale-invariant feature transform [15, 16,
19], salient part [10, 17], self-similarity [12], and dynamic features combined with
spatiotemporal appearance descriptor [5]. Model-based methods usually apply some
statistical learning techniques to learn an effective classifier, such as subspace learn-
ing [18], metric learning [15, 19], transfer learning [18], multiple kernel learning [22]
H. Yan and J. Lu, Facial Kinship Verification,
SpringerBriefs in Computer Science, DOI 10.1007/978-981-10-4484-7_1
1

2 1 Introduction to Facial Kinship Verification
and graph-based fusion [9]. Table1.1 briefly reviews and compares existing kinship
verification methods for facial kinship modeling over the past 5 years, where the
performance of these methods is evaluated by the mean verification rate. While the
verification rate of different methods in this table cannot be directly compared due
to different datasets and experimental protocols, we still see that a major progress of
kinship verification has been obtained in recent years. More recently, Fang et al. [7]
extended kinship verification to kinship classification. In their work, they proposed
a kinship classification approach by reconstructing the query face from a sparse
set of samples among the candidates for family classification, and 15% rank-one
classification rate was achieved on a dataset consisting of 50 families.
To the best of our knowledge, there are very limited attempts on tackling the prob-
lem of facial kinship verification in the literature. However, there are some potential
applications of facial kinship verification results presented. One representative appli-
cation of kinship verification is social media analysis. For example, there are tens of
billion images in the popular Facebook website, and more than 2.5 billions images
are added to the website each month. How to automatically organize such large-scale
data remains a challenging problem in computer vision and multimedia. There are
two key questions to be answered: (1) who these people are, and (2) what their rela-
tions are. Face recognition is an important approach to address the first question,
and kinship verification is a useful technique to approach the second question. When
kinship relations are known, it is possible to automatically create family trees from
these social network websites. Currently, our method has achieved 70–75% verifi-
cation rate when two face images were captured from different photos, and 75–80%
verification rate from the same photo. Compare with the state-of-the-art face ver-
ification methods which usually achieve above 90% verification rate on the LFW
dataset, the performance of existing facial kinship verification methods is low. How-
ever, existing computational facial kinship verification methods still provide useful
information for us to analyze the relation of two persons because these numbers
are not only much higher than random guess (50%), but also comparable to that of
human observers. Another important application of kinship verification is missing
children search. Currently, DNA testing is the dominant approach to verify the kin
relation of two persons, which is effective to find missing children. However, there
are two limitations for the DNA testing: (1) the privacy of DNA testing is very high,
which may make it restricted in some applications; (2) the cost of DNA testing is
higher. However, kinship verification from facial images can remedy these weak-
nesses because verifying kinship relations from facial images is very convenient and
its cost is very low. For example, if we want to find a missing child from thousands of
children, it is difficult to use the DNA testing to verify their kin relation due to privacy
concerns. However, if our kinship verification method is used, we can quickly first
identify some possible candidates which have high similarity from facial images.
Then, the DNA testing is applied to get the exact search result.

1.1 Overview of Facial Kinship Verification 3
Table1.1Comparisonsofexistingkinshipverificationmethodspresentedoverthepastfewyears.c[2015]IEEE.Reprinted,withpermission,fromref.[20]
MethodFeaturerepresentationClassifierDatasetAccuracy(%)Year
Fangetal.[8]Localfeaturesfromdifferentface
parts
KNNCornellKinFace70.72010
Zhouetal.[21]Spatialpyramidlocalfeature
descriptor
SVM400pairs(N.A.)67.82011
Xiaetal.[18]ContextfeaturewithtransferlearningKNNUBKinFace68.52012
GuoandWang[10]DAISYdescriptorsfromsemantic
parts
Bayes200pairs(N.A)75.02012
Zhouetal.[22]GaborgradientorientationpyramidSVM400pairs(N.A.)69.82012
Kohlietal.[12]Self-similarityofWeberfaceSVMIIITDKinFace[64.2,74.1]2012
Somanathand
Kambhamettu[16]
Gabor,HOGandSIFTSVMVADANAKinFace[67.0,80.2]2012
Dibekliogluetal.[5]Dynamicfeatures+spatiotemporal
appearancefeatures
SVMUVA-NEMO,228spontaneouspairs72.92013
UVA-NEMO,287posedpairs70.02013
Luetal.[15]LocalfeaturewithmetriclearningSVMKinFaceW-I69.92014
KinFaceW-II76.52014
Guoetal.[9]LBPfeatureLogisticregressor+
fusion
322pairs(availableuponrequest)69.32014
Yanetal.[20]Mid-levelfeaturesbydiscriminative
learning
SVMKinFaceW-I70.12015
KinFaceW-II77.0
CornellKinFace71.9
UBKinFace67.3
Kohiletal.[13]HierarchicalrepresentationFilteredcontractive
DBN
WVU90.82017

4 1 Introduction to Facial Kinship Verification
1.2 Outline of This Book
This book aims to give an in-time summarization of the past achievements as well
as to introduce some emerging techniques for facial kinship verification, specially
from the machine-learning prepositive. We would like to give some suggestions to
readers who are also interested in conducting research on this area by presenting
some newly facial kinship datasets and benchmarks.
The remaining of this book is organized as follows:
Chapter 2 presents the state-of-the-art feature learning techniques for facial kin-
ship verification. Specifically, conventional well-known facial feature descriptors are
briefly reviewed. However, these feature representation methods are hand-crafted
and not data-adaptive. In this chapter, we introduce the feature learning approach
and show how to employ it for facial kinship verification.
Chapter 3 presents the state-of-the-art metric learning techniques for facial kinship
verification. Specifically, conventional similarity learning methods, such as principal
component analysis, linear discriminant analysis, and locality preserving projections
are briefly reviewed. However, these similarity learning methods are hand-crafted
and not data-adaptive. In this chapter, we introduce the metric learning approach and
show how to use it for facial kinship verification.
Chapter 4 introduces some recent advances of video-based facial kinship veri-
fication. Compared with a single image, a face video provides more information
to describe the appearance of human face. It can capture the face of the person of
interest from different poses, expressions, and illuminations. Moreover, face videos
can be much easier captured in real applications because there are extensive surveil-
lance cameras installed in public areas. Hence, it is desirable to employ face videos
to determine the kin relations of persons. However, it is also challenging to exploit
discriminative information of face videos because intra-class variations within a face
video are usually larger than those in a single sill image. In this chapter, we show
some recent efforts in video-based facial kinship verification.
Chapter 5 finally concludes this book by giving some suggestions to the potential
researchers in this area. These suggestions include the popular used benchmark
datasets and the standard evaluation protocols. Moreover, we also discuss some
potential directions for future work of facial kinship verification.
References
1. Alvergne, A., Oda, R., Faurie, C., Matsumoto-Oda, A., Durand, V., Raymond, M.: Cross-
cultural perceptions of facial resemblance between kin. J. Vis. 9(6), 1–10 (2009)
2. Dal Martello, M., Maloney, L.: Where are kin recognition signals in the human face? J. Vis.
6(12), 1356–1366 (2006)
3. DeBruine, L., Smith, F., Jones, B., Craig Roberts, S., Petrie, M., Spector, T.: Kin recognition
signals in adult faces. Vis. Res. 49(1), 38–43 (2009)
4. Deng, W., Hu, J., Guo, J.: Extended SRC: undersampled face recognition via intraclass variant
dictionary. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1864–1870 (2012)

References 5
5. Dibeklioglu, H., Salah, A.A., Gevers, T.: Like father, like son: facial expression dynamics for
kinship verification. In: IEEE International Conference on Computer Vision, pp. 1497–1504
(2013)
6. Du, S., Ward, R.K.: Improved face representation by nonuniform multilevel selection of gabor
convolution features. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 39(6), 1408–1419 (2009)
7. Fang, R., Gallagher, A.C., Chen, T., Loui, A.: Kinship classification by modeling facial feature
heredity, pp. 2983–2987 (2013)
8. Fang,R.,Tang,K.,Snavely,N.,Chen,T.:Towardscomputationalmodelsofkinshipverification.
In: IEEE International Conference on Image Processing, pp. 1577–1580 (2010)
9. Guo Yuanhao, D.H., Maaten, L.v.d.: Graph-based kinship recognition. In: International Con-
ference on Pattern Recognition, pp. 4287–4292 (2014)
10. Guo, G., Wang, X.: Kinship measurement on salient facial features. IEEE Trans. Instrum. Meas.
61(8), 2322–2325 (2012)
11. Kaminski, G., Dridi, S., Graff, C., Gentaz, E.: Human ability to detect kinship in strangers’
faces: effects of the degree of relatedness. Proc. R. Soc. B: Biol. Sci. 276(1670), 3193–3200
(2009)
12. Kohli, N., Singh, R., Vatsa, M.: Self-similarity representation of weber faces for kinship classi-
fication. In: IEEE International Conference on Biometrics: Theory, Applications, and Systems,
pp. 245–250 (2012)
13. Kohli, N., Vatsa, M., Singh, R., Noore, A., Majumdar, A.: Hierarchical representation learning
for kinship verification. IEEE Trans. Image Process. 26(1), 289–302 (2017)
14. Lu, J., Hu, J., Zhou, X., Zhou, J., Castrillòn-Santana, M., Lorenzo-Navarro, J., Bottino, A.G.,
Vieira, T.: Kinship verification in the wild: the first kinship verification competition. In:
IAPR/IEEE Joint Conference on Biometrics, pp. 1–6 (2014)
15. Lu, J., Zhou, X., Tan, Y.P., Shang, Y., Zhou, J.: Neighborhood repulsed metric learning for
kinship verification. IEEE Trans. Pattern Anal. Mach. Intell. 34(2), 331–345 (2014)
16. Somanath, G., Kambhamettu, C.: Can faces verify blood-relations? In: IEEE International
Conference on Biometrics: Theory, Applications and Systems, pp. 105–112 (2012)
17. Xia, S., Shao, M., Fu, Y.: Toward kinship verification using visual attributes. In: IEEE Inter-
national Conference on Pattern Recognition, pp. 549–552 (2012)
18. Xia, S., Shao, M., Luo, J., Fu, Y.: Understanding kin relationships in a photo. IEEE Trans.
Multimed. 14(4), 1046–1056 (2012)
19. Yan,H.,Lu,J.,Deng,W.,Zhou,X.:Discriminativemultimetriclearningforkinshipverification.
IEEE Trans. Inf. Forensics Secur. 9(7), 1169–1178 (2014)
20. Yan, H., Lu, J., Zhou, X.: Prototype-based discriminative feature learning for kinship verifica-
tion. IEEE Trans. Cybern. 45(11), 2535–2545 (2015)
21. Zhou, X., Hu, J., Lu, J., Shang, Y., Guan, Y.: Kinship verification from facial images under
uncontrolledconditions.In:ACMInternationalConferenceonMultimedia,pp.953–956(2011)
22. Zhou, X., Lu, J., Hu, J., Shang, Y.: Gabor-based gabor-based gradient orientation pyramid for
kinship verification under uncontrolled environments. In: ACM International Conference on
Multimedia, pp. 725–728 (2012)

Chapter 2
Feature Learning for Facial Kinship
Verification
Abstract In this chapter, we discuss feature learning techniques for facial kinship
verification.Wefirstreviewtwowell-knownhand-craftedfacialdescriptorsincluding
local binary patterns (LBP) and the Gabor feature. Then, we introduce a compact
binary face descriptor (CBFD) method which learns face descriptors directly from
raw pixels. Unlike LBP which samples small-size neighboring pixels and computes
binary codes with a fixed coding strategy, CBFD samples large-size neighboring
pixels and learn a feature filter to obtain binary codes automatically. Subsequently,
we present a prototype-based discriminative feature learning(PDFL) method to learn
mid-level discriminative features with low-level descriptor for kinship verification.
Unlikemostexistingprototype-basedfeaturelearningmethodswhichlearnthemodel
with a strongly labeled training set, this approach works on a large unsupervised
generic set combined with a small labeled training set. To better use multiple low-
level features for mid-level feature learning, a multiview PDFL (MPDFL) method
is further proposed to learn multiple mid-level features to improve the verification
performance.
2.1 Conventional Face Descriptors
2.1.1 Local Binary Patterns
Local binary pattern (LBP) is a popular texture descriptor for feature representation.
Inspired by its success in texture classification, Ahonen et al. [1] proposed a novel
LBP feature representation method for face recognition. The basic idea is as follows:
for each pixel, its 8-neighborhood pixels are thresholded into 1 or 0 by comparing
them with the center pixel. Then the binary sequence of the 8-neighborhoods is trans-
ferred into a decimal number (bit pattern states with upper left corner moving clock-
wise around center pixel), and the histogram with 256 bins of the processed image
is used as the texture descriptor. To capture the dominant features, Ahonen et al. [1]
extended LBP to a parametric form (P, R), which indicates that P gray-scale values
are equally distributed in a circle with radius R to form circularly symmetric neighbor
sets. To better characterize the property of texture information, uniform pattern is
7

8 2 Feature Learning for Facial Kinship Verification
Fig. 2.1 The LBP and ILBP operators where a Original image patch b Results of LBP operator
c Results of ILBP operator
defined according to the number of spatial transitions that are bitwise 0/1 changes in
the calculated binary values. If there are at most two bitwise transitions from 0 to 1
or vice versa, the binary value is called a uniform pattern.
Recently, Jin et al. [22] developed an improved LBP (ILBP) method. Different
from LBP, ILBP makes better use of the central pixel in the original LBP and takes the
mean of all gray values of elements as the threshold value. For the newly generated
binary value of the central pixel, it was added to the most left position of the original
binary string. The corresponding decimal value range would be changed from [0,
255] to [0, 510] for a 3×3 operator. Figure2.1 shows the basic ideas of LBP and ILBP.
Since LBP and ILBP are robust to illumination changes, they have been widely used
in many semantic understanding tasks such as face recognition and facial expression
recognition.
2.1.2 Gabor Feature Representation
Gabor wavelet is a popular feature extraction method for visual signal representation,
and discriminative information is extracted by convoluting the original image with
a set of Gabor kernels with different scales and orientations. A 2-D Gabor wavelet
kernel is the product of an elliptical Gaussian envelope and a complex plane wave,
defined as [31]:
ψμ,ν(z) =
kμ,ν
2
σ2
e−
kμ,ν
2 z 2
2σ2
[eikμ,ν z
− e− σ2
2 ] (2.1)

2.1 Conventional Face Descriptors 9
Fig. 2.2 Real part of Gabor kernels at eight orientations and five scales [31]
where μ and ν define the orientation and scale of the Gabor kernels, z = z(x, y) is
the variable in a complex spatial domain, · denotes the norm operator, and the
wave vector kμ,ν is defined as follows:
kμ,ν = kνejφμ
(2.2)
where kν = kmax/f ν
, φμ = πμ/8, kmax is the maximum frequency, f is the spacing
factor between kernels in the frequency domain, and σ is the standard deviation of
Gaussian envelope determining the number of oscillations. For a given image I, the
convolution of I and a Gabor kernel ψμ,ν is defined as
Oμ,ν(z) = I(z) ∗ ψμ,ν(z) (2.3)
where Oμ,ν(z) is the convolution result corresponding to the Gabor kernel at orien-
tation μ and scale ν. Figure2.2 shows the the real part of the Gabor kernels.
Usually, five spatial frequencies and eight orientations are used for Gabor feature
representation, and there will be a total of 40 Gabor kernel functions employed on
each pixel of an image. Its computational cost is generally expensive. Moreover, only
the magnitudes of Gabor wavelet coefficients are used as features because the phase
information are sensitive to inaccurate alignment.

2.2 Feature Learning
There have been a number of feature learning methods proposed in recent years
[3, 16, 18, 20, 26, 38]. Representative feature learning methods include sparse
auto-encoders [3], denoising auto-encoders [38], restricted Boltzmann machine [16],
convolutional neural networks [18], independent subspace analysis [20], and recon-
struction independent component analysis [26]. Recently, there have also been some
work on feature learning-based face representation, and some of them have achieved
reasonably good performance in face recognition. For example, Lei et al. [29] pro-
posed a discriminant face descriptor (DFD) method by learning an image filter using
the LDA criterion to obtain LBP-like features. Cao et al. [7] presented a learning-
based (LE) feature representation method by applying the bag-of-word (BoW) frame-
work. Hussain et al. [19] proposed a local quantized pattern (LQP) method by mod-
ifying the LBP method with a learned coding strategy. Compared with hand-crafted
feature descriptors, learning-based feature representation methods usually show bet-
terrecognitionperformancebecausemoredata-adaptiveinformationcanbeexploited
in the learned features.
Compared with real-valued feature descriptors, there are three advantages for
binary codes: (1) they save memory, (2) they have faster computational speed, and
(3) they are robust to local variations. Recently, there has been an increasing interest
in binary code learning in computer vision [14, 40, 43, 44]. For example, Weiss et
al. [44] proposed an efficient binary coding learning method by preserving the simi-
larity of original features for image search. Norouzi et al. [36] learned binary codes
by minimizing a triplet ranking loss for similar pairs. Wang et al. [43] presented a
binary code learning method by maximizing the similarity of neighboring pairs and
minimizing the similarity of non-neighboring pairs for image retrieval. Trzcinski et
al. [40] obtained binary descriptors from patches by learning several linear projec-
tions based on pre-defined filters during training. However, most existing binary
code learning methods are developed for similarity search [14, 40] and visual track-
ing [30]. While binary features such as LBP and Haar-like descriptor have been used
in face recognition and achieved encouraging performance, most of them are hand-
crafted. In this chapter, we first review the compact binary face descriptor method
which learns binary features directly from raw pixels for face representation. Then,
we introduce a feature learning approach to learn mid-level discriminative features
with low-level descriptor for kinship verification.
2.2.1 Learning Compact Binary Face Descriptor
While binary features have been proven to be very successful in face recognition [1],
most existing binary face descriptors are all hand-crafted (e.g., LBP and its exten-
sions) and they suffer from the following limitations:

2.2 Feature Learning 11
0 10 20 30 40 50 60
0
0.01
0.02
0.03
0.04
0.05
0.06
(a)
0 10 20 30 40 50 60
0
0.005
0.01
0.015
0.02
0.025
(b)
Fig. 2.3 The bin distributions of the a LBP and b our CBFD methods. We computed the bin
distributions in the LBP histogram and our method in the FERET training set, which consists of
1002 images from 429 subjects. For a fair comparison, both of them adopted the same number of
bins for feature representation, which was set to 59in this figure. We clearly see from this figure that
the histogram distribution is uneven for LBP and is more uniform for our CBFD method. c [2015]
IEEE. Reprinted, with permission, from Ref. [35]
1. It is generally impossible to sample large size neighborhoods for hand-crafted
binary descriptors in feature encoding due to the high computational burden.
However, a large sample size is more desirable because more discriminative infor-
mation can be exploited in large neighborhoods.
2. It is difficult to manually design an optimal encoding method for hand-crafted
binary descriptors. For example, the conventional LBP adopts a hand-crafted
codebook for feature encoding, which is simple but not discriminative enough
because the hand-crafted codebook cannot well exploit more contextual
information.
3. Hand-crafted binary codes such as those in LBP are usually unevenly distributed,
as shown in Fig.2.3a. Some codes appear less than others in many real-life face
images, which means that some bins in the LBP histogram are less informative
and compact. Therefore, these bins make LBP less discriminative.

Fig. 2.4 One example to show how to extract a pixel difference vectors (PDV) from the original
face image. For any pixel in the image, we first identify its neighbors in a (2R + 1) × (2R + 1)
space, where R is a parameter to define the neighborhood size and it is selected as 1in this figure for
easy illustration. Then, the difference between the center point and neighboring pixels is computed
as the PDV. c [2015] IEEE. Reprinted, with permission, from Ref. [35]
Toaddress theselimitations, thecompact binarylocal featurelearningmethod[35]
was proposed to learn face descriptors directly from raw pixels. Unlike LBP which
samples small-size neighboring pixels and computes binary codes with a fixed coding
strategy, we sample large-size neighboring pixels and learn a feature filter to obtain
binary codes automatically. Let X = [x1, x2, . . . , xN ] be the training set containing
N samples, where xn ∈ Rd
is the nth pixel difference vector (PDV), and 1 ≤ n ≤ N.
Unlike most previous feature learning methods [18, 26] which use the original raw
pixel patch to learn the feature filters, we use PDVs for feature learning because PDV
measures the difference between the center point and neighboring pixels within a
patch so that it can better describe how pixel values change and implicitly encode
important visual patterns such as edges and lines in face images. Moreover, PDV has
been widely used in many local face feature descriptors, such as hand-crafted LBP [1]
and learning-based DFD [29]. Figure2.4 illustrates how to extract one PDV from the
original face image. For any pixel in the image, we first identify its neighbors in a
(2R + 1) × (2R + 1) space, where R is a parameter to define the neighborhood size.
Then, the difference between the center point and neighboring pixels is computed
as the PDV. In our experiments, R is set as 3 so that each PDV is a 48-dimensional
feature vector.
Our CBFD method aims to learn K hash functions to map and quantize each xn
into a binary vector bn = [bn1, . . . , bnK ] ∈ {0, 1}1×K
, which encodes more compact
and discriminative information. Let wk ∈ Rd
be the projection vector for the kth
function, the kth binary code bnk of xn can be computed as
bnk = 0.5 × sgn(wT
k xn) + 1 (2.4)
where sgn(v) equals to 1 if v ≥ 0 and −1 otherwise.
To make bn discriminative and compact, we enforce three important criterions to
learn these binary codes:

1. The learned binary codes are compact. Since large-size neighboring pixels are
sampled,therearesomeredundancyinthesampledvectors,makingthemcompact
can reduce these redundancy.
2. The learned binary codes well preserve the energy of the original sample vectors,
so that less information is missed in the binary codes learning step.
3. The learned binary codes evenly distribute so that each bin in the histogram
conveys more discriminative information, as shown in Fig.2.3b.
To achieve these objectives, we formulate the following optimization objective
function:
min
wk
J(wk) = J1(wk) + λ1J2(wk) + λ2J3(wk)
= −
N
n=1
bnk − μk
2
+λ1
N
n=1
(bnk − 0.5) − wT
k xn
2
+λ2
N
n=1
(bnk − 0.5) 2
(2.5)
where N is the number of PDVs extracted from the whole training set, μk is the mean
of the kth binary code of all the PDVs in the training set, which is recomputed and
updated in each iteration in our method, λ1 and λ2 are two parameters to balance the
effects of different terms to make a good trade-off among these terms in the objective
function.
The physical meanings of different terms in (2.5) are as follows:
1. The ﬁrst term J1 is to ensure that the variance of the learned binary codes are
maximized so that we only need to select a few bins to represent the original
PDVs in the learned binary codes.
2. The second term J2 is to ensure that the quantization loss between the original fea-
ture and the encoded binary codes is minimized, which minimizes the information
loss in the learning process.
3. The third term J3 is to ensure that feature bins in the learned binary codes evenly
distribute as much as possible, so that they are more compact and informative to
enhance the discriminative power.
Let W = [w1, w2, . . . , wK ] ∈ Rd×K
be the projection matrix. We map each sample
xn into a binary vector as follows:
bn = 0.5 × (sgn(WT
xn) + 1). (2.6)
Then, (2.5) can be re-written as:

min
W
J(W) = J1(W) + λ1J2(W) + λ2J3(W)
= −
1
N
× tr (B − U)T
(B − U)
+λ1 (B − 0.5) − WT
X 2
F
+λ2 (B − 0.5) × 1N×1 2
F (2.7)
where B = 0.5 × sgn(WT
X) + 1 ∈ {0, 1}N×K
is the binary code matrix and
U ∈ RN×K
is the mean matrix which is repeated column vector of the mean of all
binary bits in the training set, respectively.
To our knowledge, (2.7) is an NP-hard problem due to the non-linear sgn(·)
function. To address this, we relax the sgn(·) function as its signed magnitude [14,
43] and rewrite J1(W) as follows:
J1(W) = −
1
N
× (tr(WT
XXT
W))
−2 × tr(WT
XMT
W)
+tr(WT
MMT
W) (2.8)
where M ∈ RN×d
is the mean matrix which is repeated column vector of the mean
of all PDVs in the training set.
Similarly, J3(W) can be re-written as:
J3(W) = (WT
X − 0.5) × 1N×1 2
2
= WT
X × 1N×1 2
2
−N × tr(11×K
× WT
X × 1N×1
)
+0.5 × 11×N
× 1N×1
× 0.5
= tr(WT
X1N×1
11×N
XT
W)
−N × tr(11×K
WT
X1N×1
)
+H (2.9)
where H = 0.5×11×N
×1N×1
×0.5, which is a constant and is not inﬂuenced by W.
Combining (2.7)–(2.9), we have the following objective function for our CBFD
model:
min
W
J(W) = tr(WT
QW) + λ1 (B − 0.5) − WT
X 2
2
−λ2 × N × tr(11×K
WT
X1N×1
)
subject to: WT
W = I (2.10)
where

Algorithm 2.1: CBFD
Input: Training set X = [x1, x2, . . . , xN ], iteration number T, parameters λ1 and λ2, binary
code length K, and convergence parameter .
Output: Feature projection matrix W.
Step 1 (Initialization):
Initialize W to be the top K eigenvectors of XXT
corresponding to the K largest eigenvalues.
Step 2 (Optimization):
For t = 1, 2, . . . , T, repeat
2.1. Fix W and update B using (2.13).
2.2. Fix B and update W using (2.14).
2.3. If |Wt − Wt−1| < and t > 2, go to Step 3.
Step 3 (Output):
Output the matrix W.
Q −
1
N
× (XXT
− 2XMT
+ MMT
)
+λ2X1N×1
11×N
XT
(2.11)
and the columns of W are constrained to be orthogonal.
While (2.10) is not convex for W and B simultaneously, it is convex to one of
them when the other is fixed. Following the work in [14], we iteratively optimize W
and B by using the following two-stage method.
Update B with a fixed W: when W is fixed, (2.10) can be re-written as:
min
B
J(B) = (B − 0.5) − WT
X 2
F. (2.12)
The solution to (2.12) is (B − 0.5) = WT
X if there is no constraint to B. Since B is
a binary matrix, this solution is relaxed as:
B = 0.5 × sgn(WT
X) + 1 . (2.13)
Update W with a fixed B: when B is fixed, (2.10) can be re-written as:
min
W
J(W) = tr(WT
QW) + λ1(tr(WT
XXT
W))
−2 × tr((B − 0.5) × XT
W))
−λ2 × N × tr(11×K
WT
X1N×1
)
subject to WT
W = I. (2.14)
We use the gradient descent method with the curvilinear search algorithm in [45]
to solve W. Algorithm 2.1 summarizes the detailed procedure of proposed CBFD
method.

Fig. 2.5 The flow-chart of the CBFD-based face representation and recognition method. For each
training face, we first divide it into several non-overlapped regions and learn the feature filter
and dictionary for each region, individually. Then, we apply the learned filter and dictionary to
extract histogram feature for each block and concatenate them into a longer feature vector for face
representation. Finally, the nearest neighbor classifier is used to measure the sample similarity.
c [2015] IEEE. Reprinted, with permission, from Ref. [35]
Having obtained the learned feature projection matrix W, we first project each
PDV into a low-dimensional feature vector. Unlike many previous feature learning
methods [26, 27] which usually perform feature pooling on the learned features
directly, we apply an unsupervised clustering method to learn a codebook from the
training set so that the learned codes are more data-adaptive. In our implementations,
the conventional K-means method is applied to learn the codebook due to its sim-
plicity. Then, each learned binary code feature is pooled as a bin and all PDVs within
the same face image is represented as a histogram feature for face representation.
Previous studies have shown different face regions have different structural infor-
mation and it is desirable to learn position-specific features for face representation.
Motivated by this finding, we divide each face image into many non-overlapped local
regions and learn a CBFD feature descriptor for each local region. Finally, features
extracted from different regions are combined to form the final representation for the
whole face image. Figure2.5 illustrates how to use the CBFD for face representation.
2.2.2 Prototype-Based Discriminative Feature Learning
Recently, there has been growing interest in unsupervised feature learning and deep
learning in computer vision and machine learning, and a variety of feature learning
methods have been proposed in the literature [10, 16, 18, 21, 23, 25, 28, 33, 38, 41,

Fig. 2.6 Pipeline of our proposed kinship verification approach. First, we construct a set of face
samples from the labelled face in the wild (LFW) dataset as the prototypes and represent each
face image from the kinship dataset as a combination of these prototypes in the hyperplane space.
Then, we use the labeled kinship information and learn mid-level features in the hyperplane space
to extract more semantic information for feature representation. Finally, the learned hyperplane
parameters are used to represent face images in both the training and testing sets as a discrimina-
tive mid-level feature for kinship verification. c [2015] IEEE. Reprinted, with permission, from
Ref. [51]
49, 52] to learn feature representations from raw pixels. Generally, feature learning
methods exploit some prior knowledge such as smoothness, sparsity, and temporal
and spatial coherence [4]. Representative feature learning methods include sparse
auto-encoder [38, 41], restricted Boltzmann machines [16], independent subspace
analysis [8, 28], and convolutional neural networks [25]. These methods have been
successfully applied in many computer vision tasks such as image classification [25],
human action recognition [21], face recognition [18], and visual tracking [42].
Unlike previous feature learning methods which learn features directly from raw
pixels, we propose learning mid-level discriminative features with low-level descrip-
tor[51],whereeachentryinthemid-levelfeaturevectoristhecorrespondingdecision
value from one support vector machine (SVM) hyperplane. We formulate an opti-
mization objective function on the learned features so that face samples with a kin
relation are expected to have similar decision values from these hyperplanes. Hence,
our method is complementary to the exiting feature learning methods. Figure2.6
shows the pipeline of our proposed approach.
Low-level feature descriptors such as LBP [2] and scale-invariant feature trans-
form (SIFT) [32] are usually ambiguous, which are not discriminative enough for
kinship verification, especially when face images were captured in the wild because
there are large variations on face images captured in such scenarios. Unlike most pre-
vious kinship verification work where low-level hand-crafted feature descriptors [12,
13, 15, 24, 34, 39, 47, 48, 50, 53, 54] such as local binary pattern (LBP) [2, 9] and
Gabor features [31, 54] are employed for face representation, we expect to extract
more semantic information from low-level features to better characterize the relation
of face images for kinship verification.
To achieve this, we learn mid-level feature representations by using a large unsu-
pervised dataset and a small supervised dataset because it is difficult to obtain a
large number of labeled face images with kinship labels for discriminative feature
learning. First, we construct a set of face samples with unlabeled kinship relation

from the labelled face in the wild (LFW) dataset [17] as the reference set. Then, each
sample in the training set with a labeled kin relation is represented as a mid-level
feature vector. Finally, we formulate an objective function by minimizing the intra-
class samples (with a kinship relation) and maximizing the inter-class neighboring
samples with the mid-level features. Unlike most existing prototype-based feature
learning methods which learn the model with a strongly labeled training set, our
method works on a large unsupervised generic set combined with a small labeled
training set because unlabeled data can be easily collected. The following details the
proposed approach.
Let Z = [z1, z2, . . . , zN ] ∈ Rd×N
be a unlabeled reference image set, where N
and d are the number of samples and feature dimension of each sample, respectively.
Assume S = (x1, y1), . . . , (xi, yi), . . . , (xM, yM) be the training set containing M
pairs of face images with kinship relation (positive image pairs), where xi and yi are
face images of the ith pair, and xi ∈ Rd
and yi ∈ Rd
. Different from most existing
feature learning methods which learn feature representations from raw pixels, we
aim to learn a set of mid-level features which are obtained from a set of prototype
hyperplanes. For each training sample in S, we apply the linear SVM to learn the
weight vector w to represent it as follows:
w =
N
n=1
αnlnzj =
N
n=1
βnzj = Zβ (2.15)
where ln is the label of the unlabeled data zn, βn = αnln is the combination coefficient,
αn is the dual variable, and β = [β1, β2, . . . , βN ] ∈ RN×1
is the coefficient vector.
Specifically, if βn is non-zero, it means that the sample zk in the unlabeled reference
set is selected as a support vector of the learned SVM model, and ln = 1 if βn is
positive. Otherwise, ln = −1. Motivated by the maximal margin principal of SVM,
we only need to select a sparse set of support vectors to learn the SVM hyperplane.
Hence, β should be a sparse vector, where β 1 ≤ γ , and γ is a parameter to control
the sparsity of β.
Having learned the SVM hyperplane, each training sample xi and yi can be rep-
resented as
f (xi) = wT
xi = xT
i w = xT
i Zβ (2.16)
f (yi) = wT
yi = yT
i w = yT
i Zβ. (2.17)
Assume we have learned K linear SVM hyperplanes, then the mid-level feature
representations of xi and yi can be represented as
f (xi) = [xT
i Zβ1, xT
i Zβ2, . . . , xT
i ZβK ] = BT
ZT
xi (2.18)
f (yi) = [yT
i Zβ1, yT
i Zβ2, . . . , yT
i ZβK ] = BT
ZT
yi (2.19)
where B = [β1, β2, . . . , βK ] is the coefficient matrix.

Now, we propose the following optimization criterion to learn the coefﬁcient
matrix B with the sparsity constraint:
max H(B) =H1(B) + H2(B) − H3(B)
=
1
Mk
M
i=1
k
t1=1
f (xi) − f (yit1
) 2
2
+
1
Mk
M
i=1
k
t2=1
f (xit2
) − f (yi) 2
2
−
1
M
M
i=1
f (xi) − f (yi) 2
2
s.t. βk 1 ≤ γ, k = 1, 2, . . . , K (2.20)
where yit1
represents the t1th k-nearest neighbor of yi and xit2
denotes the t2th k-
nearest neighbor of xi, respectively. The objectives of H1 and H2 in (2.20) is to make
the mid-level feature representations of yit1
and xi, and xit2
and yi as far as possible
if they are originally near to each other in the low-level feature space. The physical
meaning of H3 in (2.20) is to expect that xi and yi are close to each other in the
mid-level feature space. We enforce the sparsity constraint on βk such that only a
sparse set of support vectors from the unlabeled reference dataset are selected to
learn the hyperplane because we assume each sample can be sparsely reconstructed
the reference set, which is inspired by the work in [23]. In our work, we apply the
same parameter γ to reduce the number of parameters in our proposed model so that
the complexity of the proposed approach is reduced.
Combining (2.18)–(2.20), we simplify H1(B) to the following form
H1(B) =
1
Mk
M
i=1
k
t1=1
BT
ZT
xi − BT
ZT
yit1
2
2
=
1
Mk
tr
M
i=1
k
t1=1
BT
ZT
(xi − yit1
)(xi − yit1
)T
ZB
= tr(BT
F1B) (2.21)
where
F1 =
1
Mk
M
i=1
k
t1=1
ZT
(xi − yit1
)(xi − yit1
)T
Z. (2.22)
Similarly, H2(B) and H3(B) can be simpliﬁed as
H2(B) = tr(BT
F2B), H3(B) = tr(BT
F3B) (2.23)

where
F2 =
1
Mk
M
i=1
k
21=1
ZT
(xit2
− yi)(xit2
− yi)T
Z (2.24)
F3 =
M
i=1
ZT
(xi − yi)(xi − yi)T
Z. (2.25)
Based on (2.21)–(2.25), the proposed PDFL model can be formulated as follows:
max H(B) = tr[BT
(F1 + F2 − F3)B]
s.t. BT
B = I,
βk 1 ≤ γ, k = 1, 2, . . . , K. (2.26)
where BT
B = I is a constraint to control the scale of B so that the optimization
problem in (2.26) is well-posed with respect to B.
Since there is a sparsity constraint for each βk, we cannot obtain B by solving a
standard eigenvalue equation. To address this, we propose an alternating optimization
method in [37] by reformulating the optimization problem as a regression problem.
LetF = F1+F2−F3.Weperformsingularvaluedecomposition(SVD)onF = GT
G,
where G ∈ RN×N
. Following [37], we reformulate a regression problem by using an
intermediate matrix A = [a1, a2, . . . , aK ] ∈ RN×K
(Please see Theorem 1 in [37] for
more details on this reformulation)
min H(A, B) =
K
k=1
Gak − Gβk
2
+ λ
K
k=1
βT
k βk
s.t. AT
A = IK×K ,
βk 1 ≤ γ, k = 1, 2, . . . , K. (2.27)
Weemployanalternatingoptimizationmethod[37]tooptimizeAandBiteratively.
Fix A, optimize B: For a given A, we solve the following problem to obtain B:
min H(B) =
K
k=1
Gak − Gβk
2
+ λ
K
k=1
βT
k βk
s.t. βk 1 ≤ γ, k = 1, 2, . . . , K. (2.28)
Considering that βk are independent in (2.28), we individually obtain βk by solving
the following optimization problem:

Algorithm 2.2: PDFL
Input: Reference set: Z = [z1, z2, . . . , zN ] ∈ Rd×N , training set:
S = {(xi, yi)|i = 1, 2, . . . , M}, xi ∈ Rd and yi ∈ Rd.
Output: Coefficient matrix B = [β1, β2, . . . , βK ].
Initialize A ∈ RN×K and B ∈ RN×K where
each entry of them is set as 1.
Step 2 (Local optimization):
For t = 1, 2, . . . , T, repeat
2.1. Compute B according to (2.28).
2.2. Compute A according to (2.30)–(2.31).
2.3. If t > 2 and Bt − Bt−1 F ≤ ( is set
as 0.001in our experiments), go to Step 3.
Step 3 (Output coefficient matrix):
Output coefficient matrix B = Bt.
min H(βk) = hk − Gβk
2
+ λβT
k βk
= gk − Pβi
2
s.t. βk 1 ≤ γ. (2.29)
where hk = Gak, gk = [hT
k , 0T
N ]T
, P = [GT
,
√
λ1T
N ]T
, and βk can be easily obtained
by using the conventional least angle regression solver [11].
Fix B, optimize A: For a given B, we solve the following problem to obtain A:
min H(A) = GA − GB 2
s.t. AT
A = IK×K . (2.30)
And A can be obtained by using SVD, namely
GT
GB = USV T
, and A = ˜UV T
(2.31)
where ˜U = [u1, u2, . . . , uK ] be the top K leading eigenvectors of U = [u1,
u2, . . . , uN ].
We repeat the above two steps until the algorithm meets a certain convergence
condition. The proposed PDFL algorithm is summarized in Algorithm 2.2.
2.2.3 Multiview Prototype-Based Discriminative Feature
Learning
Different feature descriptors usually capture complementary information to describe
face images from different aspects [6] and it is helpful for us to improve the kinship
verification performance with multiple feature descriptors. A nature solution for

feature learning with multiview data is concatenating multiple features first and then
applying existing feature learning methods on the concatenated features. However,
it is not physically meaningful to directly combine different features because they
usually show different statistical characteristics and such a concatenation cannot well
exploit the feature diversity. In this work, we introduce a multiview PDFL (MPDFL)
method to learn a common coefficient matrix with multiple low-level descriptors for
mid-level feature representation for kinship verification.
Given the training set S, we first extract L feature descriptors denoted as
S1
, . . . , SL
, where Sl
= (xl
1, yl
1), . . . , (xl
i , yl
i), . . . , (xl
M, yl
M) is the lth feature rep-
resentation, 1 ≤ l ≤ L, xl
i ∈ Rd
, and y
p
i ∈ Rd
are the ith parent and child faces in the
lth feature space, l = 1, 2, . . . , L. MPDFL aims to learn a shared coefficient matrix
B with the sparsity constraint so that the intra-class variations are minimized and the
inter-class variations are maximized in the mid-level feature spaces.
To exploit complemental information from facial images, we introduce a nonneg-
ative weighted vector α = [α1, α2, . . . , αL] to weight each feature space of PDFL.
Generally, the larger αl, the more contribution it is to learn the sparse coefficient
matrix. MPDFL is formulated as the following objective function by using an inter-
mediate matrix A = [a1, a2, . . . , aK ] ∈ RN×K
max
B,α
L
l=1
αltr[BT
(Fl
1 + Fl
2 − Fl
3)B]
s.t. BT
B = I,
βk 1 ≤ γ, k = 1, 2, . . . , K.
L
l=1
αl = 1, αl ≥ 0 (2.32)
where Fl
1, Fl
2 and Fl
3 are the expressions of F1, F2 and F3 in the lth feature space,
and 1 ≤ l ≤ L.
Since the solution to (2.32) is αl = 1, which corresponds to the maximal
tr[BT
(Fl
1 + Fl
2 − Fl
3)B] over different feature descriptors, and αp = 0 otherwise.
To address this, we revisit αl as αr
l (r > 1) and re-define the following optimization
function as
max
B,α
L
l=1
αr
l tr[BT
(Fl
1 + Fl
2 − Fl
3)B]
s.t. BT
B = I, βk 1 ≤ γ, k = 1, 2, . . . , K.
L
l=1
αl = 1, αl ≥ 0. (2.33)
Similar to PDFL, we also reformulate MPDFL as the following regression
problem:

min
A,B,α
L
l=1
K
k=1
αr
l Glak − Glβk
2
+ λ
K
k=1
βT
k βk
s.t. AT
A = IK×K ,
βk 1 ≤ γ, k = 1, 2, . . . , K. (2.34)
where Fl = GT
l Gl, and Fl = Fl
1 + Fl
2 − Fl
3.
Since (2.34) is non-convex with respect to A, B, and α, we solve it iteratively
similar to PDFL by using an alternating optimization method.
Fix A and B, optimize α: For the given A and B, we construct a Lagrange function
L(α, η) =
L
l=1
αr
l tr[BT
(Fl
1 + Fl
2 − Fl
3)B]
−ζ
L
l=1
αl − 1 . (2.35)
Let ∂L(α,η)
∂αl
= 0 and ∂L(α,η)
∂ζ
= 0, we have
rαr−1
l tr[BT
(Fl
1 + Fl
2 − Fl
3)B] − ζ = 0 (2.36)
L
l=1
αl − 1 = 0. (2.37)
Combining (2.36) and (2.37), we can obtain αl as follows:
αl =
(1/tr[BT
(Fl
1 + Fl
2 − Fl
3)B])1/(r−1)
L
l=1(1/tr[BT (Fl
1 + Fl
2 − Fl
3)B])1/(r−1)
(2.38)
Fix A and α, optimize B: For the given A and α, we solve the following problem
to obtain B:
min H(B) =
L
l=1
K
k=1
αr
l Glak − Glβk
2
+ λ
K
k=1
βT
k βk
s.t. βk 1 ≤ γ, k = 1, 2, . . . , K. (2.39)
Similar to PDFL, we individually obtain βk by solving the following optimization
problem:

min H(βk) =
L
l=1
αr
l Glak − Glβk
2
+ λβT
k βk
= hk − Gβk
2
+ λβT
k βk
= gk − Pβi
2
s.t. βk 1 ≤ γ. (2.40)
where hk =
L
l=1
αr
l Glak, gk = [hT
k , 0T
N ]T
, P = [
L
l=1
αr
l Gl,
√
λ1T
N ]T
, and βk can be
obtained by using the conventional least angle regression solver [11].
Fix B and α, optimize A: For the given B and α, we solve the following problem
to obtain A:
min H(A) =
L
l=1
αr
l GlA − GlB 2
s.t. AT
A = IK×K . (2.41)
And A can be obtained by using SVD, namely
L
l=1
αr
l GT
l Gl B = USV T
, and A = ˜UV T
(2.42)
where ˜U = [u1, u2, . . . , uK ] be the top K leading eigenvectors of U = [u1,
u2, . . . , uN ].
We repeat the above three steps until the algorithm converges to a local optimum.
Algorithm 2.3 summarizes the proposed MPDFL algorithm.
2.3 Evaluation
In this section, we conduct kinship verification experiments on four benchmark kin-
ship datasets to show the efficacy of feature learning methods for kinship verification
applications. The following details the results.

2.3 Evaluation 25
Algorithm 2.3: MPDFL
Input: Zl = [zl
1, zl
2, . . . , zl
N ] is the lth feature representation of the reference set,
Sl = {(xl
i , yl
i)|i = 1, 2, . . . , M} is the lth feature representation of the training set.
Output: Coefficient matrix B = [β1, β2, . . . , βK ].
1.1. Initialize A ∈ RN×K and B ∈ RN×K where
each entry is set as 1.
1.2. Initialize α = [1/K, 1/K, . . . , 1/K].
Step 2 (Local optimization):
For t = 1, 2, . . . , T, repeat
2.1. Compute α according to (2.38).
2.2. Compute B according to (2.39).
2.3. Compute A according to (2.41)–(2.42).
2.4. If t > 2 and Bt − Bt−1 F ≤ ( is set as
0.001in our experiments), go to Step 3.
Step 3 (Output coefficient matrix):
Output coefficient matrix B = Bt.
2.3.1 Data Sets
Four publicly available face kinship datasets, namely KinFaceW-I [34],1
KinFaceW-II [34],2
Cornell KinFace [12]3
, and UB KinFace [46],4
were used for
our evaluation. Facial images from all these datasets were collected from the internet
online search. Figure2.7 shows some sample positive pairs from these four datasets.
There are four kin relations in both the KinFaceW-I and KinFaceW-II datasets:
Father-Son (F-S), Father-Daughter (F-D), Mother-Son (M-S), and Mother-Daughter
(M-D). For KinFaceW-I, these four relations contain 134, 156, 127, and 116 pairs,
respectively. For KinFaceW-II, each relation contains 250 pairs.
There are 150 pairs of parents and children images in the Cornell KinFace dataset,
where 40, 22, 13, and 25% of them are with the F-S, F-D, M-S, and M-D relation,
respectively.
There are 600 face images of 400 persons in the UB KinFace dataset. These
images are categorized into 200 groups, and each group has three images, which
correspond to facial images of the child, young parent and old parent, respectively.
For each group, we constructed two kinship face pairs: child and young parent, and
child and old parent. Therefore, we constructed two subsets from the UB KinFace
dataset: Set 1 (200 child and 200 young parent face images) and Set 2 (200 child and
200 old parent face images). Since there are large imbalances of the different kinship
relations of the UB Kinface database (nearly 80% of them are the F-S relation), we
have not separated different kinship relations on this dataset.
1http://www.kinfacew.com.
2http://www.kinfacew.com.
3http://chenlab.ece.cornell.edu/projects/KinshipVerification.
4http://www.ece.neu.edu/~yunfu/research/Kinface/Kinface.htm.

Fig. 2.7 Some sample positive pairs (with kinship relation) from different face kinship databases.
Face images from the ﬁrst to fourth row are from the KinFaceW-I [34], KinFaceW-II [34], Cornell
KinFace [12], and UB KinFace [48] databases, respectively. c [2015] IEEE. Reprinted, with
permission, from Ref. [51]
2.3.2 Experimental Settings
We randomly selected 4000 face images from the LFW dataset to construct the
reference set, which was used for all the four kinship face datasets to learn the
mid-level feature representations. We aligned each face image in all datasets into
64 × 64 pixels using the provided eyes positions and converted it into gray-scale
image. We applied three different feature descriptors including LBP [2], Spatial
Pyramid LEarning (SPLE) [53] and SIFT [32] to extract different and complementary
information from each face image. The reason we selected these three features is

2.3 Evaluation 27
that they have shown reasonably good performance in recent kinship verification
studies [34, 53]. We followed the same parameter settings for these features in [34]
so that a fair comparison can be obtained.
For the LBP feature, 256 bins were used to describe each face image because
this setting yields better performance. For the SPLE method, we first constructed a
sequence of grids at three different resolution (0, 1, and 2), such that we have 21
cells in total. Then, each local feature in each cell was quantized into 200 bins and
each face image was represented by a 4200-dimensional long feature vector. For the
SIFT feature, we densely sampled and computed one 128-dimensional feature over
each 16 × 16 patch, where the overlap between two neighboring patches is 8 pixels.
Then, each SIFT descriptor was concatenated into a long feature vector. For these
features, we applied principal component analysis (PCA) to reduce each feature into
100 dimensions to remove some noise components.
The fivefold cross-validation strategy was used in our experiments. We tuned the
parameters of our PDFL and MPDFL methods on the KinFaceW-II dataset because
this dataset is the largest one such that it is more effective to tune parameters on this
dataset than others. We divided the KinFaceW-II dataset into fivefolds with an equal
size, and applied fourfolds to learn the coefficient matrix and the remaining one for
testing. For the training samples, we used three of them to learn our models and the
other one fold to tune the parameters of our methods. In our implementations, the
parameters r, λ, γ , and K were empirically set as 5, 1, 0.5, and 500, respectively.
Finally, the support vector machine (SVM) classifier with the RBF kernel is applied
for verification.
2.3.3 Results and Analysis
2.3.3.1 Comparison with Existing Low-Level Feature Descriptors
We compared our PDFL and MPDFL methods with the existing low-level feature
descriptors. The difference between our methods and the existing feature representa-
tions is that we use the mid-level features rather than the original low-level features
for verification. Tables2.1, 2.2, 2.2, and 2.4 tabulate the verification rate of differ-
ent feature descriptors on the KinFaceW-I, KinFaceW-II, Cornell KinFace, and UB
KinFace kinship databases, respectively. From these tables, we see that our proposed
PDFL and MPDFL outperform the best existing methods with the lowest gain in
mean verification accuracy of 2.6 and 7.1%, 6.2 and 6.8%, 1.0 and 2.4%, 0.9 and
4.6% on the KinFaceW-I, KinFaceW-II, Cornell KinFace, and UB KinFace kinship
datasets, respectively.

Table 2.1 Verification rate (%) on the KinFaceW-I dataset. c [2015] IEEE. Reprinted, with
Feature F-S F-D M-S M-D Mean
LBP 62.7 60.2 54.4 61.4 59.7
LBP+PDFL 65.7 65.5 60.4 67.4 64.8
LE 66.1 59.1 58.9 68.0 63.0
LE+PDFL 68.2 63.5 61.3 69.5 65.6
SIFT 65.5 59.0 55.5 55.4 58.9
SIFT+PDFL 67.5 62.0 58.8 57.9 61.6
MPDFL (All) 73.5 67.5 66.1 73.1 70.1
Table 2.2 Verification rate (%) on the KinFaceW-II dataset. c [2015] IEEE. Reprinted, with
LBP 64.0 63.5 62.8 63.0 63.3
LBP+PDFL 69.5 69.8 70.6 69.5 69.9
LE 69.8 66.1 72.8 72.0 70.2
LE+PDFL 77.0 74.3 77.0 77.2 76.4
SIFT 60.0 56.9 54.8 55.4 56.8
SIFT+PDFL 69.0 62.4 62.4 62.0 64.0
MPDFL (All) 77.3 74.7 77.8 78.0 77.0
Table 2.3 Verification rate (%) on the Cornell KinFace dataset. c [2015] IEEE. Reprinted, with
LBP 67.1 63.8 75.0 60.0 66.5
LBP+PDFL 67.9 64.2 77.0 60.8 67.5
LE 72.7 66.8 75.4 63.2 69.5
LE+PDFL 73.7 67.8 76.4 64.2 70.5
SIFT 64.5 67.3 68.4 61.8 65.5
SIFT+PDFL 66.5 69.3 69.4 62.8 67.0
MPDFL (All) 74.8 69.1 77.5 66.1 71.9
2.3.3.2 Comparison with State-of-the-Art Kinship Verification Methods
Table2.5 compares our PDFL and MPDFL methods with the state-of-the-art kin-
ship verification methods presented in the past several years. To further investigate
the performance differences between our feature learning approach and the other
compared methods, we evaluate the verification results by using the null hypothesis
statistical test based on Bernoulli model [5] to check whether the differences between

2.3 Evaluation 29
Table 2.4 Verification rate (%) on the UB KinFace dataset. c [2015] IEEE. Reprinted, with
Feature Set 1 Set 2 Mean
LBP 63.4 61.2 62.3
LBP+PDFL 64.0 62.2 63.1
LE 61.9 61.3 61.6
LE+PDFL 62.8 63.5 63.2
SIFT 62.5 62.8 62.7
SIFT+PDFL 63.8 63.4 63.6
MPDFL (All) 67.5 67.0 67.3
Table 2.5 Verification accuracy (%) on different kinship datasets. c [2015] IEEE. Reprinted, with
Method KinFaceW-I KinFaceW-II Cornell KinFace UB KinFace
Method in [12] N.A. N.A. 70.7 (0, 1) N.A.
Method in [48] N.A. N.A. N.A. 56.5 (1, 1)
NRML [34] 64.3 (0, 1) 75.7 (0, 1) 69.5 (0, 1) 65.6 (0, 1)
MNRML [34] 69.9 (0, 0) 76.5 (0, 1) 71.6 (0, 0) 67.1 (0, 0)
PDFL (best) 64.8 70.2 70.5 63.6
MPDFL 70.1 77.0 71.9 67.3
the results of our approach and those of other methods are statistically significant.
The results of the p-tests of PDFL and MPDFL are given in the brackets right after
the verification rate of each method in each table, where the number “1” represents
significant difference and “0” represents otherwise. There are two numbers in each
bracket, where the first represents the significant difference of PDFL and the second
represents that of MPDFL over previous methods. We see that PDFL achieves com-
parable accuracy with the existing state-of-the-art methods, and MPDFL obtains
better performance than the existing kinship verification methods when the same
kinship dataset was used for evaluation. Moreover, the improvement of MPDFL is
significant for most comparisons.
Since our feature learning approach and previous metric learning methods exploit
discriminative information in the feature extraction and similarity measure stages,
respectively, we also conduct kinship verification experiments when both of them
are used for our verification task. Table2.6 tabulates the verification performance
when such discriminative information is exploited in different manners. We see that
the performance of our feature learning approach can be further improved when the
discriminative metric learning methods are applied.

Table 2.6 Verification accuracy (%) of extracting discriminative information in different stages on
different kinship datasets. c [2015] IEEE. Reprinted, with permission, from Ref. [51]
NRML 64.3 75.7 69.5 65.6
PDFL (best) 64.8 70.2 70.5 63.6
PDFL (best) +
NRML
67.4 77.5 73.4 67.8
MNRML 69.9 76.5 71.6 67.1
MPDFL 70.1 77.0 71.9 67.3
MPDFL +
MNRML
72.3 78.5 75.4 69.8
Table 2.7 Verification accuracy (%) of PDFL with the best single feature and different classifiers.
NN 63.9 69.3 70.1 63.1
SVM 64.8 70.2 70.5 63.6
Table 2.8 Verification accuracy (%) of MPDFL with different classifiers. c [2015] IEEE.
Reprinted, with permission, from Ref. [51]
NN 69.6 76.0 70.4 66.2
SVM 70.1 77.0 71.9 67.3
2.3.3.3 Comparison with Different Classifiers
We investigated the performance of our PDFL (best single feature) and MPDFL
with different classifiers. In our experiments, we evaluated two classifiers: (1) SVM
and (2) NN. For the NN classifier, the cosine similarity of two face images is used.
Tables2.7 and 2.8 tabulate the mean verification rate of our PDFL and MPDFL
when different classifiers were used for verification. We see that our feature learning
methods are not sensitive to the selection of the classifier.
2.3.3.4 Parameter Analysis
We took the KinFaceW-I dataset as an example to investigate the verification perfor-
mance and training cost of our MPDFL versus varying values of K and γ . Figures2.8
and 2.9 show the mean verification accuracy and the training time of MPDFL versus
different K and γ . We see that K and γ were set to 500 and 0.5 are good tradeoffs
between the efficiency and effectiveness of our proposed method.

2.3 Evaluation 31
0.1 0.2 0.3 0.4 0.5 0.6 0.7
63
64
65
66
67
68
69
70
71
72
γ
Meanverificationaccuracy(%)
K=300
K=500
K=700
Fig. 2.8 Mean verification rate of our PDFL and MPDFL versus different values of K and γ .
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
2000
4000
6000
8000
10000
12000
γ
Trainingtime(seconds)
K=300
K=500
K=700
Fig. 2.9 Training time of our MPDFL versus different values of K and γ . c [2015] IEEE.
Reprinted, with permission, from Ref. [51]
Figure2.10 shows the mean verification rates of PDFL and MPDFL versus differ-
ent number of iteration on the KinFaceW-I dataset. We see that PDFL and MPDFL
achieve stable verification performance in several iterations.
We investigated the effect of the parameter r in MPDFL. Figure2.11 shows the
verification rate of MPDFL versus different number of r on different kinship datasets.

2 4 6 8 10
56
58
60
62
64
66
68
70
72
Iteration number
meanerificationaccuracy(%)
LBP+PDFL
LE+PDFL
SIFT+PDFL
MPDFL
Fig. 2.10 Mean verification rate of our PDFL and MPDFL versus different number of iterations,
on the KinFaceW-I dataset. c [2015] IEEE. Reprinted, with permission, from Ref. [51]
2 4 6 8 10
62
64
66
68
70
72
74
76
78
r
Verificationaccuracy(%)
On the KinFaceW−I dataset
On the KinFaceW−II dataset
On the Cornell Kinface dataset
On the UB Kinface dataset
Fig. 2.11 Mean verification rate of our MPDFL versus different values of r on different kinship
face datasets. c [2015] IEEE. Reprinted, with permission, from Ref. [51]
We observe that our MPDFL method is in general robust to the parameter r, and the
best verification performance can be obtained when r was set to 5.

2.3 Evaluation 33
2.3.3.5 Computational Time
WecomparethecomputationaltimeoftheproposedPDFLandMPDFLmethodswith
state-of-the-art metric learning-based kinship verification methods including NRML
and MNRML. Our hardware consists of a 2.4-GHz CPU and a 6GB RAM. Table2.9
shows the time spent on the training and the testing stages of different methods, where
the Matlab software, the KinFaceW-I database, and the SVM classifier were used.
We see that the computational time of our feature learning methods are comparable
to those of NRML and MNRML.
2.3.3.6 Comparison with Human Observers in Kinship Verification
Human ability in kinship verification was evaluated in [34]. We also compared our
method with humans on the KinFaceW-I and KinFaceW-II datasets. For a fair com-
parison between human ability and our proposed approach, the training samples
as well as their kin labels used in our approach were selected and presented to 10
human observers (5 males and 5 females) who are 20–30 years old [34] to pro-
vide the prior knowledge to learn the kin relation from human face images. Then,
the testing samples used in our experiments were presented to these to evaluate the
performance of human ability in kinship verification. There are two evaluations for
humans: HumanA and HumanB in [34], where the only face region and the whole
original face were presented to human observers, respectively. Table2.10 shows the
Table 2.9 CPU time (in second) used by different kinship verification methods on the KinFaceW-I
database. c [2015] IEEE. Reprinted, with permission, from Ref. [51]
Method Training Testing
NRML 18.55 0.95
MNRML 22.35 0.95
PDFL 18.25 0.95
MPDFL 21.45 0.95
Table 2.10 Performance comparison (%) between our methods and humans on kinship verifica-
tion on the KinFaceW-I and KinFaceW-II datasets, respectively. c [2015] IEEE. Reprinted, with
Method KinFaceW-I KinFaceW-II
F-S F-D M-S M-D F-S F-D M-S M-D
HumanA 62.0 60.0 68.0 72.0 63.0 63.0 71.0 75.0
HumanB 68.0 66.5 74.0 75.0 72.0 72.5 77.0 80.0
PDFL
(LE)
68.2 63.5 61.3 69.5 77.0 74.3 77.0 77.2
MPDFL 73.5 67.5 66.1 73.1 77.3 74.7 77.8 78.0

performance of these observers and our approach. We see that our proposed methods
achieve even better kinship verification performance than human observers on most
subsets of these two kinship datasets.
We make the following four observations from the above experimental results
listed in Tables2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9 and 2.10, and Figs.2.8, 2.9,
2.10, and 2.11:
1. Learning discriminative mid-level feature achieves better verification perfor-
mance than the original low-level feature. This is because the learned mid-level
feature exploits discriminative information while the original low-level feature
cannot.
2. MPDFL achieves better performance than PDFL, which indicates that combining
multiple local-level descriptors to learn mid-level features is better than using a
single one because multiple features can provide complementary information for
feature learning.
3. PDFL achieves comparable performance and MPDFL achieves better perfor-
mance than existing kinship verification methods. The reason is that most existing
kinship verification methods used low-level hand-crafted features for face repre-
sentation, which is not discriminative enough to characterize the kin relation of
face images.
4. Both PDFL and MPDFL achieve better kinship verification performance than
human observers, which further shows the potentials of our computational face-
based kinship verification models for practical applications.
References
1. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns. In: Euro-
pean Conference on Computer Vision, pp. 469–481 (2004)
2. Ahonen, T., Hadid, A., et al.: Face description with local binary patterns: application to face
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28(12), 2037–2041 (2006)
3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep
networks. In: NIPS, pp. 153–160 (2007)
4. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives.
IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
5. Beveridge, J.R., She, K., Draper, B., Givens, G.H.: Parametric and nonparametric methods for
the statistical evaluation of human id algorithms. In: International Workshop on the Empirical
Evaluation of Computer Vision Systems (2001)
6. Bickel, S., Scheffer, T.: Multi-view clustering. In: IEEE International Conference on Data
Mining, pp. 19–26 (2004)
7. Cao, Z., Yin, Q., Tang, X., Sun, J.: Face recognition with learning-based descriptor. In: CVPR,
pp. 2707–2714 (2010)
8. Deng, W., Liu, Y., Hu, J., Guo, J.: The small sample size problem of ICA: a comparative study
and analysis. Pattern Recognit. 45(12), 4438–4450 (2012)
9. Deng, W., Hu, J., Lu, J., Guo, J.: Transform-invariant pca: a unified approach to fully automatic
face alignment, representation, and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(6),
1275–1284 (2014)

References 35
10. Dornaika, F., Bosaghzadeh, A.: Exponential local discriminant embedding and its application
to face recognition. IEEE Trans. Cybern. 43(3), 921–934 (2013)
11. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2),
407–499 (2004)
12. Fang,R.,Tang,K.,Snavely,N.,Chen,T.:Towardscomputationalmodelsofkinshipverification.
In: IEEE International Conference on Image Processing, pp. 1577–1580 (2010)
13. Fang, R., Gallagher, A.C., Chen, T., Loui, A.: Kinship classification by modeling facial feature
heredity, pp. 2983–2987 (2013)
14. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes.
In: CVPR, pp. 817–824 (2011)
15. Guo, G., Wang, X.: Kinship measurement on salient facial features. IEEE Trans. Instrum. Meas.
61(8), 2322–2325 (2012)
16. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural
Comput. 18(7), 1527–1554 (2006)
17. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for
studying face recognition in unconstrained environments. Technical Reports 07-49, University
of Massachusetts, Amherst (2007)
18. Huang, G.B., Lee, H., Learned-Miller, E.: Learning hierarchical representations for face veri-
fication with convolutional deep belief networks. In: IEEE International Conference Computer
Vision and Pattern Recognition, pp. 2518–2525 (2012)
19. Hussain, S.U., Napoléon, T., Jurie, F., et al.: Face recognition using local quantized patterns.
In: BMVC, pp. 1–12 (2012)
20. Hyvärinen, A., Hurri, J., Hoyer, P.O.: Independent component analysis. Natural Image Statis-
tics, pp. 151–175 (2009)
21. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
22. Jin, H., Liu, Q., Lu, H., Tong, X.: Face detection using improved LBP under bayesian frame-
work. In: International Conference on Image and Graphics, pp. 306–309 (2004)
23. Kan, M., Xu, D., Shan, S., Li, W., Chen, X.: Learning prototype hyperplanes for face verification
in the wild. IEEE Trans. Image Process. 22(8), 3310–3316 (2013)
24. Kohli, N., Singh, R., Vatsa, M.: Self-similarity representation of weber faces for kinship classi-
fication. In: IEEE International Conference on Biometrics: Theory, Applications, and Systems,
pp. 245–250 (2012)
25. Krizhevsky,A.,Sutskever,I.,Hinton,G.:Imagenetclassificationwithdeepconvolutionalneural
networks. In: Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)
26. Le, Q.V., Karpenko, A., Ngiam, J., Ng, A.Y.: Ica with reconstruction cost for efficient over-
complete feature learning. In: NIPS, pp. 1017–1025 (2011)
27. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368
(2011)
28. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis. In: IEEE International
Conference on Computer Vision and Pattern Recognition, pp. 3361–3368 (2011)
29. Lei, Z., Pietikainen, M., Li, S.Z.: Learning discriminant face descriptor. PAMI 36(2), 289–302
(2014)
30. Li, X., Shen, C., Dick, A.R., van den Hengel, A.: Learning compact binary codes for visual
tracking. In: CVPR, pp. 2419–2426 (2013)
31. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear
discriminant model for face recognition. IEEE Trans. Image Process. 11(4), 467–476 (2002)
32. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2),
91–110 (2004)
33. Lu, J., Tan, Y.P.: Regularized locality preserving projections and its extensions for face recog-
nition. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 40(3), 958–963 (2010)

34. Lu, J., Zhou, X., Tan, Y.P., Shang, Y., Zhou, J.: Neighborhood repulsed metric learning for
kinship verification. IEEE Trans. Pattern Anal. Mach. Intell. 34(2), 331–345 (2014)
35. Lu, J., Liong, V.E., Zhou, X., Zhou, J.: Learning compact binary face descriptor for face
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2041–2056 (2015)
36. Norouzi, M., Fleet, D., Salakhutdinov, R.: Hamming distance metric learning. In: NIPS, pp.
1070–1078 (2012)
37. Qiao, Z., Zhou, L., Huang, J.Z.: Sparse linear discriminant analysis with applications to high
dimensional low sample size data. Int. J. Appl. Math. 39(1), 48–60 (2009)
38. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: Explicit
invariance during feature extraction. In: International Conference on Machine Learning, pp.
833–840 (2011)
39. Somanath, G., Kambhamettu, C.: Can faces verify blood-relations? In: IEEE International
Conference on Biometrics: Theory, Applications and Systems, pp. 105–112 (2012)
40. Trzcinski, T., Lepetit, V.: Efficient discriminative projections for compact binary descriptors.
In: ECCV, pp. 228–242 (2012)
41. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust
features with denoising autoencoders. In: International Conference on Machine Learning, pp.
1096–1103 (2008)
42. Wang, N., Yeung, D.Y.: Learning a deep compact image representation for visual tracking. In:
Advances in Neural Information Processing Systems, pp. 809–817 (2013)
43. Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable image retrieval. In:
CVPR, pp. 3424–3431 (2010)
44. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS, pp. 1753–1760 (2008)
45. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math.
Program. 1–38 (2013)
46. Xia, S., Shao, M., Fu, Y.: Kinship verification through transfer learning. In: International Joint
Conference on Artificial Intelligence, pp. 2539–2544 (2011)
47. Xia, S., Shao, M., Fu, Y.: Toward kinship verification using visual attributes. In: IEEE Inter-
national Conference on Pattern Recognition, pp. 549–552 (2012)
48. Xia, S., Shao, M., Luo, J., Fu, Y.: Understanding kin relationships in a photo. IEEE Trans.
Multimed. 14(4), 1046–1056 (2012)
49. Xu, Y., Li, X., Yang, J., Lai, Z., Zhang, D.: Integrating conventional and inverse representation
for face recognition. IEEE Trans. Cybern. 44(10), 1738–1746 (2014)
50. Yan,H.,Lu,J.,Deng,W.,Zhou,X.:Discriminativemultimetriclearningforkinshipverification.
IEEE Trans. Inf. Forensics Secur. 9(7), 1169–1178 (2014)
51. Yan, H., Lu, J., Zhou, X.: Prototype-based discriminative feature learning for kinship verifica-
tion. IEEE Trans. Cybern. 45(11), 2535–2545 (2015)
52. Yang, J., Zhang, D., Yang, J.Y.: Constructing pca baseline algorithms to reevaluate ica-based
face-recognition performance. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 37(4), 1015–
1021 (2007)
53. Zhou, X., Hu, J., Lu, J., Shang, Y., Guan, Y.: Kinship verification from facial images under
uncontrolledconditions.In:ACMInternationalConferenceonMultimedia,pp.953–956(2011)
54. Zhou, X., Lu, J., Hu, J., Shang, Y.: Gabor-based gabor-based gradient orientation pyramid for
kinship verification under uncontrolled environments. In: ACM International Conference on
Multimedia, pp. 725–728 (2012)

Chapter 3
Metric Learning for Facial Kinship
Verification
Abstract In this chapter, we discuss metric learning techniques for facial kinship
verification. We first review several conventional and representative metric learning
methods, including principal component analysis (PCA), linear discriminant analysis
(LDA), locality preserving projections (LPP), information-theoretic metric learning
(ITML), side-information-based linear discriminant analysis (SILD), KISS metric
learning (KISSME), and cosine similarity metric learning (CSML) for face similarity
measure. Then, we introduce a neighborhood repulsed correlation metric learning
(NRCML)methodforfacialkinshipverification.Mostexistingmetriclearning-based
kinship verification methods are developed with the Euclidian similarity metric, so
that they are not powerful enough to measure the similarity of face samples. Since
face images are usually captured in wild conditions, our NRCML method uses the
correlation similarity measure, where the kin relation of facial images can be better
highlighted. Moreover, since negative kinship samples are usually less than posi-
tive samples, the NRCML method automatically identifies the most discriminative
negative samples in the training set to learn the distance metric so that the most dis-
criminative encoded by negative samples can better exploited. Next, we introduce a
discriminative multi-metric learning (DMML) method for kinship verification. This
method makes use of multiple face descriptors, so that complementary and discrim-
inative information is exploited for verification.
3.1 Conventional Metric Learning
Metric learning has received a lot of attention in computer vision and machine
learning in recent years, and there have been a number of metric learning algo-
rithms in the literature. Existing metric learning methods can be mainly divided
into two categories: unsupervised and supervised. Unsupervised methods aim to
learn a low-dimensional manifold, where the geometrical information of the sam-
ples is preserved. Representatives of such algorithms include principal component
analysis (PCA) [23], locally linear embedding (LLE) [19], multidimensional scaling
(MDS) [20], and Laplacian eigenmaps (LE) [2]. Supervised methods aim to seek
an appropriate distance metric for classification tasks. Generally, an optimization
37

38 3 Metric Learning for Facial Kinship Verification
objective function is formulated based on some supervised information of the train-
ing samples to learn the distance metric. The difference among these methods lies
mainly in the objective functions, which are designed for their specific tasks. Typical
supervised metric learning methods include linear discriminant analysis (LDA) [1],
neighborhood component analysis (NCA) [6], cosine similarity metric learning
(CSML) [18], large margin nearest neighbor (LMNN) [24], and information the-
oretic metric learning (ITML) [5].
Metric learning methods have been successfully employed in various visual analy-
sis tasks [7, 10, 11, 13, 14, 16, 22, 25, 29]. For example, Lu et al. proposed a
neighborhood repulsed metric learning (NRML) [15] method by considering the dif-
ferent importance of different negative samples for kinship verification. Yan et al.
presented a discriminative multi-metric learning (DMML) [28] by making use of
multiple feature descriptors for kinship verification. However, most previous metric
learning-based kinship verification methods are developed with the Euclidian simi-
larity metric, which is not powerful enough to measure the similarity of face samples,
especially when they are captured in wild conditions. Since the correlation similar-
ity metric can better handle face variations than the Euclidian similarity metric, we
propose a new metric learning method by using the correlation similarity measure
for kinship verification.
Metric learning involves seeking a suitable distance metric from a training set of
data points. In this section, we briefly reviews several usually used and representa-
tive metric learning methods, including principal component analysis (PCA) [23],
linear discriminant analysis (LDA) [1], locality preserving projections (LPP) [8],
information theoretic metric learning (ITML) [5], side-information-based linear dis-
criminant analysis (SILD) [9], KISS metric learning (KISSME) [10], and cosine
similarity metric learning (CSML) [18].
3.1.1 Principal Component Analysis
Considering a set of samples denoted as a vector-represented dataset {xi ∈ Rd
, i =
1, 2, · · · , N}andthecorrespondinglabel{li ∈ {1, 2, . . . , c}, i = 1, 2, . . . , N},where
N is the number of samples, d is the feature dimension of each sample, c is the num-
ber of classes, and xi possesses a class label li . The objective of subspace learning
is to find a linear mapping W = [w1, w2, . . . , wk] to project {xi , i = 1, 2, . . . , N}
into a low-dimensional representation {yi ∈ Rm
, i = 1, 2, . . . , N}, i.e., yi = WT
xi ,
m d. The essential difference of these subspace learning methods lies in the dif-
ference in defining and finding the mapping W.
PCA seeks to find a set of projection axes such that the global scatter is maximized
after the projection of the samples. The global scatter can be characterized by the
mean square of the Euclidean distance between any pair of the projected sample
points, defined as [23]

3.1 Conventional Metric Learning 39
JT (w) =
1
2
1
N N
N
i=1
N
j=1
(yi − yj )2
(3.1)
We can simplify JT (W) to the following form
JT (w) =
1
2
1
N N
N
i=1
N
j=1
(wT
xi − wT
xj )(wT
xi − wT
xj )T
= wT
⎡
⎣1
2
1
N N
N
i=1
N
j=1
(xi − xj )(xi − xj )T
⎤
⎦ w (3.2)
Let
ST =
1
2
1
N N
N
i=1
N
j=1
(3.3)
and the mean vector m = 1
N
N
i=1 xi , then ST can be calculated as follows:
ST =
1
N
N
i=1
(xi − m)(xi − m)T
(3.4)
then (3.1) can be rewritten as
JT (w) = wT
ST w (3.5)
The projections {w1, w2, . . . , wk} that maximize JT (w) comprise an orthogonal set of
vectors representing the eigenvectors of ST associated with the k largest eigenvalues,
k < d, which is the solution of PCA.
3.1.2 Linear Discriminant Analysis
LDA seeks to ﬁnd a set of projection axes such that the Fisher criterion (the ratio of the
between-class scatter to the within-class scatter) is maximized after the projection.
The between-class scatter SB and the within-class scatter SW are deﬁned as [1]
SB =
c
i=1
Ni (mi − m)(mi − m)T
(3.6)

SW =
c
i=1
Ni
j=1
(xi j − mi )(xi j − mi )T
(3.7)
where xi j denotes the jth training sample of the ith class, mi is the mean of the
training sample of the ith class and m is the mean of all the training samples. The
objective function of LDA is defined as
max
w
wT
SBw
wT SW w
(3.8)
The corresponding projections {w1, w2, . . . , wk} comprise a set of the eigenvectors
of the following generalized eigenvalue function
SBw = λSW w (3.9)
Let {w1, w2, . . . , wk} be the eigenvectors corresponding to the k largest eigenval-
ues {λi |i = 1, 2, . . . , k} decreasingly ordered λ1 ≥ λ2 ≥ · · · ≥ λk, then W = [w1,
w2, . . . , wk] is the learned mapping of LDA. Since the rank of SB is bounded by
c − 1, k is at most equal to c − 1.
3.1.3 Locality Preserving Projections
LPP [8] is one recently proposed manifold learning method, and the aim of LPP is to
preserve the intrinsic geometry structure of original data and make the samples lying
in a neighborhood to maintain the locality relationship after projection. Specifically,
LPP first constructs one affinity graph to characterize the neighborhood relationship
of the training set and then seeks one low-dimensional embedding to preserve the
intrinsic geometry and local structure. The objective function of LPP is formulated
as follows:
min
i j
(yi − yj )2
Si j (3.10)
where yi and yj are the low-dimensional representation of xi and xj . The affinity
matrix S can be defined as:
Si j =
exp(− xi − xj
2
/t), if xi ∈ Nk(xj )or xj ∈ Nk(xi )
0 otherwise
(3.11)
where t and k are two dependent prespecified parameters defining the local neigh-
borhood, and Nk(x) denotes the k nearest neighbors of x. Following some simple
algebra deduction steps [8], one can obtain

1
2 i j
(yi − yj )2
Si j = wX L XT
w (3.12)
where L = D − S is the Laplacian matrix, D is a diagonal matrix and its entries are
column sums of S, i.e., Dii = i j Sji . Matrix D provides a natural measure on the
data points, and the bigger the value Dii is, the more “important" yi is. Usually, one
can impose a constraint:
yT
Dy = 1 ⇒ wX DXT
w = 1. (3.13)
Then, this minimization problem can be converted into solving the following con-
strained optimization problem:
wopt = arg min
w
wT
X L XT
w (3.14)
s.t. wT
X DXT
w = 1.
Finally, the bases of LPP are the eigenvectors of the following generalized eigenvalue
problem:
X L XT
w = λX DXT
w (3.15)
Let w1, w2, . . . , wk be the solutions of (3.15), ordered according to their eigenvalues,
0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λk, then W = [w1, w2, . . . , wk] is the resulted mapping of LPP.
3.1.4 Information-Theoretical Metric Learning
Let X = [x1, x2, . . . , xN ] ∈ Rd×N
be a training set consisting of N data points, the
aim of common distance metric learning approaches is to seek a positive semi-definite
(PSD) matrix M ∈ Rd×d
under which the squared Mahalanobis distance of two data
points xi and xj can be computed by:
d2
M(xi , xj ) = (xi − xj )T
M(xi − xj ), (3.16)
where d is the dimension of data point xi .
Information-theoretic metric learning (ITML) [5] is a typical metric learning
method, which exploits the relationship of the multivariate Gaussian distribution
and the set of Mahalanobis distances to generalize the regular Euclidean distance.
The basic idea of ITML method is to find a PSD matrix M to approach a predefined
matrix M0 by minimizing the LogDet divergence between two matrices under the
constraints that the squared distance d2
M(xi , xj ) of a positive pair (or similar pair)
is smaller than a positive threshold τp while that of a negative pair (or dissimilar
pair) is larger than a threshold τn, and we have τn > τp > 0. By employing this
constraint on all pairs of training set, ITML can be formulated as the following

LogDet optimization problem:
min
M
Dld(M, M0) = tr(MM−1
0 ) − log det(MM−1
0 ) − d
s.t. d2
M(xi , xj ) ≤ τp ∀ i j = 1
d2
M(xi , xj ) ≥ τn ∀ i j = −1, (3.17)
in which the predeﬁned metric M0 is set to the identity matrix in our experiments,
tr(A) is the trace operation of a square matrix A, and i j means the pairwise label
of a pair of data points xi and xj , which is labeled as i j = 1 for a similar pair (with
kinship) and i j = −1 for a dissimilar pair (without kinship). In practice, to solve the
optimization problem (3.17), iterative Bregman projections are employed to project
the present solution onto a single constraint by the following scheme:
Mt+1 = Mt + β Mt (xi − xj )(xi − xj )T
Mt , (3.18)
in which β is a projection variable which is controlled by both the learning rate and
the pairwise label of a pair of data points.
3.1.5 Side-Information Linear Discriminant Analysis
Side-information-based linear discriminant analysis (SILD) [9] makes use of the
side-information of pairs of data points to estimate the within-class scatter matrix
Cp by employing positive pairs and the between-class scatter matrix Cn by using
negative pairs in training set:
Cp =
i j =1
, (3.19)
Cn =
i j =−1
. (3.20)
Then, SILD learns a discriminative linear projection W ∈ Rd×m
, m ≤ d by solving
the optimization problem:
max
W
det(WT
CnW)
det(WT CpW)
. (3.21)
By diagonalizing Cp and Cn as:
Cp = UDpUT
, (UD−1/2
p )T
Cp(UD−1/2
p ) = I, (3.22)
(UD−1/2
p )T
Cn(UD−1/2
p ) = VDnVT
, (3.23)
the projection matrix W can be computed as:

W = UD−1/2
p V, (3.24)
in which matrices U and V are orthogonal, and matrices Dp and Dn are diagonal. In
the transformed subspace, the squared Euclidean distance of a pair of data points xi
and xj is calculated by:
d2
W(xi , xj ) = WT
xi − WT
xj
2
2
= (xi − xj )T
WWT
(xi − xj )
= (xi − xj )T
M(xi − xj ). (3.25)
This distance is equivalent to computing the squared Mahalanobis distance in the
original space, and we have M = WWT
.
3.1.6 Keep It Simple and Straightforward Metric Learning
Keep it simple and straightforward metric learning (KISSME) [10] method aims to
learn a distance metric from the perspective of statistical inference. KISSME makes
the statistical decision whether a pair of data points xi and xj is dissimilar/negative
or not by using the scheme of likelihood ratio test. The hypothesis H0 states that a
pair of data points is dissimilar, and the hypothesis H1 states that this pair is similar.
The log-likelihood ratio is shown as:
δ(xi , xj ) = log
p(xi , xj |H0)
p(xi , xj |H1)
, (3.26)
where p(xi , xj |H0) is the probability distribution function of a pair of data points
under the hypothesis H0. The hypothesis H0 is accepted if δ(xi , xj ) is larger than a
nonnegative constant, otherwise the hypothesis H0 is rejected and this pair is similar.
By assuming the single Gaussian distribution of the pairwise difference zi j = xi − xj
and relaxing the problem (3.26), δ(xi , xj ) is simpliﬁed as:
δ(xi , xj ) = (xi − xj )T
(C−1
p − C−1
n )(xi − xj ), (3.27)
in which the covariance matrices Cp and Cn are computed by (3.19) and (3.20),
respectively.
To achieve the PSD Mahalanobis matrix M, KISSME projects ˆM = C−1
p − C−1
n
onto the cone of the positive semi-deﬁnite matrix M by clipping the spectrum of ˆM
via the scheme of eigenvalue decomposition.

Facial kinship verification- a machine learning approach

Facial kinship verification- a machine learning approach

Recommended

Recommended

More Related Content

Similar to Facial kinship verification- a machine learning approach

Similar to Facial kinship verification- a machine learning approach (20)

Recently uploaded

Recently uploaded (20)

Facial kinship verification- a machine learning approach