7. Holistic learning
Eigenfaces [1][2]
Adopted from Wikipedia(Eigenface)
Fisherfaces [2]
Adopted from OpenCV Docs.(Face Recognition)
Bayes, Laplacianface 2DPCA, SRC, CRC, Metric Learning, etc.
8. Local handcraft
Gabor filter [3]
Adopted from Mathworks.com
(Gabor Feature Extraction)
Local Binary Pattern [4][5][6]
Adopted from Scikit-image (Local Binary Pattern for texture classification).
EBGM, LGBP, HD-LBP, etc.
13. Training, evaluation protocol
Training set
Test set
Training set
Probe set
Gallery set
Training set
Test set
Dataset type 1 Dataset type 2 Dataset type 3
Ids in training set
= Ids in test set
Ids in training set
!= Ids in test set.
Info. of matched and
mismatched pairs for
verification
Ids in training set
!= Ids in test set
(for identification)
Dataset type 4
Ids in training set
!= Ids in test set &&
Ids in probe set
!= Ids in Gallery set (exist)
(for identification)
Training set
Probe set
Gallery set
14. Training, evaluation protocol (Verification)
Training set
Test set
Training set
Probe set
Gallery set
type 2 type 3
Info. of matched and mismatched pairs for
verification.
i.e. pairs_test.txt in LFW dataset.
George_W_Bush 10 24
George_W_Bush 12 John_Kerry 8
Matched pair
Mismatched pair
LFW provides 10 sets for the test.
A set is consist of 300 matched pairs and 300
mismatched pairs.
15. Training, evaluation protocol (Verification)
Training, Evaluation protocol for LFW dataset
Adopted from [11].
1. Unrestricted, Labeled Outside Data
2. Unrestricted, No Outside Data
Commonly,
16. Training, evaluation protocol (Identification)
Training set
Probe set
Gallery set
type 3
Training set
Probe set
Gallery set
type 4
Close-set Identification Adopted from [8].
Open-set Identification Adopted from [8].
17. Dataset
Long tail distribution
Adopted from [8].
• The depth of dataset enforces the trained model to address a wide range
intra-class variations, such as lighting, age, and pose.
• The breadth of dataset ensures the trained model to cover the sufficiently
variable appearance of various people.
20. Evaluation metrics (Face verification)
• Receiver operating characteristic (ROC)
• measures the true accept rate (TAR; TPR) when false
accept rate(FAR; FPR) is kept in a very low rate in most
security certification scenario.
• i.e. PaSC : TAR@10−2FAR, IJB-A : TAR@10−3FAR,
Megaface : TAR@10−6FAR, MS-celeb-1M challenge 3:
TAR@10−9FAR
• Mean accuracy(ACC)
• Represents the percentage of correct classifications.
21. Evaluation metrics (Identification. Close-set)
• Rank-N
• Rank-N is based on what percentage of probe searches
return the probe’s gallery mate within the top k rank-
ordered results.
• IJB-A/B/C concern on the rank-1 and rank-5 recognition
rate.
• Cumulative match characteristic(CMC)
• CMC curve reports the percentage of probes identified
within a given rank (the independent variable).
• MegaFace challenge systematically evaluates rank-1
recognition rate function of increasing number of gallery
distractors (going from 10 to 1 M)
22. Evaluation metrics (Identification. Close-set)
• Precision-coverage curve
• Measure identification performance under a variable
threshold t.
• The probe is rejected when its confidence score is lower
than t.
• The algorithms are compared in term of what fraction of
passed probes, i.e. coverage, with a high recognition
precision, e.g. 95% or 99%.
CMC curve. Adopted from [9][12] CMC curve. Adopted from [13]
23. Evaluation metrics (Identification. Open-set)
• Decision(or Detection) error tradeoff (DET) curve [14]
• Characterize the false negative identification rate(FNIR)
as function of false positive identification rate(FPIR).
• The FPIR measures what fraction of comparisons
between probe templates and non-mate gallery
templates result in a match score exceeding T. At the
same time, the FNIR measures what fraction of probe
searches will fail to match a mated gallery template
above a score of T.
• The algorithms are compared in term of the FNIR at a
low FPIR, e.g. 1% or 10%.
• IJB-A benchmark supports open-set face recognition.
25. Example of FR training-test sequence.
Training set
Probe set
Gallery set
Large scale dataset
Feature Extractor
Lose
function
for training
feature
extractor
26. Example of FR training-test sequence.
Probe set
Gallery set
Bench mark 3
Feature Extractor
(trained)
feature
Probe set
Gallery set
Bench mark 2
Probe set
Gallery set
Bench mark 1
Evaluation
Provided by benchmark dev tool.
i.e. Threshold, Joint Bayesian
Classifier
27. Example of FR training-test sequence.
Feature Extractor
(trained)
Classifier
(i.e. Metric learning,
SRC)Bench mark 1
Training set
Probe set
Gallery set
Bench mark 2
Training set
Probe set
Gallery set Fine-tuning (transfer learning)
28. Deep FR System
Deep FR System
Adopted from [8].
K. Zhang, Z. Zhang, Z. Li, Y. Qiao. Joint face detection and alignment using multi-task
cascaded convolutional networks. arXiv preprint arXiv:1604.02878, 2016
32. Deep Face (Facebook, CVPR, 2014)
Outline of the DeepFace architecture
Adopted from [7].
Dataset for training: Social Face Classification (SFC) dataset
(4.4M labeled face, 4K identities, 800~1200 faces per person)
Objective: Minimize cross entropy with softmax function.
33. Deep Face (Facebook, CVPR, 2014)
• Verification metric
• Weighted 𝜒2 distance
• DeepFace feature vector contains several similarities
to histogram-based feature.[6]
1. It contains non-negative values
2. It is very sparse
3. Its values are between [0, 1].
• 𝜒2
𝑓1, 𝑓2 = 𝑖
𝑤 𝑖 𝑓1 𝑖 −𝑓2 𝑖 2
𝑓1 𝑖 +𝑓2[𝑖]
• The weight parameters are learned using a linear
SVM.
• Siamese network [18]
• Metric learning
• 𝑑 𝑓1, 𝑓2 = 𝑖 𝛼𝑖|𝑓1 𝑖 − 𝑓2 𝑖 |
35. Deep Face (Facebook, CVPR, 2014)
Comparison of the classification errors on the SFC.
Adopted from [7].
• DF-1.5K, 3.3K, 4.4K: Subsets of sizes 1.5K, 3K, 4K persons
• DF-10%, 20%, 50%: the global number of samples in SFC to
10%, 20%, 50%
• DF-sub1, sub2, sub3: chopping off the C3, L4, L5 layers.
36. Deep Face (Facebook, CVPR, 2014)
The performance of various individual DeepFace networks and
the Siamese network.
Adopted from [7].
• DeepFace-single: 3D aligned RGB inputs
• DeepFace-align2D: 2D aligned RGB inputs.
• DeepFace-gradient: gray-level image plus image gradient
magnitude and orientation.
• DeepFace-ensemple: combined distances using a non-linear
SVM with a simple sum of power CPD-kernels.
37. Deep Face (Facebook, CVPR, 2014)
Comparison with the state-of-the-art on the LFW dataset.
Adopted from [7].
• DeepFace-single, unsupervised: directly compare the inner
product of a pair of normalized features.
38. Deep Face (Facebook, CVPR, 2014)
• DeepFace-single, unsupervised(95.92%): directly compare
the inner product of a pair of normalized features.
• DeepFace-single, restricted(97%): 5,400 pair labels for
training, kernel-SVM.
• DeepFace-ensemble, restricted (97.15%):
single+gradient+align2d
• DeepFace-ensemble, unrestricted 1 (97.25%):
single+gradient+align2d+Siamese
• DeepFace-ensemble, unrestricted 2 (97.35%): 5 single +
gradient + align2d + Siamese
39. Deep Face (Facebook, CVPR, 2014)
Comparison with the state-of-the-art on the LFW dataset.
Adopted from [7].
41. DeepID2 (CUHK, NIPS, 2014)
The DeepID2 feature learning algorithm.
Adopted from [19].
42. DeepID2 (CUHK, NIPS, 2014)
Patches selected for feature extraction.(positions, scales, color
channels, horizontal flipping)
Adopted from [19].
The ConvNet structure for DeepID2 feature extraction.
Adopted from [19].
44. DeepID2 (CUHK, NIPS, 2014)
(left)Face verification accuracy by varying the weighting
parameter 𝜆.
(right) Face verification accuracy of DeepID2 features learned by
both the face identification and verification signals, where the
number of training identities used for face identification varies.
Adopted from [19].
45. DeepID2 (CUHK, NIPS, 2014)
Spectrum of eigenvalues of the inter- and intra-personal scatter
matrices.
Adopted from [19].
46. DeepID2 (CUHK, NIPS, 2014)
The first two PCA dimensions of DeepID2 features extracted from
six identities in LFW.
Adopted from [19].
Comparison of different verification signals. (classifying the 8192
identities)
Adopted from [19].
47. DeepID2 (CUHK, NIPS, 2014)
Face verification accuracy with DeepID2 features extracted from
an increasing number of face patches.
Adopted from [19].
Accuracy comparison with the previous best results on LFW.
Adopted from [19].
48. DeepID2 (CUHK, NIPS, 2014)
ROC comparison with the previous best results on LFW.
Adopted from [19].
53. FaceNet (Google, CVPR, 2015)
𝒯 is the set of all possible triplets in the training
set and has cardinality N.
𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑝
2
2
+ 𝛼 < 𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑛
2
2
∀ 𝑓 𝑥𝑖
𝑎
, 𝑓 𝑥𝑖
𝑝
, 𝑓 𝑥𝑖
𝑛
∈ 𝒯
ℒ =
𝑖
𝑁
𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑝
2
2
− 𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑛
2
2
+ 𝛼
+
𝑓 𝑥 ∈ ℝ 𝑑
Constrain embedding to live on the d-dimensional hypersphere.
i.e. 𝑓 𝑥 2 = 1
54. FaceNet (Google, CVPR, 2015)
Triplet Selection
Given 𝑥𝑖
𝑎
,
Hard positive: 𝑎𝑟𝑔𝑚𝑎𝑥 𝑥 𝑖
𝑝 𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑝
2
2
Hard negative: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑥 𝑖
𝑛 𝑓 𝑥𝑖
𝑎
− 𝑓 𝑥𝑖
𝑛
2
2
Infeasible to compute the argmin and argmax across the
whole training set.
Might lead to poor training, as mislabeled and poorly imaged
faces would dominate the hard positives and negatives.
56. FaceNet (Google, CVPR, 2015)
Triplet Selection
• Generate triplets offline every n steps,
using the most recent network checkpoint
and computing the argmin and argmax on
a subset of the data.
• Generate triplets online. This can be done
by selecting the hard positive/negative
exemplars from within a mini-batch.
57. FaceNet (Google, CVPR, 2015)
Dataset: Google (500M, 10M)
Network: Inception 224x224
LFW
98.87% ± 0.15 using fixed center crop.
99.63% ± 0.09 using the extra face alignment.
62. Center Loss (SIAT, ECCV, 2016)
The distribution of deeply learned features in (a) training set (b) testing set, both under
the supervision of softmax loss.
Adopted from [22].
63. Center Loss (SIAT, ECCV, 2016)
ℒ 𝐶 =
1
2
𝑖=1
𝑚
𝑥𝑖 − 𝑐 𝑦 𝑖 2
2
𝑥𝑖: 𝑖th deep feature belonging to the 𝑦𝑖th class.
𝑐 𝑦 𝑖
: 𝑦𝑖th class center of deep features.
The center loss and its variant suffer from massive GPU memory
consumption on the classification layer, and prefer balanced and
sufficient training data for each identity.
65. Center Loss (SIAT, ECCV, 2016)
The distribution of deeply learned features under the joint supervision of softmax loss
and center loss.
Adopted from [22].
67. Center Loss (SIAT, ECCV, 2016)
Face verification accuracies on LFW dataset, respectively achieve by (a) models with
different 𝜆 and fixed 𝛼 = 0.5. (b) models with different 𝛼 and fixed 𝜆 = 0.003.
Adopted from [22].
68. Center Loss (SIAT, ECCV, 2016)
A: softmax
B:softmax+contrastive
C: proposed. 𝜆 = 0.003, 𝛼 = 0.5
Adopted from [22].
69. L-Softmax (Peking univ. , ICML, 2016)
Original softmax loss 𝐿 =
1
𝑁
𝑖
𝐿𝑖 =
1
𝑁
𝑖
− log(
𝑒 𝑓𝑦 𝑖
𝑗 𝑒 𝑓 𝑗
)
𝑥𝑖: 𝑖-th input feature, 𝑦𝑖:label, N: the number of training
data, 𝑓𝑗: 𝑗-th element of the vector of class scores 𝒇.
𝑓 is usually the activations of a fully connected layer 𝑾, so 𝑓𝑦 𝑖
can be written
as 𝑓𝑦 𝑖
= 𝑾 𝑦 𝑖
𝑇
𝒙𝑖 in which 𝑾 𝑦 𝑖
𝑇
is 𝑦𝑖-th column of 𝑾.
𝑾 𝒙 𝑏 𝒇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦
𝑾 𝑦 𝑖
𝑓𝑦 𝑖
70. L-Softmax (Peking univ. , ICML, 2016)
𝑓𝑗 = 𝑾𝑗
𝑇
𝒙𝑖 = 𝑾𝑗
𝑇
𝒙𝑖 cos(𝜃𝑗) where 𝜃𝑗(0 ≤ 𝜃𝑗 ≤ 𝜋) is the angle between the
vector 𝑾𝑗
𝑇
and 𝒙𝑖.
𝐿𝑖 = − log
𝑒
𝑾 𝑦 𝑖
𝑇 𝒙𝑖 cos 𝜃 𝑦 𝑖
𝑗 𝑒
𝑾 𝑗
𝑇 𝒙 𝑖 cos 𝜃 𝑗
In the binary classification, if we have a sample 𝒙 from class 1.
𝑾1 𝒙 cos 𝜃1 > 𝑾2 𝒙 cos 𝜃2
𝑾1 𝒙 cos 𝑚𝜃1 > 𝑾2 𝒙 cos 𝜃2 (0 ≤ 𝜃1 ≤
𝜋
𝑚
) where 𝑚 is a positive integer.
𝑾1 𝒙 cos 𝜃1 ≥ 𝑾1 𝒙 cos 𝑚𝜃1 > 𝑾2 𝒙 cos 𝜃2
84. SphereFace (Georgia Tech. , CVPR, 2017)
1:1M rank-1 identification results on MegaFace benchmark: (a)
introducing label flips to training data, (b) introducing outliers to
training data.
Adopted from [26].
107. References
[1] M. Turk, A. Pentland, “Face recognition using eigenfaces,” in Proc. CVPR, pp.
586–591. (1991)
[2] P. Belhumeur, J. P. Hespanha, and D. Kriegman. “Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection,” in PAMI, 19(7):711-720, July
1997.
[3] H. G. Feichtinger, T. Strohmer, “Gabor Analysis and Algorithms,” in Birkhauser,
1998.
[4] DC. He, L. Wang, “Texture Unit, Texture Spectrum, And Texture Analysis,” in
IEEE Trans. Geoscience and Remote Sensing, Vol. 8, No. 8, pp. 905-910, 1990.
[5] L. Wang, DC. He, “Texture Classification Using Texture Spectrum,” in Pattern
Recognition, Vol. 23, No. 8, pp. 905-910, 1990.
[6] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary
patterns: Application to face recognition,” in PAMI, 2006
[7] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, “DeepFace: Closing the gap to human-
level performance in face verification,” in Proc. CVPR, 2014
[8] M. Wang, W. Deng, “Deep Face Recognition: A Survey,” ArXiv preprint
arXiv:1804.06655v8
[9] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, SphereFace: Deep Hypersphere
Embedding for Face Recognition. In Conf. on CVPR, 2017
108. References
[10] G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller. Labeled Faces in the Wild:
A Database for Studying Face Recognition in Unconstrained Environments.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
[11] G. B. Huang, E. Learned-Miller. Labeled Faces in the Wild: Updates and New
Reporting Procedures.
[12] J. Deng, J. Guo, S. Zafeiriou. Arcface: Additive angular margin loss for deep
face recognition. arXiv preprint arXiv:1712.04695, 2017
[13] F. Zhao & Y. Jian, Y. Shuicheng, J. Feng. Dynamic Conditional Networks for
Few-Shot Learning. ECCV, 2018
[14] B. K. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A.
Mah, K. Jain. Pushing the Frontiers of Unconstrained Face Detection and
Recognition: IARPA Janus Benchmark A. CVPR, 2015
[15] A. Talwalkar, S. Kumar, H. Rowley. Large-scale manifold learning. In CVPR,
2014
[16] K. –C. Lee, J. Ho, M. –H. Yang, D. Kriegman. Video-based face recognition using
probabilistic appearance minifolds. In CVPR, 2003.
[17] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang. “Face recognition using
laplacianfaces,” PAMI, 27(3):328-240.
[18] S. Chopra, R. Hadsell, Y. LeCun. Learning a similarity metric discriminatively,
with application to face verification. In CVPR, 2005.
109. References
[19] Y. Sun, Y. Chen, X. Wang, X. Tang. Deep learning face representation by joint
identification-verification. In NIPS, pages 1988-1996, 2014.
[20] Y. Sun, D. Liang, X. Wang, X. Tang. Deepid3: Face recognition with very deep
neural networks. arXiv preprint arXiv:1502.00873
[21] F. Schroff, D. Kalenichenko, J. Philbin. Facenet: A unified embedding for face
recognition and clustering. In CVPR, pp. 815-823, 2015.
[22] Y. Wen, K. Zhang, Z. Li, Y. Qiao. A discriminative feature learning approach for
deep face recognition. In ECCV, pp 299-515, 2016.
[23] W. Liu, Y. Wen, Z. Yu, M. Yang. Large-margin softmax loss for convolutional
neural networks. In ICML, pp. 507-516, 2016.
[24] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song. Sphererface: Deep hypersphere
embedding for face recognition. In CVPR, volume 1, 2017.
[25] F. Wang, W. Liu, H. Liu, J. Cheng. Additive margin softmax for face verification.
arXiv preprint arXiv:1801.05599, 2018
[26] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, C. Change Loy. The devil of
face recognition is in the noise. In ECCV, September 2018.
[27] R. Ranjan, C. D. Castillo, R. Chellappa. L2-constrained softmax loss for
discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
[28] J. Deng, J. Guo, S. Zafeiriou. Arcface: Additive angular margin loss for deep
face recognition .arXiv:1801.07698, 2018.
Editor's Notes
Alignment pipeline. (a) The detected face, with 6 initial fiducial points. (b) The induced 2D-aligned crop. (c) 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, we added triangles on the contour to avoid discontinuities. (d) The reference 3D shape transformed to the 2D-aligned crop image-plane. (e) Triangle visibility w.r.t. to the fitted 3D-2D camera; darker triangles are less visible. (f) The 67 fiducial points induced by the 3D model that are used to direct the piece-wise affine warpping. (g) The final frontalized crop. (h) A new view generated by the 3D model (not used in this paper).
L2+ only decreases the distances between DeepID2 features of the same identity.
L2- only increases the distances between DeepID2 features of different identities if they are smaller than the margin.
Figure 2: Comparison among softmax loss, modified softmax loss and A-Softmax loss. In this toy experiment, we construct a CNN to learn 2-D features on a
subset of the CASIA face dataset. In specific, we set the output dimension of FC1 layer as 2 and visualize the learned features. Yellow dots represent the
first class face features, while purple dots represent the second class face features. One can see that features learned by the original softmax loss can not be
classified simply via angles, while modified softmax loss can. Our A-Softmax loss can further increase the angular margin of learned features.
Figure 5: Visualization of features learned with different m. The first row shows the 3D features projected on the unit sphere. The projected points are the
intersection points of the feature vectors and the unit sphere. The second row shows the angle distribution of both positive pairs and negative pairs (we choose
class 1 and class 2 from the subset to construct positive and negative pairs). Orange area indicates positive pairs while blue indicates negative pairs. All angles
are represented in radian. Note that, this visualization experiment uses a 6-class subset of the CASIA-WebFace dataset.
Figure 11. Parallel calculation by simple matrix partition. Setting:
ResNet 50, batch size 8*64, feature dimension 512, float point
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Communication
cost: 1MB (feature x). Training speed: 800 samples/
second.
(1) Get feature (x). Face embedding features are aggregated
into one feature matrix (batch size 8*64 feature
dimension 512) from 8 GPU cards. The size of the aggregated
feature matrix is only 1MB, and the communication
cost is negligible when we transfer the feature matrix.
(2) Get similarity score matrix (score = xW). We copy
the feature matrix into each GPU, and concurrently multiply
the feature matrix by the centre sub-matrix (feature dimension
identity number 1M/8) to get the similarity
score sub-matrix (batch size 512 identity number 1M/8)
on each GPU. The similarity score matrix goes forward to
calculate the ArcFace loss and the gradient. Here, we conduct
a simple matrix partition on the centre matrix and the
similarity score matrix along the identity dimension, and
there is no communication cost on the centre and similarity
score matrix. Both the centre sub-matrix and the similarity
score sub-matrix are only 256MB on each GPU.
Figure 11. Parallel calculation by simple matrix partition. Setting:
ResNet 50, batch size 8*64, feature dimension 512, float point
32, identity number 1 Million, GPU 8 * 1080ti (11GB). Communication
cost: 1MB (feature x). Training speed: 800 samples/
second.
(3) Get gradient on centre (dW). We transpose the feature
matrix on each GPU, and concurrently multiply the
transposed feature matrix by the gradient sub-matrix of the
similarity score.
(4) Get gradient on feature (x). We concurrently multiply
the gradient sub-matrix of similarity score by the transposed
centre sub-matrix and sum up the outputs from 8
GPU cards to get the gradient on feature x.
Considering the communication cost (MB level), our
implementation of ArcFace can be easily and efficiently
trained on millions of identities by clusters.