A Discriminative Feature Learning Approach
for Deep Face Recognition
Yandong Wen, Kaipeng Zhang, Zhifeng Li and Yu Qiao
Paper Seminar @ SK Telecom: Jisung Kim
Separable vs Discriminative
A toy example : What’s wrong with Softmax
A toy example : What’s wrong with Softmax
1. m = mini-batch size
2. n = the number of class
3. x_i = feature vector in R^d (d is the feature dimension)
4. W = R^(dxn), b = R^n (bias)
Is it good for clustering?
Separable,
the deep features are not discriminative enough.
by intra-class variation
Is it good for clustering?
Training Set (50K) Test Set (10K)
Let’s be discriminative by Center Loss
1. m = mini-batch size
2. c_yi = yth class center in d dimension
3. x_i = feature vector in R^d (d is the feature dimension)
4. But! c_yi should be updated as the deep features changed.
5. Average the features of every class in each iteration.
Let’s be discriminative by Center Loss
1. Two modification
a. Updating the centers based on mini-batch
b. Updating center with learning rate alpha (for mislabelled samples)
2. Total = Softmax Loss + lambda * Center Loss
Varying λ with Loss = Soft + λ*Center
The discriminative feature learning algorithm
CNN Architecture → https://github.com/ydwen/caffe-face
Compared to Siamese and Triplet
1. Dramatic data expansion
a. Make (x_i, x_j) pair or (x_positive, x_negative, x_anchor) pair
2. Hard to make proper pair → Hard to make decreasing loss → Hard to train
The Devil is in the details.
1. Preprocessing
2. Training Data
3. Detailed Settings in CNNs
4. Detailed Settings in Testings
Implementation Details : Preprocessing
1. Cropped Face Size = (112x96x3)
a. Subtraction 127.5
b. Dividing by 128
c. Every Pixel Values in -1 <= val <= 1
2. Face Detect by MTCNN
3. Use 5 landmarks
a. Two Eyes
b. One Nose
c. Two Mouth Corners
4. Alignment : Similarity Transformation.
a. Rotation, Translation, Scaling
Implementation Details : Training data
1. 17,189 unique persons
2. 0.7 M images
a. CASIA-WebFace : 0.49 M, 10,575
b. CACD2000 : 0.16 M, 2,000
c. Celebrity+ : 0.20 M, 10,177
3. Removing Same Identities !!!
4. Data Augmentation : Only horizontally Flipped
Implementation Details : Detailed Settings in CNNs
1. Model A : Softmax Loss
2. Model B : Softmax Loss & Contrastive Loss
3. Model C : Softmax Loss & Center Loss
4. Batch Size : 256
5. GPUs : 2 x Titan X
6. Learning rate : Start from 0.1
7. Model A, C
a. Divided by 10 at the 16K, 24K
b. Complete at 28K (Roughly costs 14 hours)
8. Model B
a. Divided by 10 at the 24K, 36K
b. Complete at 42K (Roughly costs 22 hours)
Implementation Details : Detailed Settings in Testing
1. Deep Features : first FC layer
2. Extract 2 feature
a. Original Image
b. Horizontally Flipped Image
3. Just Concatenate
4. Do PCA
5. Score by cosine distance
6. Identification
a. Nearest Neighbor
7. Verification
a. Threshold Comparison
λ & α on LFW
LFW (images) & YTF (videos)
Verification Performance (λ=0.003 & α=0.5)
MegaFace
MegaFace
1. Gallery Set : 690,000 persons / 1 milion Images ( Distractors )
2. Probe set
a. Facescrub : 530 persons / 100K images
b. FGNet : 82 persons / 1,002 images (ages varying from 0 to 69)
3. Small / Large Protocol
a. When training db size < 0.5M, < 20K persons
b. When training db size > 0.5M, > 20K persons
4. Face Identification
a. From 1 vs 10
b. To 1 vs 1,000,000
5. Face Verification
a. 4 bilion = 4,000,000,000 negative pairs
MegaFace : Identification (1M distractors)
MegaFace : Verification (10^-6 FAR, 1M distractors)
FaceScrup
1. 530 celebrities
2. 100k images
FGNet
1. 82 persons
2. 1002 images
MegaFace : Identification (FaceScrub)
MegaFace : Verification (FaceScrub)
MegaFace : Identification (FGNet)
MegaFace : Verification (FGNet)
Conclusions
1. USE!!! Center Loss for getting discriminative feature
2. Much room for performance improvement.
a. To meet practical demand
b. Identification : Rank-1 with 1M distractor
c. Verification : 10^-6 FAR with 1M distractor

Center loss for Face Recognition

  • 1.
    A Discriminative FeatureLearning Approach for Deep Face Recognition Yandong Wen, Kaipeng Zhang, Zhifeng Li and Yu Qiao Paper Seminar @ SK Telecom: Jisung Kim
  • 2.
  • 3.
    A toy example: What’s wrong with Softmax
  • 4.
    A toy example: What’s wrong with Softmax 1. m = mini-batch size 2. n = the number of class 3. x_i = feature vector in R^d (d is the feature dimension) 4. W = R^(dxn), b = R^n (bias)
  • 5.
    Is it goodfor clustering? Separable, the deep features are not discriminative enough. by intra-class variation
  • 6.
    Is it goodfor clustering? Training Set (50K) Test Set (10K)
  • 7.
    Let’s be discriminativeby Center Loss 1. m = mini-batch size 2. c_yi = yth class center in d dimension 3. x_i = feature vector in R^d (d is the feature dimension) 4. But! c_yi should be updated as the deep features changed. 5. Average the features of every class in each iteration.
  • 8.
    Let’s be discriminativeby Center Loss 1. Two modification a. Updating the centers based on mini-batch b. Updating center with learning rate alpha (for mislabelled samples) 2. Total = Softmax Loss + lambda * Center Loss
  • 9.
    Varying λ withLoss = Soft + λ*Center
  • 10.
    The discriminative featurelearning algorithm
  • 11.
    CNN Architecture →https://github.com/ydwen/caffe-face
  • 12.
    Compared to Siameseand Triplet 1. Dramatic data expansion a. Make (x_i, x_j) pair or (x_positive, x_negative, x_anchor) pair 2. Hard to make proper pair → Hard to make decreasing loss → Hard to train
  • 13.
    The Devil isin the details. 1. Preprocessing 2. Training Data 3. Detailed Settings in CNNs 4. Detailed Settings in Testings
  • 14.
    Implementation Details :Preprocessing 1. Cropped Face Size = (112x96x3) a. Subtraction 127.5 b. Dividing by 128 c. Every Pixel Values in -1 <= val <= 1 2. Face Detect by MTCNN 3. Use 5 landmarks a. Two Eyes b. One Nose c. Two Mouth Corners 4. Alignment : Similarity Transformation. a. Rotation, Translation, Scaling
  • 15.
    Implementation Details :Training data 1. 17,189 unique persons 2. 0.7 M images a. CASIA-WebFace : 0.49 M, 10,575 b. CACD2000 : 0.16 M, 2,000 c. Celebrity+ : 0.20 M, 10,177 3. Removing Same Identities !!! 4. Data Augmentation : Only horizontally Flipped
  • 16.
    Implementation Details :Detailed Settings in CNNs 1. Model A : Softmax Loss 2. Model B : Softmax Loss & Contrastive Loss 3. Model C : Softmax Loss & Center Loss 4. Batch Size : 256 5. GPUs : 2 x Titan X 6. Learning rate : Start from 0.1 7. Model A, C a. Divided by 10 at the 16K, 24K b. Complete at 28K (Roughly costs 14 hours) 8. Model B a. Divided by 10 at the 24K, 36K b. Complete at 42K (Roughly costs 22 hours)
  • 17.
    Implementation Details :Detailed Settings in Testing 1. Deep Features : first FC layer 2. Extract 2 feature a. Original Image b. Horizontally Flipped Image 3. Just Concatenate 4. Do PCA 5. Score by cosine distance 6. Identification a. Nearest Neighbor 7. Verification a. Threshold Comparison
  • 18.
    λ & αon LFW
  • 19.
    LFW (images) &YTF (videos)
  • 20.
  • 21.
  • 22.
    MegaFace 1. Gallery Set: 690,000 persons / 1 milion Images ( Distractors ) 2. Probe set a. Facescrub : 530 persons / 100K images b. FGNet : 82 persons / 1,002 images (ages varying from 0 to 69) 3. Small / Large Protocol a. When training db size < 0.5M, < 20K persons b. When training db size > 0.5M, > 20K persons 4. Face Identification a. From 1 vs 10 b. To 1 vs 1,000,000 5. Face Verification a. 4 bilion = 4,000,000,000 negative pairs
  • 23.
    MegaFace : Identification(1M distractors)
  • 24.
    MegaFace : Verification(10^-6 FAR, 1M distractors)
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Conclusions 1. USE!!! CenterLoss for getting discriminative feature 2. Much room for performance improvement. a. To meet practical demand b. Identification : Rank-1 with 1M distractor c. Verification : 10^-6 FAR with 1M distractor