Introduction to
Face Processing
with Computer Vision
Gabriel Bianconi
Founder, Scalar Research
Gabriel Bianconi
Founder, Scalar Research
AI & Data Science Consulting Firm
Previously at the Stanford AI Lab
Agenda
• Theory
• Detection
• Recognition
• Other Tasks
• Practice
• Rapid Prototyping
• Scaling
3
4
Theory
5
Face Detection
Haar-Like Features
• Summarize image based on simple color patterns
• Manually determined feature extractors (kernels)
• Leveraged for first real-time face detector (2001)
6Ref: Viola & Jones (2001). Image: Wikimedia
7
8
Histogram of Oriented Gradients (HOG)
• Summarize image by distribution of color gradients
• Gradient intensities and orientations represent edges, etc.
• Captures more information than simple Haar-like features
9Ref: Shu et al. (2011).
10Ref: Shu et al. (2011)
11Ref: Shu et al. (2011)
12Ref: Rojas et al. (2011)
13Ref: Rojas et al. (2011)
R-CNN
• Introduces CNNs for object detection
• CNNs learn how to extract features from data
• Breakthrough in performance
• Beats previous SOTA methods by huge margin
• However, detection is extremely slow
14Ref: Girshick et al. (2014).
CNN Features
15Ref: Lee et al. (2009).
CNN Features
16Ref: Lee et al. (2009).
CNN Features
17Ref: Lee et al. (2009).
CNN Features
18Ref: Lee et al. (2009).
R-CNN
19Ref: Girshick et al. (2014).
Fast R-CNN
• Improvement to R-CNN that leverages CNN for
classification and regression
• Other than proposing regions, system is now end-to-end vs. three
components trained greedily.
• Predictions are 200x+ faster with better performance
• Region proposals still are a bottleneck; total inference time is ~2s.
20Ref: Girshick (2015).
Fast R-CNN
21Ref: Girshick (2015).
Faster R-CNN
• Leverages CNN for region proposals as well
• “Region Proposal Network”
• Finally an end-to-end system with deep learning
• About 10x faster than Fast R-CNN, with better performance
• Total inference time is ~0.2s
22Ref: Ren et al. (2016).
Faster R-CNN
23Ref: Ren et al. (2016).
MTCNN
• Many model for face detection draw heavily from
the generalized object detection methods.
• MTCNN, for example, trains a multi-task system for
detection and alignment.
24Ref: Zhang et al. (2015).
MTCNN
25Ref: Zhang et al. (2015).
DSFD
• The current SOTA method draws heavily from
modern single-shot detection architectures.
• DSFD extends to a dual-shot detector with
enhanced features and loss functions.
26Ref: Li et al. (2018).
DSFD
27Ref: Li et al. (2018).
Are we there yet?
28Ref: Yang (2016).
WIDER Face (Easy)
~96% AP
WIDER Face (Medium)
~95% AP
WIDER Face (Hard)
~90% AP
29
Facial Recognition
Facial Recognition
• Facial recognition actually corresponds to group of
different tasks.
• Verification vs. Identification vs. Grouping vs. …
• Closed-Set vs. Open-Set
30
Closed-Set Recognition
• Every identity appears in training set
• Example: recognizing celebrities
• Effectively a classification problem
• Model aims to learn separable features
31
Closed-Set Identification
32
Model Label ConfidencesTest Sample
Label 0 Label 1 …
… …
Images: Wikimedia
Closed-Set Verification
33
Model
Test Sample A
Test Sample B
Label Confidences
Label Confidences
Images: Wikimedia
Open-Set Recognition
• Not every identity appears in training set
• Example: Facebook Photos
• Effectively a metric learning problem
• Model aims to learn large-margin features (embeddings)
34
Embeddings
• Map each sample to a vector (coordinate system)
• Used for words, graphs, faces, etc.
• Embeddings preserve similarity
• Similar samples close to each other
• Dissimilar samples far from each other
35
36Images: Wikimedia
Embeddings
• “Similar” depends on the training data
• Same person, physical characteristic, etc.
• Embeddings represent latent information
• High-dimensional embeddings trained on large datasets
learn to represent latent information about the person (e.g.
physical characteristics)
37
Open-Set Identification
38
Model Embedding + DistanceTest Sample
Emb. 0 Emb. 1 Emb. 2 …
Images: Wikimedia
Open-Set Verification
39
Model
Test Sample A
Test Sample B
Embedding A
Embedding B
Distance
vs.
Threshold
Images: Wikimedia
Metric Learning
40Ref: Liu et al. (2018)
Are we there yet?
41Ref: Deng et al. (2018); Learned-Miller et al. (2016)
LFW (Labeled Faces in the Wild)
99.8%+ accuracy
42
Cross-Factor
Facial Recognition
Cross-Age
43Ref: Zheng et al. (2017)
Cross-Pose
44Ref: Li et al. (2011)
Cross-Makeup
45Ref: Chen et al. (2013)
46
Further Research
Security
• How do we deal with adversarial users?
• Real face goes undetected or misclassified
• Fake face gets recognized
• Private data is extracted from model
• …
47
Security
48Ref: Grigory Bakunov (2017)
Biometrics & Multi-Modal Data
• How do we deal with…
• Identical twins?
• Plastic surgery?
• ...
49Ref: Singh et al. (2010)
50Ref: Singh et al. (2010)
Biometrics & Multi-Modal Data
• Combine with other biometric data
• Biometric traits (e.g. hand)
• Multiple sensors (e.g. 2D + 3D)
• Multiple pictures (e.g. viewpoints, sequences)
• …
51Ref: Singh et al. (2010); Ross & Jain (2004); Ross & Govindarajan(2005)
52Ref: Apple
Privacy
• How do we deal with…
• Models that can predict gender, race, …?
• Models that leak the data?
• Predictions without sharing the raw data?
• …
53Ref: Singh et al. (2010)
54
Other Tasks
Alignment & Pose Estimation
55Ref: Ruiz et al. (2018)
Face Landmarks
56
Classification
57
Neutral
Happy
Happy
3D Reconstruction
58Ref: Sela et al. (2017)
59
Practice
60
Rapid Prototyping
Dozens of Tools
61
simplicity accuracyface_recognition
O
penC
V
FaceN
et
InsightFace
M
TC
N
N
……
D
SFD
……
APIs
• There are dozens of APIs providing low-cost face
processing at scale
• Most services charge less than $1 per 1000 images
• Depending on the use case, might be cheaper than provisioning GPUs
and deploying your own models (esp. if considering developer time)
• Often these APIs can achieve performance that’s
close to state-of-the-art
62
APIs – Example: Azure
• Detection
• Classification
• Gender, age, emotion, hair, smile, eyes, glasses, makeup, …
• Landmarks
• Pose Estimation
• Recognition
• Verification, identification, grouping, similarity search, …
63
Embeddings
• Face embeddings are typically used for open-set
recognition systems
• They can be leveraged to quickly train models for
downstream tasks (e.g. classification)
• Tools
• face_recognition (Github): extremely fast, reliable for frontal
• FaceNet: based on deep learning, strong across the board
64
Example – Facebook Photos
• Task: open-set face identification
• Strategy:
1. Detect faces and compute embeddings for known photos
of users; store for future use.
2. Whenever a photo is uploaded, do the same and compare
against known set.
65
import face_recognition as fr
image = fr.load_image_file("file.jpg")
face_locations = fr.face_locations(image)
66Ref: github.com/ageitgey/face_recognition
Example – Detection
image = fr.load_image_file("file.jpg")
face_embedding = fr.face_encodings(image)[0]
67Ref: github.com/ageitgey/face_recognition
Example – Embedding
68Images: WikiMedia
Example – L2 Distance
- 0.31 0.59 0.69
0.31 - 0.52 0.63
0.59 0.52 - 0.50
0.69 0.63 0.50 -
Face Landmarks
• Face landmarks can also be quickly extracted with
pretrained models and used for a number of
downstream tasks.
69
face_landmarks = fr.face_landmarks(image)[0]
print(face_landmarks.keys())
# left_eyebrow, right_eyebrow, lower_lip, top_lip, …
Example – Face Landmarks
70Ref: github.com/ageitgey/face_recognition
71
Example – Snapchat Filters
• Task: face manipulation
• Strategy:
1. Detect face and localize landmarks in image
2. Add objects, reshape image, etc. based on landmarks
72
Example – Snapchat Filters
73
from PIL import Image, ImageDraw
…
pil_image = Image.fromarray(image)
d = ImageDraw.Draw(pil_image, 'RGBA’)
lip_fill = (150, 0, 0, 128) # shade of red, 50% alpha
d.polygon(face_landmarks['top_lip'], fill=lip_fill)
d.polygon(face_landmarks['bottom_lip'], fill=lip_fill)
…
75
Scaling
Bias
• People & Demographics
• Is your training set… Coworkers? Single location?
• Environment
• Does it cover… Day and night? Seasons? Lighting
conditions? Backgrounds?
• Sensors
• Did you consider… Diverse hardware? Calibration?
Viewpoint (angle)? Resolution? Occlusion?
76
Optimizations
• It is often easier to simplify the real-world task than
drastically improve ML models.
77
Optimizations
78
Time (weeks)
Performance Multiple model optimizations
($$$ in developer time, etc.)
Optimizations
79
Time (weeks)
Performance Install a new light
($)
Risks
• What happens when your model makes a mistake?
• How can you deal with adversarial users?
• What is your threat model?
80
Other Considerations
• How do you handle…
• Model getting stale over time?
• Growing search space?
• Large amounts of real-time data?
• Detecting or tracking people vs. faces?
• Speed vs. cost vs. performance trade-offs?
81
82
Thank you.
gabriel@scalarresearch.com

Introduction to Face Processing with Computer Vision