ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNNs.pptx

Video Face Manipulation
Detection Through
Ensemble of CNNs
Paper Tutorial: Chris Chien
N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, P. Bestagini and S. Tubaro, "Video Face Manipulation
Detection Through Ensemble of CNNs," 2020 25th International Conference on Pattern Recognition (ICPR), 2021,
pp. 5012-5019, doi: 10.1109/ICPR48806.2021.9412711.

Paper Summary
● Motivation: tackle the detection of modern face manipulation and run the
detection in a more efficient way.
● Methodology:
○ Ensemble CNN (EfficientNet models) trained based on the use of attention layers and
siamese training.
● Future Work:
○ Embedding of temporal info
○ Voting schemes of the ensemble models

Fake Face Detection Algo
The previous networks/methods designed for fake face detection:
● MesoNet: a relatively shallow CNN detecting fake faces
● XceptionNet
● LSTM: extract a series of frame-based features
● Warping traces models
● Eye blinking analysis
● Semantic analysis of the frames
● Inconsistent lighting effects

Dataset
● FF++:
○ Data Variety: Face2Face, FaceSwap, DeepFakes,
Neural Textures techniques.
○ Data Volume: Each method applies to 1000
pristine videos from YouTube where each video
has 280 frames at least.
○ Data Format: Videos are compressed using the
H.264 codec.
○ Data Split: 720 videos for training, 104 for validation
and 140 for testing.
● DFDC
○ Data Variety: Different DeepFake techniques.
Have the diversity info in terms of gender,
skintone, age.
○ Data Volume: 119,000 videos where each has
roughly 300 frames. Unbalanced dataset:
100,000 are fakes.
○ Data Split: 35 folders for training, 5 folders for
validation and lst 10 folders for testing

Data Preprocessing
● Select 32 frames in each video to the training set because this number can conquer overfitting
and too many frame do not contribute to the model performance.
● Extract faces from each frame using the BlazeFace extractor in that it is faster than the MTCNN
detector. If more than one faces is in a frame, only capture the one with the higher confidence score.
● Data augmentation: downscaling, horizontal ﬂipping, random brightness contrast, hue saturation,
noise addition and ﬁnally JPEG compression. ( Albumentation lib)
32
frames
Extract Faces Using
Model BlazeFace
Run Data
Augmentation

Model Training
Solution = Ensemble CNN + Attention Mechanism + Siamese Paradigm
Keep computational complexity at bay:
● Analyze 4,000 videos in less than 9 hours using at most a single NVIDIA
P100 GPU.
● The trained model must occupies less than 1GB of disk space.

Ensembling Process
● Why using ensemble?
Train classifiers that can capture high-level semantic info that complement one
another.
● How?
1. Model Arch Source: EfficientNet (reasoning: good trade-off in terms of
model size, run time(FLOPS cost) and accuracy).
2. Attention Mechanism: make the network explainable → show which part of
frame is manipulated.
3. Network Training Strategies: Siamese Training.

What Is EfficientNet?
● The design of EfficientNet relies on the techniques of architecture scaling on
CNN.
● The scaling works on balancing the dimensions including width, depth, and
image resolution.
Image Source: EfficientNet: Rethinking
Model Scaling for Convolutional Neural
Networks

What Is Attention Layer?
Q vector K vector V vector
softmax((Q*K)/(dK)^0.5) * V

What Is Siamese Training? - Pros and Cons
It compares the similarity for a pair of input images.
Pros:
● It is designed for a scalable system (One-shot learning model.) for use cases
like Facial Recognition System, Place Registration, Signature Verification.
Cons:
● Could have more training time as it requires quadratics pairs to learn.

What Is Siamese Training? - Siamese Model Architecture
Real Image
Fake Image
Loss
Function
Network 1
Network 2
Real Image
Embedding
Fake Image
Embedding
shared network
weights

What Is Siamese Training? - Siamese Model Loss
Functions
Real Image
(Anchor
Image)
Real Image
(Positive
Image)
Fake Image
(Negative
Image)
Option 1. Triplet Loss Function =
D(A, P) - D(A, F) + Margin
Better to get
closer.
Better to get
far from
each other.
Option 2. Contrastive Loss Function =
(1-Y)*0.5*D(A, B) +
(Y)*0.5*{max(0, Margin - D(A, B))}
Image A
Image B

EfficientNetB4 with the Attention Layer
Standard EfficientNet:
● Model Size: 19 millions of
parameters
● Model Operations: 4.2
billion of FLOPS
● Model Performance: 83.8%
top-1 accuracy on the
ImageNet
EfficientNetB4Att

Siamese Network Training
source: paper content

Other Training Info
● Hyperparameter: models using Adam optimizer with hyperparameters equal
to β 1 = 0.9, β 2 = 0.999, = 10 −8, and initial learning rate equal to 10 −5.
● HW for Training: Intel Xeon E5-2687W-v4 and a NVIDIA Titan V.

EfficientNetB4Att Explainability
select the output of the Sigmoid layer in the attention block, which is a 2D map
with size 28 × 28. Then, we up-scale it to the input face size (224 × 224), and
superimpose this to the input face.

ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNNs.pptx

Recommended

Recommended

More Related Content

Similar to ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNNs.pptx

Similar to ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNNs.pptx (20)

Recently uploaded

Recently uploaded (20)

ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNNs.pptx