2. DEEPFAKES
NEURAL NETWORKS
ANALYZING THE TECHNOLOGY
PROCESS
APPLICATION
3D HEADPOSE ESTIMATION
INCONSISTENT HEAD POSES IN DEEP FAKES
CLASSIFICATION BASED ON HEAD POSES
CONCLUSION
5. The word deepfake has been around only for a
couple years. It is a combination of “deep
learning” – which is a subset of AI that uses
neural networks – and “fake.” It is a technique
for human image synthesis that is used to combine
and superimpose existing images and videos onto
source images or videos using a machine learning
technique known generative adversarial network.
The phrase deep fakes was coined in 2017. The term
is named for a Reddit user known as deep fake who,
in December 2017, used deep fake technology to edit
the faces of celebrities onto people in pornographic
video clips. These videos and audios look and sound
just like the real thing. Deep fakes are lies disguised to
look like truth.
6. A deep neural network is a concept of deep learning and it is what artificial
intelligence researchers call computer systems that have been trained to do
specific tasks, in this case, recognize altered images. These networks are
organized in connected layers. Deep neural network architecture can identify
manipulated images at the pixel level with high precision. These neural
networks are also used in our snapchat and Instagram filters.
7. Here are the specific parameters for defining what constitutes a
successful deepfake.
The following criteria have been defined to evaluate these
requirements:
Number of images
Lighting conditions
Size/quality of the source material
Angle of the source material
Differing facial structures
Overlapping objects
8. PROCESS
At the moment there are two main applications used to create deep fakes: FakeApp and faceswap. It
requires three steps: extraction, training and creation.
Extraction
The deep- in deep fakes comes from the fact that this face-swap technology uses Deep Learning. It
often requires large amounts of data. Without hundreds of face pictures or some videos, you will
not be able to create a deepfake video.
A way to get around this is to collect a number of video clips which feature the people you want to
face-swap. The extraction process refers to the process of extracting all frames from these video
clips, identifying the faces and aligning them. The alignment is critical, since the neural network
that performs the face swap requires all faces to have the same size (usually 256×256 pixels) and
features aligned. Detecting and aligning faces is a problem that is considered mostly solved, and is
done by most applications very efficiently.
9. Training
Training is a technical term borrowed from Machine Learning. In this case, it
refers to the process which allows a neural network to convert a face into
another. Although it takes several hours, the training phase needs to be done only
once. Once completed, it can convert a face from person A into person B. This is
the most obscure part of the entire process.
10. Creation
Once the training is complete, it is finally time to create a deepfake. Starting from a video, all frames
are extracted and all faces are aligned. Then, each one is converted using the trained neural network.
The final step is to merge the converted face back into the original frame. While this sounds like an
easy task, it is actually where most face-swap applications go wrong.
The creation process is the only one which does not use any Machine Learning. This is a phase where
most of the mistakes are detected. Also, each frame is processed independently; there is no temporal
correlation between them, meaning that the final video might have some flickering.
11. EXAMPLES
Obama Deepfake
Jordan Peele
In, this deepfake, a false image or video seems
deceptively real, by the American actor and director
Jordan Peele, shows former US president Barack Obama
speaking about the dangers of false information and fake
news. Jordan Peele transferred his own facial movements
to Obama’s facial characteristics using deepfake
technology.
Mark Zuckerberg Deepfake
This particular deepfake manipulates the audio to make Facebook
CEO Zuckerberg sound like a psychopath talking to CBS News
about the "truth of Facebook and who really owns the future."
This video was widely circulated on Instagram and ultimately
went viral.
12. APPLICATIONS
FAKE APP
In January 2018, a proprietary desktop application called Fake App was launched. The app allows users to easily
create and share videos with faces swapped. The app uses an artificial neural network, a GPU, and three to four
gigabytes of storage space to generate the fake video. For detailed information, the program needs a lot of visual
material from the person to be inserted in order to learn which image aspects have to be exchanged, based on the
video sequences and images.
FACE SWAP
When applied correctly, this technique is uncannily good at swapping faces. But it has a major disadvantage: it
only works on pre-existing pictures.
It relies on neural networks, computational models that are loosely inspired by the way real brains process
information. This novel technique allows generating so-called deepfakes, which actually morph a person’s face
to mimic someone else’s features, although preserving the original facial expression.
13. 3D HEAD POSE ESTIMATION
The 3D head pose corresponds to the rotation and translation of the world coordinates to the corresponding camera
coordinates. Specifically, denote [U,V,W]T as the world coordinates of one facial landmark, [X,Y,Z]T be its camera
coordinates, and (x,y)T be its image coordinates. The transformation between the world and the camera coordinate
systems can be formulated as
where R is the 3 × 3 rotation matrix, ~t is 3 × 1 translation vector. The transformation between camera and image
coordinate systems is defined as
where fx and fy are the focal lengths in the x- and y directions and (cx,cy) is the optical center, and s is an unknown scaling
factor.
14. In 3D head pose estimation, we need to solve the reverse problem, i.e, estimating s, R and ~t using the 2D image
coordinates and 3D world coordinates of the same set of facial landmarks obtained from a standard model, e.g, a 3D average
face model, assuming we know the camera parameter. Specifically, for a set of n facial landmark points, this can be
formulated as an optimization problem, as
that can be solved efficiently using the Levenberg-Marquardt algorithm [15]. The estimated R is the camera pose which is
the rotation of the camera with regards to the world coordinate, and the head pose is obtained by reversing it as RT (as R is an
orthornormal matrix).
15. INCONSISTENT HEAD POSES IN DEEP FAKES
As a result of swapping faces in the central face region in the Deep Fake process in Fig. 1, the
landmark locations of fake faces often deviate from those of the original faces. As shown in Fig.
1(c), a landmark in the central face region P0 is firstly affine-transformed into P0 in = MP0. After
the generative neural network, its corresponding landmark on the faked face is Q0 out.
As the configuration of the generative neural network in Deep Fake does not guarantee
landmark matching, and people have different facial structures, this landmark Q0 out on
generated face could have different locations to P0 in. Based on the comparing 51 central
region landmarks of 795 pairs of images in 64 × 64 pixels, the mean shifting of a landmark from
the input (Fig. 1(d)) to the output (Fig. 1(e)) of the generative neural network is 1.540 pixels,
and its standard deviation is 0.921 pixel. After an inverse transformation Q0 = M−1Q0 out, the
landmark locations Q0 in the faked faces will differ from the corresponding landmarks P0 in the
original face..
16. Fig. 1 Distribution of the cosine distance between ~vc and
~va for fake and real face images
17. CLASSIFICATION BASED ON
HEAD POSES
We further trained SVM classifiers based on the differences between head poses estimated using the full set of facial
landmarks and those in the central face regions to differentiate Deep Fakes from real images or videos. The features
are extracted in following procedures: (1) For each image or video frame, we run a face detector and extract 68 facial
landmarks using software package DLib [16]. (2) Then, with the standard 3D facial landmark model of the same 68
points from OpenFace2 [17], the head poses from central face region (Rc and tc) and whole face (Ra and ta) are
estimated with landmarks 18 − 36,49,55 (red in Fig. 2) and 1 − 36,49,55 (red and blue in Fig. 2), respectively. Here, we
approximate the camera focal length as the image width, camera center as image center, and ignore the effect of lens
distortion. (3) The differences between the obtained rotation matrices (Ra −Rc) and translation vectors are flattened
into a vector, which is standardized by subtracting its mean and divided by its standard deviation for classification.
18. Fig. 2. ROC curves of the SVM classification results, see texts for
details.
Experimental evaluations of our methods on a set of real face
images and Deep Fakes.
19. CONCLUSION
In this paper, we propose a new method to expose AIgenerated fake face images or videos
(commonly known as the Deep Fakes). Our method is based on observations that such Deep
Fakes are created by splicing a synthesized face region into the original image, and in doing so,
introducing errors that can be revealed when 3D head poses are estimated from the face
images. We perform experiments to demonstrate this phenomenon and further develop a
classification