In the last of the three talks from Barclays, I show how easy it is to make your own deepfake focussing on a "social media" quality fake for ethical reasons. The takeaway from this is how easy they are to make with a £1000 laptop in 24 hours so that you can question what you see and think critically.
Welcome. My name is Janet Bastiman and this is last in this series of talks on Deepfakes.
I’m deviating from Orwell today and instead drawing on the waring from Musical Scientist Tim Blais. Today I’m ging to give you the information you need to vreate your own deepfake, but I’m not going to show you all the tricks to make one as good as some of the examples I showed in the first talk. This will be a “social media” quality fake. I’m doing this because if you want to take this further, to make something that will fool in HD you will have to make a conscious decision about how you are going to use that knowledge.
So we’ve looked and what they are, and the why of the ethics, so we’re finishing with the How. How do you make your own. We’ll start with some of the freely available apps and websites and then dive in to how you make you own bespoke video.
In week one, I shoed the Zao app that came out last year – from a single image you could put yourself in one of a selection of video clips. This was restricted to accounts in China but was really compelling
If you search for face swap in your phone’s app store you’ll see a whole host of these now – some better than others. These apps are free and they are not keeping your data, but by using them you are creating output for their on going training. Whenever you get something for free, you are the currency you’ve used. Remember that.
In the interests of science I used Reface, previously Doublicat as it’s one of the best. This is 24.99 a year and gives you a 3 day free trial I took a single selfie, with and without glasses and selected the gif of Danearys.
Reface have full deepfake tech and this app is just a side project to help showcase their abilities and, help improve their commercial offering. It takes less than 10 seconds to put you in the gif. As you can see I tried with and without glasses to be fair. The one without is a better fake but I don’t recognise it as me ;)
This is a lot of fun and please do play with these apps but only in the knowledge that while they are not keeping your data, it is highly likely that the transient copy they make is being used to make their software better.
But what if you want to make something bespoke?
I’d like to introduce you to the amazing Captain Samantha Christoforetti. She is an Italian fighter pilot and astronaut and was on the ISS for expeditions 42 and 43 from November 2014 to July 2015. She is absolutely awesome and has inspired a whole host of girls to go into engineering. They made a barbie doll of her. She was the first astronaut to have an espresso in space and found huge fame for taking her star fleet uniform on board for this shot.
Now I’ve never been to space, but as you might have guessed from the Lego Space station that’s been behind me in these sessions, I’m really excited by the idea. However as a middle aged civilian I’m probably not top of the British space agency’s list let alone the ESAs. But since I have a habit of making rash statements in my talks that end up happening, if anyone from ESA would like a data scientist/biochemist/mathematician to go up to the ISS I do volunteer. That said, on the assumption I will never go to space, let’s fake it.
This consists of a series of steps you need a source data set and a set for the face you want to replace it with. For both of these, you run an extract script to determine the facial features Once you have the alignments file for your two faces you can train your network. What you’re doing here is creating two generative adversarial networks. One has a converter from A to B and a discriminator, and the other is a converter from B to A and a discriminator.
Then you convert the image or video. Each frame is analysed to identify the source face and then a prediction is made from that subsection of the image with the face replaced and a new frame created – this is using the model previously created that converts A to B and B to A
Finally the video is stitched back together and any new audio added. Simple right? Let’s look at each step in detail with a worked example
Before I do that here are out social responsibility pledges with deepfakes. Never use this with the intent to mislead – there are plenty of legitimate uses as we discussed last week.
First we need an application. Now while you could write one from scratch there’s really no need. A quick google and you can find all of this, including faceswap on github.
Face swap does work on Windows, but I’ve stuck with Ubuntu because of some of the other tools I’ll be using and it means there’s more GPU to play with. You need the Nvidia CUDA libraries installed and a machine with a compatible GPU. I have a Dell XPS with an Nvidia GTX 1050 which is a 4G card. - by no means top of the line considering the 3080s are rumoured to be out soon.
Getting all this installed can be a bit fiddly but they all come with step by step instructions so as long as you are comfortable with that, you can do this.
I ran this within a docker container to protect my main system.
So within 5 mins I got everything I needed installed in a virtual environment and looked at the code to make sure it was doing everything I expected. *Don’t just install random repositories from github without knowing what they do ;)
Another 10 mins and I’d checked everything was running and working as expected.
And this is what I got. This is running from a container on my machine, I do not need to worry about installing tensorflow or building any other tools and now I follow the instructions – keep an eye on the times in the top of the slides showing how much time this takes of both hands on (actively doing things on my laptop) and processing (where I’m off doing something else).
We need some source data. There’s quite a lot of images of Samantha online so I downloaded a google thumbnail scraper and fed it a few different search terms to ensure I got a variety of images. Face swap recommends 500 – 5000. I got 700 initially. I deleted anything that wasn’t her and cropped out her face where there were multiple people in the image and ended up with 477. I wasn’t sure if that was enough, but most of the interviews were of her in the same pose so I didn’t think that they would add much. This took about 30 mins to go through manually.
Not being quite as famous, I don’t have as many images online so I made a quick video of myself walking around the house and pulling faces much to the delight and amusement of my family. This gave me a few minutes of HD footage with lots of different angles and lighting.
The faceswap app can deal with both of these types of inputs. Point the app at your data and hit go and you’ll get output like this. Okay, but what’s happening behind the scenes? Well first, for a video it’s extracting the frames and then for each image it’s using tools from VGG face detect to find the landmarks in the faces. This is a 5 minute job for each set of images. The faces and their landmark features were saved to an output folder.
How does it do this? Faceswap uses a pretrained model, VGG faces, created by the visual geometry group in Oxford and originally trained on thousands of celebrities. We don’t want the 2k faces that were in the main dataset, but instead the features of both of our faces. It creates a feature map of the face in each image and outputs this as a json file – this can be used to determine if two faces are the same or different, even in different poses. This is a very deep network, and the architecture is on the slide here if you want to implement it yourself, but because we’re using a pre-trained model then we can just load it in, use it to get the face features and save these results.
Face swap is in the process of releasing an updating mapping tech I’ve included this because if you wanted to do this a few years ago – you would have needed to understand how to code this model or even how to manipulate it. You don’t any more – it’s a plug in.
And this is the sort of data we have after that first stage of processing. Notice how it doesn’t care about anything above my eyebrows and is confused by my glasses. So we’re less than an hour in from when I had this crazy idea and I’ve got detailed mapping of both faces and it’s time to train.
This is actually the fun bit :) During training we’re going to create a model, using generative adversarial techniques that will take face A and turn it into face B and face B and turn it into face A. It does this by recognising the face and generating a new face within this space. The alignment files are used to test the new face to see if is a good match for the desired destination face, as when as ensuring the face matches the sharpness of the rest of the image and every time it does better the model is saved. This is what GANs do in a nutshell. I used all the default settings and pointed the app at the new face directories and alignment files I’d created. So here is it fairly early in the process. It’s still pretty random and blurry, but if you really squint your eyes, the destination face is starting to get my eyes.
Let’s remind ourselves of GANs. Where the face should be, we’re using a generator to turn random noise into a fake image. Both fakes and originals are fed into the discriminator, which is using the alignment files and some other tooling to validate the images. When the discriminator network is correct then it is reinforced. When the generator fools the discriminator then the generator is reinforced. Through many iterations of training you end up with realistic outputs. Again – you do not need to know how to code a GAN or interact with it to use this software
I came back a few hours later and my laptop had completely died. It had got far too hot but fortunately was not near anything flammable. So I did a hard shutdown, put it in the fridge for a few minutes and then rebooted. Brought up sensors and made sure it was back to a lower temperature before kicking everything off again.
I had to manually kill the docker containers here but not the image, so it was pretty fast to be up and running. From the timestamp on the files I’d lost maybe an hour. One hour less that I could train, but not a big deal. So after propping up the laptop to get increased airflow (the intake vents on the dell XPS are not big enough for it to be maxed out like this for so long) I set it going again. It was coded well enough to recognise that there was a model in progress, imported it and carried on training.
So maybe another 15 minutes of actual hands on effort. I let it run for about 20 hours.
At this point, this is what I was getting. Going from me to Sam wasn’t great, but I don’t want to put Sam into a video of me doing an AI related talk, so I only really cared about the A to B columns, from Sam to Me. These aren’t perfect but are looking pretty good.
So I needed a good video. I found a nice interview from December 2014 with the BBC where it’s obvious that Sam is on the ISS. You can see the microphone floating and the lack of gravity on her, particularly with her hair. Using a browser extension I downloaded this from YouTube (for educational purposes!) and using Shotcut (an open source video editing app for Linux) I took about 30s from this video, that I felt was a reasonable representation.
I then used the convert option in faceswap – again this is with all the default settings. Using the downloaded video as my source, and the model I’d created in the previous step. This took a few minutes as it extracted all the frames and converted everyone, using the model. The final step was to use ffmpeg tool included to stitch it back together with the original audio and I had a 30 second clip of me in the ISS. But I sounded wrong! (And of course the hair wasn’t right)
Now there are other faking apps where you can take samples of a voice and artificially generate sentences, but since this was me, I used another open source and very well respected tool called Audacity to record my own version of the transcript and just went back to Shotcut and recorded a new audio track for the video after muting the original. (Also going into the audio faking would overrun this talk time!). With a bit more care I could have cut the audio into segments and stretched it to fit as my fake mouth is matching how Samantha talk and not how I talk. So the final version is a little out of synch in places. Similarly, trying to say what she said in my normal voice without trying to impersonate her Italian English I found really difficult.
The audio here is quite quiet, but I’d got the giggles by this point and had run out of time, so this was the best audio I had. The network has done a pretty good job with my face but is completely confused by my glasses. At times, you can see a strange double eyebrow going on.
So in total about 2 hours of my time, over about 22 hours elapsed time. This isn’t perfect, but with a bit more training data for Sam and maybe using a reference video of me without glasses, and a longer training time I could smooth it out a lot and get rid of the blurriness. Something of this quality could be made by anyone with a £1000 computer.
It’s this easy. While I do have an advantage in that I know how this all works and I had everything set up and could debug it. This is very easy and you should all have a go just to see how easy they are.
We’ve seen what they are, positive and negative applications and how easy they are to make. We’ve seen that they are out there. You need to be ready to tell the difference and think critically.
Making a deepfake
PART 3: How to make a deepfake
Dr Janet Bastiman
You gotta choose, yourself, how to use it
The knowledge you hold and
Don't ever let a letter go
You only get one shot to stop
And one chance to know
Responsibility comes once you're a science guy,
Tim Blais, Choose Yourself
Dr Janet Bastiman @yssybyl
• Short videosOne shot
• Technical Requirements
• Data RequirementsBespoke
• Generation and post-processingEnd to End
Dr Janet Bastiman @yssybyl
Captain Samantha Christoforetti
Images from ESA and Samantha Christoforetti
"If you have ALS, it's crucial
that you do not wait and that
you start backing up your voice
while you still have it."
- Pat Quinn
Dr Janet Bastiman @yssybyl
• Source Face
• Destination Face
• Still images
Get data and
• Swap A to B
• Swap B to A
Create a GAN
• Source video
• Apply deepfake
Create your new
• Over dub sound
• Fix issues
• it is not for creating inappropriate content.
• it is not for changing faces without consent
or with the intent of hiding its use.
• it is not for any illicit, unethical, or
Dr Janet Bastiman @yssybyl
Part 1: What are deepfakes?
Part 2: Ethics and detection of deepfakes