Alberto Massidda - Scenes from a memory - Codemotion Rome 2019

Scenes from a memory
Neural audio/video generation
Alberto Massidda

Who we are
● Founded in 2001;
● Branches in Milan, Rome, Cosenza and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ DevOps: Monitoring, Cloud and Containers
○ Agile: Processes and Management
○ Integration: Mobile, Frontend, Backend
○ Data science: ML, BigData and many more...

This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/

Outline
1. Supervised vs Unsupervised Learning
2. Deep generative models
a. Fully Visible Belief Networks
b. Variational Autoencoders
c. Generative Adversarial Networks
3. Some applications to images and audio
Heavily inspired by Stanford CS231 2017, lesson 13 “Generative Models” and Ian Goodfellow tutorial.

Supervised vs Unsupervised learning
Supervised Learning
Data: (x,y)
x: data, y: label
Goal: learn a function to map x → y
Examples
● classification
● regression
● almost all Computer Vision
● almost all NLP

Supervised vs Unsupervised learning
Supervised Learning
Data: (x,y)
x: data, y: label
Goal: learn a function to map x → y
Examples
● classification
● regression
● almost all Computer Vision
● almost all NLP
Unsupervised Learning (sexy!)
Data: x
x: unlabeled data
Goal: learn hidden structure of x
Examples
● clustering
● dimensionality reduction
● feature learning
● rich content generation

Deep generative Models
Given training data, generate new samples from a probability distribution.
Train by Maximum Likelihood over parameters to define the distribution.
Assumption: training data is generated by a hidden probability distribution pdata
(x).
Goal:
1. learn a function pmodel
(x) as similar as possible to pdata
(x);
2. generate new samples from pmodel
(x).
Generation as a density estimation problem.

Explicit vs implicit density estimation
Generative Model
Implicit densityExplicit density
Tractable density Approximate density
Variational
Markov chain
Direct Markov chain
Fully visible belief network
● PixelRNN/CNN
● Nonlinear ICA
● NADE
● MADE
Variational
Auto
Encoder
Generative
Adversarial
Network
Boltzmann
Machine
Generative
Stochastic
Network

Fully Visible Belief Network
FVBNs are models that use the chain rule of probability to decompose a
probability distribution over a n-dimensional vector into a product of
one-dimensional probability distribution:
We just use a neural network to express the distribution function.
The main drawback of FVBNs is that samples must be generated one entry at a
time. And it cannot be parallelized.
likelihood of the image probability of single pixel
conditioned on previous pixels

PixelRNN
Use a LSTM to generate pixels starting from the corner.

PixelCNN
Still generate pixels starting from the corner, but now we model context over a
region with a CNN.
x11
x1n
xij
xnn
Causal convolution
forbids parallelization
(but turns out to have
an optimal substructure
so we can memoize).

WaveNet
Generate samples one at a time through dilated causal convolution (demo).
Super quality but super slow: 2 mins computation, 1 sec of audio output!

Autoencoders
Autoencoders are a kind of neural network that projects higher dimensional
samples in a low level compressed representation and then reconstruct the input.
Vector ZEncoder Decoder
Original input X Reconstructed input X
Latent
representation

Autoencoders architecture
Train by backpropagating the error between original and reconstructed input.
reconstructed input
input
features (z)
L2 loss = || x - x ||2
4 layers CNN
4 layers up-CNN

Autoencoders for noise reduction

Autoencoders for semi-supervised learning
We can initialize a learner by training an Autoencoder first, throwing away the
decoder, and then train a supervised model on top of it by looking at labels too.
input
features (z)
encoder
my classifier
y y
softmax
fine-tuning
encoder jointly
with classifier

Problem with Autoencoders
The fundamental problem with
autoencoders, for generation, is that
their latent space may not be
continuous, or allow easy interpolation.
The clusters here are totally discrete.
How can you randomly sample from the
border of two spaces without producing
garbage? Your model doesn’t know how to
cope with intersections of spaces.

Solution: Variational Autoencoders
VAEs learn a probabilistic distribution (Gaussian) used to generate a sample latent
vector, used in turn to bootstrap another distribution over the final output.
encoder output σμ
reconstructed input
input
sampled z
layers
layers
this is the new latent vector

Generating with VAE
At inference time, we use the decoder only and we start with a of our choice.
Since the layers have been trained to output an image given a vector , we can
pilot the output by varying the expected latent factor: sampled image will reflect
the change in the input factors.
this is our a priori
latent vector N(0,1)
reconstructed input
sampled z
layersthis is our decoder

Generating with VAE
By interpolating the latent factors
(z1
and z2
, in this case) we can
smoothly transition the output.
Here is a matrix output of a
MNIST trained model.
Note the blurry contours due to
L2 loss.

DeepFake/Faceswap
Technically, it’s a pair of autoencoders
with:
● 1 shared encoder;
● 2 distinct decoders, T and B;
The idea is to train
● “encoder - decoder T” on pictures T;
● “encoder - decoder B” on pictures B;
At inference time, we feed picture B to the
encoder and decode with encoder T.
shared
encoder
decoder
Baldwin
decoder
Trump
Train
decoder
Trump
shared
encoder
shared
encoder
Infer

Generative Adversarial Networks (Goodfellow et al, 2014)
GANs are based a on game theory approach with a non zero-sum objective to find
Nash equilibrium between two networks, Generator and Discriminator.
Both players have cost functions that are defined in terms of both players’
parameters:
● The Discriminator must maximize J(D)
(θ(D)
,θ(G)
) and can control only θ(D)
.
● The Generator must maximize J(G)
(θ(D)
,θ(G)
) and can control only θ(G)
.
White
noise
(z)
Generator
network
Generated
image
G(z)
Trainset image
Discriminator
network
(0..1)
1=real
0=fake

Deep Convolutive GAN architecture (Radford et al, 2016)
Generator is an upsampling network with transpose convolution.
Discriminator is a convolutional network.
● Replace pooling layers with strided convolutions for downsampling.
● Use batch normalization in both networks.
● Don’t use fully connected layers.
● Generator uses ReLU everywhere but in last layer, which uses tanh
● Discriminator uses LeakyReLU everywhere.

Interpretable vector math in GAN
Interpolating input noise allows
for output interpolation.
All of these are model outputs.
Average
source
Z vectors
and do
the math:

GAN painting: deep drawing with object-level control

Unsupervised image-to-image translation

Unsupervised image-to-image translation
Images in different domains can be
mapped to a same latent
representation in a shared latent
space.
Uses VAE to train a latent space
that conditions Coupled GANs that
generate images: X2
2→1
and X1
1→2
are domain translated.
Weights are tied in high levels of
Encoder and Generator to force
similarity in high order features.
X1
X2
Z
E1
G1
G2
E2
X1
Z
X1
1→
1
X2
2→
1
X2
X1
1→
2
X2
2→
2
E2
E1
G1
G2
D
1
D
2

seq2seq CNN-Att-RNN for audio: Tacotron

References
Ian Goodfellow - NIPS 2016 tutorial over generative models
Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● Lesson 13, Generative Models (YouTube)
Exploring DeepFakes
Variational Autoencoders Explained
Pix2pix: Image-to-Image Translation with Conditional Adversarial Nets

Alberto Massidda - Scenes from a memory - Codemotion Rome 2019

More Related Content

What's hot

Similar to Alberto Massidda - Scenes from a memory - Codemotion Rome 2019

More from Codemotion

Recently uploaded

Alberto Massidda - Scenes from a memory - Codemotion Rome 2019