Scenes from a memory
Neural audio/video generation
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome, Cosenza and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ DevOps: Monitoring, Cloud and Containers
○ Agile: Processes and Management
○ Integration: Mobile, Frontend, Backend
○ Data science: ML, BigData and many more...
This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
Outline
1. Supervised vs Unsupervised Learning
2. Deep generative models
a. Fully Visible Belief Networks
b. Variational Autoencoders
c. Generative Adversarial Networks
3. Some applications to images and audio
Heavily inspired by Stanford CS231 2017, lesson 13 “Generative Models” and Ian Goodfellow tutorial.
Supervised vs Unsupervised learning
Supervised Learning
Data: (x,y)
x: data, y: label
Goal: learn a function to map x → y
Examples
● classification
● regression
● almost all Computer Vision
● almost all NLP
Supervised vs Unsupervised learning
Supervised Learning
Data: (x,y)
x: data, y: label
Goal: learn a function to map x → y
Examples
● classification
● regression
● almost all Computer Vision
● almost all NLP
Unsupervised Learning (sexy!)
Data: x
x: unlabeled data
Goal: learn hidden structure of x
Examples
● clustering
● dimensionality reduction
● feature learning
● rich content generation
Deep generative Models
Given training data, generate new samples from a probability distribution.
Train by Maximum Likelihood over parameters to define the distribution.
Assumption: training data is generated by a hidden probability distribution pdata
(x).
Goal:
1. learn a function pmodel
(x) as similar as possible to pdata
(x);
2. generate new samples from pmodel
(x).
Generation as a density estimation problem.
Explicit vs implicit density estimation
Generative Model
Implicit densityExplicit density
Tractable density Approximate density
Variational
Markov chain
Direct Markov chain
Fully visible belief network
● PixelRNN/CNN
● Nonlinear ICA
● NADE
● MADE
Variational
Auto
Encoder
Generative
Adversarial
Network
Boltzmann
Machine
Generative
Stochastic
Network
Fully Visible Belief Network
FVBNs are models that use the chain rule of probability to decompose a
probability distribution over a n-dimensional vector into a product of
one-dimensional probability distribution:
We just use a neural network to express the distribution function.
The main drawback of FVBNs is that samples must be generated one entry at a
time. And it cannot be parallelized.
likelihood of the image probability of single pixel
conditioned on previous pixels
PixelRNN
Use a LSTM to generate pixels starting from the corner.
PixelCNN
Still generate pixels starting from the corner, but now we model context over a
region with a CNN.
x11
x1n
xij
xnn
Causal convolution
forbids parallelization
(but turns out to have
an optimal substructure
so we can memoize).
WaveNet
Generate samples one at a time through dilated causal convolution (demo).
Super quality but super slow: 2 mins computation, 1 sec of audio output!
Autoencoders
Autoencoders are a kind of neural network that projects higher dimensional
samples in a low level compressed representation and then reconstruct the input.
Vector ZEncoder Decoder
Original input X Reconstructed input X
Latent
representation
Autoencoders architecture
Train by backpropagating the error between original and reconstructed input.
reconstructed input
input
features (z)
L2 loss = || x - x ||2
4 layers CNN
4 layers up-CNN
Autoencoders for noise reduction
Autoencoders for semi-supervised learning
We can initialize a learner by training an Autoencoder first, throwing away the
decoder, and then train a supervised model on top of it by looking at labels too.
input
features (z)
encoder
my classifier
y y
softmax
fine-tuning
encoder jointly
with classifier
Problem with Autoencoders
The fundamental problem with
autoencoders, for generation, is that
their latent space may not be
continuous, or allow easy interpolation.
The clusters here are totally discrete.
How can you randomly sample from the
border of two spaces without producing
garbage? Your model doesn’t know how to
cope with intersections of spaces.
Solution: Variational Autoencoders
VAEs learn a probabilistic distribution (Gaussian) used to generate a sample latent
vector, used in turn to bootstrap another distribution over the final output.
encoder output σμ
reconstructed input
input
sampled z
layers
layers
this is the new latent vector
Generating with VAE
At inference time, we use the decoder only and we start with a of our choice.
Since the layers have been trained to output an image given a vector , we can
pilot the output by varying the expected latent factor: sampled image will reflect
the change in the input factors.
this is our a priori
latent vector N(0,1)
reconstructed input
sampled z
layersthis is our decoder
Generating with VAE
By interpolating the latent factors
(z1
and z2
, in this case) we can
smoothly transition the output.
Here is a matrix output of a
MNIST trained model.
Note the blurry contours due to
L2 loss.
DeepFake/Faceswap
Technically, it’s a pair of autoencoders
with:
● 1 shared encoder;
● 2 distinct decoders, T and B;
The idea is to train
● “encoder - decoder T” on pictures T;
● “encoder - decoder B” on pictures B;
At inference time, we feed picture B to the
encoder and decode with encoder T.
shared
encoder
decoder
Baldwin
decoder
Trump
Train
decoder
Trump
shared
encoder
shared
encoder
Infer
Generative Adversarial Networks (Goodfellow et al, 2014)
GANs are based a on game theory approach with a non zero-sum objective to find
Nash equilibrium between two networks, Generator and Discriminator.
Both players have cost functions that are defined in terms of both players’
parameters:
● The Discriminator must maximize J(D)
(θ(D)
,θ(G)
) and can control only θ(D)
.
● The Generator must maximize J(G)
(θ(D)
,θ(G)
) and can control only θ(G)
.
White
noise
(z)
Generator
network
Generated
image
G(z)
Trainset image
Discriminator
network
(0..1)
1=real
0=fake
Deep Convolutive GAN architecture (Radford et al, 2016)
Generator is an upsampling network with transpose convolution.
Discriminator is a convolutional network.
● Replace pooling layers with strided convolutions for downsampling.
● Use batch normalization in both networks.
● Don’t use fully connected layers.
● Generator uses ReLU everywhere but in last layer, which uses tanh
● Discriminator uses LeakyReLU everywhere.
Interpretable vector math in GAN
Interpolating input noise allows
for output interpolation.
All of these are model outputs.
Average
source
Z vectors
and do
the math:
GAN painting: deep drawing with object-level control
Unsupervised image-to-image translation
Unsupervised image-to-image translation
Unsupervised image-to-image translation
Unsupervised image-to-image translation
Images in different domains can be
mapped to a same latent
representation in a shared latent
space.
Uses VAE to train a latent space
that conditions Coupled GANs that
generate images: X2
2→1
and X1
1→2
are domain translated.
Weights are tied in high levels of
Encoder and Generator to force
similarity in high order features.
X1
X2
Z
E1
G1
G2
E2
X1
Z
X1
1→
1
X2
2→
1
X2
X1
1→
2
X2
2→
2
E2
E1
G1
G2
D
1
D
2
seq2seq CNN-Att-RNN for audio: Tacotron
References
Ian Goodfellow - NIPS 2016 tutorial over generative models
Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● Lesson 13, Generative Models (YouTube)
Exploring DeepFakes
Variational Autoencoders Explained
Pix2pix: Image-to-Image Translation with Conditional Adversarial Nets
Thanks

Alberto Massidda - Scenes from a memory - Codemotion Rome 2019

  • 1.
    Scenes from amemory Neural audio/video generation Alberto Massidda
  • 2.
    Who we are ●Founded in 2001; ● Branches in Milan, Rome, Cosenza and London; ● Market leader in enterprise ready solutions based on Open Source tech; ● Expertise: ○ DevOps: Monitoring, Cloud and Containers ○ Agile: Processes and Management ○ Integration: Mobile, Frontend, Backend ○ Data science: ML, BigData and many more...
  • 3.
    This presentation isOpen Source (yay!) https://creativecommons.org/licenses/by-nc-sa/3.0/
  • 4.
    Outline 1. Supervised vsUnsupervised Learning 2. Deep generative models a. Fully Visible Belief Networks b. Variational Autoencoders c. Generative Adversarial Networks 3. Some applications to images and audio Heavily inspired by Stanford CS231 2017, lesson 13 “Generative Models” and Ian Goodfellow tutorial.
  • 5.
    Supervised vs Unsupervisedlearning Supervised Learning Data: (x,y) x: data, y: label Goal: learn a function to map x → y Examples ● classification ● regression ● almost all Computer Vision ● almost all NLP
  • 6.
    Supervised vs Unsupervisedlearning Supervised Learning Data: (x,y) x: data, y: label Goal: learn a function to map x → y Examples ● classification ● regression ● almost all Computer Vision ● almost all NLP Unsupervised Learning (sexy!) Data: x x: unlabeled data Goal: learn hidden structure of x Examples ● clustering ● dimensionality reduction ● feature learning ● rich content generation
  • 7.
    Deep generative Models Giventraining data, generate new samples from a probability distribution. Train by Maximum Likelihood over parameters to define the distribution. Assumption: training data is generated by a hidden probability distribution pdata (x). Goal: 1. learn a function pmodel (x) as similar as possible to pdata (x); 2. generate new samples from pmodel (x). Generation as a density estimation problem.
  • 8.
    Explicit vs implicitdensity estimation Generative Model Implicit densityExplicit density Tractable density Approximate density Variational Markov chain Direct Markov chain Fully visible belief network ● PixelRNN/CNN ● Nonlinear ICA ● NADE ● MADE Variational Auto Encoder Generative Adversarial Network Boltzmann Machine Generative Stochastic Network
  • 9.
    Fully Visible BeliefNetwork FVBNs are models that use the chain rule of probability to decompose a probability distribution over a n-dimensional vector into a product of one-dimensional probability distribution: We just use a neural network to express the distribution function. The main drawback of FVBNs is that samples must be generated one entry at a time. And it cannot be parallelized. likelihood of the image probability of single pixel conditioned on previous pixels
  • 10.
    PixelRNN Use a LSTMto generate pixels starting from the corner.
  • 11.
    PixelCNN Still generate pixelsstarting from the corner, but now we model context over a region with a CNN. x11 x1n xij xnn Causal convolution forbids parallelization (but turns out to have an optimal substructure so we can memoize).
  • 12.
    WaveNet Generate samples oneat a time through dilated causal convolution (demo). Super quality but super slow: 2 mins computation, 1 sec of audio output!
  • 13.
    Autoencoders Autoencoders are akind of neural network that projects higher dimensional samples in a low level compressed representation and then reconstruct the input. Vector ZEncoder Decoder Original input X Reconstructed input X Latent representation
  • 14.
    Autoencoders architecture Train bybackpropagating the error between original and reconstructed input. reconstructed input input features (z) L2 loss = || x - x ||2 4 layers CNN 4 layers up-CNN
  • 15.
  • 16.
    Autoencoders for semi-supervisedlearning We can initialize a learner by training an Autoencoder first, throwing away the decoder, and then train a supervised model on top of it by looking at labels too. input features (z) encoder my classifier y y softmax fine-tuning encoder jointly with classifier
  • 17.
    Problem with Autoencoders Thefundamental problem with autoencoders, for generation, is that their latent space may not be continuous, or allow easy interpolation. The clusters here are totally discrete. How can you randomly sample from the border of two spaces without producing garbage? Your model doesn’t know how to cope with intersections of spaces.
  • 18.
    Solution: Variational Autoencoders VAEslearn a probabilistic distribution (Gaussian) used to generate a sample latent vector, used in turn to bootstrap another distribution over the final output. encoder output σμ reconstructed input input sampled z layers layers this is the new latent vector
  • 19.
    Generating with VAE Atinference time, we use the decoder only and we start with a of our choice. Since the layers have been trained to output an image given a vector , we can pilot the output by varying the expected latent factor: sampled image will reflect the change in the input factors. this is our a priori latent vector N(0,1) reconstructed input sampled z layersthis is our decoder
  • 20.
    Generating with VAE Byinterpolating the latent factors (z1 and z2 , in this case) we can smoothly transition the output. Here is a matrix output of a MNIST trained model. Note the blurry contours due to L2 loss.
  • 21.
    DeepFake/Faceswap Technically, it’s apair of autoencoders with: ● 1 shared encoder; ● 2 distinct decoders, T and B; The idea is to train ● “encoder - decoder T” on pictures T; ● “encoder - decoder B” on pictures B; At inference time, we feed picture B to the encoder and decode with encoder T. shared encoder decoder Baldwin decoder Trump Train decoder Trump shared encoder shared encoder Infer
  • 22.
    Generative Adversarial Networks(Goodfellow et al, 2014) GANs are based a on game theory approach with a non zero-sum objective to find Nash equilibrium between two networks, Generator and Discriminator. Both players have cost functions that are defined in terms of both players’ parameters: ● The Discriminator must maximize J(D) (θ(D) ,θ(G) ) and can control only θ(D) . ● The Generator must maximize J(G) (θ(D) ,θ(G) ) and can control only θ(G) . White noise (z) Generator network Generated image G(z) Trainset image Discriminator network (0..1) 1=real 0=fake
  • 23.
    Deep Convolutive GANarchitecture (Radford et al, 2016) Generator is an upsampling network with transpose convolution. Discriminator is a convolutional network. ● Replace pooling layers with strided convolutions for downsampling. ● Use batch normalization in both networks. ● Don’t use fully connected layers. ● Generator uses ReLU everywhere but in last layer, which uses tanh ● Discriminator uses LeakyReLU everywhere.
  • 24.
    Interpretable vector mathin GAN Interpolating input noise allows for output interpolation. All of these are model outputs. Average source Z vectors and do the math:
  • 25.
    GAN painting: deepdrawing with object-level control
  • 26.
  • 27.
  • 28.
  • 29.
    Unsupervised image-to-image translation Imagesin different domains can be mapped to a same latent representation in a shared latent space. Uses VAE to train a latent space that conditions Coupled GANs that generate images: X2 2→1 and X1 1→2 are domain translated. Weights are tied in high levels of Encoder and Generator to force similarity in high order features. X1 X2 Z E1 G1 G2 E2 X1 Z X1 1→ 1 X2 2→ 1 X2 X1 1→ 2 X2 2→ 2 E2 E1 G1 G2 D 1 D 2
  • 30.
    seq2seq CNN-Att-RNN foraudio: Tacotron
  • 31.
    References Ian Goodfellow -NIPS 2016 tutorial over generative models Stanford CS231n: Convolutional Neural Networks for Visual Recognition ● Lesson 13, Generative Models (YouTube) Exploring DeepFakes Variational Autoencoders Explained Pix2pix: Image-to-Image Translation with Conditional Adversarial Nets
  • 32.