Generative Adversarial
Networks
Rubens Zimbres
Data Scientist
Machine Learning for IoT at Vecto Mobile
Generative
Models
Discriminative
Models
Game Theory
Prisoner’s
Dillemma
Nash Equilibrium
Rational choice
Bounded
Rationality
Non cooperative
Game
Training GANs
Some examples
+ Noise
Player 1 (D)
Player 2 (G)
Generative Models
• Probabilistic, samples from joint distribution P(x,y)
• Generate samples (latent variables) following a distribution
• Maximum Likelihood
• Range of values belong to a class
Generative Models
• Latent Dirichlet Allocation (NLP)
• Mixture Gaussians
• Hidden Markov Model
• VAE
• RBM
• GANs
Mixture of Gaussians
X
X
Linear Discriminant Analysis
X
Variational Autoencoders
Encoder Decoder
Blurred
Backprop
Kullback–Leibler divergence KL (surprise)
Restricted Boltzmann Machines
Generative Models
X1 X2
Buy Recommend
Y2
Y1
Discriminative Models
• Observed variables: target, infer outputs, conditional
• Function maps from x to Y
• Conditional density function
• P(Y|x) subsample
• Decision bounday
• Computationally expensive
Discriminative Models
• Logistic Regression
• SVMs
• Neural Networks
Decision Trees
Occam’s razor Principle
Generative vs Discriminative
• Generative learns joint probability distribution = P(x,y)
• Discriminative learns conditional probability distribution = P(y|x)
GAN Advantages over Generative Models
• D & G = Multi Layer Perceptron
• Regular backpropagation like VAE (G and D differentiable)
• Does not necessarily involve Maximum Likelihood estimation
• No need to use Markov Chains
• Subject but robust to overfitting
Game Theory
• 2 Player Game – Zero Sum
• Minimax solution: Nash Equilibrium
• Unique solution: D=1/2 everywhere
Nash Equilibrium
• Sub optimal
• Non cooperative
• Emulate human behavior: bounded rationality (Herbert Simon)
• Simultaneous but not equivalent ** (GANs)
GANs
• Semi Supervised Learning: missing labels
• Inverse Reinforcement Learning
G
D
1
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Encoder
Decoder
Latent Variables
(add noise)
MLP
MLP
Real MNIST
ADVERSARIAL
(0,0)
COOPERATIVE
(1,1)
Question
• How can we change GAN’s strategy to create e WIN-WIN
situation with payoff equal to (1,1) ?
G
D
1
Latent Variables
Add (noise)
Fake
Training Data
Real
MLP
MLP
ADVERSARIAL
(0,0)
G
COOPERATIVE
(1,1)
Infiltrated Cop
MLP
Training Data
Real
GAN Tuning
• Neural Network architecture
• Hyperparameters
• Game Theory strategies
GANs
• Train Simultaneously (no freeze)
• Train Discriminator one step
• Train Generator k steps (no Nash)
• Until pG=pdata (global optimum)
• Discriminator cannot distinguish both distributions
GANs
• x comes from pdata y=1 (Real)
• x comes from pG y=0 (Fake)
• Optimize θ rather than pG
• G: Relu (vanishing gradient)
• Dropout
• Supervised Learning in D:
• Beginning: high confidence
• Usually D is bigger/deeper than G
Vanishing Gradient
Image source: Andrej Karpathy
-25% each activation
Coefficient and Intercept: Regularization
Dropout
• Generator: Gradient Descent on V
• Discriminator: Gradient Ascent on V
• Minibatch
• Batch Normalization in G ***
• To minimize Cross Entropy
• Minimax Game
• Regular cross entropy
Training GANs
Training GANs
• Discriminator minimize J(D)(θD, θG) controlling θD
• Generator minimize J(G)(θD, θG) controlling θG
• Cannot control each other
(G still learns)
Training Challenges
• Non convergence: Nash Equilibrium, harder to optimize than
objective function
• Mode collapse: G fails to output diversity (few good samples)
• D converges to right distribution
• G generates samples in the most probable point
• Solution: k steps for D training for 1 step G training
Training Challenges
Tips and Tricks for GANs
• Normalize data (-1,1)
• Activation function Tanh output of Generative
• Sample from gaussian distribution instead of uniform distribution
• Develop different mini-batches in Discriminator:
• All real
• All fake
• Use Batch Normalization
Tips and Tricks for GANs
• Use DCGANs or VAE+GAN
• Adam in Generator (exp. momentum decay)
(Adaptive Moment Estimation)
• SGD in Discriminator
• If Generator loss high = garbage to Discriminator
• Dropouts in Generator (overfitting)
Gradient Descent
Nesterov Momentum
Adam Optimizer
Momentum w/ exponential decay according to derivative of error
Optimizers: generalization
Wilson et al, 2017
GAN Examples: Vector Arithmetic
Vector Arithmetic
Radford et al, 2015
Matrix Multiplication
DCGAN
• Deep Convolutional GAN
• Batch Norm not in end of Generator and NOT in
Discriminator (x covariate shift - Kaggle)
• To increase dimension, Upsampling: convo2D.T with
stride > 1
• Downsampling: average pooling + conv2D + stride
Convolutions
Convolution2D Convo2DTranspose
3x3 kernel 4x4 input stride=1 3x3 kernel 2x2 input stride=1
DCGAN Architecture
DCGAN Training
DCGAN Output
DCGAN Output
DCGAN – Face Recognition
Info GAN
• Based in Mutual Information and Entropy (uncertainty)
Coin toss
G
D
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Relu
B
A
b b=[1,1,1,1,1,1,1,1,1,1] N=10
B=[0,0,0,0,0,1,1,1,1,1]
P(B=b) = 0.5
H(B)=-SumP(B=b).logP(B=b)
H(B)=-0.5.log(0.5)
H(B)=-0.15
H(A|B=b)=-0.8.log(0.8)
H(A|B=b)=-0.04
B=[0,0,0,0,0,1,1,1,1,1]
A=[0,0,0,0,0,0,1,1,1,1]
P(A|B=b) = 4/5= 0.8 (subsample)
Info GAN Output
Info GAN Output
Info GAN Output
cGAN: Conditional GANs
GAN Examples
• Next frame prediction
• Image-to-image translation
Lotter et al, 2016
Isola et al, 2016
GAN Examples
• Text-to-Image generation Image Inpainting
Reed et al, 2016 Pathak et al, 2016
Cycle GAN: Style Transfer
Zhu et al, 2017
Different Game Strategies
• “Rather than” tuning architecture and hyperparameters
• Freeze layers (unofficial implementation)
• Discriminator : Generator
• 1:1
• 5:1
• 1:5
• Cooperation
• Random strategy update
• Tit-for-Tat strategy
Architecture: Denoising Autoencoders
Convolution
2D
MaxPooling
UpSampling
Backprop
Convolution
2D
MaxPooling
Convolution
2D
UpSampling
Convolution
2D
Gan Structure
One Pixel Attack to Fool Neural Nets
Generated Samples
𝒏𝒐𝒊𝒔𝒆 = 𝒙 + 𝟎. 𝟏 ∗ 𝒓𝒏𝒅 𝟎, 𝟏 𝒙 = 𝟎 𝝈 = 𝟏
Update strategy
Synchronous Asynchronous
http://www.nytimes.com/interactive/science/rock-paper-
scissors.html
1:1 Update (Vanilla GAN)
G
D
1
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Train Simultaneously
Accuracy Training Set: .92
Accuracy Test Set: .914
Nash Equilibrium: Sub-optimal
Train SimultaneouslyLearning Rate: 0.008
Epochs: 1,000 each
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
5:1 Update
G
D
1
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Train 1 time
Learning Rate: 0.008
Epochs: 1,000 each
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
Accuracy Training Set: .91
Accuracy Test Set: .91
Train 5 times
1:5 Update
G
D
1
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Train 5 times
Learning Rate: 0.008
Epochs: 1,000 each
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
Accuracy Training Set: .92
Accuracy Test Set: .91
Train 1 time
Cooperation
G
D
1
Latent Variables
(add noise)
Fake
MLP
MLP
Train at once
Learning Rate: 0.008
Epochs: 1,000 each
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
Accuracy Training Set: .99
Accuracy Test Set: .93
Train at once
x sigmoid
Vanishing gradient
Training Data
Real
Random Strategy
G
D
1
Latent Variables
(add noise)
Fake
MLP
MLP
Train at chance
Learning Rate: 0.008
Epochs: 1,500
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
Accuracy Test Set: .98
Accuracy Test Set: .96
Polarization of Game until
Train at chance
𝑁 = ∞
𝑥 ≈ 0.5
Training Data
Real
x sigmoid
Vanishing gradient Relu
Tit-For-Tat Strategy
G
D
1
Latent Variables
(add noise)
Fake
MLP
MLP
Train with penalty
if provides noisy
samples
Learning Rate: 0.008
Epochs: 1,500
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
𝒏𝒐𝒊𝒔𝒆 = 𝒙 + 𝟎. 𝟏 ∗ 𝒓𝒏𝒅 𝟎, 𝟏 𝒙 = 𝟎 𝝈 = 𝟏
Accuracy Test Set: .96
Accuracy Test Set: .92
Train
Training Data
Real
In GAN training:
% noisy samples
𝒏𝒐𝒊𝒔𝒆 = 𝒙 + 𝟎. 𝟒 ∗ 𝒓𝒏𝒅 𝟎, 𝟏 𝒙 = 𝟎 𝝈 = 𝟏
Accuracy Test Set: .92
Accuracy Test Set: .91
Freeze Layers
G
D
1
Latent Variables
(add noise)
Fake
Training Data
Real
MLP
MLP
Train First
Freeze weights
Train Second
Learning Rate: 0.008
Epochs: 1,000 each
Generator optmiz: Adam
Discriminator optimiz: SGD
Decay Rate: 5e-5
Momentum: 0.9
Accuracy Training Set: .63
Accuracy Test Set: .12
x sigmoid
Vanishing gradient Relu
* SYNCHRONICITY
Experiment Results

GANs Deep Learning Summer School