This presentation was originally given at a styling research presentation at Stitch Fix, where I talk about some of the recent progress in the field of unsupervised deep learning methods for image analysis. It includes descriptions of Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), their hybrid (VAE/GAN), Generative Moment Matching Networks (GMMN), and Adversarial Autoencoders.
2. WHY DEEP LEARNING?
Before DL much of computer vision was focused on feature descriptors
and image stats.
SURF MSER Corner
Image Credit: http://www.mathworks.com/products/computer-vision/features.html
3. WHY DEEP LEARNING?
Turns out NNs are great feature extractors.
http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
4. WHY DEEP LEARNING?
Turns out NNs are great feature extractors.
Team name Entry description
Classification
error
Localization
error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGG
a combination of multiple ConvNets, including a net trained on images of
different size (fusion weights learnt on the validation set); detected boxes
were not updated
0.07325 0.256167
VGG
a combination of multiple ConvNets, including a net trained on images of
different size (fusion done by averaging); detected boxes were not updated
0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG
a combination of multiple ConvNets (fusion weights learnt on the validation
set)
0.07407 0.253501
Leaderboard
5. WHY DEEP LEARNING?
Turns out NNs are great feature extractors.
Team name Entry description
Classification
error
Localization
error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGG
a combination of multiple ConvNets, including a net trained on images of
different size (fusion weights learnt on the validation set); detected boxes
were not updated
0.07325 0.256167
VGG
a combination of multiple ConvNets, including a net trained on images of
different size (fusion done by averaging); detected boxes were not updated
0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG
a combination of multiple ConvNets (fusion weights learnt on the validation
set)
0.07407 0.253501
Leaderboard
Convolution: gives local, translation invariant feature hierarchy
6. WHY DEEP LEARNING?
Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
7. WHY DEEP LEARNING?
Edges Curves Top of 3 shapes
Softmax Output:
Classification
Image Credit: http://parse.ele.tue.nl/education/cluster2
8. WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
9. WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
10. WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
11. WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
12. WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
16. THE UNSUPERVISED MO
Try to learn embedding space of image data.
(generally includes generative process)
17. THE UNSUPERVISED MO
Try to learn embedding space of image data.
(generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
18. THE UNSUPERVISED MO
Try to learn embedding space of image data.
(generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
2) Generate image from random embedding and reinforce
“good” looking images.
19. THE UNSUPERVISED MO
Try to learn embedding space of image data.
(generally includes generative process)
1) Train encoder and decoder to encode then reconstruct image.
2) Generate image from random embedding and reinforce
“good” looking images.
DOWNSIDES
Higher dimension embeddings = Non-interpretable
Latent distributions may contain gaps. No sensible continuum.
20. OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Briefly
21. OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Briefly
stitchfix/fauxtograph
42. Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct
training/generated data.
min
G
max
D
V (D, G) = Ex⇠pdata(x) [log D(x)] + Ez⇠pz(z) [log(1 D(G(z)))]
43. Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct
training/generated data.
LD =
1
m
mX
i=1
h
log D
⇣
x(i)
⌘
+ log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘
min
G
max
D
V (D, G) = Ex⇠pdata(x) [log D(x)] + Ez⇠pz(z) [log(1 D(G(z)))]
44. Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Generator and Discriminator play minimax game.
Lower loss for IDing correct
training/generated data.
LD =
1
m
mX
i=1
h
log D
⇣
x(i)
⌘
+ log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘
http://arxiv.org/pdf/1406.2661v1.pdf
63. TAKEAWAY
http://arxiv.org/pdf/1512.09300v1.pdf
Learned similarity metric provides feature-level distance rather than pixel-level.
We are trying to get away from pixels to begin with so why use pixel distance as
metric?
Latent space of a GAN with the encoder of a VAE
…BUT NOT THAT EASY TO TRAIN
65. DESCRIPTION
Use Maximum Mean Discrepancy between generated data and test data for loss.
Train generative network to output distribution with moments matching dataset.
LMMD2 =
1
N
NX
i=0
(xi)
1
M
MX
j=0
(yj)
2
LMMD2 =
1
N2
NX
i=0
NX
i0=0
k(xi, xi0 )
2
MN
NX
i=0
MX
j=0
k(xi, yj) +
1
M2
MX
j=0
MX
j0=0
k(yj, yj0 )
69. DESCRIPTION
Want to create an auto-encoder whose “code space” has a
distribution matching an arbitrary specified prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial
procedure to match coding dist. to prior.
70. DESCRIPTION
Want to create an auto-encoder whose “code space” has a
distribution matching an arbitrary specified prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial
procedure to match coding dist. to prior.
Train encoder/decoder with reconstruction metrics.
Additionally: sample from encoding space,
train encoder to produce samples indistinguishable
from specified prior.