Unsupervised Computer Vision: The Current State of the Art

Unsupervised Computer
Vision
Stitch Fix, Styling Algorithms Research Talk
The Current State of the Art
TJ Torres
Data Scientist, Stitch Fix

WHY DEEP LEARNING?
Before DL much of computer vision was focused on feature descriptors
and image stats.
SURF MSER Corner
Image Credit: http://www.mathworks.com/products/computer-vision/features.html

WHY DEEP LEARNING?
Turns out NNs are great feature extractors.
http://papers.nips.cc/paper/4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf

WHY DEEP LEARNING?
Team name Entry description
Classiﬁcation
error
Localization
error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGG
a combination of multiple ConvNets, including a net trained on images of
different size (fusion weights learnt on the validation set); detected boxes
were not updated
0.07325 0.256167
VGG
different size (fusion done by averaging); detected boxes were not updated
0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG
a combination of multiple ConvNets (fusion weights learnt on the validation
set)
0.07407 0.253501
Leaderboard

WHY DEEP LEARNING?
Team name Entry description
Classiﬁcation
error
Localization
error
GoogLeNet No localization. Top5 val score is 6.66% error. 0.06656 0.606257
VGG
different size (fusion weights learnt on the validation set); detected boxes
were not updated
0.07325 0.256167
VGG
different size (fusion done by averaging); detected boxes were not updated
0.07337 0.255431
VGG a combination of multiple ConvNets (by averaging) 0.07405 0.253231
VGG
a combination of multiple ConvNets (fusion weights learnt on the validation
set)
0.07407 0.253501
Leaderboard
Convolution: gives local, translation invariant feature hierarchy

WHY DEEP LEARNING?
Image Credit: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

WHY DEEP LEARNING?
Edges Curves Top of 3 shapes
Softmax Output:
Classiﬁcation
Image Credit: http://parse.ele.tue.nl/education/cluster2

WHY DEEP LEARNING?
Image Credit: http://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

LEARN MORE
http://cs231n.github.io/convolutional-networks/

WHY UNSUPERVISED?
Unfortunately very few image sets come with labels.

WHY UNSUPERVISED?
Unfortunately very few image sets come with labels.
What are the best labels for fashion/style?

THE UNSUPERVISED MO
Try to learn embedding space of image data.
(generally includes generative process)

THE UNSUPERVISED MO
1) Train encoder and decoder to encode then reconstruct image.

THE UNSUPERVISED MO
2) Generate image from random embedding and reinforce
“good” looking images.

THE UNSUPERVISED MO
2) Generate image from random embedding and reinforce
“good” looking images.
DOWNSIDES
Higher dimension embeddings = Non-interpretable
Latent distributions may contain gaps. No sensible continuum.

OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Brieﬂy

OUTLINE
1. Variational Auto-encoders (VAE)
2. Generative Adversarial Networks (GAN)
3. The combination of the two (VAE/GAN)
4. Generative Moment Matching Networks (GMMN)
5. Adversarial Auto-encoders (AAE?)
Brieﬂy
stitchﬁx/fauxtograph

VARIATIONAL STEP
sample from distribution
}µ
}
q (z) = N(z; µ(i)
, 2(i)
I)

VARIATIONAL STEP
sampled
Deconvolution

DECODING
reconstruction
Deconvolution

CALCULATE LOSS
L(x) = DKL(q (z)||N(0, I)) + MSE(x, yout)

UPDATE WEIGHTS
W
(l)⇤
ij = W
(l)
ij
✓
1 ↵
@L
@Wij
◆
@L
@W
(l)
ij
=
✓
@L
@xout
◆ ✓
@xout
@f(n 1)
◆
· · ·
@f(l)
@W
(l)
ij
!

source: @genekogan
Because of pixel-wise MSE loss.
Non-centered features
disproportionately penalized.
OUTPUT

source: @genekogan
Because of pixel-wise MSE loss.
Non-centered features
disproportionately penalized.
OUTPUT
Note Blurring hair.

GENERATIVE ADVERSARIAL
NETWORKS

GAN STRUCTURE
Latent
Random
Vector
Generator Discriminator

Discriminator
GAN STRUCTURE
Generator
Filtered

Discriminator
GAN STRUCTURE
Generator
Image

Discriminator
GAN STRUCTURE
Generator
Gen/Train
Image

Discriminator
GAN STRUCTURE
Generator
Yes/No

Discriminator
TRAINING
Generator
Generator and Discriminator play minimax game.
min
G
max
D
V (D, G) = Ex⇠pdata(x) [log D(x)] + Ez⇠pz(z) [log(1 D(G(z)))]

Discriminator
TRAINING
Generator
Lower loss for fooling Discriminator.
Lower loss for IDing correct
training/generated data.
min
G
max
D

Discriminator
TRAINING
Generator
LD =
1
m
mX
i=1
h
log D
⇣
x(i)
⌘
+ log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘
min
G
max
D

Discriminator
TRAINING
Generator
LD =
1
m
mX
i=1
h
log D
⇣
x(i)
⌘
+ log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘i
LG =
1
m
mX
i=1
log
⇣
1 D
⇣
G
⇣
z(i)
⌘⌘⌘
http://arxiv.org/pdf/1406.2661v1.pdf

OUTPUT

OUTPUT
Unfortunately Only Generative

VAE+GAN STRUCTURE
Generator DiscriminatorEncoder
O

VAE+GAN STRUCTURE
O S

VAE+GAN STRUCTURE
O S O
G(S)
G(E(O))

VAE+GAN STRUCTURE
O S O
G(S)
G(E(O))
Yes/
No
MSE

Discriminator
TRAINING
Generator
Train Encoder, Generator and Discriminator with separate optimizers.
Encoder
LE = DKL(q (z)||N(0, I)) + MSE(Dl(x), Dl(G(E(x))))
LG = ⇥ MSE(Dl(x), Dl(G(E(x)))) LGAN
LD = LGAN = || log(D(x)) + log(1 D(E(G(x)))) + log(1 D(G(z)))||1

Discriminator
TRAINING
Generator
Train Encoder, Generator and Discriminator with separate optimizers.
Encoder
LE = DKL(q (z)||N(0, I)) + MSE(Dl(x), Dl(G(E(x))))
LG = ⇥ MSE(Dl(x), Dl(G(E(x)))) LGAN
LD = LGAN = || log(D(x)) + log(1 D(E(G(x)))) + log(1 D(G(z)))||1
VAE Prior learned similarity
learned similarity GAN
GAN discriminator loss

OUTPUT

TAKEAWAY
We are trying to get away from pixels to begin with so why use pixel distance as
metric?

TAKEAWAY
Learned similarity metric provides feature-level distance rather than pixel-level.
metric?

TAKEAWAY
metric?
Latent space of a GAN with the encoder of a VAE

TAKEAWAY
metric?
Latent space of a GAN with the encoder of a VAE
…BUT NOT THAT EASY TO TRAIN

GENERATIVE MOMENT
MATCHING NETWORKS

DESCRIPTION
Use Maximum Mean Discrepancy between generated data and test data for loss.
Train generative network to output distribution with moments matching dataset.
LMMD2 =
1
N
NX
i=0
(xi)
1
M
MX
j=0
(yj)
2
LMMD2 =
1
N2
NX
i=0
NX
i0=0
k(xi, xi0 )
2
MN
NX
i=0
MX
j=0
k(xi, yj) +
1
M2
MX
j=0
MX
j0=0
k(yj, yj0 )

OUTPUT

DESCRIPTION
Want to create an auto-encoder whose “code space” has a
distribution matching an arbitrary speciﬁed prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial
procedure to match coding dist. to prior.

DESCRIPTION
Want to create an auto-encoder whose “code space” has a
distribution matching an arbitrary speciﬁed prior.
Like VAE, but instead of using Gaussian KL Div., use adversarial
procedure to match coding dist. to prior.
Train encoder/decoder with reconstruction metrics.
Additionally: sample from encoding space,
train encoder to produce samples indistinguishable
from speciﬁed prior.

DESCRIPTION
GAN/
Regularization

DESCRIPTION
GAN/
Regularization
AE/
Reconstruction

SEMI-SUPERVISED
Regularize encoding space Disentangle encoding space

SEMI-SUPERVISED10 2D Gaussians
Swiss roll http://arxiv.org/pdf/1511.05644v1.pdf

SEMI-SUPERVISED

Unsupervised Computer Vision: The Current State of the Art

Unsupervised Computer Vision: The Current State of the Art

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Unsupervised Computer Vision: The Current State of the Art

Similar to Unsupervised Computer Vision: The Current State of the Art (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Computer Vision: The Current State of the Art