Journal Club: VQ-VAE2

Vector Quantized
Variational Auto Encoder 2
(VQ-VAE2)
Takuya KOUMURA
2019.07.05 Journal club

2019.07.05
Takuya KOUMURA
cycentum.com
p. 1
Results
1024×1024
256×256

2019.07.05
Takuya KOUMURA
cycentum.com
p. 2
Probabilistic generative model
x ~ P(x)
x ∈ RN (eg: pixel values, sound amplitudes)
Difficult to sample because xis are highly
dependent on each other and N is large
x P(x)
Sample
3 classes of generative models
⚫Generative adversarial network (GAN)
⚫Variational auto encoder (VAE)
⚫Discrete autoregressive model

2019.07.05
Takuya KOUMURA
cycentum.com
p. 3
History
PixelRNN & PixelCNN
Oord A van den, Kalchbrenner N,
Kavukcuoglu K (2016) Pixel
Recurrent Neural Networks
VQ-VAE2
Razavi A, Oord A van den, Vinyals O
(2019) Generating Diverse High-
Fidelity Images with VQ-VAE-2
VQ-VAE
Oord A van den, Vinyals O,
Kavukcuoglu K (2017) Neural
Discrete Representation Learning
VAE
Kingma DP, Welling M (2013) Auto-
Encoding Variational Bayes.
WaveNet
Oord A van den, Dieleman S, Zen H,
Simonyan K, Vinyals O, Graves A,
Kalchbrenner N, Senior A,
Kavukcuoglu K (2016) WaveNet: A
Generative Model for Raw Audio

2019.07.05
Takuya KOUMURA
cycentum.com
p. 4
History
PixelRNN & PixelCNN
Oord A van den, Kalchbrenner N,
Kavukcuoglu K (2016) Pixel
Recurrent Neural Networks
VQ-VAE2
Razavi A, Oord A van den, Vinyals O
(2019) Generating Diverse High-
Fidelity Images with VQ-VAE-2
VQ-VAE
Oord A van den, Vinyals O,
Kavukcuoglu K (2017) Neural
Discrete Representation Learning
VAE
Kingma DP, Welling M (2013) Auto-
Encoding Variational Bayes.
WaveNet
Oord A van den, Dieleman S, Zen H,
Simonyan K, Vinyals O, Graves A,
Kalchbrenner N, Senior A,
Kavukcuoglu K (2016) WaveNet: A
Generative Model for Raw Audio

2019.07.05
Takuya KOUMURA
cycentum.com
p. 5
Discrete autoregressive model
P(x) = ΠiP(xi | x1, …, xi−1)
The relationship of a pixel to the other pixels are
modeled by a conditional probability
NN NN

2019.07.05
Takuya KOUMURA
cycentum.com
p. 6
P(x) = ΠiP(xi | x1, …, xi−1)
The relationship of a pixel to the other pixels are
modeled by a conditional probability
NN

2019.07.05
Takuya KOUMURA
cycentum.com
p. 7
xi ~ P(xi | x1, …, xi−1)
xi is discrete
The output layer is softmax
⚫In the case of a RGB image
⚪ The task is 256-way classification
⚪ P(xi) = P(xiR | x<i)P(xiG | xiR, x<i)P(xiB | xiR, xiG, x<i)
⚫In the case of a sound
⚪ µ-law companding transformation
⚪ Quantizing to 256 discrete values

2019.07.05
Takuya KOUMURA
cycentum.com
p. 8
Results
64×64
WaveNet:
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

2019.07.05
Takuya KOUMURA
cycentum.com
p. 9
Variational Auto Encoder
z ∈ RD : A latent variable
z ~ P(z)： A simple distribution, easy to sample from
(eg: Gaussian, uniform, or discrete)
x = f(z) : Deterministic mapping from a latent to the
data → can be modeled by an NN
x
NN
z P(z)
Sample

2019.07.05
Takuya KOUMURA
cycentum.com
p. 10
z = E(x)： Encoder
x = D(z)： Decoder
During training, z is guided to follow a simple
distribution, such as Gaussian, uniform, or discrete
zP(z)
x
NN
z P(z)
x
NN
SampleGuide to follow
DecoderEncoder

2019.07.05
Takuya KOUMURA
cycentum.com
p. 11
Training
Loss = Error(x, xgen) + KLD(p(z|x) || Standard Gaussian)
zP(z)
xgen
NN
z P(z)
x
NN
DecoderEncoder

2019.07.05
Takuya KOUMURA
cycentum.com
p. 12
z = E(x)： Encoder
x = D(z)： Decoder
During training, z is guided to follow a simple
distribution, such as Gaussian, uniform, or discrete
zP(z)
x
NN
z P(z)
x
NN
DecoderEncoder

2019.07.05
Takuya KOUMURA
cycentum.com
p. 13
Rationale
“we concentrate on discrete representations
which are potentially a more natural fit for many
of the modalities we are interested in. Language
is inherently discrete, similarly speech is
typically represented as a sequence of symbols.
Images can often be described concisely by
language. Furthermore, discrete representations
are a natural fit for complex reasoning, planning
and predictive learning (e.g., if it rains, I will use
an umbrella).”

2019.07.05
Takuya KOUMURA
cycentum.com
p. 14
VQ-VAE
ze
x
NN
zq
x
NN
DecoderEncoder
zq
Discretize

2019.07.05
Takuya KOUMURA
cycentum.com
p. 15
Discretization
⚫ze: output of the encoder (continuous)
⚫ek: embedding vectors (k ∈ 1, ..., K)
⚫zq: the nearest ek from ze
Embedding
vectors

2019.07.05
Takuya KOUMURA
cycentum.com
p. 16
Training
Reconstruction
error
Move ek
closer to ze
Alternatively, ek can be a moving average of ze
Move ze
closer to ek

2019.07.05
Takuya KOUMURA
cycentum.com
p. 17
Prior
ze
x
D
zq
x
E
Discretize
k
zq = ek
⚫During training,
elements in z are
assumed to be
independent

2019.07.05
Takuya KOUMURA
cycentum.com
p. 18
Prior
⚫During training,
elements in z are
assumed to be
independent
⚫After training, a prior
over z is modeled by
an autoregressive
model, from which z
is sampled during
generation
ze
x
D
zq
x
E
Discretize
NN
zq = ek
Auto
regressive
model

2019.07.05
Takuya KOUMURA
cycentum.com
p. 19
Results
⚫Images
128×128
⚫Videos
⚫Audio
https://avdnoord.github.io/
homepage/vqvae/

2019.07.05
Takuya KOUMURA
cycentum.com
p. 20
VQ-VAE2
Hierarchical discrete latents
So far, only for images

2019.07.05
Takuya KOUMURA
cycentum.com
p. 21
Neural network architecture
⚫A stack of residual connections and skip
connections
⚫Sometimes with gating operation

Journal Club: VQ-VAE2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Journal Club: VQ-VAE2

Similar to Journal Club: VQ-VAE2 (20)

More from Takuya Koumura

More from Takuya Koumura (20)

Recently uploaded

Recently uploaded (20)

Journal Club: VQ-VAE2