3. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 2
Probabilistic generative model
x ~ P(x)
x ∈ RN (eg: pixel values, sound amplitudes)
Difficult to sample because xis are highly
dependent on each other and N is large
x P(x)
Sample
3 classes of generative models
⚫Generative adversarial network (GAN)
⚫Variational auto encoder (VAE)
⚫Discrete autoregressive model
4. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 3
History
PixelRNN & PixelCNN
Oord A van den, Kalchbrenner N,
Kavukcuoglu K (2016) Pixel
Recurrent Neural Networks
VQ-VAE2
Razavi A, Oord A van den, Vinyals O
(2019) Generating Diverse High-
Fidelity Images with VQ-VAE-2
VQ-VAE
Oord A van den, Vinyals O,
Kavukcuoglu K (2017) Neural
Discrete Representation Learning
VAE
Kingma DP, Welling M (2013) Auto-
Encoding Variational Bayes.
WaveNet
Oord A van den, Dieleman S, Zen H,
Simonyan K, Vinyals O, Graves A,
Kalchbrenner N, Senior A,
Kavukcuoglu K (2016) WaveNet: A
Generative Model for Raw Audio
5. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 4
History
PixelRNN & PixelCNN
Oord A van den, Kalchbrenner N,
Kavukcuoglu K (2016) Pixel
Recurrent Neural Networks
VQ-VAE2
Razavi A, Oord A van den, Vinyals O
(2019) Generating Diverse High-
Fidelity Images with VQ-VAE-2
VQ-VAE
Oord A van den, Vinyals O,
Kavukcuoglu K (2017) Neural
Discrete Representation Learning
VAE
Kingma DP, Welling M (2013) Auto-
Encoding Variational Bayes.
WaveNet
Oord A van den, Dieleman S, Zen H,
Simonyan K, Vinyals O, Graves A,
Kalchbrenner N, Senior A,
Kavukcuoglu K (2016) WaveNet: A
Generative Model for Raw Audio
8. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 7
Discrete autoregressive model
xi ~ P(xi | x1, …, xi−1)
xi is discrete
The output layer is softmax
⚫In the case of a RGB image
⚪ The task is 256-way classification
⚪ P(xi) = P(xiR | x<i)P(xiG | xiR, x<i)P(xiB | xiR, xiG, x<i)
⚫In the case of a sound
⚪ µ-law companding transformation
⚪ Quantizing to 256 discrete values
10. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 9
Variational Auto Encoder
z ∈ RD : A latent variable
z ~ P(z): A simple distribution, easy to sample from
(eg: Gaussian, uniform, or discrete)
x = f(z) : Deterministic mapping from a latent to the
data → can be modeled by an NN
x
NN
z P(z)
Sample
11. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 10
Variational Auto Encoder
z = E(x): Encoder
x = D(z): Decoder
During training, z is guided to follow a simple
distribution, such as Gaussian, uniform, or discrete
zP(z)
x
NN
z P(z)
x
NN
SampleGuide to follow
DecoderEncoder
13. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 12
Variational Auto Encoder
z = E(x): Encoder
x = D(z): Decoder
During training, z is guided to follow a simple
distribution, such as Gaussian, uniform, or discrete
zP(z)
x
NN
z P(z)
x
NN
SampleGuide to follow
DecoderEncoder
14. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 13
Rationale
“we concentrate on discrete representations
which are potentially a more natural fit for many
of the modalities we are interested in. Language
is inherently discrete, similarly speech is
typically represented as a sequence of symbols.
Images can often be described concisely by
language. Furthermore, discrete representations
are a natural fit for complex reasoning, planning
and predictive learning (e.g., if it rains, I will use
an umbrella).”
19. 2019.07.05
Takuya KOUMURA
cycentum.com
p. 18
Prior
⚫During training,
elements in z are
assumed to be
independent
⚫After training, a prior
over z is modeled by
an autoregressive
model, from which z
is sampled during
generation
ze
x
D
zq
x
E
Discretize
NN
zq = ek
Auto
regressive
model