Pixel Recurrent Neural Networks
Google DeepMind
Presented by Osman Tursun
METU, CENG, KOVAN Lab.
Outline
1. Generative model
2. Proposed models
3. Optimization
4. Experiment and results
5. Conclusion
1
Generative model
Generative model
What I cannot create, I do not understand.
Richard Feynman
2
Why generative model?
• Unsupervised learning is future
• Many Applications: Image compression, debluring, generate
synthetic images, frames, text to image and so on.
3
Challenges of generative model
• Probabilistic dependency on previous contents like pixels
• Complex and highly dimensional structures like images
• Inability to train complex and expressive and tractable yet scalable
models
4
Generative models
• Laten Variable models (VAES, DRAW1
)
• Adversarial models (GAN2
)
• Autoregressive models (NADE3
, MADE4
, RIDE5
)
1Karol Gregor et al. “DRAW: A recurrent neural network for image generation”. In:
arXiv preprint arXiv:1502.04623 (2015).
2Ian Goodfellow et al. “Generative adversarial nets”. In: NIPS. 2014.
3Hugo Larochelle and Iain Murray. “The Neural Autoregressive Distribution
Estimator.” In: AISTATS. vol. 1. 2011, p. 2.
4Mathieu Germain et al. “MADE: Masked Autoencoder for Distribution Estimation.”
In: ICML. 2015.
5Lucas Theis and Matthias Bethge. “Generative Image Modeling Using Spatial
LSTMs”. In: NIPS. 2015.
5
Comparison of generative model
Image Generation Models
-Three image generation approaches are dominating the field:
Variational AutoEncoders (VAE) Generative Adversarial Networks (GAN)
z
x
)(~ zpz θ
)|(~ zxpx θ
Decoder
Encoder
)|( xzqφ
x
z
Real
D
G
Fake
Real/Fake ?
generate
Autoregressive Models
(cf. https://openai.com/blog/generative-models/)
VAE GAN Autoregressive Models
Pros.
- Efficient inference with
approximate latent variables.
- generate sharp image.
- no need for any Markov chain or
approx networks during sampling.
- very simple and stable training process
- currently gives the best log likelihood.
- tractable likelihood
Cons.
- generated samples tend to be
blurry.
- difficult to optimize due to
unstable training dynamics.
- relatively inefficient during sampling
This slide is from Yohei Sugawara
6
Proposed models
Auto-regressive image modeling
The joint distribution over the image pixel is factorized into a product of
conditional distribution.
p(x) =
n2
i=1 p(xi |x1, . . . , xi−1)
p(xi,R |X<i )p(xi,G |X<i , xi,R )p(xi,B |X<i , xi,R , xi,G )
7
Proposed models
• PixelRNN: Row LSTM, Diagonal LSTM
• PixelCNN
• Multi-Scale PixelRNN
8
Generative image modeling with Spatial LSTM
MCGSM: mixtures of conditional Gaussian mixutre6
The figure is from RIDE7
6Lucas Theis, Reshad Hosseini, and Matthias Bethge. “Mixtures of conditional
Gaussian scale mixtures applied to multiscale image representations”. In: PloS one
(2012).
7Lucas Theis and Matthias Bethge. “Generative Image Modeling Using Spatial
LSTMs”. In: NIPS. 2015.
9
Row LSTM
• Capture a roughly triangular
context.
• 1-D convolutional Kernel size
K 3
• Convolution is masked
• Input to state is parallelized
(output feature size is
4h × n × n)
10
Diagonal BiLSTM
• Capture the entire available
context
• Scan the image in diagonal
11
Diagonal BiLSTM Skew Operation
• Parallelized by skew operation
• n × n ←→ n × (2n − 1)
• Convolutional kernel is 2 x 1
12
PixelCNN
• Large bounded receptive field replace
the PixelRNN’s unbounded dependency
• Turn the problem into pixel level
classification problem
• Parallelization on train step but not
test generation step
13
PixelRNN vs PixelCNN
Previous work: Pixel Recurrent Neural Networks.
 “Pixel Recurrent Neural Networks” got best paper award at ICML2016.
 They proposed two types of models, PixelRNN and PixelCNN
(two types of LSTM layers are proposed for PixelRNN.)
PixelCNNPixelRNN
masked convolution
Row LSTM Diagonal BiLSTM
PixelRNN PixelCNN
Pros.
• effectively handles long-range dependencies
⇒ good performance
Convolutions are easier to parallelize ⇒ much faster to train
Cons.
• Each state needs to be computed sequentially.
⇒ computationally expensive
Bounded receptive field ⇒ inferior performance
Blind spot problem (due to the masked convolution) needs to be eliminated.
• LSTM based models are natural choice for
dealing with the autoregressive dependencies.
• CNN based model uses masked convolution,
to ensure the model is causal.
11w 12w 13w
21w 22w 23w
31w 32w 33w 
This slide is from Yohei Sugawara
14
Multi-scale PixelRNN
• Uncondional PixelRNN and one more
conditional PixelRNNs
• Use a small original image as a sample.
• Conditional network is similar to
PixelRNN but biased by up-sampled
version of the given small image.
15
Optimization
Residual Connections
• Deep network: PixelRNN 12 layers, PixelCNN 15 layers
• Residual connection increase convergence speed and propagate
16
Masked Convolution
• Masks are adopted to avoid capturing future context.
• Mask A is only used at the first convolutional layer, mask B is all the
subsequent input-to-state convolutional transitions.
MADE:Masked Autoencoder for Distribution Estimation8
8Mathieu Germain et al. “MADE: Masked Autoencoder for Distribution Estimation.”
In: ICML. 2015.
17
Discrete Softmax Distribution
• Regression problem to classification problem
• Easy implementation but better result
18
Experiment and results
Specification of Models
19
Evaluation
• Dataset: MNIST, CIFAR-10, and ImageNet
• Method: log-likelihood
20
Quantitative results
21
Image completions
22
Conclusion
Summary
• Raw and Diagonal LSTM, PixelCNN
• Using softmax layer
• Using Masked convolution
• Using Residual connection
• New SoA MNIST, CIFAR-10 and tested on ImageNet
23
Useful resources
• Sergei Turukin PixelCNN post and implementation
• PixeRNN conference presentation
• PixelRNN Review byKyle Kastner
• Post for Draw
24
Questions?
24

Pixel Recurrent Neural Networks

  • 1.
    Pixel Recurrent NeuralNetworks Google DeepMind Presented by Osman Tursun METU, CENG, KOVAN Lab.
  • 2.
    Outline 1. Generative model 2.Proposed models 3. Optimization 4. Experiment and results 5. Conclusion 1
  • 3.
  • 4.
    Generative model What Icannot create, I do not understand. Richard Feynman 2
  • 5.
    Why generative model? •Unsupervised learning is future • Many Applications: Image compression, debluring, generate synthetic images, frames, text to image and so on. 3
  • 6.
    Challenges of generativemodel • Probabilistic dependency on previous contents like pixels • Complex and highly dimensional structures like images • Inability to train complex and expressive and tractable yet scalable models 4
  • 7.
    Generative models • LatenVariable models (VAES, DRAW1 ) • Adversarial models (GAN2 ) • Autoregressive models (NADE3 , MADE4 , RIDE5 ) 1Karol Gregor et al. “DRAW: A recurrent neural network for image generation”. In: arXiv preprint arXiv:1502.04623 (2015). 2Ian Goodfellow et al. “Generative adversarial nets”. In: NIPS. 2014. 3Hugo Larochelle and Iain Murray. “The Neural Autoregressive Distribution Estimator.” In: AISTATS. vol. 1. 2011, p. 2. 4Mathieu Germain et al. “MADE: Masked Autoencoder for Distribution Estimation.” In: ICML. 2015. 5Lucas Theis and Matthias Bethge. “Generative Image Modeling Using Spatial LSTMs”. In: NIPS. 2015. 5
  • 8.
    Comparison of generativemodel Image Generation Models -Three image generation approaches are dominating the field: Variational AutoEncoders (VAE) Generative Adversarial Networks (GAN) z x )(~ zpz θ )|(~ zxpx θ Decoder Encoder )|( xzqφ x z Real D G Fake Real/Fake ? generate Autoregressive Models (cf. https://openai.com/blog/generative-models/) VAE GAN Autoregressive Models Pros. - Efficient inference with approximate latent variables. - generate sharp image. - no need for any Markov chain or approx networks during sampling. - very simple and stable training process - currently gives the best log likelihood. - tractable likelihood Cons. - generated samples tend to be blurry. - difficult to optimize due to unstable training dynamics. - relatively inefficient during sampling This slide is from Yohei Sugawara 6
  • 9.
  • 10.
    Auto-regressive image modeling Thejoint distribution over the image pixel is factorized into a product of conditional distribution. p(x) = n2 i=1 p(xi |x1, . . . , xi−1) p(xi,R |X<i )p(xi,G |X<i , xi,R )p(xi,B |X<i , xi,R , xi,G ) 7
  • 11.
    Proposed models • PixelRNN:Row LSTM, Diagonal LSTM • PixelCNN • Multi-Scale PixelRNN 8
  • 12.
    Generative image modelingwith Spatial LSTM MCGSM: mixtures of conditional Gaussian mixutre6 The figure is from RIDE7 6Lucas Theis, Reshad Hosseini, and Matthias Bethge. “Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations”. In: PloS one (2012). 7Lucas Theis and Matthias Bethge. “Generative Image Modeling Using Spatial LSTMs”. In: NIPS. 2015. 9
  • 13.
    Row LSTM • Capturea roughly triangular context. • 1-D convolutional Kernel size K 3 • Convolution is masked • Input to state is parallelized (output feature size is 4h × n × n) 10
  • 14.
    Diagonal BiLSTM • Capturethe entire available context • Scan the image in diagonal 11
  • 15.
    Diagonal BiLSTM SkewOperation • Parallelized by skew operation • n × n ←→ n × (2n − 1) • Convolutional kernel is 2 x 1 12
  • 16.
    PixelCNN • Large boundedreceptive field replace the PixelRNN’s unbounded dependency • Turn the problem into pixel level classification problem • Parallelization on train step but not test generation step 13
  • 17.
    PixelRNN vs PixelCNN Previouswork: Pixel Recurrent Neural Networks.  “Pixel Recurrent Neural Networks” got best paper award at ICML2016.  They proposed two types of models, PixelRNN and PixelCNN (two types of LSTM layers are proposed for PixelRNN.) PixelCNNPixelRNN masked convolution Row LSTM Diagonal BiLSTM PixelRNN PixelCNN Pros. • effectively handles long-range dependencies ⇒ good performance Convolutions are easier to parallelize ⇒ much faster to train Cons. • Each state needs to be computed sequentially. ⇒ computationally expensive Bounded receptive field ⇒ inferior performance Blind spot problem (due to the masked convolution) needs to be eliminated. • LSTM based models are natural choice for dealing with the autoregressive dependencies. • CNN based model uses masked convolution, to ensure the model is causal. 11w 12w 13w 21w 22w 23w 31w 32w 33w  This slide is from Yohei Sugawara 14
  • 18.
    Multi-scale PixelRNN • UncondionalPixelRNN and one more conditional PixelRNNs • Use a small original image as a sample. • Conditional network is similar to PixelRNN but biased by up-sampled version of the given small image. 15
  • 19.
  • 20.
    Residual Connections • Deepnetwork: PixelRNN 12 layers, PixelCNN 15 layers • Residual connection increase convergence speed and propagate 16
  • 21.
    Masked Convolution • Masksare adopted to avoid capturing future context. • Mask A is only used at the first convolutional layer, mask B is all the subsequent input-to-state convolutional transitions. MADE:Masked Autoencoder for Distribution Estimation8 8Mathieu Germain et al. “MADE: Masked Autoencoder for Distribution Estimation.” In: ICML. 2015. 17
  • 22.
    Discrete Softmax Distribution •Regression problem to classification problem • Easy implementation but better result 18
  • 23.
  • 24.
  • 25.
    Evaluation • Dataset: MNIST,CIFAR-10, and ImageNet • Method: log-likelihood 20
  • 26.
  • 27.
  • 28.
  • 29.
    Summary • Raw andDiagonal LSTM, PixelCNN • Using softmax layer • Using Masked convolution • Using Residual connection • New SoA MNIST, CIFAR-10 and tested on ImageNet 23
  • 30.
    Useful resources • SergeiTurukin PixelCNN post and implementation • PixeRNN conference presentation • PixelRNN Review byKyle Kastner • Post for Draw 24
  • 31.