Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018

58 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018

  1. 1. Santiago Pascual de la Puente santi.pascual@upc.edu PhD Candidate Universitat Politecnica de Catalunya Technical University of Catalonia Deep Generative Models III: PixelCNN/Wavenet Normalizing Flows #DLUPC http://bit.ly/dlai2018
  2. 2. Outline ● What is in here? ● Introduction ● Taxonomy ● Variational Auto-Encoders (VAEs) (DGMs I) ● Generative Adversarial Networks (GANs) (DGMs II) ● PixelCNN/Wavenet (DGMs III) ● Normalizing Flows (and flow-based gen. models) (DGMs III) ○ Real NVP ○ GLOW ● Models comparison ● Conclusions 2
  3. 3. Previously in DGMs...
  4. 4. What is a generative model? 4 We have datapoints that emerge from some generating process (landscapes in the nature, speech from people, etc.) X = {x1 , x2 , …, xN } Having our dataset X with example datapoints (images, waveforms, written text, etc.) each point xn lays in some M-dimensional space. Each point xn comes from an M-dimensional probability distribution P(X) → Model it!
  5. 5. 5 We can generate any type of data, like speech waveform samples or image pixels. What is a generative model? M samples = M dimensions 32 32 . . . Scan and unroll x3 channels M = 32x32x3 = 3072 dimensions
  6. 6. Our learned model should be able to make up new samples from the distribution, not just copy and paste existing samples! 6 What is a generative model? Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
  7. 7. 7 We will see models that learn the probability density function: ● Explicitly (we impose a known loss function) ○ (1) With approximate density → Variational Auto-Encoders (VAEs) ○ (2) With tractable density & likelihood-based: ■ PixelCNN, Wavenet. ■ Flow-based models: real-NVP, GLOW. ● Implicitly (we “do not know” the loss function) ○ (3) Generative Adversarial Networks (GANs). Taxonomy
  8. 8. Variational Auto-Encoder We finally reach the Variational AutoEncoder objective function. log likelihood of our data Not computable and non-negative (KL) approximation error. Reconstruction loss of our data given latent space → NEURAL DECODER reconstruction loss! Regularization of our latent representation → NEURAL ENCODER projects over prior. 8Credit: Kristiadi
  9. 9. Variational Auto-Encoder z Reparameterization trick Sample and operate with it, multiplying by and summing 9
  10. 10. Variational Auto-Encoder z1 Generative behavior Q: How can we now generate new samples once the underlying generating distribution is learned? A: We can sample from our prior, for example, discarding the encoder path. z2 z3 10
  11. 11. GAN schematic 11 First things first: what is an encoder neural network? Some deep network that extracts high level knowledge from our (potentially raw) data, and we can use that knowledge to discriminate among classes, to score probabilistic outcomes, to extract features, etc... ... ... ... ... ẍz ... ... ... ... cx Adversarial Learning = Let’s use the encoder’s abstraction to learn whether the generator made a good job generating something realistic ẍ = G(z) or not.
  12. 12. Generator Real world samples Database Discriminator Real Loss Latentrandomvariable Sample Sample Fake 12 z D is a binary classifier, whose objectives are: database images are Real, whereas generated ones are Fake . Generative + Adversarial
  13. 13. Conditional GANs GANs can be conditioned on other info extra to z: text, labels, speech, etc.. z might capture random characteristics of the data (variabilities of plausible futures), whilst c would condition the deterministic parts ! For details on ways to condition GANs: Ways of Conditioning Generative Adversarial Networks (Wack et al.)
  14. 14. Evolving GANs Different loss functions in the Discriminator can fit your purpose: ● Binary cross-entropy (BCE): Vanilla GAN ● Least-Squares: LSGAN ● Wasserstein GAN + Gradient Penalty: WGAN-GP ● BEGAN: Energy based (D as an autoencoder) with boundary equilibrium. ● Maximum margin: Improved binary classification
  15. 15. Likelihood-based Models
  16. 16. PixelRNN, PixelCNN & Wavenet
  17. 17. Factorizing the joint distribution ● Model explicitly the joint probability distribution of data streams x as a product of element-wise conditional distributions for each element xi in the stream. ○ Example: An image x of size (n, n) is decomposed scanning pixels in raster mode (Row by row and pixel by pixel within every row) ○ Apply probability chain rule: xi is the i-th pixel in the image. 17
  18. 18. Factorizing the joint distribution ● To model highly nonlinear and long-range correlations between pixels and their conditional distributions, we will need a powerful non-linear sequential model. ○ Q: How can we deal with this (which model)? 18
  19. 19. Factorizing the joint distribution ● To model highly nonlinear and long-range correlations between pixels and their conditional distributions, we will need a powerful non-linear sequential model. ○ Q: How can we deal with this (which model)? ■ A: Recurrent Neural Network. 19
  20. 20. PixelRNN An RNN predicts the probability of each sample xi with a categorical output distribution: Softmax RNN/LSTM xi-1 xi xi-2 xi-3 xi-4 (x, y) At every pixel, in a greyscale image: 256 classes (8 bits/pixel). 20 Recall RNN formulation: (van den Oord et al. 2016)
  21. 21. Factorizing the joint distribution ● To model highly nonlinear and long-range correlations between pixels and their conditional distributions, we will need a powerful non-linear sequential model. ○ Q: How can we deal with this (which model)? ■ A: Recurrent Neural Network. ■ B: Convolutional Neural Network. 21(van den Oord et al. 2016)²
  22. 22. Factorizing the joint distribution ● To model highly nonlinear and long-range correlations between pixels and their conditional distributions, we will need a powerful non-linear sequential model. ○ Q: How can we deal with this (which model)? ■ A: Recurrent Neural Network. ■ B: Convolutional Neural Network. A CNN? Whut?? 22
  23. 23. My CNN is going causal... hi there how are you doing ? 100 dim Let’s say we have a sequence of 100 dimensional vectors describing words. sequence length = 8 arrangein 2D 100 dim sequence length = 8 , Focus in 1D to exemplify causalitaion 23
  24. 24. 24 We can apply a 1D convolutional activation over the 2D matrix: for an arbitrary kernel of width=3 100 dim sequence length = 8 K1 Each 1D convolutional kernel is a 2D matrix of size (3, 100)100 dim My CNN is going causal...
  25. 25. 25 Keep in mind we are working with 100 dimensions although here we depict just one for simplicity 1 2 3 4 5 6 7 8 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 1 2 3 4 5 6 w1 w2 w3 The length result of the convolution is well known to be: seqlength - kwidth + 1 = 8 - 3 + 1 = 6 So the output matrix will be (6, 100) because there was no padding My CNN is going causal...
  26. 26. 26 My CNN is going causal... When we add zero padding, we normally do so on both sides of the sequence (as in image padding) 1 2 3 4 5 6 7 8 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 1 2 3 4 5 6 7 8 w1 w2 w3 The length result of the convolution is well known to be: seqlength - kwidth + 1 = 10 - 3 + 1 = 8 So the output matrix will be (8, 100) because we had padding 0 0 w1 w2 w3 w1 w2 w3
  27. 27. 27 Add the zero padding just on the left side of the sequence, not symmetrically 1 2 3 4 5 6 7 8 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3 1 2 3 4 5 6 7 8 w1 w2 w3 The length result of the convolution is well known to be: seqlength - kwidth + 1 = 10 - 3 + 1 = 8 So the output matrix will be (8, 100) because we had padding HOWEVER: now every time-step t depends on the two previous inputs as well as the current time-step → every output is causal Roughly: We make a causal convolution by padding left the sequence with (kwidth - 1) zeros 0 w1 w2 w3 w1 w2 w3 0 My CNN went causal
  28. 28. PixelCNN/Wavenet We can simply exemplify how causal convolutions help us with the SoTA model for audio generation Wavenet. We have a one dimensional stream of samples of length T: 28
  29. 29. PixelCNN/Wavenet We can simply exemplify how causal convolutions help us with the SoTA model for audio generation Wavenet. We have a one dimensional stream of samples of length T (e.g. audio): receptive field hast to be T 29(van den Oord et al. 2016)³
  30. 30. PixelCNN/Wavenet We can simply exemplify how causal convolutions help us with the SoTA model for audio generation Wavenet. We have a one dimensional stream of samples of length T: receptive field hast to be T Dilated convolutions help us reach a receptive field sufficiently large to emulate an RNN! 30 Samples available online
  31. 31. Conditional PixelCNN/Wavenet ● Q: Can we make the generative model learn the natural distribution of X and perform a specific task conditioned on it? ○ A: Yes! we can condition each sample to an embedding/feature vector h, which can represent encoded text, or speaker identity, generation speed, etc. 31
  32. 32. Conditional PixelCNN/Wavenet 32 ● PixelCNN produces very sharp and realistic samples, but... ● Q: Is this a cheap generative model computationally?
  33. 33. Conditional PixelCNN/Wavenet ● PixelCNN produces very sharp and realistic samples, but... ● Q: Is this a cheap generative model computationally? ○ A: Nope, take the example of wavenet: Forward 16.000 times in auto-regressive way to predict 1 second of audio at standard 16 kHz sampling. 33
  34. 34. Normalizing Flows
  35. 35. 35 Normalizing Flows
  36. 36. 36 Normalizing Flows
  37. 37. 37 Normalizing Flows
  38. 38. 38 Normalizing Flows
  39. 39. 39 Normalizing Flows
  40. 40. 40 Normalizing Flows
  41. 41. 41 Normalizing Flows and the Effective Chained Composition
  42. 42. 42 Normalizing Flows and the Effective Chained Composition (Mohamed and Rezende 2017)
  43. 43. 43 Normalizing Flows and the Effective Chained Composition (Mohamed and Rezende 2017)
  44. 44. 44 Normalizing Flows and the Effective Chained Composition (Mohamed and Rezende 2017)
  45. 45. 45 Planar Flow
  46. 46. 46 Planar Flow
  47. 47. 47 Planar Flow We can apply this simple flow to an already seen model to enhance its representational power though... z We still sample N(0, I) !!
  48. 48. 48 Planar Flow We can apply this simple flow to an already seen model to enhance its representational power though... z We still sample N(0, I) !! But if we had a bit more sophisticated flow formulation in some form, we could directly build an end-to-end generative model!
  49. 49. 49 Enhancing Normalizing Flows
  50. 50. Real-NVP
  51. 51. Real-Non Volume Preserving Flows 51
  52. 52. Real-Non Volume Preserving Flows 52
  53. 53. Real-Non Volume Preserving Flows 53
  54. 54. Real-Non Volume Preserving Flows 54 Copy Affine coupling
  55. 55. Real-NVP Permutations 55 Copy Affine coupling Affine coupling Copy Copy Affine coupling
  56. 56. GLOW
  57. 57. 57 GLOW
  58. 58. 58 GLOW
  59. 59. 59 GLOW Improvements over Real-NVP.
  60. 60. 60 GLOW step of flow formulations
  61. 61. 61 GLOW Multi-scale structure
  62. 62. 62 GLOW Multi-scale structure This is a very powerful structure: we can infer z features from an image and we can sample an image from z features, EITHER WAY!
  63. 63. 63 GLOW Multi-scale structure Additionally, during training loss is computed in the simplistic Z space!
  64. 64. 64 GLOW Results on CelebA
  65. 65. 65 GLOW Results with temperature
  66. 66. 66 GLOW Results interpolating Z
  67. 67. 67 GLOW Semantic Manipulation
  68. 68. 68 GLOW’s depth dependency
  69. 69. 69 WaveGLOW (Prenger et al. 2018) Modified structure to deal with 1D audio data, giving SoTA results without auto-regressive restriction. Audio samples available online
  70. 70. Models Comparison
  71. 71. 71 Models comparison
  72. 72. 72 Models comparison Latent space → feature disentangle, we can navigate and control better what we generate
  73. 73. 73 Models comparison No direct likelihood measurement, just an approximation or qualitative insights.
  74. 74. 74 Models comparison Direct likelihood values, direct objective “good fit” measurements.
  75. 75. 75 Models comparison Very deep models (many flow steps), need much much memory and compute power to get best results.
  76. 76. 76 Models comparison Fast training.
  77. 77. 77 Models comparison Slow auto-regressive sampling.
  78. 78. 78 Models comparison One-shot predictions, quickest to sample from.
  79. 79. Conclusions
  80. 80. 80 Conclusions ● (Deep) Generative models are built to learn the underlying structures hidden in our (high-dimensional) data. ● There are currently four main successful deep generative models: VAE, GAN, pixelCNN and NF-based. ● PixelCNN factorizes an explicitly known discrete distribution with probability chain rule ○ They are the slowest generative models for their recursive nature. ● VAEs, GANs, and NFs are capable of factorizing our highly-complex PDFs into a simpler prior of our choice z, which we can then explore and control to generate specific features. ● Once we trained VAEs with variational lower bound, we can generate new samples from our learned z manifold assuming most significant data lays over the prior → hence we like powerful priors (NFs can help in this matter). ● GANs use an implicit learning method to mimic our data distribution Pdata. ● GANs and NFs are the sharpiest generators at the moment → very active research fields. ● GANs are the hardest generative models to train owing to intrinsic stabilities of the G-D synergies. ● NFs (Real-NVP based) require many steps to be effective in their task, consuming many resources at the moment.
  81. 81. Thanks! Questions?
  82. 82. References 82 ● Glow: generative flow with invertible 1x1 convolutions (Kingma and Dhariwal, 2018) ● Density Estimation Using Real-NVP (Dinh et al. 2016) ● Variational Inference With Normalizing Flows (Rezende and Mohamed, 2015) ● Paper Review on Glow (Pascual 2018) ● Normalizing Flows (Kosiorek 2016) ● Normalizing Flows Tutorial 1 (Jang 2016) ● Normalizing Flows Tutorial 2 (Jang 2016)

×