How pixel CNN and Pixel RNN is used to create WaveNet.
WaveNet is a audio processing Neural Network developed by Google and is the core technology behind Google Duplex
1. WaveNet an audio generative model based
on the PixelCNN architecture
Guided By
Presented By
Ms. FABEELA ALI RAWTHER ABEY R HURTIS
2. Introduction
● This work explores raw audio generation techniques, inspired by recent advances in
neural autoregressive generative models.
● The question this paper addresses is whether similar approaches can succeed in
generating wideband raw audio waveforms which are signals with very high
temporal resolution, at least 16,000 samples per second
3. Convolutional Neural Network(CNN)
● A CNN has very less parameters
compared to a traditional dense
network
● Less parameters means the model
has less space & time complexity
and less training time
4. PixelCNN
● The h features for each input position at every
layer in the network are split into three parts, each
corresponding to one of the RGB channels
● PixelCNN uses joint distribution of pixels to
generate pixel Xi
● The pixels are interdependent is in raster scan
order
● Pixel Xi is generated for R, G and B channels
which is interdependent
5. Generation of Pixels
X1
Xi
Xn2
X1
Xi
Xn2
X1
Xi
Xn2
pixel Xi is generated for the three channels.
For generating the pixels a softmax layer is used at the end of the architecture, It outputs the
most probable intensity w.r.t the context pixels before Xi
6. Masked Convolution
● A filter with mask has a limited receptive
field
● This leads to blind spot in the
convolution
● To avoid the blind spot another
convolution is needed ( a Horizontal
convolution is added)
7. Masked Convolution (cont…)
● The vertical stack of convolution reads all the
rows above
● Horizontal stack of convolution reads the pixels
from current pixel to all the pixel’s to it’s left
8. Gated PixelCNN
● When using simply two convolution for vertical and horizontal more
complex interactions cannot be learned but stacks of layer may help
● The need for a different model arise due to the fact that the existing
model is not enough for mapping the more import features
● The Gated PixelCNN fills the gap in learning it combines the input of
gate and filter to implement the horizontal and vertical convolution
⊙ elemental wise multiplication
* Convolutional operation
10. Conditional PixelCNN
● Given a high-level image description represented as a latent vector h, we
seek to model the conditional distribution p(x|h)
● We model the conditional distribution by adding terms that depend on h to
the activations before the nonlinearities in the gated PixelCNN equation
11. Conditional PixelCNN(cont…)
● If h is one-hot encoded that specifies a class this is equivalent to adding a
class dependent bias at every layer
● By mapping h to a spatial representation S=m(h) (which has the same dim
as the image but may have an arbitrary number of feature maps) with a
deconvolutional neural network m() then we obtain a location dependent
bias
12. WaveNet
● The joint probability of a waveform x = {x1,...,xT} is factored as a product of conditional
probabilities as follows
● The conditional probability distribution is modelled by stack of convolution layers.
There is no pooling and the output has the same time dimensionality as the input
● The model uses softmax for predicting the next value
● The model is optimized to maximize the log-likelihood of the data w.r.t the
parameters
13. Causal Convolution
● The causal
convolution
ensures the
conditional
probability is
satisfied
● The audio is a 1-
D data therefore
simple masking
tensor is used
for convolution
14. Causal convolution(cont…)
● At training time , the conditional predictions for all timesteps can be made in parallel
because all timesteps of ground truth x are known.
● When generating the predictions are sequential, After each prediction it is fed back into
the network to predict the next sample
● The problem with causal convolutions is that they require many layers, or large filters to
increase the receptive field
15. Dilated Convolution
● A dilated convolution is a convolution
where the filter is applied over an
area larger than its length by
skipping input values with a certain
step
16. Dilated Convolution
● The more layer the convolution
has the receptive field and
efficiency of the model
increases
17. Softmax Distributions
● Because raw audio is typically stored as a sequence of 16-bit integer values (one per
timestep), a softmax layer would need to output 65,536 probabilities per timestep to model
all possible values
● To make this more tractable, we first apply a µ-law companding transformation (ITU-T,
1988) to the data, and then quantize it to 256 possible values
-1< x < 1 and μ = 255
18. Gated Activation Units
● We use the same gated activation unit as used in the gated PixelCNN and no need for two
CNN
19. Conditional Wavenet
● Given an additional input h WaveNets can model the conditional distribution p(x|h) of the audio
given the input
● We condition the model on other inputs in two different ways: global conditioning and local
conditioning
● Global conditioning is characterised by a single latent representation h that influences the output
distribution across all timesteps, e.g. a speaker embedding
● For local conditioning we can condition after linguistic features
20. Advantages and Limitations
● Training is easy
● Less space and time required
● Better receptive field
● Better conditioning
● Slower Generation
21. Applications
● Music Generation with WaveNet
○ Enlarging the receptive field was crucial to obtain samples that sounded musical
○ Even with a receptive field of several seconds ,the models did not enforce long-range consistency which resulted
in second-to-second variations in genre, instrumentation,volume and sound quality
○ Conditional generation with genre or instruments work reasonably well
○ Dataset used 1) MaganaTagATune 2) Youtube piano dataset
● Conditioning on ImageNet Classes
○ Give one-hot encoding for hi for the i-th class we model p(x|hi)
○ Significantly improved the log likelihood
○ We observed great improvements in the visual quality of the generated samples
○ We see that the generated classes are very distinct from one another, and that the corresponding objects, animals
and backgrounds are clearly produced. Furthermore the images of a single class are very diverse: for example the
model was able to generate similar scenes from different angles and lightning conditions
22.
23. Conclusion
● Computationally more efficient
● More state-of-the-art performance on the ImageNet 32x32 and 64x64 dataset
● Conditional modelling with classes generate realistic images corresponding classes. On
human portraits the model is capable of generating new images from the same person in
different poses and lighting condition
● High log-likelihood scores
● In future applications involve generating new images with a certain object solely from a
single example image and creating variational autoencoders
24. References
1. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu,
Conditional Image Generation with PixelCNN Decoders arXiv:1601.053328v2 [cs.CV] 18 Jun 2016
2. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio
3. Aäron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu ,Pixel Recurrent Neural Networks
arXiv:1601.06759v3 [cs.CV] 19 Aug 2016
4. Bengio, Yoshua and Bengio, Samy. Modeling high dimensional discrete data with multi-layer neural
networks. pp. 400–406. MIT Press, 2000.
5. Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp. 4230–
4234, 2015.
25. References
6. Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
7. Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.
Unifying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868, 2016.
8. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014