WaveNet.pdf

Automatic Music Generation
using WaveNet
Presenter : 何冠勳 61047017s
Date : 2022/01/11
Authors: Google DeepMind, Aräon van den Oord et al.
Published in arXiv:1609.03499 [cs.SD]

Outline
2
1 3 5
6
4
2
Introduction
Data
preprocessing
Inference
experiments
Architecture Paper
experiments
Conclusions

Introduction
What is WaveNet? Why WaveNet?
1
3

What is WaveNet?
◉ WaveNet is a Deep Learning-based generative model for raw audio developed by Google
DeepMind.
◉ The main objective of WaveNet is to generate new samples from the original distribution of
the data. Hence, it is known as a Generative Model.
“Wavenet is like a language model from NLP.”
◉ In a language model, given a sequence of words, the model tries to predict the next word.
Similar to a language model, in WaveNet, given a sequence of samples, it tries to predict the
next sample.
,where input: {x1 , . . . , xt-1} time series data, output: xt
4

Contributions
◉ They show that WaveNets can generate raw speech signals with subjective naturalness
never before reported in the ﬁeld of text-to-speech (TTS), as assessed by human raters.
◉ In order to deal with long-range temporal dependencies needed for raw audio generation,
they develop new architectures based on dilated causal convolutions, which exhibit very
large receptive ﬁelds.
◉ They show that when conditioned on a speaker identity, a single model can be used to
generate different voices.
◉ The same architecture shows strong results when tested on a small speech recognition
dataset, and is promising when used to generate other audio modalities such as music.
5

“
“WaveNets provide a generic and ﬂexible
framework for tackling many
applications that rely on audio
generation such as TTS, music, speech
enhancement, and so on.”
6

Architecture
- Casual dilated convolutions
- Gated activation function
- Residual and Skip connections
2
7

Architecture
8
There are four mechanisms in WaveNet:
Causal dilated convolutions, Gated activation function, Residual and Skip connection.

Dilated?
9
Normal convolution Dilated convolution
◉ Enable networks to consider context not just
only in the neighborhood.
◉ Preserves the input resolution throughout the
network.
◉ Exponentially increase the dilation factor results
in exponential receptive ﬁeld growth with
depth.

Causal dilated convolutions
◉ Causal, i.e. prediction p(xt+1|x1, . . . , xt) does not depend on future timesteps xt+1, . . . , xT .
◉ Filter is applied over an area larger than its length by skipping input values with a certain step
○ Allows network to operate on a coarser scale than with a normal convolution, but more efficient
○ A form of “dimensionality reduction” similar to pooling
◉ Stacked dilated convolutions enable very large receptive fields with few layers.
“The receptive field, or sensory space, is a delimited medium where some physiological stimuli
can evoke a sensory neuronal response in specific organisms.”
10

11
(Visualization of a stack of causal dilated convolutional layers)

◉ Implemented in TensorFlow as tf.nn.atrous_conv2d. (Name comes from “à trous” in French, i.e.
“with holes”.)
◉ Paper uses exponentially doubling dilations up to a limit, and then repeated:
1, 2, 4, ..., 512, 1, 2, 4, ..., 512, 1, 2, 4, ..., 512.
◉ Each 1,2,4, ..., 512 block has a receptive ﬁeld of size 1024, a more efﬁcient counterpart of 1 × 1024
convolution.
◉ Stacking these units further increases model expressivity.
12

Gated activation units
13
Gated Activation Unit
◉ Gated activation units, same as used in gated PixelCNN
∗: convolution; ⊙: element-wise multiplication
σ: sigmoid function
k: layer index; f: ﬁlter; g: gate
W : convolution ﬁlter
◉ Empirically found to outperform ReLUs.

Residual?
◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual
blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away.
14
Optimizing the residual F(x) := H(x) − x is more tractable for
deeper networks.

Residual & Skip connections
◉ Both residual and parameterised skip connections are used throughout the network, to speed up
convergence and enable training of much deeper models.
15

Data preprocessing
Why preprocessing? Compression!
3
16

Data preprocessing
◉ Input and output are sequence of one-hot vectors, not scalers.
,where input: {x1 , . . . , xt-1} time series data, output: xt
◉ Original case
○ Audio data is a sequence of 16-bit int, range of 16-bit int is [-32768, 32767]
○ Dimension of the softmax layer will be 65,536
◉ They used μ-law algorithm to decrease the output size of the softmax layer.
17

Mu-law algorithm
[-32768, 32767] (16-bit int)
↓
[0, 255] (8-bit int)
↓
[0, 255] (8-bit int)
↓
[-32768, 32767] (16-bit int)
18
Mu-law encoding
Mu-law decoding
Model generating
◉ They used μ-law algorithm to decrease the
output size of the softmax layer.
,where μ = 255
◉ The vector size:

Paper experiments
Generating music of speciﬁc instrument
4
19

Paper experiments
◉ Datasets
○ MagnaTagATune dataset — 200 hours of audio, annotated with tags describing genre,
instrumentation, tempo, volume, mood
○ YouTube piano dataset — 60 hours of solo piano music (easier to model, since
constrained to a single instrument)
◉ Enlarging receptive ﬁeld was crucial to obtain samples that sound musical.
◉ Nevertheless, the samples were often harmonic and aesthetically pleasing, even when
produced by unconditional models.
◉ Some clips:
20

Inference
experiments
Generating speciﬁc genre of music
4
21

Inference experiments
◉ As the paper doesn’t come along with ofﬁcial code, I instead use source code from here.
◉ This is a TensorFlow implementation of the WaveNet generative neural network architecture
for audio generation.
◉ Using GTZAN Dataset－Blues for experimental training, it contains 100 wav ﬁle that are
categorized as blues.
◉ So far, I’ve taken three attempts. For each time, it takes two days to train the a WaveNet
model with training steps set as 100k.
22

◉ First attempt after fixing some path and environment bugs.
◉ The first generated audio was lousy, after carefully reading the codes, I found that there is a
constraint that only 16k wav are allowed as data. Thus, I ran the following command:
◉ SoX is a cross-platform command line utility that can convert various formats of computer
audio files in to other formats.
◉ After that, the second attempt gives result:
23

◉ The second generated audio sounds intriguing, like punching effect.
◉ From my perspective, it stands a fair chance that the reason causing this phenomenon is
because of various loudness each music data holds.
◉ I mastered every music data with a commonly used loudness meter, LKFS, and aiming for
-10LKFS. (Youtube uses loudness reference as -24LKFS.)
◉ Loudness, K-weighted, relative to full scale (LKFS) is a standard loudness measurement unit
used for audio normalization in broadcast television systems and other video and music
streaming services.
24

◉ LKFS library for c++/python
◉ The third generated audio has a better content, though not good enough to be called music.
◉ The reason why generating speciﬁc genre of music may not work perhaps lies in the input data
itself, since many of them has their own tempo, keys, and instrumentation. (Although, blues
already shares a common form.)
◉ To derive a better result, some condition in training must be set.
25

Conclusion
◉ WaveNet operates directly at the waveform level.
◉ WaveNet combines causal ﬁlters with dilated convolutions that enables receptive ﬁelds to
grow exponentially with depth, which is important to model the long-range temporal
dependencies in audio signals.
◉ WaveNet can generate good results on real-world problems, since it guarantees the quality
of audio.
◉ The fact that directly generating timestep per timestep with deep neural networks works at
all for 16kHz audio is really surprising, whereas its slowness on inferencing.
27

28
References
https://benanne.github.io/2020/03/24/audio-generation.html
https://www.analyticsvidhya.com/blog/2020/01/how-to-perform-automatic-music-generation/
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
https://hackmd.io/ALiK8RglQ1KvbcN3ogwpuw?view#Loudness
https://github.com/buriburisuri/speech-to-text-wavenet
https://www.youtube.com/watch?v=CqFIVCD1WWo
https://www.youtube.com/watch?v=KCk1i5xRxLA
https://arxiv.org/pdf/1609.03499.pdf
https://ai.googleblog.com/2015/09/google-voice-search-faster-and-more.html
https://en.wikipedia.org/wiki/Vocoder
https://en.wikipedia.org/wiki/WaveNet
https://en.wikipedia.org/wiki/LKFS
https://github.com/csteinmetz1/pyloudnorm
https://github.com/jasonho610/ITU-R_BS.1770-4_cpp
https://stackoverflow.com/questions/23980283/sox-resample-and-convert
https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification
https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
https://baike.baidu.com/item/mu-law/4857952
https://en.wikipedia.org/wiki/Residual_neural_network
https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec
http://arxiv.org/abs/1606.05328.
https://www.britannica.com/science/receptive-field

Any questions ?
You can ﬁnd me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
29

WaveNet.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WaveNet.pdf

Similar to WaveNet.pdf (20)

Recently uploaded

Recently uploaded (20)

WaveNet.pdf