Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
WaveNet.pdf
1. Automatic Music Generation
using WaveNet
Presenter : 何冠勳 61047017s
Date : 2022/01/11
Authors: Google DeepMind, Aräon van den Oord et al.
Published in arXiv:1609.03499 [cs.SD]
4. What is WaveNet?
◉ WaveNet is a Deep Learning-based generative model for raw audio developed by Google
DeepMind.
◉ The main objective of WaveNet is to generate new samples from the original distribution of
the data. Hence, it is known as a Generative Model.
“Wavenet is like a language model from NLP.”
◉ In a language model, given a sequence of words, the model tries to predict the next word.
Similar to a language model, in WaveNet, given a sequence of samples, it tries to predict the
next sample.
,where input: {x1 , . . . , xt-1} time series data, output: xt
4
5. Contributions
◉ They show that WaveNets can generate raw speech signals with subjective naturalness
never before reported in the field of text-to-speech (TTS), as assessed by human raters.
◉ In order to deal with long-range temporal dependencies needed for raw audio generation,
they develop new architectures based on dilated causal convolutions, which exhibit very
large receptive fields.
◉ They show that when conditioned on a speaker identity, a single model can be used to
generate different voices.
◉ The same architecture shows strong results when tested on a small speech recognition
dataset, and is promising when used to generate other audio modalities such as music.
5
6. “
“WaveNets provide a generic and flexible
framework for tackling many
applications that rely on audio
generation such as TTS, music, speech
enhancement, and so on.”
6
8. Architecture
8
There are four mechanisms in WaveNet:
Causal dilated convolutions, Gated activation function, Residual and Skip connection.
9. Dilated?
9
Normal convolution Dilated convolution
◉ Enable networks to consider context not just
only in the neighborhood.
◉ Preserves the input resolution throughout the
network.
◉ Exponentially increase the dilation factor results
in exponential receptive field growth with
depth.
10. Causal dilated convolutions
◉ Causal, i.e. prediction p(xt+1|x1, . . . , xt) does not depend on future timesteps xt+1, . . . , xT .
◉ Filter is applied over an area larger than its length by skipping input values with a certain step
○ Allows network to operate on a coarser scale than with a normal convolution, but more efficient
○ A form of “dimensionality reduction” similar to pooling
◉ Stacked dilated convolutions enable very large receptive fields with few layers.
“The receptive field, or sensory space, is a delimited medium where some physiological stimuli
can evoke a sensory neuronal response in specific organisms.”
10
12. Causal dilated convolutions
◉ Implemented in TensorFlow as tf.nn.atrous_conv2d. (Name comes from “à trous” in French, i.e.
“with holes”.)
◉ Paper uses exponentially doubling dilations up to a limit, and then repeated:
1, 2, 4, ..., 512, 1, 2, 4, ..., 512, 1, 2, 4, ..., 512.
◉ Each 1,2,4, ..., 512 block has a receptive field of size 1024, a more efficient counterpart of 1 × 1024
convolution.
◉ Stacking these units further increases model expressivity.
12
13. Gated activation units
13
Gated Activation Unit
◉ Gated activation units, same as used in gated PixelCNN
∗: convolution; ⊙: element-wise multiplication
σ: sigmoid function
k: layer index; f: filter; g: gate
W : convolution filter
◉ Empirically found to outperform ReLUs.
14. Residual?
◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual
blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away.
14
Optimizing the residual F(x) := H(x) − x is more tractable for
deeper networks.
15. Residual & Skip connections
◉ Both residual and parameterised skip connections are used throughout the network, to speed up
convergence and enable training of much deeper models.
15
17. Data preprocessing
◉ Input and output are sequence of one-hot vectors, not scalers.
,where input: {x1 , . . . , xt-1} time series data, output: xt
◉ Original case
○ Audio data is a sequence of 16-bit int, range of 16-bit int is [-32768, 32767]
○ Dimension of the softmax layer will be 65,536
◉ They used μ-law algorithm to decrease the output size of the softmax layer.
17
18. Mu-law algorithm
[-32768, 32767] (16-bit int)
↓
[0, 255] (8-bit int)
↓
[0, 255] (8-bit int)
↓
[-32768, 32767] (16-bit int)
18
Mu-law encoding
Mu-law decoding
Model generating
◉ They used μ-law algorithm to decrease the
output size of the softmax layer.
,where μ = 255
◉ The vector size:
20. Paper experiments
◉ Datasets
○ MagnaTagATune dataset — 200 hours of audio, annotated with tags describing genre,
instrumentation, tempo, volume, mood
○ YouTube piano dataset — 60 hours of solo piano music (easier to model, since
constrained to a single instrument)
◉ Enlarging receptive field was crucial to obtain samples that sound musical.
◉ Nevertheless, the samples were often harmonic and aesthetically pleasing, even when
produced by unconditional models.
◉ Some clips:
20
22. Inference experiments
◉ As the paper doesn’t come along with official code, I instead use source code from here.
◉ This is a TensorFlow implementation of the WaveNet generative neural network architecture
for audio generation.
◉ Using GTZAN Dataset-Blues for experimental training, it contains 100 wav file that are
categorized as blues.
◉ So far, I’ve taken three attempts. For each time, it takes two days to train the a WaveNet
model with training steps set as 100k.
22
23. Inference experiments
◉ First attempt after fixing some path and environment bugs.
◉ The first generated audio was lousy, after carefully reading the codes, I found that there is a
constraint that only 16k wav are allowed as data. Thus, I ran the following command:
◉ SoX is a cross-platform command line utility that can convert various formats of computer
audio files in to other formats.
◉ After that, the second attempt gives result:
23
24. Inference experiments
◉ The second generated audio sounds intriguing, like punching effect.
◉ From my perspective, it stands a fair chance that the reason causing this phenomenon is
because of various loudness each music data holds.
◉ I mastered every music data with a commonly used loudness meter, LKFS, and aiming for
-10LKFS. (Youtube uses loudness reference as -24LKFS.)
◉ Loudness, K-weighted, relative to full scale (LKFS) is a standard loudness measurement unit
used for audio normalization in broadcast television systems and other video and music
streaming services.
24
25. Inference experiments
◉ LKFS library for c++/python
◉ The third generated audio has a better content, though not good enough to be called music.
◉ The reason why generating specific genre of music may not work perhaps lies in the input data
itself, since many of them has their own tempo, keys, and instrumentation. (Although, blues
already shares a common form.)
◉ To derive a better result, some condition in training must be set.
25
27. Conclusion
◉ WaveNet operates directly at the waveform level.
◉ WaveNet combines causal filters with dilated convolutions that enables receptive fields to
grow exponentially with depth, which is important to model the long-range temporal
dependencies in audio signals.
◉ WaveNet can generate good results on real-world problems, since it guarantees the quality
of audio.
◉ The fact that directly generating timestep per timestep with deep neural networks works at
all for 16kHz audio is really surprising, whereas its slowness on inferencing.
27