how to understand and implement the "WAVENET"
Introduction
-WaveNet: deep generative model of audio data that operate directly at the waveform level
Contributions
Method
Causal convolutions
Dilated causal convolutions
Softmax distribution
implementation
-keras
2. Introduction ๏ต Raw audio generation
๏ต WaveNet: very high temporal
resolution (16,000 samples)
3. Contributions
๏ต Generate raw speech signals
๏ต New architectures based on dilated causal
convolutions
๏ต Single model can be used to generate different
voices, conditioned on a speaker identity
4. Comment of Wavenet
WaveNet: deep generative model
of audio data that operate
directly at the waveform level
Dilated convolution
โข exponentially increase the receptive field
โข to model the long-range temporal
dependencies
Conditioned model with global or
local way
7. Softmax distribution
โข Raw audio: a sequence of 16-bit int. value / time step
โข Softmax layer: output 65,536 probabilities
- law companding trasformation
โข Quantize it to 256 possible values
14. Joint probability
โข Waveform
โข Conditional probability
distribution is modelled by a
stack of convolutional layers
(similarly to PixelCNN)
โข No pooling
โข Dimensionality of input = Dim.
of output
โข Output: softmax layer p(x)
15. Dilation layer
Background
The key application the dilated convolution
authors have in mind is dense prediction: vision
applications where the predicted object that has
similar size and structure to the input image.
For example, semantic segmentation with one
label per pixel; image super-resolution, denoising,
demosaicing, bottom-up saliency, keypoint
detection, etc.
16. Dilation layer
In many such applications one wants to integrate
information from different spatial scales and
balance two properties:
1.local, pixel-level accuracy, such as precise detection of
edges, and
2.integrating knowledge of the wider, global context
To address this problem,
people often use some kind of multi-scale
convolutional neural networks, which often relies
on spatial pooling. Instead the authors here
propose using layers dilated convolutions, which
allow us to address the multi-scale problem
efficiently without increasing the number of
parameters too much.
17. Dilation layer
In the visual system, receptive fields are volumes in visual space
dilated conv = atrous conv (a trous en francais)
receptive field = center + surround
๋นจ๊ฐ์ ์ฃผ์๋ก์ ํฝ์ ๋ค๋ง ์ฌ์ฉํ์ฌ conv๋ฅผ ์ํ. ํด์๋์ ์์ค์์ด receptive
field ์ ํฌ๊ธฐ๋ฅผ ํ์ฅํ ์ ์์.
atrous conv ๋ผ๊ณ ๋ถ๋ฆฌ๋ ์ด์ ๋ ์ ์ฒด receptive field ์์ ๋นจ๊ฐ์ ์ ์ ์์น๋ง
๊ณ์๊ฐ ์กด์ฌํ๊ณ ๋๋จธ์ง๋ ๋ชจ๋ 0์ผ๋ก ์ฑ์์ง.
24. EXAMPLE โ MUSIC
โข MagnaTagATune dataset: 200 hours, each 29-second clip is
annotated with tags (genre, instrumentation, tempo, volume
and mood of the music)
โข YouTube piano dataset: 60 hours of solo piano music
โข Enlarging the receptive field was crucial to obtain samples
that sounded music
โข Conditional music models: generate music given a set of tags
specifying e.g. genre or instruments
25. EXAMPLE โ Multi-speaker speech
generation
Multi-speaker speech generation
โข English multi-speaker corpus from CSTR voice cloning
toolkit(VCTK): 44 hours from 109 different speakers
โข Not conditioned on text
โข generates non-existent but human language-like words in a
smooth way with realistic sounding intonaitons
โข The lack of log range coherence
โข limited receptive filed size (about 300 ms)
โข Powerful model to capture the characteristics of all 109
speakers
26. EXAMPLE - Text-To-Speech
Text-To-Speech
โข Googleโs TTS dataset (Eng.: 24.6 h, Mandarin: 34.8 h)
โข Locally conditioned on linguistic features which were derived
from input texts
โข Evaluation
โข subjective paired comparison tests: choose one they
preferred
โข mean opinion score (MOS): (1: bad, 2: poor, 3: fair, 4: good, 5:
excellent)
29. Conidtional WaveNets (cont.)
โข Global conditioning
๏ง h: the output dist. across all timesteps
โข Local conditioning
๏ง second timeseries : lower sampling
frequency than raw data
๏ง transform using transposed conv.
30. + Global Conditioning is characterized by a single latent
representation h that influences the output distribution
across all timesteps
+ For Local Conditioning, we have a second timeseries h(t),
possibly with a lower sampling frequency