how to understand and implement the "WAVENET"

Wavenet
Fairies(Adonis Han 한상훈)
sanghan1990@naver.com

Introduction  Raw audio generation
 WaveNet: very high temporal
resolution (16,000 samples)

Contributions
 Generate raw speech signals
 New architectures based on dilated causal
convolutions
 Single model can be used to generate different
voices, conditioned on a speaker identity

Comment of Wavenet
WaveNet: deep generative model
of audio data that operate
directly at the waveform level
Dilated convolution
• exponentially increase the receptive field
• to model the long-range temporal
dependencies
Conditioned model with global or
local way

Causal
convolutions
• Causal convolutions
(cannot violate the
ordering)
• Same concept of the
masked convolution
• No recurrent
connections

Dilated causal
convolutions
• Dilated causal
convolutions
• Efficiently increase
the receptive field
• 1, 2, 4, …, 512, 1, 2,
4, …, 512, 1, 2, 4, …,
512

Softmax distribution
• Raw audio: a sequence of 16-bit int. value / time step
• Softmax layer: output 65,536 probabilities
- law companding trasformation
• Quantize it to 256 possible values

Softmax
distribution
conditional probability 를
modeling하는데 있어서,
softmax distributions 을
사용함.
Audio 신호는 16bit로
quantization 하는 경우가
많음 이걸 softmax로
표현하려면 sample마다
65536 개의 output이 필요.
(너무 많다.)
mu-law companding
기법을 사용. 사람의 귀는
소리 크기가 작을 때는
작은 변화에도 민감 소리
크기가 클 때는 비교적 큰
변화에도 둔감함.
quantization을
nonlinear하게 해줌.
이렇게 하면 8bit(256
outputs)로도 꽤 좋은 성능
으로 encoding/decoding이
가능

Joint probability
• Waveform
• Conditional probability
distribution is modelled by a
stack of convolutional layers
(similarly to PixelCNN)
• No pooling
• Dimensionality of input = Dim.
of output
• Output: softmax layer p(x)

Dilation layer
Background
The key application the dilated convolution
authors have in mind is dense prediction: vision
applications where the predicted object that has
similar size and structure to the input image.
For example, semantic segmentation with one
label per pixel; image super-resolution, denoising,
demosaicing, bottom-up saliency, keypoint
detection, etc.

Dilation layer
In many such applications one wants to integrate
information from different spatial scales and
balance two properties:
1.local, pixel-level accuracy, such as precise detection of
edges, and
2.integrating knowledge of the wider, global context
To address this problem,
people often use some kind of multi-scale
convolutional neural networks, which often relies
on spatial pooling. Instead the authors here
propose using layers dilated convolutions, which
allow us to address the multi-scale problem
efficiently without increasing the number of
parameters too much.

Dilation layer
In the visual system, receptive fields are volumes in visual space
dilated conv = atrous conv (a trous en francais)
receptive field = center + surround
빨간점 주위로의 픽셀들만 사용하여 conv를 수행. 해상도의 손실없이 receptive
field 의 크기를 확장할 수 있음.
atrous conv 라고 불리는 이유는 전체 receptive field 에서 빨간색 점의 위치만
계수가 존재하고 나머지는 모두 0으로 채워짐.

Ref: http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/
Dilated Convolutions

Dilation layer
장점
1. 큰 receptive field 를 취하려면, 파라미터의 개수가 많아야 하지만, dilated conv를 사용하면
receptive field 는 커지지만 파라미터의 개수는 늘어나지 않기 때문에 연산량 관점에서 탁월한
효과를 얻을 수 있음.
2. receptive field가 7 x7 이기 때문에 normal filter 로 구현을 하면 필터의 파라미터의 개수는
49이지만 dilated conv 를 사용하면 49개중 빨간점에 해당하는 부분에만 파라미터가 있는 것이고
나머지 40개 정도는 모두 0 으로 채워져 연산량 부담이 3x3필터를 처리하는것과 같음.
3. receptive field 의 크기가 커져, dilation 계수를 조정하면 다양한 scale에 대한 대응이
가능해진다.(다양한 scale에서의 정보를 끄집어내려면 넓은 receptive field 를 볼 수 있어야 하는데
dilated conv를 사용하면 별 어려움이 없다.
->기존 cnn 에서는 receptive field 확장을 위해 pooling layer를 통해 크기를 줄인 후 conv 를
수행하는 식으로 했다.

EXAMPLE – MUSIC
• MagnaTagATune dataset: 200 hours, each 29-second clip is
annotated with tags (genre, instrumentation, tempo, volume
and mood of the music)
• YouTube piano dataset: 60 hours of solo piano music
• Enlarging the receptive field was crucial to obtain samples
that sounded music
• Conditional music models: generate music given a set of tags
specifying e.g. genre or instruments

EXAMPLE – Multi-speaker speech
generation
Multi-speaker speech generation
• English multi-speaker corpus from CSTR voice cloning
toolkit(VCTK): 44 hours from 109 different speakers
• Not conditioned on text
• generates non-existent but human language-like words in a
smooth way with realistic sounding intonaitons
• The lack of log range coherence
• limited receptive filed size (about 300 ms)
• Powerful model to capture the characteristics of all 109
speakers

EXAMPLE - Text-To-Speech
Text-To-Speech
• Google’s TTS dataset (Eng.: 24.6 h, Mandarin: 34.8 h)
• Locally conditioned on linguistic features which were derived
from input texts
• Evaluation
• subjective paired comparison tests: choose one they
preferred
• mean opinion score (MOS): (1: bad, 2: poor, 3: fair, 4: good, 5:
excellent)

(1: bad, 2: poor, 3: fair, 4: good, 5: excellent)

Conidtional WaveNets (cont.)
• Global conditioning
 h: the output dist. across all timesteps
• Local conditioning
 second timeseries : lower sampling
frequency than raw data
 transform using transposed conv.

+ Global Conditioning is characterized by a single latent
representation h that influences the output distribution
across all timesteps
+ For Local Conditioning, we have a second timeseries h(t),
possibly with a lower sampling frequency

Ref
• http://deepsound.io/wavenet_first_try.html
• --keras
• https://github.com/basveeling/wavenet
• --tensorflow
• https://github.com/ibab/tensorflow-wavenet
• https://tensorflow.blog/2016/09/09/wavenet-deepminds-new-model-for-audio/
• https://deepmind.com/blog/wavenet-generative-model-raw-audio/
• https://www.youtube.com/watch?v=nsrSrYtKkT8
• --github
• https://github.com/usernaamee/keras-wavenet
• https://github.com/munich-ai-labs/keras2-wavenet
• https://github.com/rampage644/wavenet
• http://www.modulabs.co.kr/DeepLAB_Paper/16552(모두의 연구소) [PR12] WaveNet - A Generative Model for Raw
Audio
http://www.modulabs.co.kr/DeepLAB_Paper/15027(모두의 연구소)

how to understand and implement the "WAVENET"

Recommended

Recommended

More Related Content

Similar to how to understand and implement the "WAVENET"

Similar to how to understand and implement the "WAVENET" (20)

More from Adonis Han

More from Adonis Han (6)

Recently uploaded

Recently uploaded (20)

how to understand and implement the "WAVENET"

Editor's Notes