Neural Waveform Modeling

Slides by Xin Wang
National Institute of Informatics
Copyright (c) 2018 - 2019
National Institute of Informatics
Department of Computer Science
Some rights reserved.
This work is licensed under the Creative Commons Attribution 3.0 license.
See http://creativecommons.org/ for details.
Note: Natural Japanese speech data belonging to ATR Ximera corpus are deleted
in this public available version

Neural Waveform Modeling
from our experiences in text-to-speech application
2contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG with Shinji Takaki and Junichi Yamagishi
National Institute of Informatics, Japan
NLP lecture series, IIS
Erlangen Germany, 2019

 Postdoc, Yamagishi-lab, NII
 Research keywords:
• Text-to-speech synthesis (TTS)
1. Neural network
2. Hidden Markov model
• Speech anti-spoofing
SELF-INTRODUCTION
3
WANG Xin
Pronunciation one shin
☛Research-map page: https://researchmap.jp/wangxin/?lang=english
☛Personal page: http://tonywangx.github.io
鑫王

CONTENTS
4
Introduction
Theory
Practice
Summary
• AR & flow-based models
• No AR nor flow
• WaveNet
• Neural source-filter model
• Beyond speech
• Future direction

5
Text-to-speech synthesis
http://www.hawking.org.uk/the-computer.html
https://hackaday.com/2018/05/10/googles-duplex-ai-has-conversation-indistinguishable-from-humans/
Text TTS Speech waveform
INTRODUCTION

Text TTS Speech waveform
 Statistical parametric speech synthesis 1
6
Marianna
made the
marmalade
Linguistic features Acoustic features
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
/m/ /ɛ/ /r/ …
H* on Marianna …
(S (NP (N Marianna))
(VP (V made)
(NP (ART the))
(N marmalade))))
Mel-spectrum, F0,
Band-aperiodicity, etc.
1. H. Zen, K. Tokuda, and A. W. Black. Statistical parametric speech synthesis. Speech Communication, 51:1039–1064, 2009.
INTRODUCTION

 Recent TTS frameworks
7
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
Trimmed
front-end
‘end-to-end’ TTS system
Text Waveform
module
Attention-based
acoustic model
Front-end
(Text-analyzer)
Unified back-end
Text Waveform
module
Pre-
processing
A. van den Oord, et al. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Y. Wang, et al. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech, pages 4006–4010, 2017.
J. Shen, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. ICASSP, pages 4779–4783, 2017
INTRODUCTION

8
Spectral features, F0, etc.
Neural waveform modeling
INTRODUCTION

9
Neural waveform modeling
1 2 3 4 T…
Waveform
values
Neural waveform models
…
INTRODUCTION

10
Naïve neural waveform model
…
1 2 3 4 T…
Convolution network (CNN) / Recurrent network (RNN)
Mean-square-error (MSE) / Cross-entropy (CE)
Waveform
values
INTRODUCTION

11
Naïve neural waveform model
…
…
1 2 3 4 T…
INTRODUCTION

12
INTRODUCTION
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Autoregressive (AR) neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Multi-head
CNN
No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
Improve naïve model
Naïve
model
Flow-based model
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
GELP
Theoretical interpretation
Practical issues

CONTENTS
13
Introduction
Theory
Practice
Summary
• Neural source filter model
• WaveNet
• Beyond speech
• Future work

MCNN
GELP
14
THEORY: AR NEURAL WAVEFORM MODEL
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
Jordan
network
Michael I. Jordan. Serial order: A parallel distributed processing approach. Technical Report 8604, Institute for Cognitive Science, 1986.
Overview

15
General idea
 Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
Natural
waveform
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.

16
General idea
 Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.

17
General idea
 Sequential generation
…
…1 2
1 2 3 4
3
T
…Generated
waveform

MCNN
GELP
18
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Model (NSF)
• No explicit
ClariNet
Parallel
WaveNet
Naïve
model
 WaveNet
• Tractable probability & powerful AR dependency
• Slow sequential generation & only left-to-right dependency
 WaveRNN 1
• Batch-sampling: faster generation
• Subscale-dependency: more than left-to-right dependency
 LPCNet & GlotNet 2,3
• Classical AR + neural AR
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
1. N. Kalchbrenner, et al Efficient neural audio synthesis. In Proc. ICML, volume 80, pages 2410–2419, 10–15 Jul 2018.
2. J.-M. Valin and J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In Proc. ICASSP, pages 5891–5895, 2019.
3. L. Juvela, et al . Speaker-independent raw waveform model for glottal excitation. In Proc. Interspeech 2018, pages 2012–2016, 2018.

MCNN
GELP
Flow-based model
FloWaveNetWaveGlow
19
THEORY: FLOW-BASED MODELS
No AR, nor flow
Model (NSF)
• No explicit
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
Flow-based model
FloWaveNetWaveGlow
 Fast generation?

20
Revisit AR model
 Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
Or equivalently

21
Revisit AR model
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
 z-1 denotes time delay
 See proof of in appendix
NN
z-1
H(.)
NN
z-1
H-1(.)

22
Revisit AR model
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.)
 See proof of in appendix
Such an AR model is a flow-based model
Training:
1. Transform o1:T to n1:T
2. Maximizing n1:T likelihood over N(nt 0, 1)
Generation:
1. Sample nt from N(nt 0, 1)
2. Transform nt to ot
3. Repeat from t=1 to t=T

23
From AR to Inverse AR flow-based model
NN
z-1
H(.)
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.) NN
z-1
H-1(.)
AR flow Inverse-AR flow

24
From AR to Inverse AR flow-based model
NN
z-1
H(.)
Training
Generation
NN
z-1
H-1(.)
AR flow
NN
z-1
H(.)
NN
z-1
H-1(.)
Inverse-AR flow
✓ O(1)
! O(T)✓ O(1)
! O(T)
Knowledge distilling
Parallel WaveNet & ClariNet

25
MCNN
No AR, nor flow
Model (NSF)
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
 WaveGlow1 & FloWaveNet2
• Fast generation & slow training
 Parallel WaveNet3 & ClariNet4
• Knowledge-distilling is complicated
1. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. ICASSP 2019, 2018.
2. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. ICML 2019.
3. A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. Proc. ICML, pages 3918–3926, 2018.
4. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. ICLR, 2018.
Inverse AR flow
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet

• Faster training & generation
• Easy to implement
Inverse AR flow
FloWaveNetWaveGlow
26
THEORY: NEURAL SOURCE-FILTER MODEL
No AR, no flow
Model (NSF) 1
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
ClariNet
Parallel
WaveNet
1. X. Wang, et al. Neural source-filter-based waveform model for statistical parametric speech synthesis. In Proc. ICASSP, pages 5916–5920,
2019.
2. S. O ̈. Arık, et. al. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018.
MCNN2
GELP3

27
1 2 3 4 T…
…
General idea
• No AR or inverse AR flow
…
1 2 3 4 T
‘Filter’
Natural
waveform
Generated
waveform
1 2 3 4 TF0/pitch ‘Source’

28
1 2 3 4 T…
…
General idea
• Based on short-time Fourier transform (STFT)
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
1 2 3 4 TF0/pitch

…
…
1 2 3 4 TF0/pitch
29
1 2 3 4 T…
Probabilistic interpretation?
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
What is the ?

30
1 2 3 4 T…
• Spectral distance
1 2 3 4 T…
Framing
Framing
Spectral
distance
FFT
FFT
 , where D is frame length. where K is FFT points.

31
1 2 3 4 T…
•
1 2 3 4 T…
Framing
Framing
FFT
FFT
Likelihood
over Gaussian
 For explanation, denotes spectral power vector
 , where D is frame length. where K is FFT points.

32
•
1 2 3 4 T…
Framing
FFT
1 2 3 4 T…
Framing
FFT
Likelihood
over Gaussian
 , where K is FFT points
 For explanation, denotes spectral power vector

33
Naïve model
AR model
Inverse-AR flow
NSF
THEORY IN SUMMARY

CONTENTS
34
Introduction
Theory
Practice
Summary
• WaveNet
• Neural source-filter model ➣
• Beyond speech
• Future work

35
PRACTICE: WAVENET
WaveNet variants
 Discretized or continuous-valued waveforms
• Two practical issues:
1. How to generate waveform samples? ➣
2. How to train WaveNet Gaussian? ➣
1 2
1 2 3 4
3 1 2
1 2 3 4
3
GMM/GaussianSoftmax
➣

36
PRACTICE: WAVENET
WaveNet variants
 Discretized or continuous-valued waveforms
 Other variants
• WaveNet using mixture of logistic distribution 1
• WaveNet + Spline 2
• Quantization noise shaping 3, related noise shaping method 4
1. T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
2. Y. Agiomyrgiannakis. B-spline PDF: A generalization of histograms to continuous density models for generative audio networks. In Proc. ICASSP, pages 5649–5653. IEEE, 2018.
3. T. Yoshimura, et al. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE TASLP, 26(7):1173–1180, 2018.
4. K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018.
1 2
1 2 3 4
3
Softmax
1 2
1 2 3 4
3
GMM/Gaussian

Generation strategy
 WaveNet-softmax
• Generation as a search problem
• Search space: 256T for 8-bits waveform of length T
1 2
1 2 3 4
3
37
PRACTICE: WAVENET
1 2 3 4
… … …
…

Generation strategy
 WaveNet-softmax
• Sub-optimal search by
o Exploitation
o Exploration
o Or mix of both
38
PRACTICE: WAVENET
1 2 3 4
… … …
…
Random sampling
Greedy search

Generation strategy
 WaveNet-softmax
39
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0
0
200
400
600
800
1000
waveform(mu-law)
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0
sampling point
0
200
400
600
800
1000
probablityWaveformlevels(0-1024)Waveformlevels(0-1024)
PRACTICE: WAVENET

PRACTICE: WAVENET
Generation method
 Experiments on WaveNet vocoder
Sampling point
9300.0
9305.0
9310.0
9315.0
9320.0
Waveform level
0
200
400
600
800
1000
1200
0.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.0
8905.0
8910.0
8915.0
8920.0
Waveform level
0
200
400
600
800
1000
1200
0.0
0.1
0.2
0.3
0.4
0.5
59
How about
1. Exploration in unvoiced steps
2. Exploitation in randomly selected
voiced steps

41
PRACTICE: WAVENET
Natural
Greedy
search
Random
sampling
Mixed
approach

42
PRACTICE: WAVENET
 Rainbow gram: https://gist.github.com/jesseengel/e223622e255bd5b8c9130407397a0494
Natural
Greedy
search
Random
sampling
Mixed
approach

Generation strategy
 WaveNet-softmax
• Exploitation & exploration
• Other strategy: temperature of softmax 1
 WaveNet-Gaussian
• Infinite search space: the best is impossible
• Same strategy as WaveNet-softmax
43
PRACTICE: WAVENET
Greedy best?
Sampling?
1. Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255, 2018.

44
PRACTICE: WAVENET
Training stability
• Maximum likelihood training is risky: very large gradients
NN
1 2
1 2 3 4
3

45
PRACTICE: WAVENET
Training stability

-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
46
Negative log-likelihood
PRACTICE: WAVENET
Training stability
• Toy experiment
o 1 utterance & network well-initialized
o Variance floor applied
Epoch

47
-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
Network
training
Natural
𝜇 𝑡
𝜎𝑡
Epoch

48
PART 1: IMPLICIT SOURCE-FILTER MODELWAVENET
Illness of fitting Gaussian
 Why joint learning is unstable? Toy experiment
• Use the MSE network
• Fit only one utterance
-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
Epoch
Natural
𝜇 𝑡
𝜎𝑡

PRACTICE: WAVENET
Training stability
• Our two-steps strategy
1. Train blue part with
2. Train red part only
• Gradient will be mild
1. Minimizes while keep gradient mild
2. Gradient not explode when
49
NN
1 2
1 2 3 4
3

Training stability
• Experiment: 5 hours data
50
PRACTICE: WAVENET
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Naïve
strategy
Our
strategy
Epoch
On training set
On validation set
Step 1 Step 2

Training stability
• Experiment: 5 hours data
51
PRACTICE: WAVENET
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Naïve
strategy
Our
strategy
Epoch
On training set
On validation set
Step 1 Step 2

Generation strategy
Training WaveNet-Gaussian
52
PRACTICE: WAVENET
Greedy best?
Sampling?
Exploitation + exploration
Keep gradients mild
NN
1 2
1 2 3 4
3

CONTENTS
53
Introduction
Theory
Practice
Summary
• WaveNet
• Beyond speech
• Future work

54
PRACTICE: NSF
1 2 3 4 T…
…
General idea
• Source-filter structure
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
1 2 3 4 TF0/pitch

55
PRACTICE: NSF
Common structure
• No AR or inverse AR
• No knowledge distilling
Spectral features & F0
Condition module
Source module Filter module
Frequency-domain distance
Natural
waveform
Generated
waveform
F0 infor. Spectral infor.
Generated
waveform
Gradients

56
PRACTICE: NSF
Common structure
• Condition module: input feature pre-process
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Generated
waveform
Gradients
Up sampling
Dimension change
Temporal smoothing
Cat.

57
PRACTICE: NSF
Common structure
• Source module: generate a sine waveform given F0
 FF: feedforward layer with Tanh
Filter module
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients
Up samplingBi-LSTM CONV Cat.
F0

Filter module
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients
F0
58
PRACTICE: NSF
Common structure
…
Random initial phase
Sampling rate
Noise
FF
Sine
generator
Fundamental
component
Voiced:
Unvoiced: noise

59
PRACTICE: NSF
Common structure
• Error metric
Natural
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Filter module
Generated
waveform
Gradients
Compute frequency-domain distance
Compute gradients for SGD
F0

60
PRACTICE: NSF
Common structure
• Based on short-time Fourier transform
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
FFTFraming FFT Framing
iFFT De-framing
Filter module
F0

61
PRACTICE: NSF
Common structure
• Different frame shifts / window lengths / FFT points
• Homogenous distances
•
iFFT De-framing
iFFT De-framing
iFFT De-framing +

62
PRACTICE: NSF
Common structure
• Different NSF models, different neural filter modules
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
iFFT De-framing
Filter module
F0
Filter module

63
PRACTICE: NSF
Common structure
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
iFFT De-framing
Filter module
F0
Filter module
NSF models
Baseline NSF
(b-NSF)
Simplified NSF
(s-NSF)
Harmonic-plus-noise NSF
(hn-NSF)
hn-NSF
Ver.1 hn-NSF with
ver.2
ICASSP 2019 Journal paper submitted SSW 2019

64
PRACTICE: NSF
Baseline and simplified NSF
• Baseline filter block follows WaveNet / ClariNet
• Baseline filter block can be simplified
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF
simplify

simplify
65
PRACTICE: NSF
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF

 Element-wise multiplication
Baseline
filter block 1
Simplified
filter block 1
Simplified filter block
Dilated
CONV
+FF … FF
Dilated
CONV
+
Baseline filter block
Dilated
CONV
+
Tanh
Sigmoid
•
FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid
•
FF
FF
+
… + FF

66
PRACTICE: NSF
• Both models:
1. Strong harmonics in high-frequency bands
2. Awful unvoiced (fricative) sounds
• Model ‘overfitted’ to voiced sounds?
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF
simplify

67
PRACTICE: NSF
 HP, LP: high- and low-pass finite-impulse-response (FIR) filter
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
b-NSF
s-NSF
hn-NSF
simplify
upgrade

Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 2
Simplified
filter block 5
…
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
b-NSF
s-NSF
hn-NSF
simplification
improvement
Baseline
filter block 1
Simplified
filter block 1
Simplified
filter block 1
68
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Maximum voicing frequency
(MVF)
hn-NSF

69
PRACTICE: NSF
 Version I: choose MVF based on u/v
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
For voiced sounds
For unvoiced sounds
Condition module for hn-NSF
Fixed MVFs

70
PRACTICE: NSF
 Version II: predict MVF from input features
• Predict MVF from condition module (SSW paper)
• From MVF to FIR filter coefficients (SSW paper)
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
sinc
Hamming
window
Gain
norm.
HP
LP

71
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP

72
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP

73
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP

74
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP

75
PRACTICE: NSF
Up sampling
Noise
FF
Sine
generator
harmonics
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
NSF is a deep-residual network

76
PRACTICE: NSF
Up sampling
Noise
FF
Sine
generator
harmonics
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Source module

77Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Source module
PRACTICE: NSF

Up sampling
Noise
FF
Sine
generator
harmonics
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Source module
PRACTICE: NSF

Configuration
 Data and features
 Models
85
Corpus Size Note
ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style
Feature Dimension
Acoustic
Mel-generalized cepstrum coefficients (MGC)
or
Mel-spectra
60
80
F0 1
PRACTICE: COMPARISON
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder

Speech quality (ICASSP)
• 245 paid evaluators, 1450 evaluation sets
86
Copy-synthesis
Pipeline TTS
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
WORLD
vocoder
WaveNet
softmax
WaveNet
Gaussian
b-NSF
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v1.html

Speech quality (Journal paper submitted)
• >150 paid evaluators
• s-NSF did badly on unvoiced sounds
87
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder

Speech quality (SSW 2019)
• >150 paid evaluators
88
■ Copy-synthesis
■ Pipeline TTS
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
WaveNet
softmax
hn-NSF
trainable MVF
hn-NSF
fixed MVF
Natural

Generation speed
 Mem-save mode: allocate and release GPU memory layer by layer
(limited by our CUDA implemetation)
 Normal mode: allocate GPU memory once
89
How many waveform points can be generated in 1s (Tesla p100)?
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder

CONTENTS
90
Introduction
Theory
Practice
Summary
• WaveNet
• Beyond speech
• Future work

91
SUMMARY
AR model
WaveRNNSampleRNN
FFTNet
WaveNet
Multi-head
CNN
No AR, no flow
Model (NSF)
• No explicit
Naïve
model
Inverse AR flow
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
GELP

92
BEYOND SPEECH
(c.f. HTS Slides, by HTS Working Group)

93
BEYOND SPEECH
Music performance
 Training
• URPM dataset1
o ground-truth F0
o 13 instruments
o solo recording
• One model for all instruments
1 University of Rochester Multi-Modal Music Performance (URMP) Dataset http://www2.ece.rochester.edu/projects/air/projects/URMP.html
Neural
waveform
model
F0
Mel-spectra

Natural b-NSF S-NSF
hn-NSF
trainable MVF
Violin
Viola
Oboe
Trumpet
Saxophone
BEYOND SPEECH
Music performance
 Testing with natural Mel-spectra and F0 as input
WaveNet

Natural b-NSF S-NSF
hn-NSF
trainable MVF
Horn
Trombone
Tuba
Clarinet
Flute
BEYOND SPEECH
Music performance
 Testing with natural Mel-spectra and F0 as input

96
FUTURE DIRECTION
(c.f. HTS Slides, by HTS Working Group)

Questions & Comments
are always Welcome!
97
https://nii-yamagishilab.github.io/samples-nsf/index.html

98
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume
80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–
2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv
preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible
frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–
3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281,
2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint
arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP
(submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal
excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846,
2018.
MCNN: S. O ̈. Arık, H. Jun, and G. Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal
Processing Letters, 26(1):94–98, 2018.
GELP: J. Lauri, et. al. GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram, Proc. Interspeech, 2019

99
REFERENCE
By Lauri Juvela, Aalto University

DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
DFT
Frame
Length M
Padding
K-M
0
0
0
0
0
0
0
0
0
0
Framing/
windowing
Complex-value domain Real-value domain
APPENDIX
Training criterion

X=
…
1st Frame
2nd Frame
Nth Frame
T rows
M (frame length)
…
…
Frame
shift
NM
columns
…
…
0
0
0
0
0
0
0
0
0 0 0
0
APPENDIX
Training criterion

Training criterion
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
iDFT
Frame
Length M
De-framing/
windowing
inverse
DFT
De-framing
/windowing Gradients
Gradients w.r.t. zero-padded part
Not used in de-framing/windowing
Padding
K-M
Complex-value domain Real-value domain
APPENDIX

103
FLOW-BASED MODELS
Recap AR model
 Consider a WaveNet using a Gaussian distribution
1. Because , we have
1 2 3 T
1 2 3
NN
NN
z-1
H-1(.)

105
FLOW-BASED MODELS
Recap AR model
3. Therefore
Triangle-matrix,
as nt depends on o<t

106
FLOW-BASED MODELS
Recap AR model
• So:

107
FLOW-BASED MODELS
Inverse-AR flow
NN
z-1
H-1(.)
Triangle-matrix,
as nt depends on ot

108
FLOW-BASED MODELS
Inverse-AR flow
2. Therefore
NN
z-1
H-1(.)

109
FLOW-BASED MODELS
AR flow vs inverse-AR
NN
z-1
H-1(.)NN
z-1
H-1(.)

110
FLOW-BASED MODELS
NN
z-1
H-1(.)NN
z-1
H-1(.)
AR flow
AR flow vs inverse-AR
Inverse-AR flow

Neural Waveform Modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural Waveform Modeling

Similar to Neural Waveform Modeling (20)

More from Yamagishi Laboratory, National Institute of Informatics, Japan

More from Yamagishi Laboratory, National Institute of Informatics, Japan (12)

Recently uploaded

Recently uploaded (20)

Neural Waveform Modeling

Editor's Notes