Neural source-filter waveform model

Neural source-filter waveform model
1contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI
National Institute of Informatics, Japan
2019-02-27
SLP-L8.6 ICASSP 2019
Brighton, United Kingdom

 Introduction
 Proposed model
 Experiments
 Summary
CONTENTS
2

3
INTRODUCTION
Mel-spectrogram, F0, etc.
Task definition

4
INTRODUCTION
Task definition
…
1 2 3 4 T…
Waveform
values
Neural waveform models

5
INTRODUCTION
Naïve model
…
1 2 3 4 T…
CNN / RNN / Feedforward
Mean-square-error (MSE) / Cross-entropy (CE)
Waveform
values

6
INTRODUCTION
Naïve model
…
…
1 2 3 4 T…
Simple model
Simple criterion

7
INTRODUCTION
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet ClariNet
Universal neural vocoder
AR + expert knowledge
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Recent neural waveform models
Parallel WaveNet
Auto-
regressive
(AR) model
Criterion
AR Model
Criterion
Deterministic
transform
Model
Criterion
Flow-based
transform
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Flow-based models

Auto-
regressive
(AR) model
Criterion
AR Model
Criterion
Deterministic
transform
Model
Criterion
Flow-based
transform
8
INTRODUCTION
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet ClariNet
Universal neural vocoder
AR + expert knowledge
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Recent neural waveform models
• Without AR or flow?
• Use knowledge on speech modeling?
Parallel WaveNet
• Powerful model
! Sequential generation
• Non-sequential generation
! Complicated
! Slow in training
Neural source-
filter model (NSF)
Model
Better
criterion
Source
• Spectral-domain criterion
• Source-filter architecture
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Flow-based models

 Introduction
 NSF model
 Experiments
 Summary
CONTENTS
9

10
NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
Idea 1: spectral domain criterion
• Avoid naïve model assumption
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance
…
…

11
1 2 3 4 T…
…
Idea 2: sine-based source excitation
…
1 2 3 4 T
‘Filter’
Natural
waveform
‘Source’

12
Model structure
• Without AR, flow, or knowledge-distilling
Spectral features & F0
Condition module
Source module Filter module
Spectral distance
Natural
waveform
Generated
waveform
F0 infor. Spectral infor.
Generated
waveform
Gradients

13
Condition module
• Up sampling by replication
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Generated
waveform
Gradients

14
Source module
• FF: feedforward layer with Tanh
• ,
Filter module
Spectral distance
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients

Filter module
Spectral distance
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
15
Source module
…
Random initial phase
Sampling rate
Noise
FF
Sine
generator
Fundamental
component
Voiced:
Unvoiced: noise

Filter module
Spectral distance
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
16
Filter module

17
Filter module
• Neural filter blocks based on dilated convolution (CONV)
Spectral distance
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…

18
Spectral distance
Natural
waveform
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
Filter block
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
+
… FF+
Filter module
• Similar to ClariNet
• Ten dilated-CONV in one block
•
Generated
waveform
☛ Our WaveNet implementation: http://tonywangx.github.io/pdfs/wavenet.pdf

19
Training criterion
Spectral distance
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…

20
Training criterion
• Please check paper and:
Natural
waveform
Generated
waveform
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
FFTFraming FFT Framing
iFFT De-framing
SLP-P22.8, STFT spectral loss for training a neural speech waveform model

21
Training criterion
• Different frame shifts / window lengths / FFT points
• Homogenous distances
•
iFFT De-framing
iFFT De-framing
iFFT De-framing +

 Introduction
 Proposed model
 Experiments
• Comparison with WaveNet
• Ablation test
• Analysis
 Summary
CONTENTS
22

Test 1: comparison with WaveNet
 Data and feature
 Models
23[1] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages 179–184..
Corpus Size Note
ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style
Feature Dimension
Acoustic
Mel-generalized cepstrum coefficients (MGC) 60
F0 1
EXPERIMENTS

 Speech quality
• 245 paid evaluators, 1450 evaluation sets
o Copy-synthesis: given natural MGC/F0
o Pipeline TTS: given generated MGC/F0 from acoustic models
24
EXPERIMENTS
WORLD WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html

 Speech quality
• 245 paid evaluators, 1450 evaluation sets
o Copy-synthesis: given natural MGC/F0
o Pipeline TTS: given generated MGC/F0 from acoustic models
25
EXPERIMENTS
WORLD WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter

 Generation speed
 Memory-saving: release and allocate GPU memory layer by layer (pp. 69-73)
 Single GPU card, same DNN toolkit
26
EXPERIMENTS
How many waveform sampling points can be generated in 1s GPU time?
0k
50k
100k
150k
200k
250k
WaveNet NSF GPU-memory-
saving mode
NSF normal mode
20 k
227 k
0.19 k

Test 2: Ablation test
 One more corpus
 Copy-synthesis
27[2] J. Kominek and A. W. Black. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, 2004.
Corpus Size Note
CMU-arctic SLT [2] 0.83 hours 16kHz, English
Feature Dimension
Acoustic
Mel-spectrogram 80
F0 1
EXPERIMENTS

28
EXPERIMENTS
Natural NSF
ATR Ximera
F009
CMU-arctic
SLT
Condition module
Spectral distanceNatural
waveform
Generated
waveform

Natural NSF NSF w/o sine excitation
ATR Ximera
F009
CMU-arctic
SLT
29
EXPERIMENTS
Condition module
Gaussian noise Filter module
waveform
Generated
waveform
Spectral infor.

iDFT De-framing
iDFT De-framing
iDFT De-framing +
30
EXPERIMENTS
Ls1: frame shift 5ms, window length 20ms
Natural NSF NSF only Ls1
ATR Ximera
F009
CMU-arctic
SLT

iDFT De-framing
iDFT De-framing
iDFT De-framing +
31
EXPERIMENTS
Natural NSF NSF only Ls1 NSF Ls1+Ls2
ATR Ximera
F009
CMU-arctic
SLT
Ls1: frame shift 5ms, window length 20ms
Ls2: frame shift 2.5ms, window length 5ms

Analysis
32
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

Analysis
33
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

Analysis
34
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

Analysis
35
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

Analysis
36
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

Analysis
37
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5

 Introduction
 Proposed model
 Experiments
 Summary
CONTENTS
38

39
SUMMARY
NSF
• Condition: up sampling
• Source: sine excitation
• Filter: dilated-CONV
• Criterion: STFT
Condition module
waveform
Generated
waveform
Easy training
Fast generation
High-quality waveforms

40
SUMMARY
Recent work
 Simplified NSF: faster generation speed
 Simplified NSF -> Harmonic-plus-noise NSF
• Best quality
Dilated
conv
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
conv
+
Tanh
Sigmoid * FF
FF
+
… FF+
Dilated
conv
+FF … FF
Dilated
conv
+

Questions & Comments
are always Welcome!
41
https://nii-yamagishilab.github.io/samples-nsf/index.html
https://arxiv.org/abs/1904.12088

42
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume
80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–
2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv
preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible
frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–
3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281,
2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint
arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP
(submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal
excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846,
2018.

43
INTRODUCTION
Autoregressive (AR) model
1 2 3 4 T…
…
…1 2 3
• Powerful model
! Slow generation speed

44
INTRODUCTION
Normalizing flow + knowledge distilling
1 2 3 4 T…
…
…1 2 3
Teacher AR WaveNet
• Fast generation
! Complicated model

45
1 2 3 4 T…
…
Idea 1: spectral domain criterion
• Avoid naïve model assumption
• Differentiable & efficient computation
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Short-time Fourier transform
Short-time Fourier transform
Spectral amplitude error
Gradients

Natural NSF
NSF w/o ‘residual’
connection in filter module
F009
CMU-arctic
E2: Ablation test
46
EXPERIMENTS
Filter block
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
+
… FF+
https://nii-yamagishilab.github.io/samples-nsf/index.html

47
SUMMARY
Recent work
 Input F0?
• VCTK corpus, Mel-spectrogram + F0
• Samples generated from NSF, by changing F0 only
F0 0.5 * F0 2.0 * F0

48
SUMMARY
Recent work
 Input F0?
• https://arxiv.org/abs/1904.12088
• Same Mel-spec., different F0 input
NSF
F0 contour a
Natural Mel-spec.
NSF
F0 contour b
Natural Mel-spec.
NSF
F0 contour c
Natural Mel-spec.
Waveform 1
Waveform 2
Waveform 3
F0 contour 1
F0 contour 4
F0 contour 5
F0 contour from Mel-spec.
Corr. ~1.0
Corr. ~0.9
F0
model
Random
sampling

49
SUMMARY
Recent work
 Input F0?
• https://arxiv.org/abs/1904.12088
• Same Mel-spec., different F0 input
• Not easy to control F0 of generated waveforms from WaveNet
WaveNet
F0 contour a
Natural Mel-spec.
WaveNet
F0 contour b
Natural Mel-spec.
WaveNet
F0 contour c
Natural Mel-spec.
Waveform 1
Waveform 2
Waveform 3
F0 contour 1
F0 contour 4
F0 contour 5
F0 contour from Mel-spec.
Corr. ~0.97
Corr. ~0.92
F0
model
Random
sampling

Filter module (original)
• Similar to WaveNet / parallel WaveNet / ClariNet
• Necessary?
50
SIMPLIFIED & IMPROVED NSF
Neural filter module
Dilated-CONV-based
filter block 1
…Dilated-CONV-based
filter block 2
Dilated-CONV-based
filter block 5
Dilated-CONV-based filter block
Dilated
conv layer
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
conv layer
+
Tanh
Sigmoid * FF
FF
+
… FF+

Filter module (simplified)
• Without gated-activation / additional feedforward / ax+b
• Deep residual network
• Easy to interpret and optimize
51
Neural filter module
Dilated-CONV-based
filter block 1
…Dilated-CONV-based
filter block 2
Dilated-CONV-based
filter block 5
Simplified dilated-CONV-based filter block
Dilated
conv layer
+FF … FF
Dilated
conv layer
+

Comparison with original NSF
 Waveform generation speed
 Number of model parameters
 Memory-save mode: release and allocate GPU memory layer by layer
52
AR WaveNet
Original NSF Simplified NSF
Memory save Full speed Memory save Full speed
190 20,000 227,000
Number of waveform values generated in 1s GPU time
AR WaveNet Original NSF Simplified NSF
~ 2.9 M ~1.8 M
88,000 327,000
~1.07 M

Improve the simplified NSF?
• Simplified NSF trained on Mel-spectrogram, 16kHz Japanese data
• Unvoiced sound disappears 53
Filter moduleSource module
Condition module
Epoch 1
Epoch 12 Epoch 28

Improve the simplified NSF?
54
Filter moduleSource module
Condition module
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
De-voiced した
… 金融調節を実施した。
? Separate streams for unvoiced / voiced sound

Improved NSF
• Harmonic + noise style
• How to merge?
55
Source module
(sine-based)
Condition module
Filter module 1Noise
F0
Filter module 2
Merge

Improved NSF
• How to merge?
• Summation: may not separate harmonic from noise
56
Condition module
F0
+
Source module
(sine-based)
Filter module 2

Improved NSF
• How to merge?
X Summation: may not separate harmonic from noise
X Merge in time domain: hard / soft switch
57
Condition module
Voice/unvoiced
Source module
(sine-based)
Filter module 2
F0

Improved NSF
• How to merge?
✓ Merge in frequency domain: FIR filters
Condition module
F0
+
Source module
(sine-based)
Filter module 2
58
Harmonic partNoise part
Merged
FFT bins

Improved NSF
• How to merge?
Condition module
F0
+
Source module
(sine-based)
Filter module 2
59

Improved NSF
• How to merge?
o Change cut-off frequency based on voicing condition
60
Condition module U/V
Source module
(sine-based)
Filter module 2
+
F0

61
Spectrogram of ‘noise’ component
Spectrogram of ‘harmonic’ component

HNM + FIR filters
 Deterministic filter selection
WAVEFORM MODELING
e33
e21e10
62
Learning curve

Simplified & improved NSF
• Simplified: filter uses only dilated convolution + skip-connection
• Improved: separate streams for noise and harmonic
 More speech processing ideas? e.g., multi-band, glottal excitation
+
U/V
63
SUMMARY
Source module
(sine-based)
Condition module
F0
Filter module 2Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+

Natural
Simplified &
improved NSF
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
Samples (16kHz waveform, natural acoustic features)
• Natural MGC / Mel-spectrogram
• Natural F0
64
SUMMARY

Natural
Simplified &
improved NSF
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
Samples (16kHz waveform, text-to-speech)
• Predicted MGC / Mel-spectrogram from text
• Natural F0
65
SUMMARY

DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
DFT
Frame
Length M
Padding
K-M
0
0
0
0
0
0
0
0
0
0
Framing/
windowing
Complex-value domain Real-value domain
Training criterion

X=
…
1st Frame
2nd Frame
Nth Frame
T rows
M (frame length)
…
…
Frame
shift
NM
columns
…
…
0
0
0
0
0
0
0
0
0 0 0
0
Training criterion

Training criterion
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
iDFT
Frame
Length M
De-framing/
windowing
inverse
DFT
De-framing
/windowing Gradients
Gradients w.r.t. zero-padded part
Not used in de-framing/windowing
Padding
K-M
Complex-value domain Real-value domain

Implementation issue
 WaveNet memory consumption in generation
• Only allocate GPU memory for generating 1 waveform point
• For t = 1 : T, reuse the memory space
• Check: http://tonywangx.github.io/pdfs/CURRENNT_WAVENET.pdf (pp.44-55) 69
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
Memory for 1 time step
N time steps
(N = dilation size)Implementation
WaveNet

 NSF normal node
• Normal mode: allocate GPU memory for all layers and all time steps
• GPU memory consumption is large for long waveforms
70
PART 3
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
NSF
Condition Condition
CNN stacks
Implementation

 NSF memory-save node
• Memory-save mode:
• Allocate GPU memory for the current layer and all previous
layers it depends on
• Release GPU memory when one layer is no longer needed
• GPU memory consumption is small
71
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
NSF
Implementation

 NSF memory-save node
72
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
Memory-save mode:
Function __computeGenPass_LayerByLayer_mem
Build layer-dependency graph
For layer_n = 1 : L
Allocate GPU memory and compute layer_n’s output
For layer_m = 1 : layer_n where layer_m’s output is input to layer_n
If layer_m is no longer needed by any layer, release memory of layer_m
NSF
Implementation

 Generation speed
 Memory-saving: release and allocate GPU memory layer by layer (pp. 68-71)
 Single GPU card, same DNN toolkit
 Training is also straightforward and fast
73
EXPERIMENTS
WaveNet
NSF
GPU-memory-saving
mode
Normal mode
0.19 k
How many waveform sampling points can be generated in 1s GPU time?
20 k 227 k

 National Institute of Informatics
 Yamagishi-lab
 Postdoc
 PhD (2015-2018) SOKENDAI・NII
 M.A (2012-2015) USTC
 B.A (2008-2012) UESTC
 http://tonywangx.github.io
SELF-INTRODUCTION
74

Neural source-filter waveform model

More Related Content

What's hot

Similar to Neural source-filter waveform model

More from Yamagishi Laboratory, National Institute of Informatics, Japan

Recently uploaded

Neural source-filter waveform model

Editor's Notes