Neural source-filter waveform model
1contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI
National Institute of Informatics, Japan
2019-02-27
SLP-L8.6 ICASSP 2019
Brighton, United Kingdom
 Introduction
 Proposed model
 Experiments
 Summary
CONTENTS
2
3
INTRODUCTION
Mel-spectrogram, F0, etc.
Task definition
4
INTRODUCTION
Task definition
…
1 2 3 4 T…
Waveform
values
Neural waveform models
5
INTRODUCTION
Naïve model
…
1 2 3 4 T…
CNN / RNN / Feedforward
Mean-square-error (MSE) / Cross-entropy (CE)
Waveform
values
6
INTRODUCTION
Naïve model
…
…
1 2 3 4 T…
Simple model
Simple criterion
7
INTRODUCTION
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet ClariNet
Universal neural vocoder
AR + expert knowledge
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Recent neural waveform models
Parallel WaveNet
Auto-
regressive
(AR) model
Criterion
AR Model
Criterion
Deterministic
transform
Model
Criterion
Flow-based
transform
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Flow-based models
Auto-
regressive
(AR) model
Criterion
AR Model
Criterion
Deterministic
transform
Model
Criterion
Flow-based
transform
8
INTRODUCTION
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet ClariNet
Universal neural vocoder
AR + expert knowledge
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Recent neural waveform models
• Without AR or flow?
• Use knowledge on speech modeling?
Parallel WaveNet
• Powerful model
! Sequential generation
• Non-sequential generation
! Complicated
! Slow in training
Neural source-
filter model (NSF)
Model
Better
criterion
Source
• Spectral-domain criterion
• Source-filter architecture
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Flow-based models
 Introduction
 NSF model
 Experiments
 Summary
CONTENTS
9
10
NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
Idea 1: spectral domain criterion
• Avoid naïve model assumption
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance
…
…
11
NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
Idea 2: sine-based source excitation
…
1 2 3 4 T
‘Filter’
Natural
waveform
‘Source’
12
NEURAL SOURCE-FILTER MODEL
Model structure
• Without AR, flow, or knowledge-distilling
Spectral features & F0
Condition module
Source module Filter module
Spectral distance
Natural
waveform
Generated
waveform
F0 infor. Spectral infor.
Generated
waveform
Gradients
13
NEURAL SOURCE-FILTER MODEL
Condition module
• Up sampling by replication
Spectral features & F0
Source module Filter module
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Generated
waveform
Gradients
14
NEURAL SOURCE-FILTER MODEL
Source module
• FF: feedforward layer with Tanh
• ,
Spectral features & F0
Filter module
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients
Spectral features & F0
Filter module
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
15
NEURAL SOURCE-FILTER MODEL
Source module
…
Random initial phase
Sampling rate
Noise
FF
Sine
generator
Fundamental
component
Voiced:
Unvoiced: noise
Spectral features & F0
Filter module
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
16
NEURAL SOURCE-FILTER MODEL
Filter module
17
NEURAL SOURCE-FILTER MODEL
Filter module
• Neural filter blocks based on dilated convolution (CONV)
Spectral features & F0
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
18
NEURAL SOURCE-FILTER MODEL
Spectral features & F0
Spectral distance
Natural
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
Filter block
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
+
… FF+
Filter module
• Similar to ClariNet
• Ten dilated-CONV in one block
•
Generated
waveform
☛ Our WaveNet implementation: http://tonywangx.github.io/pdfs/wavenet.pdf
19
NEURAL SOURCE-FILTER MODEL
Training criterion
Spectral features & F0
Spectral distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
20
NEURAL SOURCE-FILTER MODEL
Training criterion
• Please check paper and:
Spectral features & F0
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Noise
FF
Sine
generator
harmonics
Filter
block 1
Filter
block 2
Filter
block 5
…
FFTFraming FFT Framing
iFFT De-framing
SLP-P22.8, STFT spectral loss for training a neural speech waveform model
21
NEURAL SOURCE-FILTER MODEL
Training criterion
• Different frame shifts / window lengths / FFT points
• Homogenous distances
•
FFTFraming FFT Framing
iFFT De-framing
FFTFraming FFT Framing
iFFT De-framing
FFTFraming FFT Framing
iFFT De-framing +
 Introduction
 Proposed model
 Experiments
• Comparison with WaveNet
• Ablation test
• Analysis
 Summary
CONTENTS
22
Test 1: comparison with WaveNet
 Data and feature
 Models
23[1] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages 179–184..
Corpus Size Note
ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style
Feature Dimension
Acoustic
Mel-generalized cepstrum coefficients (MGC) 60
F0 1
EXPERIMENTS
Test 1: comparison with WaveNet
 Speech quality
• 245 paid evaluators, 1450 evaluation sets
o Copy-synthesis: given natural MGC/F0
o Pipeline TTS: given generated MGC/F0 from acoustic models
24
EXPERIMENTS
WORLD WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
Test 1: comparison with WaveNet
 Speech quality
• 245 paid evaluators, 1450 evaluation sets
o Copy-synthesis: given natural MGC/F0
o Pipeline TTS: given generated MGC/F0 from acoustic models
25
EXPERIMENTS
WORLD WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
Test 1: comparison with WaveNet
 Generation speed
 Memory-saving: release and allocate GPU memory layer by layer (pp. 69-73)
 Single GPU card, same DNN toolkit
26
EXPERIMENTS
How many waveform sampling points can be generated in 1s GPU time?
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
0k
50k
100k
150k
200k
250k
WaveNet NSF GPU-memory-
saving mode
NSF normal mode
20 k
227 k
0.19 k
Test 2: Ablation test
 One more corpus
 Copy-synthesis
27[2] J. Kominek and A. W. Black. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, 2004.
Corpus Size Note
CMU-arctic SLT [2] 0.83 hours 16kHz, English
Feature Dimension
Acoustic
Mel-spectrogram 80
F0 1
EXPERIMENTS
Test 2: Ablation test
28
EXPERIMENTS
Natural NSF
ATR Ximera
F009
CMU-arctic
SLT
Spectral features & F0
Condition module
Source module Filter module
Spectral distanceNatural
waveform
Generated
waveform
F0 infor. Spectral infor.
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
Natural NSF NSF w/o sine excitation
ATR Ximera
F009
CMU-arctic
SLT
Test 2: Ablation test
29
EXPERIMENTS
Spectral features & F0
Condition module
Gaussian noise Filter module
Spectral distanceNatural
waveform
Generated
waveform
Spectral infor.
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
FFTFraming FFT Framing
iDFT De-framing
FFTFraming FFT Framing
iDFT De-framing
FFTFraming FFT Framing
iDFT De-framing +
Test 2: Ablation test
30
EXPERIMENTS
Ls1: frame shift 5ms, window length 20ms
Natural NSF NSF only Ls1
ATR Ximera
F009
CMU-arctic
SLT
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
FFTFraming FFT Framing
iDFT De-framing
FFTFraming FFT Framing
iDFT De-framing
FFTFraming FFT Framing
iDFT De-framing +
Test 2: Ablation test
31
EXPERIMENTS
Natural NSF NSF only Ls1 NSF Ls1+Ls2
ATR Ximera
F009
CMU-arctic
SLT
Ls1: frame shift 5ms, window length 20ms
Ls2: frame shift 2.5ms, window length 5ms
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
Analysis
32
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
Analysis
33
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
Analysis
34
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
Analysis
35
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
Analysis
36
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
Analysis
37
EXPERIMENTS
Condition module
Source module
Filter
block 1
Filter
block 2
Filter
block 3
Filter
block 4
Filter
block 5
 Introduction
 Proposed model
 Experiments
 Summary
CONTENTS
38
39
SUMMARY
NSF
• Condition: up sampling
• Source: sine excitation
• Filter: dilated-CONV
• Criterion: STFT
Spectral features & F0
Condition module
Source module Filter module
Spectral distanceNatural
waveform
Generated
waveform
F0 infor. Spectral infor.
Easy training
Fast generation
High-quality waveforms
40
SUMMARY
Recent work
 Simplified NSF: faster generation speed
 Simplified NSF -> Harmonic-plus-noise NSF
• Best quality
Dilated
conv
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
conv
+
Tanh
Sigmoid * FF
FF
+
… FF+
Dilated
conv
+FF … FF
Dilated
conv
+
Questions & Comments
are always Welcome!
41
https://nii-yamagishilab.github.io/samples-nsf/index.html
https://arxiv.org/abs/1904.12088
42
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume
80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–
2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv
preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible
frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–
3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281,
2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint
arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP
(submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv
preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv
preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal
excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv
preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846,
2018.
43
INTRODUCTION
Autoregressive (AR) model
1 2 3 4 T…
…
…1 2 3
• Powerful model
! Slow generation speed
44
INTRODUCTION
Normalizing flow + knowledge distilling
1 2 3 4 T…
…
…1 2 3
Teacher AR WaveNet
• Fast generation
! Complicated model
45
NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
Idea 1: spectral domain criterion
• Avoid naïve model assumption
• Differentiable & efficient computation
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Short-time Fourier transform
Short-time Fourier transform
Spectral amplitude error
Gradients
Natural NSF
NSF w/o ‘residual’
connection in filter module
F009
CMU-arctic
E2: Ablation test
46
EXPERIMENTS
Filter block
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid * FF
FF
+
… FF+
https://nii-yamagishilab.github.io/samples-nsf/index.html
47
SUMMARY
Recent work
 Input F0?
• VCTK corpus, Mel-spectrogram + F0
• Samples generated from NSF, by changing F0 only
F0 0.5 * F0 2.0 * F0
48
SUMMARY
Recent work
 Input F0?
• https://arxiv.org/abs/1904.12088
• Same Mel-spec., different F0 input
NSF
F0 contour a
Natural Mel-spec.
NSF
F0 contour b
Natural Mel-spec.
NSF
F0 contour c
Natural Mel-spec.
Waveform 1
Waveform 2
Waveform 3
F0 contour 1
F0 contour 4
F0 contour 5
F0 contour from Mel-spec.
Corr. ~1.0
Corr. ~0.9
F0
model
Random
sampling
49
SUMMARY
Recent work
 Input F0?
• https://arxiv.org/abs/1904.12088
• Same Mel-spec., different F0 input
• Not easy to control F0 of generated waveforms from WaveNet
WaveNet
F0 contour a
Natural Mel-spec.
WaveNet
F0 contour b
Natural Mel-spec.
WaveNet
F0 contour c
Natural Mel-spec.
Waveform 1
Waveform 2
Waveform 3
F0 contour 1
F0 contour 4
F0 contour 5
F0 contour from Mel-spec.
Corr. ~0.97
Corr. ~0.92
F0
model
Random
sampling
Filter module (original)
• Similar to WaveNet / parallel WaveNet / ClariNet
• Necessary?
50
SIMPLIFIED & IMPROVED NSF
Neural filter module
Dilated-CONV-based
filter block 1
…Dilated-CONV-based
filter block 2
Dilated-CONV-based
filter block 5
Dilated-CONV-based filter block
Dilated
conv layer
+
Tanh
Sigmoid * FF
FF
FF
+
Dilated
conv layer
+
Tanh
Sigmoid * FF
FF
+
… FF+
Filter module (simplified)
• Without gated-activation / additional feedforward / ax+b
• Deep residual network
• Easy to interpret and optimize
51
SIMPLIFIED & IMPROVED NSF
Neural filter module
Dilated-CONV-based
filter block 1
…Dilated-CONV-based
filter block 2
Dilated-CONV-based
filter block 5
Simplified dilated-CONV-based filter block
Dilated
conv layer
+FF … FF
Dilated
conv layer
+
Comparison with original NSF
 Waveform generation speed
 Number of model parameters
 Memory-save mode: release and allocate GPU memory layer by layer
52
SIMPLIFIED & IMPROVED NSF
AR WaveNet
Original NSF Simplified NSF
Memory save Full speed Memory save Full speed
190 20,000 227,000
Number of waveform values generated in 1s GPU time
AR WaveNet Original NSF Simplified NSF
~ 2.9 M ~1.8 M
88,000 327,000
~1.07 M
Improve the simplified NSF?
• Simplified NSF trained on Mel-spectrogram, 16kHz Japanese data
• Unvoiced sound disappears 53
SIMPLIFIED & IMPROVED NSF
Filter moduleSource module
Condition module
Epoch 1
Epoch 12 Epoch 28
Improve the simplified NSF?
54
SIMPLIFIED & IMPROVED NSF
Filter moduleSource module
Condition module
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
De-voiced し た
… 金融調節を実施した。
? Separate streams for unvoiced / voiced sound
Improved NSF
• Harmonic + noise style
• How to merge?
55
SIMPLIFIED & IMPROVED NSF
Source module
(sine-based)
Condition module
Filter module 1Noise
F0
Filter module 2
Merge
Improved NSF
• Harmonic + noise style
• How to merge?
• Summation: may not separate harmonic from noise
56
SIMPLIFIED & IMPROVED NSF
Condition module
F0
+
Source module
(sine-based)
Filter module 1Noise
Filter module 2
Improved NSF
• Harmonic + noise style
• How to merge?
X Summation: may not separate harmonic from noise
X Merge in time domain: hard / soft switch
57
SIMPLIFIED & IMPROVED NSF
Condition module
Voice/unvoiced
Source module
(sine-based)
Filter module 1Noise
Filter module 2
F0
Improved NSF
• Harmonic + noise style
• How to merge?
X Summation: may not separate harmonic from noise
X Merge in time domain: hard / soft switch
✓ Merge in frequency domain: FIR filters
SIMPLIFIED & IMPROVED NSF
Condition module
F0
+
Source module
(sine-based)
Filter module 1Noise
Filter module 2
58
Harmonic partNoise part
Merged
FFT bins
Improved NSF
• Harmonic + noise style
• How to merge?
X Summation: may not separate harmonic from noise
X Merge in time domain: hard / soft switch
✓ Merge in frequency domain: FIR filters
SIMPLIFIED & IMPROVED NSF
Condition module
F0
+
Source module
(sine-based)
Filter module 1Noise
Filter module 2
59
Improved NSF
• Harmonic + noise style
• How to merge?
✓ Merge in frequency domain: FIR filters
o Change cut-off frequency based on voicing condition
60
SIMPLIFIED & IMPROVED NSF
Condition module U/V
Source module
(sine-based)
Filter module 1Noise
Filter module 2
+
F0
61
Spectrogram of ‘noise’ component
Spectrogram of ‘harmonic’ component
HNM + FIR filters
 Deterministic filter selection
WAVEFORM MODELING
e33
e21e10
62
Learning curve
Simplified & improved NSF
• Simplified: filter uses only dilated convolution + skip-connection
• Improved: separate streams for noise and harmonic
 More speech processing ideas? e.g., multi-band, glottal excitation
Filter module 1Noise
+
U/V
63
SUMMARY
Source module
(sine-based)
Condition module
F0
Filter module 2Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Dilated
CONV
+
Natural
Original NSF Simplified NSF
Simplified &
improved NSF
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
Samples (16kHz waveform, natural acoustic features)
• Natural MGC / Mel-spectrogram
• Natural F0
64
SUMMARY
Natural
Original NSF Simplified NSF
Simplified &
improved NSF
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
MGC + F0
Mel-spec
+ F0
Samples (16kHz waveform, text-to-speech)
• Predicted MGC / Mel-spectrogram from text
• Natural F0
65
SUMMARY
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
DFT
Frame
Length M
Padding
K-M
0
0
0
0
0
0
0
0
0
0
Framing/
windowing
Complex-value domain Real-value domain
NEURAL SOURCE-FILTER MODEL
Training criterion
X=
…
1st Frame
2nd Frame
Nth Frame
T rows
M (frame length)
…
…
Frame
shift
NM
columns
…
…
0
0
0
0
0
0
0
0
0 0 0
0
NEURAL SOURCE-FILTER MODEL
Training criterion
Training criterion
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
iDFT
Frame
Length M
De-framing/
windowing
inverse
DFT
De-framing
/windowing Gradients
Gradients w.r.t. zero-padded part
Not used in de-framing/windowing
Padding
K-M
Complex-value domain Real-value domain
NEURAL SOURCE-FILTER MODEL
Implementation issue
 WaveNet memory consumption in generation
• Only allocate GPU memory for generating 1 waveform point
• For t = 1 : T, reuse the memory space
• Check: http://tonywangx.github.io/pdfs/CURRENNT_WAVENET.pdf (pp.44-55) 69
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
Memory for 1 time step
N time steps
(N = dilation size)Implementation
WaveNet
Implementation issue
 NSF normal node
• Normal mode: allocate GPU memory for all layers and all time steps
• GPU memory consumption is large for long waveforms
70
PART 3
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
NSF
Condition Condition
CNN stacks
Implementation
Implementation issue
 NSF memory-save node
• Memory-save mode:
• Allocate GPU memory for the current layer and all previous
layers it depends on
• Release GPU memory when one layer is no longer needed
• GPU memory consumption is small
71
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
NSF
Implementation
Implementation issue
 NSF memory-save node
72
PART 3
Condition
GPU Memory for
neural network in
theory
Number of
layers
Waveform length
Condition
CNN stacks
Memory-save mode:
Function __computeGenPass_LayerByLayer_mem
Build layer-dependency graph
For layer_n = 1 : L
Allocate GPU memory and compute layer_n’s output
For layer_m = 1 : layer_n where layer_m’s output is input to layer_n
If layer_m is no longer needed by any layer, release memory of layer_m
NSF
Implementation
Test 1: comparison with WaveNet
 Generation speed
 Memory-saving: release and allocate GPU memory layer by layer (pp. 68-71)
 Single GPU card, same DNN toolkit
 Training is also straightforward and fast
73
EXPERIMENTS
WaveNet
NSF
GPU-memory-saving
mode
Normal mode
0.19 k
How many waveform sampling points can be generated in 1s GPU time?
20 k 227 k
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
 National Institute of Informatics
 Yamagishi-lab
 Postdoc
 PhD (2015-2018) SOKENDAI・NII
 M.A (2012-2015) USTC
 B.A (2008-2012) UESTC
 http://tonywangx.github.io
SELF-INTRODUCTION
74

Neural source-filter waveform model

  • 1.
    Neural source-filter waveformmodel 1contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 2019-02-27 SLP-L8.6 ICASSP 2019 Brighton, United Kingdom
  • 2.
     Introduction  Proposedmodel  Experiments  Summary CONTENTS 2
  • 3.
  • 4.
    4 INTRODUCTION Task definition … 1 23 4 T… Waveform values Neural waveform models
  • 5.
    5 INTRODUCTION Naïve model … 1 23 4 T… CNN / RNN / Feedforward Mean-square-error (MSE) / Cross-entropy (CE) Waveform values
  • 6.
    6 INTRODUCTION Naïve model … … 1 23 4 T… Simple model Simple criterion
  • 7.
    7 INTRODUCTION WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet ClariNet Universal neuralvocoder AR + expert knowledge LPCNet LP-WaveNet ExcitNet GlotNet Recent neural waveform models Parallel WaveNet Auto- regressive (AR) model Criterion AR Model Criterion Deterministic transform Model Criterion Flow-based transform ☛ Reference in appendix ☛ Tutorial slides: https://www.slideshare.net/jyamagis/ Flow-based models
  • 8.
    Auto- regressive (AR) model Criterion AR Model Criterion Deterministic transform Model Criterion Flow-based transform 8 INTRODUCTION WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNetClariNet Universal neural vocoder AR + expert knowledge LPCNet LP-WaveNet ExcitNet GlotNet Recent neural waveform models • Without AR or flow? • Use knowledge on speech modeling? Parallel WaveNet • Powerful model ! Sequential generation • Non-sequential generation ! Complicated ! Slow in training Neural source- filter model (NSF) Model Better criterion Source • Spectral-domain criterion • Source-filter architecture ☛ Reference in appendix ☛ Tutorial slides: https://www.slideshare.net/jyamagis/ Flow-based models
  • 9.
     Introduction  NSFmodel  Experiments  Summary CONTENTS 9
  • 10.
    10 NEURAL SOURCE-FILTER MODEL 12 3 4 T… … Idea 1: spectral domain criterion • Avoid naïve model assumption … Generated waveform Natural waveform 1 2 3 4 T Spectral distance … …
  • 11.
    11 NEURAL SOURCE-FILTER MODEL 12 3 4 T… … Idea 2: sine-based source excitation … 1 2 3 4 T ‘Filter’ Natural waveform ‘Source’
  • 12.
    12 NEURAL SOURCE-FILTER MODEL Modelstructure • Without AR, flow, or knowledge-distilling Spectral features & F0 Condition module Source module Filter module Spectral distance Natural waveform Generated waveform F0 infor. Spectral infor. Generated waveform Gradients
  • 13.
    13 NEURAL SOURCE-FILTER MODEL Conditionmodule • Up sampling by replication Spectral features & F0 Source module Filter module Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Generated waveform Gradients
  • 14.
    14 NEURAL SOURCE-FILTER MODEL Sourcemodule • FF: feedforward layer with Tanh • , Spectral features & F0 Filter module Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics Generated waveform Gradients
  • 15.
    Spectral features &F0 Filter module Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics 15 NEURAL SOURCE-FILTER MODEL Source module … Random initial phase Sampling rate Noise FF Sine generator Fundamental component Voiced: Unvoiced: noise
  • 16.
    Spectral features &F0 Filter module Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics 16 NEURAL SOURCE-FILTER MODEL Filter module
  • 17.
    17 NEURAL SOURCE-FILTER MODEL Filtermodule • Neural filter blocks based on dilated convolution (CONV) Spectral features & F0 Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics Filter block 1 Filter block 2 Filter block 5 …
  • 18.
    18 NEURAL SOURCE-FILTER MODEL Spectralfeatures & F0 Spectral distance Natural waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics Filter block 1 Filter block 2 Filter block 5 … Filter block Dilated CONV + Tanh Sigmoid * FF FF FF + Dilated CONV + Tanh Sigmoid * FF FF + … FF+ Filter module • Similar to ClariNet • Ten dilated-CONV in one block • Generated waveform ☛ Our WaveNet implementation: http://tonywangx.github.io/pdfs/wavenet.pdf
  • 19.
    19 NEURAL SOURCE-FILTER MODEL Trainingcriterion Spectral features & F0 Spectral distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics Filter block 1 Filter block 2 Filter block 5 …
  • 20.
    20 NEURAL SOURCE-FILTER MODEL Trainingcriterion • Please check paper and: Spectral features & F0 Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Noise FF Sine generator harmonics Filter block 1 Filter block 2 Filter block 5 … FFTFraming FFT Framing iFFT De-framing SLP-P22.8, STFT spectral loss for training a neural speech waveform model
  • 21.
    21 NEURAL SOURCE-FILTER MODEL Trainingcriterion • Different frame shifts / window lengths / FFT points • Homogenous distances • FFTFraming FFT Framing iFFT De-framing FFTFraming FFT Framing iFFT De-framing FFTFraming FFT Framing iFFT De-framing +
  • 22.
     Introduction  Proposedmodel  Experiments • Comparison with WaveNet • Ablation test • Analysis  Summary CONTENTS 22
  • 23.
    Test 1: comparisonwith WaveNet  Data and feature  Models 23[1] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages 179–184.. Corpus Size Note ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style Feature Dimension Acoustic Mel-generalized cepstrum coefficients (MGC) 60 F0 1 EXPERIMENTS
  • 24.
    Test 1: comparisonwith WaveNet  Speech quality • 245 paid evaluators, 1450 evaluation sets o Copy-synthesis: given natural MGC/F0 o Pipeline TTS: given generated MGC/F0 from acoustic models 24 EXPERIMENTS WORLD WaveNet u-law WaveNet Gaussian Neural source-filter ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 25.
    Test 1: comparisonwith WaveNet  Speech quality • 245 paid evaluators, 1450 evaluation sets o Copy-synthesis: given natural MGC/F0 o Pipeline TTS: given generated MGC/F0 from acoustic models 25 EXPERIMENTS WORLD WaveNet u-law WaveNet Gaussian Neural source-filter ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 26.
    Test 1: comparisonwith WaveNet  Generation speed  Memory-saving: release and allocate GPU memory layer by layer (pp. 69-73)  Single GPU card, same DNN toolkit 26 EXPERIMENTS How many waveform sampling points can be generated in 1s GPU time? ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html 0k 50k 100k 150k 200k 250k WaveNet NSF GPU-memory- saving mode NSF normal mode 20 k 227 k 0.19 k
  • 27.
    Test 2: Ablationtest  One more corpus  Copy-synthesis 27[2] J. Kominek and A. W. Black. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, 2004. Corpus Size Note CMU-arctic SLT [2] 0.83 hours 16kHz, English Feature Dimension Acoustic Mel-spectrogram 80 F0 1 EXPERIMENTS
  • 28.
    Test 2: Ablationtest 28 EXPERIMENTS Natural NSF ATR Ximera F009 CMU-arctic SLT Spectral features & F0 Condition module Source module Filter module Spectral distanceNatural waveform Generated waveform F0 infor. Spectral infor. ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 29.
    Natural NSF NSFw/o sine excitation ATR Ximera F009 CMU-arctic SLT Test 2: Ablation test 29 EXPERIMENTS Spectral features & F0 Condition module Gaussian noise Filter module Spectral distanceNatural waveform Generated waveform Spectral infor. ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 30.
    FFTFraming FFT Framing iDFTDe-framing FFTFraming FFT Framing iDFT De-framing FFTFraming FFT Framing iDFT De-framing + Test 2: Ablation test 30 EXPERIMENTS Ls1: frame shift 5ms, window length 20ms Natural NSF NSF only Ls1 ATR Ximera F009 CMU-arctic SLT ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 31.
    FFTFraming FFT Framing iDFTDe-framing FFTFraming FFT Framing iDFT De-framing FFTFraming FFT Framing iDFT De-framing + Test 2: Ablation test 31 EXPERIMENTS Natural NSF NSF only Ls1 NSF Ls1+Ls2 ATR Ximera F009 CMU-arctic SLT Ls1: frame shift 5ms, window length 20ms Ls2: frame shift 2.5ms, window length 5ms ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 32.
    Analysis 32 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 33.
    Analysis 33 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 34.
    Analysis 34 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 35.
    Analysis 35 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 36.
    Analysis 36 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 37.
    Analysis 37 EXPERIMENTS Condition module Source module Filter block1 Filter block 2 Filter block 3 Filter block 4 Filter block 5
  • 38.
     Introduction  Proposedmodel  Experiments  Summary CONTENTS 38
  • 39.
    39 SUMMARY NSF • Condition: upsampling • Source: sine excitation • Filter: dilated-CONV • Criterion: STFT Spectral features & F0 Condition module Source module Filter module Spectral distanceNatural waveform Generated waveform F0 infor. Spectral infor. Easy training Fast generation High-quality waveforms
  • 40.
    40 SUMMARY Recent work  SimplifiedNSF: faster generation speed  Simplified NSF -> Harmonic-plus-noise NSF • Best quality Dilated conv + Tanh Sigmoid * FF FF FF + Dilated conv + Tanh Sigmoid * FF FF + … FF+ Dilated conv +FF … FF Dilated conv +
  • 41.
    Questions & Comments arealways Welcome! 41 https://nii-yamagishilab.github.io/samples-nsf/index.html https://arxiv.org/abs/1904.12088
  • 42.
    42 REFERENCE WaveNet: A. vanden Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016. WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018. FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251– 2255. IEEE, 2018. Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv preprint arXiv:1811.06292, 2018. Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018. Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918– 3926, 2018. ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP (submitted), 2018. NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv preprint arXiv:1810.11946, 2018. LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv preprint arXiv:1811.11913, 2018. GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018. ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv preprint arXiv:1811.04769, 2018. LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846, 2018.
  • 43.
    43 INTRODUCTION Autoregressive (AR) model 12 3 4 T… … …1 2 3 • Powerful model ! Slow generation speed
  • 44.
    44 INTRODUCTION Normalizing flow +knowledge distilling 1 2 3 4 T… … …1 2 3 Teacher AR WaveNet • Fast generation ! Complicated model
  • 45.
    45 NEURAL SOURCE-FILTER MODEL 12 3 4 T… … Idea 1: spectral domain criterion • Avoid naïve model assumption • Differentiable & efficient computation … Generated waveform Natural waveform 1 2 3 4 T Short-time Fourier transform Short-time Fourier transform Spectral amplitude error Gradients
  • 46.
    Natural NSF NSF w/o‘residual’ connection in filter module F009 CMU-arctic E2: Ablation test 46 EXPERIMENTS Filter block Dilated CONV + Tanh Sigmoid * FF FF FF + Dilated CONV + Tanh Sigmoid * FF FF + … FF+ https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 47.
    47 SUMMARY Recent work  InputF0? • VCTK corpus, Mel-spectrogram + F0 • Samples generated from NSF, by changing F0 only F0 0.5 * F0 2.0 * F0
  • 48.
    48 SUMMARY Recent work  InputF0? • https://arxiv.org/abs/1904.12088 • Same Mel-spec., different F0 input NSF F0 contour a Natural Mel-spec. NSF F0 contour b Natural Mel-spec. NSF F0 contour c Natural Mel-spec. Waveform 1 Waveform 2 Waveform 3 F0 contour 1 F0 contour 4 F0 contour 5 F0 contour from Mel-spec. Corr. ~1.0 Corr. ~0.9 F0 model Random sampling
  • 49.
    49 SUMMARY Recent work  InputF0? • https://arxiv.org/abs/1904.12088 • Same Mel-spec., different F0 input • Not easy to control F0 of generated waveforms from WaveNet WaveNet F0 contour a Natural Mel-spec. WaveNet F0 contour b Natural Mel-spec. WaveNet F0 contour c Natural Mel-spec. Waveform 1 Waveform 2 Waveform 3 F0 contour 1 F0 contour 4 F0 contour 5 F0 contour from Mel-spec. Corr. ~0.97 Corr. ~0.92 F0 model Random sampling
  • 50.
    Filter module (original) •Similar to WaveNet / parallel WaveNet / ClariNet • Necessary? 50 SIMPLIFIED & IMPROVED NSF Neural filter module Dilated-CONV-based filter block 1 …Dilated-CONV-based filter block 2 Dilated-CONV-based filter block 5 Dilated-CONV-based filter block Dilated conv layer + Tanh Sigmoid * FF FF FF + Dilated conv layer + Tanh Sigmoid * FF FF + … FF+
  • 51.
    Filter module (simplified) •Without gated-activation / additional feedforward / ax+b • Deep residual network • Easy to interpret and optimize 51 SIMPLIFIED & IMPROVED NSF Neural filter module Dilated-CONV-based filter block 1 …Dilated-CONV-based filter block 2 Dilated-CONV-based filter block 5 Simplified dilated-CONV-based filter block Dilated conv layer +FF … FF Dilated conv layer +
  • 52.
    Comparison with originalNSF  Waveform generation speed  Number of model parameters  Memory-save mode: release and allocate GPU memory layer by layer 52 SIMPLIFIED & IMPROVED NSF AR WaveNet Original NSF Simplified NSF Memory save Full speed Memory save Full speed 190 20,000 227,000 Number of waveform values generated in 1s GPU time AR WaveNet Original NSF Simplified NSF ~ 2.9 M ~1.8 M 88,000 327,000 ~1.07 M
  • 53.
    Improve the simplifiedNSF? • Simplified NSF trained on Mel-spectrogram, 16kHz Japanese data • Unvoiced sound disappears 53 SIMPLIFIED & IMPROVED NSF Filter moduleSource module Condition module Epoch 1 Epoch 12 Epoch 28
  • 54.
    Improve the simplifiedNSF? 54 SIMPLIFIED & IMPROVED NSF Filter moduleSource module Condition module Dilated CONV + Dilated CONV + Dilated CONV + Dilated CONV + Dilated CONV + De-voiced し た … 金融調節を実施した。 ? Separate streams for unvoiced / voiced sound
  • 55.
    Improved NSF • Harmonic+ noise style • How to merge? 55 SIMPLIFIED & IMPROVED NSF Source module (sine-based) Condition module Filter module 1Noise F0 Filter module 2 Merge
  • 56.
    Improved NSF • Harmonic+ noise style • How to merge? • Summation: may not separate harmonic from noise 56 SIMPLIFIED & IMPROVED NSF Condition module F0 + Source module (sine-based) Filter module 1Noise Filter module 2
  • 57.
    Improved NSF • Harmonic+ noise style • How to merge? X Summation: may not separate harmonic from noise X Merge in time domain: hard / soft switch 57 SIMPLIFIED & IMPROVED NSF Condition module Voice/unvoiced Source module (sine-based) Filter module 1Noise Filter module 2 F0
  • 58.
    Improved NSF • Harmonic+ noise style • How to merge? X Summation: may not separate harmonic from noise X Merge in time domain: hard / soft switch ✓ Merge in frequency domain: FIR filters SIMPLIFIED & IMPROVED NSF Condition module F0 + Source module (sine-based) Filter module 1Noise Filter module 2 58 Harmonic partNoise part Merged FFT bins
  • 59.
    Improved NSF • Harmonic+ noise style • How to merge? X Summation: may not separate harmonic from noise X Merge in time domain: hard / soft switch ✓ Merge in frequency domain: FIR filters SIMPLIFIED & IMPROVED NSF Condition module F0 + Source module (sine-based) Filter module 1Noise Filter module 2 59
  • 60.
    Improved NSF • Harmonic+ noise style • How to merge? ✓ Merge in frequency domain: FIR filters o Change cut-off frequency based on voicing condition 60 SIMPLIFIED & IMPROVED NSF Condition module U/V Source module (sine-based) Filter module 1Noise Filter module 2 + F0
  • 61.
    61 Spectrogram of ‘noise’component Spectrogram of ‘harmonic’ component
  • 62.
    HNM + FIRfilters  Deterministic filter selection WAVEFORM MODELING e33 e21e10 62 Learning curve
  • 63.
    Simplified & improvedNSF • Simplified: filter uses only dilated convolution + skip-connection • Improved: separate streams for noise and harmonic  More speech processing ideas? e.g., multi-band, glottal excitation Filter module 1Noise + U/V 63 SUMMARY Source module (sine-based) Condition module F0 Filter module 2Dilated CONV + Dilated CONV + Dilated CONV + Dilated CONV + Dilated CONV +
  • 64.
    Natural Original NSF SimplifiedNSF Simplified & improved NSF MGC + F0 Mel-spec + F0 MGC + F0 Mel-spec + F0 MGC + F0 Mel-spec + F0 Samples (16kHz waveform, natural acoustic features) • Natural MGC / Mel-spectrogram • Natural F0 64 SUMMARY
  • 65.
    Natural Original NSF SimplifiedNSF Simplified & improved NSF MGC + F0 Mel-spec + F0 MGC + F0 Mel-spec + F0 MGC + F0 Mel-spec + F0 Samples (16kHz waveform, text-to-speech) • Predicted MGC / Mel-spectrogram from text • Natural F0 65 SUMMARY
  • 66.
  • 67.
    X= … 1st Frame 2nd Frame NthFrame T rows M (frame length) … … Frame shift NM columns … … 0 0 0 0 0 0 0 0 0 0 0 0 NEURAL SOURCE-FILTER MODEL Training criterion
  • 68.
    Training criterion DFT Framing/ windowing DFT Framing/ windowing Generated waveform Natural waveform … N frames …K DFT bins K-points iDFT Frame LengthM De-framing/ windowing inverse DFT De-framing /windowing Gradients Gradients w.r.t. zero-padded part Not used in de-framing/windowing Padding K-M Complex-value domain Real-value domain NEURAL SOURCE-FILTER MODEL
  • 69.
    Implementation issue  WaveNetmemory consumption in generation • Only allocate GPU memory for generating 1 waveform point • For t = 1 : T, reuse the memory space • Check: http://tonywangx.github.io/pdfs/CURRENNT_WAVENET.pdf (pp.44-55) 69 PART 3 Condition GPU Memory for neural network in theory Number of layers Waveform length Condition CNN stacks Memory for 1 time step N time steps (N = dilation size)Implementation WaveNet
  • 70.
    Implementation issue  NSFnormal node • Normal mode: allocate GPU memory for all layers and all time steps • GPU memory consumption is large for long waveforms 70 PART 3 GPU Memory for neural network in theory Number of layers Waveform length NSF Condition Condition CNN stacks Implementation
  • 71.
    Implementation issue  NSFmemory-save node • Memory-save mode: • Allocate GPU memory for the current layer and all previous layers it depends on • Release GPU memory when one layer is no longer needed • GPU memory consumption is small 71 PART 3 Condition GPU Memory for neural network in theory Number of layers Waveform length Condition CNN stacks NSF Implementation
  • 72.
    Implementation issue  NSFmemory-save node 72 PART 3 Condition GPU Memory for neural network in theory Number of layers Waveform length Condition CNN stacks Memory-save mode: Function __computeGenPass_LayerByLayer_mem Build layer-dependency graph For layer_n = 1 : L Allocate GPU memory and compute layer_n’s output For layer_m = 1 : layer_n where layer_m’s output is input to layer_n If layer_m is no longer needed by any layer, release memory of layer_m NSF Implementation
  • 73.
    Test 1: comparisonwith WaveNet  Generation speed  Memory-saving: release and allocate GPU memory layer by layer (pp. 68-71)  Single GPU card, same DNN toolkit  Training is also straightforward and fast 73 EXPERIMENTS WaveNet NSF GPU-memory-saving mode Normal mode 0.19 k How many waveform sampling points can be generated in 1s GPU time? 20 k 227 k ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 74.
     National Instituteof Informatics  Yamagishi-lab  Postdoc  PhD (2015-2018) SOKENDAI・NII  M.A (2012-2015) USTC  B.A (2008-2012) UESTC  http://tonywangx.github.io SELF-INTRODUCTION 74

Editor's Notes

  • #7 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/trial003/output_trained_network.jsn/arctic_a0118.wav
  • #8 p_{O}(\bs{o}_{1:T}|\bs{c}_{1:B}) = p_{Z}(H(\bs{o}_{1:T},\bs{c}_{1:B}))|\text{det}\frac{\partial{H}}{\partial{\bs{o}_{1:T}}}|
  • #12 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #13 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #14 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #15 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #16 Meaning of this equation (frequency modulation)
  • #21 Reference to Shinji’s work (poster session …)
  • #22 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #24 Gaussian WaveNet: continuous-valued waveforms NSF: continuous-valued waveforms
  • #27 Training: efficient, not slow
  • #30 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/trial001/
  • #31 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/trial002/output_trained_network.jsn
  • #32 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/trial009
  • #33 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/ana.ipynb
  • #34 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/ana.ipynb
  • #35 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/ana.ipynb
  • #36 They have 50 layers in ClariNet
  • #37 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/ana.ipynb
  • #38 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/ana.ipynb
  • #40 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #43 WaveNet 101 Just play one row
  • #48 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #49 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #50 /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/sys004/ana/. Python book
  • #54 /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/mel-sys003
  • #55 /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/mel-sys003
  • #62 work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/mel-sys006
  • #63 /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/mel-sys006/
  • #66 nfs://corn.smg.nii.ac.jp/share/nas2/exports/work_smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/sys008/output_TTS_017_BASELINE-SAR nfs://corn.smg.nii.ac.jp/share/nas2/exports/work_smg/wang/PROJ/NNWAV/WAVNET/F009/02/502/sys011/output_TTS_033_BASELINE-SAR
  • #70 trial6
  • #71 trial6
  • #72 trial6
  • #73 trial6
  • #74 Training: efficient, not slow