SlideShare a Scribd company logo
Slides by Xin Wang
National Institute of Informatics
Copyright (c) 2018 - 2019
National Institute of Informatics
Department of Computer Science
Some rights reserved.
This work is licensed under the Creative Commons Attribution 3.0 license.
See http://creativecommons.org/ for details.
Note: Natural Japanese speech data belonging to ATR Ximera corpus are deleted
in this public available version
Neural Waveform Modeling
from our experiences in text-to-speech application
2contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG with Shinji Takaki and Junichi Yamagishi
National Institute of Informatics, Japan
NLP lecture series, IIS
Erlangen Germany, 2019
 Postdoc, Yamagishi-lab, NII
 Research keywords:
• Text-to-speech synthesis (TTS)
1. Neural network
2. Hidden Markov model
• Speech anti-spoofing
SELF-INTRODUCTION
3
WANG Xin
Pronunciation one shin
☛Research-map page: https://researchmap.jp/wangxin/?lang=english
☛Personal page: http://tonywangx.github.io
鑫王
CONTENTS
4
Introduction
Theory
Practice
Summary
• AR & flow-based models
• No AR nor flow
• WaveNet
• Neural source-filter model
• Beyond speech
• Future direction
5
Text-to-speech synthesis
http://www.hawking.org.uk/the-computer.html
https://hackaday.com/2018/05/10/googles-duplex-ai-has-conversation-indistinguishable-from-humans/
Text TTS Speech waveform
INTRODUCTION
Text TTS Speech waveform
Text-to-speech synthesis
 Statistical parametric speech synthesis 1
6
Marianna
made the
marmalade
Linguistic features Acoustic features
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
/m/ /ɛ/ /r/ …
H* on Marianna …
(S (NP (N Marianna))
(VP (V made)
(NP (ART the))
(N marmalade))))
Mel-spectrum, F0,
Band-aperiodicity, etc.
1. H. Zen, K. Tokuda, and A. W. Black. Statistical parametric speech synthesis. Speech Communication, 51:1039–1064, 2009.
INTRODUCTION
Text-to-speech synthesis
 Recent TTS frameworks
7
Front-end
(Text-analyzer)
Back-end
Waveform
generator
Acoustic
models
Text
Trimmed
front-end
‘end-to-end’ TTS system
Text Waveform
module
Attention-based
acoustic model
Front-end
(Text-analyzer)
Unified back-end
Text Waveform
module
Pre-
processing
A. van den Oord, et al. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Y. Wang, et al. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech, pages 4006–4010, 2017.
J. Shen, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. ICASSP, pages 4779–4783, 2017
INTRODUCTION
8
Spectral features, F0, etc.
Neural waveform modeling
INTRODUCTION
9
Neural waveform modeling
1 2 3 4 T…
Waveform
values
Neural waveform models
…
INTRODUCTION
10
Naïve neural waveform model
…
1 2 3 4 T…
Convolution network (CNN) / Recurrent network (RNN)
Mean-square-error (MSE) / Cross-entropy (CE)
Waveform
values
INTRODUCTION
11
Naïve neural waveform model
…
…
1 2 3 4 T…
INTRODUCTION
12
INTRODUCTION
☛ Reference in appendix
☛ Tutorial slides: https://www.slideshare.net/jyamagis/
Autoregressive (AR) neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Multi-head
CNN
No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
Improve naïve model
Naïve
model
Flow-based model
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
GELP
Theoretical interpretation
Practical issues
CONTENTS
13
Introduction
Theory
Practice
Summary
• AR & flow-based models
• Neural source filter model
• WaveNet
• Neural source-filter model
• Beyond speech
• Future work
MCNN
GELP
14
THEORY: AR NEURAL WAVEFORM MODEL
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Jordan
network
Michael I. Jordan. Serial order: A parallel distributed processing approach. Technical Report 8604, Institute for Cognitive Science, 1986.
Overview
15
THEORY: AR NEURAL WAVEFORM MODEL
General idea
 Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
Natural
waveform
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
16
General idea
 Training: teacher forcing 1
1 2 3 4 T…
…
…1 2 3
1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
THEORY: AR NEURAL WAVEFORM MODEL
17
General idea
 Sequential generation
…
…1 2
1 2 3 4
3
T
…Generated
waveform
THEORY: AR NEURAL WAVEFORM MODEL
MCNN
GELP
18
Flow-based model
FloWaveNetWaveGlow No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
 WaveNet
• Tractable probability & powerful AR dependency
• Slow sequential generation & only left-to-right dependency
 WaveRNN 1
• Batch-sampling: faster generation
• Subscale-dependency: more than left-to-right dependency
 LPCNet & GlotNet 2,3
• Classical AR + neural AR
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
1. N. Kalchbrenner, et al Efficient neural audio synthesis. In Proc. ICML, volume 80, pages 2410–2419, 10–15 Jul 2018.
2. J.-M. Valin and J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In Proc. ICASSP, pages 5891–5895, 2019.
3. L. Juvela, et al . Speaker-independent raw waveform model for glottal excitation. In Proc. Interspeech 2018, pages 2012–2016, 2018.
THEORY: AR NEURAL WAVEFORM MODEL
MCNN
GELP
Flow-based model
FloWaveNetWaveGlow
19
THEORY: FLOW-BASED MODELS
No AR, nor flow
Neural source-filter
Model (NSF)
• No explicit
• Spectral-domain training criterion
• Source-filter architecture
ClariNet
Parallel
WaveNet
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Flow-based model
FloWaveNetWaveGlow
 Fast generation?
20
Revisit AR model
 Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
THEORY: FLOW-BASED MODELS
Or equivalently
21
Revisit AR model
 Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
 z-1 denotes time delay
 See proof of in appendix
NN
z-1
H(.)
NN
z-1
H-1(.)
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
THEORY: FLOW-BASED MODELS
22
Revisit AR model
 Consider an AR model using a Gaussian distribution
1 2 3 T
1 2 3
NN
1 2 3 T
1 2 3
NN
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.)
 z-1 denotes time delay
 See proof of in appendix
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017.
Such an AR model is a flow-based model
Training:
1. Transform o1:T to n1:T
2. Maximizing n1:T likelihood over N(nt 0, 1)
Generation:
1. Sample nt from N(nt 0, 1)
2. Transform nt to ot
3. Repeat from t=1 to t=T
THEORY: FLOW-BASED MODELS
23
THEORY: FLOW-BASED MODELS
From AR to Inverse AR flow-based model
 z-1 denotes time delay
NN
z-1
H(.)
Training
Generation
NN
z-1
H(.)
NN
z-1
H-1(.) NN
z-1
H-1(.)
AR flow Inverse-AR flow
24
THEORY: FLOW-BASED MODELS
From AR to Inverse AR flow-based model
 z-1 denotes time delay
NN
z-1
H(.)
Training
Generation
NN
z-1
H-1(.)
AR flow
NN
z-1
H(.)
NN
z-1
H-1(.)
Inverse-AR flow
✓ O(1)
! O(T)✓ O(1)
! O(T)
Knowledge distilling
Parallel WaveNet & ClariNet
25
MCNN
No AR, nor flow
Neural source-filter
Model (NSF)
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
 WaveGlow1 & FloWaveNet2
• Fast generation & slow training
 Parallel WaveNet3 & ClariNet4
• Knowledge-distilling is complicated
1. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. ICASSP 2019, 2018.
2. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. ICML 2019.
3. A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. Proc. ICML, pages 3918–3926, 2018.
4. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. ICLR, 2018.
Inverse AR flow
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
THEORY: FLOW-BASED MODELS
• Faster training & generation
• Easy to implement
Inverse AR flow
FloWaveNetWaveGlow
26
THEORY: NEURAL SOURCE-FILTER MODEL
No AR, no flow
Neural source-filter
Model (NSF) 1
• Source-filter architecture
• Spectral-domain training criterion
Naïve
model
AR neural model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
ClariNet
Parallel
WaveNet
1. X. Wang, et al. Neural source-filter-based waveform model for statistical para- metric speech synthesis. In Proc. ICASSP, pages 5916–5920,
2019.
2. S. O ̈. Arık, et. al. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018.
MCNN2
GELP3
27
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
General idea
• No AR or inverse AR flow
…
1 2 3 4 T
‘Filter’
Natural
waveform
Generated
waveform
1 2 3 4 TF0/pitch ‘Source’
28
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
…
General idea
• Based on short-time Fourier transform (STFT)
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
1 2 3 4 TF0/pitch
…
…
1 2 3 4 TF0/pitch
29
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
What is the ?
30
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
• Spectral distance
1 2 3 4 T…
Framing
Framing
Spectral
distance
FFT
FFT
 , where D is frame length. where K is FFT points.
31
THEORY: NEURAL SOURCE-FILTER MODEL
1 2 3 4 T…
Probabilistic interpretation?
•
1 2 3 4 T…
Framing
Framing
FFT
FFT
Likelihood
over Gaussian
 For explanation, denotes spectral power vector
 , where D is frame length. where K is FFT points.
32
THEORY: NEURAL SOURCE-FILTER MODEL
Probabilistic interpretation?
•
1 2 3 4 T…
Framing
FFT
1 2 3 4 T…
Framing
FFT
Likelihood
over Gaussian
 , where K is FFT points
 For explanation, denotes spectral power vector
33
Naïve model
AR model
Inverse-AR flow
NSF
THEORY IN SUMMARY
CONTENTS
34
Introduction
Theory
Practice
Summary
• AR & flow-based models
• Neural source filter model
• WaveNet
• Neural source-filter model ➣
• Beyond speech
• Future work
35
PRACTICE: WAVENET
WaveNet variants
 Discretized or continuous-valued waveforms
• Two practical issues:
1. How to generate waveform samples? ➣
2. How to train WaveNet Gaussian? ➣
1 2
1 2 3 4
3 1 2
1 2 3 4
3
GMM/GaussianSoftmax
➣
36
PRACTICE: WAVENET
WaveNet variants
 Discretized or continuous-valued waveforms
 Other variants
• WaveNet using mixture of logistic distribution 1
• WaveNet + Spline 2
• Quantization noise shaping 3, related noise shaping method 4
1. T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
2. Y. Agiomyrgiannakis. B-spline PDF: A generalization of histograms to continuous density models for generative audio networks. In Proc. ICASSP, pages 5649–5653. IEEE, 2018.
3. T. Yoshimura, et al. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE TASLP, 26(7):1173–1180, 2018.
4. K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018.
1 2
1 2 3 4
3
Softmax
1 2
1 2 3 4
3
GMM/Gaussian
Generation strategy
 WaveNet-softmax
• Generation as a search problem
• Search space: 256T for 8-bits waveform of length T
1 2
1 2 3 4
3
37
PRACTICE: WAVENET
1 2 3 4
… … …
…
Generation strategy
 WaveNet-softmax
• Sub-optimal search by
o Exploitation
o Exploration
o Or mix of both
38
PRACTICE: WAVENET
1 2 3 4
… … …
…
Random sampling
Greedy search
Generation strategy
 WaveNet-softmax
39
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0
0
200
400
600
800
1000
waveform(mu-law)
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0
sampling point
0
200
400
600
800
1000
probablityWaveformlevels(0-1024)Waveformlevels(0-1024)
PRACTICE: WAVENET
PRACTICE: WAVENET
Generation method
 Experiments on WaveNet vocoder
Sampling point
9300.0
9305.0
9310.0
9315.0
9320.0
Waveform level
0
200
400
600
800
1000
1200
0.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.0
8905.0
8910.0
8915.0
8920.0
Waveform level
0
200
400
600
800
1000
1200
0.0
0.1
0.2
0.3
0.4
0.5
59
How about
1. Exploration in unvoiced steps
2. Exploitation in randomly selected
voiced steps
41
PRACTICE: WAVENET
Natural
Greedy
search
Random
sampling
Mixed
approach
42
PRACTICE: WAVENET
 Rainbow gram: https://gist.github.com/jesseengel/e223622e255bd5b8c9130407397a0494
Natural
Greedy
search
Random
sampling
Mixed
approach
Generation strategy
 WaveNet-softmax
• Exploitation & exploration
• Other strategy: temperature of softmax 1
 WaveNet-Gaussian
• Infinite search space: the best is impossible
• Same strategy as WaveNet-softmax
43
PRACTICE: WAVENET
Greedy best?
Sampling?
1. Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255, 2018.
44
PRACTICE: WAVENET
Training stability
 WaveNet-Gaussian
• Maximum likelihood training is risky: very large gradients
NN
1 2
1 2 3 4
3
45
PRACTICE: WAVENET
Training stability
 WaveNet-Gaussian
-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
46
Negative log-likelihood
PRACTICE: WAVENET
Training stability
 WaveNet-Gaussian
• Toy experiment
o 1 utterance & network well-initialized
o Variance floor applied
Epoch
47
-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
Network
training
Natural
𝜇 𝑡
𝜎𝑡
Epoch
48
PART 1: IMPLICIT SOURCE-FILTER MODELWAVENET
Illness of fitting Gaussian
 Why joint learning is unstable? Toy experiment
• Use the MSE network
• Fit only one utterance
-10
-5
0
5
10
15
20
25
30
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
628
647
666
685
704
723
742
761
780
799
818
837
856
875
894
913
932
951
970
989
-loglikelihood
Epoch
Natural
𝜇 𝑡
𝜎𝑡
PRACTICE: WAVENET
Training stability
 WaveNet-Gaussian
• Our two-steps strategy
1. Train blue part with
2. Train red part only
• Gradient will be mild
1. Minimizes while keep gradient mild
2. Gradient not explode when
49
NN
1 2
1 2 3 4
3
Training stability
 WaveNet-Gaussian
• Experiment: 5 hours data
50
PRACTICE: WAVENET
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Negative log-likelihood
Naïve
strategy
Our
strategy
Epoch
On training set
On validation set
Step 1 Step 2
Training stability
 WaveNet-Gaussian
• Experiment: 5 hours data
51
PRACTICE: WAVENET
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Negative log-likelihood
Naïve
strategy
Our
strategy
Epoch
On training set
On validation set
Step 1 Step 2
Generation strategy
Training WaveNet-Gaussian
52
PRACTICE: WAVENET
Greedy best?
Sampling?
Exploitation + exploration
Keep gradients mild
NN
1 2
1 2 3 4
3
CONTENTS
53
Introduction
Theory
Practice
Summary
• AR & flow-based models
• Neural source filter model
• WaveNet
• Neural source-filter model
• Beyond speech
• Future work
54
PRACTICE: NSF
1 2 3 4 T…
…
General idea
• Spectral-domain training criterion
• Source-filter structure
…
Generated
waveform
Natural
waveform 1 2 3 4 T
Spectral
distance …
…
1 2 3 4 TF0/pitch
55
PRACTICE: NSF
Common structure
• No AR or inverse AR
• No knowledge distilling
Spectral features & F0
Condition module
Source module Filter module
Frequency-domain distance
Natural
waveform
Generated
waveform
F0 infor. Spectral infor.
Generated
waveform
Gradients
56
PRACTICE: NSF
Common structure
• Condition module: input feature pre-process
Spectral features & F0
Source module Filter module
Frequency-domain distance
Natural
waveform
Generated
waveform
Up sampling Up samplingBi-LSTM CONV
F0
Generated
waveform
Gradients
Up sampling
Dimension change
Temporal smoothing
Cat.
57
PRACTICE: NSF
Common structure
• Source module: generate a sine waveform given F0
 FF: feedforward layer with Tanh
Spectral features & F0
Filter module
Frequency-domain distance
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients
Up samplingBi-LSTM CONV Cat.
F0
Spectral features & F0
Filter module
Frequency-domain distance
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Generated
waveform
Gradients
Up samplingBi-LSTM CONV Cat.
F0
58
PRACTICE: NSF
Common structure
…
Random initial phase
Sampling rate
Noise
FF
Sine
generator
Fundamental
component
Voiced:
Unvoiced: noise
59
PRACTICE: NSF
Common structure
• Error metric
Spectral features & F0
Frequency-domain distance
Natural
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
Filter module
Generated
waveform
Gradients
Compute frequency-domain distance
Compute gradients for SGD
Up samplingBi-LSTM CONV Cat.
F0
60
PRACTICE: NSF
Common structure
• Based on short-time Fourier transform
Spectral features & F0
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
FFTFraming FFT Framing
iFFT De-framing
Filter module
Up samplingBi-LSTM CONV Cat.
F0
61
PRACTICE: NSF
Common structure
• Different frame shifts / window lengths / FFT points
• Homogenous distances
•
FFTFraming FFT Framing
iFFT De-framing
FFTFraming FFT Framing
iFFT De-framing
FFTFraming FFT Framing
iFFT De-framing +
62
PRACTICE: NSF
Common structure
• Different NSF models, different neural filter modules
Spectral features & F0
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
FFTFraming FFT Framing
iFFT De-framing
Filter module
Up samplingBi-LSTM CONV Cat.
F0
Filter module
63
PRACTICE: NSF
Common structure
Spectral features & F0
Natural
waveform
Generated
waveform
Up sampling
Noise
FF
Sine
generator
harmonics
FFTFraming FFT Framing
iFFT De-framing
Filter module
Up samplingBi-LSTM CONV Cat.
F0
Filter module
NSF models
Baseline NSF
(b-NSF)
Simplified NSF
(s-NSF)
Harmonic-plus-noise NSF
(hn-NSF)
hn-NSF
Ver.1 hn-NSF with
ver.2
ICASSP 2019 Journal paper submitted SSW 2019
64
PRACTICE: NSF
Baseline and simplified NSF
• Baseline filter block follows WaveNet / ClariNet
• Baseline filter block can be simplified
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF
simplify
simplify
65
PRACTICE: NSF
Baseline and simplified NSF
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF

 Element-wise multiplication
Baseline
filter block 1
Simplified
filter block 1
Simplified filter block
Dilated
CONV
+FF … FF
Dilated
CONV
+
Baseline filter block
Dilated
CONV
+
Tanh
Sigmoid
•
FF
FF
FF
+
Dilated
CONV
+
Tanh
Sigmoid
•
FF
FF
+
… + FF
66
PRACTICE: NSF
Baseline and simplified NSF
• Both models:
1. Strong harmonics in high-frequency bands
2. Awful unvoiced (fricative) sounds
• Model ‘overfitted’ to voiced sounds?
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
b-NSF
s-NSF
simplify
67
PRACTICE: NSF
Harmonic-plus-noise NSF
 HP, LP: high- and low-pass finite-impulse-response (FIR) filter
Baseline
filter block 1
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
Simplified
filter block 5
…
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
b-NSF
s-NSF
hn-NSF
simplify
upgrade
Baseline
filter block 2
Baseline
filter block 5
…
Simplified
filter block 2
Simplified
filter block 5
…
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
b-NSF
s-NSF
hn-NSF
simplification
improvement
Baseline
filter block 1
Simplified
filter block 1
Simplified
filter block 1
68
PRACTICE: NSF
Harmonic-plus-noise NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Maximum voicing frequency
(MVF)
hn-NSF
69
PRACTICE: NSF
Harmonic-plus-noise NSF
 Version I: choose MVF based on u/v
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
For voiced sounds
For unvoiced sounds
Condition module for hn-NSF
Fixed MVFs
70
PRACTICE: NSF
Harmonic-plus-noise NSF
 Version II: predict MVF from input features
• Predict MVF from condition module (SSW paper)
• From MVF to FIR filter coefficients (SSW paper)
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Condition module for hn-NSF
sinc
Hamming
window
Gain
norm.
HP
LP
71
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
Condition module for hn-NSF
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Condition module for hn-NSF
72
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
Condition module for hn-NSF
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Condition module for hn-NSF
73
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
Condition module for hn-NSF
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Condition module for hn-NSF
74
PRACTICE: NSF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
u/v flag
Condition module for hn-NSF
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
Simplified
filter block 5
noise
+
HP
LP
Condition module for hn-NSF
75
PRACTICE: NSF
Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
NSF is a deep-residual network
76
PRACTICE: NSF
Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
NSF is a deep-residual network
77Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
78Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
79Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
80Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
81Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
82Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
83Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
84Spectral features & F0
Up sampling
Noise
FF
Sine
generator
harmonics
Up samplingBi-LSTM CONV Cat.
F0
MVF
Simplified
filter block 1
Simplified
filter block 2
…
Simplified
filter block 5
noise
+
HP
LP
Simplified
filter block 5
Condition module for proposed hn-NSF
Source module
PRACTICE: NSF
NSF is a deep-residual network
Configuration
 Data and features
 Models
85
Corpus Size Note
ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style
Feature Dimension
Acoustic
Mel-generalized cepstrum coefficients (MGC)
or
Mel-spectra
60
80
F0 1
PRACTICE: COMPARISON
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
Speech quality (ICASSP)
• 245 paid evaluators, 1450 evaluation sets
86
PRACTICE: COMPARISON
Copy-synthesis
Pipeline TTS
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
WORLD
vocoder
WaveNet
softmax
WaveNet
Gaussian
b-NSF
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v1.html
Speech quality (Journal paper submitted)
• >150 paid evaluators
• s-NSF did badly on unvoiced sounds
87
PRACTICE: COMPARISON
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v2.html
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
Speech quality (SSW 2019)
• >150 paid evaluators
88
☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v3.html
PRACTICE: COMPARISON
■ Copy-synthesis
■ Pipeline TTS
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
WaveNet
softmax
hn-NSF
trainable MVF
hn-NSF
fixed MVF
Natural
Generation speed
 Mem-save mode: allocate and release GPU memory layer by layer
(limited by our CUDA implemetation)
 Normal mode: allocate GPU memory once
89
How many waveform points can be generated in 1s (Tesla p100)?
PRACTICE: COMPARISON
WaveNet
softmax
b-NSF hn-NSF
trainable MVF
WaveNet
Gaussian
s-NSF hn-NSF
fixed MVF
WORLD
vocoder
CONTENTS
90
Introduction
Theory
Practice
Summary
• AR & flow-based models
• Neural source filter model
• WaveNet
• Neural source-filter model
• Beyond speech
• Future work
91
SUMMARY
AR model
WaveRNNSampleRNN
FFTNet
WaveNet
LPCNetExcitNet GlotNet
Multi-head
CNN
No AR, no flow
Neural source-filter
Model (NSF)
• No explicit
Naïve
model
Inverse AR flow
FloWaveNetWaveGlow
ClariNet
Parallel
WaveNet
GELP
92
BEYOND SPEECH
(c.f. HTS Slides, by HTS Working Group)
Source module Filter module
93
BEYOND SPEECH
Music performance
 Training
• URPM dataset1
o ground-truth F0
o 13 instruments
o solo recording
• One model for all instruments
1 University of Rochester Multi-Modal Music Performance (URMP) Dataset http://www2.ece.rochester.edu/projects/air/projects/URMP.html
Neural
waveform
model
F0
Mel-spectra
Natural b-NSF S-NSF
hn-NSF
trainable MVF
Violin
Viola
Oboe
Trumpet
Saxophone
BEYOND SPEECH
Music performance
 Testing with natural Mel-spectra and F0 as input
WaveNet
Natural b-NSF S-NSF
hn-NSF
trainable MVF
Horn
Trombone
Tuba
Clarinet
Flute
BEYOND SPEECH
Music performance
 Testing with natural Mel-spectra and F0 as input
96
FUTURE DIRECTION
(c.f. HTS Slides, by HTS Working Group)
Questions & Comments
are always Welcome!
97
https://nii-yamagishilab.github.io/samples-nsf/index.html
98
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume
80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–
2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv
preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible
frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–
3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281,
2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint
arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP
(submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv
preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv
preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal
excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv
preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846,
2018.
MCNN: S. O ̈. Arık, H. Jun, and G. Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal
Processing Letters, 26(1):94–98, 2018.
GELP: J. Lauri, et. al. GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram, Proc. Interspeech, 2019
99
REFERENCE
By Lauri Juvela, Aalto University
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
DFT
Frame
Length M
Padding
K-M
0
0
0
0
0
0
0
0
0
0
Framing/
windowing
Complex-value domain Real-value domain
APPENDIX
Training criterion
X=
…
1st Frame
2nd Frame
Nth Frame
T rows
M (frame length)
…
…
Frame
shift
NM
columns
…
…
0
0
0
0
0
0
0
0
0 0 0
0
APPENDIX
Training criterion
Training criterion
DFT
Framing/
windowing
DFT
Framing/
windowing
Generated
waveform
Natural
waveform
…
N frames
…K
DFT
bins
K-points
iDFT
Frame
Length M
De-framing/
windowing
inverse
DFT
De-framing
/windowing Gradients
Gradients w.r.t. zero-padded part
Not used in de-framing/windowing
Padding
K-M
Complex-value domain Real-value domain
APPENDIX
103
FLOW-BASED MODELS
Recap AR model
 Consider a WaveNet using a Gaussian distribution
1. Because , we have
1 2 3 T
1 2 3
NN
 z-1 denotes time delay
NN
z-1
H-1(.)
104
FLOW-BASED MODELS
105
FLOW-BASED MODELS
Recap AR model
 Consider a WaveNet using a Gaussian distribution
2. Because , we have
3. Therefore
 z-1 denotes time delay
Triangle-matrix,
as nt depends on o<t
106
FLOW-BASED MODELS
Recap AR model
 Consider a WaveNet using a Gaussian distribution
• So:
 z-1 denotes time delay
107
FLOW-BASED MODELS
Inverse-AR flow
1. Because , we have
 z-1 denotes time delay
NN
z-1
H-1(.)
Triangle-matrix,
as nt depends on ot
108
FLOW-BASED MODELS
Inverse-AR flow
2. Therefore
 z-1 denotes time delay
NN
z-1
H-1(.)
109
FLOW-BASED MODELS
AR flow vs inverse-AR
 z-1 denotes time delay
NN
z-1
H-1(.)NN
z-1
H-1(.)
110
FLOW-BASED MODELS
 z-1 denotes time delay
NN
z-1
H-1(.)NN
z-1
H-1(.)
AR flow
AR flow vs inverse-AR
Inverse-AR flow

More Related Content

What's hot

Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
NAVER Engineering
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Francisco Zamora-Martinez
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
Tomoki Hayashi
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
a3labdsp
 
AnupVMathur
AnupVMathurAnupVMathur
AnupVMathuranupmath
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
Mengmeng Xu
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
Takuya Yoshioka
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency Filtering
Tejus Adiga M
 
gio's tesi
gio's tesigio's tesi
gio's tesi
capitan_jo
 
Slide aansw
Slide aanswSlide aansw
Slide aanswedge7
 
Resume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software DeveloperResume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software Developer
Gaurang Rathod
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Pythonkonryd
 
Iaetsd implementation of lsb image steganography system using edge detection
Iaetsd implementation of lsb image steganography system using edge detectionIaetsd implementation of lsb image steganography system using edge detection
Iaetsd implementation of lsb image steganography system using edge detection
Iaetsd Iaetsd
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Universitat Politècnica de Catalunya
 
Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslator
Giuseppe D'Onofrio
 
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Alpen-Adria-Universität
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
Editor IJCATR
 
Net2
Net2Net2
20110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-0220110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-02Computer Science Club
 

What's hot (20)

Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
AnupVMathur
AnupVMathurAnupVMathur
AnupVMathur
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency Filtering
 
gio's tesi
gio's tesigio's tesi
gio's tesi
 
Slide aansw
Slide aanswSlide aansw
Slide aansw
 
Resume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software DeveloperResume of Gaurang Rathod, Embedded Software Developer
Resume of Gaurang Rathod, Embedded Software Developer
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
Iaetsd implementation of lsb image steganography system using edge detection
Iaetsd implementation of lsb image steganography system using edge detectionIaetsd implementation of lsb image steganography system using edge detection
Iaetsd implementation of lsb image steganography system using edge detection
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslator
 
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Ne...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Net2
Net2Net2
Net2
 
20110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-0220110319 parameterized algorithms_fomin_lecture01-02
20110319 parameterized algorithms_fomin_lecture01-02
 

Similar to Neural Waveform Modeling

Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
June-Woo Kim
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
hansjansen9999
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
IRJET- Extension to Visual Information Narrator using Neural Network
IRJET- Extension to Visual Information Narrator using Neural NetworkIRJET- Extension to Visual Information Narrator using Neural Network
IRJET- Extension to Visual Information Narrator using Neural Network
IRJET Journal
 
Design of iir notch filters and narrow and wide band filters
Design of iir notch filters and narrow and wide band filtersDesign of iir notch filters and narrow and wide band filters
Design of iir notch filters and narrow and wide band filtersHarshal Ladhe
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Spark Summit
 
Stream Reasoning - where we got so far 2011.1.18 Oxford Key Note
Stream Reasoning - where we got so far 2011.1.18 Oxford Key NoteStream Reasoning - where we got so far 2011.1.18 Oxford Key Note
Stream Reasoning - where we got so far 2011.1.18 Oxford Key NoteEmanuele Della Valle
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
NAVER Engineering
 
Accelerating the Design of Optical Networks using Surrogate Models
Accelerating the Design of Optical Networks using Surrogate ModelsAccelerating the Design of Optical Networks using Surrogate Models
Accelerating the Design of Optical Networks using Surrogate ModelsCPqD
 
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
CaoVuThang
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
GraphAware
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
Asim Jalis
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
QIAGEN
 
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdfLagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
AnkitBiswas31
 
IRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural NetworkIRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural Network
IRJET Journal
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
Taehoon Lee
 
Nlp
NlpNlp
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
Ahmed Gad
 

Similar to Neural Waveform Modeling (20)

Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
IRJET- Extension to Visual Information Narrator using Neural Network
IRJET- Extension to Visual Information Narrator using Neural NetworkIRJET- Extension to Visual Information Narrator using Neural Network
IRJET- Extension to Visual Information Narrator using Neural Network
 
Design of iir notch filters and narrow and wide band filters
Design of iir notch filters and narrow and wide band filtersDesign of iir notch filters and narrow and wide band filters
Design of iir notch filters and narrow and wide band filters
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Stream Reasoning - where we got so far 2011.1.18 Oxford Key Note
Stream Reasoning - where we got so far 2011.1.18 Oxford Key NoteStream Reasoning - where we got so far 2011.1.18 Oxford Key Note
Stream Reasoning - where we got so far 2011.1.18 Oxford Key Note
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Accelerating the Design of Optical Networks using Surrogate Models
Accelerating the Design of Optical Networks using Surrogate ModelsAccelerating the Design of Optical Networks using Surrogate Models
Accelerating the Design of Optical Networks using Surrogate Models
 
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
(Data Communication Series) Zhenbin Li, Zhibo Hu, Cheng Li - SRv6 Network Pro...
 
cv_10
cv_10cv_10
cv_10
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdfLagging_Inference_Networks_and_Posterior_Collapse_.pdf
Lagging_Inference_Networks_and_Posterior_Collapse_.pdf
 
IRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural NetworkIRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural Network
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Nlp
NlpNlp
Nlp
 
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
NumPyCNNAndroid: A Library for Straightforward Implementation of Convolutiona...
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
Yamagishi Laboratory, National Institute of Informatics, Japan
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Yamagishi Laboratory, National Institute of Informatics, Japan
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan (12)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

Neural Waveform Modeling

  • 1. Slides by Xin Wang National Institute of Informatics Copyright (c) 2018 - 2019 National Institute of Informatics Department of Computer Science Some rights reserved. This work is licensed under the Creative Commons Attribution 3.0 license. See http://creativecommons.org/ for details. Note: Natural Japanese speech data belonging to ATR Ximera corpus are deleted in this public available version
  • 2. Neural Waveform Modeling from our experiences in text-to-speech application 2contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion Xin WANG with Shinji Takaki and Junichi Yamagishi National Institute of Informatics, Japan NLP lecture series, IIS Erlangen Germany, 2019
  • 3.  Postdoc, Yamagishi-lab, NII  Research keywords: • Text-to-speech synthesis (TTS) 1. Neural network 2. Hidden Markov model • Speech anti-spoofing SELF-INTRODUCTION 3 WANG Xin Pronunciation one shin ☛Research-map page: https://researchmap.jp/wangxin/?lang=english ☛Personal page: http://tonywangx.github.io 鑫王
  • 4. CONTENTS 4 Introduction Theory Practice Summary • AR & flow-based models • No AR nor flow • WaveNet • Neural source-filter model • Beyond speech • Future direction
  • 6. Text TTS Speech waveform Text-to-speech synthesis  Statistical parametric speech synthesis 1 6 Marianna made the marmalade Linguistic features Acoustic features Front-end (Text-analyzer) Back-end Waveform generator Acoustic models Text /m/ /ɛ/ /r/ … H* on Marianna … (S (NP (N Marianna)) (VP (V made) (NP (ART the)) (N marmalade)))) Mel-spectrum, F0, Band-aperiodicity, etc. 1. H. Zen, K. Tokuda, and A. W. Black. Statistical parametric speech synthesis. Speech Communication, 51:1039–1064, 2009. INTRODUCTION
  • 7. Text-to-speech synthesis  Recent TTS frameworks 7 Front-end (Text-analyzer) Back-end Waveform generator Acoustic models Text Trimmed front-end ‘end-to-end’ TTS system Text Waveform module Attention-based acoustic model Front-end (Text-analyzer) Unified back-end Text Waveform module Pre- processing A. van den Oord, et al. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Y. Wang, et al. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech, pages 4006–4010, 2017. J. Shen, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. ICASSP, pages 4779–4783, 2017 INTRODUCTION
  • 8. 8 Spectral features, F0, etc. Neural waveform modeling INTRODUCTION
  • 9. 9 Neural waveform modeling 1 2 3 4 T… Waveform values Neural waveform models … INTRODUCTION
  • 10. 10 Naïve neural waveform model … 1 2 3 4 T… Convolution network (CNN) / Recurrent network (RNN) Mean-square-error (MSE) / Cross-entropy (CE) Waveform values INTRODUCTION
  • 11. 11 Naïve neural waveform model … … 1 2 3 4 T… INTRODUCTION
  • 12. 12 INTRODUCTION ☛ Reference in appendix ☛ Tutorial slides: https://www.slideshare.net/jyamagis/ Autoregressive (AR) neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet Multi-head CNN No AR, nor flow Neural source-filter Model (NSF) • No explicit Improve naïve model Naïve model Flow-based model FloWaveNetWaveGlow ClariNet Parallel WaveNet GELP Theoretical interpretation Practical issues
  • 13. CONTENTS 13 Introduction Theory Practice Summary • AR & flow-based models • Neural source filter model • WaveNet • Neural source-filter model • Beyond speech • Future work
  • 14. MCNN GELP 14 THEORY: AR NEURAL WAVEFORM MODEL Flow-based model FloWaveNetWaveGlow No AR, nor flow Neural source-filter Model (NSF) • No explicit • Spectral-domain training criterion • Source-filter architecture ClariNet Parallel WaveNet Naïve model AR neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet Jordan network Michael I. Jordan. Serial order: A parallel distributed processing approach. Technical Report 8604, Institute for Cognitive Science, 1986. Overview
  • 15. 15 THEORY: AR NEURAL WAVEFORM MODEL General idea  Training: teacher forcing 1 1 2 3 4 T… … …1 2 3 Natural waveform 1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  • 16. 16 General idea  Training: teacher forcing 1 1 2 3 4 T… … …1 2 3 1 R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. THEORY: AR NEURAL WAVEFORM MODEL
  • 17. 17 General idea  Sequential generation … …1 2 1 2 3 4 3 T …Generated waveform THEORY: AR NEURAL WAVEFORM MODEL
  • 18. MCNN GELP 18 Flow-based model FloWaveNetWaveGlow No AR, nor flow Neural source-filter Model (NSF) • No explicit • Spectral-domain training criterion • Source-filter architecture ClariNet Parallel WaveNet Naïve model  WaveNet • Tractable probability & powerful AR dependency • Slow sequential generation & only left-to-right dependency  WaveRNN 1 • Batch-sampling: faster generation • Subscale-dependency: more than left-to-right dependency  LPCNet & GlotNet 2,3 • Classical AR + neural AR AR neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet 1. N. Kalchbrenner, et al Efficient neural audio synthesis. In Proc. ICML, volume 80, pages 2410–2419, 10–15 Jul 2018. 2. J.-M. Valin and J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In Proc. ICASSP, pages 5891–5895, 2019. 3. L. Juvela, et al . Speaker-independent raw waveform model for glottal excitation. In Proc. Interspeech 2018, pages 2012–2016, 2018. THEORY: AR NEURAL WAVEFORM MODEL
  • 19. MCNN GELP Flow-based model FloWaveNetWaveGlow 19 THEORY: FLOW-BASED MODELS No AR, nor flow Neural source-filter Model (NSF) • No explicit • Spectral-domain training criterion • Source-filter architecture ClariNet Parallel WaveNet Naïve model AR neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet Flow-based model FloWaveNetWaveGlow  Fast generation?
  • 20. 20 Revisit AR model  Consider an AR model using a Gaussian distribution 1 2 3 T 1 2 3 NN 1 2 3 T 1 2 3 NN Training Generation G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017. THEORY: FLOW-BASED MODELS Or equivalently
  • 21. 21 Revisit AR model  Consider an AR model using a Gaussian distribution 1 2 3 T 1 2 3 NN 1 2 3 T 1 2 3 NN Training Generation  z-1 denotes time delay  See proof of in appendix NN z-1 H(.) NN z-1 H-1(.) G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017. THEORY: FLOW-BASED MODELS
  • 22. 22 Revisit AR model  Consider an AR model using a Gaussian distribution 1 2 3 T 1 2 3 NN 1 2 3 T 1 2 3 NN Training Generation NN z-1 H(.) NN z-1 H-1(.)  z-1 denotes time delay  See proof of in appendix G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. Proc NIPS, pages 2338–2347, 2017. Such an AR model is a flow-based model Training: 1. Transform o1:T to n1:T 2. Maximizing n1:T likelihood over N(nt 0, 1) Generation: 1. Sample nt from N(nt 0, 1) 2. Transform nt to ot 3. Repeat from t=1 to t=T THEORY: FLOW-BASED MODELS
  • 23. 23 THEORY: FLOW-BASED MODELS From AR to Inverse AR flow-based model  z-1 denotes time delay NN z-1 H(.) Training Generation NN z-1 H(.) NN z-1 H-1(.) NN z-1 H-1(.) AR flow Inverse-AR flow
  • 24. 24 THEORY: FLOW-BASED MODELS From AR to Inverse AR flow-based model  z-1 denotes time delay NN z-1 H(.) Training Generation NN z-1 H-1(.) AR flow NN z-1 H(.) NN z-1 H-1(.) Inverse-AR flow ✓ O(1) ! O(T)✓ O(1) ! O(T) Knowledge distilling Parallel WaveNet & ClariNet
  • 25. 25 MCNN No AR, nor flow Neural source-filter Model (NSF) Naïve model AR neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet  WaveGlow1 & FloWaveNet2 • Fast generation & slow training  Parallel WaveNet3 & ClariNet4 • Knowledge-distilling is complicated 1. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. ICASSP 2019, 2018. 2. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. ICML 2019. 3. A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. Proc. ICML, pages 3918–3926, 2018. 4. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. ICLR, 2018. Inverse AR flow FloWaveNetWaveGlow ClariNet Parallel WaveNet THEORY: FLOW-BASED MODELS
  • 26. • Faster training & generation • Easy to implement Inverse AR flow FloWaveNetWaveGlow 26 THEORY: NEURAL SOURCE-FILTER MODEL No AR, no flow Neural source-filter Model (NSF) 1 • Source-filter architecture • Spectral-domain training criterion Naïve model AR neural model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet ClariNet Parallel WaveNet 1. X. Wang, et al. Neural source-filter-based waveform model for statistical para- metric speech synthesis. In Proc. ICASSP, pages 5916–5920, 2019. 2. S. O ̈. Arık, et. al. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018. MCNN2 GELP3
  • 27. 27 THEORY: NEURAL SOURCE-FILTER MODEL 1 2 3 4 T… … General idea • No AR or inverse AR flow … 1 2 3 4 T ‘Filter’ Natural waveform Generated waveform 1 2 3 4 TF0/pitch ‘Source’
  • 28. 28 THEORY: NEURAL SOURCE-FILTER MODEL 1 2 3 4 T… … General idea • Based on short-time Fourier transform (STFT) … Generated waveform Natural waveform 1 2 3 4 T Spectral distance … … 1 2 3 4 TF0/pitch
  • 29. … … 1 2 3 4 TF0/pitch 29 THEORY: NEURAL SOURCE-FILTER MODEL 1 2 3 4 T… Probabilistic interpretation? Generated waveform Natural waveform 1 2 3 4 T Spectral distance … … What is the ?
  • 30. 30 THEORY: NEURAL SOURCE-FILTER MODEL 1 2 3 4 T… Probabilistic interpretation? • Spectral distance 1 2 3 4 T… Framing Framing Spectral distance FFT FFT  , where D is frame length. where K is FFT points.
  • 31. 31 THEORY: NEURAL SOURCE-FILTER MODEL 1 2 3 4 T… Probabilistic interpretation? • 1 2 3 4 T… Framing Framing FFT FFT Likelihood over Gaussian  For explanation, denotes spectral power vector  , where D is frame length. where K is FFT points.
  • 32. 32 THEORY: NEURAL SOURCE-FILTER MODEL Probabilistic interpretation? • 1 2 3 4 T… Framing FFT 1 2 3 4 T… Framing FFT Likelihood over Gaussian  , where K is FFT points  For explanation, denotes spectral power vector
  • 33. 33 Naïve model AR model Inverse-AR flow NSF THEORY IN SUMMARY
  • 34. CONTENTS 34 Introduction Theory Practice Summary • AR & flow-based models • Neural source filter model • WaveNet • Neural source-filter model ➣ • Beyond speech • Future work
  • 35. 35 PRACTICE: WAVENET WaveNet variants  Discretized or continuous-valued waveforms • Two practical issues: 1. How to generate waveform samples? ➣ 2. How to train WaveNet Gaussian? ➣ 1 2 1 2 3 4 3 1 2 1 2 3 4 3 GMM/GaussianSoftmax ➣
  • 36. 36 PRACTICE: WAVENET WaveNet variants  Discretized or continuous-valued waveforms  Other variants • WaveNet using mixture of logistic distribution 1 • WaveNet + Spline 2 • Quantization noise shaping 3, related noise shaping method 4 1. T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. 2. Y. Agiomyrgiannakis. B-spline PDF: A generalization of histograms to continuous density models for generative audio networks. In Proc. ICASSP, pages 5649–5653. IEEE, 2018. 3. T. Yoshimura, et al. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE TASLP, 26(7):1173–1180, 2018. 4. K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018. 1 2 1 2 3 4 3 Softmax 1 2 1 2 3 4 3 GMM/Gaussian
  • 37. Generation strategy  WaveNet-softmax • Generation as a search problem • Search space: 256T for 8-bits waveform of length T 1 2 1 2 3 4 3 37 PRACTICE: WAVENET 1 2 3 4 … … … …
  • 38. Generation strategy  WaveNet-softmax • Sub-optimal search by o Exploitation o Exploration o Or mix of both 38 PRACTICE: WAVENET 1 2 3 4 … … … … Random sampling Greedy search
  • 39. Generation strategy  WaveNet-softmax 39 8900.0 9100.0 9300.0 9500.0 9700.0 9900.0 0 200 400 600 800 1000 waveform(mu-law) 8900.0 9100.0 9300.0 9500.0 9700.0 9900.0 sampling point 0 200 400 600 800 1000 probablityWaveformlevels(0-1024)Waveformlevels(0-1024) PRACTICE: WAVENET
  • 40. PRACTICE: WAVENET Generation method  Experiments on WaveNet vocoder Sampling point 9300.0 9305.0 9310.0 9315.0 9320.0 Waveform level 0 200 400 600 800 1000 1200 0.0 0.1 0.2 0.3 0.4 0.5 Sampling point 8900.0 8905.0 8910.0 8915.0 8920.0 Waveform level 0 200 400 600 800 1000 1200 0.0 0.1 0.2 0.3 0.4 0.5 59 How about 1. Exploration in unvoiced steps 2. Exploitation in randomly selected voiced steps
  • 42. 42 PRACTICE: WAVENET  Rainbow gram: https://gist.github.com/jesseengel/e223622e255bd5b8c9130407397a0494 Natural Greedy search Random sampling Mixed approach
  • 43. Generation strategy  WaveNet-softmax • Exploitation & exploration • Other strategy: temperature of softmax 1  WaveNet-Gaussian • Infinite search space: the best is impossible • Same strategy as WaveNet-softmax 43 PRACTICE: WAVENET Greedy best? Sampling? 1. Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255, 2018.
  • 44. 44 PRACTICE: WAVENET Training stability  WaveNet-Gaussian • Maximum likelihood training is risky: very large gradients NN 1 2 1 2 3 4 3
  • 48. 48 PART 1: IMPLICIT SOURCE-FILTER MODELWAVENET Illness of fitting Gaussian  Why joint learning is unstable? Toy experiment • Use the MSE network • Fit only one utterance -10 -5 0 5 10 15 20 25 30 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 419 438 457 476 495 514 533 552 571 590 609 628 647 666 685 704 723 742 761 780 799 818 837 856 875 894 913 932 951 970 989 -loglikelihood Epoch Natural 𝜇 𝑡 𝜎𝑡
  • 49. PRACTICE: WAVENET Training stability  WaveNet-Gaussian • Our two-steps strategy 1. Train blue part with 2. Train red part only • Gradient will be mild 1. Minimizes while keep gradient mild 2. Gradient not explode when 49 NN 1 2 1 2 3 4 3
  • 50. Training stability  WaveNet-Gaussian • Experiment: 5 hours data 50 PRACTICE: WAVENET -5.5 -5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Negative log-likelihood Naïve strategy Our strategy Epoch On training set On validation set Step 1 Step 2
  • 51. Training stability  WaveNet-Gaussian • Experiment: 5 hours data 51 PRACTICE: WAVENET -5.5 -5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Negative log-likelihood Naïve strategy Our strategy Epoch On training set On validation set Step 1 Step 2
  • 52. Generation strategy Training WaveNet-Gaussian 52 PRACTICE: WAVENET Greedy best? Sampling? Exploitation + exploration Keep gradients mild NN 1 2 1 2 3 4 3
  • 53. CONTENTS 53 Introduction Theory Practice Summary • AR & flow-based models • Neural source filter model • WaveNet • Neural source-filter model • Beyond speech • Future work
  • 54. 54 PRACTICE: NSF 1 2 3 4 T… … General idea • Spectral-domain training criterion • Source-filter structure … Generated waveform Natural waveform 1 2 3 4 T Spectral distance … … 1 2 3 4 TF0/pitch
  • 55. 55 PRACTICE: NSF Common structure • No AR or inverse AR • No knowledge distilling Spectral features & F0 Condition module Source module Filter module Frequency-domain distance Natural waveform Generated waveform F0 infor. Spectral infor. Generated waveform Gradients
  • 56. 56 PRACTICE: NSF Common structure • Condition module: input feature pre-process Spectral features & F0 Source module Filter module Frequency-domain distance Natural waveform Generated waveform Up sampling Up samplingBi-LSTM CONV F0 Generated waveform Gradients Up sampling Dimension change Temporal smoothing Cat.
  • 57. 57 PRACTICE: NSF Common structure • Source module: generate a sine waveform given F0  FF: feedforward layer with Tanh Spectral features & F0 Filter module Frequency-domain distance Natural waveform Generated waveform Up sampling Noise FF Sine generator harmonics Generated waveform Gradients Up samplingBi-LSTM CONV Cat. F0
  • 58. Spectral features & F0 Filter module Frequency-domain distance Natural waveform Generated waveform Up sampling Noise FF Sine generator harmonics Generated waveform Gradients Up samplingBi-LSTM CONV Cat. F0 58 PRACTICE: NSF Common structure … Random initial phase Sampling rate Noise FF Sine generator Fundamental component Voiced: Unvoiced: noise
  • 59. 59 PRACTICE: NSF Common structure • Error metric Spectral features & F0 Frequency-domain distance Natural waveform Up sampling Noise FF Sine generator harmonics Filter module Generated waveform Gradients Compute frequency-domain distance Compute gradients for SGD Up samplingBi-LSTM CONV Cat. F0
  • 60. 60 PRACTICE: NSF Common structure • Based on short-time Fourier transform Spectral features & F0 Natural waveform Generated waveform Up sampling Noise FF Sine generator harmonics FFTFraming FFT Framing iFFT De-framing Filter module Up samplingBi-LSTM CONV Cat. F0
  • 61. 61 PRACTICE: NSF Common structure • Different frame shifts / window lengths / FFT points • Homogenous distances • FFTFraming FFT Framing iFFT De-framing FFTFraming FFT Framing iFFT De-framing FFTFraming FFT Framing iFFT De-framing +
  • 62. 62 PRACTICE: NSF Common structure • Different NSF models, different neural filter modules Spectral features & F0 Natural waveform Generated waveform Up sampling Noise FF Sine generator harmonics FFTFraming FFT Framing iFFT De-framing Filter module Up samplingBi-LSTM CONV Cat. F0 Filter module
  • 63. 63 PRACTICE: NSF Common structure Spectral features & F0 Natural waveform Generated waveform Up sampling Noise FF Sine generator harmonics FFTFraming FFT Framing iFFT De-framing Filter module Up samplingBi-LSTM CONV Cat. F0 Filter module NSF models Baseline NSF (b-NSF) Simplified NSF (s-NSF) Harmonic-plus-noise NSF (hn-NSF) hn-NSF Ver.1 hn-NSF with ver.2 ICASSP 2019 Journal paper submitted SSW 2019
  • 64. 64 PRACTICE: NSF Baseline and simplified NSF • Baseline filter block follows WaveNet / ClariNet • Baseline filter block can be simplified Baseline filter block 1 Baseline filter block 2 Baseline filter block 5 … Simplified filter block 1 Simplified filter block 2 Simplified filter block 5 … b-NSF s-NSF simplify
  • 65. simplify 65 PRACTICE: NSF Baseline and simplified NSF Baseline filter block 2 Baseline filter block 5 … Simplified filter block 2 Simplified filter block 5 … b-NSF s-NSF   Element-wise multiplication Baseline filter block 1 Simplified filter block 1 Simplified filter block Dilated CONV +FF … FF Dilated CONV + Baseline filter block Dilated CONV + Tanh Sigmoid • FF FF FF + Dilated CONV + Tanh Sigmoid • FF FF + … + FF
  • 66. 66 PRACTICE: NSF Baseline and simplified NSF • Both models: 1. Strong harmonics in high-frequency bands 2. Awful unvoiced (fricative) sounds • Model ‘overfitted’ to voiced sounds? Baseline filter block 1 Baseline filter block 2 Baseline filter block 5 … Simplified filter block 1 Simplified filter block 2 Simplified filter block 5 … b-NSF s-NSF simplify
  • 67. 67 PRACTICE: NSF Harmonic-plus-noise NSF  HP, LP: high- and low-pass finite-impulse-response (FIR) filter Baseline filter block 1 Baseline filter block 2 Baseline filter block 5 … Simplified filter block 1 Simplified filter block 2 Simplified filter block 5 … Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP b-NSF s-NSF hn-NSF simplify upgrade
  • 68. Baseline filter block 2 Baseline filter block 5 … Simplified filter block 2 Simplified filter block 5 … Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP b-NSF s-NSF hn-NSF simplification improvement Baseline filter block 1 Simplified filter block 1 Simplified filter block 1 68 PRACTICE: NSF Harmonic-plus-noise NSF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Maximum voicing frequency (MVF) hn-NSF
  • 69. 69 PRACTICE: NSF Harmonic-plus-noise NSF  Version I: choose MVF based on u/v Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP u/v flag For voiced sounds For unvoiced sounds Condition module for hn-NSF Fixed MVFs
  • 70. 70 PRACTICE: NSF Harmonic-plus-noise NSF  Version II: predict MVF from input features • Predict MVF from condition module (SSW paper) • From MVF to FIR filter coefficients (SSW paper) MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Condition module for hn-NSF sinc Hamming window Gain norm. HP LP
  • 71. 71 PRACTICE: NSF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP u/v flag Condition module for hn-NSF MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Condition module for hn-NSF
  • 72. 72 PRACTICE: NSF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP u/v flag Condition module for hn-NSF MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Condition module for hn-NSF
  • 73. 73 PRACTICE: NSF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP u/v flag Condition module for hn-NSF MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Condition module for hn-NSF
  • 74. 74 PRACTICE: NSF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP u/v flag Condition module for hn-NSF MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 Simplified filter block 5 noise + HP LP Condition module for hn-NSF
  • 75. 75 PRACTICE: NSF Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module NSF is a deep-residual network
  • 76. 76 PRACTICE: NSF Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module NSF is a deep-residual network
  • 77. 77Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 78. 78Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 79. 79Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 80. 80Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 81. 81Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 82. 82Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 83. 83Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 84. 84Spectral features & F0 Up sampling Noise FF Sine generator harmonics Up samplingBi-LSTM CONV Cat. F0 MVF Simplified filter block 1 Simplified filter block 2 … Simplified filter block 5 noise + HP LP Simplified filter block 5 Condition module for proposed hn-NSF Source module PRACTICE: NSF NSF is a deep-residual network
  • 85. Configuration  Data and features  Models 85 Corpus Size Note ATR Ximera F009 [1] 15 hours 16kHz, Japanese, neutral style Feature Dimension Acoustic Mel-generalized cepstrum coefficients (MGC) or Mel-spectra 60 80 F0 1 PRACTICE: COMPARISON WaveNet softmax b-NSF hn-NSF trainable MVF WaveNet Gaussian s-NSF hn-NSF fixed MVF WORLD vocoder
  • 86. Speech quality (ICASSP) • 245 paid evaluators, 1450 evaluation sets 86 PRACTICE: COMPARISON Copy-synthesis Pipeline TTS WaveNet softmax b-NSF hn-NSF trainable MVF WaveNet Gaussian s-NSF hn-NSF fixed MVF WORLD vocoder WORLD vocoder WaveNet softmax WaveNet Gaussian b-NSF ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v1.html
  • 87. Speech quality (Journal paper submitted) • >150 paid evaluators • s-NSF did badly on unvoiced sounds 87 PRACTICE: COMPARISON ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v2.html WaveNet softmax b-NSF hn-NSF trainable MVF WaveNet Gaussian s-NSF hn-NSF fixed MVF WORLD vocoder
  • 88. Speech quality (SSW 2019) • >150 paid evaluators 88 ☛ Samples, models, codes: https://nii-yamagishilab.github.io/samples-nsf/nsf-v3.html PRACTICE: COMPARISON ■ Copy-synthesis ■ Pipeline TTS WaveNet softmax b-NSF hn-NSF trainable MVF WaveNet Gaussian s-NSF hn-NSF fixed MVF WORLD vocoder WaveNet softmax hn-NSF trainable MVF hn-NSF fixed MVF Natural
  • 89. Generation speed  Mem-save mode: allocate and release GPU memory layer by layer (limited by our CUDA implemetation)  Normal mode: allocate GPU memory once 89 How many waveform points can be generated in 1s (Tesla p100)? PRACTICE: COMPARISON WaveNet softmax b-NSF hn-NSF trainable MVF WaveNet Gaussian s-NSF hn-NSF fixed MVF WORLD vocoder
  • 90. CONTENTS 90 Introduction Theory Practice Summary • AR & flow-based models • Neural source filter model • WaveNet • Neural source-filter model • Beyond speech • Future work
  • 91. 91 SUMMARY AR model WaveRNNSampleRNN FFTNet WaveNet LPCNetExcitNet GlotNet Multi-head CNN No AR, no flow Neural source-filter Model (NSF) • No explicit Naïve model Inverse AR flow FloWaveNetWaveGlow ClariNet Parallel WaveNet GELP
  • 92. 92 BEYOND SPEECH (c.f. HTS Slides, by HTS Working Group) Source module Filter module
  • 93. 93 BEYOND SPEECH Music performance  Training • URPM dataset1 o ground-truth F0 o 13 instruments o solo recording • One model for all instruments 1 University of Rochester Multi-Modal Music Performance (URMP) Dataset http://www2.ece.rochester.edu/projects/air/projects/URMP.html Neural waveform model F0 Mel-spectra
  • 94. Natural b-NSF S-NSF hn-NSF trainable MVF Violin Viola Oboe Trumpet Saxophone BEYOND SPEECH Music performance  Testing with natural Mel-spectra and F0 as input WaveNet
  • 95. Natural b-NSF S-NSF hn-NSF trainable MVF Horn Trombone Tuba Clarinet Flute BEYOND SPEECH Music performance  Testing with natural Mel-spectra and F0 as input
  • 96. 96 FUTURE DIRECTION (c.f. HTS Slides, by HTS Working Group)
  • 97. Questions & Comments are always Welcome! 97 https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 98. 98 REFERENCE WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016. WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018. FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251– 2255. IEEE, 2018. Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv preprint arXiv:1811.06292, 2018. Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018. Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918– 3926, 2018. ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP (submitted), 2018. NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv preprint arXiv:1810.11946, 2018. LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv preprint arXiv:1811.11913, 2018. GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018. ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv preprint arXiv:1811.04769, 2018. LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846, 2018. MCNN: S. O ̈. Arık, H. Jun, and G. Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 26(1):94–98, 2018. GELP: J. Lauri, et. al. GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram, Proc. Interspeech, 2019
  • 99. 99 REFERENCE By Lauri Juvela, Aalto University
  • 101. X= … 1st Frame 2nd Frame Nth Frame T rows M (frame length) … … Frame shift NM columns … … 0 0 0 0 0 0 0 0 0 0 0 0 APPENDIX Training criterion
  • 102. Training criterion DFT Framing/ windowing DFT Framing/ windowing Generated waveform Natural waveform … N frames …K DFT bins K-points iDFT Frame Length M De-framing/ windowing inverse DFT De-framing /windowing Gradients Gradients w.r.t. zero-padded part Not used in de-framing/windowing Padding K-M Complex-value domain Real-value domain APPENDIX
  • 103. 103 FLOW-BASED MODELS Recap AR model  Consider a WaveNet using a Gaussian distribution 1. Because , we have 1 2 3 T 1 2 3 NN  z-1 denotes time delay NN z-1 H-1(.)
  • 105. 105 FLOW-BASED MODELS Recap AR model  Consider a WaveNet using a Gaussian distribution 2. Because , we have 3. Therefore  z-1 denotes time delay Triangle-matrix, as nt depends on o<t
  • 106. 106 FLOW-BASED MODELS Recap AR model  Consider a WaveNet using a Gaussian distribution • So:  z-1 denotes time delay
  • 107. 107 FLOW-BASED MODELS Inverse-AR flow 1. Because , we have  z-1 denotes time delay NN z-1 H-1(.) Triangle-matrix, as nt depends on ot
  • 108. 108 FLOW-BASED MODELS Inverse-AR flow 2. Therefore  z-1 denotes time delay NN z-1 H-1(.)
  • 109. 109 FLOW-BASED MODELS AR flow vs inverse-AR  z-1 denotes time delay NN z-1 H-1(.)NN z-1 H-1(.)
  • 110. 110 FLOW-BASED MODELS  z-1 denotes time delay NN z-1 H-1(.)NN z-1 H-1(.) AR flow AR flow vs inverse-AR Inverse-AR flow

Editor's Notes

  1. This work is licensed under the Creative Commons Attribution 3.0 License. All slides may be reused for non-commercial purposes provided full attribution is made to the National Institute of Informatics. (See http://creativecommons.org/ for details.)
  2. two reasons
  3. /work/smg/wang/PROJ/PROJS/TSNet/WaveModel/MODEL/continuous/trial003/output_trained_network.jsn/arctic_a0118.wav
  4. For neural network, AR ->
  5. For neural waveform modelling, AR -> AR + teacher forcing -> likelihood on better PDF
  6. Fast generation? Flow-based model Why Flow-based model is fast Not explain how to implement, but the link between the Flow and AR
  7. To be modified
  8. To be modified
  9. To be modified
  10. To be modified
  11. To be modified
  12. To be modified
  13. Mention other strategies such as temperature
  14. SAVE/misc/waveform_models/wavenet_sampling_strategy/outputs
  15. SAVE/misc/waveform_models/wavenet_sampling_strategy/outputs
  16. /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/208/sys2/stage1/epoch018_0.000.wav /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/208/sys2/stage2/epoch012_0.500.wav /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/208/sys2/stage2/epoch012_0.750.wav /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/208/sys2/stage2/epoch012_1.000.wav /work/smg/wang/PROJ/NNWAV/WAVNET/F009/02/008/output_testset_mix_epoch013_mdn1.000000/ATR_Ximera_F009_AOZORAR_03372_T01.wav
  17. To be modified
  18. Many questions to ask How to design frequency-domain distance How to design source module How to design condition module / what input features should be used
  19. Meaning of this equation (frequency modulation)
  20. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  21. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  22. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  23. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  24. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  25. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  26. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  27. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  28. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  29. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/ana_data/m1/
  30. Gaussian WaveNet: continuous-valued waveforms NSF: continuous-valued waveforms
  31. Gaussian WaveNet: continuous-valued waveforms NSF: continuous-valued waveforms
  32. Gaussian WaveNet: continuous-valued waveforms NSF: continuous-valued waveforms
  33. /Users/wangxin/WORK/REMO/SAVE/ssw10-nsf-h-sinc/scripts-analysis/samples_presentation
  34. Training: efficient, not slow
  35. Training: efficient, not slow
  36. Training: efficient, not slow
  37. /work/smg/wang/PROJ/PROJS/NSF-Extented/URMP/project-CURRENNT-scripts/waveform-modeling/tmp_samples
  38. /work/smg/wang/PROJ/PROJS/NSF-Extented/URMP/project-CURRENNT-scripts/waveform-modeling/tmp_samples
  39. Training: efficient, not slow
  40. WaveNet 101 Just play one row