2. /181
Text-To-Speech (TTS) synthesis
using Deep Neural Networks (DNNs)
Text-To-Speech (TTS) synthesis
TTS using Deep Neural Networks (DNNs) [Zen et al., 2013]
Text Speech
Linguistic
features
Speech
params.
Text
analysis
Speech
synthesis
Text-To-Speech (TTS)
DNN-based
acoustic models
To realize high-quality TTS,
studio-quality clean speech data is required for training the DNNs.
3. /18
Goal: realizing high-quality TTS using NOISY speech data
Common approach: noise reduction before training DNNs
– Error caused by the noise reduction is propagated to TTS training...
Proposed: training DNNs considering noise additive process
– GAN*-based noise generation models are introduced to TTS training.
Results: improving synthetic speech quality
2
Outline of this talk
*Generative Adversarial Network [Goodfellow et al., 2014]
Noise
reduction
Noisy
(observed)
Clean
(estimated)
TTS
Noise
addition
Noisy
(observed)
Clean
(unobserved)
TTS
Noise generation
models
4. /18
Noise reduction using Spectral Subtraction (SS)*
Amplitude spectra after noise reduction 𝒚s
(SS)
is calculated as:
– 𝑦s
SS
𝑡, 𝑓 =
𝑦ns
2
𝑡, 𝑓 − 𝛽 𝑦n
2
𝑓 𝑦ns
2 𝑡, 𝑓 − 𝛽 𝑦n
2 𝑓 > 0
0 otherwise
The estimated average power of noise 𝒚n
2 is defined as:
– 𝑦n
2
𝑓 =
1
𝑇n
𝑡=1
𝑇n
𝑦n
2
𝑡, 𝑓 (𝑇n: total frame length of the noise)
Limitations
– Approximating the noise distribution with its expectation value 𝒚n
2
– Causing trade-off between noise reduction & speech distortion due to
setting the hyper-parameter 𝛽 (noise suppression ratio)
3*[Boll, 1979]
5. /18
Training TTS from noisy speech using SS
4
Mean squared error
𝐿MSE 𝒚s
SS
, 𝒚s
SS
TTS
Linguistic
features
Predicted clean
amplitude
spectra
Estimated clean
amplitude
spectra
Noisy
amplitude
spectra
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
→ Minimize𝐿MSE 𝒚s
SS
, 𝒚s
SS
=
1
𝑇
𝒚s
SS
− 𝒚s
SS
⊤
𝒚s
SS
− 𝒚s
SS
𝑇: total frame length of the features
6. /18
1. Speech distortion caused by error of SS
2. Propagation of the distortion by using 𝒚s
SS
as a target vector
Issues in training TTS using SS
5
𝐿MSE 𝒚s
SS
, 𝒚s
SS
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
TTS
These issues significantly degrade synthetic speech quality...
8. /187
Overview of the proposed algorithm
𝐿MSE 𝒚ns, 𝒚ns
TTS
Linguistic
features
Estimated
noisy Noisy
Predicted
clean
𝒚s 𝒚ns 𝒚ns
Noise
addition
Pre-trained
noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
We want 𝐺n ⋅ to model the distribution of the observed noise.
9. /188
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
10. /189
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
11. /1810
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
*Jensen—Shannon
This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n.
Extraction of
non-speech
period
𝒚n
Observed
noise
12. /1811
Comparison of observed/generated noise
(generating Gaussian noise from uniform noise)
Frequency Amplitude
Freq.[kHz]Freq.[kHz]
Time [s]
Observed
Generated
Spectrogram Histogram
Our noise generation models effectively reproduce
characteristics of the observed noise!
13. /18
Modeling distribution of stationary noise by using GANs
– Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced.
– By using recurrent networks, distribution of non-stationary noise can be
also modeled by our algorithm.
Extending the proposed algorithm
– Distribution of context-dependent noise (e.g., pop-noise) can be
captured by using conditional GANs [Mirza et al., 2015].
– By using WaveNet [Oord et al., 2016], noise distribution can be modeled
in the waveform domain.
Adapting TTS or noise generation models
– Pre-recorded clean speech data can be used to build initial models
used in our algorithm.
12
Discussion of proposed algorithm
19. /18
Conclusion
Purpose
– Training high-quality TTS using noisy speech data
Proposed
– Training algorithm considering noise additive process
• Our noise generation models can learn distribution of
observed noise through the GAN-based training.
Results
– Improving synthetic speech quality compared with TTS using SS
Future work
– Modeling non-stationary noise by the proposed algorithm
• Using richer DNN architectures (e.g., long-short term memory)
– Comparing our algorithm with state-of-the-art noise suppression
18
Thank you for your attention!