Une18apsipa

©Yuki Saito, 13/11/2018
GENERATIVE APPROACH USING THE NOISE
GENERATION MODELS FOR DNN-BASED SPEECH
SYNTHESIS TRAINED FROM NOISY SPEECH
Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2,
Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2
1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan,
3NIT, Kagawa College, Japan
APSIPA-ASC 2018 TU-P1-5.1

/181
Text-To-Speech (TTS) synthesis
using Deep Neural Networks (DNNs)
 Text-To-Speech (TTS) synthesis
 TTS using Deep Neural Networks (DNNs) [Zen et al., 2013]
Text Speech
Linguistic
features
Speech
params.
Text
analysis
Speech
synthesis
Text-To-Speech (TTS)
DNN-based
acoustic models
To realize high-quality TTS,
studio-quality clean speech data is required for training the DNNs.

/18
 Goal: realizing high-quality TTS using NOISY speech data
 Common approach: noise reduction before training DNNs
– Error caused by the noise reduction is propagated to TTS training...
 Proposed: training DNNs considering noise additive process
– GAN*-based noise generation models are introduced to TTS training.
 Results: improving synthetic speech quality
2
Outline of this talk
*Generative Adversarial Network [Goodfellow et al., 2014]
Noise
reduction
Noisy
(observed)
Clean
(estimated)
TTS
Noise
addition
Noisy
(observed)
Clean
(unobserved)
TTS
Noise generation
models

/18
Noise reduction using Spectral Subtraction (SS)*
 Amplitude spectra after noise reduction 𝒚s
(SS)
is calculated as:
– 𝑦s
SS
𝑡, 𝑓 =
𝑦ns
2
𝑡, 𝑓 − 𝛽 𝑦n
2
𝑓 𝑦ns
2 𝑡, 𝑓 − 𝛽 𝑦n
2 𝑓 > 0
0 otherwise
 The estimated average power of noise 𝒚n
2 is defined as:
– 𝑦n
2
𝑓 =
1
𝑇n
𝑡=1
𝑇n
𝑦n
2
𝑡, 𝑓 (𝑇n: total frame length of the noise)
 Limitations
– Approximating the noise distribution with its expectation value 𝒚n
2
– Causing trade-off between noise reduction & speech distortion due to
setting the hyper-parameter 𝛽 (noise suppression ratio)
3*[Boll, 1979]

/18
Training TTS from noisy speech using SS
4
Mean squared error
𝐿MSE 𝒚s
SS
, 𝒚s
SS
TTS
Linguistic
features
Predicted clean
amplitude
spectra
Estimated clean
amplitude
spectra
Noisy
amplitude
spectra
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
→ Minimize𝐿MSE 𝒚s
SS
, 𝒚s
SS
=
1
𝑇
𝒚s
SS
− 𝒚s
SS
⊤
𝒚s
SS
− 𝒚s
SS
𝑇: total frame length of the features

/18
 1. Speech distortion caused by error of SS
 2. Propagation of the distortion by using 𝒚s
SS
as a target vector
Issues in training TTS using SS
5
𝐿MSE 𝒚s
SS
, 𝒚s
SS
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
TTS
These issues significantly degrade synthetic speech quality...

/186
Proposed algorithm:
Training TTS using
noise generation models
based on GANs

/187
Overview of the proposed algorithm
𝐿MSE 𝒚ns, 𝒚ns
TTS
Linguistic
features
Estimated
noisy Noisy
Predicted
clean
𝒚s 𝒚ns 𝒚ns
Noise
addition
Pre-trained
noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
We want 𝐺n ⋅ to model the distribution of the observed noise.

/188
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise

/189
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝐺n
max
𝐷
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise

/1810
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
𝐺n
max
𝐷
1: observed
*Jensen—Shannon
This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n.
Extraction of
non-speech
period
𝒚n
Observed
noise

/1811
Comparison of observed/generated noise
(generating Gaussian noise from uniform noise)
Frequency Amplitude
Freq.[kHz]Freq.[kHz]
Time [s]
Observed
Generated
Spectrogram Histogram
Our noise generation models effectively reproduce
characteristics of the observed noise!

/18
 Modeling distribution of stationary noise by using GANs
– Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced.
– By using recurrent networks, distribution of non-stationary noise can be
also modeled by our algorithm.
 Extending the proposed algorithm
– Distribution of context-dependent noise (e.g., pop-noise) can be
captured by using conditional GANs [Mirza et al., 2015].
– By using WaveNet [Oord et al., 2016], noise distribution can be modeled
in the waveform domain.
 Adapting TTS or noise generation models
– Pre-recorded clean speech data can be used to build initial models
used in our algorithm.
12
Discussion of proposed algorithm

/1813
Experimental evaluations

/18
Experimental conditions
14
Dataset
Japanese female speaker
(subset of JSUT corpus [Sonobe et al., 2017])
Train / evaluate data 3,000 / 53 sentences (16 kHz sampling)
Linguistic feats.
442-dimensional vector
(phoneme, accent type, F0, UV, duration, etc...)
Speech params. 257-dimensional log amplitude spectrum
Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986]
Prior / observed noise Uniform / Gaussian (artificially added)
DNN architectures Feed-Forward (details are written in our manuscript)
Noise suppression ratio
of SS 𝛽
0.5, 1.0, 2.0, and 5.0
(larger value means stronger noise reduction)
Input SNR 0, 5, and 10 [dB]
Evaluation method
Preference AB test in terms of speech quality
(25 participants / evaluation)

/18
Results of subjective evaluation of speech quality
(input SNR = 0 [dB])
15In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!

/18
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.292 0.708
0.320 0.680
0.323 0.677
0.00 0.25 0.50 0.75 1.00
Preference score
0.216 0.784
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00

/18
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.268 0.732
0.292 0.707
0.256 0.744
0.00 0.25 0.50 0.75 1.00
Preference score
0.288 0.712
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00

/18
Conclusion
 Purpose
– Training high-quality TTS using noisy speech data
 Proposed
– Training algorithm considering noise additive process
• Our noise generation models can learn distribution of
observed noise through the GAN-based training.
 Results
– Improving synthetic speech quality compared with TTS using SS
 Future work
– Modeling non-stationary noise by the proposed algorithm
• Using richer DNN architectures (e.g., long-short term memory)
– Comparing our algorithm with state-of-the-art noise suppression
18
Thank you for your attention!

Une18apsipa

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Une18apsipa

Similar to Une18apsipa (20)

More from Yuki Saito

More from Yuki Saito (20)

Recently uploaded

Recently uploaded (20)

Une18apsipa