Saito2017icassp

©Yuki Saito, 07/03/2017
TRAINING ALGORITHM TO DECEIVE
ANTI-SPOOFING VERIFICATION
FOR DNN-BASED SPEECH SYNTHESIS
Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
ICASSP 2017 SP-L4.2

/17
 Issue: quality degradation in statistical parametric speech
synthesis due to over-smoothing of the speech params.
 Countermeasures: reproducing natural statistics
– 2nd moment (a.k.a. Global Variance: GV) [Toda et al., 2007.]
– Histogram[Ohtani et al., 2012.]
 Proposed: training algorithm to deceive an Anti-Spoofing
Verification (ASV) for DNN-based speech synthesis
– Tries to deceive the ASV which distinguishes natural / synthetic speech.
– Compensates distribution difference betw. natural / synthetic speech.
 Results:
– Improves the synthetic speech quality.
– Works comparably robustly against its hyper-parameter setting.
1
Outline of This Talk

/17
Conventional Training Algorithm:
Minimum Generation Error (MGE) Training
2
Generation
error
𝐿G 𝒄, ො𝒄
Linguistic
feats.
[Wu et al., 2016.]
Natural
speech
params.
𝐿G 𝒄, ො𝒄 =
1
𝑇
ො𝒄 − 𝒄 ⊤ ො𝒄 − 𝒄 → Minimize
𝒄
ML-based
parameter
generation
Generated
speech
params.ො𝒄
Acoustic models
⋯
⋯
⋯
Frame
𝑡 = 1
Static-dynamic
mean vectors
Frame
𝑡 = 𝑇

/173
Issue of MGE Training:
Over-smoothing of Generated Speech Parameters
Natural MGE
21st mel-cepstral coefficient
23rdmel-cepstral
coefficient
These distributions are significantly different...
(GV [Toda et al., 2007.] explicitly compensates the 2nd moment.)
Narrow

/174
Proposed algorithm:
Training Algorithm to Deceive
Anti-Spoofing Verification (ASV)

/17
Anti-Spoofing Verification (ASV):
Discriminator to Prevent Spoofing Attacks w/ Speech
5
[Wu et al., 2016.] [Chen et al., 2015.]
𝐿D,1 𝒄 𝐿D,0 ො𝒄
𝐿D 𝒄, ො𝒄 = → Minimize−
1
𝑇
෍
𝑡=1
𝑇
log 𝐷 𝒄 𝑡 −
1
𝑇
෍
𝑡=1
𝑇
log 1 − 𝐷 ො𝒄 𝑡
ො𝒄
Cross entropy
𝐿D 𝒄, ො𝒄
1: natural
0: generated
Generated
speech params.
𝒄Natural
speech params.
Feature
function
𝝓 ⋅
Here, 𝝓 𝒄 𝑡 = 𝒄 𝑡 ASV 𝐷 ⋅
or
Loss to recognize
generated speech as generated
Loss to recognize
natural speech as natural

/17
Training Algorithm to Deceive ASV
6
𝐿 𝒄, ො𝒄 = 𝐿G 𝒄, ො𝒄 + 𝜔D
𝐸 𝐿G
𝐸 𝐿D
𝐿D,1 ො𝒄 → Minimize
𝐿G 𝒄, ො𝒄
Linguistic
feats.
Natural
speech params. 𝒄
ML-based
parameter
generation
Generated
speech params.ො𝒄
Acoustic models
⋯
⋯
⋯
𝐿D,1 ො𝒄
1: natural
Feature
function
𝝓 ⋅
ASV 𝐷 ⋅
Loss to recognize
generated speech as natural
𝜔D: weight, 𝐸𝐿G
, 𝐸𝐿D
: expectation values of 𝐿G 𝒄, ො𝒄 , 𝐿D,1 ො𝒄
Static-dynamic
mean vectors

/17
 ① Update the acoustic models
 ② Update the ASV
Iterative Optimization of Acoustic models and ASV
7
By iterating ① and ②, we construct the final acoustic models!
Fixed
Fixed
𝐿G 𝒄, ො𝒄
Natural
𝒄
ML-based
parameter
generation
Generated
ො𝒄
⋯
⋯
⋯
𝐿D,1 ො𝒄
1: natural
Feature
function
𝝓 ⋅
Natural
𝒄
ML-based
parameter
generation
Generated
ො𝒄
⋯
⋯
⋯
𝐿D 𝒄, ො𝒄
1: natural
0: generated
Feature
function
𝝓 ⋅
or

/17
 Compensations of speech feats. through the feature function:
– Automatically-derived feats. such as auto-encoded feats.
– Conventional analytically-derived feats. such as GV
 Loss function for training the acoustic models:
– Combination of MGE and adversarial training [Goodfellow et al., 2014.]
 The effect of the adversarial training:
– Minimizes the Jensen-Shannon divergence betw. the distributions of
the natural data / generated data.
8
Discussions of Proposed Algorithm

/179
Distributions of Speech Parameters
Our algorithm alleviates the over-smoothing effect!
21st mel-cepstral coefficient
23rdmel-cepstral
coefficient
Natural MGE Proposed
Narrow
Wide as
natural speech

/17
 Global Variance (GV): [Toda et al., 2007.]
– 2nd moment of the parameter distribution
10
Compensation of Global Variance
Feature index
0 5 10 15 20
10－3
10－1
101
Globalvariance
Proposed
Natural
MGE
10－2
100
10－4
GV is NOT used for training, but compensated by the ASV!

/17
 Maximal Information Coefficient (MIC): [Reshef et al., 2011.]
– Values to quantify a nonlinear correlation b/w two variables
– Natural speech params. tend to have weak correlation [Ijima et al., 2016.]
11
Additional Effect:
Alleviation of Unnaturally Strong Correlation
Natural MGE
0 6 12 18 24
0.0
0.2
0.4
0.6
0.8
1.0
Strong
Weak
Proposed
0 6 12 18 24 0 6 12 18 24
Proposed algorithm not only compensates the GV,
but also makes the correlations among speech params. natural!

/1712
Experimental Evaluations

/17
Experimental Conditions
13
Dataset
ATR Japanese speech database
(phonetic balanced 503 sentences)
Train / evaluate data 450 sentences / 53 sentences (16 kHz sampling)
Linguistic feats.
274-dimensional vector
(phoneme, accent type, frame position, etc...)
Speech params.
Mel-cepstral coefficients (0th-through-24th),
𝐹0, 5-band aperiodicity
Prediction params.
Mel-cepstral coefficients
(the others were NOT predicted)
Optimization algorithm AdaGrad [Duchi et al., 2011.] (learning rate: 0.01)
Acoustic models Feed-Forward 274 – 3x400 (ReLU) – 75 (linear)
ASV Feed-Forward 25 – 2x200 (ReLU) – 1 (sigmoid)

/17
Initialization, Training, and Objective Evaluation
14
 Initialization:
– Acoustic models: conventional MGE training
– ASV: distinguish natural / generated speech after the MGE training
 Training:
– Acoustic models: update with the proposed algorithm
– ASV: distinguish natural / generated speech after updating the acoustic
models
 Objective evaluation:
– Generation loss 𝐿G 𝒄, ො𝒄 and spoofing rate
Spoofing rate =
# of the spoofing synthetic speech params.
Total # of the synthetic speech params.
We calculated these values w/ various 𝜔D.

/17
Results of Objective Evaluations
15
Generation loss Spoofing rate
0.0 0.2 0.4 0.6 0.8 1.0
Weight 𝜔D
0.45
0.50
0.55
0.60
0.65
0.70
0.75
1.0
0.8
0.6
0.4
0.2
0.0
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Weight 𝜔D
Got
worse when 𝜔D > 0.3,
spoofing rate > 99%
Got
better
Our algorithm makes the generation loss worse
but
can train the acoustic models to deceive the ASV!

/17
Results of Subjective Evaluations
in Terms of Speech Quality
16
Proposed
𝜔D = 1.0
Proposed
𝜔D = 0.3
MGE
𝜔D = 0.0
Preference score (w/ 8 listeners)
0.0 0.2 0.4 0.6 0.8 1.0
Got
better
NO
significant
difference
Our algorithm improves the synthetic speech quality
and
works comparably robustly against its hyper-parameter setting!
Error bars denote 95% confidence intervals.
Speech samples: http://sython.org/demo/icassp2017advtts/demo.html

/17
Conclusion
 Purpose:
– Improving the speech quality of statistical parametric speech synthesis
 Proposed:
– Training algorithm to deceive an ASV
• Compensates the difference b/w distributions of natural /
generated speech params. using adversarial training
 Results:
– Improved the speech quality compared to conventional training
– Worked comparably robustly against its hyper-parameter setting
 Future work:
– Devising temporal- and linguistic-dependent ASV
– Extending our algorithm to generate 𝐹0 and duration
17

Saito2017icassp

More Related Content

What's hot

Viewers also liked

Similar to Saito2017icassp

More from Yuki Saito

Recently uploaded

Saito2017icassp