fujii22apsipa_asc

AVATAR SYMBIOTIC
SOCIETY
Adaptive End-to-End Text-to-Speech Synthesis
Based on Error Correction Feedback from Humans
Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari
The University of Tokyo, Japan
APSIPA ASC 2022 ThPM1-3.5
7 - 10 Nov. 2022・Chiang Mai, Thailand
1 / 17

Research background
2 / 17
• End-to-end (E2E) Text-to-Speech (TTS)[Wang+17]
・High quality but low controllability
・Difficulty in correcting accent errors
(cause of miscommunication, especially in a tonal language e.g., Japanese)
High
Low
E2E TTS
model
a m e
a m e
E2E TTS
model
Text
a
me
a
me
Prosody modeling and control in E2E TTS are a very
important and challenging task.
High
Low

Improving controllability using linguistic features
3 / 17
• Examples of linguistic features
・Full-context labels [Okamoto+19], [Okamoto+20]
・Phonetic and prosodic labels [Kurihara+21]
• High controllability but need for prerequisite expertise
“Adaptability”, the ability to easily correct mistakes,
is not taken into consideration.
Text
analysis
E2E
TTS
Input text → → linguistic feature → → Synthesized speech

Outline of this talk
4 / 17
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability” )
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrects accent errors.
・Our method successfully achieves the same quality of synthetic speech
as the conventional method.

Overview of proposed TTS model
5 / 17
• Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor

Backbone TTS model: FastSpeech2
6 / 17
• FastSpeech2 [Ren+21]
・Modules for generating a mel-spectrogram of speech from a phoneme sequence
・Stable learning and inference
Variance Adaptor Duration/pitch/energy
predictor

Proposed method
7 / 17
• Prosody predictor: DNN to estimate pitch changes per syllable (mora)
• Input: Phoneme embedding + Word embedding from BERT [Devlin+19]
• Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch,
• Loss function: MSE between predicted symbols and ground-truth (text analysis results)
"_" : keeping accent unchanged)

Overview of proposed HITL framework
8 / 17
• Motivation: Improves the adaptability of synthetic speech
• Approach: Involves multiple listeners in the process of correcting accent errors
…
…
(a) (me)
a
me
a
me

HITL accent error correction: Feedback aggregation
9 / 17
• Challenging point: Differences of listeners' error correction abilities
・Simple Way: Choose one from multiple accent annotations
・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H)
↑
Good
Bad
↓
Random
Selector
Bad accent
selected
Annotated accent Estimated ability by MACE
Listener 1 L L L L Low ability
Listener 2 L L L L Low ability
Listener 3 L H H H High ability
Mode L L L L
MACE [Hovy+13] L H H H
Low quarty
speech

Experimental evaluation
10 / 17
• Evaluation targets
・TTS model
・HITL framework
• Objective evaluation criterion
・RMSE of the logF0 between synthetic and natural speech
• Subjective evaluation criterion
・Mean Opinion Score (MOS) test
・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good).
・The number of listeners was 50, and each listened to 20 speech samples.
・Preference AB test
・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair.
・The number of listeners for each AB test was 25.

Experimental setting for TTS model
11 / 17
• Experimental conditions for TTS model
• Compared methods for TTS model
TTS model FastSpeech2 [Ren+21]
Train/eval. data of TTS model
and Prosody predictor
JSUT corpus [Takamichi+20], BASIC5000 subset,
4,488 / 512 sentences
・FS2 ・FS2+Symbol ・FS2+Predictor (target)
・FS2+Predictor
…Proposed

• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
worsen
This means that the prediction error
of the prosody predictor significantly degrades quality

• Objective evaluation of TTS model
12 / 17
Bad
Good
comparable
Suggests that natural speech can be synthesized
if correct accent is obtained through feedback

• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
Result of subjective evaluation
13 / 17
better
better

13 / 17
No significant
difference

13 / 17
No significant
difference
These results suggests that our method
improves the prosodic naturalness of synthetic speech

Experimental setting for HITL framework
14 / 17
• Experimental conditions for HITL framework
• Compared methods for HITL framework
・Mode, MACE, Best, Median, Worst
Listerner hiring platform Lancers (crowdsourcing platform)
Dataset, Participant Japanese female speaker (JSUT corpus)
Unused data for training TTS models and Prosody Predictor)
100 centences, 15 person / sentences (=1,500 crowdworkers)
HITL framework interface Initialize radio buttons with text analysis-derived accent information
…
15
crowdworkers
L H H H L L .... →
H H H H L L .... →
L H H H H L .... →
E2E
TTS
…
…
logF0
RMSE
between
Ground-Truth
Sort
…
…
Good
↑
↓
Bad
←Best
←Median
←Worst
[input accents]

• Objective evaluation of HITL framework
15 / 17
Bad
Good
→Methods w/ HITL
worsen

15 / 17
Bad
Good
→Methods w/ HITL

15 / 17
Bad
Good
→Methods w/ HITL
comparable
These results suggests that error correction feedback
can improve TTS quality

• Subjective evaluation of HITL framework
…BOLD are higher than those of FS2+Predictor
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
obviously
lower

16 / 17
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
These results suggest that the prediction error
of the prosody predictor degrades the naturalness of speech.
obviously
lower

Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
16 / 17
significantly higher

Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
16 / 17
significantly higher
These results suggests that our method
improves the prosodic naturalness of synthetic speech

Summary
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability”)
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrected accent errors.
・Our method successfully achieved the same quality of synthetic speech
as the conventional method.
• Future work
・User interface of feedback
・How to integrate the obtained prosodic sequences.
17 / 17
Thank you for your attention!

fujii22apsipa_asc

Recommended

Recommended

More Related Content

Similar to fujii22apsipa_asc

Similar to fujii22apsipa_asc (20)

More from Yuki Saito

More from Yuki Saito (20)

Recently uploaded

Recently uploaded (20)

fujii22apsipa_asc