Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Personalized Spontaneous Speech Synthesis

11/10/2022
Empirical Study Incorporating
Linguistic Knowledge on Filled Pauses
for Personalized Spontaneous Speech Synthesis
Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari
Graduate School of Information Science and Technology,
The University of Tokyo, Japan.
APSIPA ASC 2022 @ Chiang Mai, Thailand

Background: personalized speech synthesis
Speech synthesis: artificially synthesize human-like speech.
− Text-to-speech synthesis: using text as inputs.
− Can synthesize human-like natural speech [Shen+18, Ren+21].
2
Personalized speech synthesis: reproduce speaker’s individuality.
− Voice cloning: especially reproduce individuality of voice timbre [Xie+21].
− Limitation: handle only fluent reading-style speech (like an announcer).
Challenge: handle spontaneous speech including disfluency.
I’ll explain
speech synthesis.
Text-to-speech synthesis
Speech
Text
I’ll explain
speech synthesis.
Text-to-speech synthesis
for target speaker
Target speaker’s
speech
Text

Background: disfluency and FP
3
What is disfluency?
− Speech part which is not fluent in spontaneous speech [Schriberg+94].
• Hesitation, filled pause, etc.
My research theme is uh
“personalized spontaneous
speech synthesis”
Recognize the important word
“personalized spontaneous …”
Various roles of disfluency
− Speech generation: being generated if speakers make mistakes [Levelt+83].
− Communication: reduce listening effort and facilitate understanding of
newly appeared words [Arnold+04].
Filled pauses (FPs): have a filling-in role [Maekawa+03].
− FPs can be decomposed into FP positions and words.
• FP position: where FP is inserted in the utterance?
• FP word: which kind of FP word is used?
− Necessary for human-like speech and its personalization

Prior knowledge of FPs in linguistics
− Vocabulary: Japanese (target language) has 160 different FPs [Hirose+06].
− Individuality: FPs are different among speakers [Watanabe+19].
− Replaceability: FP effect is not changed if FP words are replaced.
Overview of this work
4
[Yamashita+07]
Research purpose: personalized spontaneous speech synthesis
− Reproduce individuality of FPs.
− Realize voice cloning for more human-like spontaneous speech.
I’ll explain
the theory
(a) Conventional speech synthesis (b) Personalized spontaneous speech synthesis
I’ll explain
uh the theory
“I’ll explain
the theory”
“I’ll explain
the theory”
This work: investigation based on these knowledge of FPs
− Investigate relations betw. FP position/word and speech evaluation.
− Compare personalized and non-personalized FPs.
(ground-truth) (predicted)

Related work
FP-included speech synthesis
− Ex) multi-speaker speech synthesis model w/ FP insertion [Yan+21]
6
Evaluation of FP-included speech synthesis
− Ex) compare individuality of synthetic speech w/ and w/o FPs [Szekely+19a].
Limitations
− Use limited FP word vocabulary (only “uh” and “um” in English [Yan+21]).
− Not evaluate FP positions and words in details [Szekely+19a].
− Not evaluate individuality [Szekely+19b].
This work
− Create rich FP word vocabulary.
− Investigate relations betw. FP position/word and speech evaluation in
details.
− Evaluate in terms of naturalness, individuality, and listening effort.

11/10/2022
Spontaneous speech synthesis model
with FP insertion

Spontaneous speech synthesis model w/ FP insertion
Structure of the proposed model
8
“I’ll explain uh the theory.”
FP tag
Word embedding
FP prediction model
“I’ll explain the theory.”
Embedding
Encoder
Decoder
No-FP text
FP-included text
FP prediction model
• Trained on multi-speaker FP-annotated
corpus.
• Predict “None” or 13 kinds of FP words.
Text-to-speech synthesis model
• Trained on target speaker’s spontaneous
speech corpus.
FP-included speech
→ next page
→ next page

FP vocabulary and dataset
Rich FP word vocabulary for personalization
− Should include FP words used by various speakers.
• Use multi-speaker FP-annotated corpus.
• Exclude FP words used less than 20% by all speakers.
− Obtained vocabulary:
• Includes 13 FP words.
• Covers 83% of each speaker‘s FPs on average.
9
ee ano eeto n aanoo
e anoo a nn
ma maa aa etto
Corpus: JLecSponSpeech
− Japanese lecture spontaneous speech corpus for 3-5 hours / speaker
− Include two speakers.
− Include FP tags and timing information.
※ If you want to use our corpus, please check the paper.

FP insertion methods
10
“I’ll uh explain the theory.”
FP prediction
“I’ll explain the theory.”
Predicted
w/ ground-truth position
Predicted
(non-personalized)
Ground-truth
(personalized)
Text-to-speech
synthesis
“I’ll explain um the theory.”
FP prediction
“I’ll explain <FP> the theory.”
Text-to-speech
synthesis
“I’ll explain uh the theory.”
Text-to-speech
synthesis
ground-truth
position

11/10/2022
Experimental evaluation

Experimental settings
Models of FP prediction and text-to-speech synthesis
12
FP prediction
Model BERT + BLSTM [Matsunaga+22].
Dataset CSJ [cite]
Text-to-speech
synthesis
Model FastSpeech2 [Ren+20]
Dataset (pre-training) JSUT [Sonobe+17]
Dataset (training) JLecSponSpeech
Auxiliary feature
FP tag
(concatenated to phoneme embed.)
Other
hyper-parameters
Published implementation [1]
[1] https://github.com/ndkgit339/FastSpeech2-filled_pause_speech_synthesis

Investigation of FP insertion effects
Experiments
− Compare FP-included synthetic speech by preference AB/XAB test.
− A total of 30 listeners evaluated 10/8 speech samples in AB/XAB test.
− Conduct evaluations for each of the two speakers of JLecSponSpeech.
13
Investigations:
Quality of FP-included speech
Necessity of FP prediction
Necessity of reproduction of FP position and word
Necessity of reproduction of FP word
Necessity of reproduction of FP position
Criteria:
Naturalness: which speech sample sounds more natural (human-like)?
Individuality: which speech sample sounds closer to target speaker?
Listening effort: which speech sample requires less effort to listen to?

Quality of FP-included speech
Compared methods
14
Limitation: quality degradation of speech by FP insertion
Criterion Spk. NoW-NoP vs. TrueW-TrueP
Naturalness
A
B
0.660 vs. 0.340
0.563 vs. 0.437
Individuality
A
B
0.671 vs. 0.329
0.542 vs. 0.458
Listening effort
A
B
0.660 vs. 0.340
0.560 vs. 0.440
“No FP” is preferred.
Method FP word (W) FP position (P) Example
NoW-NoP -- -- I’ll explain speech synthesis.
TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis.
Results:

Necessity of FP prediction
15
Not random but predicted FPs are necessary.
Criterion Spk. PredW-PredP vs. RandW-RandP
Naturalness
A
B
0.770 vs. 0.230
0.747 vs. 0.253
Individuality
A
B
0.808 vs. 0.192
0.817 vs. 0.183
Listening effort
A
B
0.750 vs. 0.250
0.693 vs. 0.307
Predicted FPs are preferred.
Compared methods
Results:
PredW-PredP Predicted Predicted I’ll um explain speech synthesis.
RandW-RandP Random Random I’ll explain speech synthesis uh.

16
Reproducing ground-truth FPs is necessary for personalization.
Criterion Spk. PredW-PredP vs. TrueW-TrueP
Naturalness
A
B
0.470 vs. 0.530
0.457 vs. 0.543
Individuality
A
B
0.442 vs. 0.558
0.350 vs. 0.650
Listening effort
A
B
0.487 vs. 0.513
0.433 vs. 0.567
(positions and words)
Ground-truth FPs are preferred.
Ground-truth FPs are preferred in speaker B.
Compared methods
Results:
Necessity of reproduction of FP position and word

Necessity of reproduction of FP word
17
• Reproducing ground-truth FP words might be necessary
for personalization (in some speakers).
• Replaceability of FP words [Yamashita+07] might be true.
Criterion Spk. PredW-TrueP vs. TrueW-TrueP
Naturalness
A
B
0.470 vs. 0.530
0.493 vs. 0.507
Individuality
A
B
0.454 vs. 0.546
0.496 vs. 0.504
Listening effort
A
B
0.463 vs. 0.537
0.527 vs. 0.437
i.e., FP functions remain unchanged if FP words are replaced.
Ground-truth words
are preferred to
predicted ones.
No significant
differences
Compared methods
Results:
PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis.

Necessity of reproduction of FP position
18
Speech with ground-truth FP positions sounds more natural.
Criterion Spk. PredW-PredP vs. PredW-TrueP
Naturalness
A
B
0.437 vs. 0.563
0.423 vs. 0.577
Individuality
A
B
0.542 vs. 0.458
0.479 vs. 0.521
Listening effort
A
B
0.470 vs. 0.530
0.503 vs. 0.497
Ground-truth positions are preferred
to predicted ones.
Compared methods
Results:
PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis.

Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
19
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Reading-style
speech synthesis
Natural speech
Spontaneous speech synthesis
※ details in our paper

20
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction

21
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Inferiority of
FP reproduction

22
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
No FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Inferiority of modeling
FP-included synthesis
Inferiority of
FP reproduction
Future work

23
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
No FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Inferiority of modeling
FP-included synthesis
Inferiority of
FP reproduction
Future work

11/10/2022
Summary and future direction

Summary and future direction
25
Research purpose:
− Personalized spontaneous speech synthesis, which reproduces
individuality of FPs.
This work: investigation based on linguistic priors of FPs
− Investigate relations betw. FP position/word and speech evaluation.
− Experimentally evaluate FP-included synthetic speech.
• Compare personalized and non-personalized FPs.
− Clarify relations betw. FP insertion and naturalness/individuality of
synthetic speech
− Limitation: synthesized speech quality is degraded by FP insertion.
Future work
− Improve the quality of the synthesized speech by the spontaneous
speech synthesis model.
Thank you for your attention!

Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Personalized Spontaneous Speech Synthesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Personalized Spontaneous Speech Synthesis

Similar to Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Personalized Spontaneous Speech Synthesis (20)

Recently uploaded

Recently uploaded (20)