SlideShare a Scribd company logo
1 of 25
Download to read offline
11/10/2022
Empirical Study Incorporating
Linguistic Knowledge on Filled Pauses
for Personalized Spontaneous Speech Synthesis
Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari
Graduate School of Information Science and Technology,
The University of Tokyo, Japan.
APSIPA ASC 2022 @ Chiang Mai, Thailand
Background: personalized speech synthesis
Speech synthesis: artificially synthesize human-like speech.
− Text-to-speech synthesis: using text as inputs.
− Can synthesize human-like natural speech [Shen+18, Ren+21].
2
Personalized speech synthesis: reproduce speaker’s individuality.
− Voice cloning: especially reproduce individuality of voice timbre [Xie+21].
− Limitation: handle only fluent reading-style speech (like an announcer).
Challenge: handle spontaneous speech including disfluency.
I’ll explain
speech synthesis.
Text-to-speech synthesis
Speech
Text
I’ll explain
speech synthesis.
Text-to-speech synthesis
for target speaker
Target speaker’s
speech
Text
Background: disfluency and FP
3
What is disfluency?
− Speech part which is not fluent in spontaneous speech [Schriberg+94].
• Hesitation, filled pause, etc.
My research theme is uh
“personalized spontaneous
speech synthesis”
Recognize the important word
“personalized spontaneous …”
Various roles of disfluency
− Speech generation: being generated if speakers make mistakes [Levelt+83].
− Communication: reduce listening effort and facilitate understanding of
newly appeared words [Arnold+04].
Filled pauses (FPs): have a filling-in role [Maekawa+03].
− FPs can be decomposed into FP positions and words.
• FP position: where FP is inserted in the utterance?
• FP word: which kind of FP word is used?
− Necessary for human-like speech and its personalization
Prior knowledge of FPs in linguistics
− Vocabulary: Japanese (target language) has 160 different FPs [Hirose+06].
− Individuality: FPs are different among speakers [Watanabe+19].
− Replaceability: FP effect is not changed if FP words are replaced.
Overview of this work
4
[Yamashita+07]
Research purpose: personalized spontaneous speech synthesis
− Reproduce individuality of FPs.
− Realize voice cloning for more human-like spontaneous speech.
I’ll explain
the theory
(a) Conventional speech synthesis (b) Personalized spontaneous speech synthesis
I’ll explain
uh the theory
“I’ll explain
the theory”
“I’ll explain
the theory”
This work: investigation based on these knowledge of FPs
− Investigate relations betw. FP position/word and speech evaluation.
− Compare personalized and non-personalized FPs.
(ground-truth) (predicted)
11/10/2022
Related work
Related work
FP-included speech synthesis
− Ex) multi-speaker speech synthesis model w/ FP insertion [Yan+21]
6
Evaluation of FP-included speech synthesis
− Ex) compare individuality of synthetic speech w/ and w/o FPs [Szekely+19a].
Limitations
− Use limited FP word vocabulary (only “uh” and “um” in English [Yan+21]).
− Not evaluate FP positions and words in details [Szekely+19a].
− Not evaluate individuality [Szekely+19b].
This work
− Create rich FP word vocabulary.
− Investigate relations betw. FP position/word and speech evaluation in
details.
− Evaluate in terms of naturalness, individuality, and listening effort.
11/10/2022
Spontaneous speech synthesis model
with FP insertion
Spontaneous speech synthesis model w/ FP insertion
Structure of the proposed model
8
“I’ll explain uh the theory.”
FP tag
Word embedding
FP prediction model
“I’ll explain the theory.”
Embedding
Encoder
Decoder
No-FP text
FP-included text
FP prediction model
• Trained on multi-speaker FP-annotated
corpus.
• Predict “None” or 13 kinds of FP words.
Text-to-speech synthesis model
• Trained on target speaker’s spontaneous
speech corpus.
FP-included speech
→ next page
→ next page
FP vocabulary and dataset
Rich FP word vocabulary for personalization
− Should include FP words used by various speakers.
• Use multi-speaker FP-annotated corpus.
• Exclude FP words used less than 20% by all speakers.
− Obtained vocabulary:
• Includes 13 FP words.
• Covers 83% of each speaker‘s FPs on average.
9
ee ano eeto n aanoo
e anoo a nn
ma maa aa etto
Corpus: JLecSponSpeech
− Japanese lecture spontaneous speech corpus for 3-5 hours / speaker
− Include two speakers.
− Include FP tags and timing information.
※ If you want to use our corpus, please check the paper.
FP insertion methods
10
“I’ll uh explain the theory.”
FP prediction
“I’ll explain the theory.”
Predicted
w/ ground-truth position
Predicted
(non-personalized)
Ground-truth
(personalized)
Text-to-speech
synthesis
“I’ll explain um the theory.”
FP prediction
“I’ll explain <FP> the theory.”
Text-to-speech
synthesis
“I’ll explain uh the theory.”
Text-to-speech
synthesis
ground-truth
position
11/10/2022
Experimental evaluation
Experimental settings
Models of FP prediction and text-to-speech synthesis
12
FP prediction
Model BERT + BLSTM [Matsunaga+22].
Dataset CSJ [cite]
Text-to-speech
synthesis
Model FastSpeech2 [Ren+20]
Dataset (pre-training) JSUT [Sonobe+17]
Dataset (training) JLecSponSpeech
Auxiliary feature
FP tag
(concatenated to phoneme embed.)
Other
hyper-parameters
Published implementation [1]
[1] https://github.com/ndkgit339/FastSpeech2-filled_pause_speech_synthesis
Investigation of FP insertion effects
Experiments
− Compare FP-included synthetic speech by preference AB/XAB test.
− A total of 30 listeners evaluated 10/8 speech samples in AB/XAB test.
− Conduct evaluations for each of the two speakers of JLecSponSpeech.
13
Investigations:
Quality of FP-included speech
Necessity of FP prediction
Necessity of reproduction of FP position and word
Necessity of reproduction of FP word
Necessity of reproduction of FP position
Criteria:
Naturalness: which speech sample sounds more natural (human-like)?
Individuality: which speech sample sounds closer to target speaker?
Listening effort: which speech sample requires less effort to listen to?
Quality of FP-included speech
Compared methods
14
Limitation: quality degradation of speech by FP insertion
Criterion Spk. NoW-NoP vs. TrueW-TrueP
Naturalness
A
B
0.660 vs. 0.340
0.563 vs. 0.437
Individuality
A
B
0.671 vs. 0.329
0.542 vs. 0.458
Listening effort
A
B
0.660 vs. 0.340
0.560 vs. 0.440
“No FP” is preferred.
Method FP word (W) FP position (P) Example
NoW-NoP -- -- I’ll explain speech synthesis.
TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis.
Results:
Necessity of FP prediction
15
Not random but predicted FPs are necessary.
Criterion Spk. PredW-PredP vs. RandW-RandP
Naturalness
A
B
0.770 vs. 0.230
0.747 vs. 0.253
Individuality
A
B
0.808 vs. 0.192
0.817 vs. 0.183
Listening effort
A
B
0.750 vs. 0.250
0.693 vs. 0.307
Predicted FPs are preferred.
Compared methods
Results:
Method FP word (W) FP position (P) Example
PredW-PredP Predicted Predicted I’ll um explain speech synthesis.
RandW-RandP Random Random I’ll explain speech synthesis uh.
16
Reproducing ground-truth FPs is necessary for personalization.
Criterion Spk. PredW-PredP vs. TrueW-TrueP
Naturalness
A
B
0.470 vs. 0.530
0.457 vs. 0.543
Individuality
A
B
0.442 vs. 0.558
0.350 vs. 0.650
Listening effort
A
B
0.487 vs. 0.513
0.433 vs. 0.567
(positions and words)
Ground-truth FPs are preferred.
Ground-truth FPs are preferred in speaker B.
Compared methods
Results:
Method FP word (W) FP position (P) Example
PredW-PredP Predicted Predicted I’ll um explain speech synthesis.
TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis.
Necessity of reproduction of FP position and word
Necessity of reproduction of FP word
17
• Reproducing ground-truth FP words might be necessary
for personalization (in some speakers).
• Replaceability of FP words [Yamashita+07] might be true.
Criterion Spk. PredW-TrueP vs. TrueW-TrueP
Naturalness
A
B
0.470 vs. 0.530
0.493 vs. 0.507
Individuality
A
B
0.454 vs. 0.546
0.496 vs. 0.504
Listening effort
A
B
0.463 vs. 0.537
0.527 vs. 0.437
i.e., FP functions remain unchanged if FP words are replaced.
Ground-truth words
are preferred to
predicted ones.
No significant
differences
Compared methods
Results:
Method FP word (W) FP position (P) Example
PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis.
TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis.
Necessity of reproduction of FP position
18
Speech with ground-truth FP positions sounds more natural.
Criterion Spk. PredW-PredP vs. PredW-TrueP
Naturalness
A
B
0.437 vs. 0.563
0.423 vs. 0.577
Individuality
A
B
0.542 vs. 0.458
0.479 vs. 0.521
Listening effort
A
B
0.470 vs. 0.530
0.503 vs. 0.497
Ground-truth positions are preferred
to predicted ones.
Compared methods
Results:
Method FP word (W) FP position (P) Example
PredW-PredP Predicted Predicted I’ll um explain speech synthesis.
PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis.
Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
19
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Reading-style
speech synthesis
Natural speech
Spontaneous speech synthesis
※ details in our paper
Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
20
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Spontaneous speech synthesis
Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
21
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Spontaneous speech synthesis
Inferiority of
FP reproduction
Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
22
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
No FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Spontaneous speech synthesis
Inferiority of modeling
FP-included synthesis
Inferiority of
FP reproduction
Future work
Absolute evaluation of FP-included synthetic speech
Summary of Mean Opinion Score (MOS) test
23
Naturalness MOS
Individuality
MOS
2
4
2 4
Random FP
Ground-truth FP
No FP
Predicted FP
Reading-style
speech synthesis
Natural speech
Improvement by
FP prediction
Spontaneous speech synthesis
Inferiority of modeling
FP-included synthesis
Inferiority of
FP reproduction
Future work
11/10/2022
Summary and future direction
Summary and future direction
25
Research purpose:
− Personalized spontaneous speech synthesis, which reproduces
individuality of FPs.
This work: investigation based on linguistic priors of FPs
− Investigate relations betw. FP position/word and speech evaluation.
− Experimentally evaluate FP-included synthetic speech.
• Compare personalized and non-personalized FPs.
− Clarify relations betw. FP insertion and naturalness/individuality of
synthetic speech
− Limitation: synthesized speech quality is degraded by FP insertion.
Future work
− Improve the quality of the synthesized speech by the spontaneous
speech synthesis model.
Thank you for your attention!

More Related Content

What's hot

サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価Shinnosuke Takamichi
 
【解説】 一般逆行列
【解説】 一般逆行列【解説】 一般逆行列
【解説】 一般逆行列Kenjiro Sugimoto
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
Non-autoregressive text generation
Non-autoregressive text generationNon-autoregressive text generation
Non-autoregressive text generationnlab_utokyo
 
JVS:フリーの日本語多数話者音声コーパス
JVS:フリーの日本語多数話者音声コーパス JVS:フリーの日本語多数話者音声コーパス
JVS:フリーの日本語多数話者音声コーパス Shinnosuke Takamichi
 
音声生成の基礎と音声学
音声生成の基礎と音声学音声生成の基礎と音声学
音声生成の基礎と音声学Akinori Ito
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...Deep Learning JP
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?NU_I_TODALAB
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響NU_I_TODALAB
 
音声の認識と合成
音声の認識と合成音声の認識と合成
音声の認識と合成Akinori Ito
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調Yuma Koizumi
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査Tomoki Hayashi
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパスShinnosuke Takamichi
 
音声認識の基礎
音声認識の基礎音声認識の基礎
音声認識の基礎Akinori Ito
 
Mplusの使い方 初級編
Mplusの使い方 初級編Mplusの使い方 初級編
Mplusの使い方 初級編Hiroshi Shimizu
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析Shinnosuke Takamichi
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 

What's hot (20)

サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
 
【解説】 一般逆行列
【解説】 一般逆行列【解説】 一般逆行列
【解説】 一般逆行列
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
Non-autoregressive text generation
Non-autoregressive text generationNon-autoregressive text generation
Non-autoregressive text generation
 
JVS:フリーの日本語多数話者音声コーパス
JVS:フリーの日本語多数話者音声コーパス JVS:フリーの日本語多数話者音声コーパス
JVS:フリーの日本語多数話者音声コーパス
 
音声生成の基礎と音声学
音声生成の基礎と音声学音声生成の基礎と音声学
音声生成の基礎と音声学
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
はじめての「R」
はじめての「R」はじめての「R」
はじめての「R」
 
音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?音声信号の分析と加工 - 音声を自在に変換するには?
音声信号の分析と加工 - 音声を自在に変換するには?
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
音声の認識と合成
音声の認識と合成音声の認識と合成
音声の認識と合成
 
深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
 
音声認識の基礎
音声認識の基礎音声認識の基礎
音声認識の基礎
 
Mplusの使い方 初級編
Mplusの使い方 初級編Mplusの使い方 初級編
Mplusの使い方 初級編
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 

Similar to Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabikevig
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parametersVelnar
 
江振宇/It's Not What You Say: It's How You Say It!
江振宇/It's Not What You Say: It's How You Say It!江振宇/It's Not What You Say: It's How You Say It!
江振宇/It's Not What You Say: It's How You Say It!台灣資料科學年會
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemIJERA Editor
 

Similar to Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis (20)

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Guide bem-anglais-2017
Guide bem-anglais-2017Guide bem-anglais-2017
Guide bem-anglais-2017
 
Guide bem-anglais-2017
Guide bem-anglais-2017Guide bem-anglais-2017
Guide bem-anglais-2017
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIRULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
 
江振宇/It's Not What You Say: It's How You Say It!
江振宇/It's Not What You Say: It's How You Say It!江振宇/It's Not What You Say: It's How You Say It!
江振宇/It's Not What You Say: It's How You Say It!
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis System
 

Recently uploaded

LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfakankshagupta7348026
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSebastiano Panichella
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 

Recently uploaded (20)

LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdf
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

  • 1. 11/10/2022 Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari Graduate School of Information Science and Technology, The University of Tokyo, Japan. APSIPA ASC 2022 @ Chiang Mai, Thailand
  • 2. Background: personalized speech synthesis Speech synthesis: artificially synthesize human-like speech. − Text-to-speech synthesis: using text as inputs. − Can synthesize human-like natural speech [Shen+18, Ren+21]. 2 Personalized speech synthesis: reproduce speaker’s individuality. − Voice cloning: especially reproduce individuality of voice timbre [Xie+21]. − Limitation: handle only fluent reading-style speech (like an announcer). Challenge: handle spontaneous speech including disfluency. I’ll explain speech synthesis. Text-to-speech synthesis Speech Text I’ll explain speech synthesis. Text-to-speech synthesis for target speaker Target speaker’s speech Text
  • 3. Background: disfluency and FP 3 What is disfluency? − Speech part which is not fluent in spontaneous speech [Schriberg+94]. • Hesitation, filled pause, etc. My research theme is uh “personalized spontaneous speech synthesis” Recognize the important word “personalized spontaneous …” Various roles of disfluency − Speech generation: being generated if speakers make mistakes [Levelt+83]. − Communication: reduce listening effort and facilitate understanding of newly appeared words [Arnold+04]. Filled pauses (FPs): have a filling-in role [Maekawa+03]. − FPs can be decomposed into FP positions and words. • FP position: where FP is inserted in the utterance? • FP word: which kind of FP word is used? − Necessary for human-like speech and its personalization
  • 4. Prior knowledge of FPs in linguistics − Vocabulary: Japanese (target language) has 160 different FPs [Hirose+06]. − Individuality: FPs are different among speakers [Watanabe+19]. − Replaceability: FP effect is not changed if FP words are replaced. Overview of this work 4 [Yamashita+07] Research purpose: personalized spontaneous speech synthesis − Reproduce individuality of FPs. − Realize voice cloning for more human-like spontaneous speech. I’ll explain the theory (a) Conventional speech synthesis (b) Personalized spontaneous speech synthesis I’ll explain uh the theory “I’ll explain the theory” “I’ll explain the theory” This work: investigation based on these knowledge of FPs − Investigate relations betw. FP position/word and speech evaluation. − Compare personalized and non-personalized FPs. (ground-truth) (predicted)
  • 6. Related work FP-included speech synthesis − Ex) multi-speaker speech synthesis model w/ FP insertion [Yan+21] 6 Evaluation of FP-included speech synthesis − Ex) compare individuality of synthetic speech w/ and w/o FPs [Szekely+19a]. Limitations − Use limited FP word vocabulary (only “uh” and “um” in English [Yan+21]). − Not evaluate FP positions and words in details [Szekely+19a]. − Not evaluate individuality [Szekely+19b]. This work − Create rich FP word vocabulary. − Investigate relations betw. FP position/word and speech evaluation in details. − Evaluate in terms of naturalness, individuality, and listening effort.
  • 7. 11/10/2022 Spontaneous speech synthesis model with FP insertion
  • 8. Spontaneous speech synthesis model w/ FP insertion Structure of the proposed model 8 “I’ll explain uh the theory.” FP tag Word embedding FP prediction model “I’ll explain the theory.” Embedding Encoder Decoder No-FP text FP-included text FP prediction model • Trained on multi-speaker FP-annotated corpus. • Predict “None” or 13 kinds of FP words. Text-to-speech synthesis model • Trained on target speaker’s spontaneous speech corpus. FP-included speech → next page → next page
  • 9. FP vocabulary and dataset Rich FP word vocabulary for personalization − Should include FP words used by various speakers. • Use multi-speaker FP-annotated corpus. • Exclude FP words used less than 20% by all speakers. − Obtained vocabulary: • Includes 13 FP words. • Covers 83% of each speaker‘s FPs on average. 9 ee ano eeto n aanoo e anoo a nn ma maa aa etto Corpus: JLecSponSpeech − Japanese lecture spontaneous speech corpus for 3-5 hours / speaker − Include two speakers. − Include FP tags and timing information. ※ If you want to use our corpus, please check the paper.
  • 10. FP insertion methods 10 “I’ll uh explain the theory.” FP prediction “I’ll explain the theory.” Predicted w/ ground-truth position Predicted (non-personalized) Ground-truth (personalized) Text-to-speech synthesis “I’ll explain um the theory.” FP prediction “I’ll explain <FP> the theory.” Text-to-speech synthesis “I’ll explain uh the theory.” Text-to-speech synthesis ground-truth position
  • 12. Experimental settings Models of FP prediction and text-to-speech synthesis 12 FP prediction Model BERT + BLSTM [Matsunaga+22]. Dataset CSJ [cite] Text-to-speech synthesis Model FastSpeech2 [Ren+20] Dataset (pre-training) JSUT [Sonobe+17] Dataset (training) JLecSponSpeech Auxiliary feature FP tag (concatenated to phoneme embed.) Other hyper-parameters Published implementation [1] [1] https://github.com/ndkgit339/FastSpeech2-filled_pause_speech_synthesis
  • 13. Investigation of FP insertion effects Experiments − Compare FP-included synthetic speech by preference AB/XAB test. − A total of 30 listeners evaluated 10/8 speech samples in AB/XAB test. − Conduct evaluations for each of the two speakers of JLecSponSpeech. 13 Investigations: Quality of FP-included speech Necessity of FP prediction Necessity of reproduction of FP position and word Necessity of reproduction of FP word Necessity of reproduction of FP position Criteria: Naturalness: which speech sample sounds more natural (human-like)? Individuality: which speech sample sounds closer to target speaker? Listening effort: which speech sample requires less effort to listen to?
  • 14. Quality of FP-included speech Compared methods 14 Limitation: quality degradation of speech by FP insertion Criterion Spk. NoW-NoP vs. TrueW-TrueP Naturalness A B 0.660 vs. 0.340 0.563 vs. 0.437 Individuality A B 0.671 vs. 0.329 0.542 vs. 0.458 Listening effort A B 0.660 vs. 0.340 0.560 vs. 0.440 “No FP” is preferred. Method FP word (W) FP position (P) Example NoW-NoP -- -- I’ll explain speech synthesis. TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis. Results:
  • 15. Necessity of FP prediction 15 Not random but predicted FPs are necessary. Criterion Spk. PredW-PredP vs. RandW-RandP Naturalness A B 0.770 vs. 0.230 0.747 vs. 0.253 Individuality A B 0.808 vs. 0.192 0.817 vs. 0.183 Listening effort A B 0.750 vs. 0.250 0.693 vs. 0.307 Predicted FPs are preferred. Compared methods Results: Method FP word (W) FP position (P) Example PredW-PredP Predicted Predicted I’ll um explain speech synthesis. RandW-RandP Random Random I’ll explain speech synthesis uh.
  • 16. 16 Reproducing ground-truth FPs is necessary for personalization. Criterion Spk. PredW-PredP vs. TrueW-TrueP Naturalness A B 0.470 vs. 0.530 0.457 vs. 0.543 Individuality A B 0.442 vs. 0.558 0.350 vs. 0.650 Listening effort A B 0.487 vs. 0.513 0.433 vs. 0.567 (positions and words) Ground-truth FPs are preferred. Ground-truth FPs are preferred in speaker B. Compared methods Results: Method FP word (W) FP position (P) Example PredW-PredP Predicted Predicted I’ll um explain speech synthesis. TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis. Necessity of reproduction of FP position and word
  • 17. Necessity of reproduction of FP word 17 • Reproducing ground-truth FP words might be necessary for personalization (in some speakers). • Replaceability of FP words [Yamashita+07] might be true. Criterion Spk. PredW-TrueP vs. TrueW-TrueP Naturalness A B 0.470 vs. 0.530 0.493 vs. 0.507 Individuality A B 0.454 vs. 0.546 0.496 vs. 0.504 Listening effort A B 0.463 vs. 0.537 0.527 vs. 0.437 i.e., FP functions remain unchanged if FP words are replaced. Ground-truth words are preferred to predicted ones. No significant differences Compared methods Results: Method FP word (W) FP position (P) Example PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis. TrueW-TrueP Ground-truth Ground-truth I’ll explain uh speech synthesis.
  • 18. Necessity of reproduction of FP position 18 Speech with ground-truth FP positions sounds more natural. Criterion Spk. PredW-PredP vs. PredW-TrueP Naturalness A B 0.437 vs. 0.563 0.423 vs. 0.577 Individuality A B 0.542 vs. 0.458 0.479 vs. 0.521 Listening effort A B 0.470 vs. 0.530 0.503 vs. 0.497 Ground-truth positions are preferred to predicted ones. Compared methods Results: Method FP word (W) FP position (P) Example PredW-PredP Predicted Predicted I’ll um explain speech synthesis. PredW-TrueP Predicted Ground-truth I’ll explain um speech synthesis.
  • 19. Absolute evaluation of FP-included synthetic speech Summary of Mean Opinion Score (MOS) test 19 Naturalness MOS Individuality MOS 2 4 2 4 Random FP Reading-style speech synthesis Natural speech Spontaneous speech synthesis ※ details in our paper
  • 20. Absolute evaluation of FP-included synthetic speech Summary of Mean Opinion Score (MOS) test 20 Naturalness MOS Individuality MOS 2 4 2 4 Random FP Predicted FP Reading-style speech synthesis Natural speech Improvement by FP prediction Spontaneous speech synthesis
  • 21. Absolute evaluation of FP-included synthetic speech Summary of Mean Opinion Score (MOS) test 21 Naturalness MOS Individuality MOS 2 4 2 4 Random FP Ground-truth FP Predicted FP Reading-style speech synthesis Natural speech Improvement by FP prediction Spontaneous speech synthesis Inferiority of FP reproduction
  • 22. Absolute evaluation of FP-included synthetic speech Summary of Mean Opinion Score (MOS) test 22 Naturalness MOS Individuality MOS 2 4 2 4 Random FP Ground-truth FP No FP Predicted FP Reading-style speech synthesis Natural speech Improvement by FP prediction Spontaneous speech synthesis Inferiority of modeling FP-included synthesis Inferiority of FP reproduction Future work
  • 23. Absolute evaluation of FP-included synthetic speech Summary of Mean Opinion Score (MOS) test 23 Naturalness MOS Individuality MOS 2 4 2 4 Random FP Ground-truth FP No FP Predicted FP Reading-style speech synthesis Natural speech Improvement by FP prediction Spontaneous speech synthesis Inferiority of modeling FP-included synthesis Inferiority of FP reproduction Future work
  • 25. Summary and future direction 25 Research purpose: − Personalized spontaneous speech synthesis, which reproduces individuality of FPs. This work: investigation based on linguistic priors of FPs − Investigate relations betw. FP position/word and speech evaluation. − Experimentally evaluate FP-included synthetic speech. • Compare personalized and non-personalized FPs. − Clarify relations betw. FP insertion and naturalness/individuality of synthetic speech − Limitation: synthesized speech quality is degraded by FP insertion. Future work − Improve the quality of the synthesized speech by the spontaneous speech synthesis model. Thank you for your attention!