SlideShare a Scribd company logo
1 of 25
Download to read offline
AVATAR SYMBIOTIC
SOCIETY
Adaptive End-to-End Text-to-Speech Synthesis
Based on Error Correction Feedback from Humans
Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari
The University of Tokyo, Japan
APSIPA ASC 2022 ThPM1-3.5
7 - 10 Nov. 2022・Chiang Mai, Thailand
1 / 17
Research background
2 / 17
• End-to-end (E2E) Text-to-Speech (TTS)[Wang+17]
・High quality but low controllability
・Difficulty in correcting accent errors
(cause of miscommunication, especially in a tonal language e.g., Japanese)
High
Low
E2E TTS
model
a m e
a m e
E2E TTS
model
Text
a
me
a
me
Prosody modeling and control in E2E TTS are a very
important and challenging task.
High
Low
Improving controllability using linguistic features
3 / 17
• Examples of linguistic features
・Full-context labels [Okamoto+19], [Okamoto+20]
・Phonetic and prosodic labels [Kurihara+21]
• High controllability but need for prerequisite expertise
“Adaptability”, the ability to easily correct mistakes,
is not taken into consideration.
Text
analysis
E2E
TTS
Input text → → linguistic feature → → Synthesized speech
Outline of this talk
4 / 17
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability” )
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrects accent errors.
・Our method successfully achieves the same quality of synthetic speech
as the conventional method.
Overview of proposed TTS model
5 / 17
• Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
Backbone TTS model: FastSpeech2
6 / 17
• FastSpeech2 [Ren+21]
・Modules for generating a mel-spectrogram of speech from a phoneme sequence
・Stable learning and inference
Variance Adaptor Duration/pitch/energy
predictor
Proposed method
7 / 17
• Prosody predictor: DNN to estimate pitch changes per syllable (mora)
• Input: Phoneme embedding + Word embedding from BERT [Devlin+19]
• Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch,
• Loss function: MSE between predicted symbols and ground-truth (text analysis results)
"_" : keeping accent unchanged)
Overview of proposed HITL framework
8 / 17
• Motivation: Improves the adaptability of synthetic speech
• Approach: Involves multiple listeners in the process of correcting accent errors
…
…
(a) (me)
a
me
a
me
HITL accent error correction: Feedback aggregation
9 / 17
• Challenging point: Differences of listeners' error correction abilities
・Simple Way: Choose one from multiple accent annotations
・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H)
↑
Good
Bad
↓
Random
Selector
Bad accent
selected
Annotated accent Estimated ability by MACE
Listener 1 L L L L Low ability
Listener 2 L L L L Low ability
Listener 3 L H H H High ability
Mode L L L L
MACE [Hovy+13] L H H H
Low quarty
speech
Experimental evaluation
10 / 17
• Evaluation targets
・TTS model
・HITL framework
• Objective evaluation criterion
・RMSE of the logF0 between synthetic and natural speech
• Subjective evaluation criterion
・Mean Opinion Score (MOS) test
・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good).
・The number of listeners was 50, and each listened to 20 speech samples.
・Preference AB test
・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair.
・The number of listeners for each AB test was 25.
Experimental setting for TTS model
11 / 17
• Experimental conditions for TTS model
• Compared methods for TTS model
TTS model FastSpeech2 [Ren+21]
Train/eval. data of TTS model
and Prosody predictor
JSUT corpus [Takamichi+20], BASIC5000 subset,
4,488 / 512 sentences
・FS2 ・FS2+Symbol ・FS2+Predictor (target)
・FS2+Predictor
…Proposed
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
worsen
This means that the prediction error
of the prosody predictor significantly degrades quality
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
comparable
Suggests that natural speech can be synthesized
if correct accent is obtained through feedback
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
Result of subjective evaluation
13 / 17
better
better
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Experimental setting for HITL framework
14 / 17
• Experimental conditions for HITL framework
• Compared methods for HITL framework
・Mode, MACE, Best, Median, Worst
Listerner hiring platform Lancers (crowdsourcing platform)
Dataset, Participant Japanese female speaker (JSUT corpus)
Unused data for training TTS models and Prosody Predictor)
100 centences, 15 person / sentences (=1,500 crowdworkers)
HITL framework interface Initialize radio buttons with text analysis-derived accent information
…
15
crowdworkers
L H H H L L .... →
H H H H L L .... →
L H H H H L .... →
E2E
TTS
…
…
logF0
RMSE
between
Ground-Truth
Sort
…
…
Good
↑
↓
Bad
←Best
←Median
←Worst
[input accents]
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
worsen
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
comparable
These results suggests that error correction feedback
can improve TTS quality
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
obviously
lower
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
These results suggest that the prediction error
of the prosody predictor degrades the naturalness of speech.
obviously
lower
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Summary
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability”)
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrected accent errors.
・Our method successfully achieved the same quality of synthetic speech
as the conventional method.
• Future work
・User interface of feedback
・How to integrate the obtained prosodic sequences.
17 / 17
Thank you for your attention!

More Related Content

Similar to fujii22apsipa_asc

07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
simonp16
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 

Similar to fujii22apsipa_asc (20)

The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010
 
Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]
 

More from Yuki Saito

More from Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

fujii22apsipa_asc

  • 1. AVATAR SYMBIOTIC SOCIETY Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari The University of Tokyo, Japan APSIPA ASC 2022 ThPM1-3.5 7 - 10 Nov. 2022・Chiang Mai, Thailand 1 / 17
  • 2. Research background 2 / 17 • End-to-end (E2E) Text-to-Speech (TTS)[Wang+17] ・High quality but low controllability ・Difficulty in correcting accent errors (cause of miscommunication, especially in a tonal language e.g., Japanese) High Low E2E TTS model a m e a m e E2E TTS model Text a me a me Prosody modeling and control in E2E TTS are a very important and challenging task. High Low
  • 3. Improving controllability using linguistic features 3 / 17 • Examples of linguistic features ・Full-context labels [Okamoto+19], [Okamoto+20] ・Phonetic and prosodic labels [Kurihara+21] • High controllability but need for prerequisite expertise “Adaptability”, the ability to easily correct mistakes, is not taken into consideration. Text analysis E2E TTS Input text → → linguistic feature → → Synthesized speech
  • 4. Outline of this talk 4 / 17 • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability” ) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrects accent errors. ・Our method successfully achieves the same quality of synthetic speech as the conventional method.
  • 5. Overview of proposed TTS model 5 / 17 • Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
  • 6. Backbone TTS model: FastSpeech2 6 / 17 • FastSpeech2 [Ren+21] ・Modules for generating a mel-spectrogram of speech from a phoneme sequence ・Stable learning and inference Variance Adaptor Duration/pitch/energy predictor
  • 7. Proposed method 7 / 17 • Prosody predictor: DNN to estimate pitch changes per syllable (mora) • Input: Phoneme embedding + Word embedding from BERT [Devlin+19] • Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch, • Loss function: MSE between predicted symbols and ground-truth (text analysis results) "_" : keeping accent unchanged)
  • 8. Overview of proposed HITL framework 8 / 17 • Motivation: Improves the adaptability of synthetic speech • Approach: Involves multiple listeners in the process of correcting accent errors … … (a) (me) a me a me
  • 9. HITL accent error correction: Feedback aggregation 9 / 17 • Challenging point: Differences of listeners' error correction abilities ・Simple Way: Choose one from multiple accent annotations ・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H) ↑ Good Bad ↓ Random Selector Bad accent selected Annotated accent Estimated ability by MACE Listener 1 L L L L Low ability Listener 2 L L L L Low ability Listener 3 L H H H High ability Mode L L L L MACE [Hovy+13] L H H H Low quarty speech
  • 10. Experimental evaluation 10 / 17 • Evaluation targets ・TTS model ・HITL framework • Objective evaluation criterion ・RMSE of the logF0 between synthetic and natural speech • Subjective evaluation criterion ・Mean Opinion Score (MOS) test ・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good). ・The number of listeners was 50, and each listened to 20 speech samples. ・Preference AB test ・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair. ・The number of listeners for each AB test was 25.
  • 11. Experimental setting for TTS model 11 / 17 • Experimental conditions for TTS model • Compared methods for TTS model TTS model FastSpeech2 [Ren+21] Train/eval. data of TTS model and Prosody predictor JSUT corpus [Takamichi+20], BASIC5000 subset, 4,488 / 512 sentences ・FS2 ・FS2+Symbol ・FS2+Predictor (target) ・FS2+Predictor …Proposed
  • 12. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good worsen This means that the prediction error of the prosody predictor significantly degrades quality
  • 13. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good comparable Suggests that natural speech can be synthesized if correct accent is obtained through feedback
  • 14. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 Result of subjective evaluation 13 / 17 better better
  • 15. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752
  • 16. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 17. Experimental setting for HITL framework 14 / 17 • Experimental conditions for HITL framework • Compared methods for HITL framework ・Mode, MACE, Best, Median, Worst Listerner hiring platform Lancers (crowdsourcing platform) Dataset, Participant Japanese female speaker (JSUT corpus) Unused data for training TTS models and Prosody Predictor) 100 centences, 15 person / sentences (=1,500 crowdworkers) HITL framework interface Initialize radio buttons with text analysis-derived accent information … 15 crowdworkers L H H H L L .... → H H H H L L .... → L H H H H L .... → E2E TTS … … logF0 RMSE between Ground-Truth Sort … … Good ↑ ↓ Bad ←Best ←Median ←Worst [input accents]
  • 18. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL worsen
  • 19. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL
  • 20. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL comparable These results suggests that error correction feedback can improve TTS quality
  • 21. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 obviously lower
  • 22. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 These results suggest that the prediction error of the prosody predictor degrades the naturalness of speech. obviously lower
  • 23. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher
  • 24. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 25. Summary • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability”) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrected accent errors. ・Our method successfully achieved the same quality of synthetic speech as the conventional method. • Future work ・User interface of feedback ・How to integrate the obtained prosodic sequences. 17 / 17 Thank you for your attention!