SlideShare a Scribd company logo
1 of 18
Investigation of Text-to-speech based
Synthetic Parallel Data for Sequence-to-
sequence Non-parallel Voice Conversion
Ding Ma, Wen-Chin Huang and Tomoki Toda
Graduate School of Informatics, Nagoya University, Nagoya, Japan
Paper ID: #1606 Presenter: Ding Ma
Introduction
•Voice conversion (VC)
• The methodology that aims to convert the speaker identity
of speech from source speaker into target speaker while
preserving the linguistic information.
• VC is expected to play a significant role in augmented
human communication.
Source
speech
VC
Target
speech
2
Introduction
•Sequence-to-sequence (seq2seq) modeling
• Seq2Seq model: a model that takes a sequence of items and outputs another sequence of
items, have emerged from the development of deep neural networks (DNN).
• Can automatically determine the output phoneme duration.
• Capture long term dependencies: prosody (F0 & duration), intonation…
• Requires a large amount and parallel speech corpus from source and target
speakers for training.
Encoder Attention Decoder
Source
speech
Target
speech
3
Background
• Voice conversion challenge 2020(VCC2020)
• Bi-annual event to compare the performance of different VC systems.
• 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2.
• Parallel: same utterances
• Nonparallel: different utterances
• Semiparallel: Parallel + Nonparallel situation
(can be regarded as the relaxation of non-
parallel case)
• Limited dataset: Only 90 corpus in T1/ 70
corpus in T2
4
Background
• VTN: Voice Transformer Network, which is the sequence-to-
sequence Voice Conversion Using Transformer with Text-to-
speech (TTS) pretraining.
• ➕ 1hr à 5 mins training data (Thanks to pretraining technology).
• ➖ still Needs parallel training data.
• How to tackle issue of semiparallel dataset?
• 「 Synthetic speech method 」
• We extended VTN model by training TTS models to generate synthetic
parallel data (SPD). (Semiparallel à Parallel)
[1]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech,
pp. 4676-4680, 2020.
5
Background
• Generation process of a synthetic parallel data (SPD) from a
semiparallel dataset.
(a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data,
target synthetic data, and external SPD.
6
Background
•Generation process of a synthetic parallel data (SPD)
from a semiparallel dataset.
• Four types of parallel data available for training the VC
model in total:
1. <source natural, target natural>
2. <SPD with source synthetic, target natural>
3. <source natural, SPD with target synthetic>
4. <external SPD with source synthetic, external SPD with
target synthetic>
7
「Synthetic Speech Method」
• There are still uncertainties about the effects and usage of
SPD on seq2seq VC model. In this paper we try to address
the following 3 questions:
• Q1: What are the feasibility and properties of using SPD?
• Q1-1: How does quality of data affect VC performance?
• Q1-2: Which kind of the training pair is better?
Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed
situation)
• Q2: How can this method benefit from a semiparallel setting?
• Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%)
• Q3: What are the influences of using external text data?
• Fix original training data, increase external data (1k/2k/5k)
8
Datasets and Configuration
• Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances
recorded by the English speakers in 16kHZ)
• Female: clb, slt
• Male: bdl, rms
• Development set and evaluation set: 100 utterances separately
• External dataset: M-AILABS database
• English corpus: 15369 utterances, 30 hours long
• Implementation:
• TTS models: Pretrained Transformer-TTS architecture
• VC model: VTN (Transformer-based seq2seq VC model) [1]
• Vocoder: Parallel WaveGAN (PWG) neural vocoder [2]
• Objective evaluation: Transformer-based ASR engine trained by LibriSpeech
database [3]
[1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020.
[2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
[3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018.
9
Experiment and Evaluation
• Q1: What are the feasibility and properties of using SPD?
Five kinds of training pairs:
1. <source natural, target natural>
2. <source natural, target synthetic>
3. <source synthetic, target natural>
4. <source synthetic, target synthetic>
5. <source synthetic and source natural, target
natural and target synthetic>
10
Experiment and Evaluation
• MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate
• The Objective evaluationresults of Q1
Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of
TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training.
• The TTS performance is critical in terms of the impact of VC results.
• The training dataset of source synthetic - target natural generally performs better among the other pairs using
SPD. 11
Experiment and Evaluation
• Q2: How can this method benefit from a semiparallel setting?
• Training procedure with different semiparallel setting
(e.g., datasize=400).
• Parallel ratio (PR) is used to represent the proportion
of natural parallel corpus, so as to reflect the semi-
parallel setting under each group.
• The respective TTS models of source and target
speaker are trained in case of constant datasize but
different semiparallel setting for each group.
• Two parts experiment: Training dataset I retains all
SPD as shown in (a); training dataset II removes
natural-synthetic part of semiparallel cases for
training as shown in (b).
12
Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
13
Experiment and Evaluation
• The Objective evaluationresults of Q2
Table II: Experimental results under different semiparallel setting.
14
Experiment and Evaluation
• The Objective evaluation results of Q3
Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize
of TTS finetuning.
15
Experiment and Evaluation
• The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets.
Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95%
confidence intervals for Q1.
Table V: Results of subjective evaluation using test sets with 95%
confidence intervals for Q2.
Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence
intervals for Q3.
• The overall results are consistent with
the findings in the objective evaluations.
16
Conclusions
• SPD is feasible for seq2seq non-parallel VC. The VC results using
SPD are determined by the performance of TTS models and VC
training datasize. In addition, the VC result is also affected by the object of
using SPD.
• When the dataset is semiparallel, we should try to ensure the PR is
large enough. If the original datasize is large, the introduction of SPD
into target speaker or source speaker can both achieve ideal VC results.
Thus, the full use of all types of SPD to ensure amount of data, can
maximize the benefits. On the contrary, when the original datasize is
small, the well-performing TTS models are difficult to get. Introducing
training pair with negative impact such as source natural-target
synthetic should be avoided.
• SPD with external text data as data augmentation can improve parallel
seq2seq VC performance to a certain extent (e.g. natural-natural).
17
Future work
• Using more speakers and a larger amount of data to further investigate the
beneficial trend that seq2seq non-parallel VC can obtain from SPD.
• In terms of methodology, we can introduce the VC models which can directly
processing non-parallel data training to compare the performance with the
way of using SPD on seq2seq VC in the future research, so as to further
clarify the role of SPD.
18

More Related Content

What's hot

深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調Yuma Koizumi
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice ConversionNU_I_TODALAB
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離NU_I_TODALAB
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice ConversionNU_I_TODALAB
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展Shinnosuke Takamichi
 
実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術Yuma Koizumi
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響NU_I_TODALAB
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3Naoya Takahashi
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法NU_I_TODALAB
 
短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討Shinnosuke Takamichi
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワークNU_I_TODALAB
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトNU_I_TODALAB
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例NU_I_TODALAB
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Shinnosuke Takamichi
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 

What's hot (20)

深層学習を利用した音声強調
深層学習を利用した音声強調深層学習を利用した音声強調
深層学習を利用した音声強調
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展
 
実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術
 
WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響WaveNetが音声合成研究に与える影響
WaveNetが音声合成研究に与える影響
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法喉頭摘出者のための歌唱支援を目指した電気音声変換法
喉頭摘出者のための歌唱支援を目指した電気音声変換法
 
短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 

Similar to Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...IRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET Journal
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionTakato Hayashi
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesXavier Anguera
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
 

Similar to Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion (20)

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Deep Domain
Deep DomainDeep Domain
Deep Domain
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 

More from NU_I_TODALAB

距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知NU_I_TODALAB
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識NU_I_TODALAB
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法NU_I_TODALAB
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法NU_I_TODALAB
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元NU_I_TODALAB
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法NU_I_TODALAB
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換NU_I_TODALAB
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調NU_I_TODALAB
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離NU_I_TODALAB
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現NU_I_TODALAB
 

More from NU_I_TODALAB (12)

距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
実環境下におけるサイレント音声通話の実現に向けた雑音環境変動に頑健な非可聴つぶやき強調
 
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
ケプストラム正則化NTFによるステレオチャネル楽曲音源分離
 
音情報処理における特徴表現
音情報処理における特徴表現音情報処理における特徴表現
音情報処理における特徴表現
 

Recently uploaded

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 

Recently uploaded (20)

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 

Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion

  • 1. Investigation of Text-to-speech based Synthetic Parallel Data for Sequence-to- sequence Non-parallel Voice Conversion Ding Ma, Wen-Chin Huang and Tomoki Toda Graduate School of Informatics, Nagoya University, Nagoya, Japan Paper ID: #1606 Presenter: Ding Ma
  • 2. Introduction •Voice conversion (VC) • The methodology that aims to convert the speaker identity of speech from source speaker into target speaker while preserving the linguistic information. • VC is expected to play a significant role in augmented human communication. Source speech VC Target speech 2
  • 3. Introduction •Sequence-to-sequence (seq2seq) modeling • Seq2Seq model: a model that takes a sequence of items and outputs another sequence of items, have emerged from the development of deep neural networks (DNN). • Can automatically determine the output phoneme duration. • Capture long term dependencies: prosody (F0 & duration), intonation… • Requires a large amount and parallel speech corpus from source and target speakers for training. Encoder Attention Decoder Source speech Target speech 3
  • 4. Background • Voice conversion challenge 2020(VCC2020) • Bi-annual event to compare the performance of different VC systems. • 2 tasks: Intra-lingual and semi-parallel case in Task 1 & cross lingual case in Task 2. • Parallel: same utterances • Nonparallel: different utterances • Semiparallel: Parallel + Nonparallel situation (can be regarded as the relaxation of non- parallel case) • Limited dataset: Only 90 corpus in T1/ 70 corpus in T2 4
  • 5. Background • VTN: Voice Transformer Network, which is the sequence-to- sequence Voice Conversion Using Transformer with Text-to- speech (TTS) pretraining. • ➕ 1hr à 5 mins training data (Thanks to pretraining technology). • ➖ still Needs parallel training data. • How to tackle issue of semiparallel dataset? • 「 Synthetic speech method 」 • We extended VTN model by training TTS models to generate synthetic parallel data (SPD). (Semiparallel à Parallel) [1] [1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020. 5
  • 6. Background • Generation process of a synthetic parallel data (SPD) from a semiparallel dataset. (a) TTS training process using the semiparallel dataset; and (b) SPD generation process using source synthetic data, target synthetic data, and external SPD. 6
  • 7. Background •Generation process of a synthetic parallel data (SPD) from a semiparallel dataset. • Four types of parallel data available for training the VC model in total: 1. <source natural, target natural> 2. <SPD with source synthetic, target natural> 3. <source natural, SPD with target synthetic> 4. <external SPD with source synthetic, external SPD with target synthetic> 7
  • 8. 「Synthetic Speech Method」 • There are still uncertainties about the effects and usage of SPD on seq2seq VC model. In this paper we try to address the following 3 questions: • Q1: What are the feasibility and properties of using SPD? • Q1-1: How does quality of data affect VC performance? • Q1-2: Which kind of the training pair is better? Ø Source + target natural / source synthetic only / target synthetic only / natural+synthetic (mixed situation) • Q2: How can this method benefit from a semiparallel setting? • Fix original training data, and set semiparallel ratio(0/25%/50%/75%/100%) • Q3: What are the influences of using external text data? • Fix original training data, increase external data (1k/2k/5k) 8
  • 9. Datasets and Configuration • Initial dataset : CMU ARCTIC database(containing parallel 1132 utterances recorded by the English speakers in 16kHZ) • Female: clb, slt • Male: bdl, rms • Development set and evaluation set: 100 utterances separately • External dataset: M-AILABS database • English corpus: 15369 utterances, 30 hours long • Implementation: • TTS models: Pretrained Transformer-TTS architecture • VC model: VTN (Transformer-based seq2seq VC model) [1] • Vocoder: Parallel WaveGAN (PWG) neural vocoder [2] • Objective evaluation: Transformer-based ASR engine trained by LibriSpeech database [3] [1] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020. [2] R. Yamamoto, E. Song, and J. M. Kim. “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020. [3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884-5888, 2018. 9
  • 10. Experiment and Evaluation • Q1: What are the feasibility and properties of using SPD? Five kinds of training pairs: 1. <source natural, target natural> 2. <source natural, target synthetic> 3. <source synthetic, target natural> 4. <source synthetic, target synthetic> 5. <source synthetic and source natural, target natural and target synthetic> 10
  • 11. Experiment and Evaluation • MCD: Mel Cepstrum Distortion / CER: Character Error Rate / WER: Word Error Rate • The Objective evaluationresults of Q1 Table I: The comparison results with different training pair and datasize. TTS-450, TTS-400, TTS-200 and TTS-80 represent the homologous datasize of TTS finetuning, which also reflect TTS performance, the datasize of SPD generation and VC training. • The TTS performance is critical in terms of the impact of VC results. • The training dataset of source synthetic - target natural generally performs better among the other pairs using SPD. 11
  • 12. Experiment and Evaluation • Q2: How can this method benefit from a semiparallel setting? • Training procedure with different semiparallel setting (e.g., datasize=400). • Parallel ratio (PR) is used to represent the proportion of natural parallel corpus, so as to reflect the semi- parallel setting under each group. • The respective TTS models of source and target speaker are trained in case of constant datasize but different semiparallel setting for each group. • Two parts experiment: Training dataset I retains all SPD as shown in (a); training dataset II removes natural-synthetic part of semiparallel cases for training as shown in (b). 12
  • 13. Experiment and Evaluation • The Objective evaluationresults of Q2 Table II: Experimental results under different semiparallel setting. 13
  • 14. Experiment and Evaluation • The Objective evaluationresults of Q2 Table II: Experimental results under different semiparallel setting. 14
  • 15. Experiment and Evaluation • The Objective evaluation results of Q3 Table III: Experimental results of adding external data with different datasizes. TTS-400 and TTS-200 represent homologous datasize of TTS finetuning. 15
  • 16. Experiment and Evaluation • The Subjective evaluation (MOS)results of Q1, Q2 and Q3 under specific datasets. Table IV: Results of subjective evaluation using test set under 450 and 80 datasize with 95% confidence intervals for Q1. Table V: Results of subjective evaluation using test sets with 95% confidence intervals for Q2. Table VI: Results of subjective evaluation using test sets under 400 and 200 datasize with 95% confidence intervals for Q3. • The overall results are consistent with the findings in the objective evaluations. 16
  • 17. Conclusions • SPD is feasible for seq2seq non-parallel VC. The VC results using SPD are determined by the performance of TTS models and VC training datasize. In addition, the VC result is also affected by the object of using SPD. • When the dataset is semiparallel, we should try to ensure the PR is large enough. If the original datasize is large, the introduction of SPD into target speaker or source speaker can both achieve ideal VC results. Thus, the full use of all types of SPD to ensure amount of data, can maximize the benefits. On the contrary, when the original datasize is small, the well-performing TTS models are difficult to get. Introducing training pair with negative impact such as source natural-target synthetic should be avoided. • SPD with external text data as data augmentation can improve parallel seq2seq VC performance to a certain extent (e.g. natural-natural). 17
  • 18. Future work • Using more speakers and a larger amount of data to further investigate the beneficial trend that seq2seq non-parallel VC can obtain from SPD. • In terms of methodology, we can introduce the VC models which can directly processing non-parallel data training to compare the performance with the way of using SPD on seq2seq VC in the future research, so as to further clarify the role of SPD. 18