SlideShare a Scribd company logo
Hiroyuki Miyoshi, Yuki Saito,
Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
Voice Conversion Using
Sequence-to-Sequence Learning
of Context Posterior Probabilities
INTERSPEECH Tue-O-4-10-1
Stockholm, Sweden
Aug. 22, 2017
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15
Outline of This Talk
Issue:
 Voice conversion needs parallel data of source and target speakers.
Conventional method
 Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016]
1. Recognition: source speech feats. → source CPPs.
2. Synthesis: copied source CPPs. → target speech feats.
Pros. : Non-parallel voice conversion
Cons. : Difficulty of converting speaker individuality included in CPPs
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs
 Joint training of recognition and synthesis to increase conversion performance
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Training
Target
speech feats.
LSTM
LSTM
Source
speech feats.
a
i
u
Target
CPP
𝒙
𝑹(⋅)
CPP
Context
label
𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦
𝑮(⋅)
𝒚
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
Time
Recognition Error
(Softmax cross entropy)
𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥)
Synthesis Error
(Mean squared error)
𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚)
Separated training
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Conversion (conventional)
Predicted
speech feats.
LSTM
LSTM
Source
speech feats.
Target
CPP
𝒙
𝑹(⋅)
Source
CPP
𝑮(⋅)
ෝ𝒚
Time
ෝ𝒑 𝒙
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
COPY
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15
Time
Issues of Conventional Voice Conversion
1. CPPs’ shapes and lengths are significantly different betw. speakers.
Shapes are different.
Lengths of each phoneme are different.
2. Improving recognition accuracy ≠ improving synthesis accuracy
Conventional method separately trains speech recognition/synthesis.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15
Proposed Algorithms
1. Sequence-to-Sequence Conversion from
Source CPP to Target CPP
2. Joint Training of Recognition and Synthesis
(like auto-encoding)
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15
Sequence-to-Sequence Learning [Sutskever et al., 2014]
 Sequence-to-Sequence Learning: variable-length conversion
雨 が 降る
It rainsInput sequence (Encoder)
Output sequence (Decoder)
Japanese-to-English translation using Seq2Seq learning
Constraints
 Phoneme duration is given.
 Conversion is done phoneme by phoneme.
 Problems of Seq2Seq conversion of CPPs
・Determining duration is difficult.
・Conversion failures propagate if the number of frames to be generated is large.
[Weng et al., 2016]
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15
Sequence-to-Sequence Conversion of CPPs
Target
speech feats.
LSTM
LSTM
Source
speech feats.
Target CPP
𝒙
𝑹(⋅)
CPP
ෝ𝒑 𝒙
𝑪(ෝ𝒑 𝒙) 𝑮(⋅)
ෝ𝒚
𝑮(𝑪(ෝ𝒑 𝑥))
1. Recognition 2. Synthesis
Time
 Conversion
Seq2Seq
conversion
𝑪(⋅)
Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚
Mean squared error Softmax cross entropy
betw. predicted CPPs/target labels
Minimizes conversion error Alleviates recognition error
included in target CPPs.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15
Effect of the Proposed Algorithm
 Variable-length voice conversion
0
1
Variable-length
conversion of CPPs is achieved!
Source CPP Target CPP
Frame
CPP after Seq2Seq conversion
Time
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15
Joint Training of
Speech Recognition and Synthesis
 Training
Source
speech feats.
LSTM
LSTM
Source
speech feats.
Source CPP
𝒙
𝑹(⋅)
𝒍 𝑥
ෝ𝒑 𝒙
𝑮(⋅)
𝒙
𝑮(ෝ𝒑 𝑥)
1. Recognition 2. Synthesis
Time Joint training
Recognition Error
𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙)
(Conventional term) + Synthesis error using predicted CPP
Experimental Evaluations
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15
Experimental Setup
Dataset ATR Japanese speech database
(phonetically balanced 503 sentences)
Training/Test 450 sentences / 53 sentences (16 kHz sampling)
Linguistic feats. 224-dimensional vectors (phonemes)
Speech feats. Mel-cepstrum (1st-through-24th) + Delta
Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.]
Recognition/ Synthesis Model Bidirectional LSTM (256 units)
Encoder / Decoder Bidirectional LSTM / LSTM (256 units each)
Number of Speakers 8 people including source and target speaker
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15
Objective and Subjective Evaluations of
Seq2Seq Learning
Objective Eval.
Subjective Eval.
Better!
Better!
Worse
Error bars denote
95 % confidence
intervals.
Source Target
Voice samples are
available online.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15
Objective Evaluation of Joint Training
Better!
Joint Training got better score on mel-cepstral distortion!
Auto-encoding case
Calculates reconstruction error
after recognition and synthesis.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15
Subjective Evaluation of Joint Training
Better!
Subjective Eval.
Better!
Joint Training made both speaker similarity and speech quality better!
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15
Conclusion
Issue:
 Difficulty of converting speaker individuality included in CPPs.
 Improving recognition accuracy ≠ improving synthesis accuracy.
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs.
 Joint training of recognition and synthesis.
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.

More Related Content

What's hot

BERT
BERTBERT
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
Artifacia
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
RIILP
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Tomoki Koriyama
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
nlab_utokyo
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
Class9
 Class9 Class9
Class9
issbp
 
Neural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translateNeural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translate
sotanemoto
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
JEE HYUN PARK
 
Why Ruby
Why RubyWhy Ruby
Why Ruby
Daniel Lv
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Mohamed El-Geish
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation
sekizawayuuki
 
Bert
BertBert
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
sadakpramodh
 

What's hot (20)

BERT
BERTBERT
BERT
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Class9
 Class9 Class9
Class9
 
Neural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translateNeural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translate
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Why Ruby
Why RubyWhy Ruby
Why Ruby
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation
 
Bert
BertBert
Bert
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 

Similar to Interspeech 2017 s_miyoshi

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
IJCI JOURNAL
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
inscit2006
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
Tae Hwan Jung
 
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONAPPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
IJDKP
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
mlaij
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
ijiert bestjournal
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
simonp16
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Kotaro Hara
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Association for Computational Linguistics
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
karthik annam
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
cscpconf
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
csandit
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
ijnlc
 
FYPReport
FYPReportFYPReport
FYPReport
David Ferris
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti1
 

Similar to Interspeech 2017 s_miyoshi (20)

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONAPPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
FYPReport
FYPReportFYPReport
FYPReport
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 

Recently uploaded

5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
skuxot
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
This is my Environmental physics presentation
This is my Environmental physics presentationThis is my Environmental physics presentation
This is my Environmental physics presentation
ZainabHashmi17
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 

Recently uploaded (20)

5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
一比一原版(UC Berkeley毕业证)加利福尼亚大学|伯克利分校毕业证成绩单专业办理
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
This is my Environmental physics presentation
This is my Environmental physics presentationThis is my Environmental physics presentation
This is my Environmental physics presentation
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 

Interspeech 2017 s_miyoshi

  • 1. Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari (The University of Tokyo) Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities INTERSPEECH Tue-O-4-10-1 Stockholm, Sweden Aug. 22, 2017
  • 2. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15 Outline of This Talk Issue:  Voice conversion needs parallel data of source and target speakers. Conventional method  Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016] 1. Recognition: source speech feats. → source CPPs. 2. Synthesis: copied source CPPs. → target speech feats. Pros. : Non-parallel voice conversion Cons. : Difficulty of converting speaker individuality included in CPPs Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs  Joint training of recognition and synthesis to increase conversion performance Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.
  • 3. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Training Target speech feats. LSTM LSTM Source speech feats. a i u Target CPP 𝒙 𝑹(⋅) CPP Context label 𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦 𝑮(⋅) 𝒚 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis Time Recognition Error (Softmax cross entropy) 𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥) Synthesis Error (Mean squared error) 𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚) Separated training
  • 4. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Conversion (conventional) Predicted speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) Source CPP 𝑮(⋅) ෝ𝒚 Time ෝ𝒑 𝒙 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis COPY
  • 5. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15 Time Issues of Conventional Voice Conversion 1. CPPs’ shapes and lengths are significantly different betw. speakers. Shapes are different. Lengths of each phoneme are different. 2. Improving recognition accuracy ≠ improving synthesis accuracy Conventional method separately trains speech recognition/synthesis.
  • 6. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15 Proposed Algorithms 1. Sequence-to-Sequence Conversion from Source CPP to Target CPP 2. Joint Training of Recognition and Synthesis (like auto-encoding)
  • 7. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15 Sequence-to-Sequence Learning [Sutskever et al., 2014]  Sequence-to-Sequence Learning: variable-length conversion 雨 が 降る It rainsInput sequence (Encoder) Output sequence (Decoder) Japanese-to-English translation using Seq2Seq learning Constraints  Phoneme duration is given.  Conversion is done phoneme by phoneme.  Problems of Seq2Seq conversion of CPPs ・Determining duration is difficult. ・Conversion failures propagate if the number of frames to be generated is large. [Weng et al., 2016]
  • 8. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15 Sequence-to-Sequence Conversion of CPPs Target speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) CPP ෝ𝒑 𝒙 𝑪(ෝ𝒑 𝒙) 𝑮(⋅) ෝ𝒚 𝑮(𝑪(ෝ𝒑 𝑥)) 1. Recognition 2. Synthesis Time  Conversion Seq2Seq conversion 𝑪(⋅) Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚 Mean squared error Softmax cross entropy betw. predicted CPPs/target labels Minimizes conversion error Alleviates recognition error included in target CPPs.
  • 9. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15 Effect of the Proposed Algorithm  Variable-length voice conversion 0 1 Variable-length conversion of CPPs is achieved! Source CPP Target CPP Frame CPP after Seq2Seq conversion Time
  • 10. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15 Joint Training of Speech Recognition and Synthesis  Training Source speech feats. LSTM LSTM Source speech feats. Source CPP 𝒙 𝑹(⋅) 𝒍 𝑥 ෝ𝒑 𝒙 𝑮(⋅) 𝒙 𝑮(ෝ𝒑 𝑥) 1. Recognition 2. Synthesis Time Joint training Recognition Error 𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙) (Conventional term) + Synthesis error using predicted CPP
  • 12. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15 Experimental Setup Dataset ATR Japanese speech database (phonetically balanced 503 sentences) Training/Test 450 sentences / 53 sentences (16 kHz sampling) Linguistic feats. 224-dimensional vectors (phonemes) Speech feats. Mel-cepstrum (1st-through-24th) + Delta Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.] Recognition/ Synthesis Model Bidirectional LSTM (256 units) Encoder / Decoder Bidirectional LSTM / LSTM (256 units each) Number of Speakers 8 people including source and target speaker
  • 13. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15 Objective and Subjective Evaluations of Seq2Seq Learning Objective Eval. Subjective Eval. Better! Better! Worse Error bars denote 95 % confidence intervals. Source Target Voice samples are available online.
  • 14. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15 Objective Evaluation of Joint Training Better! Joint Training got better score on mel-cepstral distortion! Auto-encoding case Calculates reconstruction error after recognition and synthesis.
  • 15. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15 Subjective Evaluation of Joint Training Better! Subjective Eval. Better! Joint Training made both speaker similarity and speech quality better!
  • 16. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15 Conclusion Issue:  Difficulty of converting speaker individuality included in CPPs.  Improving recognition accuracy ≠ improving synthesis accuracy. Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs.  Joint training of recognition and synthesis. Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.