SlideShare a Scribd company logo
1 of 16
Download to read offline
Hiroyuki Miyoshi, Yuki Saito,
Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
Voice Conversion Using
Sequence-to-Sequence Learning
of Context Posterior Probabilities
INTERSPEECH Tue-O-4-10-1
Stockholm, Sweden
Aug. 22, 2017
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15
Outline of This Talk
Issue:
 Voice conversion needs parallel data of source and target speakers.
Conventional method
 Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016]
1. Recognition: source speech feats. → source CPPs.
2. Synthesis: copied source CPPs. → target speech feats.
Pros. : Non-parallel voice conversion
Cons. : Difficulty of converting speaker individuality included in CPPs
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs
 Joint training of recognition and synthesis to increase conversion performance
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Training
Target
speech feats.
LSTM
LSTM
Source
speech feats.
a
i
u
Target
CPP
𝒙
𝑹(⋅)
CPP
Context
label
𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦
𝑮(⋅)
𝒚
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
Time
Recognition Error
(Softmax cross entropy)
𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥)
Synthesis Error
(Mean squared error)
𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚)
Separated training
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Conversion (conventional)
Predicted
speech feats.
LSTM
LSTM
Source
speech feats.
Target
CPP
𝒙
𝑹(⋅)
Source
CPP
𝑮(⋅)
ෝ𝒚
Time
ෝ𝒑 𝒙
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
COPY
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15
Time
Issues of Conventional Voice Conversion
1. CPPs’ shapes and lengths are significantly different betw. speakers.
Shapes are different.
Lengths of each phoneme are different.
2. Improving recognition accuracy ≠ improving synthesis accuracy
Conventional method separately trains speech recognition/synthesis.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15
Proposed Algorithms
1. Sequence-to-Sequence Conversion from
Source CPP to Target CPP
2. Joint Training of Recognition and Synthesis
(like auto-encoding)
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15
Sequence-to-Sequence Learning [Sutskever et al., 2014]
 Sequence-to-Sequence Learning: variable-length conversion
雨 が 降る
It rainsInput sequence (Encoder)
Output sequence (Decoder)
Japanese-to-English translation using Seq2Seq learning
Constraints
 Phoneme duration is given.
 Conversion is done phoneme by phoneme.
 Problems of Seq2Seq conversion of CPPs
・Determining duration is difficult.
・Conversion failures propagate if the number of frames to be generated is large.
[Weng et al., 2016]
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15
Sequence-to-Sequence Conversion of CPPs
Target
speech feats.
LSTM
LSTM
Source
speech feats.
Target CPP
𝒙
𝑹(⋅)
CPP
ෝ𝒑 𝒙
𝑪(ෝ𝒑 𝒙) 𝑮(⋅)
ෝ𝒚
𝑮(𝑪(ෝ𝒑 𝑥))
1. Recognition 2. Synthesis
Time
 Conversion
Seq2Seq
conversion
𝑪(⋅)
Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚
Mean squared error Softmax cross entropy
betw. predicted CPPs/target labels
Minimizes conversion error Alleviates recognition error
included in target CPPs.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15
Effect of the Proposed Algorithm
 Variable-length voice conversion
0
1
Variable-length
conversion of CPPs is achieved!
Source CPP Target CPP
Frame
CPP after Seq2Seq conversion
Time
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15
Joint Training of
Speech Recognition and Synthesis
 Training
Source
speech feats.
LSTM
LSTM
Source
speech feats.
Source CPP
𝒙
𝑹(⋅)
𝒍 𝑥
ෝ𝒑 𝒙
𝑮(⋅)
𝒙
𝑮(ෝ𝒑 𝑥)
1. Recognition 2. Synthesis
Time Joint training
Recognition Error
𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙)
(Conventional term) + Synthesis error using predicted CPP
Experimental Evaluations
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15
Experimental Setup
Dataset ATR Japanese speech database
(phonetically balanced 503 sentences)
Training/Test 450 sentences / 53 sentences (16 kHz sampling)
Linguistic feats. 224-dimensional vectors (phonemes)
Speech feats. Mel-cepstrum (1st-through-24th) + Delta
Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.]
Recognition/ Synthesis Model Bidirectional LSTM (256 units)
Encoder / Decoder Bidirectional LSTM / LSTM (256 units each)
Number of Speakers 8 people including source and target speaker
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15
Objective and Subjective Evaluations of
Seq2Seq Learning
Objective Eval.
Subjective Eval.
Better!
Better!
Worse
Error bars denote
95 % confidence
intervals.
Source Target
Voice samples are
available online.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15
Objective Evaluation of Joint Training
Better!
Joint Training got better score on mel-cepstral distortion!
Auto-encoding case
Calculates reconstruction error
after recognition and synthesis.
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15
Subjective Evaluation of Joint Training
Better!
Subjective Eval.
Better!
Joint Training made both speaker similarity and speech quality better!
INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15
Conclusion
Issue:
 Difficulty of converting speaker individuality included in CPPs.
 Improving recognition accuracy ≠ improving synthesis accuracy.
Proposed:
 Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs.
 Joint training of recognition and synthesis.
Results:
 Seq2Seq learning achieved variable-length voice conversion.
 Joint training improved speaker similarity and quality of converted speech.

More Related Content

What's hot

5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
RIILP
 
Class9
 Class9 Class9
Class9
issbp
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation
sekizawayuuki
 

What's hot (20)

BERT
BERTBERT
BERT
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Class9
 Class9 Class9
Class9
 
Neural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translateNeural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translate
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Why Ruby
Why RubyWhy Ruby
Why Ruby
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation
 
Bert
BertBert
Bert
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 

Similar to Interspeech 2017 s_miyoshi

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
IJCI JOURNAL
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
simonp16
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
cscpconf
 

Similar to Interspeech 2017 s_miyoshi (20)

Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONAPPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
LPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A ReviewLPC Models and Different Speech Enhancement Techniques- A Review
LPC Models and Different Speech Enhancement Techniques- A Review
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
FYPReport
FYPReportFYPReport
FYPReport
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 

Recently uploaded

21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 
1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
AshishSingh1301
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 

Recently uploaded (20)

5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 

Interspeech 2017 s_miyoshi

  • 1. Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari (The University of Tokyo) Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities INTERSPEECH Tue-O-4-10-1 Stockholm, Sweden Aug. 22, 2017
  • 2. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15 Outline of This Talk Issue:  Voice conversion needs parallel data of source and target speakers. Conventional method  Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016] 1. Recognition: source speech feats. → source CPPs. 2. Synthesis: copied source CPPs. → target speech feats. Pros. : Non-parallel voice conversion Cons. : Difficulty of converting speaker individuality included in CPPs Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs  Joint training of recognition and synthesis to increase conversion performance Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.
  • 3. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Training Target speech feats. LSTM LSTM Source speech feats. a i u Target CPP 𝒙 𝑹(⋅) CPP Context label 𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦 𝑮(⋅) 𝒚 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis Time Recognition Error (Softmax cross entropy) 𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥) Synthesis Error (Mean squared error) 𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚) Separated training
  • 4. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 3/15 Conventional Voice Conversion Algorithm: Copied Context Posterior Probability [Sun et al., 2016] Conversion (conventional) Predicted speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) Source CPP 𝑮(⋅) ෝ𝒚 Time ෝ𝒑 𝒙 𝑮(ෝ𝒑 𝑦) 1. Recognition 2. Synthesis COPY
  • 5. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15 Time Issues of Conventional Voice Conversion 1. CPPs’ shapes and lengths are significantly different betw. speakers. Shapes are different. Lengths of each phoneme are different. 2. Improving recognition accuracy ≠ improving synthesis accuracy Conventional method separately trains speech recognition/synthesis.
  • 6. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15 Proposed Algorithms 1. Sequence-to-Sequence Conversion from Source CPP to Target CPP 2. Joint Training of Recognition and Synthesis (like auto-encoding)
  • 7. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15 Sequence-to-Sequence Learning [Sutskever et al., 2014]  Sequence-to-Sequence Learning: variable-length conversion 雨 が 降る It rainsInput sequence (Encoder) Output sequence (Decoder) Japanese-to-English translation using Seq2Seq learning Constraints  Phoneme duration is given.  Conversion is done phoneme by phoneme.  Problems of Seq2Seq conversion of CPPs ・Determining duration is difficult. ・Conversion failures propagate if the number of frames to be generated is large. [Weng et al., 2016]
  • 8. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 7/15 Sequence-to-Sequence Conversion of CPPs Target speech feats. LSTM LSTM Source speech feats. Target CPP 𝒙 𝑹(⋅) CPP ෝ𝒑 𝒙 𝑪(ෝ𝒑 𝒙) 𝑮(⋅) ෝ𝒚 𝑮(𝑪(ෝ𝒑 𝑥)) 1. Recognition 2. Synthesis Time  Conversion Seq2Seq conversion 𝑪(⋅) Loss function: 𝑳 𝑮 𝑪(ෝ𝒑 𝒙), ෝ𝒑 𝒚 + 𝑳 𝑪 𝑪 ෝ𝒑 𝒙 , 𝒍 𝒚 Mean squared error Softmax cross entropy betw. predicted CPPs/target labels Minimizes conversion error Alleviates recognition error included in target CPPs.
  • 9. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15 Effect of the Proposed Algorithm  Variable-length voice conversion 0 1 Variable-length conversion of CPPs is achieved! Source CPP Target CPP Frame CPP after Seq2Seq conversion Time
  • 10. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15 Joint Training of Speech Recognition and Synthesis  Training Source speech feats. LSTM LSTM Source speech feats. Source CPP 𝒙 𝑹(⋅) 𝒍 𝑥 ෝ𝒑 𝒙 𝑮(⋅) 𝒙 𝑮(ෝ𝒑 𝑥) 1. Recognition 2. Synthesis Time Joint training Recognition Error 𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙) (Conventional term) + Synthesis error using predicted CPP
  • 12. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15 Experimental Setup Dataset ATR Japanese speech database (phonetically balanced 503 sentences) Training/Test 450 sentences / 53 sentences (16 kHz sampling) Linguistic feats. 224-dimensional vectors (phonemes) Speech feats. Mel-cepstrum (1st-through-24th) + Delta Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.] Recognition/ Synthesis Model Bidirectional LSTM (256 units) Encoder / Decoder Bidirectional LSTM / LSTM (256 units each) Number of Speakers 8 people including source and target speaker
  • 13. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15 Objective and Subjective Evaluations of Seq2Seq Learning Objective Eval. Subjective Eval. Better! Better! Worse Error bars denote 95 % confidence intervals. Source Target Voice samples are available online.
  • 14. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15 Objective Evaluation of Joint Training Better! Joint Training got better score on mel-cepstral distortion! Auto-encoding case Calculates reconstruction error after recognition and synthesis.
  • 15. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15 Subjective Evaluation of Joint Training Better! Subjective Eval. Better! Joint Training made both speaker similarity and speech quality better!
  • 16. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15 Conclusion Issue:  Difficulty of converting speaker individuality included in CPPs.  Improving recognition accuracy ≠ improving synthesis accuracy. Proposed:  Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target CPPs.  Joint training of recognition and synthesis. Results:  Seq2Seq learning achieved variable-length voice conversion.  Joint training improved speaker similarity and quality of converted speech.