Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

Shinnosuke Takamichi
Shinnosuke TakamichiThe University of Tokyo - Project Research Associate
2015©Shinnosuke TAKAMICHI
09/19/2015
Prosody-Controllable HMM-Based
Speech Synthesis Using Speech Input
Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda,
Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST)
MLSLP2015 in Aizu Univ.
/17
Speech-based creative activities
and HMM-based speech synthesis
2
Singing voice Speech
Advertisement Live concert Narration Next?
Video avatar
Voice actor
…
Useful method: HMM-based speech synthesis [Tokuda et al., 2013.]
“Synthesize!”
Synthetic speech parameters
text speech
/17
Manual control of synthetic speech
Laugh
Sad
Regression
Multi-Regression HMM [Nose et al., 2007.]
Manually manipulating HMM parameters
User
User
They are very useful, but difficult to control as the user wants.
/17
Motivation of this study
 Functions we want
– Original capability of HMM-based TTS
– Speech-based control
• Intuitive to control
• Make synthetic speech mimic input speech prosody
 Our work
– Speech synthesis having both functions
4
Synthesize
System
Synthesize“Synthesize.”
MR-HMM etc.
Similar to VOCALISTENER
for singing voice control
/17
Overview of the proposed system
(Only text is input.)
5
Input text
Text analysis
Waveform generation
Synthetic speech
Parameter
generation
Synthesis
HMM
Original HMM-based
speech synthesis
/17
Overview of the proposed system
(Text & speech are input.)
6
Input textInput speech
Speech analysis Text analysis
Waveform generation
Synthetic speech
F0
modification
Duration
extraction
Parameter
generation
Alignment
HMM
Synthesis
HMM
/17
Duration extraction module
7
Alignment
HMM
Synthesis
HMM
Feature of
input speech
Context of
Input text
HMM
alignment
Duration
generation
State duration of
synthetic speech
Parm. Gen.
Duration of input speech
/17
Alignment accuracy & duration unit
 How to build alignment HMMs suitable for input speech?
– → The use of pre-recorded speech uttered by users
– Large amounts → user-dependent HMMs
– Small amounts → HMMs adapted from original alignment HMMs
 How to map the input speech duration to synthetic speech?
– Alignment/synthesis HMM-states represent different speech segments.
– Which is better, HMM-state, phone, or mora-level duration unit?
8
/17
Speech parameter generation module
9
Synthesis
HMM
Context of
Input text
Parameter
generation
Spectrum of
synthetic speech
F0 generated
From HMMs
Dur. ext.
State duration
F0 mod. Wav. Gen.
/17
F0 modification module
10
Feature of
input speech
F0 generated
from HMMs
F0
conversion
U/V region
modification
Parm. gen.
F0 of
synthetic speech
Wav. Gen.
/17
F0 conversion &
unvoiced/voiced modification
11
F0
Time
Reference
generated from HMMs
Input speech
F0-converted
U/V-modified
 F0 conversion fixes F0 range of input speech to fit to reference.
 U/V modification fixes the U/V region of input speech to fit to reference.
Linear
conversion
Spline
interpolation
EXPERIMENTAL EVALUATION
12
/17
Experimental Setup
13
Content Value/Setting
User 4 Japanese speakers (2 male & 2 female)
Target speaker 1 Japanese female speaker
Training data of
synthesis HMMs
450 phoneme-balanced sentences,
16 kHz-sampled, 5 ms shift, reading style
Evaluation data 53 phoneme-balanced sentences
Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity
Speech analyzer STRAIGHT [Kawahara et al., 1999.]
Text analyzer Open-jtalk
Acoustic model 5-state HSMM [Zen et al., 2007.]
 1. duration unit & alignment HMM adaptation
 2. synthesis HMM adaptation
 3. effect of U/V modification
/17
Evaluation 1: duration unit &
alignment HMM adaptation
 3 duration units
– State / phoneme / mora-level duration
 4 HMMs using different amounts of pre-recorded speech
– 0 … target-speaker-dependent HMMs (= synthesis HMM)
– 1 … HMMs adapted using 1 utterance uttered by the user
– 56 … HMMs adapted using 56 utterances
– 450 … user-dependent HMMs
 Evaluation
– MOS test on naturalness of synthetic speech
– DMOS test on prosody mimicking ability of synthetic speech
• Input speech is presented as reference.
14
/17
Result 1: duration unit &
alignment HMM adaptation
15
1
2
3
4
5
MOS on naturalness DMOS on prosody mimicking ability
0 1 56 450utts.
We can confirm (1) adaptation is effective, and
(2) phoneme-level dur. is relatively robust.
No significant diff. No significant diff.
state phone mora
/17
Experiment 2: Effectiveness of U/V
modification in naturalness
Preferencescoreonnaturalness[%]
0
20
40
60
80
100
Spkr1 Spkr2 Spkr3 Spkr4
U/Vmodificationratio[%]
0
5
10
15
20
Spkr1 Spkr2 Spkr3 Spkr4
w/o or w/ modification U->V or V->U modification
U/V modification can improve the naturalness!
(especially when many U frames of input speech are fixed.)
/17
Conclusion
 2 functions to control synthetic speech
– An original function of HMM-based TTS
• MR-HMM or manual control
– Speech-based control
• Intuitive for users
 2 main modules of our system
– Mimic duration.
• Copy duration of input speech to synthetic speech.
– Mimic F0 patterns.
• Copy dynamic F0 pattern of input speech to synthetic speech.
 Future work
– HMM selection using text & speech 17
1 of 17

Recommended

The NAIST Text-to-Speech System for Blizzard Challenge 2015 by
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015Shinnosuke Takamichi
1.6K views20 slides
Ph.D defence (Shinnosuke Takamichi) by
Ph.D defence (Shinnosuke Takamichi)Ph.D defence (Shinnosuke Takamichi)
Ph.D defence (Shinnosuke Takamichi)Shinnosuke Takamichi
953 views46 slides
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesis by
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesisAPSIPA2017: Trajectory smoothing for vocoder-free speech synthesis
APSIPA2017: Trajectory smoothing for vocoder-free speech synthesisShinnosuke Takamichi
1.4K views16 slides
1909 BERT: why-and-how (CODE SEMINAR) by
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
546 views57 slides
Voice biometric recognition by
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
870 views11 slides
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes... by
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...IJERA Editor
217 views5 slides

More Related Content

What's hot

Limited Data Speaker Verification: Fusion of Features by
Limited Data Speaker Verification: Fusion of FeaturesLimited Data Speaker Verification: Fusion of Features
Limited Data Speaker Verification: Fusion of FeaturesIJECEIAES
11 views14 slides
A Marathi Hidden-Markov Model Based Speech Synthesis System by
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
670 views6 slides
BERT: Bidirectional Encoder Representations from Transformers by
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
1.9K views46 slides
Mjfg now by
Mjfg nowMjfg now
Mjfg nowPrabha P
68 views110 slides
Baum2 by
Baum2Baum2
Baum2dmolina87
408 views4 slides
[Paper Introduction] Translating into Morphologically Rich Languages with Syn... by
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...NAIST Machine Translation Study Group
311 views19 slides

What's hot(12)

Limited Data Speaker Verification: Fusion of Features by IJECEIAES
Limited Data Speaker Verification: Fusion of FeaturesLimited Data Speaker Verification: Fusion of Features
Limited Data Speaker Verification: Fusion of Features
IJECEIAES11 views
A Marathi Hidden-Markov Model Based Speech Synthesis System by iosrjce
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
iosrjce670 views
BERT: Bidirectional Encoder Representations from Transformers by Liangqun Lu
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu1.9K views
Mjfg now by Prabha P
Mjfg nowMjfg now
Mjfg now
Prabha P68 views
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese by journalBEEI
The first FOSD-tacotron-2-based text-to-speech application for VietnameseThe first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
journalBEEI81 views
Voice morphing document by himadrigupta
Voice morphing documentVoice morphing document
Voice morphing document
himadrigupta7.2K views

Viewers also liked

日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価” by
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”Shinnosuke Takamichi
1.2K views12 slides
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築” by
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”Shinnosuke Takamichi
1.6K views14 slides
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応 by
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応Shinnosuke Takamichi
1.2K views15 slides
DNN音響モデルにおける特徴量抽出の諸相 by
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相Takuya Yoshioka
15.3K views74 slides
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生] by
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]Shinnosuke Takamichi
1.2K views40 slides
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017) by
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
3.2K views59 slides

Viewers also liked(16)

日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価” by Shinnosuke Takamichi
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”Moment-matching networkに基づく一期一会音声合成における発話間変動の評価”
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築” by Shinnosuke Takamichi
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”
日本音響学会2017秋 ”クラウドソーシングを利用した対訳方言音声コーパスの構築”
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応 by Shinnosuke Takamichi
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応
DNN音響モデルにおける特徴量抽出の諸相 by Takuya Yoshioka
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相
Takuya Yoshioka15.3K views
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生] by Shinnosuke Takamichi
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]
ICASSP2017読み会 (Deep Learning III) [電通大 中鹿先生]
音声の声質を変換する技術とその応用 by NU_I_TODALAB
音声の声質を変換する技術とその応用音声の声質を変換する技術とその応用
音声の声質を変換する技術とその応用
NU_I_TODALAB8.9K views
ICASSP2017読み会 (acoustic modeling and adaptation) by Shinnosuke Takamichi
ICASSP2017読み会 (acoustic modeling and adaptation)ICASSP2017読み会 (acoustic modeling and adaptation)
ICASSP2017読み会 (acoustic modeling and adaptation)
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎" by Shinnosuke Takamichi
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
Saito2017icassp by Yuki Saito
Saito2017icasspSaito2017icassp
Saito2017icassp
Yuki Saito1.3K views
MIRU2016 チュートリアル by Shunsuke Ono
MIRU2016 チュートリアルMIRU2016 チュートリアル
MIRU2016 チュートリアル
Shunsuke Ono25.5K views
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習 by Shinnosuke Takamichi
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
信号処理・画像処理における凸最適化 by Shunsuke Ono
信号処理・画像処理における凸最適化信号処理・画像処理における凸最適化
信号処理・画像処理における凸最適化
Shunsuke Ono11.3K views
Moment matching networkを用いた音声パラメータのランダム生成の検討 by Shinnosuke Takamichi
Moment matching networkを用いた音声パラメータのランダム生成の検討Moment matching networkを用いた音声パラメータのランダム生成の検討
Moment matching networkを用いた音声パラメータのランダム生成の検討
Shinnosuke Takamichi16.1K views
ICASSP2017読み会(関東編)・AASP_L3(北村担当分) by Daichi Kitamura
ICASSP2017読み会(関東編)・AASP_L3(北村担当分)ICASSP2017読み会(関東編)・AASP_L3(北村担当分)
ICASSP2017読み会(関東編)・AASP_L3(北村担当分)
Daichi Kitamura4K views

Similar to Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System by
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemEvaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor
32 views5 slides
Voice morphing- by
Voice morphing-Voice morphing-
Voice morphing-Navneet Sharma
17K views23 slides
Speech Recognition by
Speech RecognitionSpeech Recognition
Speech RecognitionHardik Kanjariya
1.6K views19 slides
Personalising speech to-speech translation by
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translationbehzad66
543 views36 slides
Performance Calculation of Speech Synthesis Methods for Hindi language by
Performance Calculation of Speech Synthesis Methods for Hindi languagePerformance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
312 views7 slides
Survey On Speech Synthesis by
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech SynthesisCSCJournals
195 views6 slides

Similar to Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input(20)

Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System by IJERA Editor
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemEvaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
IJERA Editor32 views
Personalising speech to-speech translation by behzad66
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66543 views
Performance Calculation of Speech Synthesis Methods for Hindi language by iosrjce
Performance Calculation of Speech Synthesis Methods for Hindi languagePerformance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi language
iosrjce312 views
Survey On Speech Synthesis by CSCJournals
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
CSCJournals195 views
Homomorphic speech processing by sivakumar m
Homomorphic speech processingHomomorphic speech processing
Homomorphic speech processing
sivakumar m3.5K views
Hindi digits recognition system on speech data collected in different natural... by csandit
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
csandit448 views
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS by ijnlc
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSEFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS
ijnlc776 views
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes by kevig
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
kevig3 views
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES by kevig
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
kevig65 views
Voice morphing-101113123852-phpapp01 by Rehan Ahmed
Voice morphing-101113123852-phpapp01Voice morphing-101113123852-phpapp01
Voice morphing-101113123852-phpapp01
Rehan Ahmed12.4K views
IRJET- Designing and Creating Punjabi Speech Synthesis System using Hidden Ma... by IRJET Journal
IRJET- Designing and Creating Punjabi Speech Synthesis System using Hidden Ma...IRJET- Designing and Creating Punjabi Speech Synthesis System using Hidden Ma...
IRJET- Designing and Creating Punjabi Speech Synthesis System using Hidden Ma...
IRJET Journal16 views
Voicemorphingppt 110328163403-phpapp01 by Madhu Babu
Voicemorphingppt 110328163403-phpapp01Voicemorphingppt 110328163403-phpapp01
Voicemorphingppt 110328163403-phpapp01
Madhu Babu2.6K views
SMATalk: Standard Malay Text to Speech Talk System by CSCJournals
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
CSCJournals301 views
Speech Analysis and synthesis using Vocoder by IJTET Journal
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
IJTET Journal923 views
High Quality Arabic Concatenative Speech Synthesis by sipij
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesis
sipij45 views

More from Shinnosuke Takamichi

JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス by
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパスShinnosuke Takamichi
1.4K views15 slides
音声合成のコーパスをつくろう by
音声合成のコーパスをつくろう音声合成のコーパスをつくろう
音声合成のコーパスをつくろうShinnosuke Takamichi
8.9K views20 slides
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス by
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスShinnosuke Takamichi
745 views4 slides
短時間発話を用いた話者照合のための音声加工の効果に関する検討 by
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討Shinnosuke Takamichi
1K views37 slides
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法 by
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法Shinnosuke Takamichi
1.2K views44 slides
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム) by
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)Shinnosuke Takamichi
1.6K views34 slides

More from Shinnosuke Takamichi(20)

JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス by Shinnosuke Takamichi
JTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパスJTubeSpeech:  音声認識と話者照合のために YouTube から構築される日本語音声コーパス
JTubeSpeech: 音声認識と話者照合のために YouTube から構築される日本語音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス by Shinnosuke Takamichi
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパスJ-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
J-KAC:日本語オーディオブック・紙芝居朗読音声コーパス
短時間発話を用いた話者照合のための音声加工の効果に関する検討 by Shinnosuke Takamichi
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法 by Shinnosuke Takamichi
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
リアルタイムDNN音声変換フィードバックによるキャラクタ性の獲得手法
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム) by Shinnosuke Takamichi
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
ここまで来た&これから来る音声合成 (明治大学 先端メディアコロキウム)
Interspeech 2020 読み会 "Incremental Text to Speech for Neural Sequence-to-Sequ... by Shinnosuke Takamichi
Interspeech 2020 読み会 "Incremental Text to Speech for Neural  Sequence-to-Sequ...Interspeech 2020 読み会 "Incremental Text to Speech for Neural  Sequence-to-Sequ...
Interspeech 2020 読み会 "Incremental Text to Speech for Neural Sequence-to-Sequ...
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価 by Shinnosuke Takamichi
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
P J S: 音素バランスを考慮した日本語歌声コーパス by Shinnosuke Takamichi
P J S: 音素バランスを考慮した日本語歌声コーパスP J S: 音素バランスを考慮した日本語歌声コーパス
P J S: 音素バランスを考慮した日本語歌声コーパス
音響モデル尤度に基づくsubword分割の韻律推定精度における評価 by Shinnosuke Takamichi
音響モデル尤度に基づくsubword分割の韻律推定精度における評価音響モデル尤度に基づくsubword分割の韻律推定精度における評価
音響モデル尤度に基づくsubword分割の韻律推定精度における評価
音声合成研究を加速させるためのコーパスデザイン by Shinnosuke Takamichi
音声合成研究を加速させるためのコーパスデザイン音声合成研究を加速させるためのコーパスデザイン
音声合成研究を加速させるためのコーパスデザイン
論文紹介 Unsupervised training of neural mask-based beamforming by Shinnosuke Takamichi
論文紹介 Unsupervised training of neural  mask-based beamforming論文紹介 Unsupervised training of neural  mask-based beamforming
論文紹介 Unsupervised training of neural mask-based beamforming
論文紹介 Building the Singapore English National Speech Corpus by Shinnosuke Takamichi
論文紹介 Building the Singapore English National Speech Corpus論文紹介 Building the Singapore English National Speech Corpus
論文紹介 Building the Singapore English National Speech Corpus
論文紹介 SANTLR: Speech Annotation Toolkit for Low Resource Languages by Shinnosuke Takamichi
論文紹介 SANTLR: Speech Annotation Toolkit for Low Resource Languages論文紹介 SANTLR: Speech Annotation Toolkit for Low Resource Languages
論文紹介 SANTLR: Speech Annotation Toolkit for Low Resource Languages
話者V2S攻撃: 話者認証から構築される 声質変換とその音声なりすまし可能性の評価 by Shinnosuke Takamichi
話者V2S攻撃: 話者認証から構築される 声質変換とその音声なりすまし可能性の評価話者V2S攻撃: 話者認証から構築される 声質変換とその音声なりすまし可能性の評価
話者V2S攻撃: 話者認証から構築される 声質変換とその音声なりすまし可能性の評価
JVS:フリーの日本語多数話者音声コーパス by Shinnosuke Takamichi
JVS:フリーの日本語多数話者音声コーパス JVS:フリーの日本語多数話者音声コーパス
JVS:フリーの日本語多数話者音声コーパス
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定 by Shinnosuke Takamichi
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
差分スペクトル法に基づく DNN 声質変換の計算量削減に向けたフィルタ推定
音声合成・変換の国際コンペティションへの 参加を振り返って by Shinnosuke Takamichi
音声合成・変換の国際コンペティションへの  参加を振り返って音声合成・変換の国際コンペティションへの  参加を振り返って
音声合成・変換の国際コンペティションへの 参加を振り返って
ユーザ歌唱のための generative moment matching network に基づく neural double-tracking by Shinnosuke Takamichi
ユーザ歌唱のための generative moment matching network に基づく neural double-trackingユーザ歌唱のための generative moment matching network に基づく neural double-tracking
ユーザ歌唱のための generative moment matching network に基づく neural double-tracking

Recently uploaded

Factors affecting fluorescence and phosphorescence.pptx by
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptxSamarthGiri1
7 views11 slides
Assessment and Evaluation GROUP 3.pdf by
Assessment and Evaluation GROUP 3.pdfAssessment and Evaluation GROUP 3.pdf
Assessment and Evaluation GROUP 3.pdfkimberlyndelgado18
10 views10 slides
A giant thin stellar stream in the Coma Galaxy Cluster by
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy ClusterSérgio Sacani
19 views14 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
15 views15 slides
Krishna VSC 692 Credit Seminar.pptx by
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
11 views54 slides
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...SwagatBehera9
5 views36 slides

Recently uploaded(20)

Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri17 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani19 views
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera95 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI9 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya40 views
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife114 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople68 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew12 views
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by ShadmanSakib63
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
ShadmanSakib636 views
Exploring the nature and synchronicity of early cluster formation in the Larg... by Sérgio Sacani
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...
Sérgio Sacani1.4K views
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa11 views

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

  • 1. 2015©Shinnosuke TAKAMICHI 09/19/2015 Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST) MLSLP2015 in Aizu Univ.
  • 2. /17 Speech-based creative activities and HMM-based speech synthesis 2 Singing voice Speech Advertisement Live concert Narration Next? Video avatar Voice actor … Useful method: HMM-based speech synthesis [Tokuda et al., 2013.] “Synthesize!” Synthetic speech parameters text speech
  • 3. /17 Manual control of synthetic speech Laugh Sad Regression Multi-Regression HMM [Nose et al., 2007.] Manually manipulating HMM parameters User User They are very useful, but difficult to control as the user wants.
  • 4. /17 Motivation of this study  Functions we want – Original capability of HMM-based TTS – Speech-based control • Intuitive to control • Make synthetic speech mimic input speech prosody  Our work – Speech synthesis having both functions 4 Synthesize System Synthesize“Synthesize.” MR-HMM etc. Similar to VOCALISTENER for singing voice control
  • 5. /17 Overview of the proposed system (Only text is input.) 5 Input text Text analysis Waveform generation Synthetic speech Parameter generation Synthesis HMM Original HMM-based speech synthesis
  • 6. /17 Overview of the proposed system (Text & speech are input.) 6 Input textInput speech Speech analysis Text analysis Waveform generation Synthetic speech F0 modification Duration extraction Parameter generation Alignment HMM Synthesis HMM
  • 7. /17 Duration extraction module 7 Alignment HMM Synthesis HMM Feature of input speech Context of Input text HMM alignment Duration generation State duration of synthetic speech Parm. Gen. Duration of input speech
  • 8. /17 Alignment accuracy & duration unit  How to build alignment HMMs suitable for input speech? – → The use of pre-recorded speech uttered by users – Large amounts → user-dependent HMMs – Small amounts → HMMs adapted from original alignment HMMs  How to map the input speech duration to synthetic speech? – Alignment/synthesis HMM-states represent different speech segments. – Which is better, HMM-state, phone, or mora-level duration unit? 8
  • 9. /17 Speech parameter generation module 9 Synthesis HMM Context of Input text Parameter generation Spectrum of synthetic speech F0 generated From HMMs Dur. ext. State duration F0 mod. Wav. Gen.
  • 10. /17 F0 modification module 10 Feature of input speech F0 generated from HMMs F0 conversion U/V region modification Parm. gen. F0 of synthetic speech Wav. Gen.
  • 11. /17 F0 conversion & unvoiced/voiced modification 11 F0 Time Reference generated from HMMs Input speech F0-converted U/V-modified  F0 conversion fixes F0 range of input speech to fit to reference.  U/V modification fixes the U/V region of input speech to fit to reference. Linear conversion Spline interpolation
  • 13. /17 Experimental Setup 13 Content Value/Setting User 4 Japanese speakers (2 male & 2 female) Target speaker 1 Japanese female speaker Training data of synthesis HMMs 450 phoneme-balanced sentences, 16 kHz-sampled, 5 ms shift, reading style Evaluation data 53 phoneme-balanced sentences Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity Speech analyzer STRAIGHT [Kawahara et al., 1999.] Text analyzer Open-jtalk Acoustic model 5-state HSMM [Zen et al., 2007.]  1. duration unit & alignment HMM adaptation  2. synthesis HMM adaptation  3. effect of U/V modification
  • 14. /17 Evaluation 1: duration unit & alignment HMM adaptation  3 duration units – State / phoneme / mora-level duration  4 HMMs using different amounts of pre-recorded speech – 0 … target-speaker-dependent HMMs (= synthesis HMM) – 1 … HMMs adapted using 1 utterance uttered by the user – 56 … HMMs adapted using 56 utterances – 450 … user-dependent HMMs  Evaluation – MOS test on naturalness of synthetic speech – DMOS test on prosody mimicking ability of synthetic speech • Input speech is presented as reference. 14
  • 15. /17 Result 1: duration unit & alignment HMM adaptation 15 1 2 3 4 5 MOS on naturalness DMOS on prosody mimicking ability 0 1 56 450utts. We can confirm (1) adaptation is effective, and (2) phoneme-level dur. is relatively robust. No significant diff. No significant diff. state phone mora
  • 16. /17 Experiment 2: Effectiveness of U/V modification in naturalness Preferencescoreonnaturalness[%] 0 20 40 60 80 100 Spkr1 Spkr2 Spkr3 Spkr4 U/Vmodificationratio[%] 0 5 10 15 20 Spkr1 Spkr2 Spkr3 Spkr4 w/o or w/ modification U->V or V->U modification U/V modification can improve the naturalness! (especially when many U frames of input speech are fixed.)
  • 17. /17 Conclusion  2 functions to control synthetic speech – An original function of HMM-based TTS • MR-HMM or manual control – Speech-based control • Intuitive for users  2 main modules of our system – Mimic duration. • Copy duration of input speech to synthetic speech. – Mimic F0 patterns. • Copy dynamic F0 pattern of input speech to synthetic speech.  Future work – HMM selection using text & speech 17