Your SlideShare is downloading. ×
Statistical parametric speech synthesis by product of experts
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Statistical parametric speech synthesis by product of experts

590

Published on

ILCC/HCRC seminar series at University of Edinburgh, March 2011 …

ILCC/HCRC seminar series at University of Edinburgh, March 2011

Multiple-level acoustic models (AMs) are often combined in statistical parametric speech synthesis. Both linear and non-linear functions of the observation sequence are used as features in these AMs. This combination of multiple-level AMs can be expressed as a product of experts (PoE); the likelihoods from the AMs are scaled, multiplied together and then normalized. Currently these multiple-level AMs are individually trained and only combined at the synthesis stage. This seminar discusses a more consistent PoE framework where the AMs are jointly trained. A generalization of trajectory HMM training can be used for multiple-level Gaussian AMs based on linear functions. However for the non-linear case this is not possible, so a scheme based on contrastive divergence learning is described.

Published in: Technology
1 Comment
1 Like
Statistics
Notes
  • Hy I'm Ahsan From PAKISTAN. I'm Woking on Project of ANDROID Search Using Voice Recognition And Face Recognition. Please Send Me Stastical Parametric Base Speech Synthesis and all about it. Please If Ty Can Do This So Please. It Is Intelligent Search Engine Project and Database Search Using Voice and Image Search.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
590
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Statistical Parametric Speech Synthesis Based on Product of Experts Heiga ZEN Toshiba Research Euorpe Ltd. Cambridge Research Lab. ILCC/HCRC Seminar Series @ University of Edinburgh March 11th, 2011 Copyright 2011, Toshiba Corporation
  • 2. OverviewGoal of text-to-speech (TTS) synthesis - Convert text * sequence of words [ hello, my, name, is, heiga, zen ] - to speech * waveform 1
  • 3. Typical flow of TTS systems TEXT Grapheme-to-Phoneme Part-of-Speech tagging Text analysis Text normalization Pause prediction Prosody prediction Speech synthesis discrete ⇒ discrete Waveform generation SYNTHESIZED discrete ⇒ continuous SPEECH This talk mainly focuses on speech syntheisis part 2
  • 4. Statistical parametric speech synthesis Feature Model Parameter Waveform SynthesizedSpeech extraction training generation synthesis Speech Labels TEXT- Large data + automatic training ⇒ Automatic voice building- Parametric representation of speech + statistical modeling ⇒ Flexible to change its voice characteristicsHMM as its acoustic model ⇒ HMM-based speech synthesis (HTS) [Yoshimura;02] 3
  • 5. Statistical parametric speech synthesisAdvantages - Flexibility to change voice characteristics - Small footprint - RobustnessDrawback - QualityMajor factors for quality degradation - Vocoding (feature extraction) - Acoustic model (modeling) ⇒ better modeling w/ PoE! - Over-smoothing (reconstruction) 4
  • 6. OutlineStatistical parametric speech synthesis framework - Speech & features - Text & labels - Acoustic models by directed & undirected graphical modelsProduct of Experts (PoE)-based speech synthesis - Hierarchical nature in speech & text - Combination of multiple models at multiple levels - PoE interpretationExperimentsConclusions 5
  • 7. Statistical speech synthesisProbabilistic model of speech given text - Training corpus: text & speech pairs (e.g., 5k utterances) (X, W ) = {(x1 , w1 ), . . . , (xR , wR )} x = r r r {x1 , . . . , xN }: speech w = r r r {w1 , . . . , wK}: text - Estimate generative model of speech given text ˆ λ = arg max p(X | W , λ) λ - Use estimated model to generate speech for unseen text ˆ x = arg max p(x | w, λ) or draw x from p(x | w, λ) x random sampling ML parameter generation 6
  • 8. SpeechModeling speech waveform directly is difficult - Convert waveform into a series of feature vectors - Typically extracted at every 5ms w/ overlap c1 ... ... ct ... ... cT c = {c1 , . . . , cT } : feature vector sequence (continuous) - Features are similar to those used in our mobile phone - Resynthesis from features vector sequence original reconst ⇒ Reconstruct very naturally sounding speech 7
  • 9. TextWords are too big as modeling unit - Impossible to have all words (infinite) * Break words into sub-word units (e.g., phones) * Number of phones is finite * e.g., hot ⇒ [ hh, aa, t ] - Takes into account richer contexts * triphone (typically used in ASR) - e.g., hot ⇒ [ x-hh+aa, hh-aa+t, aa-t+x ] * full context (all contexts that might be relevant) - Part-of-Speech, grammar, accent, stress etc. - Context-dependent phone sequence ⇒ Labels l = {l1 , . . . , lL } 8
  • 10. Text-to-Speech ⇒ Labels-to-FeaturesAssume speech⇒feats, text⇒labels are deterministic - Not true, but reasonable assumption R R ˆ λ = arg max p(xr | wr , λ) ˆ λ = arg max p(cr | lr , λ) λ λ r=1 ⇒ r=1 ˆ x = arg max p(x | w, λ) ˆ c = arg max p(c | l, λ) x c text-speech labels-features - Estimate generative model of speech feats given labels - Use estimated model to generate feats given labels 9
  • 11. ExampleFeature sequence (2nd mel-cepstrum) & labels 2.5 2 1.5 1c 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)l sil j i b u N n o j i ts u ry o k u w a pau - Features are smooth trajectory - Each segment (1 label) has different duration - How to model probability of c given l ? 10
  • 12. Acoustic models for speech synthesisSub-word level model 2.5 2 1.5 1c 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)l sil j i b u N n o j i ts u ry o k u w a pau - 1 model per segment (label) * same label ⇒ same model - Concatenate sub-word models to form entire-level model - Sharing structure is usually built to reduce #parameters 11
  • 13. Acoustic models for speech synthesisDiscrete latent state Markov model - Unobservable (latent) variable: state at each frame qt p(c | l, λ) = p(c | q, λ)P (q | l, λ) ∀q q = {q1 , . . . , qT } : state sequence (discrete) c = {c1 , . . . , cT } : observation sequence (continuous) - State transition probability P (q | l, λ) ⇒ Markov (or semi-Markov) chain - State output probability p(c | q, λ) ⇒ directed or undirected graphical model 12
  • 14. Acoustic models for speech synthesisGraphical models for modeling time series - Directed model (e.g., HMM, autoregressive HMM) q1 q2 q3 T p(c | q, λ) = p(ct | qt , λ) c1 c2 c3 t=1 q1 q2 q3 T p(c | q, λ) = p(ct | qt , ct−1, λ) c1 c2 c3 t=1 - Undirected model (e.g., trajectory HMM [Zen;07]) T q1 q2 q3 1 p(c | q, λ) = ψ(ct−1 , ct , ct+1 ; qt , λ) Zq (λ) t=1 c1 c2 c3 clique potential function T Zq (λ) = ψ (ct−1 , ct , ct+1 ; qt , λ)dc t=1 13
  • 15. Graphical models for modeling time seriesDirected model (e.g., HMM) q1 q2 q3 T p(c | q, λ) = p(ct | qt , λ) c1 c2 c3 t=1 - Probability is factorized over time - Each factor is locally normalized (valid p.d.f.)Undirected model (e.g., trajectory HMM) q1 q2 q3 T 1 p(c | q, λ) = ψ(ct−1 , ct , ct+1 ; qt , λ) c1 c2 c3 Zq (λ) t=1 - Potential function is factorized over time - Combined factors are globally normalized 14
  • 16. Most probable sequence from HMMFeature sequence (2nd mel-cepstrum) & labels 2.5 2 1.5 1c 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)l sil j i b u N n o j i ts u ry o k u w a pau - Features are smooth, but model is piece-wise constant * due to frame-wise conditional independence assumption * Not a good model for smooth speech feature trajectory 15
  • 17. Most probable sequence from trajectory HMMFeature sequence (2nd mel-cepstrum) & labels 2.5 2 1.5 1c 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)l sil j i b u N n o j i ts u ry o k u w a pau - Both features & model is smooth * due to edges between neighbouring observations * Good model for capturing local smooth characteristics 16
  • 18. Directed vs Undirected graphical modelsDirected (D) Undirected (U) D: explicitly model an imagined generative process D: easier to make Bayesian D: allow easy sampling U: normalization constant often makes inference intractable U: allow incorporation of richer features U: more naturally cope with a set of soft constraints all of which should be simultaneously roughly satisfied ⇒ Good property for acoustic modeling 17
  • 19. OutlineStatistical parametric speech synthesis framework - Speech & features - Text & labels - Acoustic models by directed & undirected graphical modelsProduct of Experts (PoE)-based speech synthesis - Hierarchical nature in speech & text - Combination of multiple models at multiple levels - PoE interpretationExperimentsConclusions 18
  • 20. Hierarchical structure in text Document ... Paragraph Paragraph ... ...Sentence Sentence Sentence Sentence ... ... ... Phrase Phrase Phrase Phrase ... ... ... ... Word Word ... Character Character 19
  • 21. Hierarchical structure in speech 2.5 2 1.5 1 c 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)phone sil j i b u N n o j i ts u ry o k u w a pausyllable sil ji bu N no ji tsu ryo ku wa pau word sil jibuN no jitsuryoku wa pauphrase sil jibuNnojitsuryokuwa pau ... ... 20
  • 22. Model hierarchical structure in text & speechCurrent approach - Ignore hierarchical structure in text * Contexts at all levels are associated with phone - Ignore hierarchical structure in speech * Frame-level p.d.f. (directed) or potential func (undirected)Possible approaches - Directed model * Hierarchical priors - Undirected model * Combine models at multiple levels as a single model ⇒ Product of Experts (PoE) [Hinton;02] 21
  • 23. Mixture model vs Product modelMixture of experts M 1 p (c | λ1 , . . . , λM ) = αi p (fi (c) | λi ) Z i=1 - Data is generated from union of experts - Robust for modeling data with many variations - GMM → Mixture of GaussiansProduct of experts [Hinton;02] M 1 p (c | λ1 , . . . , λM ) = {p (fi (c) | λi )}αi Z i=1 - Data is generated from intersection of experts - Efficient for modeling data with many constraints - PoG → Product of Gaussians 22
  • 24. Hierarchical modeling w/ PoE 2.5 2 1.5 1 c 0.5 0 -0.5 -1 -1.5 fsyl (c) fwrd (c) fphn (c) 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (sec)phone sil j i b u N n o j i ts u ry o k u w a pausyllable sil ji bu N no ji tsu ryo ku wa pau word sil jibuN no jitsuryoku wa pau - Extract segment-level features at each level fi (c) - Model extracted features p (fi (c) | λi ) M - Multiply them then normalize the result 1 p (fi (c) | λi ) Z i=1 23
  • 25. Feature function & individual modelFeature function - Extracts segment level feature from c - Linear or non-linear * dynamic features * DCT [Latorre;08, Qian;09] * average [Wang;08] * summation [Ling;06, Gao;08] fphn (c) * variance [Toda;07]Models at each level - Typically context-dependent Gaussians or non-Gaussians 24
  • 26. Conventional usage of PoE for speech synthesisTraining - Estimate each model individually ˆ λi = arg max p (fi (c) | λi ) i = 1, . . . , M λiSynthesis - Generate c that jointly maximize output probs from AMs M ˆ c = arg max ˆ αi log p(fi (c) | λi ) c i=1Parameters of multiple models are trained independently * Inconsistency between training & synthesis * Use weights to control balance among AMs * Weights are determined by held-out data 25
  • 27. Training PoEJointly estimate all models R M 1 {λi } = arg max p(fi (cr ) | li , λi ) r {λi } r=1 Z({λi }, {li }) i=1 M ˆ c = arg max p(fi (c) | li , λi ) c i=1 - Jointly estimate all models as PoE, synthesis w/ all models - Linear feature function & Gaussian experts ⇒ Easy to train - General case ⇒ Difficult to train, use Contrastive Divergence [Hinton;02] 26
  • 28. OutlineStatistical parametric speech synthesis framework - Speech & features - Text & labels - Acoustic models by directed & undirected graphical modelsProduct of Experts (PoE)-based speech synthesis - Hierarchical nature in speech & text - Combination of multiple models at multiple levels - PoE interpretationExperimentsConclusions 27
  • 29. Experiment - Multiple-Level Dur Models as PoEsCombine multiple-level duration models [Ling;06, Gao;08] - Durations are also modeled in statistical speech synthesis - State-level durations are modeled by Gaussians - Combine higher-level duration models w/ state-level one * phone duration = sum of state durations in a phone * phone duration are modeled by Gaussians ⇒ linear & Gaussian PoE P Ni P 1 p (d | λ) = N (dij ; ξij , σij ) · N (pk ; νk , ωk ) Z i=1 j=1 k=1 dij : duration of state j in phoneme i 28
  • 30. Experiment - Multiple-Level Dur Models as PoEsGraphical model representation d11 q 11 d11 q 11 d11 q 11 d12 q 12 r1 d12 q 12 r1 d12 q 12 d13 q 13 d13 q 13 d13 q 13 s1 d21 q 21 d21 q 21 d21 q 21 d22 q 22 r2 d22 q 22 r2 d22 q 22 d23 q 23 d23 q 23 d23 q 23 (a) baseline (st) (b) PoE (st*ph) (c) PoE (st*ph*syl) 29
  • 31. Experiment - Multiple-Level Dur Models as PoEExperimental conditions * Training data; 2,469 utterances * Development data; 137 utterances - Used to optimize weights in conventional method - Weights were optimized to minimize RMSE by grid search - Baseline & proposed method did not use development data * Almost the same training setup as Nitech-HTS 2005 [Zen;06] * Test data; 137 utterances * State, phone, & syllable-level models were clustered individually - # of leaf nodes * state; 607, phoneme; 1,364, syllable; 281 30
  • 32. Experimental ResultsDuration prediction results (RMSE in frame (rel imp)) Model Phoneme Syllable Pause Baseline (st) 5.08 (ref) 8.98 (ref) 35.0 (ref) uPoE (st*ph) 4.62 (9.1%) 8.13 (9.5%) 31.8 (9.1%) uPoE (st*ph*syl) 4.62 (9.1%) 8.11 (9.7%) 31.8 (9.1%) PoE (st*ph) 4.60 (9.4%) 8.04 (10.5%) 31.9 (8.9%) PoE (st*ph*syl) 4.57 (10.0%) 8.02 (10.7%) 31.9 (8.9%) st; state only, st*ph; state & phoneme, st*ph*syl; state, phoneme, & syllable uPoE; individually trained multiple-level duration models with optimized weights PoE; jointly estimated multiple-level duration models 31
  • 33. Experiment - Global Variance as PoEParameter generation w/ global variance (GV) [Toda;07] Natural Most probable sequence from model 1 2nd mel-cepstral coefficient 0 v(m) -1 0 1 2 3 Time [sec]Most probable sequence has smaller dynamic range 32
  • 34. Experiment - Global Variance as PoEParameter generation w/ global variance (GV) [Toda;07] - How to recover dynamic range? ⇒ Use additional expert modeling dynamic range T 1fv (c) = diag (ct − c) (ct − c) ¯ ¯ : intra-utt variance, quadratic T t=1 1 α p (c | q, λ, λGV ) = N (c ; cq , Pq ) N (fv (c) ; µv , Σv ) ¯ Zq product of utterance-level expert frame-level experts * experts are Gaussians * feature function is non-linear (quadratic) ⇒ General PoE ⇒ Contrastive divergence learning 33
  • 35. Experiment - Global Variance as PoEsGraphical model representation q1 q2 q3 q3 q1 q2 q3 q3 c1 c2 c3 c4 c1 c2 c3 c4 gutt (a) w/o GV (b) w/ GV 34
  • 36. Experiment - Global Variance as PoEExperimental conditions * Training data; 2,469 utterances - Training data was split into mini-batch (250 utterances) - 10 MCMC sampling at each contrastive divergence learning * Hybrid Monte Carlo with 20 leap-frog steps * Leap-frog size was adjusted adaptively - Learning rate was annealed at every 2,000 iterations - Momentum method was used to accelerate learning - Context-dependent logarithmic GV w/o silence was used * Test sentences; 70 sentences - Paired comparison test, # of subjects 7 (native English speaker) - 30 sentences per subject 35
  • 37. Experimental ResultsPaired comparison test result Baseline PoE No preference 17.1 32.4 50.5 Baseline; conventional (not jointly estimated) GV PoE; proposed (jointly estimated) GV Difference was statistically significant at p<0.05 level 36
  • 38. OutlineStatistical parametric speech synthesis framework - Speech & features - Text & labels - Acoustic models by directed & undirected graphical modelsProduct of Experts (PoE)-based speech synthesis - Hierarchical nature in speech & text - Combination of multiple models at multiple levels - PoE interpretationExperimentsConclusions 37
  • 39. ConclusionsPoE-based statistical parametric speech synthesis - Undirected model-based acoustic model for synthesis - Combine multiple models at multiple levels as PoE * Can incorporate hierarchical nature of speech * Joint estimation as PoE gave improvementsFuture works - Effect to quality of synthesized speech * Combining models > Joint estimation ⇒ Fiding good experts is more important - Automatic method to define experts 38
  • 40. References[Yoshimura;02] T. Yoshimura, "Simultaneous modeling of phonetic and prosodic parameters,& characteristic conversion for HMM-based text-to-speech systems, Ph.D thesis, Nitech, 2002.[Zen;07] H. Zen, et al., "Reformulating the HMM as a trajectory model by imposing explicitrelationships between static and dynamic feature vector sequences," Computer Speech & Language,[Hinton;02] G. Hinton, "Training product of experts by minimizing contrastive divergence," NeuralComputation, vol. 14, no. 8, pp. 1771-1800, 2002.[Latorre;08] J. Latorre & M. Akamine, "Multilevel parametric-base F0 model for speech synthesis,"Proc. Interspeech, pp. 2274--2277, 2008.[Qian;09] Y. Qian, et al., "Improved prosody generation by maximizing joint likelihood of state andlonger units," Proc. ICASSP, pp. 1173--1176, 2009.[Wang;08] C.-C. Wang, et al., "Multi-layer F0 modeling for HMM-based speech synthesis," Proc.ISCSLP, pp. 129--132, 2008.[Ling;06] Z.-H. Ling, et al., "USTC system for Blizzard Challenge 2006 an improved HMM-basedspeech synthesis method," Proc. Blizzard Challenge Workshop, 2006.[Gao;08] B.-H. Gao, et al., "Duration refinement by jointly optimizing state and longer unit likelihood,"Proc. Interspeech, 2266--2269, 2008.[Toda;07] T. Toda & K. Tokuda, "A speech parameter generation algorithm considering global variancefor HMM-based speech synthesis," IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816--824, 2007. 39

×