Candeias sti lg2p_vfinal

142 views

Published on

Published in: Travel, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
142
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Candeias sti lg2p_vfinal

  1. 1. © 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga1,2 Sara Candeias1 Fernando Perdigão1,2 1Instituto de Telecomunicações, Polo de Coimbra, Portugal 2Universidade de Coimbra, DEEC, Portugal STIL 2011 8th Symposium in Information and Human Language Technology Oct. 14-26 2011 Cuiaba, Brazil GENERATING A PRONUNCIATION DICTIONARY FOR EUROPEAN PORTUGUESE USING A JOINT-SEQUENCE MODEL WITH EMBEDDED STRESS ASSIGNMENT
  2. 2. 2 SUMMARY • Goal • Problem Statement • G2P System • Joint-Sequence Model • Stressed Vowel Assignment • Results • Conclusions STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  3. 3. 3 GOAL STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • To Generate a Pronunciation Dictionary for EP • To Develop a G2P System for EP
  4. 4. 4 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • Portuguese has an orthography roughly phonologically based  provides a good coverage of the association between G2P • No natural human-language satisfies this assumption  the association between G and P is not quite one-to-one  list of exceptions • Very complex, hard and tiresome
  5. 5. 5 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • statistics • Using pronunciation examples it could be possible to predict the pronunciation of unseen words by analogy • Is not smart enough… • vaga -> v „a g 6 vs. vagarosa -> v 6 g 6 r „O z 6 • linguistic rules
  6. 6. 6 PROBLEM STATEMENT STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 What approaches? How? Implementing an automatic system for converter G2P • linguistic rules • statistics • MIXED
  7. 7. 7 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM Alignment between graphemes and phonemes: “one-to-one”
  8. 8. 8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 JOINT-SEQUENCE MODEL < B r a s i l > / b r 6 z i l / Alignment between graphemes and phonemes: “one-to-one”
  9. 9. 9 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 < c h a m o u > < t ê m > / S 6 m o / / t 6~ i~ 6~ i~ / < B r a s i l > / b r 6 z i l / Alignment between graphemes and phonemes: “one-to-one” JOINT-SEQUENCE MODEL
  10. 10. 10 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 < c h a m o u > < t ê m > / S 6 m o / / t 6~ i~ 6~ i~ / Alignment between graphemes and phonemes: “one-to-one” JOINT-SEQUENCE MODEL
  11. 11. 11 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS < c h a m o u > < S a m º > / S 6 m o / JOINT-SEQUENCE MODEL
  12. 12. 12 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < t ê m > < t 6 ~ i ~ 6 ~ i ~ / / t ï ï / / t Æ ï / < c h a m o u > < S a m º > / S 6 m o / JOINT-SEQUENCE MODEL
  13. 13. 13 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / JOINT-SEQUENCE MODEL
  14. 14. 14 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / JOINT-SEQUENCE MODEL
  15. 15. 15 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 • Implementing the Levenshtein algorithm (“1-01”) • Defining alternative symbols • Graphemes  DIGRAPHS • Phonemes  SAMPA UniChar < c h a m o u > < S a m º > / S 6 m o / < t ê m > / t Æ ï / Graphonemes GOAL: to compute the most probable pronunciation of a word given the word‟s graphoneme form TECHNIQUE: using n-grams JOINT-SEQUENCE MODEL
  16. 16. 16 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM • Several errors due to incorrect stress assignment: solidamente, incansavelmente
  17. 17. 17 System based on a mixed approach funded on: • a scholastic model: joint-sequence model • rules for stressed vowel assignment STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 G2P SYSTEM Marking the Vstressed improved the statistical model by expressing graphoneme classes unequivocally 6 rules
  18. 18. 18 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 STRESSED VOWEL ASSIGNMENT For adverbs ending in <mente> (< pido> → <rapidamente> (fast → quickly): • An algorithm that divides the word into two parts, <ROOT> and <mente>. • The <ROOT> part undertakes a specific module (list of graphematic patterns which have the Vstressed identified). To generate a univocal graphoneme, we attributed special symbols to the Vstressed
  19. 19. 19 To estimate the graphoneme‟s model: • SpeechDat pronunciation dictionary • 15k entries • Deletion of foreign words • Change of some transcriptions • Standardization of the pronunciation STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 VOCABULARY Applied to the CETEMPúblico vocabulary 40k words  40k pronunciations
  20. 20. 20 CETEMPúblico 40k pronunciations: • Iterative procedure: • Long manual verification • Correction of the transcriptions • Comparison to the pronunciations of LOQUENDO STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 DICTIONARY This dictionary was used for the training and test procedure. • The majority of the transcriptions agreed. • The transcriptions from our dictionary were the right ones most of the times.
  21. 21. 21 EXPERIMENTS All experiments were based on the dictionary of the 40K pronunciations: • with stress marking • without stress marking STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 Final results were obtained by evaluating the average of the five partial results. To train and test the model, each one of these two dictionaries was partitioned into five folds for a cross-validation procedure.
  22. 22. 22 The performance of the G2P conversion system was expressed in two average error rates: average error rate of phonemes (PER) and average error rate of words (WER) STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 RESULTS
  23. 23. 23 RESULTS The following figures summarize the results obtained using n- grams with n between 2 and 8 STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  24. 24. 24 RESULTS The use of n-grams with large contexts (n greater than 5) did not improve the system. In fact, there was a slight increase in the error rates (lack of samples to estimate large contexts) STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  25. 25. 25 RESULTS The marking of the stressed vowel contributed to a significant improvement in the system performance STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011
  26. 26. 26 CONCLUSIONS STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 The joint-sequence model with embedded stress assignment had good results. By inspecting the test errors, we observed that most of them resulted from uncommon grapheme patterns or compound words without graphic stress marks. The most frequent errors resulted from the pronunciation of the stressed <e> and <o> since they could be pronounced as /E/ vs. /e/ (<selo>: verb vs. noun) and /O/ vs. /o/ (<ovos> (pl) vs. <ovo>(sing)) without any systematic rule. Obrigada Our system is freely available on http://www.co.it.pt/~labfala/g2p/ and includes models, dictionaries and the G2P converter.
  27. 27. © 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga1,2 Sara Candeias1 (saracandeias@co.it.pt) Fernando Perdigão1,2 1Instituto de Telecomunicações, Polo de Coimbra, Portugal 2Universidade de Coimbra, DEEC, Portugal STIL 2011 8th Symposium in Information and Human Language Technology Oct. 14-26 2011 Cuiaba, Brazil GENERATING A PRONUNCIATION DICTIONARY FOR EUROPEAN PORTUGUESE USING A JOINT-SEQUENCE MODEL WITH EMBEDDED STRESS ASSIGNMENT
  28. 28. 28 INTRODUCTION STIL 2011 - Cuiabá, Brazil - Oct.24-26 2011 Generate a Pronunciation Dictionary for PE • Grapheme-to-Phoneme conversion (G2P) Bom dia  b‟o~ d‟i6 (en. Good morning) • Applications: component of ASR and TTS systems e.g. in language learning, machine translation,… • For correct pronunciation we need: • G2P, stress assignment • Contribution of this paper: • Show phonological constraints (vowel stressed) • Evaluate a mixed approach for G2P system • Turn the dictionary (the model and the converter) publicly available

×