Modeling Improved Syllabification Algorithm for Amharic

1,817 views

Published on

© Nirayo Hailu & Sebsbie Hailemariam

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,817
On SlideShare
0
From Embeds
0
Number of Embeds
104
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Modeling Improved Syllabification Algorithm for Amharic

  1. 1. AGIS’11 Action week for Information sharingModeling Improved Syllabification Algorithm for Amharic Nirayo Hailu and Sebsbie Hailemariam (PhD.) nirayo2000@yahoo.com sebsibe2004@yahoo.com December 01, 2011 Addis Ababa, Ethiopia
  2. 2. OUTLINE Introduction Amharic Syllable Structure & Syllabification Design of Syllabification model Experimental results & evaluation Conclusion & RecommendationsModeling Improved Amharic Syllabification Algorithm 2
  3. 3. INTRODUCTION A syllable is a unit of sound composed of a central peak ofsonority (usually a vowel), & the consonants that cluster aroundthis central peak. Syllabification: is the task of segmenting words whetherspoken or written into syllables Technically, the basic elements of the syllable are: Onset Rhyme (Nucleus + Coda) A syllable can be described by a series of grammars. consonant-vowel-consonant (CVC) sequence. onset, nucleus & coda (ONC) Modeling Improved Amharic Syllabification Algorithm 3
  4. 4. INTRODUCTIONExample: Syllable (σ) structure for the word (ብብብብ ) /bil-hat/ σ σ Onset rhyme Onset rhyme b nucleus coda h nucleus coda i l a tImportance of identification of syllables structures: Speech synthesis (in G2P module, prosody module, Synthesis module) Improve the synthesized speech intonation Speech recognition (pronunciations dictionary) To build recognizer which represents pronunciations in terms of syllables/phoneme rather than grapheme. Modeling Improved Amharic Syllabification Algorithm 4
  5. 5. INTRDUCTIONAmharic Language Amharic is a syllabic language in which every graphemerepresent Consonant-Vowel assimilation. All the syllables are not uttered as expected. Amharic orthography did not show epenthetic vowel &geminated consonants. In this project we developed appropriate syllabification modelfor Amharic text. Modeling Improved Amharic Syllabification Algorithm 5
  6. 6. AUTOMATIC SYLLABIFICATIONApproachesRule-based: Effectively embodies some theoretical position regarding the syllable Rules are used as gold standard. requires linguistic expert Implementing notions such as maximal onset principle & sonority hierarchy. . Example: sonority curve for the word ምምምም/milkit/ - Modeling Improved Amharic Syllabification Algorithm 6
  7. 7. AUTOMATIC SYLLABIFICATIONApproachesData-driven: Infer new syllabifications from an evidence base on already-syllabified words (a dictionary or lexicon). Requires large training corpus to attain better performance. Examples: Look-up procedure Syllabification by analogy Decision tree-based syllabification Modeling Improved Amharic Syllabification Algorithm 7
  8. 8. AUTOMATIC SYLLABIFICATION Related worksTitle Approach used Dataset used AccuracyUnit Selection Voice For Amharic Rule-based - -Using FESTVOXA Rule based Syllabification Rule-based 30,000 distinct words 99.95%Algorithm for SinhalaAutomatic detection of syllable Rule-based speaker1: 653 words 97.77%boundaries in spontaneous speech speaker2:1,238 words(for French language)Automatic Word Stress Marking & Rule-based test 1: 1000 words 99.7%Syllabification for Catalan TTS test 2: 223 words 99.8%Automatic Syllabification for Danish Rule-based & test1: 1000 (randomly RB(96.9%Text-to-Speech Systems (ANN) selected) &98.7% ) test2: 1000 ANN(94.1 (from newspapers) %& 94.5%)A Syllabification Algorithm for Rule-based 316 words 98.4%Spanish Modeling Improved Amharic Syllabification Algorithm 8
  9. 9. AMHARIC SYLLABLE STRUCTURE & SYLLABIFICATIONSyllable structure of Amharic words The main syllable templates: V , VC , VCC , CV, CVC, CVCC . Gemination & syllable structures Traditionally, it is represented either as /C:/ or /CC/ to indicate its length. Gemination happens when a spoken consonant is pronounced for an audibly longer period of time than a short consonant. Gemination occurs frequently Amharic words except for phoneme (ም )/h/ & (ም /ax/. ) Modeling Improved Amharic Syllabification Algorithm 9
  10. 10. AMHARIC SYLLABLE STRUCTURE & SYLLABIFICATIONSyllable structure of Amharic wordsGemination & syllable structures Example: ምምም /kift/ & with gemination /kiffitt/  /Kift/ /Kiffitt/ Modeling Improved Amharic Syllabification Algorithm 10
  11. 11. AMHARIC SYLLABLE STRUCTURE & SYLLABIFICATIONSyllable structure of Amharic words Consonant clusters & their syllable structures In Amharic, the maximum number of allowable consonant sequences in a cluster is two. Onset cluster is not allowed in Amharic. Sonority hierarchy help us to deal with consonant clusters. Example: if stops & liquids appear together in cluster word finally, an epenthesis vowel is inserted. – Sonority of the final liquid is greater than that of the preceding phoneme. Modeling Improved Amharic Syllabification Algorithm 11
  12. 12. AMHARIC SYLLABLE STRUCTURE & SYLLABIFICATIONSyllable structure of Amharic words Consonant clusters & Epenthesis Epenthesis: the process of inserting epenthetic vowel to split impermissible consonant clusters. General rules regarding epenthesis in Amharic: Word initially no consonant cluster CCC CCiC, in a word like /fendto/ /fendito/ C:C C:iC, in a word like /fellgo//felligo/ CC:CiC:, in a word like /sebrre//sebirre/ C:C: C:iC: Final position sonority hierarchy principle is applied Modeling Improved Amharic Syllabification Algorithm 12
  13. 13. DESIGN OF SLLABIFICATION MODELSyllabification Having gemination handling rules, syllabification rules, epenthesis rules & syllable templates of the language it is possible to syllabify (mark syllable boundaries) given the text. Modeling Improved Amharic Syllabification Algorithm 13
  14. 14. DESIGN OF SLLABIFICATION MODELDesign of rule-based syllabification Model The over all architecture of automatic syllabification includes five modules: o Transliteration module o Gemination handling module (expert knowledge) o Epenthesis module o Syllabification module o Stress assignment module Modeling Improved Amharic Syllabification Algorithm 14
  15. 15. GENERAL ARCHITECTURE FOR AUTOMATIC SYLLABIFICATION Amharic Text Transliteration Expert’s Gemination knowledge Epenthesis Consonant cluster Identification Geminated consonant identification Sonority Scale Epenthetic Vowel Insertion of phonemes & Epenthesis Rules Syllabification Syllable Consonant-Vowel parsing templates & Syllable template matching Syllabification rules Syllable boundary marking Stress Assignment Syllable weight Assignment Syllable Weight (Rules) Stress Marker Syllable & stress marked Phoneme sequenceModeling Improved Amharic Syllabification Algorithm 15
  16. 16. DESIGN OF SYLLABIFICATION MODELProposed Epenthesis procedure1. Accept input word & scan from left to right.2. If consonant cluster occurs at word initial position, insert epenthetic vowel between them. Exception: If the first phoneme is consonant & the next consonant is glide /w/. (Rule #1)3. If three consonants are appeared in sequence word medially or word final position, insert epenthetic vowel before the third consonant.( Rule #2). Exception: If the middle consonant sonority is greater than the rest insert epenthetic vowel after the first consonant in the cluster.4. If a cluster of consonants contains the geminate & singleton in sequence, insert epenthetic vowel after the geminated consonants.( Rule #3) Modeling Improved Amharic Syllabification Algorithm 16
  17. 17. DESIGN OF SYLLABIFICATION MODELProposed Epenthesis procedure5. If a cluster of consonants contains the singleton & geminate in sequence, insert epenthetic vowel after the singleton consonants. (Rule #4)6. If a cluster of consonants contains two different geminates in sequence, insert epenthetic vowel between the two geminate consonants. (Rule #5)7. If the sonority of the final consonant is greater than that of the preceding consonant, the epenthetic vowel is inserted between the final consonant clusters. (Rule #6)8. Repeat 2 through 7 until all the phonemes are parsed in the phonemes list. Modeling Improved Amharic Syllabification Algorithm 17
  18. 18. DESIGN OF SYLLABIFICATION MODELProposed syllabification procedure1. Accept the input from epenthesis algorithm & scan from left to right.2. At word initial position if two vowels phonemes (VV) occurs in sequence, mark syllable boundary between them.3. If the initial phoneme is vowel & the next two phonemes are consonant & vowels respectively; mark the syllable boundary just at the second4. If (VCCV) pattern occurs at any position, mark syllable boundary between the two consonant clusters.5. If (VCVC) pattern occurs at word initial position, mark syllable boundary before the second vowel.6. If (CVV) type sequence occurs at any position, mark syllable boundary between the two vowels. Modeling Improved Amharic Syllabification Algorithm 18
  19. 19. DESIGN OF SYLLABIFICATION MODELProposed syllabification procedure7. If (CVCCV) phoneme sequence occurs at word initial position mark syllable boundary between the middle consonant clusters (CVC-CV).8. If (CVCC) pattern occurs at word final position & if there is phoneme before the first consonant mark syllable boundary before the initial consonant in this pattern.9. If (CVCV) pattern occurs at any position, mark syllable boundary after the vowels, but if it occurs at word final position the syllable boundary becomes CV - CV pattern.10. If (CVC1C1VC or CVCCVC) pattern occurs in a word mark syllable boundary between the geminated consonants. (CVC1- C1VC).11. If (VVCC) syllable pattern occurs at word final or initial position mark syllable boundary between the two vowels.12. Repeat 2 throgh11 until all phonemes are parsed. Modeling Improved Amharic Syllabification Algorithm 19
  20. 20. DESIGN OF SYLLAFICATION MODELThe algorithms were implemented using C# programming language Modeling Improved Amharic Syllabification Algorithm 20
  21. 21. EXPERIMENTAL RESULTS & EVALUTATIONThe test corpus: Each word contains three to four syllables on average. The corpus contains a total of 3,099 syllables 779 consonant clusters including geminated consonants. Cluster Phonemes Position in a type Type word Total Number Percentage (%) Consonant #CC initial 256 45.47 Consonant CC# final 78 13.85 Consonant CCC medial or final 55 9.77 Gemination C:C medial or final 65 11.54 Gemination CC: medial or final 61 10.83 Gemination C:C: medial or final 48 8.53 Total 563 100 Modeling Improved Amharic Syllabification Algorithm 21
  22. 22. EXPERIMENTAL RESULTS & EVALUTATIONEpenthesis performance Epenthesis insertion performance Algorithm Prediction Insert Don’t insert Total Rate epenthesis epenthesis Expert decision Insert 528 17 545 69.97% epenthesis Don’t insert 2 232 234 30.03% epenthesis Total 530 249 779 100% # correct insertion correct neglection Accuracy 100% # total consonant clusters Modeling Improved Amharic Syllabification Algorithm 22
  23. 23. EXPERIMENTAL RESULTS & EVALUTATIONSyllabification Distribution of syllable templates over the test result Syllable Frequencies Percentage Pattern Word Word Total (%) Initial Word medial final V 36 0 2 38 1.27 CV 352 690 579 1621 53.59 VC 64 25 5 94 3.10 VCC 74 0 0 74 2.44 CVC 450 310 342 1102 36.43 CVCC 24 0 72 96 3.17 Total 1000 1025 1000 3025 100% Syllabification performance: evaluation by Amharic Linguist Expert shows an overall accuracy of the syllabifier 98.1%, Word accuracy 98.1% & the same figure tends to juncture accuracy. Modeling Improved Amharic Syllabification Algorithm 23
  24. 24. EXPERIMENTAL RESULTS & EVALUTATIONSyllabification errors Most of the errors occurred due to missed epenthesis vowel or wrongly inserted epenthetic vowel Problem #. Total # in the test Problem Descriptions result corpus 1 Words which have wrong epenthetic insertions 2 2 Words with neglected epenthetic vowel insertions 11 3 Syllabification problem from neglected epenthesis in CC sequence at word medial position 6 Total syllabification error 19 Modeling Improved Amharic Syllabification Algorithm 24
  25. 25. CONCLUTION & RECEMMONDATIONSConclusion Automatic syllabification algorithm in considering frequently occurring epithetic vowel /i/ and gemination Algorithm for the frequently occurring epenthetic vowel.. Result showed 98.1% word accuracy Rule-based syllabification & linguistic syllabification principles are important in implementing automatic syllabification & epenthesis. Modeling Improved Amharic Syllabification Algorithm 25
  26. 26. CONCLUTION & RECOMMENDATIONSRecommendations the researchers in the area can use the algorithm in Amharic TTS & in Amharic ASR – the source code can be found in C++(GNU C++ compiler) and C#. Future works Gemination handling algorithm. Study on final consonant cluster to improve the performance of the syllabifier. Stress assignment algorithm. Performance investigation on both Amharic TTS & ASR. A comparison study using data-driven approaches Modeling Improved Amharic Syllabification Algorithm 26
  27. 27. ብብብብብብ Thank you! ብ !Modeling Improved Amharic Syllabification Algorithm 27

×