Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sltu12

746 views

Published on

  • Be the first to comment

Sltu12

  1. 1. Developments of Swahiliresources for an ASR systemHadrien Gelas1,2, Laurent Besacier2, François Pellegrino11Laboratoire DDL, CNRS - Université de Lyon, France2LIG, CNRS - Université Joseph Fourier Grenoble, France
  2. 2. Swahili Systemintroduction results 1 2 3 ASR resources
  3. 3. Swahili ?1
  4. 4. 2% only of native speakers(between 800k and 5M) 98% are non- nativesbetween 40M and 100M speakers
  5. 5. Largearea ofEastAfrica 9  Spoken in more than countries  
  6. 6. Largearea ofEastAfricaOfficial language of 5   nations  
  7. 7. Largearea ofEastAfrica area   Swahili language
  8. 8. Internet penetrationrate (%) 78.6 67.5 61.3 39.5 35.6 32.7 26.213.5Africa Asia World Middle East Latin Europe Oceania / North Average America / Australia America Caribbean
  9. 9. Internet population 2988.4 growth (%) 2244.8 2000-2011 1205.1 789.6 528.1 376.4 214 152.6Africa Asia World Middle East Latin Europe Oceania / North Average America / Australia America Caribbean
  10. 10. Swahili and IT services
  11. 11. Swahili and IT servicesBut not yet
  12. 12. Swahili onlineresources
  13. 13. Bantufamily333
  14. 14. Swahili featuresfor ASR Rich morphology Non-tonal Noun classes Roman script agreement systems complex verbs
  15. 15. ASR resources" Acoustic r Pronunciation r Language models l dictionary l models 2 J Text output
  16. 16. ASR resources" Acoustic r Pronunciation r Language models l dictionary l models Needs text corpus J Text output
  17. 17. Text corpus (M words) 28 Collected from 16 news websites 12 5 2 Sawa corpus [Getao and Miriti] Helsinki corpus Our corpus
  18. 18. Rich morphology inSwahiliEnglish They will not tell youSwahili hawatakuambieniSegm. ha-wa-ta-ku-ambi-e-niGloss NEG-SM2-FUT-OM2-tell-FIN-PL
  19. 19. Rich morphologyfor ASR (Type OOV %) 19.17 12.46 10.28 Word-65k Word-200k Word-400k
  20. 20. Rich morphologyfor ASR (Type OOV %) 19.17 High OOV rates   12.46 10.28 Word-65k Word-200k Word-400k
  21. 21. Rich morphologyfor ASR (Type OOV %) 19.17 To reach a larger lexical coverage, we used an 12.46 unsupervised 10.28 approach (Morfessor) to segment words in sub-words units   Word-65k Word-200k Word-400k
  22. 22. Rich morphologyfor ASR (Type OOV %) 19.17 12.46 10.28 11.36 1.61 Word-65k Word-200k Word-400k Morf-65k Morf-200k
  23. 23. ASR resources" Acoustic r Pronunciation r Language models l dictionary l models Needs unit pronunciation J Text output
  24. 24. Pronunciationdictionary65k most frequent units (words or sub-words)+Grapheme-to-phoneme script taking benefitsof the regularity of Swahili spelling
  25. 25. Pronunciationdictionary65k most frequent units (words or sub-words)+Grapheme-to-phoneme script taking benefitsof the regularity of Swahili spelling BUT…Issue with English words, proper names andacronyms!
  26. 26. PronunciationdictionaryNear 9% of units in 65k lexicon arefound in CMU English dictionary
  27. 27. PronunciationdictionaryWords in 65k dictionary Words in CMU    … …games g a m e s games G EY M Z… …
  28. 28. PronunciationdictionaryWords in 65k dictionary Words in CMU    … 1 … Identical wordgames g a m e s games G EY M Z… …
  29. 29. PronunciationdictionaryWords in 65k dictionary Words in CMU    … 1 … Identical wordgames g a m e s games G EY M Z… … 2 Mapping to Swahili phones
  30. 30. PronunciationdictionaryWords in 65k dictionary Words in CMU    … 1 … Identical wordgames g a m e s games G EY M Zgames(2) g e y m z …… Add as a 3 variant 2 Mapping to Swahili phones
  31. 31. ASR resources" Acoustic r Pronunciation r Language models l dictionary l models Needs audio data and matching transcriptions J Text output
  32. 32. Audio corpusMain constraint for us !It is a time consuming andexpensive task.
  33. 33. Read speech corpus(1st solution)Transcriptions are directly available and thetask is easy to prepare BUT…May not be natural enough, need to findspeakers willing to record 3h30 collected this way
  34. 34. Crowdsourcingtranscriptions (2ndsolution)Amazon’s Mechanical Turk:Tasks can be posted online and anyone can bepaid to do them.Good enough quality Completion rate lowerfor acoustic models than for EnglishPossibility to find Ethical issuestranscribersOnly a test, 1h30 of read speech corpustranscribed this way
  35. 35. Collaborativetranscriptions (3rdsolution)Corpus to transcribe: web broadcast news(available online with good enough quality)Collaboration with a Kenyan institute :  
  36. 36. Collaborativetranscriptions (3rdsolution) A 1st acoustic model (AM) is trained using read speech corpus1st set AM
  37. 37. Collaborativetranscriptions (3rdsolution) 2hrs set preparation A 2hrs set is automatically1st set AM segmented and filtered
  38. 38. Collaborativetranscriptions (3rdsolution) 2hrs set preparation 2hrs set1st set AM transcribed The 2hrs set is transcribed using our 1st set AM
  39. 39. Collaborativetranscriptions (3rdsolution) 2hrs set preparation 2hrs set1st set AM The 2hrs set is sent to transcribed the Ta ji Institute for correction 2hrs set corrected
  40. 40. Collaborativetranscriptions (3rdsolution) 2hrs set preparation After correction, data are added to the 2hrs set2nd set AM training corpus and a transcribed new corpus is trained 2hrs set corrected
  41. 41. Collaborativetranscriptions (3rdsolution) 2hrs set preparation 12 hours were 2hrs set6th set AM transcribed transcribed 2hrs set corrected
  42. 42. Collaborative transcriptions 1st set 40 40 35TimeSpent Time Spent (hours) 30(hours) 3rd set 5th set 25 2nd set 25 4th set 20 6th set 15 15 60 60 65 70 70 75 80 85 85 Character Accuracy rate (%) Character Accuracy Rate (%)
  43. 43. System results (WER)" Acoustic r Pronunciation r Language models l dictionary l models 3 J Text output
  44. 44. Asante! (Thank you!) hadrien.gelas@univ-lyon2.fr laurent.besacier@imag.fr françois.pellegrino@univ-lyon2.fr

×