Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Progress on Bangla Text-To-Speech System by Dr. M. Shahidur Rahman

Progress on Bangla Text-To-Speech System

  • Be the first to comment

  • Be the first to like this

Progress on Bangla Text-To-Speech System by Dr. M. Shahidur Rahman

  1. 1. Progress on Bangla Text-To-Speech System Presented By: Dr. M. Shahidur Rahman Professor, Dept. of Computer Science & Engg. Shahjalal University of Science & Technology rahmanms@sust.edu
  2. 2. Outline • Introduction to TTS • How TTS works • Present Bangla TTS systems • Problems of the present Bangla TTS • Directions to improve the performance of Bangla TTS • Discussion… 2
  3. 3. What is a TTS? • The goal of text-to-speech (TTS) synthesis is to convert an arbitrary input text into intelligible and natural sounding speech – TTS is not a “cut-and-paste” approach that strings together isolated words – Instead, TTS employs linguistic analysis to infer correct pronunciation and prosody (i.e., NLP) and acoustic representations of speech to generate waveforms (i.e., DSP) 3
  4. 4. TTS Applications Applications:  Services for the visually impaired community  Services for the Illiterate people with difficulties in reading  Enable use of Computers and IT services  Reading email aloud  Using Word processor  Using Internet Commercial TTS Systems:  Festival  Bell Labs TTS 4
  5. 5. How TTS Works 5
  6. 6. Different TTS Systems Phoneme-Based TTS System • Phonemes are: – The minimal distinctive phonetic units – Relatively small in number (39 phonemes in English) • Disadvantage – Phonemes ignore transitional sound !!! 6
  7. 7. Different TTS Systems (cont’d) Diphone-Based TTS System:  Diphones are: – Made up of 2 phonemes – Incorporate transitional sound – Produce better sounding speech – Ex. কক = ক + কঅ + অক + ক Disadvantage: • Over 1500 diphones in English language !!! 7
  8. 8. Text Pre-Processing • Convert raw text, which may include numbers, abbreviations, etc., into the equivalent of written-out words 8
  9. 9. Word to Diphone Converter (Phonetization)  Purpose  Translate words to their diphone representations (Ex. রাজা -> Diphones: {র + রআ + আজ + জআ})  mark the text into prosodic units such as phrases, clauses and sentences  Resource – Dictionary of words and their diphones 9
  10. 10. Prosody Diphone Retrieval ConcatenationAcoustic Manipulation Diphone Database Prosody Param. 10
  11. 11. Properties of Speech PeriodicNon- Periodic Non- Periodic eg. cat.wav 11
  12. 12. Altering Pitch/Duration/Amplitude • For smooth concatenation, altering pitch, duration and amplitude at the concatenation point is very important. 12
  13. 13. Altering Pitch Hanning window Original diphone Extracted pitch period Hanned pitch period X = 13
  14. 14. PSOLA – Pitch Synchronous Overlap and Add = 50% Overlap + Add Pitch Up > 50% Pitch Down < 50% 14
  15. 15. Altering Duration • Increase number of PSOLA iterations (overlaps) to increase duration • Decrease number of PSOLA iterations (overlaps) to decrease duration 15
  16. 16. Altering Amplitude  Multiplying the signal by a constant  If constant > 1, amplitude increase  If constant < 1, amplitude decrease 16
  17. 17. Concatenation Diphones  Word • Using PSOLA at the joining ends • Ensures smooth transition Words  Sentence • Straight joining at the end points due to presence of pauses 17
  18. 18. Putting All Together TTS System Text Pre-processing Prosody Concatenation words 18
  19. 19. Types of Concatenative speech synthesis • Concatenative synthesis with a fixed inventory – contain one sample for each unit, and perform prosodic modification to match the required prosody • Unit-selection-based synthesis – store several instances of each unit, thus improving the chances of finding a well-matched unit 19
  20. 20. Progress of Bangla TTS • KATHA  Developed in BRAC university  Unit based system using Festival framework  4355 Diphones  Takes 2 sec to generate a 10 sec utterance • BANGLA VAANI  syllable based synthesis system  Developed in Kolkata • SUBACHAN  Developed by SUST people  Diphone based synthesis system  527 Diphones  Takes 45ms to generate a 10 sec utterance 20
  21. 21. Speech Signal From Kotha and Subachan • (Voice of kotha) তিতি প্রধািি কতি হলেও বিশ তকছু প্রিন্ধ- তিিন্ধ রচিা ও প্রকাশ কলরলছি • (Voice of Subachan) তিতি প্রধািি কতি হলেও বিশ তকছু প্রিন্ধ-তিিন্ধ রচিা ও প্রকাশ কলরলছি • (Voice of kotha) জীবনানন্দ দাশ ববিংশ শতাব্দীর অনযতম প্রধান আধুবনক বািংলা কবব • (Voice of Subachan) জীবনানন্দ দাশ ববিংশ শতাব্দীর অনযতম প্রধান আধুবনক বািংলা কবব 21
  22. 22. Problems: Homograph Ambiguity • Homographs are words that share the same spelling but differ in meaning and pronunciation 22
  23. 23. Solution: Homograph Disambiguation  Collect allpossible homograph words  Determine POS tag of the homograph words Ex. বছলেরামালেিে (bol) বেেলছ। িু তম যালি তক িা িে (bolo)। • Bayes Theorem can also be applied to determine the likelihood of a word. 23
  24. 24. Problems: Improper Concatenation 24 Not concatenated properly Signal from the the utterance of রাশেদ
  25. 25. Solution: Improper Concatenation • PSOLA • Reducing number of concatenation point – Ex 1. Sentence-> কামাে ভাে বছলে। Diphones-> কা + আমা + আে ভা+আলো বছ+এলে Instead of ক + কআ +আম + মআ +আে + ে … – Ex 2. ফলাাঃ পৃবিবী -> পৃ + ইবি + ইবী • Vowel sound is periodic, thus suitable for appropriate concatenation • Use 1000 most frequently spoken word 25
  26. 26. Duration Modeling 26
  27. 27. Duration Modeling 27
  28. 28. Thank you all! Suggestions?? 28
  29. 29. Sound Synthesized by Katha • Katha 29
  30. 30. Sound Synthesized by Subachan • Subachan 30

    Be the first to comment

    Login to see the comments

Progress on Bangla Text-To-Speech System

Views

Total views

1,855

On Slideshare

0

From embeds

0

Number of embeds

702

Actions

Downloads

36

Shares

0

Comments

0

Likes

0

×