Gujarati Text-to-Speech Presentation


Published on

Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gujarati Text-to-Speech Presentation

  1. 1. 10:02 10:02Text-to-Speech System forGujaratiProject Presentation by Samyak Bhuta
  2. 2. 10:02 10:02* PROJECT PROFILE *Objective : Developing a Text-to-SpeechSystem for Gujarati
  3. 3. 10:02 10:02* PROJECT PROFILE *Under the guidance of Prof. Ram Mohan Shri Jignesh Dholakia
  4. 4. 10:02 10:02* PROJECT PROFILE *At Resorce Centre for Indian LanguageTechnology Solutions in Gujarati,Faculty of Arts,The M. S. University of Baroda, BARODA.
  5. 5. 10:02 10:02Next 25 minutes …> Sound and Speech Sound> ABC of TTS Systems> Pilot Project> GTTS from scratch> Speech , Syllable and Partneme> Speech Sounds in detail> Core Engine> Language Dependent Components
  6. 6. 10:02 10:02Sound : a flow of airSource EarAir flowsSound♫♪♫
  7. 7. 10:02 10:02What makes different sounds ? The factors, responsible for perceptualdifference between one kind of sound fromthe another are Amplitude (or volume) which tells how muchpower the air-flow holds within Frequency (or pitch) which tells at what ratethe air-flow is repeating itself
  8. 8. 10:02 10:02The “Source” doesn’t matters An air-flow of kind A will sound sameweather it has generated from source Xor source Y.
  9. 9. 10:02 10:02Speech Sound A kind of sound whose source isHuman Vocal Organism and whofinds its place in human speech. e.g. ક્ , સ્ , અ , ઈ A standard called International PhoneticAlphabet (IPA) is used to depict such sounds
  10. 10. 10:02 10:02IPA IPA comprises almost all the speech soundsof all languages in the world. Speech sounds are more formally known asPhones IPA uses set of symbols to represent theme.g. k , s , ə , i , ʤ IPA Chart …
  11. 11. 10:02 10:02IPA Chart
  12. 12. 10:02 10:02Synthesized Speech Sound If we can produce the same pattern ofair-flow as it is produced by Human VocalOrganism, representing a speech sound,we can say that we have synthesized thespeech sound
  13. 13. 10:02 10:02Speech Synthesizer A mechanism which is capable of producingsynthesized speech sound in controlledmanner.
  14. 14. 10:02 10:02Text-to-Speech Systems A Speech Synthesizer which is smart enoughto produce equivalent Speech output of thegiven text. The smartness accounts for making theoutput as natural and intelligible aspossible.
  15. 15. 10:02 10:02Text-to-Speech Systems Usually, the TTS Systems are specific toonly one human language and takes inputtext from only that language
  16. 16. 10:02 10:02Basic structure of TTS Systems Function of any TTS System is, generally,divided into three subtasks or phases.I. PreprocessingII. Phonetic-Prosodic TranslationIII. Speech Production The text input travels through thesephases, one by one, and eventually endsup in a speech .
  17. 17. 10:02 10:02Preprocessing “Dr. Ajay Shah will come to clinic on 23 ,Jan.” We read it …“DOCTOR Ajay Shah will come to clinic onTWENTY THIRD OF JANUARY”. The Preprocessing is meant to convertthe input text, from raw condition, topronounceable word text.
  18. 18. 10:02 10:02Phonetic-Prosodic Translation This phase can be logically divided into twodifferent phases,• Phonetic Translation• Prosodic Translation Real TTS Systems may implement thesephases separately or as a unit but togetherthey provide data for the next phase of TTS.
  19. 19. 10:02 10:02Phonetic Translation In human languages, the script under usedoesn’t necessarily posses the one to onemapping with speech. e.g.enough is pronounced as INAF / inəf IPAછોકરો is pronounced as છોક્રો / okʧ ɾo IPA
  20. 20. 10:02 10:02Phonetic Translation A Phonetic Translation is used to provideinformation, to the next phase, about exactlywhat kind of speech sounds (phones) to beproduced for the given text. Phonetic Translation is also regarded asLetter-to-Sound rules.
  21. 21. 10:02 10:02Prosodic Translation Mapping from letter-to-sound rules onlyprovides information about kind of speechsound to be generated. To convey theemotions and expressions residing in theinput text , Prosody needs to be applied. By Prosody we mean,Amplitude + Pitch + Duration
  22. 22. 10:02 10:02Speech Production This phase is responsible for actual outputof the speech. The phase uses the phonetic and prosodicinformation provided from the previousphase. Various approaches exist for production ofspeech.
  23. 23. 10:02 10:02Different ways for Speech Production Three widely used approaches for speechproduction are• Articulatory Synthesis• Source-Filter Synthesis• Concatenative Synthesis Speech production part of the TTS Systemis generally regarded as speech engine.
  24. 24. 10:02 10:02Usecases As we understood the structure of the TTSSystems we realized that all three phases isrequired in order to develop complete TTSfor Gujarati. At the top most abstraction level a use casecan be conceived for fulfilling the requirementof having a TTS System for Gujarati.
  25. 25. 10:02 10:02Usecases The topmost use case, then, can be dividedinto three further use cases each fulfillingthe requirement of three different phases During the project we tried to realize eachuse case one by one.
  26. 26. 10:02 10:02Pilot Project As we approached various requirementsand usecases to be realized, we found thatdeveloping a Preprocessor is not so muchsignificant as developing the other twophases. So we decided to develop later on. We decided to develop Phonetic-ProsodicTranslation phase first as if it can be easilyplugged into any already build ….speech
  27. 27. 10:02 10:02Pilot Project… speech engine who takes input in terms ofof IPA. FreeTTS, IBMJS, Dhvani, Narad werestudied We used Java Speech API along with IBMJSas a speech engine to be used. The input to the engine was provided throughJava Speech Markup Language (JSML)
  28. 28. 10:02 10:02Pilot Project : Objective To develop a TTS System using alreadyavailable Speech Engine and supplyingtranscripted (equivalent ) IPA text of targetGujarati Unicode text to the engine.
  29. 29. 10:02 10:02Pilot Project : S/W Requirement A Speech Engine Component which takesIPA and speaks it out .
  30. 30. 10:02 10:02Pilot Project : Design No of usecases were conceived and itsimplementation was provided as differentjava classes.
  31. 31. 10:02 10:02Pilot Project : Conclusion We cannot continue developing a TTSSystem with “outsider” speech engine asthe accent and other things need to beGujarati in nature.
  32. 32. 10:02 10:02Starting of GTTS from Scratch From the result of the Pilot Project weconcluded that it is required to develop theSpeech Engine keeping Gujarati in mind. Concatenative approach was to be usedsince it provides naturalness and has proventrack record.
  33. 33. 10:02 10:02Concatenation In Concatenative approach, already storedsegments of sounds are joined together toproduce the complete speech. Such segments are known as concatenationunit. We used Partnemes as our concatenationunit.
  34. 34. 10:02 10:02Partnemes Partneme is a very small segment of soundwhose typical length ranges from 8 ms to100 ms. We get the partnemes by cuttingthe recorded speech. But before understanding what is partnemewe have to understand human speech ingreater detail. Especially the relationbetween speech and syllable.
  35. 35. 10:02 10:02How we speak ? At time of normal breathing the period wedevote to breath-in is longer than that ofbreath-out in a complete breath cycle. But when we start speaking, the breath-inperiod becomes shorter paving the way fora longer breath-out period. It is so because to speak out (anything) weneed some air-flow. We use the air-flow …
  36. 36. 10:02 10:02How we speak ? : Human Vocal Tract… powered by lungs, during breath-out. This air-flow is modified at various pointsof Human Vocal Tract, ending up in a oneor another kind of speech sound (phones). Human Vocal Tract comprises of variousorgans which, in one or another way,changes the air-flow. Human Vocal Tract …
  37. 37. 10:02 10:02HumanVocalTract
  38. 38. 10:02 10:02
  39. 39. 10:02 10:02How we speak ? : Syllable and Speech During the one complete breath cyclewe can speak out more than one phones. These all phones, spoken out in just onebreath cycle, constitutes a syllable . Sequence of such syllables in theircontinuity forms a speech.
  40. 40. 10:02 10:02How we speak ? : Syllable Structure It is important to know the structure ofsyllable in order to understand partnemes. Typically a syllable is made up of vowel as anucleus with consonants around it. Gujarati employees the following syllablestructure.< C + C + C + V + V̯ + C + C >
  41. 41. 10:02 10:02How we speak ? : Syllable Structure < C + C + C + V + V̯ + C + C >where C - consonantsV - vowelV̯ - unsyllablized vowel An utterance ( spoken word ) is made upseries of such syllables.
  42. 42. 10:02 10:02How we speak ? : Syllable Structure રામ - ɾam is made up of single the structure becomes< ɾC+ aV+ mC> . પત્ર - pətɾ is also made up of single the structure becomes< pC+ əV+ tC+ ɾC> લશ્કર - ləʃkəɾ is made up of two the structure becomes< lC+ əV+ ʃC> < kC+ əV+ ɾC>
  43. 43. 10:02 10:02How we speak ? : Consonants and Vowels Consonants and vowels are two differentkind of speech sounds with differentacoustic parameters. To know the exact difference betweenconsonants and vowels we have tounderstand how the single vocal tract iscapable of producing so many differentsounds.
  44. 44. 10:02 10:02How we speak ? : Articulation Modification of the air-flow is achieved byarticulation of various speech organs of thevocal tract. The exact nature of speech sound that willcome up during the breath-out is determinedby1 Place of Articulation2 Manner of Articulation
  45. 45. 10:02 10:02How we speak ? : Place of articulation Place of articulation refers to the exact point,in human vocal tract, where articulationhappened.e.g. [p] - two lips[k] - back of tongue with velum[ ] - tip of tongue with alveolarɾ
  46. 46. 10:02 10:02How we speak ? : Manner of articulation Manner of articulation refers to the degreeof constriction made, during the articulation.e.g. [p] - stop or plosive[ ] - affricateʧ[ ] - tappedɾ[ j ] - glide[ o ] - vowel ( no constriction )
  47. 47. 10:02 10:02How we speak ? : Voicedness If, during the traveling of the air-flow from theglottis, vocal cords are vibrating (and thuschanging the air-flow) we get a voicedsound.e.g. [g] - voiced[k] - unvoiced
  48. 48. 10:02 10:02How we speak ? : Aspiration Aspiration refers to the state of vocal cords,during the final stage of process, whenspeaking out phones. When we speak outaspirated phones the vocal cordsapproaches, itself to vibrating state, astime goes ( irrespective of their voicednees ).e.g. [k ] - aspiratedʰ[ k ] - unaspirated
  49. 49. 10:02 10:02Segmentation and Partneme Segmentation of partnemes is achieved byseparating the recorded syllable. Given is sound wave form for ગમન build withpartnemes. Red lines mark the separation.
  50. 50. 10:02 10:02Partnemes As shown syallable is logically divided into null sound to consonant transition core consonant consonant to vowel transition core vowel vowel to consonant transition core consonant consonant to null sound transition
  51. 51. 10:02 10:02Partnemes If we can provide the partnemes for eachvowel and consonant we can join themaccordingly to produce any complete syllableand hence any utterance.e.g.કરણ - kə əɾ ɳ0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
  52. 52. 10:02 10:02ભારત - b aʰ əɾ t0_b ;b ;b _a;a;a_ʰ ʰ ʰ ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0
  53. 53. 10:02 10:02Core Engine The speech engine, we developed toconcatenate such partneme sequencebased on given IPA, uses pair of files. One, called Voice File , contains the audiodata of all the partnemes. The other serves as a reference to theVoice File and is called Voice Info File .It contains the place and length ofpartnemes in the Voice File .
  54. 54. 10:02 10:02Core Engine The Core Engine realizes the usecase forhaving a speech engine.
  55. 55. 10:02 10:02Language Dependent Components Since Core Engine only understands IPAsequence we have to provide a componentwhich translate the Gujarati text to IPAsequence . The Preprocessing capabilities need alsobe developed for a complete TTS System. Unlike Core Engine, both aforementionedcomponents would be specific to particularlanguage and …
  56. 56. 10:02 10:02Language Dependent Components… therefore kept aside as language dependentcomponents. Preprocessor :As preprocessing should be highlycustomizable from the end user end wehave provided a text file which can beedited to control the functionality of thepreprocessor.
  57. 57. 10:02 10:02 IPATranscriptor : This component currentlyprovides only phonetic translation of the givenGujarati text as complete rules for prosodictranslation are not available.
  58. 58. 10:02 10:02Thanks Prof. Bhartiben Modi Mr. Ajay Sarvaiya Mr. Irshad Shaikh Mr. Mihir Trivedi
  59. 59. 10:02 10:02Slokaબુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજેછે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે.તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને,તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતરપ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે.- પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .