10:02 10:02Text-to-Speech System forGujaratiProject Presentation by Samyak Bhuta
10:02 10:02* PROJECT PROFILE *Objective : Developing a Text-to-SpeechSystem for Gujarati
10:02 10:02* PROJECT PROFILE *Under the guidance of Prof. Ram Mohan Shri Jignesh Dholakia
10:02 10:02* PROJECT PROFILE *At Resorce Centre for Indian LanguageTechnology Solutions in Gujarati,Faculty of Arts,The M. S. University of Baroda, BARODA.
10:02 10:02Next 25 minutes …> Sound and Speech Sound> ABC of TTS Systems> Pilot Project> GTTS from scratch> Speech , Syllable and Partneme> Speech Sounds in detail> Core Engine> Language Dependent Components
10:02 10:02Sound : a flow of airSource EarAir flowsSound♫♪♫
10:02 10:02What makes different sounds ? The factors, responsible for perceptualdifference between one kind of sound fromthe another are Amplitude (or volume) which tells how muchpower the air-flow holds within Frequency (or pitch) which tells at what ratethe air-flow is repeating itself
10:02 10:02The “Source” doesn’t matters An air-flow of kind A will sound sameweather it has generated from source Xor source Y.
10:02 10:02Speech Sound A kind of sound whose source isHuman Vocal Organism and whofinds its place in human speech. e.g. ક્ , સ્ , અ , ઈ A standard called International PhoneticAlphabet (IPA) is used to depict such sounds
10:02 10:02IPA IPA comprises almost all the speech soundsof all languages in the world. Speech sounds are more formally known asPhones IPA uses set of symbols to represent theme.g. k , s , ə , i , ʤ IPA Chart …
10:02 10:02Synthesized Speech Sound If we can produce the same pattern ofair-flow as it is produced by Human VocalOrganism, representing a speech sound,we can say that we have synthesized thespeech sound
10:02 10:02Speech Synthesizer A mechanism which is capable of producingsynthesized speech sound in controlledmanner.
10:02 10:02Text-to-Speech Systems A Speech Synthesizer which is smart enoughto produce equivalent Speech output of thegiven text. The smartness accounts for making theoutput as natural and intelligible aspossible.
10:02 10:02Text-to-Speech Systems Usually, the TTS Systems are specific toonly one human language and takes inputtext from only that language
10:02 10:02Basic structure of TTS Systems Function of any TTS System is, generally,divided into three subtasks or phases.I. PreprocessingII. Phonetic-Prosodic TranslationIII. Speech Production The text input travels through thesephases, one by one, and eventually endsup in a speech .
10:02 10:02Preprocessing “Dr. Ajay Shah will come to clinic on 23 ,Jan.” We read it …“DOCTOR Ajay Shah will come to clinic onTWENTY THIRD OF JANUARY”. The Preprocessing is meant to convertthe input text, from raw condition, topronounceable word text.
10:02 10:02Phonetic-Prosodic Translation This phase can be logically divided into twodifferent phases,• Phonetic Translation• Prosodic Translation Real TTS Systems may implement thesephases separately or as a unit but togetherthey provide data for the next phase of TTS.
10:02 10:02Phonetic Translation In human languages, the script under usedoesn’t necessarily posses the one to onemapping with speech. e.g.enough is pronounced as INAF / inəf IPAછોકરો is pronounced as છોક્રો / okʧ ɾo IPA
10:02 10:02Phonetic Translation A Phonetic Translation is used to provideinformation, to the next phase, about exactlywhat kind of speech sounds (phones) to beproduced for the given text. Phonetic Translation is also regarded asLetter-to-Sound rules.
10:02 10:02Prosodic Translation Mapping from letter-to-sound rules onlyprovides information about kind of speechsound to be generated. To convey theemotions and expressions residing in theinput text , Prosody needs to be applied. By Prosody we mean,Amplitude + Pitch + Duration
10:02 10:02Speech Production This phase is responsible for actual outputof the speech. The phase uses the phonetic and prosodicinformation provided from the previousphase. Various approaches exist for production ofspeech.
10:02 10:02Different ways for Speech Production Three widely used approaches for speechproduction are• Articulatory Synthesis• Source-Filter Synthesis• Concatenative Synthesis Speech production part of the TTS Systemis generally regarded as speech engine.
10:02 10:02Usecases As we understood the structure of the TTSSystems we realized that all three phases isrequired in order to develop complete TTSfor Gujarati. At the top most abstraction level a use casecan be conceived for fulfilling the requirementof having a TTS System for Gujarati.
10:02 10:02Usecases The topmost use case, then, can be dividedinto three further use cases each fulfillingthe requirement of three different phases During the project we tried to realize eachuse case one by one.
10:02 10:02Pilot Project As we approached various requirementsand usecases to be realized, we found thatdeveloping a Preprocessor is not so muchsignificant as developing the other twophases. So we decided to develop later on. We decided to develop Phonetic-ProsodicTranslation phase first as if it can be easilyplugged into any already build ….speech
10:02 10:02Pilot Project… speech engine who takes input in terms ofof IPA. FreeTTS, IBMJS, Dhvani, Narad werestudied We used Java Speech API along with IBMJSas a speech engine to be used. The input to the engine was provided throughJava Speech Markup Language (JSML)
10:02 10:02Pilot Project : Objective To develop a TTS System using alreadyavailable Speech Engine and supplyingtranscripted (equivalent ) IPA text of targetGujarati Unicode text to the engine.
10:02 10:02Pilot Project : S/W Requirement A Speech Engine Component which takesIPA and speaks it out .
10:02 10:02Pilot Project : Design No of usecases were conceived and itsimplementation was provided as differentjava classes.
10:02 10:02Pilot Project : Conclusion We cannot continue developing a TTSSystem with “outsider” speech engine asthe accent and other things need to beGujarati in nature.
10:02 10:02Starting of GTTS from Scratch From the result of the Pilot Project weconcluded that it is required to develop theSpeech Engine keeping Gujarati in mind. Concatenative approach was to be usedsince it provides naturalness and has proventrack record.
10:02 10:02Concatenation In Concatenative approach, already storedsegments of sounds are joined together toproduce the complete speech. Such segments are known as concatenationunit. We used Partnemes as our concatenationunit.
10:02 10:02Partnemes Partneme is a very small segment of soundwhose typical length ranges from 8 ms to100 ms. We get the partnemes by cuttingthe recorded speech. But before understanding what is partnemewe have to understand human speech ingreater detail. Especially the relationbetween speech and syllable.
10:02 10:02How we speak ? At time of normal breathing the period wedevote to breath-in is longer than that ofbreath-out in a complete breath cycle. But when we start speaking, the breath-inperiod becomes shorter paving the way fora longer breath-out period. It is so because to speak out (anything) weneed some air-flow. We use the air-flow …
10:02 10:02How we speak ? : Human Vocal Tract… powered by lungs, during breath-out. This air-flow is modified at various pointsof Human Vocal Tract, ending up in a oneor another kind of speech sound (phones). Human Vocal Tract comprises of variousorgans which, in one or another way,changes the air-flow. Human Vocal Tract …
10:02 10:02How we speak ? : Syllable and Speech During the one complete breath cyclewe can speak out more than one phones. These all phones, spoken out in just onebreath cycle, constitutes a syllable . Sequence of such syllables in theircontinuity forms a speech.
10:02 10:02How we speak ? : Syllable Structure It is important to know the structure ofsyllable in order to understand partnemes. Typically a syllable is made up of vowel as anucleus with consonants around it. Gujarati employees the following syllablestructure.< C + C + C + V + V̯ + C + C >
10:02 10:02How we speak ? : Syllable Structure < C + C + C + V + V̯ + C + C >where C - consonantsV - vowelV̯ - unsyllablized vowel An utterance ( spoken word ) is made upseries of such syllables.
10:02 10:02How we speak ? : Syllable Structure રામ - ɾam is made up of single syllable.here the structure becomes< ɾC+ aV+ mC> . પત્ર - pətɾ is also made up of single syllable.here the structure becomes< pC+ əV+ tC+ ɾC> લશ્કર - ləʃkəɾ is made up of two syllables.here the structure becomes< lC+ əV+ ʃC> < kC+ əV+ ɾC>
10:02 10:02How we speak ? : Consonants and Vowels Consonants and vowels are two differentkind of speech sounds with differentacoustic parameters. To know the exact difference betweenconsonants and vowels we have tounderstand how the single vocal tract iscapable of producing so many differentsounds.
10:02 10:02How we speak ? : Articulation Modification of the air-flow is achieved byarticulation of various speech organs of thevocal tract. The exact nature of speech sound that willcome up during the breath-out is determinedby1 Place of Articulation2 Manner of Articulation
10:02 10:02How we speak ? : Place of articulation Place of articulation refers to the exact point,in human vocal tract, where articulationhappened.e.g. [p] - two lips[k] - back of tongue with velum[ ] - tip of tongue with alveolarɾ
10:02 10:02How we speak ? : Manner of articulation Manner of articulation refers to the degreeof constriction made, during the articulation.e.g. [p] - stop or plosive[ ] - affricateʧ[ ] - tappedɾ[ j ] - glide[ o ] - vowel ( no constriction )
10:02 10:02How we speak ? : Voicedness If, during the traveling of the air-flow from theglottis, vocal cords are vibrating (and thuschanging the air-flow) we get a voicedsound.e.g. [g] - voiced[k] - unvoiced
10:02 10:02How we speak ? : Aspiration Aspiration refers to the state of vocal cords,during the final stage of process, whenspeaking out phones. When we speak outaspirated phones the vocal cordsapproaches, itself to vibrating state, astime goes ( irrespective of their voicednees ).e.g. [k ] - aspiratedʰ[ k ] - unaspirated
10:02 10:02Segmentation and Partneme Segmentation of partnemes is achieved byseparating the recorded syllable. Given is sound wave form for ગમન build withpartnemes. Red lines mark the separation.
10:02 10:02Partnemes As shown syallable is logically divided into null sound to consonant transition core consonant consonant to vowel transition core vowel vowel to consonant transition core consonant consonant to null sound transition
10:02 10:02Partnemes If we can provide the partnemes for eachvowel and consonant we can join themaccordingly to produce any complete syllableand hence any utterance.e.g.કરણ - kə əɾ ɳ0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
10:02 10:02Core Engine The speech engine, we developed toconcatenate such partneme sequencebased on given IPA, uses pair of files. One, called Voice File , contains the audiodata of all the partnemes. The other serves as a reference to theVoice File and is called Voice Info File .It contains the place and length ofpartnemes in the Voice File .
10:02 10:02Core Engine The Core Engine realizes the usecase forhaving a speech engine.
10:02 10:02Language Dependent Components Since Core Engine only understands IPAsequence we have to provide a componentwhich translate the Gujarati text to IPAsequence . The Preprocessing capabilities need alsobe developed for a complete TTS System. Unlike Core Engine, both aforementionedcomponents would be specific to particularlanguage and …
10:02 10:02Language Dependent Components… therefore kept aside as language dependentcomponents. Preprocessor :As preprocessing should be highlycustomizable from the end user end wehave provided a text file which can beedited to control the functionality of thepreprocessor.
10:02 10:02 IPATranscriptor : This component currentlyprovides only phonetic translation of the givenGujarati text as complete rules for prosodictranslation are not available.
10:02 10:02Thanks Prof. Bhartiben Modi Mr. Ajay Sarvaiya Mr. Irshad Shaikh Mr. Mihir Trivedi
10:02 10:02Slokaબુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજેછે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે.તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને,તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતરપ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે.- પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .