The document discusses the development of a Maithili text-to-speech (TTS) system using a unit selection concatenative method. The system was developed using a speech corpus of 8 hours recorded from native Maithili speakers. The quality of synthesized speech from the TTS system was evaluated by 10 native speakers using mean opinion score (MOS) testing, with an average quality score of 84%. Future work to improve the system includes implementing more robust prosodic features to handle intra- and inter-dialectal differences in Maithili.
2. Abstract
• The paper discusses development of Maithili TTS system.
• Unit Selection Concatenative Method
• Basic unit - Syllable
• Speech corpus – 8 hours (5 hours borrowed from LDCIL, CIIL Mysore, and 3 hours is collected
from native speakers in studio environment).
• 1055 most frequently occurring words have been recorded and stored.
• Interface - C#.Net
• 930 syllable (C*V) - 300 syllables (30*10) and 10 independent vowels.
• Evaluation is performed by 10 native speakers using MOS.
• The quality of synthesized speech is approximately 84%.
07/29/19 2
4. Introduction
• A Text-to-Speech System (TTS) converts a raw text into human speech sounds.
• Aid to communication for visually impaired people, its role in telecommunication, industrial
and educational applications and many more.
• GOI initiated development of TTS systems for Indian languages through TTS consortium
project under MeitY(Ministry of Electronics and Information Technology).
• 13 Indian languages, namely Hindi, Gujarati, Telugu, Marathi, Tamil, Odia, Malayalam,
Bengali, Assamese, Kannada, Bodo, Manipuri, and Rajasthani.
• Methodologies: articulatory synthesis, formant synthesis, unit selection synthesis (USS), and
HMM based speech synthesis (HTS), etc.
07/29/19 4
5. Maithili Language
• Maithili (EGIDS 0-4) - Bihar, India and Eastern Nepal (approximate 30 million).
• Script – Devanagari, Kaithi, Mithilakshar (also known as Tirhuta), and Newari.
• 16 phonologically distinctive vowel segments - 8 oral vowels [i e æ a ə ɔ o u] and 8
corresponding nasal vowels [ ].
• 2 distinctive oral diphthongs /əi/ and /əu/.
• 33 consonantal segments [
m, n, ŋ, ʃ, ɳ, ʂ,w (v), j, r, ɽ, l,] are used in written form, only 30 segments are realized in
speech.
The consonants [ , , ] are replaced by [s, n and s], respectively in speech.ʃ ɳ ʂ
• <ʃaːm> → [saːm] ‘evening’, <baːɳ> → [baːn] ‘arrow’, and <kəʂʈ> → [kəsʈ] ‘pain’. The
sound [ ] is also realized as [r] is intervocalic and syllabic boundary positions.ɽ
<ɡʰoɽa > → [ː ɡʰoraː] ‘horse’ and < kəɽ.ək > →[ kər.ək] ‘strict’.
• Native Maithili exhibits four types of syllabic structure V, CV, VC and CVC.
07/29/19 5
6. Database Creation
• Phonetically balanced text data (corpus) is collected in studio environment at sampling
frequency of 16 KHz/16 bits.
• Domains: children stories, literature, science, tourism, politics, history, daily affairs, drama,
poetry, etc. were adequately covered.
• Source: published books, newspapers, local periodicals & magazines, and web pages & blogs
and dictionaries Kalyani shabdkosh.
• 120 oral and folk narratives (stories and legends were audio-translitrated and then recorded in
studio environment.
• PRAAT software - Sentence, Word, and Syllable levels.
• For example, an syllable ‘’ ’ [] <then>, ‘’ [lətam] <guava>, and ‘’
• Speech database consists of 930 syllable (C*V). [(300*3) + (10*3)] = 930
07/29/19 6
7. PROPOSED ALGORITHM
• Concatenate Unit Selection Synthesis (USS) has been unsed as it uses small amount of Digital
Signal Processing (DSP) to speech recorded data.
• DSP makes speech less natural; to smoothen the waveform, some systems anyway apply small
amount of signal processing at the point of concatenation.
• Input text is tokenized into words based on white space and special symbol such as, purn viram
(full stop), semicolon, comma, colon, question mark, exclamation mark, etc.
• Identify the NSW tokens such as abbreviation, acronyms, number, fractions, ratios, symbols,
dates, time, etc.
• Classify the tokens as abbreviation, acronyms, numbers, symbol, date, URL, etc.
• Convert Non-standard words to standards words by corresponding expansion rules and
developed lexicon.
07/29/19 7
8. Flowchart
• Input Maithili text using UTF-16.
• text is normalized using three algorithmic
modules written in C# and SQL
• Segmentation of inputted text into sentence
and word level.
• A word level search is done in database and if
it is found then corresponding speech file is
added into playlist. Else, the word is broken
into corresponding syllables and
corresponding syllables files are searched and
added in playlist.
• Found speech units are concatenated in
playlist using digital signal processing
• Play the sound of playlist
07/29/19 8
10. Conclusion
• Mean Opinion Score (MOS) from 10 users
was calculated on test data((5 Male and 5
Female).
• 10 Sample sentences were covering different
domain were given to the evaluators.
• The quality of the TTS system for Maithili
language is 4.2, i.e. 84 percent.
• Prosodic differences in some speech samples
were found due to intra-dialectal and inter-
dialectal differences.
• The present TTS system can be made more
robust with implementation of prosodic
features.
07/29/19 10
Editor's Notes
The slide guide is available in the following file:
slidesV20.1.ppt:Fix reference to ITC xxxSite
slidesV20.0.ppt:PowerPoint, version 2003 format.
Note: We have saved this presentation in the older 2003 format, because PowerPoint 2003, 2007 through 2016 can read it.
For this year’s test conference we will use PowerPoint (Office) 2016 in our projection computers.