Maithili Text-to-Speech System
Amit Kumar Jha
Pankaj Dwivedi
Piyush Pratap Singh
July 2109
IEEE CONECCT-2019
JULY 26-27, 2019
Abstract
• The paper discusses development of Maithili TTS system.
• Unit Selection Concatenative Method
• Basic unit - Syllable
• Speech corpus – 8 hours (5 hours borrowed from LDCIL, CIIL Mysore, and 3 hours is collected
from native speakers in studio environment).
• 1055 most frequently occurring words have been recorded and stored.
• Interface - C#.Net
• 930 syllable (C*V) - 300 syllables (30*10) and 10 independent vowels.
• Evaluation is performed by 10 native speakers using MOS.
• The quality of synthesized speech is approximately 84%.
07/29/19 2
Roadmap
• Introduction
• Maithili Phonology and Database Creation
• Proposed Algorithm
• Results and Discussion
• Conclusion
07/29/19 3
Introduction
• A Text-to-Speech System (TTS) converts a raw text into human speech sounds.
• Aid to communication for visually impaired people, its role in telecommunication, industrial
and educational applications and many more.
• GOI initiated development of TTS systems for Indian languages through TTS consortium
project under MeitY(Ministry of Electronics and Information Technology).
• 13 Indian languages, namely Hindi, Gujarati, Telugu, Marathi, Tamil, Odia, Malayalam,
Bengali, Assamese, Kannada, Bodo, Manipuri, and Rajasthani.
• Methodologies: articulatory synthesis, formant synthesis, unit selection synthesis (USS), and
HMM based speech synthesis (HTS), etc.
07/29/19 4
Maithili Language
• Maithili (EGIDS 0-4) - Bihar, India and Eastern Nepal (approximate 30 million).
• Script – Devanagari, Kaithi, Mithilakshar (also known as Tirhuta), and Newari.
• 16 phonologically distinctive vowel segments - 8 oral vowels [i e  æ a ə ɔ o u] and 8
corresponding nasal vowels [ ].
• 2 distinctive oral diphthongs /əi/ and /əu/.
• 33 consonantal segments [
m, n, ŋ, ʃ, ɳ, ʂ,w (v), j, r, ɽ, l,] are used in written form, only 30 segments are realized in
speech.
The consonants [ , , ] are replaced by [s, n and s], respectively in speech.ʃ ɳ ʂ
• <ʃaːm> → [saːm] ‘evening’, <baːɳ> → [baːn] ‘arrow’, and <kəʂʈ> → [kəsʈ] ‘pain’. The
sound [ ] is also realized as [r] is intervocalic and syllabic boundary positions.ɽ
<ɡʰoɽa > → [ː ɡʰoraː] ‘horse’ and < kəɽ.ək > →[ kər.ək] ‘strict’.
• Native Maithili exhibits four types of syllabic structure V, CV, VC and CVC.
07/29/19 5
Database Creation
• Phonetically balanced text data (corpus) is collected in studio environment at sampling
frequency of 16 KHz/16 bits.
• Domains: children stories, literature, science, tourism, politics, history, daily affairs, drama,
poetry, etc. were adequately covered.
• Source: published books, newspapers, local periodicals & magazines, and web pages & blogs
and dictionaries Kalyani shabdkosh.
• 120 oral and folk narratives (stories and legends were audio-translitrated and then recorded in
studio environment.
• PRAAT software - Sentence, Word, and Syllable levels.
• For example, an syllable ‘’ ’ [] <then>, ‘’ [lətam] <guava>, and ‘’
• Speech database consists of 930 syllable (C*V). [(300*3) + (10*3)] = 930
07/29/19 6
PROPOSED ALGORITHM
• Concatenate Unit Selection Synthesis (USS) has been unsed as it uses small amount of Digital
Signal Processing (DSP) to speech recorded data.
• DSP makes speech less natural; to smoothen the waveform, some systems anyway apply small
amount of signal processing at the point of concatenation.
• Input text is tokenized into words based on white space and special symbol such as, purn viram
(full stop), semicolon, comma, colon, question mark, exclamation mark, etc.
• Identify the NSW tokens such as abbreviation, acronyms, number, fractions, ratios, symbols,
dates, time, etc.
• Classify the tokens as abbreviation, acronyms, numbers, symbol, date, URL, etc.
• Convert Non-standard words to standards words by corresponding expansion rules and
developed lexicon.
07/29/19 7
Flowchart
• Input Maithili text using UTF-16.
• text is normalized using three algorithmic
modules written in C# and SQL
• Segmentation of inputted text into sentence
and word level.
• A word level search is done in database and if
it is found then corresponding speech file is
added into playlist. Else, the word is broken
into corresponding syllables and
corresponding syllables files are searched and
added in playlist.
• Found speech units are concatenated in
playlist using digital signal processing
• Play the sound of playlist
07/29/19 8
Results and Discussion
MOS Score
07/29/19 9
MOS Chart for Quality analysis
Conclusion
• Mean Opinion Score (MOS) from 10 users
was calculated on test data((5 Male and 5
Female).
• 10 Sample sentences were covering different
domain were given to the evaluators.
• The quality of the TTS system for Maithili
language is 4.2, i.e. 84 percent.
• Prosodic differences in some speech samples
were found due to intra-dialectal and inter-
dialectal differences.
• The present TTS system can be made more
robust with implementation of prosodic
features.
07/29/19 10

Maithili Text-to-Speech

  • 1.
    Maithili Text-to-Speech System AmitKumar Jha Pankaj Dwivedi Piyush Pratap Singh July 2109 IEEE CONECCT-2019 JULY 26-27, 2019
  • 2.
    Abstract • The paperdiscusses development of Maithili TTS system. • Unit Selection Concatenative Method • Basic unit - Syllable • Speech corpus – 8 hours (5 hours borrowed from LDCIL, CIIL Mysore, and 3 hours is collected from native speakers in studio environment). • 1055 most frequently occurring words have been recorded and stored. • Interface - C#.Net • 930 syllable (C*V) - 300 syllables (30*10) and 10 independent vowels. • Evaluation is performed by 10 native speakers using MOS. • The quality of synthesized speech is approximately 84%. 07/29/19 2
  • 3.
    Roadmap • Introduction • MaithiliPhonology and Database Creation • Proposed Algorithm • Results and Discussion • Conclusion 07/29/19 3
  • 4.
    Introduction • A Text-to-SpeechSystem (TTS) converts a raw text into human speech sounds. • Aid to communication for visually impaired people, its role in telecommunication, industrial and educational applications and many more. • GOI initiated development of TTS systems for Indian languages through TTS consortium project under MeitY(Ministry of Electronics and Information Technology). • 13 Indian languages, namely Hindi, Gujarati, Telugu, Marathi, Tamil, Odia, Malayalam, Bengali, Assamese, Kannada, Bodo, Manipuri, and Rajasthani. • Methodologies: articulatory synthesis, formant synthesis, unit selection synthesis (USS), and HMM based speech synthesis (HTS), etc. 07/29/19 4
  • 5.
    Maithili Language • Maithili(EGIDS 0-4) - Bihar, India and Eastern Nepal (approximate 30 million). • Script – Devanagari, Kaithi, Mithilakshar (also known as Tirhuta), and Newari. • 16 phonologically distinctive vowel segments - 8 oral vowels [i e  æ a ə ɔ o u] and 8 corresponding nasal vowels [ ]. • 2 distinctive oral diphthongs /əi/ and /əu/. • 33 consonantal segments [ m, n, ŋ, ʃ, ɳ, ʂ,w (v), j, r, ɽ, l,] are used in written form, only 30 segments are realized in speech. The consonants [ , , ] are replaced by [s, n and s], respectively in speech.ʃ ɳ ʂ • <ʃaːm> → [saːm] ‘evening’, <baːɳ> → [baːn] ‘arrow’, and <kəʂʈ> → [kəsʈ] ‘pain’. The sound [ ] is also realized as [r] is intervocalic and syllabic boundary positions.ɽ <ɡʰoɽa > → [ː ɡʰoraː] ‘horse’ and < kəɽ.ək > →[ kər.ək] ‘strict’. • Native Maithili exhibits four types of syllabic structure V, CV, VC and CVC. 07/29/19 5
  • 6.
    Database Creation • Phoneticallybalanced text data (corpus) is collected in studio environment at sampling frequency of 16 KHz/16 bits. • Domains: children stories, literature, science, tourism, politics, history, daily affairs, drama, poetry, etc. were adequately covered. • Source: published books, newspapers, local periodicals & magazines, and web pages & blogs and dictionaries Kalyani shabdkosh. • 120 oral and folk narratives (stories and legends were audio-translitrated and then recorded in studio environment. • PRAAT software - Sentence, Word, and Syllable levels. • For example, an syllable ‘’ ’ [] <then>, ‘’ [lətam] <guava>, and ‘’ • Speech database consists of 930 syllable (C*V). [(300*3) + (10*3)] = 930 07/29/19 6
  • 7.
    PROPOSED ALGORITHM • ConcatenateUnit Selection Synthesis (USS) has been unsed as it uses small amount of Digital Signal Processing (DSP) to speech recorded data. • DSP makes speech less natural; to smoothen the waveform, some systems anyway apply small amount of signal processing at the point of concatenation. • Input text is tokenized into words based on white space and special symbol such as, purn viram (full stop), semicolon, comma, colon, question mark, exclamation mark, etc. • Identify the NSW tokens such as abbreviation, acronyms, number, fractions, ratios, symbols, dates, time, etc. • Classify the tokens as abbreviation, acronyms, numbers, symbol, date, URL, etc. • Convert Non-standard words to standards words by corresponding expansion rules and developed lexicon. 07/29/19 7
  • 8.
    Flowchart • Input Maithilitext using UTF-16. • text is normalized using three algorithmic modules written in C# and SQL • Segmentation of inputted text into sentence and word level. • A word level search is done in database and if it is found then corresponding speech file is added into playlist. Else, the word is broken into corresponding syllables and corresponding syllables files are searched and added in playlist. • Found speech units are concatenated in playlist using digital signal processing • Play the sound of playlist 07/29/19 8
  • 9.
    Results and Discussion MOSScore 07/29/19 9 MOS Chart for Quality analysis
  • 10.
    Conclusion • Mean OpinionScore (MOS) from 10 users was calculated on test data((5 Male and 5 Female). • 10 Sample sentences were covering different domain were given to the evaluators. • The quality of the TTS system for Maithili language is 4.2, i.e. 84 percent. • Prosodic differences in some speech samples were found due to intra-dialectal and inter- dialectal differences. • The present TTS system can be made more robust with implementation of prosodic features. 07/29/19 10

Editor's Notes

  • #2 The slide guide is available in the following file: slidesV20.1.ppt:Fix reference to ITC xxxSite slidesV20.0.ppt:PowerPoint, version 2003 format. Note: We have saved this presentation in the older 2003 format, because PowerPoint 2003, 2007 through 2016 can read it. For this year’s test conference we will use PowerPoint (Office) 2016 in our projection computers.