Maithili Text-to-Speech

Maithili Text-to-Speech System
Amit Kumar Jha
Pankaj Dwivedi
Piyush Pratap Singh
July 2109
IEEE CONECCT-2019
JULY 26-27, 2019

Abstract
• The paper discusses development of Maithili TTS system.
• Unit Selection Concatenative Method
• Basic unit - Syllable
• Speech corpus – 8 hours (5 hours borrowed from LDCIL, CIIL Mysore, and 3 hours is collected
from native speakers in studio environment).
• 1055 most frequently occurring words have been recorded and stored.
• Interface - C#.Net
• 930 syllable (C*V) - 300 syllables (30*10) and 10 independent vowels.
• Evaluation is performed by 10 native speakers using MOS.
• The quality of synthesized speech is approximately 84%.
07/29/19 2

Roadmap
• Introduction
• Maithili Phonology and Database Creation
• Proposed Algorithm
• Results and Discussion
• Conclusion
07/29/19 3

Introduction
• A Text-to-Speech System (TTS) converts a raw text into human speech sounds.
• Aid to communication for visually impaired people, its role in telecommunication, industrial
and educational applications and many more.
• GOI initiated development of TTS systems for Indian languages through TTS consortium
project under MeitY(Ministry of Electronics and Information Technology).
• 13 Indian languages, namely Hindi, Gujarati, Telugu, Marathi, Tamil, Odia, Malayalam,
Bengali, Assamese, Kannada, Bodo, Manipuri, and Rajasthani.
• Methodologies: articulatory synthesis, formant synthesis, unit selection synthesis (USS), and
HMM based speech synthesis (HTS), etc.
07/29/19 4

Maithili Language
• Maithili (EGIDS 0-4) - Bihar, India and Eastern Nepal (approximate 30 million).
• Script – Devanagari, Kaithi, Mithilakshar (also known as Tirhuta), and Newari.
• 16 phonologically distinctive vowel segments - 8 oral vowels [i e æ a ə ɔ o u] and 8
corresponding nasal vowels [ ].
• 2 distinctive oral diphthongs /əi/ and /əu/.
• 33 consonantal segments [
m, n, ŋ, ʃ, ɳ, ʂ,w (v), j, r, ɽ, l,] are used in written form, only 30 segments are realized in
speech.
The consonants [ , , ] are replaced by [s, n and s], respectively in speech.ʃ ɳ ʂ
• <ʃaːm> → [saːm] ‘evening’, <baːɳ> → [baːn] ‘arrow’, and <kəʂʈ> → [kəsʈ] ‘pain’. The
sound [ ] is also realized as [r] is intervocalic and syllabic boundary positions.ɽ
<ɡʰoɽa > → [ː ɡʰoraː] ‘horse’ and < kəɽ.ək > →[ kər.ək] ‘strict’.
• Native Maithili exhibits four types of syllabic structure V, CV, VC and CVC.
07/29/19 5

Database Creation
• Phonetically balanced text data (corpus) is collected in studio environment at sampling
frequency of 16 KHz/16 bits.
• Domains: children stories, literature, science, tourism, politics, history, daily affairs, drama,
poetry, etc. were adequately covered.
• Source: published books, newspapers, local periodicals & magazines, and web pages & blogs
and dictionaries Kalyani shabdkosh.
• 120 oral and folk narratives (stories and legends were audio-translitrated and then recorded in
studio environment.
• PRAAT software - Sentence, Word, and Syllable levels.
• For example, an syllable ‘’ ’ [] <then>, ‘’ [lətam] <guava>, and ‘’
• Speech database consists of 930 syllable (C*V). [(300*3) + (10*3)] = 930
07/29/19 6

PROPOSED ALGORITHM
• Concatenate Unit Selection Synthesis (USS) has been unsed as it uses small amount of Digital
Signal Processing (DSP) to speech recorded data.
• DSP makes speech less natural; to smoothen the waveform, some systems anyway apply small
amount of signal processing at the point of concatenation.
• Input text is tokenized into words based on white space and special symbol such as, purn viram
(full stop), semicolon, comma, colon, question mark, exclamation mark, etc.
• Identify the NSW tokens such as abbreviation, acronyms, number, fractions, ratios, symbols,
dates, time, etc.
• Classify the tokens as abbreviation, acronyms, numbers, symbol, date, URL, etc.
• Convert Non-standard words to standards words by corresponding expansion rules and
developed lexicon.
07/29/19 7

Flowchart
• Input Maithili text using UTF-16.
• text is normalized using three algorithmic
modules written in C# and SQL
• Segmentation of inputted text into sentence
and word level.
• A word level search is done in database and if
it is found then corresponding speech file is
added into playlist. Else, the word is broken
into corresponding syllables and
corresponding syllables files are searched and
added in playlist.
• Found speech units are concatenated in
playlist using digital signal processing
• Play the sound of playlist
07/29/19 8

Results and Discussion
MOS Score
07/29/19 9
MOS Chart for Quality analysis

Conclusion
• Mean Opinion Score (MOS) from 10 users
was calculated on test data((5 Male and 5
Female).
• 10 Sample sentences were covering different
domain were given to the evaluators.
• The quality of the TTS system for Maithili
language is 4.2, i.e. 84 percent.
• Prosodic differences in some speech samples
were found due to intra-dialectal and inter-
dialectal differences.
• The present TTS system can be made more
robust with implementation of prosodic
features.
07/29/19 10

Maithili Text-to-Speech

More Related Content

Similar to Maithili Text-to-Speech

More from Dr. Amit Kumar Jha

Recently uploaded

Maithili Text-to-Speech

Editor's Notes