This paper explores syllable approach to building language independent text to speech systems for Indian Languages. The use of common phone set, common question set and borrowing context-independent monophone models along with syllable approach across languages makes the procedure easier and less time-consuming, without compromising the synthesized speech quality. Systems can be built without even knowing the language. This is especially quite beneficial in the Indian scenario.
Comparative study of Text-to-Speech Synthesis for Indian Languages by using Syllable Approach
1. Comparative study of Text-to-Speech
Synthesis for Indian Languages by using
Syllable Approach
CLASS:M.E I COMPUTER
GUIDED BY : PROF. ASHISH MANWATKAR PRESENTED BY : RAVI SHARMA
ROLL NO: 15311
2. CONTENT
• INTRODUCTION
• MOTIVATION
• LITERATURE SURVEY
• DATA TABLE
• SYSYEM ARCHITECTURE
• MATHEMATICAL MODEL
• ALGORITHM
• ADVANTAGES
• DISADVANTAGES
• APPLICATION
• CONCLUSION
3. INTRODUCTION
• Text to Speech Synthesis-
A system which takes as input a sequence of words and converts
them to speech
4. •Parts of Speech Synthesizers
Speech Synthesizers usually consist of two parts.
First Part- The first part has two major tasks.
• First it takes the raw text and converts things like numbers and
abbreviations into their written-out word equivalents. This process is
often called text normalization.
• Then it assigns phonetic transcriptions to each word, and divides and
marks the text into various linguistic units like phrases, clauses, and
sentences.
5. • Second Part- The other part, the back end, takes the symbolic
linguistic representation and converts it into actual sound output
6. Text-to-phoneme challenges
• Speech synthesis systems use two basic approaches to determine the
pronunciation of a word based on its spelling, a process which is often
called text-to-phoneme conversion.
7. Dictionary Based approach
• The simplest approach to text-to-phoneme conversion is the
dictionary-based approach, where a large dictionary containing all
the words of a language and their correct pronunciation is stored by
the program.
• Determining the correct pronunciation of each word is a matter of
looking up each word in the dictionary and replacing the spelling with
the pronunciation specified in the dictionary
8. Rule based approach
• The other approach used for text-to-phoneme conversion is the rule-
based approach, where rules for the pronunciations of words are
applied to words to work out their pronunciations based on their
spellings. This is similar to the "sounding out" approach to learning
reading.
9. • SYLLABLE RULES-
Syllable is a cluster of consonants and vowel
Syllable should contain one vowel and any number of consonants.
1. Single vowel can act as a syllable. (I.e. V).
2. V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……et .
3. Consonant efore o el is alled „O set‟. i.e. C*V
4. Consonant after o el is alled „Coda‟. i.e. V*C
10. Syllable Rules-
1. When asals su h as / ’/, half pro ou ed / / or / / sou d
succeed a vowel immediately, they would be treated as a part of
the o el a d also the sa e s lla le. For e a ple, / ’/ i sa ’sthaa
will be a part of syllable containing /sa/
2. When there are three or more consonants between two
consecutive vowels, the first consonant would be a part of the coda
of the previous syllable while the remaining consonants would be
onset of the next syllable .
11. Syllable Rules-
3. When there are exactly two consonants between two vowels, the first consonant
would be part of coda of previous syllable and the second would be onset of the
next syllable
4. When the second consonant is a member of the set {/r/ /s/ /sh/ /shh/}, both the
consonants would be a part of onset of the next syllable
12. HMM synthesis
• A quite new technology is speech synthesis based on HMM, a
mathematical concept called Hidden Markov models.
• It is a statistical method where the text-to-speech system is based on
a model that is not known beforehand but it is refined by continuous
training.
• The technique consumes large CPU resources but very little memory.
• This approach seems to give a better prosody, without glitches, and
still produces very natural sounding, human-like speech
13.
14. MOTIVATION
• There are 1652 languages in India
• Building a TTS system for each of them is time-consuming and
exhausting. Thus a more generic approach towards system building is
required. A common framework is first designed, using which
language- spe ifi systems are then built.
15. LITERATURE SURVEY
SR.
NO
PAPER TITLE Aim of the Paper Advantages Disadvantages
1.
An Unit Selection based
Hindi Text To Speech
Synthesis System Using
Syllable as a Basic Unit
quality of this system is the
improved naturalness in the
synthesized speech
An important
advantage of this
approach leads to
reduced prosody
mismatch and
spectral
discontinuity that
occurs during
syllable
concatenation.
Large concatenation
points. This large
concatenation
results in glitch at
the output which is
hard to eliminate
prosody mismatch
and spectral
discontinuity
2. Design and Development of
a Text-To-Speech Synthesizer
for Indian Languages
The design and
implementation of a unit
selection based text-to-
speech synthesizer with
syllables and polysyllables
as units of concatenation
improves synthesis
quality and it
reduces search
space improving the
synthesis timing.
it is not clear at the
time of writing, how
spectral
interpolation will be
performed at the
boundaries
16. SR.
NO
PAPER TITLE Aim of the Paper Advantages Disadvantages
3. Development of Speech
Database for Hindi Text-To-
Speech System Considering
Syllable as a Basic Unit
convert an orthographic
text into intelligible and
natural sounding speech
This technique
provides very high
quality speech
output which is
reasonably natural
and equivalent to
voice of the original
speaker.
before synthesizing
pre-processing of
text is required
4. Text-to-Speech Synthesis
using syllable-like units
the design of a syllable
based concatenative
waveform synthesizer for
Indian languages.
the automatic
segmentation
algorithm has in-
deed created a
useful speech unit
that has low target
and concatenation
costs.
current work uses a
single unique
syllable-like unit
from the repository
for synthesis.
17. SR.
NO
PAPER TITLE Aim of the Paper Advantages Disadvantages
5. Statistical parametric speech
synthesis
generating acceptable
speech synthesis
a variety of speaking
styles or emotional
speech can be
synthesized
using the small
amount of speech
data.
quality of
synthesized speech
factors which
degrade the
Quality: vocoder,
modeling accuracy,
and over-
smoothing.
6. Unit selection in a
concatenative speech
synthesis system using a
large speech database
the generation of natural-
sounding synthesized
speech waveforms
produce more
natural speech
there is little
difference in the
quality of out- put
using the two
training method
18. SR.
NO
PAPER TITLE Aim of the Paper Advantages Disadvantages
7. An Unit Selection based
Hindi Text To Speech
Synthesis System Using
Syllable as a Basic Unit
quality of this system is the
improved naturalness in the
synthesized speech and
gives very high quality
speech output when
compared to other
synthesizing techniques
An important
advantage of this
approach leads to
reduced prosody
mismatch and
spectral
discontinuity that
occurs during
syllable
concatenation.
Large concatenation
points. This large
concatenation
results in glitch at
the output which is
hard to eliminate
prosody mismatch
and spectral
discontinuity.
19. SR.
NO
PAPER TITLE Aim of the Paper Advantages Disadvantages
8. A Common Attribute based
Unified HTS framework for
Speech
Synthesis in Indian
Languages
high-quality synthetic
speech
concatenates pre-
recorded speech units
in
the database such that
the target and
concatenation costs
are minimized.
to obtain high-
quality synthetic
speech, the size of
the database
required is large, to
ensure that
sufficient examples
for each unit in
every
possible context is
available
20. DATA TABLE
TABLE I: Degradation MOS (DMOS) and Word error rate (WER) scores
Target Language Marathi Bengali Tamil Tamil Telugu Malayalam
Source Language Hindi Hindi Tamil Hindi Tamil Tamil
Numbers of hours of
target language
3 2 3 3 3 3
DMOS 2.79 2.50 2.97 2.53 2.63 2.88
WER 3.48% 15.06% 6.61% 5.16% 16.14% 3.13%
22. MATHEMATICAL MODEL
Let I = Set of Language
I = {T, S}
Where,
T is the text which is input and
S is the sound is output.
D (I) = arg max p(o/w, lambda)
Where,
Lambda represents the model parameters
o represents speech parameters and
w is the transcription of the test sentence
23. Syllable Rules-
Syllable is a cluster of consonants and vowel
Syllable should contain one vowel and any number of consonants.
Single vowel can act as a syllable. (I.e. V).
V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……etc.
Consonant before vowel is called „Onset‟. i.e.(C*V)
Consonant after vowel is called „Coda‟. i.e.(V*C)
Output = Pk
Where D(I) = dictionary Fuction
Pk is Phonetics
25. ADVANTAGES
• For people wanting to learn a new language
• For educational institutions looking to enhance student learning,
recall and comprehension
• For people wanting to learn through multiple mediums to solidify
learning
• For people with physical disabilities
• Difficulty handling a book or paper
• Visual Issue (Difficulty seeing text)
26. DISADVANTAGES
• Despite large improvements, Speech Synthesis can still sound a little
unnatural.
• The approaches to Speech Synthesis that yield the most natural
speech need considerable resources in terms of data storage and
processing power.
• pronunciation analysis from written text is also a major problem
27. APPLICATION
• Systems that provide voice synthesis output for blind users are
generally referred to as screen readers.
• Applications for the Blind
• Applications for the Deafened and Vocally Handicapped
• Educational Applications
28. CONCLUSION
This paper explores syllable approach to building language independent
text to speech systems for Indian Languages. The use of common
phone set, common question set and borrowing context-independent
monophone models along with syllable approach across languages
makes the procedure easier and less time-consuming, without
compromising the synthesized speech quality. Systems can be built
without even knowing the language. This is especially quite beneficial
in the Indian scenario.
29. REFERENCES
• [ ] A. J. Hu t a d A. W. Bla k, U it sele tio i a concatenative speech synthesis system using a
large spee h data ase, i A ousti s, Spee h, a d Sig al Pro essi g, ICASSP-96), vol. 1,
1996, pp. 373–376.
• [2] H. Zen, K. Tokuda, a d A. W. Bla k, Statisti al para etri spee h s thesis, Spee h
Communication, vol. 51, no. 3, pp. 1039–1064, November 2009.
• [3] A. Beyerlein, W. Byrne, J. M. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone,
a d W. Wa g, To ards la guage i depe de t a ousti odeli g, i Pro eedi g o A ousti s,
Speech, and Signal Processing (ICASSP), vol. 2, 2000, pp. 1029–1032.
• [4] R. Bayeh, S. Lin, G. Chollet, and C. Mokbel, To ards ultili gual spee h re og itio usi g
data dri e sour e/target a ousti al u its asso iatio , i A ousti s, Spee h, a d Sig al
Processing, 2004. Pro- ceedings ICASSP ’ , ol. , , pp. I–521–4. [5] V. B. Le and L. Besacier,
First steps i fast a ousti odeli g for a e target la guage: Appli atio to Viet a ese, i
A ousti s, Spee h, a d Sig al Pro essi g, . Pro eedi gs ICASSP ’ , ol. , , pp. –
824.
• [5] P. Eswar, A rule ased approa h for spotti g hara ters fro contin- uous speech in Indian
la guages, PhD Dissertatio , I dia I stitute of Te h olog , Depart e t of Co puter S ie e
and Engg., Madras, India, 1991