This document discusses speech synthesis technology. It begins with an introduction defining speech synthesis as the artificial production of human speech. It then discusses the history of speech synthesis, including early inventions and developments of speech synthesizers. It also covers the construction and various approaches to speech synthesis, such as concatenative synthesis and formant synthesis. The document concludes by discussing applications of speech synthesis and remaining challenges.
3. INTRODUCTION
• Speech Synthesis is the artificial production of
human speech. A synthesizer can incorporate a model of
the vocal tract and other human voice characteristics to
create a completely "synthetic" voice output.
• A computer system used for this purpose is called
a speech computer or speech synthesizer.
• A text-to-speech (TTS) system converts normal
language text into speech; other systems
render symbolic linguistic representations like phonetic
transcriptions into speech.
4. HISTORY
• In 1779, the Danish scientist Christian Kratzenstein,
working at the Russian Academy of Sciences, built
models of the human vocal tract that could produce the
five long vowel sounds.
• Wolfgang Von Kempelen added models of tongue and
lips , enabling the machine to pronounce both vowels
and constants.
• In 1837, Charles Wheatstone produced a "speaking
machine" based on von Kempelen's design, and in 1857,
M. Faber built the "Euphonia". Wheatstone's design was
resurrected in 1923 by Paget.
5. • In the 1930s, Bell Labs developed the vocoder, which
automatically analyzed speech into its fundamental tone
and resonances.
• Homer Dudley developed a keyboard-operated voice
synthesizer called The Voder (Voice Demonstrator),
which is developed from the Vocoder.
• The first computer-based speech synthesis systems
were created in the late 1950s. The first general English
text-to-speech system was developed by Noriko
Umeda et al. in 1968 at the Electro technical Laboratory,
Japan.
A speech
synthesizer in
1990s
6. CONSTRUCTION
• A text-to-speech system (or "engine") is composed of two
parts: a front-end and a back-end.
• The front-end converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-
out words (tokenization), then assigns phonetic
transcriptions to each word, and divides and marks the
text into prosodic units, like phrases, clauses,
and sentences (grapheme-phoneme conversion).
• The back-end—often referred to as the synthesizer—
then converts the symbolic linguistic representation into
sound.
8. APPROACHES
There are different approaches to speech synthesis, for
example: text-to-speech and concept-to-speech synthesis.
• Concept-to-speech synthesis involves a generation
component that generates a textual expression from
semantic, pragmatic and discourse knowledge. The
speech signal can then be generated from this
expression.
• In text-to-speech synthesis, the text to be spoken in
provided, it is not generated by the system. It must
however be analyzed and interpreted in order to convey
the proper pronunciation and emphasis.
10. • Concatenative synthesis is based on the concatenation
(or stringing together) of segments of recorded speech.
Generally, concatenative synthesis produces the most
natural-sounding synthesized speech.
• Formant synthesis does not use human speech samples
at runtime. Instead, the synthesized speech output is
created using additive synthesis and an acoustic model
(physical modelling synthesis). Parameters such
as fundamental frequency, voicing, and noise levels are
varied over time to create a waveform of artificial speech.
• Articulatory synthesis refers to computational techniques
for synthesizing speech based on models of the
human vocal tract and the articulation processes
occurring there.
11. • HMM-based synthesis is a synthesis method based
on Hidden Markov models, also called Statistical
Parametric Synthesis. In this system, the frequency
spectrum (vocal tract),fundamental frequency (voice
source), and duration (prosody) of speech are modeled
simultaneously by HMMs.
• Sinewave synthesis is a technique for synthesizing
speech by replacing the formants (main bands of energy)
with pure tone whistles
12. SYSTEMS PROVIDING SPEECH
SYNTHESIS
• Apple uses Voiceover speech engine for its Mac OS in
laptops and iOS in iPhones, iPads and iPods.
• Modern Windows desktop systems can use SAPI
4 and SAPI 5 components to support speech synthesis
and speech recognition. Microsoft Speech Server is a
server-based package for voice synthesis and
recognition. It is designed for network use with web
applications and call centers.
• The Mattel Intellivision game console offered
the Intellivoice Voice Synthesis module in 1982. It
included the SP0256 Narrator speech synthesizer chip
on a removable cartridge.
13. SYSTEMS PROVIDING TEXT TO
SPEECH SYNTHESIS
• From version 1.6, Android added support for speech
synthesis (TTS)
• Currently, there are a number of applications and web
pages from a web browser or Google Toolbar such
as Text-to-voice which is an add-on to Firefox.
• Some specialized software can narrate RSS-feeds.
• Some e-book readers, such as the Amazon
Kindle, PocketBook eBook Reader Pro, and the Bebook
Neo use TTS.
• GPS Navigation units use speech synthesis for
automobile navigation.
15. MARKUP LANGUAGES ON
SPEECH SYNTHESIS
• A number of markup languages have been established
for the rendition of text as speech in an XML-compliant
format. The most recent is Speech Synthesis Markup
Language(SSML), which became a W3C
recommendation in 2004.
• Older speech synthesis markup languages include Java
Speech Markup Language (JSML) and SABLE.
16. APPLICATIONS
• The longest application has been in the use of screen
readers for people with visual impairment, but text-to-speech
systems are now commonly used by people with dyslexia and
other reading difficulties as well as by pre-literate children.
• Speech synthesis techniques are also used in entertainment
productions such as games and animations.
• In addition, speech synthesis is a valuable computational aid
for the analysis and assessment of speech disorders.
• It can also be used as an educational tool, to learn different
accents, like in Google Translate.
17. CHALLENGES
• Despite large improvements, Speech Synthesis can still
sound a little unnatural.
• The approaches to Speech Synthesis that yield the most
natural speech need considerable resources in terms of
data storage and processing power.
• The process of tokenizing text is rarely straightforward.
There are many spellings in English which are
pronounced differently based on context making it
difficult for users.