TTS Speech Synthesis

Speech synthesis
Speech synthesis is the artificial production
of human speech that sounds almost like a
human voice and is more precise with pitch,
speech, and tone.
Automation and AI-based system designed
for this purpose is called a text-to-speech
synthesizer and can be implemented in
software or hardware.

Architecture of TTS systems
3
Text-to-phoneme module
Text input
Grapheme-to-
phoneme
conversion
Prosodic
modelling
Acoustic
synthesis
Abbreviation
lexicon
Text in orthographic form
Exceptions
lexicon
Orthographic
rules
Phoneme string
Normalization
Grammar rules
Phoneme string +
prosodic annotation
Prosodic model
Synthetic speech
output
Phoneme-to-speech module
Various
methods

Speech Synthesis
 Speech Synthesis is the artificial production of human
speech.
 A synthesizer can incorporate a model of the vocal tract and
other human voice characteristics to create a completely
"synthetic" voice output.
 A computer system used for this purpose is called a speech
computer or speech synthesizer.
 A text-to-speech (TTS) system converts normal language
text into speech; other systems render symbolic linguistic
representations like phonetic transcriptions into speech.

Text-to-speech
 A text-to-speech system (or "engine") is composed of two
parts: a front-end and a back-end.
 The front-end converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-
out words (tokenization), then assigns phonetic
transcriptions to each word, and divides and marks the text
into prosodic units, like phrases, clauses, and sentences
(grapheme-phoneme conversion).
 The back-end—often referred to as the synthesizer— then
converts the symbolic linguistic representation into sound.

Types of voice synthesis systems
 text-to-speech and concept-to-speech synthesis.
 Concept-to-speech synthesis involves a generation
component that generates a textual expression
from semantic, pragmatic and discourse
knowledge. The speech signal can then be
generated from this expression.
 In text-to-speech synthesis, the text to be spoken in
provided, it is not generated by the system. It must
however be analyzed and interpreted in order to
convey the proper pronunciation and emphasis.

SpeechSynthesis forTranslations
 the synthesized speech can be controlled more precisely than human
speech, making it easier to produce an accurate rendition of the
original text.
 It saves you ample time while saving you the labor of manual work that
may have a chance of being error-prone.
 The speech synthesis translator does not need to spend time recording
themselves speaking the translated text. It can be a significant time-
saving for long or complex texts.

Speech sound variations
 Pitch, length, loudness
 Intonation (pitch)
 essential to avoid monotonous robot-like voice
 linked to basic syntax (eg statement vs question), but also to
thematization (stress)
 Pitch range is a sensitive issue
 Rhythm (length)
 Has to do with pace (natural tendency to slow down at end of
utterance)
 Also need to pause at appropriate place
 Linked (with pitch and loudness) to stress

Synthesis types
Articulatory synthesis
Formant synthesis
Concatenative synthesis
Unit selection synthesis

Articulatory synthesis
 Simulation of physical processes of human articulation
 Wolfgang von Kempelen (1734-1804) and others used
bellows, reeds and tubes to construct mechanical
speaking machines
 Modern versions simulate electronically the effect of
articulator positions, vocal tract shape, etc.

Formant synthesis
 Reproduce the relevant characteristics of the
acoustic signal
 In particular, amplitude and frequency of
formants
 But also other resonances and noise, eg for
nasals, laterals, fricatives etc.
 Values of acoustic parameters are derived by rule
from phonetic transcription
 Result is intelligible, but too “pure” and sounds
synthetic

Concatenative synthesis
 Concatenate segments of pre-recorded natural
human speech
 Requires database of previously recorded human
speech covering all the possible segments to be
synthesised
 Segment might be phoneme, syllable, word,
phrase, or any combination

Concatenative synthesis
 Input is phonemic representation + prosodic features
 Diphone segments can be digitally manipulated for
length, pitch and loudness
 Segment boundaries need to be smoothed to avoid
distortion

Diphone synthesis
 Most systems use diphones because they
are
Manageable in number
Can be automatically extracted from
recordings of human speech
Capture most inter-allophonic variants

Unit selection synthesis
 Same idea as concatenative synthesis, but database
contains bigger variety of “units”
 Multiple examples of phonemes (under different prosodic
conditions) are recorded
 Selection of appropriate unit therefore becomes more
complex, as there are in the database competing
candidates for selection

Navigation andVoiceCommands
 Navigation systems and voice-activated assistants like Siri and Google
Assistant are prime examples ofTTS software.
 They convert text-based directions into speech, making it easier for
drivers to stay focused on the road.
 The voice assistants offer voice commands for various tasks, such as
sending a text message or setting a reminder.This technology
benefits people unfamiliar with an area or who have trouble reading
maps.

Applications
 web pages from a web browser or Google Toolbar such as
Text-to-voice which is an add-on to Firefox.
 Some specialized software can narrate RSS-feeds.
 Some e-book readers, such as the Amazon Kindle,
PocketBook eBook Reader Pro, and the Bebook Neo use
TTS.
 GPS Navigation units use speech synthesis for automobile
navigation.

Use
 Speech synthesizers are great to help in preparing
educational materials, such as audiobooks, audio
blogs and language-learning materials.
 Some visual learners or those who prefer to listen to
material rather than read it. Now educational content
creators can create materials for those with reading
impairments, such as dyslexia.

Use
 The longest application has been in the use of screen readers
for people with visual impairment, but text-to-speech systems
are now commonly used by people with dyslexia and other
reading difficulties as well as by pre-literate children.
 Speech synthesis techniques are also used in entertainment
productions such as games and animations.
 In addition, speech synthesis is a valuable computational aid
for the analysis and assessment of speech disorders.
 It can also be used as an educational tool, to learn different
accents, like in Google Translate.

Limitations
 Speech Synthesis can still sound a little unnatural.
 The approaches to Speech Synthesis that yield the
most natural speech need considerable resources in
terms of data storage and processing power.
 The process of tokenizing text is rarely
straightforward.
 There are many spellings in English which are
pronounced differently based on context making it
difficult for users

TTS Speech Synthesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TTS Speech Synthesis

Similar to TTS Speech Synthesis (20)

More from Subramanian Mani

More from Subramanian Mani (20)

Recently uploaded

Recently uploaded (20)

TTS Speech Synthesis