Speech synthesis is the artificial production of human speech using technology. A text-to-speech (TTS) system converts written text into synthesized audio speech. TTS systems have two parts - a front-end that converts text into linguistic units and a back-end that converts this into speech. Different TTS systems use various techniques like formant synthesis, concatenative synthesis, and unit selection to generate speech. TTS is used in navigation systems, voice assistants, e-books, and screen readers to assist those with visual or reading impairments.
2. Speech synthesis
Speech synthesis is the artificial production
of human speech that sounds almost like a
human voice and is more precise with pitch,
speech, and tone.
Automation and AI-based system designed
for this purpose is called a text-to-speech
synthesizer and can be implemented in
software or hardware.
3. Architecture of TTS systems
3
Text-to-phoneme module
Text input
Grapheme-to-
phoneme
conversion
Prosodic
modelling
Acoustic
synthesis
Abbreviation
lexicon
Text in orthographic form
Exceptions
lexicon
Orthographic
rules
Phoneme string
Normalization
Grammar rules
Phoneme string +
prosodic annotation
Prosodic model
Synthetic speech
output
Phoneme-to-speech module
Various
methods
4. Speech Synthesis
Speech Synthesis is the artificial production of human
speech.
A synthesizer can incorporate a model of the vocal tract and
other human voice characteristics to create a completely
"synthetic" voice output.
A computer system used for this purpose is called a speech
computer or speech synthesizer.
A text-to-speech (TTS) system converts normal language
text into speech; other systems render symbolic linguistic
representations like phonetic transcriptions into speech.
5. Text-to-speech
A text-to-speech system (or "engine") is composed of two
parts: a front-end and a back-end.
The front-end converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-
out words (tokenization), then assigns phonetic
transcriptions to each word, and divides and marks the text
into prosodic units, like phrases, clauses, and sentences
(grapheme-phoneme conversion).
The back-end—often referred to as the synthesizer— then
converts the symbolic linguistic representation into sound.
6. Types of voice synthesis systems
text-to-speech and concept-to-speech synthesis.
Concept-to-speech synthesis involves a generation
component that generates a textual expression
from semantic, pragmatic and discourse
knowledge. The speech signal can then be
generated from this expression.
In text-to-speech synthesis, the text to be spoken in
provided, it is not generated by the system. It must
however be analyzed and interpreted in order to
convey the proper pronunciation and emphasis.
7. SpeechSynthesis forTranslations
the synthesized speech can be controlled more precisely than human
speech, making it easier to produce an accurate rendition of the
original text.
It saves you ample time while saving you the labor of manual work that
may have a chance of being error-prone.
The speech synthesis translator does not need to spend time recording
themselves speaking the translated text. It can be a significant time-
saving for long or complex texts.
8. Speech sound variations
Pitch, length, loudness
Intonation (pitch)
essential to avoid monotonous robot-like voice
linked to basic syntax (eg statement vs question), but also to
thematization (stress)
Pitch range is a sensitive issue
Rhythm (length)
Has to do with pace (natural tendency to slow down at end of
utterance)
Also need to pause at appropriate place
Linked (with pitch and loudness) to stress
10. Articulatory synthesis
Simulation of physical processes of human articulation
Wolfgang von Kempelen (1734-1804) and others used
bellows, reeds and tubes to construct mechanical
speaking machines
Modern versions simulate electronically the effect of
articulator positions, vocal tract shape, etc.
11. Formant synthesis
Reproduce the relevant characteristics of the
acoustic signal
In particular, amplitude and frequency of
formants
But also other resonances and noise, eg for
nasals, laterals, fricatives etc.
Values of acoustic parameters are derived by rule
from phonetic transcription
Result is intelligible, but too “pure” and sounds
synthetic
12. Concatenative synthesis
Concatenate segments of pre-recorded natural
human speech
Requires database of previously recorded human
speech covering all the possible segments to be
synthesised
Segment might be phoneme, syllable, word,
phrase, or any combination
13. Concatenative synthesis
Input is phonemic representation + prosodic features
Diphone segments can be digitally manipulated for
length, pitch and loudness
Segment boundaries need to be smoothed to avoid
distortion
14. Diphone synthesis
Most systems use diphones because they
are
Manageable in number
Can be automatically extracted from
recordings of human speech
Capture most inter-allophonic variants
15. Unit selection synthesis
Same idea as concatenative synthesis, but database
contains bigger variety of “units”
Multiple examples of phonemes (under different prosodic
conditions) are recorded
Selection of appropriate unit therefore becomes more
complex, as there are in the database competing
candidates for selection
16. Navigation andVoiceCommands
Navigation systems and voice-activated assistants like Siri and Google
Assistant are prime examples ofTTS software.
They convert text-based directions into speech, making it easier for
drivers to stay focused on the road.
The voice assistants offer voice commands for various tasks, such as
sending a text message or setting a reminder.This technology
benefits people unfamiliar with an area or who have trouble reading
maps.
17. Applications
web pages from a web browser or Google Toolbar such as
Text-to-voice which is an add-on to Firefox.
Some specialized software can narrate RSS-feeds.
Some e-book readers, such as the Amazon Kindle,
PocketBook eBook Reader Pro, and the Bebook Neo use
TTS.
GPS Navigation units use speech synthesis for automobile
navigation.
18. Use
Speech synthesizers are great to help in preparing
educational materials, such as audiobooks, audio
blogs and language-learning materials.
Some visual learners or those who prefer to listen to
material rather than read it. Now educational content
creators can create materials for those with reading
impairments, such as dyslexia.
19. Use
The longest application has been in the use of screen readers
for people with visual impairment, but text-to-speech systems
are now commonly used by people with dyslexia and other
reading difficulties as well as by pre-literate children.
Speech synthesis techniques are also used in entertainment
productions such as games and animations.
In addition, speech synthesis is a valuable computational aid
for the analysis and assessment of speech disorders.
It can also be used as an educational tool, to learn different
accents, like in Google Translate.
20. Limitations
Speech Synthesis can still sound a little unnatural.
The approaches to Speech Synthesis that yield the
most natural speech need considerable resources in
terms of data storage and processing power.
The process of tokenizing text is rarely
straightforward.
There are many spellings in English which are
pronounced differently based on context making it
difficult for users