Speech Recognition and Speech Synthesis on iOS

Sp!ch Recognition
and Sp!ch Syn"esis
on iOS
http://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html

@peterfriese
peter.friese@zuehlke.com
xing.to/peter
http://peterfriese.de
Peter Friese

Ever since we use computers,
we have dreamt of using
spoken language
to communicate with them

SPEECH
SYNTHESIS
SPEECH
RECOGNITION

is the artificial production of
human speech
Sp!ch Syn"esis

Sp!ch syn"esis: Hist#y
1769: Speaking machine, by Wolfgang von Kempelen (he also developed the
famous Mechanical Turk)
Functional representation of the human vocal tract.
http://www.youtube.com/watch?v=zYRVqrfY3tQ
1970: Vocoder, custom built for Kraftwerk.
http://www.youtube.com/watch?v=w-Jq7BHtQMA
1939: Vocoder (Vocal Encoder), developed by Horner Dudley for Bell Labs.
Needed to be played (using a keyboard) by a trained operator.
Exhibited at the 1939 World Fair.
http://www.youtube.com/watch?v=CyaK22DMfF0

Most modern speech synthesis
systems use electronic /
computerized approaches
Sp!ch Syn"esis

Text to sp!ch (TTS)
Text Speech
Front end Back end
In modern TTS systems, speech synthesis is a
multi-step process that is divided into two
main parts:
1) Front end (analysis)
2) Back end (synthesis)

Text to sp!ch (TTS)
Text
analysis
Linguistic analysis
Waveform
generation
Phasing
Intonation
Duration
Text Speech
PhonemesWords
Front end Back end

TTS: Analysis
Text normalization challenges
My latest project is to
learn how to better
project my voice

TTS: Analysis
Text normalization challenges
1430
Half past two
one - four - "r! - zero
Fourt!n hundred and "irty
One "ousand four hundred "irty

TTS: Analysis
Text to phoneme challenges
read
Red
R!d

TTS: Syn"esis
1) Concatenative synthesis
2) Formant synthesis

TTS: Concatenative syn"
Base strategy: Concatenate segments of recorded speech
Unit selection synthesis: uses phones, diphones, half-phones, syllables,
morphemes, word, phrases and sentences. Best results, often
indistinguishable from human speech. Requires huge amount of pre-
recorded data.
Diphone synthesis: uses a minimal database containing all diphones of a
natural language (English: 800 diphones, German: 2500 diphones).
Disadvantage: sonic glitches. Still used commercially, but on the decline.
Domain-specific synthesis: concatenates prerecorded words and
sentences. Used in transport schedule announcements, weather reports,...
Simple to implement. High level of naturalness.

TTS: F#mant syn"
Formant: spectral peak of the sound spectrum of the voice.
It is sufficient to reproduce the first two (of 4) formants to be able to
distinguish vowels.
Can be implemented quite easily, but results in rather artificial results
(“computer voice”).
Vowel Formant f1 Formant f2
i 240 Hz 2400 Hz
e 390 Hz 2300 Hz
o 360 Hz 640 Hz
Vowel Formant f1 Formant f2
i 320 Hz 3200 Hz
e 500 Hz 2300 Hz
o 500 Hz 1000 Hz
English German

Concatenative Formant
Advantages • High level of naturalness • No large database
required
• Very intelligible, also at
high speeds
Disadvantages • Requires large database • Low level of naturalness
(“robotic” sound)
TTS: Syn"esis

TTS SDKs
• Siri
• iOS Voice Services
• Flite
• OpenEars (based on Flite)
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS

Using iOS Voice Service
Private API: Not save for the App Store - use at your own risk!
VSSpeechSynthesizer *speech =
[[NSClassFromString(@"VSSpeechSynthesizer") alloc] init];
[speech setRate:(float)1.0];
[speech startSpeakingString:@"Hello world, how are you"];

OpenEars SDK
URL: http://www.politepix.com/openears/
Shared Source
Based on CMU Pocketsphinx, CMU Flite, and CMU-CLMTK
Works offline, both for recognition and synthesis
Currently only supports English
Synthetic sound (diphone voice synthesis)
Pricing: free, with additional paid voices

iSp!ch SDK
URL: http://www.ispeech.org
Commercial, free access for testing
Needs a server connection
Supports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,
Japanese, Danish, Finnish, Italian, German, Russian, ...
Synthetic sound (diphone voice synthesis)
Pricing:
pay per use (0.02$ per TX)
pay per install (0.25$ per install, minimum 10.000 installs)

AT & T Sp!ch SDK
URL: http://developer.att.com
Commercial, free trial access for 90 days
Pricing: USD 99 / year grants 1.000.000 API calls per month
TTS API:
Web Service:
send text, get WAV back
Voices:
US English (male / female)
US Spanish (male / female)

Nuance
URL: http://dragonmobile.nuancemobiledeveloper.com/
Rather natural sound
Pricing:
Several Service Levels (Silver, Gold, Emerald)
Silver:
Up to 20 TX per device per day, max 500.000 devices
Gold
Pay per device ($0.24 per install)
Pay per transaction ($0.009 per tx)
Pre-payment of at least $3000

is the translation of spoken
words into text.
Sp!ch Recognition

Sp!ch recognition: Hist#y
1952: “Audrey” developed at Bell Labs. Could recognized digits spoken by a
single voice.
1970s: DARPA Speech Unerstanding Research program. “Harpy”, developed at
Carnegie Mellon University (could understand 1011 words).
http://www.youtube.com/watch?v=N3i6NoUZsSw
1962: “Shoebox” by IBM, demonstrated at World Fair. Could recognize 16
words spoken in English.
http://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/
specialprod1_7.html
1980s: By using statistical models (Hidden Markov Models), ASR vocabularies
grew from a few hundred words over several thousand words to
potentially unlimited numbers of words. Still, discrete dictation was
required.
1990s: Dragon Naturally Speaking (originally at $9000) supports continuous
speech recognition.

Sp!ch recognition
Preprocessing
Recognition
Decoder
(analogous)
speech
Language
model
Dictionary
Text
Candidate
Candidate
Candidate
Acoustic
model

Sp!ch recognition
Language
model
Dictionary
Acoustic
model
States Phonemes Words Sentences
/’h/
/’h/ -> /a/
/a/ how will
the weather be
tomorrow
todayshow me

Sp!ch Recognition SDKs
• Siri
• Flite
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS

• Siri
• Flite
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS
Sp!ch Recognition SDKs

OpenEars SDK
URL: http://www.politepix.com/openears/
Shared Source
Based on CMU Pocketsphinx, CMU Flite, and CMU-CLMTK
Works offline, both for recognition and synthesis
Vocabulary: needs to be provided by developer
Currently only supports English
Pricing: free, with additional paid voices

iSp!ch SDK
URL: http://www.ispeech.org
Pricing:
pay per use (0.02$ per TX)
pay per install (0.25$ per install, minimum 10.000 installs)

AT & T Sp!ch SDK
URL: http://developer.att.com
Commercial, free trial access for 90 days
Pricing: USD 99 / year grants 1.000.000 API calls per month
Supports several recognition contexts:
Gaming, Social Media, Web Search, Business Search, Voicemail to Text,
SMS, Question and Answer, TV, Generic
Support for command mode:
provide set of commands that are allowed in your app. Supports 19
languages (including English, German, Mandarin, Japanese, French,
Italian)

Nuance
URL: http://dragonmobile.nuancemobiledeveloper.com/
Supports several languages: English (US, UK), Spanish, Chinese,
Pricing:
Several Service Levels (Silver, Gold, Emerald)
Silver:
Up to 20 TX per device per day, max 500.000 devices
Gold
Pay per device ($0.24 per install)
Pay per transaction ($0.009 per tx)
Pre-payment of at least $3000

Multi-modal UIs
Pixeltone
http://www.gierad.com/projects/pixeltone-a-multimodal-interface-for-image-editing/

Zühlke. Empowering Ideas.
@peterfriese
http://www.zuehlke.com
Want to learn more? Get in touch - I’m available for consulting:

Zühlke. Empowering Ideas.
@peterfriese
http://www.zuehlke.com
Want to learn more? Get in touch - I’m available for consulting:
http://slidesha.re/15xNxpf

Speech Recognition and Speech Synthesis on iOS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

More from Peter Friese

More from Peter Friese (20)

Recently uploaded

Recently uploaded (20)

Speech Recognition and Speech Synthesis on iOS