Sp!ch Recognitionand Sp!ch Syn"esison iOShttp://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.h...
@peterfriesepeter.friese@zuehlke.comxing.to/peterhttp://peterfriese.dePeter Friese
Ever since we use computers,we have dreamt of usingspoken languageto communicate with them
SPEECHSYNTHESISSPEECHRECOGNITION
WHAT ISSPEECH SYNTHESIS?
is the artificial production ofhuman speechSp!ch Syn"esis
Sp!ch syn"esis: Hist#y1769: Speaking machine, by Wolfgang von Kempelen (he also developed thefamous Mechanical Turk)Functi...
Most modern speech synthesissystems use electronic /computerized approachesSp!ch Syn"esis
Text to sp!ch (TTS)Text SpeechFront end Back endIn modern TTS systems, speech synthesis is amulti-step process that is div...
Text to sp!ch (TTS)TextanalysisLinguistic analysisWaveformgenerationPhasingIntonationDurationText SpeechPhonemesWordsFront...
TTS: AnalysisText normalization challengesMy latest project is tolearn how to betterproject my voice
TTS: AnalysisText normalization challenges1430Half past twoone - four - "r! - zeroFourt!n hundred and "irtyOne "ousand fou...
TTS: AnalysisText to phoneme challengesreadRedR!d
SYNTHESISAPPROACHES
TTS: Syn"esis1) Concatenative synthesis2) Formant synthesis
TTS: Concatenative syn"Base strategy: Concatenate segments of recorded speechUnit selection synthesis: uses phones, diphon...
TTS: F#mant syn"Formant: spectral peak of the sound spectrum of the voice.It is sufficient to reproduce the first two (of ...
Concatenative FormantAdvantages • High level of naturalness • No large databaserequired• Very intelligible, also athigh sp...
SPEECH SYNTHESISDEMOS
TTS SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTS
TTS SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTS
Using iOS Voice ServicePrivate API: Not save for the App Store - use at your own risk!VSSpeechSynthesizer *speech =[[NSCla...
OpenEars SDKURL: http://www.politepix.com/openears/Shared SourceBased on CMU Pocketsphinx, CMU Flite, and CMU-CLMTKWorks o...
iSp!ch SDKURL: http://www.ispeech.orgCommercial, free access for testingNeeds a server connectionSupports several language...
AT & T Sp!ch SDKURL: http://developer.att.comCommercial, free trial access for 90 daysPricing: USD 99 / year grants 1.000....
NuanceURL: http://dragonmobile.nuancemobiledeveloper.com/Commercial, free access for testingNeeds a server connectionSuppo...
WHAT ISSPEECHRECOGNITION?
is the translation of spokenwords into text.Sp!ch Recognition
Sp!ch recognition: Hist#y1952: “Audrey” developed at Bell Labs. Could recognized digits spoken by asingle voice.1970s: DAR...
Sp!ch recognitionPreprocessingRecognitionDecoder(analogous)speechLanguagemodelDictionaryTextCandidateCandidateCandidateAco...
Sp!ch recognitionLanguagemodelDictionaryAcousticmodelStates Phonemes Words Sentences/’h//’h/ -> /a//a/ how willthe weather...
SPEECH RECOGNITIONDEMOS
Sp!ch Recognition SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bin...
• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTSSp!ch Recognition...
OpenEars SDKURL: http://www.politepix.com/openears/Shared SourceBased on CMU Pocketsphinx, CMU Flite, and CMU-CLMTKWorks o...
iSp!ch SDKURL: http://www.ispeech.orgCommercial, free access for testingNeeds a server connectionSupports several language...
AT & T Sp!ch SDKURL: http://developer.att.comCommercial, free trial access for 90 daysPricing: USD 99 / year grants 1.000....
NuanceURL: http://dragonmobile.nuancemobiledeveloper.com/Commercial, free access for testingNeeds a server connectionSuppo...
OUTLOOK
Multi-modal UIsPixeltonehttp://www.gierad.com/projects/pixeltone-a-multimodal-interface-for-image-editing/
Multi modal input
+ = ?++ = ?+ = ?
http://elizaapp.com
Zühlke. Empowering Ideas.@peterfriesepeter.friese@zuehlke.comhttp://www.zuehlke.comWant to learn more? Get in touch - I’m ...
Zühlke. Empowering Ideas.@peterfriesepeter.friese@zuehlke.comhttp://www.zuehlke.comWant to learn more? Get in touch - I’m ...
Upcoming SlideShare
Loading in...5
×

Speech Recognition and Speech Synthesis on iOS

6,783

Published on

Ever since we use computers, people have dreamt of interacting with computers in a natural way, using spoken language. Hardly any science fiction flick gets by without humans talking to computers or even androids. With Siri, Apple has brought this capability to iOS. Many developers hoped to use Siri I their apps, but so far Apple hasn’t provided us with an API. In this talk, I will show how to use speech recognition and speech synthesis in native iOS applications without having to jailbreak your device. We will take a look at some libraries (both open source and commercial) that allow us to build speech-enabled apps with little effort.

Published in: Technology
4 Comments
14 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,783
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
4
Likes
14
Embeds 0
No embeds

No notes for slide

Speech Recognition and Speech Synthesis on iOS

  1. 1. Sp!ch Recognitionand Sp!ch Syn"esison iOShttp://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html
  2. 2. @peterfriesepeter.friese@zuehlke.comxing.to/peterhttp://peterfriese.dePeter Friese
  3. 3. Ever since we use computers,we have dreamt of usingspoken languageto communicate with them
  4. 4. SPEECHSYNTHESISSPEECHRECOGNITION
  5. 5. WHAT ISSPEECH SYNTHESIS?
  6. 6. is the artificial production ofhuman speechSp!ch Syn"esis
  7. 7. Sp!ch syn"esis: Hist#y1769: Speaking machine, by Wolfgang von Kempelen (he also developed thefamous Mechanical Turk)Functional representation of the human vocal tract.http://www.youtube.com/watch?v=zYRVqrfY3tQ1970: Vocoder, custom built for Kraftwerk.http://www.youtube.com/watch?v=w-Jq7BHtQMA1939: Vocoder (Vocal Encoder), developed by Horner Dudley for Bell Labs.Needed to be played (using a keyboard) by a trained operator.Exhibited at the 1939 World Fair.http://www.youtube.com/watch?v=CyaK22DMfF0
  8. 8. Most modern speech synthesissystems use electronic /computerized approachesSp!ch Syn"esis
  9. 9. Text to sp!ch (TTS)Text SpeechFront end Back endIn modern TTS systems, speech synthesis is amulti-step process that is divided into twomain parts:1) Front end (analysis)2) Back end (synthesis)
  10. 10. Text to sp!ch (TTS)TextanalysisLinguistic analysisWaveformgenerationPhasingIntonationDurationText SpeechPhonemesWordsFront end Back end
  11. 11. TTS: AnalysisText normalization challengesMy latest project is tolearn how to betterproject my voice
  12. 12. TTS: AnalysisText normalization challenges1430Half past twoone - four - "r! - zeroFourt!n hundred and "irtyOne "ousand four hundred "irty
  13. 13. TTS: AnalysisText to phoneme challengesreadRedR!d
  14. 14. SYNTHESISAPPROACHES
  15. 15. TTS: Syn"esis1) Concatenative synthesis2) Formant synthesis
  16. 16. TTS: Concatenative syn"Base strategy: Concatenate segments of recorded speechUnit selection synthesis: uses phones, diphones, half-phones, syllables,morphemes, word, phrases and sentences. Best results, oftenindistinguishable from human speech. Requires huge amount of pre-recorded data.Diphone synthesis: uses a minimal database containing all diphones of anatural language (English: 800 diphones, German: 2500 diphones).Disadvantage: sonic glitches. Still used commercially, but on the decline.Domain-specific synthesis: concatenates prerecorded words andsentences. Used in transport schedule announcements, weather reports,...Simple to implement. High level of naturalness.
  17. 17. TTS: F#mant syn"Formant: spectral peak of the sound spectrum of the voice.It is sufficient to reproduce the first two (of 4) formants to be able todistinguish vowels.Can be implemented quite easily, but results in rather artificial results(“computer voice”).Vowel Formant f1 Formant f2i 240 Hz 2400 Hze 390 Hz 2300 Hzo 360 Hz 640 HzVowel Formant f1 Formant f2i 320 Hz 3200 Hze 500 Hz 2300 Hzo 500 Hz 1000 HzEnglish German
  18. 18. Concatenative FormantAdvantages • High level of naturalness • No large databaserequired• Very intelligible, also athigh speedsDisadvantages • Requires large database • Low level of naturalness(“robotic” sound)TTS: Syn"esis
  19. 19. SPEECH SYNTHESISDEMOS
  20. 20. TTS SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTS
  21. 21. TTS SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTS
  22. 22. Using iOS Voice ServicePrivate API: Not save for the App Store - use at your own risk!VSSpeechSynthesizer *speech =[[NSClassFromString(@"VSSpeechSynthesizer") alloc] init];[speech setRate:(float)1.0];[speech startSpeakingString:@"Hello world, how are you"];
  23. 23. OpenEars SDKURL: http://www.politepix.com/openears/Shared SourceBased on CMU Pocketsphinx, CMU Flite, and CMU-CLMTKWorks offline, both for recognition and synthesisCurrently only supports EnglishSynthetic sound (diphone voice synthesis)Pricing: free, with additional paid voices
  24. 24. iSp!ch SDKURL: http://www.ispeech.orgCommercial, free access for testingNeeds a server connectionSupports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,Japanese, Danish, Finnish, Italian, German, Russian, ...Synthetic sound (diphone voice synthesis)Pricing:pay per use (0.02$ per TX)pay per install (0.25$ per install, minimum 10.000 installs)
  25. 25. AT & T Sp!ch SDKURL: http://developer.att.comCommercial, free trial access for 90 daysPricing: USD 99 / year grants 1.000.000 API calls per monthTTS API:Web Service:send text, get WAV backVoices:US English (male / female)US Spanish (male / female)
  26. 26. NuanceURL: http://dragonmobile.nuancemobiledeveloper.com/Commercial, free access for testingNeeds a server connectionSupports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,Japanese, Danish, Finnish, Italian, German, Russian, ...Rather natural soundPricing:Several Service Levels (Silver, Gold, Emerald)Silver:Up to 20 TX per device per day, max 500.000 devicesGoldPay per device ($0.24 per install)Pay per transaction ($0.009 per tx)Pre-payment of at least $3000
  27. 27. WHAT ISSPEECHRECOGNITION?
  28. 28. is the translation of spokenwords into text.Sp!ch Recognition
  29. 29. Sp!ch recognition: Hist#y1952: “Audrey” developed at Bell Labs. Could recognized digits spoken by asingle voice.1970s: DARPA Speech Unerstanding Research program. “Harpy”, developed atCarnegie Mellon University (could understand 1011 words).http://www.youtube.com/watch?v=N3i6NoUZsSw1962: “Shoebox” by IBM, demonstrated at World Fair. Could recognize 16words spoken in English.http://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html1980s: By using statistical models (Hidden Markov Models), ASR vocabulariesgrew from a few hundred words over several thousand words topotentially unlimited numbers of words. Still, discrete dictation wasrequired.1990s: Dragon Naturally Speaking (originally at $9000) supports continuousspeech recognition.
  30. 30. Sp!ch recognitionPreprocessingRecognitionDecoder(analogous)speechLanguagemodelDictionaryTextCandidateCandidateCandidateAcousticmodel
  31. 31. Sp!ch recognitionLanguagemodelDictionaryAcousticmodelStates Phonemes Words Sentences/’h//’h/ -> /a//a/ how willthe weather betomorrowtodayshow me
  32. 32. SPEECH RECOGNITIONDEMOS
  33. 33. Sp!ch Recognition SDKs• Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTS
  34. 34. • Siri• iOS Voice Services• Flite• OpenEars (based on Flite)• iSpeech• Nuance• AT&T• Google TTS• Bing TTSSp!ch Recognition SDKs
  35. 35. OpenEars SDKURL: http://www.politepix.com/openears/Shared SourceBased on CMU Pocketsphinx, CMU Flite, and CMU-CLMTKWorks offline, both for recognition and synthesisVocabulary: needs to be provided by developerCurrently only supports EnglishPricing: free, with additional paid voices
  36. 36. iSp!ch SDKURL: http://www.ispeech.orgCommercial, free access for testingNeeds a server connectionSupports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,Japanese, Danish, Finnish, Italian, German, Russian, ...Pricing:pay per use (0.02$ per TX)pay per install (0.25$ per install, minimum 10.000 installs)
  37. 37. AT & T Sp!ch SDKURL: http://developer.att.comCommercial, free trial access for 90 daysPricing: USD 99 / year grants 1.000.000 API calls per monthSupports several recognition contexts:Gaming, Social Media, Web Search, Business Search, Voicemail to Text,SMS, Question and Answer, TV, GenericSupport for command mode:provide set of commands that are allowed in your app. Supports 19languages (including English, German, Mandarin, Japanese, French,Italian)
  38. 38. NuanceURL: http://dragonmobile.nuancemobiledeveloper.com/Commercial, free access for testingNeeds a server connectionSupports several languages: English (US, UK), Spanish, Chinese,Japanese, Danish, Finnish, Italian, German, Russian, ...Pricing:Several Service Levels (Silver, Gold, Emerald)Silver:Up to 20 TX per device per day, max 500.000 devicesGoldPay per device ($0.24 per install)Pay per transaction ($0.009 per tx)Pre-payment of at least $3000
  39. 39. OUTLOOK
  40. 40. Multi-modal UIsPixeltonehttp://www.gierad.com/projects/pixeltone-a-multimodal-interface-for-image-editing/
  41. 41. Multi modal input
  42. 42. + = ?++ = ?+ = ?
  43. 43. http://elizaapp.com
  44. 44. Zühlke. Empowering Ideas.@peterfriesepeter.friese@zuehlke.comhttp://www.zuehlke.comWant to learn more? Get in touch - I’m available for consulting:
  45. 45. Zühlke. Empowering Ideas.@peterfriesepeter.friese@zuehlke.comhttp://www.zuehlke.comWant to learn more? Get in touch - I’m available for consulting:http://slidesha.re/15xNxpf

×