Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Voice Recognition and Natural Language - Dallas TechFest 2016

427 views

Published on

An overview of what voice recognition and natural language are, and the state of the industry, for people new to the subject

Published in: Technology
  • Be the first to comment

Voice Recognition and Natural Language - Dallas TechFest 2016

  1. 1. Voice Recognition and Natural Language Dallas TechFest January 29, 2016 Crispin Reedy @crispinTX #DallasTechFest16
  2. 2. 2© 2016 Versay Solutions LLC • Voice User Interface Designer • 10 years in the field • Former coder; got interested in UX • President of the Association for Voice Interaction Design • Consultant for Versay Solutions @crispinTX crispinreedy.com
  3. 3. Disclaimers This Session Is About: • What is speech recognition anyway? • Should I speech-enable X? How? • In general, how does it work? – What technologies should I consider? – What skills are important? • What are the design considerations? It’s NOT About: • Detailed code • In depth how-tos • Deep technical knowledge • Advanced ASR
  4. 4. Should I Speech-Enable X?
  5. 5. What IS X? 6© 2016 Versay Solutions LLC
  6. 6. How does this new modality enable or enhance what I want to do on this platform?
  7. 7. What IS X? 8© 2016 Versay Solutions LLC
  8. 8. Terms & Technologies • Speech Recognition • Natural Language Understanding • Text to Speech • Voice Verification (Biometrics) 9© 2016 Versay Solutions LLC
  9. 9. Speech Recognition • Also known as “ASR” – “Speech to Text” ? 10© 2016 Versay Solutions LLC “See the cat.” Spoken language Machine- readable format
  10. 10. Natural Language Understanding • Extracting meaning from natural text – Not necessarily tied to speech recognition 11© 2016 Versay Solutions LLC “Hello, yes, I’d like to pay my water bill. Can you help me with that? Action = BillPay BillType = Water
  11. 11. Text to Speech • Speech Synthesis – Used to convert text to spoken words 12© 2016 Versay Solutions LLC
  12. 12. Voice Verification • Also called voiceprints, biometrics, voice authentication, etc. • Recognizes a person, not necessarily what they are saying. – You can have ASR without Voice Verification – And vice versa 13© 2016 Versay Solutions LLC “My voice is my password.” “Authenticated. Welcome, Mr. Smith.” ✓
  13. 13. 14© 2016 Versay Solutions LLC Speech Recognition • Hands-free command / control • Dictation • Input text • Small form factor device, etc. Text To Speech • Output text dynamically • Respond to input • Useful when no display is available Natural Language Understanding • Necessary at some level for all language-based input • Also used to parse large volumes of text Voice Verification • Security Uses: Separate Applications
  14. 14. Uses: Combined 15© 2016 Versay Solutions LLC ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation
  15. 15. True Multimodality 16© 2016 Versay Solutions LLC ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation Touch Keyboard Manage I/O Modality Determine Meaning in Context Visual Context!
  16. 16. Credit: Jon Bloom
  17. 17. Let’s Talk Speech!
  18. 18. Output: Text to Speech • (Somewhat) mature technology • (Fairly) easy to understand and use – Note: “Create TTS audio” is not the same as having a TTS engine 19© 2016 Versay Solutions LLC
  19. 19. How it Works 20© 2016 Versay Solutions LLC
  20. 20. TTS Engine • Text in, speech out • May do some text pre-processing – St. James St. – Saint James Street – Punctuation – If it doesn’t do this, you’ll have to yourself. • Grapheme to phoneme transcription • Identify intonation patterns – Assign the correct lexical stress to the words 21© 2016 Versay Solutions LLC
  21. 21. What Makes Good TTS? • Phonemes change based on location – “Cat” – “Alligator” • Elision – “I’m. Awaiting. You.” – “I’m awaiting you.” • Intonation – “Do you want coffee?” – “Do you want soda, tea, or coffee?” 22© 2016 Versay Solutions LLC
  22. 22. SSML • XML based WC3 standard for Speech Synthesis Markup – Not universally supported by vendors. • Tags for marking up text to produce a more natural quality output. – Emphasis – Break – Voice – Prosody – Pitch 23© 2016 Versay Solutions LLC
  23. 23. SSML Example 24© 2016 Versay Solutions LLC
  24. 24. When To Use It • When high quality audio is not a consideration – TTS has improved considerably, but is still noticeable • When you have a lot of dynamic data – If you just need to say a few things, it may be overkill 25© 2016 Versay Solutions LLC
  25. 25. Other Considerations • More phonemes = higher quality voice – Also means a bigger download and install (if on device) • Exceptions (addresses, names) can be iffy – May require a lot of work to handle well • Your data needs to be clean and ready to voice back – Acronyms, incomplete sentences will not sound good • Some applications may have other acoustic limitations – Telephony • It is possible to build a custom voice – But it takes a lot of work! 26© 2016 Versay Solutions LLC
  26. 26. Where To Find It • Many commercial products available – Most languages and dialects i.e. American English, British English, etc. – Many different voices – Nuance, Cepstral, Inova – Some open source – Some APIs • Chrome https://developer.chrome.com/apps/tts 27© 2016 Versay Solutions LLC
  27. 27. ASR and NLU
  28. 28. ASR and NLU: Topics • Complications of speech – Why is it so hard? • How it works: overview • Early commercial adoptions – IVR • Design considerations • Speech today – Different vendors • Should I voice-enable X? 29© 2016 Versay Solutions LLC
  29. 29. 30(The Speech Chain, Bell Labs, 1963)
  30. 30. 31The Voice in the Machine: Pieraccini World Knowledge Semantics Syntax Lexicon Morphology Phonetics Acoustics Linguistics Physiology Concepts Phrases Words Phonemes Sounds ASR NLU
  31. 31. Speech Is Ambiguous • Speech is never stationary – Coarticulation • Noisy environments • Accents • Different speakers have voices with different acoustic qualities – Goats • Challenges vary depending on what you are going to recognize – Spelling (short utterances) can be difficult even for humans – Phonetic alphabet (Military) 32© 2016 Versay Solutions LLC
  32. 32. Language Is Ambiguous • Humans can deduce meaning from context and unknown words “How can I help you?” I’m having a problem with my account. I’d like that one. No, not the green one, the red one. Time flies like an arrow. Fruit flies like a banana. 33© 2016 Versay Solutions LLC
  33. 33. Everything Is Ambiguous • All modern speech recognition is probabilistic – GUI: Button clicked? true / false – VUI: There is an 85% chance that button was clicked 34© 2016 Versay Solutions LLC
  34. 34. Three Dimensions of Speech Problems 35The Voice in the Machine: Pieraccini Speaker Independence Speaker Dependent Multiple Speakers Speaker Independent Isolated Words Connected Words Natural Speech 10 words 1000 words 100,000 words Unlimited VocabularySize Humanlike
  35. 35. History of Speech Recognition • AUDREY: Davis, Biddulph, and Balashek - Bell Labs 1952 36© 2016 Versay Solutions LLC • Analog • Isolated digit recognition – Pause between digits • Speaker-dependent
  36. 36. Sampling • The start of being able to digitally manipulate audio 39© 2016 Versay Solutions LLC
  37. 37. 40© 2016 Versay Solutions LLC 0 db frequency Spectrogram vs. Waveform
  38. 38. 1970’s: Template Matching • Template matching approach – “Brute force” model – Quantitized spectrograms – What about duration? • Dynamic time warping • Endpoint detection – Difficult to do • Feature extraction 41© 2016 Versay Solutions LLC
  39. 39. 1980’s: The Power of Statistics • The recognition of connected speech becomes a search for the best path in a large network – Problem of finding the probabilities • Statistical Language Models – Not all sequences of words are equally probable – Rank all permissible sentences in terms of probability • “Correct” grammar is not applicable • Restricted by domain • Hidden Markov Models (HMM) – Unified probabilistic model for speech 42© 2016 Versay Solutions LLC
  40. 40. Hidden Markov Model Example 43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia) X — states y — possible observations a — state transition probabilities b — output probabilities
  41. 41. You’re Only As Good As What You’re Trained On • Corpora – Collection of speech used to train a recognizer – Acoustic and/or Pronunciation Model • Associates sounds with symbols and words. • Created by a general speech corpora and a phonetic and orthographic transcription – Statistical Language Model (SLM) • A probability distribution over sequences of words • Created by a domain-specific speech corpora and a tagged transcription to extract meaning 44© 2016 Versay Solutions LLC
  42. 42. Training 45© 2016 Versay Solutions LLC Speech Recognition Engine Acoustic Model SLM and/or Grammar Pronunciation Model
  43. 43. Language Model vs. Grammar • SLM – Has to be trained against collected utterances – Large potential set of what the caller can say – Tagged with the meanings of what they can say • Grammar (GrXML) – More tightly constrained than an SLM – Easier to create – Not “trained” in the same way – System will only recognize what is in the grammar 46© 2016 Versay Solutions LLC
  44. 44. 47© 2016 Versay Solutions LLC Utterance Noise Levels? Barge-In? Feature Extraction Endpointing Speech Recognition Engine Grammar or SLM Probabilities n:best list Literal return Tokens Recognition Event
  45. 45. Natural Language Understanding • Parsing input to extract meaning • Covers a large field – Commands – Automatic classification of emails – Newspaper articles, large chunks of text • Lexicon • Parser • Grammar rules • New tools / APIs 48© 2016 Versay Solutions LLC
  46. 46. Levels of Meaning 49© 2016 Versay Solutions LLC Too Broad / Ambiguous Too MuchJust Right “I’m having a problem with my account.” “Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I saw…” “I’m seeing an unusual charge on my bill.” “How can I help you?”
  47. 47. Multi-Token Utterances • “I’d like to transfer $50 from my checking account to my savings account.” – ACTION = Transfer – FROM_ACCOUNT = Checking – TO_ACCOUNT = Savings – AMOUNT = $50 • Unfortunately, people don’t often naturally produce these kinds of utterances. 50© 2016 Versay Solutions LLC
  48. 48. Early Commercial Adoption • IVR – Touchtone / DTMF • “For checking, press 1. For savings, press 2.” – Directed Dialog (Grammar-based ASR) • “Which account? Just say ‘checking,’ ‘savings,’ or ‘money market.’” – Natural Language (SLM-based ASR) • “From which account?” • SpeechWorks / Nuance technology • Voice XML / GrXML 51© 2016 Versay Solutions LLC
  49. 49. 53© 2016 Versay Solutions LLC
  50. 50. Typical IVR Architecture 54© 2016 Versay Solutions LLC Voice Browser VUI VXML PSTN / VOIP HTTP App Server / Data Connection Data SIP MRCP ASR Server TTS Server
  51. 51. Anatomy of an VUI + NLU project • Voice User Interface Design – High level design • Design style, sound and feel, IA, – Detailed design • Prompts (recorded) • Grammars for directed dialog states • Data I/O 55© 2016 Versay Solutions LLC • SLM Creation – Utterance capture – Transcription – Tagging – Compiling and deployment
  52. 52. 56© 2016 Versay Solutions LLC
  53. 53. VUI Design Doc – Detailed Example 57© 2016 Versay Solutions LLC
  54. 54. Corpora Documentation Example 58© 2016 Versay Solutions LLC
  55. 55. Design Considerations • Types of Speech User Interfaces – Command and Control – Dictation – Dialog-based • Speech is a linear, time-based interface – Multimodality introduces additional complications 59© 2016 Versay Solutions LLC
  56. 56. Design Considerations • If the recognizer doesn’t get something, you have to reprompt. • Don’t say “sorry.” “Where are you traveling today?” I’m going to…. <noise> “What city was that?” 60© 2016 Versay Solutions LLC
  57. 57. Design Considerations • Speech is interruptible – Main Menu: Choose from: “Beverages,” “Sandwiches,” “Sides,” “Salads,” or “Alcoholic Drinks.” 61© 2016 Versay Solutions LLC
  58. 58. Design Considerations • Prompts imply more than choices – Would you like chocolate or vanilla? • Yes • Both 62© 2016 Versay Solutions LLC
  59. 59. Design Considerations • Input must be limited *after* it is provided – Can’t check the box on the client side to only allow input of valid amounts – “Sorry, you’re only allowed to transfer up to $500.” 63© 2016 Versay Solutions LLC
  60. 60. Design Considerations • Avoid using the word “Help” as a global command. • Instead, if there is a need to give additional information, supply it in the first or second reprompts. – Or use specific keywords – Other than “help” • “You can also say ‘instructions.’” • “Or, say ‘It’s something else.’” 64© 2016 Versay Solutions LLC
  61. 61. User Centered Design Techniques • A set of techniques designed to keep the focus on the user during the design process • May include but are not limited to: – Conversations • Specific to VUI design – Read Aloud • Specific to VUI design – Card Sorts • Used to construct an IA – Personas • Used in all modalities – Usability Testing • Used in all modalities – A/B Testing • Useful for applications that are already in production 65© 2015 Versay Solutions LLC
  62. 62. Usability Testing 66© 2016 Versay Solutions LLC
  63. 63. 67
  64. 64. Should I Speech-Enable X?
  65. 65. What IS X? 69© 2016 Versay Solutions LLC
  66. 66. What’s the Use Case For Speech? • Enabling application – User can’t do it any other way – New tasks • Enhancing application – User can do it now – But speech makes it better • Faster • Safer 70Credit: Bruce Ballentine, EIG
  67. 67. How Hard Is It To Do? • What do you need it for? • What kind of device will you be running it on? – Connectivity? Can you use cloud based ASR? – Do you have to download it? If so, how much space do you have? • How much control do you need over the application / user interface? 71© 2016 Versay Solutions LLC
  68. 68. Possibilities 72© 2016 Versay Solutions LLC Write an app (skill) for an agent such as Cortana / Alexa Use cloud APIs to add ASR to your app / device / page / gadget Download an ASR and use full-featured capabilities for more robust recognition Build your own
  69. 69. Distributed: Today’s Speech Agents • Siri • Cortana • Google Now • Amazon Echo (Alexa) 73© 2016 Versay Solutions LLC
  70. 70. Today’s Cloud-Based Speech APIs • Distributed speech recognition – Collection and compression of speech is on the device – The language models are typically on the network – Phone can be speaker-dependent • Trains itself on your voice and on the acoustic environments you are in most often – Many companies are providing APIs to use their speech recognition 74© 2016 Versay Solutions LLC
  71. 71. AVS vs. Amazon Echo • Could use AVS with the Amazon Echo, or with your own device 75© 2016 Versay Solutions LLC
  72. 72. Speech API Example: Alexa Voice Services 76© 2016 Versay Solutions LLC
  73. 73. Alexa Skill Example 77© 2016 Versay Solutions LLC
  74. 74. 78© 2016 Versay Solutions LLC
  75. 75. Alexa “Skills” • “Alexa, ask Yelp to find me a restaurant.” – Cortana has similar integration • Register your skill with Amazon and publish it 79© 2016 Versay Solutions LLC
  76. 76. Cloud vs. Downloadable / Embedded • Microsoft – Cortana integration – Project Oxford API • Google API • Amazon • Several new recent startups – Api.ai, Capio.ai, Speechmatics, iSpeech 80© 2016 Versay Solutions LLC • Microsoft – Windows 10 Speech APIs – Microsoft Speech Server • Nuance – the 800 pound gorilla in the room • Interactions – IBM Watson
  77. 77. Cloud vs. Downloadable / Embedded • Easy to get started • Lightweight • Not much specialized knowledge 81© 2016 Versay Solutions LLC • Customizable • Probably better recognition • Can be device-specific • More features • Higher powered • Will require specialized knowledge • Speech scientist
  78. 78. Today’s NLU APIs • Microsoft LUIS (part of Project Oxford) • Api.ai 82© 2016 Versay Solutions LLC
  79. 79. Open Source ASR • CMU Sphinx – pocketsphinx • Kaldi – http://kaldi-asr.org/ • Github • New updates include some pretty interesting stuff (DNN) • Requires: – Corpus – Tech know-how 83© 2016 Versay Solutions LLC
  80. 80. Who May You Need On Your Team • Speech Scientist • VUI Designer 84© 2016 Versay Solutions LLC
  81. 81. Should I Speech-Enable X? 85© 2016 Versay Solutions LLC
  82. 82. Should I Speech-Enable X? 86© 2016 Versay Solutions LLC Desktop App / Website • Easy to get started with API-based ASR • But the use case may not be as powerful Tablet / Mobile • Stronger use case • But will the network be available for APIs? Industrial Device • Great use case esp. with multimodal • But this is harder to do and probably will be custom Gadget • Decent use case • APIs are tailored for this • Will they do everything you need? • Will the extra modality be a plus or just a “silly add-on?” Car • Safety considerations are high here • Need better user interfaces & more robust IVR • Touchtone can still be good for a lot of applications • Speech is good for complex call routing and input
  83. 83. Resources • The Voice in the Machine: Building Computers that Understand Speech – Roberto Pieraccini • YouTube video: “Open the Pod Bay Doors, Siri” • Best Practices in VUI Design: AVIxD Wiki – http://videsign.wikispaces.com/ • AVIxD: Quarterly Brown Bags 87© 2016 Versay Solutions LLC
  84. 84. 88© 2016 Versay Solutions LLC Thanks! @crispinTX crispinreedy.com creedy@versay.com

×