Successfully reported this slideshow.
Your SlideShare is downloading. ×

Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 77 Ad

More Related Content

Similar to Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016 (20)

Advertisement

Recently uploaded (20)

Where's Jarvis? The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

  1. 1. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Where’s Jarvis? The Future of Voice Recognition and Natural Language User Interfaces. Crispin Reedy, Versay Solutions @crispinTX crispinreedy.com #UXPA2016
  2. 2. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions From the session description • What is voice recognition? • What is natural language understanding? • What are the common technologies in the market today? • How does this fit with IoT? • What are design considerations / methods to evaluate these types of interfaces? • Implied: Should I speech-enable my ___? • Bonus Q: Why doesn’t it work the way we want it to, and when will it?
  3. 3. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Should I Speech-Enable My ___?
  4. 4. Iron Man 2: Marvel Studios, Paramount Pictures
  5. 5. Star Trek Voyager: Paramount Television
  6. 6. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions “Tomato soup” “Tomato soup. Ok, what kind?” “Just plain” “Coming right up!” Implicit confirmation Second level-open ended prompting Cultural context: plain = hot
  7. 7. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Terms & Technologies • Speech Recognition • Natural Language Understanding • Voice Verification (Biometrics) • Text to Speech
  8. 8. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Speech Recognition “ASR” “See the cat.”
  9. 9. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Natural Language Understanding • Extracting meaning from natural text “Hello, yes, I’d like to pay my water bill. Can you help me with that? Intent = BillPay Entity (Bill Type) = Water
  10. 10. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Voice Verification “My voice is my password.” “Authenticated. Welcome, Mr. Smith.” ✓
  11. 11. Text To Speech
  12. 12. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions What Is Good TTS? • Phonemes change based on location • “Cat” • “Alligator” • Elision • “I’m. Awaiting. You.” • “I’m awaiting you.” • Intonation • “Do you want coffee?” • “Do you want soda, tea, or coffee?” • Most TTS isn’t “Movie Quality” IMDB
  13. 13. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions SSML Example SSML
  14. 14. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Speech Recognition • Hands-free command / control • Dictation • Input text • Small form factor device, etc. Text To Speech • Output text dynamically • Respond to input • Useful when no display is available Natural Language Understanding • Necessary for all language-based input • Extract meaning • Parse large volumes of text Voice Verification • Security
  15. 15. ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation
  16. 16. ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation Touch Keyboard Manage I/O Modality Determine Meaning in Context Visual Context!
  17. 17. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
  18. 18. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions ASR
  19. 19. World Knowledge Semantics Syntax Lexicon Morphology Phonetics Acoustics Linguistics Physiology Concepts Phrases Words Phonemes Sounds ASR NLU
  20. 20. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Speech is ambiguous
  21. 21. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Language is ambiguous
  22. 22. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Everything is ambiguous
  23. 23. Speaker Independence Speaker Dependent Multiple Speakers Speaker Independent Isolated Words Connected Words Natural Speech 10 words 1000 words 100,000 words Unlimited VocabularySize Humanlike
  24. 24. AUDREY: Automatic Digit Recognizer Bell Labs 1952
  25. 25. X — states y — possible observations a — state transition probabilities b — output probabilities "HiddenMarkovModel" by Tdunningvectorization: Wikimedia
  26. 26. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Training Speech Recognition Engine Acoustic Model SLM and/or Grammar Pronunciation Model
  27. 27. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Utterance Noise Levels? Barge-In? Feature Extraction Endpointing Speech Recognition Engine Grammar or SLM Probabilities n:best list Literal return Tokens Recognition Event
  28. 28. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Early Commercial Adoptions • Interactive Voice Response • “Those Phone Menus” • Server-based ASR • Nuance • Microsoft • Voice-Enabled Handheld Devices • Industrial / Productivity applications • Device-based ASR • Network not needed Note: Call center is still an important customer touchpoint!
  29. 29. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Today’s Speech Agents vs. APIs • Siri / Apple APIs • Cortana / Cortana APIs • Google Now / Google Voice Actions • Amazon Echo (Alexa) / AVS API • Jibo • Ubi / Ubi Kit • Assistant.ai / Api.ai
  30. 30. Alexa Skill vs. Amazon Voice Service Amazon.com
  31. 31. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Alexa Skill Example Amazon.com
  32. 32. Amazon.com
  33. 33. Capitol One.com
  34. 34. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions NLU
  35. 35. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Natural Language Understanding • Parsing input to extract meaning • Covers a large field • Commands • Automatic classification of emails • Newspaper articles, large chunks of text • Bots • Conversational agents • Messaging apps • Personal assistants • Input could be via speech or via text
  36. 36. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Levels of Meaning Too Broad / Ambiguous Too MuchJust Right “I’m having a problem with my account.” “Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I saw…” “I’m seeing an unusual charge on my bill.” “How can I help you?”
  37. 37. NLU Tasks http://www.conversational-technologies.com/nldemos/nlDemos.html
  38. 38. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Intents and Entities • “I’d like to transfer $50 from my checking account to my savings account.” • ACTION = Transfer (Intent) • FROM_ACCOUNT = Checking (Entity) • TO_ACCOUNT = Savings (Entity) • AMOUNT = $50 (Entity)
  39. 39. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions NLU APIs • API.ai • Alexa • Microsoft LUIS • Wit.ai • Google Voice Actions • Etc.
  40. 40. Today’s NLU APIs • Microsoft LUIS (part of Project Oxford) Microsoft.com
  41. 41. Today’s NLU APIs API.ai| • API.ai
  42. 42. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions The Future Is Here • DNN (Deep Neural Networks) • Being applied to both ASR and NLU problems • Requires large amounts of data to train the models
  43. 43. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions What’s The Glue Here? Consistency Across Contexts? “Omnichannel CX” Data Is Everywhere State Chart XML?
  44. 44. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions ASR vs. NLU: Wrap Up ASR • Spoken aloud • Requires some NLU even if it’s hand-crafted (tagging) • Useful in hands-free, eyes-free contexts NLU • Focuses on meaning extraction • Could be used for chat bots, etc. • Machine learning to train models
  45. 45. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Design Considerations
  46. 46. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Design Considerations • What are you trying to build? • What’s your platform? • Existing guidelines / research • User testing is key • Especially if you’re trying to do something complicated
  47. 47. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Should I Speech-Enable My ___?
  48. 48. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions What’s Your ASR/NLU Platform? Write an app (skill) for an agent such as Cortana / Alexa Use cloud APIs to add ASR / NLU to your app / device / page / gadget Download software and use full-featured capabilities for more robust recognition on a specific device Build your own
  49. 49. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Network Availability • Simply irritating… or totally unusable? “What’s on my calendar today? “Sorry, I can’t complete that request right now.”
  50. 50. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Appropriate Modality? • Voice Only? Voice + Display? • Is it possible for the user to switch modalities? • Or would switching potentially be dangerous? “How long is the flight from Dallas to Seattle? “I’ve got a few results to show you.”
  51. 51. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Is State Maintained? • Does your platform support a multiple-stage interaction? • Does it remember what you did previously? “Who is Barack Obama?” “Barack Obama is the 44th president of the United States.” “How old is he?” “I’m sorry, I don’t understand your question.”
  52. 52. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Wake-Up Words • How many of these “Agents” will we be talking to? “Jibo, take a picture.” “Alexa, play music.” “OK Google, set the temperature to 77 degrees.”
  53. 53. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions System Personality • Are you writing for an “Agent” who has an existing style? • What if your skill or app doesn’t match that style? • If not, should you create one? “Hi, I’m Julie!”
  54. 54. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Context • Real-world context • Digital context • How much does your app know about where you are and what it can do? “When I get home, remind me to take out the trash.” “I’m sorry, your calendar doesn’t support location- based reminders.”
  55. 55. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions What Are You Trying To Recognize? • Long utterances work better than short ones • Letter names require extra work “Start a session” “Got it”
  56. 56. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions And So Much More…. • What will you do when the recognizer just can’t get it? “I want my…. BARK BARK BARK Timmy STOP THAT NOW GET DOWN!” ????
  57. 57. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Existing Guidelines / Research • Caveat: Best practices evolved in one modality (e.g. voice-only) may not apply the same way in another (e.g. combined voice + touch) • But they could be adapted • Association for Voice Interaction Design (AVIxD.org) • Wiki • Peer-Reviewed Journal • Virtual “Brown Bags” • Academic Sources, Books
  58. 58. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions AVIxD.org CUI Working Group is actively recruiting!
  59. 59. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Specific Example: “Help” Voice XML Standard (2004) “Help” should be a global command AVIxD Wiki (2014) Stop using “Help” as a global Agent API Doc (2015) Offer “Help”
  60. 60. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Specific Example: “Help” • Designers who tune applications have seen that the word “help” is a known “False Attractor” • Other things that you say which are short get recognized as “help” • People don’t voluntarily come up with “help” unless they are prompted • Give callers a context specific command only where help may truly be needed, and call it something besides "help” • System: Say or enter your account number, or say, where do I find it.
  61. 61. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Special Case: Car • “Distracted Driver” is a hot topic! • Richard Young, Wayne State University • Paper: “Safe Interaction For Drivers” • “Visual-Manual Mode” – What we do today • “Auditory-Vocal Mode” – Speech only. NO GUI. • “Mixed Mode” – Speech and GUI being used together • Finding: If you give someone a graphic interface, they’re going to look at it • And take their eyes off the road
  62. 62. Design Documents
  63. 63. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Usability Studies / Research • Special Challenges • Technical setup • Phone tap / Recording both sides
  64. 64. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Warner Bros.
  65. 65. Early Stage Voice Only Prototype
  66. 66. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Should I Speech-Enable My ___?
  67. 67. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions What’s the Use Case? • Enabling application • User can’t do it any other way • New tasks • Enhancing application • User can do it now • But speech makes it better • Faster • Safer
  68. 68. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions API-Based Device- Based Roll Your Own / Open- Source • Flexibility • Power • Customization • Time • Difficulty
  69. 69. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Cloud vs. Downloadable / Embedded • Easy to get started • Lightweight • Not much specialized knowledge • Customizable • Probably better recognition • Can be device-specific • More features • Higher powered • May require specialized knowledge – Speech scientist
  70. 70. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Open Source ASR • CMU Sphinx • pocketsphinx • Kaldi • http://kaldi-asr.org/ • Github • New updates include some pretty interesting stuff (DNN) • Requires: • Corpus • Tech know-how
  71. 71. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Should I Speech-Enable My ___?
  72. 72. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Should I Speech-Enable My ___? Maybe
  73. 73. Iron Man 2: Marvel Studios, Paramount Pictures Where’s Jarvis?
  74. 74. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Where’s Jarvis? Gesture Based Interface Artificial Intelligence Voice Based Interface
  75. 75. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Where’s Jarvis? ASR NLU Voice Design Context
  76. 76. #UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Resources • Handout / Web page

Editor's Notes

  • Voice User Interface Designer
    10 years in the field
    English major, former coder; got interested in UX
    President of the Association for Voice Interaction Design
    Consultant for Versay Solutions
    2 weeks in a row for conferences
  • Jarvis:
    Audio and gestural
    Perfect recognition.
    No error recovery needed
    Great voice quality
    Connected to vast amounts of data
    Understands all the parts of the model: “Lose the landscape.”
    Context-sensitive.
    Aware of the space around him
    Sense of humor. “Am I to include the Belgian Waffle stands?”
    Takes initiative. “What is it you’re trying to achieve, sir?”
  • Replicator:
    Good recognition
    No error recovery needed
    Good voice quality – understandable
    Connected to data – perhaps too much so?
    Context sensitive- but was this enough?
    A design failure (not a tech failure)
    Specifically around excessive disambiguation
  • A Better Replicator Conversation
  • “Speech to Text” ?
    Spoken Language – Machine readable format
  • Not necessarily tied to speech recognition

  • Also called voiceprints, biometrics, voice authentication, etc.
    Not going to discuss this one in a lot of detail today but it’s important that you understand the difference between these technologies.
    Recognizes a person, not necessarily what they are saying.
    You can have ASR without Voice Verification
    And vice versa
  • Human voice talent
    Hundreds of hours of recording
    Digitized
    Phonemes:
    Concatenated speech synthesis
  • Dynamic Speech Synthesis
    Many commercial products are available
    API-based
    Downloadable
    Quality varies
    If possible, record audio
    TTS has improved considerably, but is still noticeable
    High quality TTS may not be available in all situations
    If you have a lot of dynamic data TTS is useful
    You can mix recorded audio and TTS
    You may have to use TTS
    Voice Agent (Alexa, Cortana, etc.)
    API-based
    Some of them do let you mark up your TTS with SSML

    More phonemes = higher quality voice
    Also means a bigger download and install (if on device)
    Exceptions (addresses, names) can be iffy
    May require a lot of work to handle well
    St. James St.
    Saint James Street
    Punctuation
    Your data needs to be clean and ready to voice back
    Acronyms, incomplete sentences will not sound good
    It is possible to build a custom voice
    But it takes a lot of work!
  • Speech Synthesis Markup Language
    XML based WC3 standard
    Not universally supported
    Tags which allow you produce a more natural quality output.
    Emphasis
    Break
    Voice
    Prosody
    Pitch
  • World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before language
    Semantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitation
    Syntax: The rules that govern putting words together to form meaningful units
    Lexicon: What words mean
    Morphology: How words change their form to perform differently in a language i.e. horse / horses
    Phonetics: Phonemes and how words are built
    Acoustics: What phonemes sound like and how to create them
  • Speech is never stationary
    Coarticulation
    Noisy environments
    Accents
    Different speakers have voices with different acoustic qualities
    Goats
    Challenges vary depending on what you are going to recognize
    Spelling (short utterances) can be difficult even for humans
    Phonetic alphabet (Military)
  • Humans can deduce meaning from context and unknown words

    “How can I help you?”
    I’m having a problem with my account.

    I’d like that one. No, not the green one, the red one.

    Time flies like an arrow.
    Fruit flies like a banana.
  • All modern speech recognition is probabilistic
    GUI: Button clicked? true / false
    VUI: There is an 85% chance that button was clicked
  • Three Dimensions of Speech Problems
  • AUDREY: Davis, Biddulph, and Balashek - Bell Labs 1952
    Analog
    Isolated digit recognition
    Pause between digits
    Speaker-dependent
    Speech recognition with vacuum tubes – How very steampunk.
    Her name was AUDREY. Let that sink in a minute.
    (Automatic Digit Recognizer)
  • 1980’s: The Power of Statistics
    The recognition of connected speech becomes a search for the best path in a large network
    Problem of finding the probabilities
    Statistical Language Models
    Not all sequences of words are equally probable
    Rank all permissible sentences in terms of probability
    “Correct” grammar is not applicable
    Restricted by domain
    Hidden Markov Models (HMM)
    Unified probabilistic model for speech
  • You’re Only As Good As What You’re Trained On
    Corpora
    Collection of speech used to train a recognizer
    Acoustic and/or Pronunciation Model
    Associates sounds with symbols and words.
    Created by a general speech corpora and a phonetic and orthographic transcription
    Statistical Language Model (SLM)
    A probability distribution over sequences of words
    Created by a domain-specific speech corpora and a tagged transcription to extract meaning
  • Speech Agent: The “Person” who
    Distributed speech recognition
    Collection and compression of speech is on the device
    The language models are typically on the network
    Phone can be speaker-dependent
    Trains itself on your voice and on the acoustic environments you are in most often
    Many companies are providing APIs to use their speech recognition
  • Alexa, Ask Capitol One What’s my current credit card balance?
  • Observations to make: Represents the entirety of a VUI experience
    Placement of Spanish prompt would vary depending on type of call.
    Confirmation is variable
    Confirmation prompt is general
  • What do you need it for?
    What kind of device will you be running it on?
    Connectivity?
    Can you use cloud based ASR?
    How much control do you need over the application / user interface?
  • Jarvis:
    Audio and gestural
    Perfect recognition.
    No error recovery needed
    Great voice quality
    Connected to vast amounts of data
    Understands all the parts of the model: “Lose the landscape.”
    Context-sensitive.
    Aware of the space around him
    Sense of humor. “Am I to include the Belgian Waffle stands?”
    Takes initiative. “What is it you’re trying to achieve, sir?”

×