Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning in Alexa at AI NEXT Conference

387 views

Published on

AI NEXT Conference 2017 Seattle by Nikko Strom.
Video: https://www.youtube.com/channel/UCj09XsAWj-RF9kY4UvBJh_A

Published in: Technology
  • Be the first to comment

Deep Learning in Alexa at AI NEXT Conference

  1. 1. AI NEXT Bellevue, WA March 18, 2017 Nikko Strom, Sr. Principal Scientist, Alexa Machine Learning www.nikkostrom.com – twitter.com/nikkostrom – angel.co/nikkostrom
  2. 2. • Alexa • Deep learning at scale • Speech recognition • Speech Synthesis • Alexa Prize, Alexa Fund, and Alexa Accelerator
  3. 3. Alexa’s growing family AI NEXT 3/18/2017 (C) Amazon.com 3
  4. 4. Alexa in the wild AI NEXT 3/18/2017 (C) Amazon.com 4
  5. 5. Alexa’s friends 68⁰ AI NEXT 3/18/2017 (C) Amazon.com 5
  6. 6. Alexa’s skills AI NEXT 3/18/2017 (C) Amazon.com 6
  7. 7. Deep Learning at scale
  8. 8. Longer-form talk at AWS re:Invent 2016 https://www.youtube.com/watch?v=TYRckcVm4WE Deep Learning in Alexa (MAC202) AI NEXT 3/18/2017 (C) Amazon.com 8
  9. 9. Speech data = 140,160 hours16 years ≈14,016 hours of speech AI NEXT 3/18/2017 (C) Amazon.com 9
  10. 10. Large-scale distributed training Up to 80 EC2 g2.2xlarge GPU instances working in sync to train a model Thousands of hours of speech training data stored in S3 AI NEXT 3/18/2017 (C) Amazon.com 10
  11. 11. Large-scale distributed training All nodes must communicate updates to the model to all other nodes. GPUs compute model updates fast – Think updates per second A model update is hundreds of MB AI NEXT 3/18/2017 (C) Amazon.com 11
  12. 12. 0 100,000 200,000 300,000 400,000 500,000 600,000 0 10 20 30 40 50 60 70 80 Frames per second Number of GPU workers DNN training speed Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015. AI NEXT 3/18/2017 (C) Amazon.com 12
  13. 13. Speech Recognition
  14. 14. Signal processing Acoustic model Decoder (inference) Post processing Feature vectors [4.7, 2.3, -1.4, …] Phonetic probabilities [0.1, 0.1, 0.4, …] Words increase to 70 degrees Text Increase to 70⁰ Sound Speech recognition AI NEXT 3/18/2017 (C) Amazon.com 14
  15. 15. Transfer learning from English to German Hidden layer 1 Hidden layer 2 Last hidden layer æI ɑɜ ʊ … eæI ɑɜ u: … œ Output layer AI NEXT 3/18/2017 (C) Amazon.com 15
  16. 16. The cocktail party problem Alexa! Blah Blah Blah Blah Blah Blah AI NEXT 3/18/2017 (C) Amazon.com 16
  17. 17. The cocktail party problem … play some jazz! …blah, blah, blah, blah… …blah, blah, blah, blah… …blah, blah, blah, blah… …blah, blah. …blah, blah, blah, blah… …blah, blah, blah, blah… AI NEXT 3/18/2017 (C) Amazon.com 17
  18. 18. Anchored speech detection Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016. Alexa, play some jazz! Wake word Request “Anchor” Speech consistent with anchor Encoder Decoder AI NEXT 3/18/2017 (C) Amazon.com 18
  19. 19. Anchored speech detection Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016. Alexa, play some jazz! t LSTM Encoder speech features from wake word speech features from request endpoint decision anchor embedding LSTM Decoder AI NEXT 3/18/2017 (C) Amazon.com 19
  20. 20. Speech synthesis
  21. 21. Speech synthesis Text Text normalization Grapheme-to-phoneme conversion Waveform generation Speech She has 20$ in her pocket. she has twenty dollars in her pocket ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t AI NEXT 3/18/2017 (C) Amazon.com 21
  22. 22. Concatenative synthesis Di-phone segment database Di-phone unit selection SpeechInput ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t AI NEXT 3/18/2017 (C) Amazon.com 22
  23. 23. Prosody for natural sounding reading Bi-directional recurrent network pitch duration • Phonetic features • Linguistic features • Semantic word vectors targets for segment intensity AI NEXT 3/18/2017 (C) Amazon.com 23
  24. 24. Long-form example “Over a lunch of diet cokes and lobster salad one balmy fall day in Boston, Joseph Martin, the genial, white-haired, former dean of Harvard medical school, told me how many hours of pain education Harvard med students get during four years of medical school.” Before After AI NEXT 3/18/2017 (C) Amazon.com 24
  25. 25. Collaborative Programs
  26. 26. $2.5M inaugural competition to advance the field of Conversational AI CHALLENGE Create a socialbot that can converse coherently and engagingly on popular topics for 20 minutes A L E X A , L E T ’ S T A L K A B O U T A I
  27. 27. 27 What is the Alexa Fund? $100 MM venture capital fund to invest in early- stage and growth-stage companies We seek to support best-of-breed entrepreneurs and companies that can innovate on the Alexa service o Products or services which introduce new and compelling voice use cases to Alexa through hardware or software o Enabling technologies that can enhance the capabilities of the Alexa service itself, including natural language understanding (NLU), automatic speech recognition (ASR), artificial intelligence (AI), and text-to-speech (TTS)
  28. 28. 28 The Alexa Accelerator Our New Program for early stage companies, powered by Techstars Applications Due 13 Week Program Demo Day www. alexa-accelerator.com
  29. 29. Jobs are here: www.amazon.jobs/alexa speech@amazon.com

×