Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Deep Learning in Alexa (MAC202)


Published on

Neural networks have a long and rich history in automatic speech recognition. In this talk, we present a brief primer on the origin of deep learning in spoken language, and then explore today’s world of Alexa. Alexa is the AWS service that understands spoken language and powers Amazon Echo. Alexa relies heavily on machine learning and deep neural networks for speech recognition, text-to-speech, language understanding, and more. We also discuss the Alexa Skills Kit, which lets any developer teach Alexa new skills.

Published in: Technology
  • Be the first to comment

AWS re:Invent 2016: Deep Learning in Alexa (MAC202)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nikko Strom, Sr. Principal Scientist Arpit Gupta, Scientist November 30, 2016 Deep Learning in Alexa MAC202
  2. 2. Outline • History of Deep Learning • Deep Learning in Alexa • The Alexa Skills Kit
  3. 3. Intense academic activity “Neural winter” The “GPU era" History of Deep Learning 1986 1998 2007 20162014 Amazon Echo launches! Hinton, Rumelhart and Williams invent backpropagation training
  4. 4. Multilayer perceptron input, x output, y “input layer” “hidden layer” “hidden layer” “output layer” h1 = sigmoid(A1x+b1) h2 = sigmoid(A2h1+b2) y = sigmoid(Aoh2+bo) x
  5. 5. Mohamed, Dahl and Hinton beat a well-known speech recognition benchmark (TIMIT) Neural winter Deep Learning milestones 1986 1998 2009 2010 2016 Krizhevsky, Sutskever, and Hinton win the ImageNet object recognition challenge. AlphaGo beats a Go World Champion Microsoft and Google demonstrate breakthrough results on large vocabulary speech recognition. Hinton, Rumelhart and Williams Salakhutdinov and Hinton discover a method to train very deep neural networks. 2002 2011 LeCun, Bottou, Bengio and Haffner publish CNN for Computer Vision 1997 Hochreiter and Schmidthuber invent LSTM for recurrent networks with long memory.
  6. 6. Neural winter Deep Learning in Speech Recognition 1986 1998 2009 2010 20162002 2011 Mohamed, Dahl and Hinton beat a well-known speech recognition benchmark (TIMIT) Microsoft and Google demonstrate breakthrough results on large vocabulary speech recognition. ‘96‘91 ‘92‘89 Waibel, Hanazawa, Hinton, Shikano, and Lang publish time- delay neural network (TDNN). Strom combines time-delay NN and RNN (RTDNN) Strom introduces speaker vectors for speaker adaptation Robinson demonstrates RNN for ASR and get the best result on TIMIT so far. Bourlard, Morgan, Wooters and Renals introduce context dependent MLP models.
  7. 7. Impact of data corpus size = 140,160 hours16 years ≈14,016 hours of speech
  8. 8. Neural winter Impact of data corpus size 8800 GTX 350 GFLOPS 1986 1998 2007 2016
  9. 9. Neural winter Impact of compute capacity Cray X-MP/48 1986 1 GFLOPS 8800 GTX 350 GFLOPS p2.16xlarge 23 TFLOPS (70 TFLOPS single) cg1.4xlarge 1 TFLOPS ASCI Red 1 TFLOPS 1986 1998 2007 2016 Sun Ultra 60 1 GFLOPS Taihu 100 PFLOPSRoadrunner 1 PFLOPS
  10. 10. Neural winter Impact of compute infrastructure 1986 1998 2007 2012 2016 Reign of EM • During the “neural winter,” EM became a dominant distributed computing paradigm for machine learning (ML) • ML algorithms that use the EM algorithms benefited greatly • Distributed SGD broke out Deep Learning from the single box Distributed SGD StromDean et al. 2015
  11. 11. Conclusion – how we got here • Theory and algorithm design in the 80s and 90s • Orders of magnitude more data available • Orders of magnitude more computational capacity • A few algorithmic inventions enabled deep networks • The rise of distributed SGD training We are in a period of massive Deep Learning adoption because:
  12. 12. Deep Learning in Alexa
  13. 13. Large-scale distributed training Up to 80 EC2 g2.2xlarge GPU instances working in sync to train a model Thousands of hours of speech training data stored in Amazon S3
  14. 14. Large-scale distributed training All nodes must communicate updates to the model to all other nodes. GPUs compute model updates fast – Think updates per second A model update is hundreds of MB
  15. 15. 0 100,000 200,000 300,000 400,000 500,000 600,000 0 20 40 60 80 Framespersecond Number of GPU workers DNN training speed Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.
  16. 16. Speech Recognition
  17. 17. Signal processing Acoustic model Decoder (inference) Post processing Feature vectors [4.7, 2.3, -1.4, …] Phonetic probabilities [0.1, 0.1, 0.4, …] Words increase to 70 degrees Text Increase to 70⁰ Sound Speech recognition
  18. 18. Transfer learning from English to German Hidden layer 1 Hidden layer 2 Last hidden layer æI ɑɜ ʊ … eæI ɑɜ u: … œ Output layer
  19. 19. Natural Language Understanding
  20. 20. Intent and entities play two steps behind by def leppard Intent PlayMusic Entities Song Artist Two problems: 1. Words are symbols – not vectors of numbers 2. Requests are of different lengths
  21. 21. PlayMusic Recurrent Neural Networks Recurrent Network play two steps behind by def leppard
  22. 22. Speech synthesis
  23. 23. Speech synthesis Text Text normalization Grapheme-to- phoneme conversion Waveform generation Speech She has 20$ in her pocket. she has twenty dollars in her pocket ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
  24. 24. Concatenative synthesis Di-phone segment database Di-phone unit selection SpeechInput ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
  25. 25. Prosody for natural sounding reading Bi-directional recurrent network pitch duration • Phonetic features • Linguistic features • Semantic word vectors targets for segment intensity
  26. 26. Long-form example “Over a lunch of diet cokes and lobster salad one balmy fall day in Boston, Joseph Martin, the genial, white-haired, former dean of Harvard medical school, told me how many hours of pain education Harvard med students get during four years of medical school.” Before After
  27. 27. The Alexa Skills Kit
  28. 28. The Alexa Skills Kit Alexa! Customers DevelopersAlexa
  29. 29. Growth of Published Skills 0 1000 2000 3000 4000 March May July September 2016
  30. 30. Alexa Skills: Examples Business: Uber, Dominos, Fidelity, Capital One, Home Advisor, 1-800 Flowers Info: Washington Post, Campbell’s Kitchen, Boston Children’s Hospital, Stocks, Bitcoin Price, History Buff, Savvy Consumer Fitness: Fitbit, 7-Minute Workout Automation: Nest, Garageio,, Scout Alarm Misc: Quick Events, Phone Finder, Cat Facts, Famous Quotes Games: Jeopardy!, Minesweeper, Word Master, Blackjack, Math Puzzles, Guess Number, Spelling Bee
  31. 31. Customers ASK for Developers Alexa! DevelopersAlexa
  32. 32. ASK for Developers • Define a Voice User Interface • Provide a finite number of sample utterances • ASK automatically builds and deploys machine learning models
  33. 33. Developer Input
  34. 34. Model Build Workflow DEVELOPER Developer Portal Website creates/edits skill Skill Model Builder builds/uploads skill models reads skill.json writes skill defn Data Store Runtime Cloud Store
  35. 35. Model Building Finite-state transducers (FSTs) (exact match) ML Entity Recognizer ML Intent Recognizer Developer Input We build two models: FSTs are for exact matches, machine learning models for fuzzy matches.
  36. 36. ASK Machine Learning ASK Machine Learning Model hey uhm i need a car to starbucks Training: Finite number of sample utterances MATCH TRAIN Runtime: Infinite number of possible utterances DevelopersCustomers get a car to <Destination> get me a car …
  37. 37. • Neural Networks (NNs) • Transfer Learning: • Use knowledge learned from large related training data • Example: We’ve seen slots like <Destination> before, no need to learn from scratch. get a car to <Destination> get me a car … ASK Machine Learning (contd.)
  38. 38. How to Write Great Skills Slots • Catalogs: Provide as many values as possible. Add representative values of different lengths where appropriate • Use built-in slots where possible (e.g., cities, states, first names) • Do not use too many slots in one utterance (rather ask for missing slots in a dialog) • Use context around each slot
  39. 39. How to Write Great Skills Intents • Split heterogeneous intents • Use built-in intents where possible • Provide as many carrier phrases as possible • Use Thesaurus or paraphrasing tools, ask your friends or mechanical turk for utterances
  40. 40. Conclusions • ASK connects developers to customers • Developers constantly extend Alexa’s capabilities • We constantly get more data and improve experience via machine learning • Making Alexa more intelligent and powerful, bridging the gap between human and machine
  41. 41. Thank you!
  42. 42. Remember to complete your evaluations!
  43. 43. Related Sessions
  44. 44. Images used Glove vectors. Produced internally.
  45. 45. Images used Macaw. Public domain. VW. Free for editorial use.
  46. 46. Images used ASCI Red. Public domain. 8800 GTX. Permission by email by Tri Hyunth at Nvidia.
  47. 47. Images used,_2009.jpg 025_Former_U.S._President_George_H._W._Bush_congratulates_Sailor_aboard_USS_Harry_S._Truman_(CVN_75).jpg