AWS re:Invent 2016: Deep Learning in Alexa (MAC202)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nikko Strom, Sr. Principal Scientist
Arpit Gupta, Scientist
November 30, 2016
Deep Learning in Alexa
MAC202

Outline
• History of Deep Learning
• Deep Learning in Alexa
• The Alexa Skills Kit

Intense academic
activity
“Neural winter” The “GPU era"
History of Deep Learning
1986 1998 2007 20162014
Amazon
Echo
launches!
Hinton, Rumelhart
and Williams invent
backpropagation
training

Multilayer perceptron
input, x
output, y
“input layer”
“hidden layer”
“hidden layer”
“output layer”
h1 = sigmoid(A1x+b1)
h2 = sigmoid(A2h1+b2)
y = sigmoid(Aoh2+bo)
x

Mohamed, Dahl and Hinton beat a
well-known speech recognition
benchmark (TIMIT)
Neural winter
Deep Learning milestones
1986 1998 2009 2010 2016
Krizhevsky, Sutskever, and
Hinton win the ImageNet
object recognition challenge.
AlphaGo beats a Go
World Champion
Microsoft and Google
demonstrate breakthrough
results on large vocabulary
speech recognition.
Hinton, Rumelhart
and Williams
Salakhutdinov and
Hinton discover a
method to train very
deep neural
networks.
2002 2011
LeCun, Bottou,
Bengio and Haffner
publish CNN for
Computer Vision
1997
Hochreiter and
Schmidthuber invent LSTM
for recurrent networks with
long memory.

Neural winter
Deep Learning in Speech Recognition
1986 1998 2009 2010 20162002 2011
Mohamed, Dahl and Hinton beat a
well-known speech recognition
benchmark (TIMIT)
Microsoft and Google
demonstrate breakthrough
results on large vocabulary
speech recognition.
‘96‘91 ‘92‘89
Waibel, Hanazawa,
Hinton, Shikano, and
Lang publish time-
delay neural network
(TDNN).
Strom combines
time-delay NN and
RNN (RTDNN)
Strom introduces
speaker vectors for
speaker adaptation
Robinson demonstrates
RNN for ASR and get the
best result on TIMIT so far.
Bourlard, Morgan, Wooters and
Renals introduce context
dependent MLP models.

Impact of data corpus size
= 140,160 hours16 years
≈14,016 hours of speech

Neural winter
Impact of data corpus size
8800 GTX
350 GFLOPS
1986 1998 2007 2016

Neural winter
Impact of compute capacity
Cray X-MP/48
1986
1 GFLOPS
8800 GTX
350 GFLOPS
p2.16xlarge
23 TFLOPS
(70 TFLOPS single)
cg1.4xlarge
1 TFLOPS
ASCI Red
1 TFLOPS
1986 1998 2007 2016
Sun Ultra 60
1 GFLOPS
Taihu
100 PFLOPSRoadrunner
1 PFLOPS

Neural winter
Impact of compute infrastructure
1986 1998 2007 2012 2016
Reign of EM
• During the “neural winter,” EM became a dominant distributed
computing paradigm for machine learning (ML)
• ML algorithms that use the EM algorithms benefited greatly
• Distributed SGD broke out Deep Learning from the single box
Distributed SGD
StromDean et al.
2015

Conclusion – how we got here
• Theory and algorithm design in the 80s and 90s
• Orders of magnitude more data available
• Orders of magnitude more computational capacity
• A few algorithmic inventions enabled deep networks
• The rise of distributed SGD training
We are in a period of massive Deep Learning adoption because:

Large-scale distributed training
Up to 80 EC2 g2.2xlarge GPU
instances working in sync to train
a model
Thousands of
hours of speech
training data stored
in Amazon S3

Large-scale distributed training
All nodes must communicate
updates to the model to all
other nodes.
GPUs compute model
updates fast – Think updates
per second
A model update is hundreds
of MB

0
100,000
200,000
300,000
400,000
500,000
600,000
0 20 40 60 80
Framespersecond
Number of GPU workers
DNN training speed
Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.

Signal
processing
Acoustic model
Decoder
(inference)
Post
processing
Feature
vectors
[4.7, 2.3, -1.4, …]
Phonetic
probabilities
[0.1, 0.1, 0.4, …]
Words
increase to
70 degrees
Text
Increase to 70⁰
Sound
Speech recognition

Transfer learning from English to German
Hidden layer 1
Hidden layer 2
Last hidden layer
æI ɑɜ ʊ … eæI ɑɜ u: … œ
Output layer

Natural Language
Understanding

Intent and entities
play two steps behind by def leppard
Intent
PlayMusic
Entities
Song Artist
Two problems:
1. Words are symbols – not vectors of numbers
2. Requests are of different lengths

PlayMusic
Recurrent Neural Networks
Recurrent
Network
play two steps behind by def leppard

Speech synthesis
Text
Text normalization
Grapheme-to-
phoneme conversion
Waveform generation
Speech
She has 20$ in her pocket.
she has twenty dollars in her pocket
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Concatenative synthesis
Di-phone
segment
database
Di-phone unit selection
SpeechInput
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Prosody for natural sounding reading
Bi-directional recurrent network
pitch duration
• Phonetic features
• Linguistic features
• Semantic word vectors
targets for segment
intensity

Long-form example
“Over a lunch of diet cokes and lobster salad one
balmy fall day in Boston, Joseph Martin, the
genial, white-haired, former dean of Harvard
medical school, told me how many hours of pain
education Harvard med students get during four
years of medical school.”
Before After

The Alexa Skills Kit
Alexa!
Customers DevelopersAlexa

Growth of Published Skills
0
1000
2000
3000
4000
March May July September
2016

Alexa Skills: Examples
Business: Uber, Dominos, Fidelity, Capital One, Home Advisor, 1-800
Flowers
Info: Washington Post, Campbell’s Kitchen, Boston Children’s Hospital,
Stocks, Bitcoin Price, History Buff, Savvy Consumer
Fitness: Fitbit, 7-Minute Workout
Automation: Nest, Garageio, Alarm.com, Scout Alarm
Misc: Quick Events, Phone Finder, Cat Facts, Famous Quotes
Games: Jeopardy!, Minesweeper, Word Master, Blackjack, Math Puzzles,
Guess Number, Spelling Bee

Customers
ASK for Developers
Alexa!
DevelopersAlexa

ASK for Developers
• Define a Voice User Interface
• Provide a finite number of sample utterances
• ASK automatically builds and deploys
machine learning models

Model Build Workflow
DEVELOPER
Developer
Portal
Website
creates/edits
skill
Skill Model Builder
builds/uploads
skill models
reads
skill.json
writes
skill defn
Data
Store
Runtime
Cloud
Store

Model Building
Finite-state transducers (FSTs)
(exact match)
ML Entity Recognizer
ML Intent Recognizer
Developer Input
We build two models: FSTs are for exact matches,
machine learning models for fuzzy matches.

ASK Machine Learning
ASK
Machine
Learning
Model
hey uhm i need a
car to starbucks
Training: Finite number
of sample utterances
MATCH TRAIN
Runtime: Infinite number of
possible utterances
DevelopersCustomers
get a car to <Destination>
get me a car
…

• Neural Networks (NNs)
• Transfer Learning:
• Use knowledge learned from
large related training data
• Example: We’ve seen slots
like <Destination> before, no
need to learn from scratch.
get a car to <Destination>
get me a car
…
ASK Machine Learning (contd.)

How to Write Great Skills
Slots
• Catalogs: Provide as many values as possible.
Add representative values of different lengths where
appropriate
• Use built-in slots where possible
(e.g., cities, states, first names)
• Do not use too many slots in one utterance
(rather ask for missing slots in a dialog)
• Use context around each slot

How to Write Great Skills
Intents
• Split heterogeneous intents
• Use built-in intents where possible
• Provide as many carrier phrases as possible
• Use Thesaurus or paraphrasing tools, ask your friends or
mechanical turk for utterances

Conclusions
• ASK connects developers to customers
• Developers constantly extend Alexa’s capabilities
• We constantly get more data and improve experience
via machine learning
• Making Alexa more intelligent and powerful, bridging
the gap between human and machine

Remember to complete
your evaluations!

Images used
Glove vectors. Produced internally.

Images used
Macaw. Public domain. https://pixabay.com/en/macaw-bird-beak-parrot-650638/
VW. Free for editorial use. http://media.vw.com/images/category/11/

Images used
ASCI Red. Public domain. https://commons.wikimedia.org/wiki/File:Asci_red_-_tflop4m.jpeg
8800 GTX. Permission by email by Tri Hyunth at Nvidia.

Images used
https://commons.wikimedia.org/wiki/File:President_Ronald_Reagan_addresses_Congress_in_1981.jpg
https://commons.wikimedia.org/wiki/File:President_George_W._Bush_(8003096992).jpg
https://commons.wikimedia.org/wiki/File:President_Obama_interview_January_27,_2009.jpg
https://commons.wikimedia.org/wiki/File:US_Navy_020828-N-1058W-
025_Former_U.S._President_George_H._W._Bush_congratulates_Sailor_aboard_USS_Harry_S._Truman_(CVN_75).jpg
https://commons.wikimedia.org/wiki/File:President_Clinton_speaks_on_tax_cut_deal.jpg

AWS re:Invent 2016: Deep Learning in Alexa (MAC202)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to AWS re:Invent 2016: Deep Learning in Alexa (MAC202)

Similar to AWS re:Invent 2016: Deep Learning in Alexa (MAC202) (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS re:Invent 2016: Deep Learning in Alexa (MAC202)