Proprietary and confidential. Do not distribute.
End-to-end speech recognition with neon
Anthony Ndirango & Tyler Lee
MAKING MACHINES
SMARTER.™
now part of
Nervana Systems Proprietary
2
• Deep learning synopsis
• Large vocabulary continuous speech recognition
• End-to-end speech recognition systems
• Integrating weighted finite-state transducers for decoding
Nervana Systems Proprietary
3
Back-propagation
End-to-end
Resnet
ImageNet
Word2Vec
Regularization
Convolution
Unrolling
RNN
Generalization
hyperparameters
Video recognition
dropout
Pooling
LSTM
AlexNet
Speech recognition
download neon!
https://github.com/NervanaSystems/neon
git clone git@github.com:NervanaSystems/neon.git
Nervana’s deep learning tutorials:
https://www.nervanasys.com/deep-learning-tutorials/
We are hiring!
https://www.nervanasys.com/careers/
Nervana Systems Proprietary
4
A method for extracting features at multiple
levels of abstraction
• Features are discovered from data
• Performance improves with more data
• Network can express complex transformations
• High degree of representational power
Nervana Systems Proprietary
5
Nervana Systems Proprietary
6
Healthcare: Tumor detection
Automotive: Speech interfaces Finance: Time-series search engine
Positive:
Negative:
Agricultural Robotics Oil & Gas
Positive:
Negative:
Proteomics: Sequence analysis
Query:
Results:
Nervana Systems Proprietary
7
Nervana Systems Proprietary
8
kwɪk
braʊn
fɒks
quick
brown
fox
Nervana Systems Proprietary
9
Nervana Systems Proprietary
10
download aeon!
https://github.com/NervanaSystems/aeon
• Train directly from raw audio, extracting spectral features on-the-fly
• Handles arbitrarily large datasets
• Loads data from disk to device with minimal latency
• Also supports image and video data
Nervana Systems Proprietary
11
Acoustic models in neon complete source available at
https://github.com/NervanaSystems/deepspeech
Nervana Systems Proprietary
12
• The basic problem to be solved involves mapping a sequence of audio
features to a sequence of characters, with no obvious relationship
between the lengths of the sequences.
• CTC works around this problem by first defining a ”collapse” function.
Definition by example:
Collapse(_NNN_ _EE_ _R_ _VVV_AAA_N_AAAA_) = NERVANA
Nervana Systems Proprietary
13
• For each utterance, model outputs a matrix of frame-wise character probabilities
• Given the ”ground truth” transcript, the CTC algorithm:
1. finds all paths which collapse onto ground truth
2. uses the probability matrix to weight each path
Nervana Systems Proprietary
14
- input audio with 5 frames
- ground truth: “CAB”
- find all strings of length 5, including blank characters that collapse onto “CAB”
CTC Cost
Nervana Systems Proprietary
15
1.do an argmax for each column
2.concatenate the resulting characters to obtain
a string
3.“collapse” the string to get the output
Nervana Systems Proprietary
16
decoded outputs ground truth
younited presidentiol is a lefe in surance company united presidential is a life insurance company
that was sertainly true last week that was certainly true last week
we're now ready to say we're intechnical default a
spokesman said
we're not ready to say we're in technical default a
spokesman said
Nervana Systems Proprietary
17
So we have probabilities of each character at each frame. Now
what?
If CER (character error
rate) is nearly perfect…
We’re pretty much set.
Just use the best
character at each frame.
If CER is too high…
We should enforce some rules
from the language. E.g:
- All words must be valid
- Favor likely word sequences
Nervana Systems Proprietary
18
Weighted finite state transducers:
Automata whose state transitions map a
sequence of input symbols to a sequence of
output symbols
- Directed graph structure
- States enforce language structure
- Transitions choose amongst valid
symbols
For an in-depth review, see Mohri, Pereira & Riley, 2008
Nervana Systems Proprietary
19
- A lot of decoding concepts map nicely to FSTs.
- CTC, lexicon (vocabulary) and grammar (language model) can all be easily
represented.
- Efficient algorithms exist to combine FSTs, giving a single decoding graph
𝐷 = 𝑇 ∘ 𝐿 ∘ 𝐺
Decoding
graph
CTC
graph
Lexicon
graph
Grammar
graph
Nervana Systems Proprietary
20
Removes repeated characters and blanks (“_”)
𝑇
C_
A A A _
B B _
Input: “C _ A A A _ B B _”
C
C A
C A B
Output: “C A B”
Nervana Systems Proprietary
21
Maps a sequence of characters or phonemes to words
𝐿
Nervana Systems Proprietary
22
- Less end-to-end: A large number of
parameters learned completely
separate from the acoustic model
- Memory issues with large vocabularies
or complex language models
Graph # States # Arcs
CTC 31 91
Lexicon 30,629 40,516
Trigram 3,538,579 10,213,039
Composed
Trigram
26,817,696 54,104,686
Nervana Systems Proprietary
23
Reference
CER
(no LM)
WER
(no LM)
WER
(trigram LM)
WER
(trigram LM w/
enhancements)
Hannun, et al.
(2014)
10.7 35.8 14.1 N/A
Graves-Jaitly
(2014)
9.2 30.1 N/A 8.7
Hwang-Sung (2016) 10.6 38.4 8.88 8.1
Miao et al. (2015)
[Eesen]
N/A N/A 9.1 7.3
Bahdanau et al.
(2016)
6.4 18.6 10.8 9.3
Nervana-Speech 8.64 32.5 8.4 N/A
younited presidentiol is a lefe in surance
company
united presidential is a life insurance
company
that was sertainly true last week
that was certainly true last week
we're now ready to say we're intechnical
default a spokesman said
we're not ready to say we're in technical
default a spokesman said
Nervana Systems Proprietary
24
Nervana’s deep learning tutorials:
https://www.nervanasys.com/deep-learning-tutorials/
Acoustic model source available at:
https://github.com/NervanaSystems/deepspeech
Github page:
https://github.com/NervanaSystems/neon
For more information, contact:
info@nervanasys.com
Nervana Systems Proprietary
25
Nervana Systems Proprietary
26
• github.com/NervanaSystems/ModelZoo
• model files, parameters
Nervana Systems Proprietary
27
THANK YOU!
QUESTIONS?
Nervana Systems Proprietary
28
Nervana Systems Proprietary
0 1 2
3 4 5
6 7 8
0 1
2 3
19 25
37 43
0 1 3 4 0 1 2 3 19
• Each element in the output is the result of a dot product between two vectors
29
input filter output
Nervana Systems Proprietary
30
0 1 2
3 4 5
6 7 8
0 1
2 3
19 25
37 43
0
1
2
3
4
5
6
7
8
19
0
2
3
1
0
2
3
1
0
2
3
1
0
2
3
1
25
37
43
Nervana Systems Proprietary
31

Intel Nervana Artificial Intelligence Meetup 11/30/16

  • 1.
    Proprietary and confidential.Do not distribute. End-to-end speech recognition with neon Anthony Ndirango & Tyler Lee MAKING MACHINES SMARTER.™ now part of
  • 2.
    Nervana Systems Proprietary 2 •Deep learning synopsis • Large vocabulary continuous speech recognition • End-to-end speech recognition systems • Integrating weighted finite-state transducers for decoding
  • 3.
    Nervana Systems Proprietary 3 Back-propagation End-to-end Resnet ImageNet Word2Vec Regularization Convolution Unrolling RNN Generalization hyperparameters Videorecognition dropout Pooling LSTM AlexNet Speech recognition download neon! https://github.com/NervanaSystems/neon git clone git@github.com:NervanaSystems/neon.git Nervana’s deep learning tutorials: https://www.nervanasys.com/deep-learning-tutorials/ We are hiring! https://www.nervanasys.com/careers/
  • 4.
    Nervana Systems Proprietary 4 Amethod for extracting features at multiple levels of abstraction • Features are discovered from data • Performance improves with more data • Network can express complex transformations • High degree of representational power
  • 5.
  • 6.
    Nervana Systems Proprietary 6 Healthcare:Tumor detection Automotive: Speech interfaces Finance: Time-series search engine Positive: Negative: Agricultural Robotics Oil & Gas Positive: Negative: Proteomics: Sequence analysis Query: Results:
  • 7.
  • 8.
  • 9.
  • 10.
    Nervana Systems Proprietary 10 downloadaeon! https://github.com/NervanaSystems/aeon • Train directly from raw audio, extracting spectral features on-the-fly • Handles arbitrarily large datasets • Loads data from disk to device with minimal latency • Also supports image and video data
  • 11.
    Nervana Systems Proprietary 11 Acousticmodels in neon complete source available at https://github.com/NervanaSystems/deepspeech
  • 12.
    Nervana Systems Proprietary 12 •The basic problem to be solved involves mapping a sequence of audio features to a sequence of characters, with no obvious relationship between the lengths of the sequences. • CTC works around this problem by first defining a ”collapse” function. Definition by example: Collapse(_NNN_ _EE_ _R_ _VVV_AAA_N_AAAA_) = NERVANA
  • 13.
    Nervana Systems Proprietary 13 •For each utterance, model outputs a matrix of frame-wise character probabilities • Given the ”ground truth” transcript, the CTC algorithm: 1. finds all paths which collapse onto ground truth 2. uses the probability matrix to weight each path
  • 14.
    Nervana Systems Proprietary 14 -input audio with 5 frames - ground truth: “CAB” - find all strings of length 5, including blank characters that collapse onto “CAB” CTC Cost
  • 15.
    Nervana Systems Proprietary 15 1.doan argmax for each column 2.concatenate the resulting characters to obtain a string 3.“collapse” the string to get the output
  • 16.
    Nervana Systems Proprietary 16 decodedoutputs ground truth younited presidentiol is a lefe in surance company united presidential is a life insurance company that was sertainly true last week that was certainly true last week we're now ready to say we're intechnical default a spokesman said we're not ready to say we're in technical default a spokesman said
  • 17.
    Nervana Systems Proprietary 17 Sowe have probabilities of each character at each frame. Now what? If CER (character error rate) is nearly perfect… We’re pretty much set. Just use the best character at each frame. If CER is too high… We should enforce some rules from the language. E.g: - All words must be valid - Favor likely word sequences
  • 18.
    Nervana Systems Proprietary 18 Weightedfinite state transducers: Automata whose state transitions map a sequence of input symbols to a sequence of output symbols - Directed graph structure - States enforce language structure - Transitions choose amongst valid symbols For an in-depth review, see Mohri, Pereira & Riley, 2008
  • 19.
    Nervana Systems Proprietary 19 -A lot of decoding concepts map nicely to FSTs. - CTC, lexicon (vocabulary) and grammar (language model) can all be easily represented. - Efficient algorithms exist to combine FSTs, giving a single decoding graph 𝐷 = 𝑇 ∘ 𝐿 ∘ 𝐺 Decoding graph CTC graph Lexicon graph Grammar graph
  • 20.
    Nervana Systems Proprietary 20 Removesrepeated characters and blanks (“_”) 𝑇 C_ A A A _ B B _ Input: “C _ A A A _ B B _” C C A C A B Output: “C A B”
  • 21.
    Nervana Systems Proprietary 21 Mapsa sequence of characters or phonemes to words 𝐿
  • 22.
    Nervana Systems Proprietary 22 -Less end-to-end: A large number of parameters learned completely separate from the acoustic model - Memory issues with large vocabularies or complex language models Graph # States # Arcs CTC 31 91 Lexicon 30,629 40,516 Trigram 3,538,579 10,213,039 Composed Trigram 26,817,696 54,104,686
  • 23.
    Nervana Systems Proprietary 23 Reference CER (noLM) WER (no LM) WER (trigram LM) WER (trigram LM w/ enhancements) Hannun, et al. (2014) 10.7 35.8 14.1 N/A Graves-Jaitly (2014) 9.2 30.1 N/A 8.7 Hwang-Sung (2016) 10.6 38.4 8.88 8.1 Miao et al. (2015) [Eesen] N/A N/A 9.1 7.3 Bahdanau et al. (2016) 6.4 18.6 10.8 9.3 Nervana-Speech 8.64 32.5 8.4 N/A younited presidentiol is a lefe in surance company united presidential is a life insurance company that was sertainly true last week that was certainly true last week we're now ready to say we're intechnical default a spokesman said we're not ready to say we're in technical default a spokesman said
  • 24.
    Nervana Systems Proprietary 24 Nervana’sdeep learning tutorials: https://www.nervanasys.com/deep-learning-tutorials/ Acoustic model source available at: https://github.com/NervanaSystems/deepspeech Github page: https://github.com/NervanaSystems/neon For more information, contact: info@nervanasys.com
  • 25.
  • 26.
    Nervana Systems Proprietary 26 •github.com/NervanaSystems/ModelZoo • model files, parameters
  • 27.
  • 28.
  • 29.
    Nervana Systems Proprietary 01 2 3 4 5 6 7 8 0 1 2 3 19 25 37 43 0 1 3 4 0 1 2 3 19 • Each element in the output is the result of a dot product between two vectors 29 input filter output
  • 30.
    Nervana Systems Proprietary 30 01 2 3 4 5 6 7 8 0 1 2 3 19 25 37 43 0 1 2 3 4 5 6 7 8 19 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 25 37 43
  • 31.

Editor's Notes

  • #10 Emphasize that we are just replicating DS2’s acoustic model
  • #27 don’t have to make a model from scratch - many examples of pre-trained models mention Yinyin-Fast-RCNN, Sathish C3D, babI
  • #30 To understand Convolution networks, we should understand convolution operation first and then see how such operation is implemented in a network structure