Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intel Nervana Artificial Intelligence Meetup 11/30/16

End-to-end speech recognition in Neon presented by Anthony Ndirango and Tyler Lee

Modern automatic speech recognition systems incorporate tremendous amount of expert knowledge and a wide array of machine learning techniques. The promise of deep learning is to strip away much of this complexity in favor of the flexibility of neural networks. We will describe our efforts in implementing end-to-end speech recognition in neon by combining convolutional and recurrent neural networks to create an acoustic model followed by a graph-based decoding scheme. These types of models are trained to go directly from raw waveforms to transcribed speech without requiring any kind of explicit forced alignment. We will also discuss additional challenges that must be overcome to produce state-of-the-art results.

  • Login to see the comments

Intel Nervana Artificial Intelligence Meetup 11/30/16

  1. 1. Proprietary and confidential. Do not distribute. End-to-end speech recognition with neon Anthony Ndirango & Tyler Lee MAKING MACHINES SMARTER.™ now part of
  2. 2. Nervana Systems Proprietary 2 • Deep learning synopsis • Large vocabulary continuous speech recognition • End-to-end speech recognition systems • Integrating weighted finite-state transducers for decoding
  3. 3. Nervana Systems Proprietary 3 Back-propagation End-to-end Resnet ImageNet Word2Vec Regularization Convolution Unrolling RNN Generalization hyperparameters Video recognition dropout Pooling LSTM AlexNet Speech recognition download neon! https://github.com/NervanaSystems/neon git clone git@github.com:NervanaSystems/neon.git Nervana’s deep learning tutorials: https://www.nervanasys.com/deep-learning-tutorials/ We are hiring! https://www.nervanasys.com/careers/
  4. 4. Nervana Systems Proprietary 4 A method for extracting features at multiple levels of abstraction • Features are discovered from data • Performance improves with more data • Network can express complex transformations • High degree of representational power
  5. 5. Nervana Systems Proprietary 5
  6. 6. Nervana Systems Proprietary 6 Healthcare: Tumor detection Automotive: Speech interfaces Finance: Time-series search engine Positive: Negative: Agricultural Robotics Oil & Gas Positive: Negative: Proteomics: Sequence analysis Query: Results:
  7. 7. Nervana Systems Proprietary 7
  8. 8. Nervana Systems Proprietary 8 kwɪk braʊn fɒks quick brown fox
  9. 9. Nervana Systems Proprietary 9
  10. 10. Nervana Systems Proprietary 10 download aeon! https://github.com/NervanaSystems/aeon • Train directly from raw audio, extracting spectral features on-the-fly • Handles arbitrarily large datasets • Loads data from disk to device with minimal latency • Also supports image and video data
  11. 11. Nervana Systems Proprietary 11 Acoustic models in neon complete source available at https://github.com/NervanaSystems/deepspeech
  12. 12. Nervana Systems Proprietary 12 • The basic problem to be solved involves mapping a sequence of audio features to a sequence of characters, with no obvious relationship between the lengths of the sequences. • CTC works around this problem by first defining a ”collapse” function. Definition by example: Collapse(_NNN_ _EE_ _R_ _VVV_AAA_N_AAAA_) = NERVANA
  13. 13. Nervana Systems Proprietary 13 • For each utterance, model outputs a matrix of frame-wise character probabilities • Given the ”ground truth” transcript, the CTC algorithm: 1. finds all paths which collapse onto ground truth 2. uses the probability matrix to weight each path
  14. 14. Nervana Systems Proprietary 14 - input audio with 5 frames - ground truth: “CAB” - find all strings of length 5, including blank characters that collapse onto “CAB” CTC Cost
  15. 15. Nervana Systems Proprietary 15 1.do an argmax for each column 2.concatenate the resulting characters to obtain a string 3.“collapse” the string to get the output
  16. 16. Nervana Systems Proprietary 16 decoded outputs ground truth younited presidentiol is a lefe in surance company united presidential is a life insurance company that was sertainly true last week that was certainly true last week we're now ready to say we're intechnical default a spokesman said we're not ready to say we're in technical default a spokesman said
  17. 17. Nervana Systems Proprietary 17 So we have probabilities of each character at each frame. Now what? If CER (character error rate) is nearly perfect… We’re pretty much set. Just use the best character at each frame. If CER is too high… We should enforce some rules from the language. E.g: - All words must be valid - Favor likely word sequences
  18. 18. Nervana Systems Proprietary 18 Weighted finite state transducers: Automata whose state transitions map a sequence of input symbols to a sequence of output symbols - Directed graph structure - States enforce language structure - Transitions choose amongst valid symbols For an in-depth review, see Mohri, Pereira & Riley, 2008
  19. 19. Nervana Systems Proprietary 19 - A lot of decoding concepts map nicely to FSTs. - CTC, lexicon (vocabulary) and grammar (language model) can all be easily represented. - Efficient algorithms exist to combine FSTs, giving a single decoding graph 𝐷 = 𝑇 ∘ 𝐿 ∘ 𝐺 Decoding graph CTC graph Lexicon graph Grammar graph
  20. 20. Nervana Systems Proprietary 20 Removes repeated characters and blanks (“_”) 𝑇 C_ A A A _ B B _ Input: “C _ A A A _ B B _” C C A C A B Output: “C A B”
  21. 21. Nervana Systems Proprietary 21 Maps a sequence of characters or phonemes to words 𝐿
  22. 22. Nervana Systems Proprietary 22 - Less end-to-end: A large number of parameters learned completely separate from the acoustic model - Memory issues with large vocabularies or complex language models Graph # States # Arcs CTC 31 91 Lexicon 30,629 40,516 Trigram 3,538,579 10,213,039 Composed Trigram 26,817,696 54,104,686
  23. 23. Nervana Systems Proprietary 23 Reference CER (no LM) WER (no LM) WER (trigram LM) WER (trigram LM w/ enhancements) Hannun, et al. (2014) 10.7 35.8 14.1 N/A Graves-Jaitly (2014) 9.2 30.1 N/A 8.7 Hwang-Sung (2016) 10.6 38.4 8.88 8.1 Miao et al. (2015) [Eesen] N/A N/A 9.1 7.3 Bahdanau et al. (2016) 6.4 18.6 10.8 9.3 Nervana-Speech 8.64 32.5 8.4 N/A younited presidentiol is a lefe in surance company united presidential is a life insurance company that was sertainly true last week that was certainly true last week we're now ready to say we're intechnical default a spokesman said we're not ready to say we're in technical default a spokesman said
  24. 24. Nervana Systems Proprietary 24 Nervana’s deep learning tutorials: https://www.nervanasys.com/deep-learning-tutorials/ Acoustic model source available at: https://github.com/NervanaSystems/deepspeech Github page: https://github.com/NervanaSystems/neon For more information, contact: info@nervanasys.com
  25. 25. Nervana Systems Proprietary 25
  26. 26. Nervana Systems Proprietary 26 • github.com/NervanaSystems/ModelZoo • model files, parameters
  27. 27. Nervana Systems Proprietary 27 THANK YOU! QUESTIONS?
  28. 28. Nervana Systems Proprietary 28
  29. 29. Nervana Systems Proprietary 0 1 2 3 4 5 6 7 8 0 1 2 3 19 25 37 43 0 1 3 4 0 1 2 3 19 • Each element in the output is the result of a dot product between two vectors 29 input filter output
  30. 30. Nervana Systems Proprietary 30 0 1 2 3 4 5 6 7 8 0 1 2 3 19 25 37 43 0 1 2 3 4 5 6 7 8 19 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 25 37 43
  31. 31. Nervana Systems Proprietary 31

×