Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

652 views

Published on

AI NEXT Conference 2017 Seattle by Jinyu Li
Video: https://www.youtube.com/channel/UCj09XsAWj-RF9kY4UvBJh_A

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

  1. 1. Jinyu Li Microsoft
  2. 2. } Review the deep learning trends for automatic speech recognition (ASR) in industry ◦ Deep Neural Network (DNN) ◦ Long Short-Term Memory (LSTM) ◦ Connectionist Temporal Classification (CTC) } Describe selected key technologies to make deep learning models more effective under production environment
  3. 3. Feature Analysis (Spectral Analysis) Language Model Word Lexicon Confidence Scoring Pattern Classification (Decoding, Search) Acoustic Model (HMM) Input Speech “Hey Cortana” (0.9) (0.8) s(n), W Xn W ^
  4. 4. Feature Analysis (Spectral Analysis) Language Model Word Lexicon Confidence Scoring Pattern Classification (Decoding, Search) Acoustic Model (HMM) Input Speech “Hey Cortana” (0.9) (0.8) s(n), W Xn W ^
  5. 5. } Word sequence: Hey Cortana } Phone sequence: hh ey k ao r t ae n ax } Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r ao-r+ae ae-n+ax n-ax+sil } Every triphone is then modeled by a three-state HMM: sil- hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n- ax+sil[3]. The key problem is how to evaluate the state likelihood given the speech signal.
  6. 6. sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
  7. 7. sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
  8. 8. sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]
  9. 9. sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
  10. 10. } ZH-CN is improved by 32% within one year! 0 5 10 15 20 25 30 35 GMM MFCC CE DNN LFB CE DNN LFB SE DNN ZH-CN Relative Improvement CERR CE: Cross Entropy training SE: SEquence training
  11. 11. DNNs process speech frames independently tx1−tx ( )bxWh += thxt σ
  12. 12. RNN considers temporal relation over speech frames. tx1−tx Vulnerable to gradients vanishing and exploding ( )bhWxWh ++= −1thhthxt σ
  13. 13. Memory cells store the history information Various gates control the information flow inside LSTM Advantageous in learning long short-term temporal dependency tx1−tx
  14. 14. 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 SMD2015 VS2015 MobileC Mobile Win10C WERR Relative WER reduction of LSTM from DNN
  15. 15. The HMM/GMM or HMM/DNN pipeline is highly complex Multiple training stages: CI phone, CD senones, … Various resources: lexicon, decision trees questions, … Many hyper-parameters: number of senones, number of Gaussians, … CI Phone CD Senone DNN/ LSTM GMM Hybrid
  16. 16. Feature Analysis (Spectral Analysis) Language Model Word Lexicon Confidence Scoring Pattern Classification (Decoding, Search) Acoustic Model (HMM) Input Speech “Hey Cortana” (0.9) (0.8) s(n), W Xn W ^
  17. 17. The HMM/GMM or HMM/DNN pipeline is highly complex Multiple training stages: CI phone, CD senones, … Various resources: lexicon, decision trees questions, … Many hyper-parameters: number of senones, number of Gaussians, … LM building requests tons of data and complicated process also Efficient decoder writing needs experts with years’ experience
  18. 18. The HMM/GMM or HMM/DNN pipeline is highly complex Multiple training stages: CI phone, CD senones, … Various resources: lexicon, decision trees questions, … Many hyper-parameters: number of senones, number of Gaussians, … LM building requests tons of data and complicated process also Efficient decoder writing needs experts with years’ experience
  19. 19. End-to-End Model “Hey Cortana” } ASR is a sequence-to-sequence learning problem. } A simpler paradigm with a single model (and training stage) is desired.
  20. 20. Allow repetitions of non-blank labels Add the blank as an additional label, meaning no (actual) labels are emitted A B C !A!!!A!!!∅!!!∅!!!B!!!C!!!∅! !∅!!!A!!!A!!!B!!!∅!!!C!!!C! !∅!!!∅!!!∅!!!A!!!B!!!C!!!∅! collapse expand } CTC is a sequence-to-sequence learning method used to map speech waveforms directly to characters, phonemes, or even words } CTC paths differ from labels sequences in that: A B C -- labels sequencez -- observation framesX
  21. 21. t-1 t t+1 LSTM LSTM LSTM … … softmax ∅blank words
  22. 22. } Directly from speech to text, no language model, no decoder, no lexicon……
  23. 23. } Reduce runtime cost without accuracy loss } Adapt to speakers with low footprints } Reduce accuracy gap between large and small deep networks } Enable languages with limited training data
  24. 24. [Xue13]
  25. 25. } The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.
  26. 26. } The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it. } We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it
  27. 27. } How to reduce the runtime cost of DNN ? SVD !!! } speaker personalization & AM modularization.
  28. 28. 𝐴"×$ = 𝑈"×$∑$×$ 𝑉$×$ ) = 𝑢++ ⋯ 𝑢+$ ⋮ ⋱ ⋮ 𝑢"+ ⋯ 𝑢"$ / 𝜖++ ⋯ ⋮ ⋱ 0 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ ⋮ ⋱ 0 ⋯ 𝜖22 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 𝜖$$ / 𝑣++ ⋯ 𝑣+$ ⋮ ⋱ ⋮ 𝑣$+ ⋯ 𝑣$$
  29. 29. } Number of parameters: mn->mk+nk. } Runtime cost: O(mn) -> O(mk+nk). } E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.
  30. 30. } Singular Value Decomposition
  31. 31. LSTM LSTM tx1−tx LSTM 1+tx Cop y DNN Model LSTM Model DNN DNN tx1−tx DNN 1+tx Cop y
  32. 32. Split training utterances through frame skipping 2x1x 3x 5x4x 6x 1x 3x 5x 2x 4x 6x When skipping 1 frame, odd and even frames are picked as separate utterances Frame labels are selected accordingly
  33. 33. [Xue 14]
  34. 34. } Speaker personalization with a deep model creates a storage size issue: It is not practical to store an entire deep models for each individual speaker during deployment.
  35. 35. } Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment. } We propose low-footprint DNN personalization method based on SVD structure.
  36. 36. 0 0.36 18.64 20.86 30 7.4 7.4 0.26 FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION Adapting with 100 utterances Relative WER reduction Number of parameters (M)
  37. 37. } SVD matrices are used to reduce the number of DNN parameters and CPU cost. } Quantization for SSE evaluation is used for single instruction multiple data processing. } Frame skipping is used to remove the evaluation of some frames.
  38. 38. } The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios. } Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices. } A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by ◦ reducing the number of nodes in hidden layers ◦ reducing the number of targets in the output layer
  39. 39. } Better accuracy is obtained if we use the output of large- size DNN for acoustic likelihood evaluation } The output of small-size DNN is away from that of large- size DNN, resulting in worse recognition accuracy } The problem is solved if the small-size DNN can generate similar output as the large- size DNN ... ... ... ... ... ...Text ... ... ... ... ... ... ... ...
  40. 40. ◦ Use the standard DNN training method to train a large-size teacher DNN using transcribed data ◦ Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of un- transcribed data
  41. 41. } 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN. } The footprint is further reduced to 0.5 million parameter when combining with SVD. Teacher DNN trained with standard sequence training Small-size DNN trained with standard sequence training Student DNN trained with output distribution learning Accuracy
  42. 42. [Huang 13]
  43. 43. } Develop a new language in new scenario with small amount of training data.
  44. 44. } Develop a new language in new scenario with small amount of training data. } Leverage the resource-rich languages to develop high-quality ASR for resource- limited languages.
  45. 45. ... ... ... ... ... ... ... Input Layer: A window of acoustic feature frames Shared Feature Transformation Output Layer New language senones New Language Training or Testing Samples Text Many Hidden Layers
  46. 46. 0 5 10 15 20 25 3 hrs 9hrs 36hrs 139hrs releative error reduction

×