Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploring Automatic Speech Recognition with Rensorflow

https://imatge.upc.edu/web/publications/exploring-automatic-speech-recognition-tensorflow

Speech recognition is the task aiming to identify words in spoken language and convert them into text. This bachelor's thesis focuses on using deep learning techniques to build an end-to-end Speech Recognition system. As a preliminary step, we overview the most relevant methods carried out over the last several years. Then, we study one of the latest proposals for this end-to-end approach that uses a sequence to sequence model with attention-based mechanisms. Next, we successfully reproduce the model and test it over the TIMIT database. We analyze the similarities and differences between the current implementation proposal and the original theoretical work. And finally, we experiment and contrast using different parameters (e.g. number of layer units, learning rates and batch sizes) and reduce the Phoneme Error Rate in almost 12% relative.

  • Be the first to comment

  • Be the first to like this

Exploring Automatic Speech Recognition with Rensorflow

  1. 1. Exploring Automatic Speech Recognition with Tensorflow Janna Escur Xavier Giró Marta Ruiz 1 February 5th, 2018
  2. 2. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 2
  3. 3. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 3
  4. 4. Introduction: motivation 4
  5. 5. Introduction: motivation 5
  6. 6. Introduction: problem definition Automatic Speech Recognition (ASR) system “Good morning” 6
  7. 7. Introduction: difficulties - Variability: child, male, women… - Circumstances: same speaker different speech signal - Not native speaker - Noisy conditions - More than one speaker - Crossing conversations - ... 7
  8. 8. Introduction: context Speech2Signs: Spoken to Sign Language Translation using Neural Networks 8
  9. 9. Introduction: context Speech2Signs: Spoken to Sign Language Translation using Neural Networks 9
  10. 10. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 10
  11. 11. Related work: Automatic Speech Recognition history - NO Deep Learning: Gaussian Mixture Models (GMM) - Hidden Markov Model (HMM) - Deep Neural Networks (DNN) - HMM - DNN without HMM: Connectionist Temporal Classification - End-to-end trained 11
  12. 12. Related work: classic ASR system 12
  13. 13. Related work: Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM) 13 Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3 (2008): 195-304.
  14. 14. Related work: Deep Neural Networks - HMM Recurrent Neural Networks - HMM LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436. 14 Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal Processing Magazine 29.6 (2012): 82-97.
  15. 15. Related work: Connectionist Temporal Classification - Does not impose frame by frame synchronization between input and output - The system can choose the position of the output characters 15 Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8. https://distill.pub/2017/ctc/ Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
  16. 16. Related work: Connectionist Temporal Classification The Recurrent Neural Network (RNN) estimates the per time-step probabilities 16 Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8. https://distill.pub/2017/ctc/ Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
  17. 17. Related work: end-to-end ASR Acoustic model Pronunciation dictionary Language model DNN Big models & lots of data! 17
  18. 18. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 18
  19. 19. Methodology: Listen, Attend and Spell (LAS) LAS learns all the components of a speech recognizer jointly Sequence to sequence + attention 19 Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
  20. 20. Methodology: Listen, Attend and Spell (LAS) Listener (encoder) 20 Pyramidal BLSTM
  21. 21. Methodology: Listen, Attend and Spell (LAS) 21 Speller (decoder)
  22. 22. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 22
  23. 23. Contribution LAS implementation by Vincent Renkens (Phd at KULeuven): Nabu under development Github code: https://github.com/vrenkens/nabu 23
  24. 24. Contribution TOTAL of 10 issues. We closed all of them! 24
  25. 25. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 25
  26. 26. Experiments: TIMIT database - Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers - Train: 462 speakers - Complete test set: 168 speakers, 8 sentences each. - Core test set: 24 speakers, 8 sentences each. - Validation set: 50 speakers, 8 sentences each. 26
  27. 27. Experiments: TIMIT database - Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers - Train: 462 speakers - Complete test set: 168 speakers, 8 sentences each. - Core test set: 24 speakers, 8 sentences each. - Validation set: 50 speakers, 8 sentences each. 27
  28. 28. Evaluation metric Phoneme Error Rate (PER) PER = 28 S + D + I N Where, S: number of substitutions D: number of deletions I: number of insertions N: number of phonemes in the reference sentence.
  29. 29. Experiments: [1] NABU Encoder - Layers 2 + 1 non-pyramidal while training - Units/layer 128 per direction Decoder - Layers 1 - Units/layer 128 PER = 31,94% 29 Steps Steps Validation loss Training loss
  30. 30. Experiments: [2] LAS Encoder - Layers 3 - Units/layer 256 per direction Decoder - Layers 2 - Units/layer 512 PER = 27.29% 30 Validation loss Training loss Steps Steps
  31. 31. Outline - Introduction - Related work - Methodology - Contribution - Experiments - Conclusions and future work 31
  32. 32. Conclusions and future work - We have reached a better understanding of the techniques used to process speech in the deep learning framework. - We have faced with an under-development implementation 32
  33. 33. Conclusions and future work - We finally have trained the first end-to-end model with sequence to sequence learning improved with the attention mechanism in the research area of TALP. 33 - We tried it with different configurations in order to obtain the lowest Phoneme Error Rate. - The system is ready to be used as the first module of Speech2Signs or as a baseline in further research projects.
  34. 34. 34Speech2Signs project

×