Exploring Automatic Speech
Recognition with Tensorflow
Janna Escur Xavier Giró Marta Ruiz
1
February 5th, 2018
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
2
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
3
Introduction: motivation
4
Introduction: motivation
5
Introduction: problem definition
Automatic Speech
Recognition (ASR)
system
“Good morning”
6
Introduction: difficulties
- Variability: child, male, women…
- Circumstances: same speaker different speech signal
- Not native speaker
- Noisy conditions
- More than one speaker
- Crossing conversations
- ...
7
Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
8
Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
9
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
10
Related work: Automatic Speech Recognition
history
- NO Deep Learning: Gaussian Mixture Models (GMM) - Hidden Markov Model (HMM)
- Deep Neural Networks (DNN) - HMM
- DNN without HMM: Connectionist Temporal Classification
- End-to-end trained
11
Related work: classic ASR system
12
Related work:
Gaussian Mixture Model (GMM) -
Hidden Markov Model (HMM)
13
Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3
(2008): 195-304.
Related work:
Deep Neural Networks - HMM
Recurrent Neural Networks - HMM
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature
521.7553 (2015): 436.
14
Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal
Processing Magazine 29.6 (2012): 82-97.
Related work: Connectionist Temporal
Classification
- Does not impose frame by frame synchronization between
input and output
- The system can choose the position of the output
characters
15
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
Related work: Connectionist Temporal
Classification
The Recurrent Neural Network (RNN) estimates the per
time-step probabilities
16
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
Related work: end-to-end ASR
Acoustic
model
Pronunciation
dictionary
Language
model
DNN
Big models & lots of data!
17
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
18
Methodology: Listen, Attend and Spell (LAS)
LAS learns all the components of
a speech recognizer jointly
Sequence to sequence + attention
19
Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal
Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
Methodology: Listen, Attend and Spell (LAS)
Listener (encoder)
20
Pyramidal BLSTM
Methodology: Listen, Attend and Spell (LAS)
21
Speller (decoder)
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
22
Contribution
LAS implementation by Vincent Renkens (Phd at KULeuven): Nabu under development
Github code: https://github.com/vrenkens/nabu
23
Contribution
TOTAL of 10 issues.
We closed all of them!
24
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
25
Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
26
Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
27
Evaluation metric
Phoneme Error Rate (PER)
PER =
28
S + D + I
N
Where,
S: number of substitutions
D: number of deletions
I: number of insertions
N: number of phonemes in the reference sentence.
Experiments: [1] NABU
Encoder
- Layers 2 + 1 non-pyramidal while training
- Units/layer 128 per direction
Decoder
- Layers 1
- Units/layer 128
PER = 31,94%
29
Steps
Steps
Validation loss
Training loss
Experiments: [2] LAS
Encoder
- Layers 3
- Units/layer 256 per direction
Decoder
- Layers 2
- Units/layer 512
PER = 27.29%
30
Validation loss
Training loss
Steps
Steps
Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
31
Conclusions and future work
- We have reached a better understanding of the techniques used to process speech in the
deep learning framework.
- We have faced with an under-development implementation
32
Conclusions and future work
- We finally have trained the first end-to-end model with sequence to sequence learning
improved with the attention mechanism in the research area of TALP.
33
- We tried it with different configurations in order to obtain the lowest Phoneme Error Rate.
- The system is ready to be used as the first module of Speech2Signs or as a baseline in further
research projects.
34Speech2Signs project

Exploring Automatic Speech Recognition with Rensorflow

  • 1.
    Exploring Automatic Speech Recognitionwith Tensorflow Janna Escur Xavier Giró Marta Ruiz 1 February 5th, 2018
  • 2.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 2
  • 3.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 3
  • 4.
  • 5.
  • 6.
    Introduction: problem definition AutomaticSpeech Recognition (ASR) system “Good morning” 6
  • 7.
    Introduction: difficulties - Variability:child, male, women… - Circumstances: same speaker different speech signal - Not native speaker - Noisy conditions - More than one speaker - Crossing conversations - ... 7
  • 8.
    Introduction: context Speech2Signs: Spokento Sign Language Translation using Neural Networks 8
  • 9.
    Introduction: context Speech2Signs: Spokento Sign Language Translation using Neural Networks 9
  • 10.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 10
  • 11.
    Related work: AutomaticSpeech Recognition history - NO Deep Learning: Gaussian Mixture Models (GMM) - Hidden Markov Model (HMM) - Deep Neural Networks (DNN) - HMM - DNN without HMM: Connectionist Temporal Classification - End-to-end trained 11
  • 12.
    Related work: classicASR system 12
  • 13.
    Related work: Gaussian MixtureModel (GMM) - Hidden Markov Model (HMM) 13 Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3 (2008): 195-304.
  • 14.
    Related work: Deep NeuralNetworks - HMM Recurrent Neural Networks - HMM LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436. 14 Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal Processing Magazine 29.6 (2012): 82-97.
  • 15.
    Related work: ConnectionistTemporal Classification - Does not impose frame by frame synchronization between input and output - The system can choose the position of the output characters 15 Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8. https://distill.pub/2017/ctc/ Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
  • 16.
    Related work: ConnectionistTemporal Classification The Recurrent Neural Network (RNN) estimates the per time-step probabilities 16 Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8. https://distill.pub/2017/ctc/ Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
  • 17.
    Related work: end-to-endASR Acoustic model Pronunciation dictionary Language model DNN Big models & lots of data! 17
  • 18.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 18
  • 19.
    Methodology: Listen, Attendand Spell (LAS) LAS learns all the components of a speech recognizer jointly Sequence to sequence + attention 19 Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
  • 20.
    Methodology: Listen, Attendand Spell (LAS) Listener (encoder) 20 Pyramidal BLSTM
  • 21.
    Methodology: Listen, Attendand Spell (LAS) 21 Speller (decoder)
  • 22.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 22
  • 23.
    Contribution LAS implementation byVincent Renkens (Phd at KULeuven): Nabu under development Github code: https://github.com/vrenkens/nabu 23
  • 24.
    Contribution TOTAL of 10issues. We closed all of them! 24
  • 25.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 25
  • 26.
    Experiments: TIMIT database -Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers - Train: 462 speakers - Complete test set: 168 speakers, 8 sentences each. - Core test set: 24 speakers, 8 sentences each. - Validation set: 50 speakers, 8 sentences each. 26
  • 27.
    Experiments: TIMIT database -Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers - Train: 462 speakers - Complete test set: 168 speakers, 8 sentences each. - Core test set: 24 speakers, 8 sentences each. - Validation set: 50 speakers, 8 sentences each. 27
  • 28.
    Evaluation metric Phoneme ErrorRate (PER) PER = 28 S + D + I N Where, S: number of substitutions D: number of deletions I: number of insertions N: number of phonemes in the reference sentence.
  • 29.
    Experiments: [1] NABU Encoder -Layers 2 + 1 non-pyramidal while training - Units/layer 128 per direction Decoder - Layers 1 - Units/layer 128 PER = 31,94% 29 Steps Steps Validation loss Training loss
  • 30.
    Experiments: [2] LAS Encoder -Layers 3 - Units/layer 256 per direction Decoder - Layers 2 - Units/layer 512 PER = 27.29% 30 Validation loss Training loss Steps Steps
  • 31.
    Outline - Introduction - Relatedwork - Methodology - Contribution - Experiments - Conclusions and future work 31
  • 32.
    Conclusions and futurework - We have reached a better understanding of the techniques used to process speech in the deep learning framework. - We have faced with an under-development implementation 32
  • 33.
    Conclusions and futurework - We finally have trained the first end-to-end model with sequence to sequence learning improved with the attention mechanism in the research area of TALP. 33 - We tried it with different configurations in order to obtain the lowest Phoneme Error Rate. - The system is ready to be used as the first module of Speech2Signs or as a baseline in further research projects.
  • 34.