Exploring Automatic Speech Recognition with Rensorflow

Exploring Automatic Speech
Recognition with Tensorflow
Janna Escur Xavier Giró Marta Ruiz
1
February 5th, 2018

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
2

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
3

Introduction: problem definition
Automatic Speech
Recognition (ASR)
system
“Good morning”
6

Introduction: difficulties
- Variability: child, male, women…
- Circumstances: same speaker different speech signal
- Not native speaker
- Noisy conditions
- More than one speaker
- Crossing conversations
- ...
7

Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
8

Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
9

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
10

Related work: Automatic Speech Recognition
history
- NO Deep Learning: Gaussian Mixture Models (GMM) - Hidden Markov Model (HMM)
- Deep Neural Networks (DNN) - HMM
- DNN without HMM: Connectionist Temporal Classification
- End-to-end trained
11

Related work: classic ASR system
12

Related work:
Gaussian Mixture Model (GMM) -
Hidden Markov Model (HMM)
13
Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3
(2008): 195-304.

Related work:
Deep Neural Networks - HMM
Recurrent Neural Networks - HMM
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature
521.7553 (2015): 436.
14
Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal
Processing Magazine 29.6 (2012): 82-97.

Related work: Connectionist Temporal
Classification
- Does not impose frame by frame synchronization between
input and output
- The system can choose the position of the output
characters
15
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006

Related work: Connectionist Temporal
Classification
The Recurrent Neural Network (RNN) estimates the per
time-step probabilities
16
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006

Related work: end-to-end ASR
Acoustic
model
Pronunciation
dictionary
Language
model
DNN
Big models & lots of data!
17

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
18

Methodology: Listen, Attend and Spell (LAS)
LAS learns all the components of
a speech recognizer jointly
Sequence to sequence + attention
19
Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal
Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

Listener (encoder)
20
Pyramidal BLSTM

21
Speller (decoder)

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
22

Contribution
LAS implementation by Vincent Renkens (Phd at KULeuven): Nabu under development
Github code: https://github.com/vrenkens/nabu
23

Contribution
TOTAL of 10 issues.
We closed all of them!
24

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
25

Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
26

Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
27

Evaluation metric
Phoneme Error Rate (PER)
PER =
28
S + D + I
N
Where,
S: number of substitutions
D: number of deletions
I: number of insertions
N: number of phonemes in the reference sentence.

Experiments: [1] NABU
Encoder
- Layers 2 + 1 non-pyramidal while training
- Units/layer 128 per direction
Decoder
- Layers 1
- Units/layer 128
PER = 31,94%
29
Steps
Steps
Validation loss
Training loss

Experiments: [2] LAS
Encoder
- Layers 3
- Units/layer 256 per direction
Decoder
- Layers 2
- Units/layer 512
PER = 27.29%
30
Validation loss
Training loss
Steps
Steps

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
31

Conclusions and future work
- We have reached a better understanding of the techniques used to process speech in the
deep learning framework.
- We have faced with an under-development implementation
32

Conclusions and future work
- We finally have trained the first end-to-end model with sequence to sequence learning
improved with the attention mechanism in the research area of TALP.
33
- We tried it with different configurations in order to obtain the lowest Phoneme Error Rate.
- The system is ready to be used as the first module of Speech2Signs or as a baseline in further
research projects.

Exploring Automatic Speech Recognition with Rensorflow

More Related Content

More from Universitat Politècnica de Catalunya

Recently uploaded

Exploring Automatic Speech Recognition with Rensorflow