Marketing Management Business Plan_My Sweet Creations
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
1. How to choose ASR
AI & BIG DATA DAY
Hamolia Vladyslav
skype: vhamolya
ELEKS
2. Agenda
● History of ASR
● ASR challenges
● General overview of ASR processing
● Speech representation
● Implement ASR using HMM and DNN
● Open Source tools
● Q&A
2
4. ● Large vocabulary
● Background Noise
● Regional and Social Dialects
● Spoken Language vs Written Language
● Spontaneous vs read speech
● Ambiguity
ASR Сhallenges
4
6. Speech Representation
Fast fourier transform:
- Time domain to frequency domain
- Shows energy in different frequency band
- Complex spectra: all information is preserved
- Supporting sound source separation (sources
overlap less in time frequency than in time
domain) invertibility
6
7. Spectrum Estimation
Spectrum of audio signals is typically
estimated in short consecutive segments,
frames
- Real audio signals are not stationary but
vary through time
- Framewise processing assumes the
signal is time-invariant
- Frame length for audio application in
between 10ms - 100ms
- Frame length for ASR 25ms
7
8. Bhiksha Raj, Rita Singh: Techniques for Noise Robust Automatic Speech Recognition
Common path:
- Take the FFT of a signal
- Map the powers of the spectrum
obtained above onto the mel scale
- Cosine transform of mel log powers
- MFCCs are amplitudes of the resulting
spectrum
MFCC
8
9. Acoustic Model
- Phonemes are fundamental units
- “cat” -> /k/, /at/, /t/
- Split each phoneme in 3 states
- ~ 10% advantage using phonemes
in contrast to words
- Training model for each word
requires a lot of data
9
10. p r aa b iy
p r ay b i
p r aw i uh
p r aa i iy
p r aa b uw
p ow ih
p aa iy
p aa b uh b l iy
p aa ah iy
s eh n t s
s ih t s
eh v r ax b ax d iy
eh v er ax d iy
eh ux b ax iy
eh r uw ay
eh b ah iy
Phonemes
10
do ow n
d ow
d ow n t
d ow t
d ah n
ow
n ax
d ax n
ax
n uw
probably sense everybody don’t
13. HMM-based Recogniser
- For each example, use current HMM models
to assign feature vectors to HMM states
- Viterbi algorithm, find the most likely path
through the composite HMM model
- Group the feature vectors assigned to each
HMM
- GMM for computing P(O|S) (acoustic model)
13
15. Language Model
- Word sequence
- Bigram approximation
- N-gram approximation
15
- …. and LSTMs
16. DNN
● Two ways of using DNN for ASR task:
○ Extracting nonlinear features (and modeling in
GMM)
○ Estimate phonetic probabilities
● Train the network as a classifier with a softmax across
the phonetics units
● Will converge to posterior across phonetic state
● Architectures
○ Fully connected
○ Convolutional networks (CNNs)
○ Recurrent (LSTMs, GRUs)
● Dependencies not long at speech frame rates (100Hz)
DNN
LSTM
LSTM
LSTM
Conv
Log Mel
16
17. 17
Open Source ASR
● Offline tools
○ Cmusphinx
○ Kaldi
○ Julius
● Libs
○ Time tools
○ Automatic Speech Recognition
○ KerasDeepSpeech
18. 18
Results
Implementation details:
Lib:
Automatic_Speech_Recognition
Dataset: TIMIT
Architecture: BiLSTM
Speakers: 2
Target: Elderly people are often excluded.
Predicted: Early people are often excluded.
Target: Drop five forms in the box before you go out.
Predicted: Drop wave forms in the box before you got
it.
Target: one who writes of such an era labours under a
troublesome disadvantage
Predicted: one how rights of such an er a labours
onder a troubles hom disadvantage
Target: Don't ask me to carry an oily rag like that.
Predicted: Don't ask me to carefully rog like that.
Target: Calcium makes bones and teeth strong.
Predicted: Calcium makes bones and tea strong.