Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"

How to choose ASR
AI & BIG DATA DAY
Hamolia Vladyslav
skype: vhamolya
ELEKS

Agenda
● History of ASR
● ASR challenges
● General overview of ASR processing
● Speech representation
● Implement ASR using HMM and DNN
● Open Source tools
● Q&A
2

History
Isolated
words
Continuous
speech
Connected
digits
Connected
words
Spoken dialog
Filter-bank
analysis
Pattern
recognition
HHM,
Stochastic
language
model
Statistical
learning,
Multi-layer
perceptron
Concatenative
synthesis;
Machine
Learning,
LSTM
1962 1967 1972 1977 1982 1987 1998 2005 +
3

● Large vocabulary
● Background Noise
● Regional and Social Dialects
● Spoken Language vs Written Language
● Spontaneous vs read speech
● Ambiguity
ASR Сhallenges
4

Speech Representation
Fast fourier transform:
- Time domain to frequency domain
- Shows energy in different frequency band
- Complex spectra: all information is preserved
- Supporting sound source separation (sources
overlap less in time frequency than in time
domain) invertibility
6

Spectrum Estimation
Spectrum of audio signals is typically
estimated in short consecutive segments,
frames
- Real audio signals are not stationary but
vary through time
- Framewise processing assumes the
signal is time-invariant
- Frame length for audio application in
between 10ms - 100ms
- Frame length for ASR 25ms
7

Bhiksha Raj, Rita Singh: Techniques for Noise Robust Automatic Speech Recognition
Common path:
- Take the FFT of a signal
- Map the powers of the spectrum
obtained above onto the mel scale
- Cosine transform of mel log powers
- MFCCs are amplitudes of the resulting
spectrum
MFCC
8

Acoustic Model
- Phonemes are fundamental units
- “cat” -> /k/, /at/, /t/
- Split each phoneme in 3 states
- ~ 10% advantage using phonemes
in contrast to words
- Training model for each word
requires a lot of data
9

p r aa b iy
p r ay b i
p r aw i uh
p r aa i iy
p r aa b uw
p ow ih
p aa iy
p aa b uh b l iy
p aa ah iy
s eh n t s
s ih t s
eh v r ax b ax d iy
eh v er ax d iy
eh ux b ax iy
eh r uw ay
eh b ah iy
Phonemes
10
do ow n
d ow
d ow n t
d ow t
d ah n
ow
n ax
d ax n
ax
n uw
probably sense everybody don’t

HMM-based Recogniser
Optimization problem:
O Overvation
(features)
P(O | S) Acoustic mode
P(S | W) Pronunciation model
P(W) Language model
12

HMM-based Recogniser
- For each example, use current HMM models
to assign feature vectors to HMM states
- Viterbi algorithm, find the most likely path
through the composite HMM model
- Group the feature vectors assigned to each
HMM
- GMM for computing P(O|S) (acoustic model)
13

Language Model
- Word sequence
- Bigram approximation
- N-gram approximation
15
- …. and LSTMs

DNN
● Two ways of using DNN for ASR task:
○ Extracting nonlinear features (and modeling in
GMM)
○ Estimate phonetic probabilities
● Train the network as a classifier with a softmax across
the phonetics units
● Will converge to posterior across phonetic state
● Architectures
○ Fully connected
○ Convolutional networks (CNNs)
○ Recurrent (LSTMs, GRUs)
● Dependencies not long at speech frame rates (100Hz)
DNN
LSTM
LSTM
LSTM
Conv
Log Mel
16

17
Open Source ASR
● Offline tools
○ Cmusphinx
○ Kaldi
○ Julius
● Libs
○ Time tools
○ Automatic Speech Recognition
○ KerasDeepSpeech

18
Results
Implementation details:
Lib:
Automatic_Speech_Recognition
Dataset: TIMIT
Architecture: BiLSTM
Speakers: 2
Target: Elderly people are often excluded.
Predicted: Early people are often excluded.
Target: Drop five forms in the box before you go out.
Predicted: Drop wave forms in the box before you got
it.
Target: one who writes of such an era labours under a
troublesome disadvantage
Predicted: one how rights of such an er a labours
onder a troubles hom disadvantage
Target: Don't ask me to carry an oily rag like that.
Predicted: Don't ask me to carefully rog like that.
Target: Calcium makes bones and teeth strong.
Predicted: Calcium makes bones and tea strong.

1. TIMIT dataset
2. http://www.speech.sri.com/projects/srilm/
3. Human parity in speech recognition
4. Cloud solution vs open source comparing
20
Links

Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"

Recommended

Recommended

More Related Content

Similar to Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"

Similar to Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system" (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"