The document provides an overview of parallelism techniques in HMM-DNN based automatic speech recognition systems as implemented in the Kaldi ASR toolkit. It discusses various stages of a typical ASR pipeline that can benefit from parallelization including feature extraction, acoustic modeling using neural networks, language modeling, and decoding. Specific examples mentioned include using GPUs to speed up MFCC feature extraction by 97 times and neural network training by 10-1000 times. Advanced decoding algorithms like Viterbi beam search and A* search are also discussed along with their GPU implementations providing significant speedups.
2. Classical automatic speech recognition system (viable locations for parallelization marked with red arrows)
Viterbi
beam
search /
A*
decoding
N-best
sentences or
word lattice
Rescoring
FINAL
UTTERRENCE
Acoustic model generation
Sentence model preparation
Phonetic
utterance models
Sentence model
1
2
4
3
5
6Signal
acquisition
Feature extraction
Acoustic modelling
3. Neural networks and Deep learning in ASR
Drawbacks of HMM-GMM models:
The conventional HMM-GMM models used for ASR has the following assumptions that proves to be detrimental for many applications:
1. First order Markov Chain assumption: HMM assumes the next state of the system is independent of all the previous states given the current
state. This makes capturing long distance semantics tough.
2. Parametric modelling of observations: GMMs are used to model the observations using a mixture of Gaussians. When we make such
theoretical assumptions we also get constrained by the limitations and oftentimes the model fails to capture essential statistics of the data.
3. Lack of generalization: Each HMM state uses only a small fraction of the training data. The absence of data sharing among the HMM states
causes poor generalization to real world variations.
4. Dimensionality reduction: Oftentimes dimensionality reduction is performed to cope up with the shortage of training data. This causes loss of
valuable information and compromised performance.
• Use of artificial neural networks:
Artificial neural networks have been leveraged in many ways to ameliorate these shortcomings of HMM-GMM systems
1. Alterative for GMMs for creating the acoustic model: ANNs are used to generate a non-parametric posterior distribution over the HMM states
that can be normalized to get (scaled) likelihoods of the observations.
2. Efficient dimensionality reduction of feature space: Autoencoders are used in TANDEM neural networks for efficient non-linear dimensionality
reduction of input feature space for use with HMM-GMM models
3. Modelling dynamics over time: Recurrent neural networks and Time delay neural networks have been used as an alternative for HMMs for
modelling temporal dynamics of the system.
• Relevance of Deep learning:
Recently several deep learning paradigms have found their ways into ASR due to the following unique set of properties:
1. Ability to model highly non-linear functions efficiently
2. Learning of specialized input representations hierarchically
3. Possibility of extensive knowledge and parameter sharing
4. Scope of parallel distributed processing.
4. How ANN is used for ASR
• The objective of automatic speech recognition systems can be mathematically expressed as:
Where W is a sequence of words and X is the corresponding sequence of observations (the input acoustic signal).
• The objective function can be written as:
• Now the likelihood of the observations can be written approximately as:
Where qt denotes the state of the system at time t and xt denotes the observation at the same instant of time.
• We can write the term as:
• Artificial neural networks are used for non-parametric modelling of P(qt | xt) and P(qt) is estimated from the given data.
5. Instances of Parallelism
1. Feature extraction: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used features for continuous
speech recognition. The use of GPUs makes the process of extraction of these features 97 times faster.
• Kou H, Shang W, Lane I, Chong J, Optimized MFCC feature extraction on GPU, ICASSP 2013
2. Probabilistic modelling of observations: Neural networks are used for dimensionality reduction and probabilistic
modelling of likelihoods. Layer-level and data-level parallelism of the neural networks can be achieved using GPUs that
speeds up computations by 10x to 1000x.
• Hinton et al. Deep neural networks for acoustic modelling in speech recognition, IEEE Signal Processing Magazine
2012
• Dixon PR, Oonishi T, Furui S, Harnessing graphics processors for the fast computation of acoustic likelihoods in speech
recognition, Computer Speech and Language, Elsevier 2009
3. Phonetic utterance and language modelling: Learning and inference of neural phonetic and language models can be
sped up using GPU.
• Lewandowski NB et al. Phone sequence modelling with recurrent neural networks, ICASSP 2014
• Bengio et al. A neural probabilistic language model, JMLR 2003
4. Decoding of optimal utterance: The most likely utterances are searched using techniques like Viterbi beam search and
A* decoding. The use of GPUs achieves remarkable speedup in these tasks.
• Langdon et al. Non-recursive beam search on GPU for formal concept analysis, Research note, University of London
• Zhou Y, Zeng J. Massively parallel A* search on a GPU. Proc. 29th AAAI Conference on Artificial Intelligence
6. • Extensive Development and ContributionOpen Source
• On the fly decoding for continuous speechOnline Decoding
• Integration with OpenFst makes system light-weighted,
extremely efficient for computation and suitable for
parallel distributed processing.
FST Framework
K KALDI : Toolkit for ASR
Kaldi is a toolkit for speech recognition written in C++
Why Kaldi ?
7. Data and Lexicon Preparation
a) Partition data into training , validation
and test sets.
b) Dictionary Preparation.
c) Language Model Initialization
d) Check for consistency of data
MFCC and CMVN for Datasets
a) Extract MFCC, delta, delta-delta
features
b) Compute CMVN stats for every
speaker
Various Training and Decoding Methods
a) Align the model
b) Train the system using a scheme
c) Prepare a combined WFST (Weighted Finite
State Transducer) using acoustic HMM, context
information , grammar (trigram) ,
lexicon(pronunciation) called HCLG FST.
Further Optimization of HCLG FST
using training
a) RBM Pretraining
b) Fine tuning using cross entropy
error criterion
c) sMBR Sequence Discriminative
Training
RESULT
a) Final decoding on HCLG FST(Finite
State Transducer)
b) Generate and store result
Kaldi Workflow
8. STAGE 0
STAGE 1
STAGE 2
STAGE 3
•Fine tune the
DNN using cross
entropy error
criterion.
•Decode the
HCLG FST(Finite
State Transducer)
• Compute
FMLLR(feature
space maximum
likelihood linear
regression) feature
•Pretrain DNN in
Deep Belief
Network approach
•sMBR sequence
discriminative
training using
stochastic gradient
descent.
•Generate word
lattices and
alignment.
•Six fold cross
validation using
sMBR sequence
discriminative
training.
Karel’s Implementation
9. Working HTK System
Organized MFCC
files for TIMIT
Working Kaldi
System
Propose
Optimization in Kaldi
Deliverables
11. TimeLine
Preliminary concept build up
1) Basic Concepts of ASR
2) Working with TIMIT Dataset
3) MFCC Generation
4)Running HTK for training and
decoding
Getting Kaldi running
1) Acquaintance with kaldi
2) Running scripts for training , decoding and
Karel’s algorithm
3) Indentify modules for decoding
4)Figure out segments involved in decoding,
forward-pass and word lattice.
In quest of optimization
1) Point out operational differences in
Kaldi and PocketSphinx.
2)Hardware and Software optimizations
in PocketSphinx
3) Thorough theoretical survey of high
performance algorithms like viterbi beam
search, A*, B-best sentence search and
lattice generation
.
Drafting proposal
1)Indentifying possible
optimizations in kaldi.
2)Propose optimizations in
kaldi to parallelise the
system.
2 WEEKS 3 WEEKS 2 WEEKS 1 WEEK
13. Analysis
ASR ToolKit HTK PocketSphinx Kaldi
Word Error Rate 18.4 16.2 6.6
Kaldi PocketSphinx
HMM-DNN based system HMM-GMM based system
Uses complex Math Library OpenBlas Can use simple Math Library like Eigen
Works only on hardware that has support
for Floating point operations
Can work on Fixed Floating point
architecture
Model representation in form of
Weighted Finite State Transducer(WFST)
Representation in form of tree structure
Better accuracy (WER = 6.6%) Lesser accuracy (WER=16.2%)
Capable of working on complex hardware Capable of working on simple embedded
systems.
14. Advanced Decoding Algorithms
• Drawbacks of Viterbi Decoding :
a) Biased towards short sentences
b) Predicts only the best path, hence ruling out possibility of iterative decoding
c) Fails for Language Model complex than Bigram.
d) Biased towards words having less pronunciation variations.
• N best sequence of states :
a) Predicts a set of N-best sentence hypotheses
b) Difficult to implement iterative decoding
• System works using Word lattice algorithm
a) Iterative decoding possible where multiple decoding algorithm can be used
b) Output of previous iteration constraints the word lattice of next iteration.
c) Forward pass and pruning generated word lattice in conjunction with viterbi algorithm
d) After the early pass algorithm, word lattice is generated.
e) The words lattice generated is rescored using more sophisticated techniques.
f) Decoding schemes used are A* and N-best sequence of sentences for fine tuning.
g) Balances the trade off between space and time.
15. References
1. J. M. Daniel Jurafsky, Speech and Language Processing, 1999.
2. B. H. Juang, “An Introduction to Hidden Markov Models,” no. January, 1986.
3. M. K. Ravishankar, “Efficient Algorithm for Speech Recognition,” PhD theses,
1996.
4. G. Saon, D. Povey, G. Zweig, I. B. M. T. J. Watson, and Y. Heights, “Anatomy of
an extremely fast LVCSR decoder.”
5. J. Butzberger, M. Weintraub, S. R. I. Intemational, and P. Art, “LARGEVOCABULARY DICTATION USING SRI S
DECIPHERm SPEECH RECOGNITION SYSTEM : PROGRESSIVE SEARCH TECHNIQUES H y Murveit,” pp.
319–322, 1993.
6. X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, “Accurate and Compact Large
Vocabulary Speech Recognition on Mobile Devices,” no. August, pp. 662–665, 2013.
7. D. Povey, M. Hannemann, G. Boulianne, A. Ghoshal, M. Kara, S. Kombrink,
P. Motl, N. T. Vu, Y. Qian, K. Riedhammer, K. Vesel, C. S. R. I. International,
M. Park, and U. K. Idiap, “GENERATING EXACT LATTICES IN THE WFST
FRAMEWORK a s c Tsinghua University , Beijing , China 8 Karlsruhe Institute of Technology , Germany Pattern
Recognition Lab , University of ErlangenNuremberg , Germany,” vol. 213850, no. 102, pp. 4213–4216, 2012.
8. L. E. E. F. Klovstad, JW, Mondsnein, “The CASPERS Linguistic Analysis System,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. ASSP-
23, no. 1, pp. 118–123, 1975.
16. 9. E. Proce, F. The, N. M. Likely, and S. Hypotheses, “The n-best algorithm: an
efficient and exact proce finding the n most likely sentence hypotheses,” pp. 2–5.
10. L. R. Bahl and R. L. Mercer, “Design of a Linguistic Statistical Decoder for the
Recognition of Continuous Sgeech,” vol. i, pp. 250–256, 1975.
11. S. Haykin, Neural networks A comprehensive foundation, 1990.
12. E. Trentin and M. Gori, “A survey of hybrid ANN / HMM models for automatic
speech recognition,” vol. 37, pp. 91–126, 2001.
13. H. Ney, B. Dan, and M. Oerder, “IMPROVEMENTS I N BEAM SEARCH FOR
10000-WORD CONTINUOUS SPEECH RECOGNITION,” pp. 9–12, 1992.
14. L. Nguyen and R. Schwartz, “Single tree method for grammar directed search.”
15. Povey Daniel, “Discriminative Training for Large Vocabulary Speech Recognition,”
PhD theses.
16. D. Huggins-daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar, A. I. Rudnicky, and F. Avenue,
“POCKETSPHINX : A FREE , REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD
DEVICES Language Technologies Institute ( dhuggins , mohitkum , archan , awb , rkm , air )@
cs . cmu . edu,” pp. 185–188, 2006.
17. D. Furcy and S. Koenig, “Limited Discrepancy Beam Search .”
18. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015.
19. W. S. Nambirajan Seshadri, Carl-Eric, “List Viterbi Decoding Algorithms with
Applications,” IEEE Transactions on Communications, vol. 42, 1994.