SlideShare a Scribd company logo
1 of 17
Implementation and Optimization of
parallelism in HMM-DNN based state
of the Kaldi ASR Toolkit
BY
Shubham
Classical automatic speech recognition system (viable locations for parallelization marked with red arrows)
Viterbi
beam
search /
A*
decoding
N-best
sentences or
word lattice
Rescoring
FINAL
UTTERRENCE
Acoustic model generation
Sentence model preparation
Phonetic
utterance models
Sentence model
1
2
4
3
5
6Signal
acquisition
Feature extraction
Acoustic modelling
Neural networks and Deep learning in ASR
 Drawbacks of HMM-GMM models:
The conventional HMM-GMM models used for ASR has the following assumptions that proves to be detrimental for many applications:
1. First order Markov Chain assumption: HMM assumes the next state of the system is independent of all the previous states given the current
state. This makes capturing long distance semantics tough.
2. Parametric modelling of observations: GMMs are used to model the observations using a mixture of Gaussians. When we make such
theoretical assumptions we also get constrained by the limitations and oftentimes the model fails to capture essential statistics of the data.
3. Lack of generalization: Each HMM state uses only a small fraction of the training data. The absence of data sharing among the HMM states
causes poor generalization to real world variations.
4. Dimensionality reduction: Oftentimes dimensionality reduction is performed to cope up with the shortage of training data. This causes loss of
valuable information and compromised performance.
• Use of artificial neural networks:
Artificial neural networks have been leveraged in many ways to ameliorate these shortcomings of HMM-GMM systems
1. Alterative for GMMs for creating the acoustic model: ANNs are used to generate a non-parametric posterior distribution over the HMM states
that can be normalized to get (scaled) likelihoods of the observations.
2. Efficient dimensionality reduction of feature space: Autoencoders are used in TANDEM neural networks for efficient non-linear dimensionality
reduction of input feature space for use with HMM-GMM models
3. Modelling dynamics over time: Recurrent neural networks and Time delay neural networks have been used as an alternative for HMMs for
modelling temporal dynamics of the system.
• Relevance of Deep learning:
Recently several deep learning paradigms have found their ways into ASR due to the following unique set of properties:
1. Ability to model highly non-linear functions efficiently
2. Learning of specialized input representations hierarchically
3. Possibility of extensive knowledge and parameter sharing
4. Scope of parallel distributed processing.
How ANN is used for ASR
• The objective of automatic speech recognition systems can be mathematically expressed as:
Where W is a sequence of words and X is the corresponding sequence of observations (the input acoustic signal).
• The objective function can be written as:
• Now the likelihood of the observations can be written approximately as:
Where qt denotes the state of the system at time t and xt denotes the observation at the same instant of time.
• We can write the term as:
• Artificial neural networks are used for non-parametric modelling of P(qt | xt) and P(qt) is estimated from the given data.
Instances of Parallelism
1. Feature extraction: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used features for continuous
speech recognition. The use of GPUs makes the process of extraction of these features 97 times faster.
• Kou H, Shang W, Lane I, Chong J, Optimized MFCC feature extraction on GPU, ICASSP 2013
2. Probabilistic modelling of observations: Neural networks are used for dimensionality reduction and probabilistic
modelling of likelihoods. Layer-level and data-level parallelism of the neural networks can be achieved using GPUs that
speeds up computations by 10x to 1000x.
• Hinton et al. Deep neural networks for acoustic modelling in speech recognition, IEEE Signal Processing Magazine
2012
• Dixon PR, Oonishi T, Furui S, Harnessing graphics processors for the fast computation of acoustic likelihoods in speech
recognition, Computer Speech and Language, Elsevier 2009
3. Phonetic utterance and language modelling: Learning and inference of neural phonetic and language models can be
sped up using GPU.
• Lewandowski NB et al. Phone sequence modelling with recurrent neural networks, ICASSP 2014
• Bengio et al. A neural probabilistic language model, JMLR 2003
4. Decoding of optimal utterance: The most likely utterances are searched using techniques like Viterbi beam search and
A* decoding. The use of GPUs achieves remarkable speedup in these tasks.
• Langdon et al. Non-recursive beam search on GPU for formal concept analysis, Research note, University of London
• Zhou Y, Zeng J. Massively parallel A* search on a GPU. Proc. 29th AAAI Conference on Artificial Intelligence
• Extensive Development and ContributionOpen Source
• On the fly decoding for continuous speechOnline Decoding
• Integration with OpenFst makes system light-weighted,
extremely efficient for computation and suitable for
parallel distributed processing.
FST Framework
K KALDI : Toolkit for ASR
Kaldi is a toolkit for speech recognition written in C++
Why Kaldi ?
Data and Lexicon Preparation
a) Partition data into training , validation
and test sets.
b) Dictionary Preparation.
c) Language Model Initialization
d) Check for consistency of data
MFCC and CMVN for Datasets
a) Extract MFCC, delta, delta-delta
features
b) Compute CMVN stats for every
speaker
Various Training and Decoding Methods
a) Align the model
b) Train the system using a scheme
c) Prepare a combined WFST (Weighted Finite
State Transducer) using acoustic HMM, context
information , grammar (trigram) ,
lexicon(pronunciation) called HCLG FST.
Further Optimization of HCLG FST
using training
a) RBM Pretraining
b) Fine tuning using cross entropy
error criterion
c) sMBR Sequence Discriminative
Training
RESULT
a) Final decoding on HCLG FST(Finite
State Transducer)
b) Generate and store result
Kaldi Workflow
STAGE 0
STAGE 1
STAGE 2
STAGE 3
•Fine tune the
DNN using cross
entropy error
criterion.
•Decode the
HCLG FST(Finite
State Transducer)
• Compute
FMLLR(feature
space maximum
likelihood linear
regression) feature
•Pretrain DNN in
Deep Belief
Network approach
•sMBR sequence
discriminative
training using
stochastic gradient
descent.
•Generate word
lattices and
alignment.
•Six fold cross
validation using
sMBR sequence
discriminative
training.
Karel’s Implementation
Working HTK System
Organized MFCC
files for TIMIT
Working Kaldi
System
Propose
Optimization in Kaldi
Deliverables
Basic Concept
Gathering
Acquaintance
with Toolkit
Installation of
Toolkit
State of the art ASR System
Proposal of
Optimizations
TimeLine
Preliminary concept build up
1) Basic Concepts of ASR
2) Working with TIMIT Dataset
3) MFCC Generation
4)Running HTK for training and
decoding
Getting Kaldi running
1) Acquaintance with kaldi
2) Running scripts for training , decoding and
Karel’s algorithm
3) Indentify modules for decoding
4)Figure out segments involved in decoding,
forward-pass and word lattice.
In quest of optimization
1) Point out operational differences in
Kaldi and PocketSphinx.
2)Hardware and Software optimizations
in PocketSphinx
3) Thorough theoretical survey of high
performance algorithms like viterbi beam
search, A*, B-best sentence search and
lattice generation
.
Drafting proposal
1)Indentifying possible
optimizations in kaldi.
2)Propose optimizations in
kaldi to parallelise the
system.
2 WEEKS 3 WEEKS 2 WEEKS 1 WEEK
Progress of Training
0
5
10
15
20
25
30
Monophone Training Delta + Delta-Delta
Training
LDA + MLLT Training LDA + MLLT + SAT
Training
Karel's
Implementation
Training Methods
WordErrorRate
Analysis
ASR ToolKit HTK PocketSphinx Kaldi
Word Error Rate 18.4 16.2 6.6
Kaldi PocketSphinx
HMM-DNN based system HMM-GMM based system
Uses complex Math Library OpenBlas Can use simple Math Library like Eigen
Works only on hardware that has support
for Floating point operations
Can work on Fixed Floating point
architecture
Model representation in form of
Weighted Finite State Transducer(WFST)
Representation in form of tree structure
Better accuracy (WER = 6.6%) Lesser accuracy (WER=16.2%)
Capable of working on complex hardware Capable of working on simple embedded
systems.
Advanced Decoding Algorithms
• Drawbacks of Viterbi Decoding :
a) Biased towards short sentences
b) Predicts only the best path, hence ruling out possibility of iterative decoding
c) Fails for Language Model complex than Bigram.
d) Biased towards words having less pronunciation variations.
• N best sequence of states :
a) Predicts a set of N-best sentence hypotheses
b) Difficult to implement iterative decoding
• System works using Word lattice algorithm
a) Iterative decoding possible where multiple decoding algorithm can be used
b) Output of previous iteration constraints the word lattice of next iteration.
c) Forward pass and pruning generated word lattice in conjunction with viterbi algorithm
d) After the early pass algorithm, word lattice is generated.
e) The words lattice generated is rescored using more sophisticated techniques.
f) Decoding schemes used are A* and N-best sequence of sentences for fine tuning.
g) Balances the trade off between space and time.
References
1. J. M. Daniel Jurafsky, Speech and Language Processing, 1999.
2. B. H. Juang, “An Introduction to Hidden Markov Models,” no. January, 1986.
3. M. K. Ravishankar, “Efficient Algorithm for Speech Recognition,” PhD theses,
1996.
4. G. Saon, D. Povey, G. Zweig, I. B. M. T. J. Watson, and Y. Heights, “Anatomy of
an extremely fast LVCSR decoder.”
5. J. Butzberger, M. Weintraub, S. R. I. Intemational, and P. Art, “LARGEVOCABULARY DICTATION USING SRI S
DECIPHERm SPEECH RECOGNITION SYSTEM : PROGRESSIVE SEARCH TECHNIQUES H y Murveit,” pp.
319–322, 1993.
6. X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, “Accurate and Compact Large
Vocabulary Speech Recognition on Mobile Devices,” no. August, pp. 662–665, 2013.
7. D. Povey, M. Hannemann, G. Boulianne, A. Ghoshal, M. Kara, S. Kombrink,
P. Motl, N. T. Vu, Y. Qian, K. Riedhammer, K. Vesel, C. S. R. I. International,
M. Park, and U. K. Idiap, “GENERATING EXACT LATTICES IN THE WFST
FRAMEWORK a s c Tsinghua University , Beijing , China 8 Karlsruhe Institute of Technology , Germany Pattern
Recognition Lab , University of ErlangenNuremberg , Germany,” vol. 213850, no. 102, pp. 4213–4216, 2012.
8. L. E. E. F. Klovstad, JW, Mondsnein, “The CASPERS Linguistic Analysis System,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. ASSP-
23, no. 1, pp. 118–123, 1975.
9. E. Proce, F. The, N. M. Likely, and S. Hypotheses, “The n-best algorithm: an
efficient and exact proce finding the n most likely sentence hypotheses,” pp. 2–5.
10. L. R. Bahl and R. L. Mercer, “Design of a Linguistic Statistical Decoder for the
Recognition of Continuous Sgeech,” vol. i, pp. 250–256, 1975.
11. S. Haykin, Neural networks A comprehensive foundation, 1990.
12. E. Trentin and M. Gori, “A survey of hybrid ANN / HMM models for automatic
speech recognition,” vol. 37, pp. 91–126, 2001.
13. H. Ney, B. Dan, and M. Oerder, “IMPROVEMENTS I N BEAM SEARCH FOR
10000-WORD CONTINUOUS SPEECH RECOGNITION,” pp. 9–12, 1992.
14. L. Nguyen and R. Schwartz, “Single tree method for grammar directed search.”
15. Povey Daniel, “Discriminative Training for Large Vocabulary Speech Recognition,”
PhD theses.
16. D. Huggins-daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar, A. I. Rudnicky, and F. Avenue,
“POCKETSPHINX : A FREE , REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD
DEVICES Language Technologies Institute ( dhuggins , mohitkum , archan , awb , rkm , air )@
cs . cmu . edu,” pp. 185–188, 2006.
17. D. Furcy and S. Koenig, “Limited Discrepancy Beam Search .”
18. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015.
19. W. S. Nambirajan Seshadri, Carl-Eric, “List Viterbi Decoding Algorithms with
Applications,” IEEE Transactions on Communications, vol. 42, 1994.
Thank you

More Related Content

What's hot

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniquessonukumar142
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
Short story presentation
Short story presentationShort story presentation
Short story presentationStutiAgarwal36
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachIJECEIAES
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemVani011
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition Goa App
 
ATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 

What's hot (20)

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Language models
Language modelsLanguage models
Language models
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Chatbot
ChatbotChatbot
Chatbot
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Short story presentation
Short story presentationShort story presentation
Short story presentation
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
ATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliteration
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 

Viewers also liked

Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeXavier Anguera
 
MASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingMASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingXavier Anguera
 
English Preposition
English PrepositionEnglish Preposition
English Prepositionannasnz19
 
Lua: the world's most infuriating language
Lua: the world's most infuriating languageLua: the world's most infuriating language
Lua: the world's most infuriating languagejgrahamc
 
How to use NLP in Business
How to use NLP in BusinessHow to use NLP in Business
How to use NLP in BusinessMorgan PR
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer BuildPetteriTeikariPhD
 

Viewers also liked (7)

Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source code
 
MASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingMASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio Fingerprinting
 
Nlp tech talk
Nlp tech talkNlp tech talk
Nlp tech talk
 
English Preposition
English PrepositionEnglish Preposition
English Preposition
 
Lua: the world's most infuriating language
Lua: the world's most infuriating languageLua: the world's most infuriating language
Lua: the world's most infuriating language
 
How to use NLP in Business
How to use NLP in BusinessHow to use NLP in Business
How to use NLP in Business
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer Build
 

Similar to Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit

Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelANIRUDHMALODE2
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Lviv Startup Club
 
Theses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionTheses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionFitrie Ratnasari
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
Scene Text detection in Images-A Deep Learning Survey
 Scene Text detection in Images-A Deep Learning Survey Scene Text detection in Images-A Deep Learning Survey
Scene Text detection in Images-A Deep Learning SurveySrilalitha Veerubhotla
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...IJECEIAES
 
Cj32980984
Cj32980984Cj32980984
Cj32980984IJMER
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Performance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoPerformance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoRahul Shirude
 
The Most Important Algorithms
The Most Important AlgorithmsThe Most Important Algorithms
The Most Important Algorithmswensheng wei
 

Similar to Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit (20)

Et25897899
Et25897899Et25897899
Et25897899
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
Theses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionTheses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech Reconstruction
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Scene Text detection in Images-A Deep Learning Survey
 Scene Text detection in Images-A Deep Learning Survey Scene Text detection in Images-A Deep Learning Survey
Scene Text detection in Images-A Deep Learning Survey
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
L046056365
L046056365L046056365
L046056365
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...
A Summative Comparison of Blind Channel Estimation Techniques for Orthogonal ...
 
Cj32980984
Cj32980984Cj32980984
Cj32980984
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Performance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoPerformance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence Detectionalgo
 
The Most Important Algorithms
The Most Important AlgorithmsThe Most Important Algorithms
The Most Important Algorithms
 

Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit

  • 1. Implementation and Optimization of parallelism in HMM-DNN based state of the Kaldi ASR Toolkit BY Shubham
  • 2. Classical automatic speech recognition system (viable locations for parallelization marked with red arrows) Viterbi beam search / A* decoding N-best sentences or word lattice Rescoring FINAL UTTERRENCE Acoustic model generation Sentence model preparation Phonetic utterance models Sentence model 1 2 4 3 5 6Signal acquisition Feature extraction Acoustic modelling
  • 3. Neural networks and Deep learning in ASR  Drawbacks of HMM-GMM models: The conventional HMM-GMM models used for ASR has the following assumptions that proves to be detrimental for many applications: 1. First order Markov Chain assumption: HMM assumes the next state of the system is independent of all the previous states given the current state. This makes capturing long distance semantics tough. 2. Parametric modelling of observations: GMMs are used to model the observations using a mixture of Gaussians. When we make such theoretical assumptions we also get constrained by the limitations and oftentimes the model fails to capture essential statistics of the data. 3. Lack of generalization: Each HMM state uses only a small fraction of the training data. The absence of data sharing among the HMM states causes poor generalization to real world variations. 4. Dimensionality reduction: Oftentimes dimensionality reduction is performed to cope up with the shortage of training data. This causes loss of valuable information and compromised performance. • Use of artificial neural networks: Artificial neural networks have been leveraged in many ways to ameliorate these shortcomings of HMM-GMM systems 1. Alterative for GMMs for creating the acoustic model: ANNs are used to generate a non-parametric posterior distribution over the HMM states that can be normalized to get (scaled) likelihoods of the observations. 2. Efficient dimensionality reduction of feature space: Autoencoders are used in TANDEM neural networks for efficient non-linear dimensionality reduction of input feature space for use with HMM-GMM models 3. Modelling dynamics over time: Recurrent neural networks and Time delay neural networks have been used as an alternative for HMMs for modelling temporal dynamics of the system. • Relevance of Deep learning: Recently several deep learning paradigms have found their ways into ASR due to the following unique set of properties: 1. Ability to model highly non-linear functions efficiently 2. Learning of specialized input representations hierarchically 3. Possibility of extensive knowledge and parameter sharing 4. Scope of parallel distributed processing.
  • 4. How ANN is used for ASR • The objective of automatic speech recognition systems can be mathematically expressed as: Where W is a sequence of words and X is the corresponding sequence of observations (the input acoustic signal). • The objective function can be written as: • Now the likelihood of the observations can be written approximately as: Where qt denotes the state of the system at time t and xt denotes the observation at the same instant of time. • We can write the term as: • Artificial neural networks are used for non-parametric modelling of P(qt | xt) and P(qt) is estimated from the given data.
  • 5. Instances of Parallelism 1. Feature extraction: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used features for continuous speech recognition. The use of GPUs makes the process of extraction of these features 97 times faster. • Kou H, Shang W, Lane I, Chong J, Optimized MFCC feature extraction on GPU, ICASSP 2013 2. Probabilistic modelling of observations: Neural networks are used for dimensionality reduction and probabilistic modelling of likelihoods. Layer-level and data-level parallelism of the neural networks can be achieved using GPUs that speeds up computations by 10x to 1000x. • Hinton et al. Deep neural networks for acoustic modelling in speech recognition, IEEE Signal Processing Magazine 2012 • Dixon PR, Oonishi T, Furui S, Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition, Computer Speech and Language, Elsevier 2009 3. Phonetic utterance and language modelling: Learning and inference of neural phonetic and language models can be sped up using GPU. • Lewandowski NB et al. Phone sequence modelling with recurrent neural networks, ICASSP 2014 • Bengio et al. A neural probabilistic language model, JMLR 2003 4. Decoding of optimal utterance: The most likely utterances are searched using techniques like Viterbi beam search and A* decoding. The use of GPUs achieves remarkable speedup in these tasks. • Langdon et al. Non-recursive beam search on GPU for formal concept analysis, Research note, University of London • Zhou Y, Zeng J. Massively parallel A* search on a GPU. Proc. 29th AAAI Conference on Artificial Intelligence
  • 6. • Extensive Development and ContributionOpen Source • On the fly decoding for continuous speechOnline Decoding • Integration with OpenFst makes system light-weighted, extremely efficient for computation and suitable for parallel distributed processing. FST Framework K KALDI : Toolkit for ASR Kaldi is a toolkit for speech recognition written in C++ Why Kaldi ?
  • 7. Data and Lexicon Preparation a) Partition data into training , validation and test sets. b) Dictionary Preparation. c) Language Model Initialization d) Check for consistency of data MFCC and CMVN for Datasets a) Extract MFCC, delta, delta-delta features b) Compute CMVN stats for every speaker Various Training and Decoding Methods a) Align the model b) Train the system using a scheme c) Prepare a combined WFST (Weighted Finite State Transducer) using acoustic HMM, context information , grammar (trigram) , lexicon(pronunciation) called HCLG FST. Further Optimization of HCLG FST using training a) RBM Pretraining b) Fine tuning using cross entropy error criterion c) sMBR Sequence Discriminative Training RESULT a) Final decoding on HCLG FST(Finite State Transducer) b) Generate and store result Kaldi Workflow
  • 8. STAGE 0 STAGE 1 STAGE 2 STAGE 3 •Fine tune the DNN using cross entropy error criterion. •Decode the HCLG FST(Finite State Transducer) • Compute FMLLR(feature space maximum likelihood linear regression) feature •Pretrain DNN in Deep Belief Network approach •sMBR sequence discriminative training using stochastic gradient descent. •Generate word lattices and alignment. •Six fold cross validation using sMBR sequence discriminative training. Karel’s Implementation
  • 9. Working HTK System Organized MFCC files for TIMIT Working Kaldi System Propose Optimization in Kaldi Deliverables
  • 10. Basic Concept Gathering Acquaintance with Toolkit Installation of Toolkit State of the art ASR System Proposal of Optimizations
  • 11. TimeLine Preliminary concept build up 1) Basic Concepts of ASR 2) Working with TIMIT Dataset 3) MFCC Generation 4)Running HTK for training and decoding Getting Kaldi running 1) Acquaintance with kaldi 2) Running scripts for training , decoding and Karel’s algorithm 3) Indentify modules for decoding 4)Figure out segments involved in decoding, forward-pass and word lattice. In quest of optimization 1) Point out operational differences in Kaldi and PocketSphinx. 2)Hardware and Software optimizations in PocketSphinx 3) Thorough theoretical survey of high performance algorithms like viterbi beam search, A*, B-best sentence search and lattice generation . Drafting proposal 1)Indentifying possible optimizations in kaldi. 2)Propose optimizations in kaldi to parallelise the system. 2 WEEKS 3 WEEKS 2 WEEKS 1 WEEK
  • 12. Progress of Training 0 5 10 15 20 25 30 Monophone Training Delta + Delta-Delta Training LDA + MLLT Training LDA + MLLT + SAT Training Karel's Implementation Training Methods WordErrorRate
  • 13. Analysis ASR ToolKit HTK PocketSphinx Kaldi Word Error Rate 18.4 16.2 6.6 Kaldi PocketSphinx HMM-DNN based system HMM-GMM based system Uses complex Math Library OpenBlas Can use simple Math Library like Eigen Works only on hardware that has support for Floating point operations Can work on Fixed Floating point architecture Model representation in form of Weighted Finite State Transducer(WFST) Representation in form of tree structure Better accuracy (WER = 6.6%) Lesser accuracy (WER=16.2%) Capable of working on complex hardware Capable of working on simple embedded systems.
  • 14. Advanced Decoding Algorithms • Drawbacks of Viterbi Decoding : a) Biased towards short sentences b) Predicts only the best path, hence ruling out possibility of iterative decoding c) Fails for Language Model complex than Bigram. d) Biased towards words having less pronunciation variations. • N best sequence of states : a) Predicts a set of N-best sentence hypotheses b) Difficult to implement iterative decoding • System works using Word lattice algorithm a) Iterative decoding possible where multiple decoding algorithm can be used b) Output of previous iteration constraints the word lattice of next iteration. c) Forward pass and pruning generated word lattice in conjunction with viterbi algorithm d) After the early pass algorithm, word lattice is generated. e) The words lattice generated is rescored using more sophisticated techniques. f) Decoding schemes used are A* and N-best sequence of sentences for fine tuning. g) Balances the trade off between space and time.
  • 15. References 1. J. M. Daniel Jurafsky, Speech and Language Processing, 1999. 2. B. H. Juang, “An Introduction to Hidden Markov Models,” no. January, 1986. 3. M. K. Ravishankar, “Efficient Algorithm for Speech Recognition,” PhD theses, 1996. 4. G. Saon, D. Povey, G. Zweig, I. B. M. T. J. Watson, and Y. Heights, “Anatomy of an extremely fast LVCSR decoder.” 5. J. Butzberger, M. Weintraub, S. R. I. Intemational, and P. Art, “LARGEVOCABULARY DICTATION USING SRI S DECIPHERm SPEECH RECOGNITION SYSTEM : PROGRESSIVE SEARCH TECHNIQUES H y Murveit,” pp. 319–322, 1993. 6. X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, “Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices,” no. August, pp. 662–665, 2013. 7. D. Povey, M. Hannemann, G. Boulianne, A. Ghoshal, M. Kara, S. Kombrink, P. Motl, N. T. Vu, Y. Qian, K. Riedhammer, K. Vesel, C. S. R. I. International, M. Park, and U. K. Idiap, “GENERATING EXACT LATTICES IN THE WFST FRAMEWORK a s c Tsinghua University , Beijing , China 8 Karlsruhe Institute of Technology , Germany Pattern Recognition Lab , University of ErlangenNuremberg , Germany,” vol. 213850, no. 102, pp. 4213–4216, 2012. 8. L. E. E. F. Klovstad, JW, Mondsnein, “The CASPERS Linguistic Analysis System,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP- 23, no. 1, pp. 118–123, 1975.
  • 16. 9. E. Proce, F. The, N. M. Likely, and S. Hypotheses, “The n-best algorithm: an efficient and exact proce finding the n most likely sentence hypotheses,” pp. 2–5. 10. L. R. Bahl and R. L. Mercer, “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Sgeech,” vol. i, pp. 250–256, 1975. 11. S. Haykin, Neural networks A comprehensive foundation, 1990. 12. E. Trentin and M. Gori, “A survey of hybrid ANN / HMM models for automatic speech recognition,” vol. 37, pp. 91–126, 2001. 13. H. Ney, B. Dan, and M. Oerder, “IMPROVEMENTS I N BEAM SEARCH FOR 10000-WORD CONTINUOUS SPEECH RECOGNITION,” pp. 9–12, 1992. 14. L. Nguyen and R. Schwartz, “Single tree method for grammar directed search.” 15. Povey Daniel, “Discriminative Training for Large Vocabulary Speech Recognition,” PhD theses. 16. D. Huggins-daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar, A. I. Rudnicky, and F. Avenue, “POCKETSPHINX : A FREE , REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD DEVICES Language Technologies Institute ( dhuggins , mohitkum , archan , awb , rkm , air )@ cs . cmu . edu,” pp. 185–188, 2006. 17. D. Furcy and S. Koenig, “Limited Discrepancy Beam Search .” 18. Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015. 19. W. S. Nambirajan Seshadri, Carl-Eric, “List Viterbi Decoding Algorithms with Applications,” IEEE Transactions on Communications, vol. 42, 1994.