The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.
Deep convolutional neural networks-based features for Indonesian large vocabu...IAESIJAI
There are great interests in developing speech recognition using deep
learning technologies due to their capability to model the complexity of
pronunciations, syntax, and language rules of speech data better than the
traditional hidden Markov model (HMM) do. But, the availability of large
amount of data is necessary for deep learning-based speech recognition to be
effective. While this is not a problem for mainstream languages such as
English or Chinese, this is not the case for non-mainstream languages such
as Indonesian. To overcome this limitation, we present deep features based
on convolutional neural networks (CNN) for Indonesian large vocabulary
continuous speech recognition in this paper. The CNN is trained
discriminatively which is different from usual deep learning
implementations where the networks are trained generatively. Our
evaluations show that the proposed method on Indonesian speech data
achieves 7.26% and 9.01% error reduction rates over the state-of-the-art
deep belief networks-deep neural networks (DBN-DNN) for large
vocabulary continuous speech recognition (LVCSR), with Mel frequency
cepstral coefficients (MFCC) and filterbank (FBANK) used as features,
respectively. An error reduction rate of 6.13% is achieved compared to
CNN-DNN with generative training.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.
Deep convolutional neural networks-based features for Indonesian large vocabu...IAESIJAI
There are great interests in developing speech recognition using deep
learning technologies due to their capability to model the complexity of
pronunciations, syntax, and language rules of speech data better than the
traditional hidden Markov model (HMM) do. But, the availability of large
amount of data is necessary for deep learning-based speech recognition to be
effective. While this is not a problem for mainstream languages such as
English or Chinese, this is not the case for non-mainstream languages such
as Indonesian. To overcome this limitation, we present deep features based
on convolutional neural networks (CNN) for Indonesian large vocabulary
continuous speech recognition in this paper. The CNN is trained
discriminatively which is different from usual deep learning
implementations where the networks are trained generatively. Our
evaluations show that the proposed method on Indonesian speech data
achieves 7.26% and 9.01% error reduction rates over the state-of-the-art
deep belief networks-deep neural networks (DBN-DNN) for large
vocabulary continuous speech recognition (LVCSR), with Mel frequency
cepstral coefficients (MFCC) and filterbank (FBANK) used as features,
respectively. An error reduction rate of 6.13% is achieved compared to
CNN-DNN with generative training.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Intelligent Arabic letters speech recognition system based on mel frequency c...IJECEIAES
Speech recognition is one of the important applications of artificial intelligence (AI). Speech recognition aims to recognize spoken words regardless of who is speaking to them. The process of voice recognition involves extracting meaningful features from spoken words and then classifying these features into their classes. This paper presents a neural network classification system for Arabic letters. The paper will study the effect of changing the multi-layer perceptron (MLP) artificial neural network (ANN) properties to obtain an optimized performance. The proposed system consists of two main stages; first, the recorded spoken letters are transformed from the time domain into the frequency domain using fast Fourier transform (FFT), and features are extracted using mel frequency cepstral coefficients (MFCC). Second, the extracted features are then classified using the MLP ANN with back-propagation (BP) learning algorithm. The obtained results show that the proposed system along with the extracted features can classify Arabic spoken letters using two neural network hidden layers with an accuracy of around 86%.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Comparison and Analysis Of LDM and LMS for an Application of a SpeechCSCJournals
Most of the automatic speech recognition (ASR) systems are based on Guassian Mixtures model. The output of these models depends on subphone states. We often measure and transform the speech signal in another form to enhance our ability to communicate. Speech recognition is the conversion from acoustic waveform into written equivalent message information. The nature of speech recognition problem is heavily dependent upon the constraints placed on the speaker, speaking situation and message context. Various speech recognition systems are available. The system which detects the hidden conditions of speech is the best model. LMS is one of the simple algorithm used to reconstruct the speech and linear dynamic model is also used to recognize the speech in noisy atmosphere..This paper is analysis and comparison between the LDM and a simple LMS algorithm which can be used for speech recognition purpose.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
Attention gated encoder-decoder for ultrasonic signal denoisingIAESIJAI
Ultrasound imaging is one of the most widely used non-destructive testing methods. The transducer emits pulses that travel through the imaged samples and are reflected by echo-forming impedance. The resulting ultrasonic signals usually contain noise. Most of the traditional noise reduction algorithms require high skills and prior knowledge of noise distribution, which has a crucial impact on their performances. As a result, these methods generally yield a loss of information, significantly influencing the final data and deeply limiting both sensitivity and resolution of imaging devices in medical and industrial applications. In the present study, a denoising method based on an attention-gated convolutional autoencoder is proposed to fill this gap. To evaluate its performance, the suggested protocol is compared to widely used methods such as butterworth filtering (BF), discrete wavelet transforms (DWT), principal component analysis (PCA), and convolutional autoencoder (CAE) methods. Results proved that better denoising can be achieved especially when the original signal-to-noise ratio (SNR) is very low and the sound waves’ traces are distorted by noise. Moreover, the initial SNR was improved by up to 30 dB and the resulting Pearson correlation coefficient was maintained over 99% even for ultrasonic signals with poor initial SNR.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
Mfcc based enlargement of the training set for emotion recognition in speechsipij
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
Investigation of the performance of multi-input multi-output detectors based...IJECEIAES
The next generation of wireless cellular communication networks must be energy efficient, extremely reliable, and have low latency, leading to the necessity of using algorithms based on deep neural networks (DNN) which have better bit error rate (BER) or symbol error rate (SER) performance than traditional complex multi-antenna or multi-input multi-output (MIMO) detectors. This paper examines deep neural networks and deep iterative detectors such as OAMP-Net based on information theory criteria such as maximum correntropy criterion (MCC) for the implementation of MIMO detectors in non-Gaussian environments, and the results illustrate that the proposed method has better BER or SER performance.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
More Related Content
Similar to QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
Intelligent Arabic letters speech recognition system based on mel frequency c...IJECEIAES
Speech recognition is one of the important applications of artificial intelligence (AI). Speech recognition aims to recognize spoken words regardless of who is speaking to them. The process of voice recognition involves extracting meaningful features from spoken words and then classifying these features into their classes. This paper presents a neural network classification system for Arabic letters. The paper will study the effect of changing the multi-layer perceptron (MLP) artificial neural network (ANN) properties to obtain an optimized performance. The proposed system consists of two main stages; first, the recorded spoken letters are transformed from the time domain into the frequency domain using fast Fourier transform (FFT), and features are extracted using mel frequency cepstral coefficients (MFCC). Second, the extracted features are then classified using the MLP ANN with back-propagation (BP) learning algorithm. The obtained results show that the proposed system along with the extracted features can classify Arabic spoken letters using two neural network hidden layers with an accuracy of around 86%.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Comparison and Analysis Of LDM and LMS for an Application of a SpeechCSCJournals
Most of the automatic speech recognition (ASR) systems are based on Guassian Mixtures model. The output of these models depends on subphone states. We often measure and transform the speech signal in another form to enhance our ability to communicate. Speech recognition is the conversion from acoustic waveform into written equivalent message information. The nature of speech recognition problem is heavily dependent upon the constraints placed on the speaker, speaking situation and message context. Various speech recognition systems are available. The system which detects the hidden conditions of speech is the best model. LMS is one of the simple algorithm used to reconstruct the speech and linear dynamic model is also used to recognize the speech in noisy atmosphere..This paper is analysis and comparison between the LDM and a simple LMS algorithm which can be used for speech recognition purpose.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
Attention gated encoder-decoder for ultrasonic signal denoisingIAESIJAI
Ultrasound imaging is one of the most widely used non-destructive testing methods. The transducer emits pulses that travel through the imaged samples and are reflected by echo-forming impedance. The resulting ultrasonic signals usually contain noise. Most of the traditional noise reduction algorithms require high skills and prior knowledge of noise distribution, which has a crucial impact on their performances. As a result, these methods generally yield a loss of information, significantly influencing the final data and deeply limiting both sensitivity and resolution of imaging devices in medical and industrial applications. In the present study, a denoising method based on an attention-gated convolutional autoencoder is proposed to fill this gap. To evaluate its performance, the suggested protocol is compared to widely used methods such as butterworth filtering (BF), discrete wavelet transforms (DWT), principal component analysis (PCA), and convolutional autoencoder (CAE) methods. Results proved that better denoising can be achieved especially when the original signal-to-noise ratio (SNR) is very low and the sound waves’ traces are distorted by noise. Moreover, the initial SNR was improved by up to 30 dB and the resulting Pearson correlation coefficient was maintained over 99% even for ultrasonic signals with poor initial SNR.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
Mfcc based enlargement of the training set for emotion recognition in speechsipij
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
Investigation of the performance of multi-input multi-output detectors based...IJECEIAES
The next generation of wireless cellular communication networks must be energy efficient, extremely reliable, and have low latency, leading to the necessity of using algorithms based on deep neural networks (DNN) which have better bit error rate (BER) or symbol error rate (SER) performance than traditional complex multi-antenna or multi-input multi-output (MIMO) detectors. This paper examines deep neural networks and deep iterative detectors such as OAMP-Net based on information theory criteria such as maximum correntropy criterion (MCC) for the implementation of MIMO detectors in non-Gaussian environments, and the results illustrate that the proposed method has better BER or SER performance.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Similar to QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION (20)
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
1. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
DOI:10.5121/ijma.2020.12501 1
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR
BANGLA SPEECH RECOGNITION
Nahyan Al Mahmud1
and Shahfida Amjad Munni2
1
Department of Electrical and Electronic Engineering, Ahsanullah University of Science
and Technology, Dhaka, Bangladesh
2
Cygnus Innovation Limited, Dhaka, Bangladesh
ABSTRACT
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
KEYWORDS
LSTM, Perceptual linear prediction, Mel frequency cepstral coefficients, Bhattacharyya Distance,
Mahalanobis Distance.
1. INTRODUCTION
Speech is the most effective way of communication among people. This is the most natural way
of conveying information. Therefore, to interact vocally with computers, studies are being
conducted in a number of fields. The objective is to simulate the humans’ ability to talk, to carry
out of simple tasks by computers through the means of machine-human interaction, to turning
speech to text through Automatic Speech Recognition (ASR) systems. Recently, signal
processing and detection have been applied in a variety of fields such as human activity tracking
and recognition [1,2], computer engineering, physical sciences, health-related applications [3],
and natural science and industrial applications [4].
In recent years, neural network technology has significantly enhanced the precision of ASR
system, and many applications are being developed for smartphones and intelligent personal
assistants. Many researches on end-to-end speech detection are being conducted to swap the
hidden Markov model (HMM) based technique which has been used for many years. The end-to-
end models with neural networks include connectionist temporal classification (CTC)-trained
recurrent neural networks (RNN), encoder-decoder architectures [5], and RNN transducers [6,7].
Although HMM-based algorithms can be considered computationally efficient, they require many
uneven memory accesses and a large memory foot-print. On the contrary, RNN based ASR has
the advantage of low memory footprint; however, it demands many computationally expensive
arithmetic operations for real-time inference.
2. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
2
A lot of works has been done in ASR for a variety of major languages in the world. Regrettably,
only a handful of works have been carried out for Bangla, which is among the most widely
spoken languages in the world in terms of number of speakers. Some of these efforts can be
found in [8]. However, majority of these studies mainly focussed on simple word-level detection
worked on a very minor database. Also these works did not account for the various dialects of
different parts of the country. The reasons for this can be pointed as the lack of proper speech
database.
The primary target of this study is to examine the efficiency of various acoustic vectors for
Bangla speech detection using LSTM neural network and assess their performances based on
different statistical parameters.
2. DATA PREPARATION
2.1. Bangla Speech Database
Right now the real difficulty while experimenting on Bangla ASR is the absence of proper
Bangla speech database. Either the existing database is not large enough or these databases are
not considering the colloquial variations among the speakers. Also majority of these are not
publicly available which is a massive hindrance in conducting research. Another difficulty is that
the database must be properly segmented or labelled while using for supervised learning. To
overcome this problem, we had to come up with our own solution [9].
The samples were first divided into two groups: training and testing. The male training corpus
were prepared by 50 male and 50 female speakers. A separate set of 20 male and female speakers
voiced the test samples. 200 common Bangla sentences were chosen for this database.
A medium sized room having styrofoam damper was used to record the samples. A dynamic
unidirectional microphone was used to collect training samples. In order to mimic the real-world
scenario for testing samples, a moderate grade smart phone was used.
2.2. Acoustic Feature Vectors
Linear predictive coding (LPC) was first introduced in 1966. It is widely used in audio signal
processing, speech processing and telephony. LPC embodies the spectral envelope of speech in
compressed form, using a linear predictive methodology. “These techniques can thus serve as a
tool for modifying the acoustic properties of a given speech signal without degrading the speech
quality. Some other potential applications of these techniques are in the areas of efficient storage
and transmission of speech, automatic formant and pitch extraction, and speaker and speech
recognition” [10].
The most efficient acoustic feature extraction methods are mainly based on models that are
similar to human hearing system. There are some limiting features of the human auditory system
such as nonlinear frequency scale, spectral amplitude compression, reduced sensitivity of at
lower frequencies etc. Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual Linear
Predictive (PLP) to some extent adopts the behaviour of human auditory system. MFCC was first
introduced in [11]. PLP introduced in [12] is based on ideas similar to the MFCCs. But the
computational steps of PLP are a bit elaborate and thus produce marginally better result which is
illustrated in Figure 1.
3. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
3
Figure 1. The steps of PLP and MFCC feature vector extraction.
3. EXPERIMENTAL SETUP
3.1. LSTM Neural Network Structure
LSTM, and in general, RNN based ASR systems [13,14,15] trained with CTC [16] have recently
been shown to work extremely well when there is an abundance of training data, matching and
exceeding the performance of hybrid DNN systems [15]. In the conventional DNN if the time is
long, the amount of remaining vector reduces that needs to be looped back. This process
inherently breaks down the network-updating scheme [17]. LSTM is naturally immune to this
problem as it can employ its comparatively long-term memory structure [5].
LSTM and RNN are very prominent in sequence prediction and labelling at the same time,
LSTM proved to have the better performance than RNN and DNN in context [18]. Recently,
according to the successful application of DNN to acoustic modelling [19,20,21] and speech
enhancement [22], a CTC output layer, is imposed on the acoustic model [23,24] in conjunction
with a deep LSTM network. Figure 2 shows the LSTM speech recognition block.
Figure 2. LSTM speech detection block
A standard LSTM neural network structure is comprised of an input layer, a recurring LSTM
layer and an output layer. In this study the input layer is a Time Delay Neural Network (TDNN)
layer and this layer is connected to the hidden layer.
The new LSTM structure proposed in [23] has been used in this work. As this method is based on
CTC training, there is a significant performance growth in speech detection tasks. This technique
has been adopted by Google. This typical LSTM cell structure is illustrated in Figure 3. The
LSTM cell consists of three gates: input, forget, and output, which control, respectively, what
fraction of the input is passed to the "memory" cell, what fraction of the stored cell memory is
retained, and what fraction of the cell memory is output.
4. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
4
Figure 3. LSTM cell model
Equations (1)-(5) are vector formulas that describe the LSTM cell.
Typically, LSTM parameters are initialized to small random values in an interval. However, for
the forget gate, this is a suboptimal choice, where small weights effectively close the gate,
preventing cell memory and its gradients from flowing in time. In this work this issue has been
addressed by initializing the forget gate bias to a large value.
3.2. Bhattacharyya Distance
The Bhattacharyya Distance [25,26] measures the similarity of two discrete or continuous
probability distributions which can be expressed by the following expressions.
where µi and Σi are the means and covariances of the distributions, respectively and
3.3. Mahalanobis Distance
The Mahalanobis distance [26] is a measure of the distance between a point and a distribution,
introduced by P. C. Mahalanobis in 1936.
Where µ and Σ are the means and covariances of the distributions respectively and x is the point
whose distance is to be measured. In this work, the Mahalanobis distance between each frame is
measured for different phonemes.
𝑖𝑡 = 𝜎(𝑊𝑖𝑥 𝑥𝑡 + 𝑊𝑖𝑟 𝑥𝑡−1 + 𝑏𝑖)………………………...(1)
𝑓𝑡 = 𝜎(𝑊𝑓𝑥 𝑥𝑡 + 𝑊𝑓𝑟𝑥𝑡−1 + 𝑏𝑓)………………………..(2)
𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ g(𝑊
𝑐𝑥 𝑥𝑡 + 𝑊
𝑐𝑟 𝑟𝑡−1 + 𝑏𝑐 )…….(3)
𝑜𝑡 = 𝜎(𝑊
𝑜𝑥 𝑥𝑡 + 𝑊
𝑜𝑟 𝑟𝑡−1 + 𝑊
𝑜𝑐 𝑐𝑡 + 𝑏𝑜 )………….…..(4)
ℎ𝑡 = 𝑜𝑡 ⊙ 𝛷 𝑐𝑡 …………………………….…….…….(5)
𝐷𝐵 =
1
8
(µ1 − µ2)𝑇
(µ1 − µ2)−1
+
1
2
𝑙𝑛
det 𝛴
det 𝛴1 det 𝛴2
….(6)
𝛴 =
𝛴1+𝛴2
2
….(7)
𝐷𝑀(𝑥) = (𝑥 − 𝛴)𝑇 𝑆−1(𝑥 − 𝛴)….(8)
5. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
5
4. PERFORMANCE ANALYSIS
The next steps are training the LSTM model using the recorded training set of data for Bangla
speech. Three popular feature extraction methods, namely, LPC, MFCC and PLP has been tried
while training by 5000 male samples and testing this model using 1000 different male samples.
This combination has been chosen carefully as it performed reasonably well in [9].
Statistical distance quantifies the distance between two statistical objects, which can be two
random variables, or two probability distributions or samples. The distance can be between an
individual sample point and a population or a wider sample of points.
In this work both Bhattacharyya distance and Mahalanobis distance of various phoneme
coefficients has been computed for LPC, MFCC and PLP.
Table 1. Bhattacharyya distance between similar sounding phonemes
Phonemes to be compared LPC MFCC PLP
'"r"' '"er"' 1.268892 1.609877 1.935885
'"b"' '"v"' 0.602648 1.019706 1.176067
'"b"' '"bh"' 0.674364 1.99346 2.228276
'"d"' '"dh"' 0.091483 0.732092 0.630423
'"aa"' '"ax"' 1.398714 0.257079 0.278304
'"ch"' '"chh"' 0.477755 0.288733 0.318336
'"ey"' '"y"' 6.73841 1.88483 2.006171
'"n"' '"ng"' 1.090365 2.900765 2.830959
'"uh"' '"ih"' 1.399962 3.502966 3.659523
'"uh"' '"eh"' 9.00648 3.199748 3.829925
'"uh"' '"u"' 7.451423 1.100306 1.12853
'"dh"' '"d"' 0.091483 0.732092 0.630423
'"s"' '"sh"' 5.384669 0.149458 0.150676
'"v"' '"bh"' 0.513366 1.427321 1.206649
'"oy"' '"ay"' 3.930717 1.342863 1.604389
'"hh"' '"h"' 1.363757 2.696534 2.499212
'"ah"' '"ax"' 8.392236 0.288995 0.300063
'"ay"' '"oy"' 3.930717 1.342863 1.604389
'"z"' '"jh"' 3.993217 0.83873 0.956689
'"z"' '"j"' 7.540232 0.810661 0.860811
'"zh"' '"jh"' 0.934331 1.320837 1.200627
'"j"' '"z"' 7.540232 0.810661 0.860811
'"j"' '"zh"' 1.583787 1.722148 1.625498
'"j"' '"jh"' 3.002899 0.511258 0.524364
6. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
6
Figure 4. Mahalanobis distance between the phoneme ‘sil’ and the word “bideshi bonduk amdani”
Table 2. Timeframe of the word “bideshi” in Figure 4.
Frame start Frame end Phoneme
0 53 sil
53 63 bi
63 69 d
69 74 ey
74 78 eh
78 89 sh
89 94 ih
94 100 sp
5. DISCUSSION
From Table 1 it is evident that both PLP and LPC show some benefits over MFCC for similar
sounding phonemes. However, in most cases LPC shows some abnormally large distances which
should not be that large, as these phonemes sound similar.
If the timeframe of the word “bideshi” in Table 2 is matched with Figure 4, both PLP and MFCC
manages to distinguish 53th frame from the phoneme ‘sil’, while LPC detects something on the
onset of the phoneme ‘sh’ at frame no 78.
6. CONCLUSION
From Table 1, Table 2 and Figure 4 it is clear that, PLP and MFCC shows significantly better
result than LPC in terms of statistical distance. Combining the findings in [9] it can be said
clearly that PLP performs better than MFCC. So for large data set or for large sets of training and
testing, PLP seems to be a better alternative. However, RNN and LSTM mainly relies on
sequential processing. Which means they are inherently slow. The performance of PLP shines
through by combining it with a more advanced neural network like the transformer network [27].
As the transformer network can process faster, PLP should perform more aggressively with this
types of framework.
7. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
7
REFERENCES
[1] Uddin, M.T.; Uddiny, M.A. “Human activity recognition from wearable sensors using extremely
randomized trees”. In Proceedings of the International Conference on Electrical Engineering and
Information Communication Technology (ICEEICT), Dhaka, Bangladesh, 13–15 September 2015;
pp. 1–6.
[2] Jalal, A. “Human activity recognition using the labelled depth body parts information of depth
silhouettes”. In Proceedings of the 6th International Symposium on Sustainable Healthy Buildings,
Seoul, Korea, 27 February 2012; pp. 1–8.
[3] Ahad, M.A.R.; Kobashi, S.; Tavares, J.M.R. “Advancements of image processing and vision” in
healthcare. J. Healthcare Eng. 2018.
[4] Jalal, A.; Quaid, M.A.K.; Hasan, A.S. “Wearable sensor-based human behaviour understanding and
recognition in daily life for smart environments”. In Proceedings of the International Conference on
Frontiers of Information Technology (FIT), Islamabad, Pakistan, 18–20 December 2017.
[5] C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.Weiss, K. Rao, E.
Gonina, et al. “State-of-the art speech recognition with sequence-to-sequence models”. In Acoustics,
Speech and Signal Processing (ICASSP), 2018 IEEE International Conference, pages 4774–4778.
IEEE, 2018.
[6] Kanishka Rao, Ha¸sim Sak, and Rohit Prabhavalkar. “Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer”. In Automatic Speech Recognition
and Understanding Workshop (ASRU), 2017 IEEE, pages 193–199. IEEE, 2017.
[7] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu,
Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu. “Exploring neural transducers for end-to- end
speech recognition”. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017
IEEE, pages 206–213. IEEE, 2017.
[8] Kishore, S., Black, A., Kumar, R., and Sangal, R. “Experiments with unit selection speech databases
for Indian languages,” National Seminar on Language Technology Tools: Implementation of Telugu,
Hyderabad, India, October 2003.
[9] Nahyan A. M. “Performance analysis of different acoustic features based on LSTM for Bangla
speech recognition.” The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No.
1/2/3/4, August 2020, DOI :10.5121/ijma.2020.12402 17
[10] Rafal J., Wojciech Z., and Ilya S. “An empirical exploration of recurrent network architectures”. In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France,
6-11 July 2015, pages 2342–2350, 2015.
[11] Atal, B. S., “Speech analysis and synthesis by linear prediction of the speech wave.” The Journal of
The Acoustical Society of America 47 (1970) 65.
[12] Davis, S. B and Mermelstein, P., “Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP-28, no. 4, pp. 357 – 366, August 1980.
[13] Alex Graves and Navdeep Jaitly. “Towards end-to-end speech recognition with recurrent neural
networks”. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014,
Beijing, China, 21-26 June 2014, pages 1764–1772, 2014.
[14] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. “Deep speech:
scaling up end-to-end speech recognition”. CoRR, abs/1412.5567, 2014.
[15] Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: acoustic-to-word LSTM model
for large vocabulary speech recognition”. CoRR, abs/1610.09975, 2016.
[16] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. “Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In
Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006),
Pittsburgh, Pennsylvania, USA, June 25-29, 2006, pages 369–376, 2006.
[17] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory”, Neural Computation, vol.9,
no.8,pp.1735 780,Nov.1997
[18] Mike Schuster and Kuldip K.Paliwal, “Bidirectional recurrent neural networks,” Signal Processing,
IEEE Transactions, vol. 45, no. 11, pp.2673-2681,1997.
8. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
8
[19] Abdel Rahman Mohamed, George E. Dah1, and Geoffrey E.Hinton, “Acoustic modeling using deep
belief networks,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp.
14-22, 2012.
[20] George E. Dah1, Dong Yu, Li Deng, and Alex Acero, “Contextdependent pre-trained deep neural
networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech &
Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012.
[21] Navedeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke, “Application of pre-
trained deep neural networks to large vocabulary speech recognition,” in Proceedings of
INTERSPEECH, 2012.
[22] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “An experiment study on speech enhancement
based on deep neural networks”, IEEE Signal Processing, vol. 21, no. 1, pp. 65-68, Nov. 2013.
[23] Hasim Sak, Andrew Senior, and Francoise Beaufays, “Long Short-Term memory based recurrent
neural network architectures for large vocabulary speech recognition”, ArXiv e-prints, Feb.2014.
[24] Z. Chen, Y. Zhuang, Y. Qian, K. Yu, et al. “Phone synchronous speech recognition with CTC
lattices” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1): 90–
101, 2017
[25] S. Dubuisson. “The computation of the Bhattacharyya distance between histograms without
histograms” 2nd International Conference on Image Processing Theory Tools and Applications
(IPTA'10), Jul 2010, Paris, France. pp.373-378,
[26] W. F. Basener and M. Flynn "Microscene evaluation using the Bhattacharyya distance", Proc. SPIE
10780, Multispectral, Hyperspectral, and Ultraspectral Remote Sensing Technology, Techniques and
Applications VII, 107800R (23 October 2018); https://doi.org/10.1117/12.2327004
[27] Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018).
"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355.
AUTHORS
Nahyan Al Mahmud
Mr. Mahmud graduated from Electrical and Electronic Engineering department of
Ahsanullah University of Science and Technology (AUST), Dhaka in 2008. Mr.
Mahmud has completed the MSc program (EEE) from Bangladesh University of
Engineering & Technology (BUET), Dhaka. Currently he is working as an Assistant
Professor of EEE Department in AUST. His research interests include system and
signal processing, analysis and design.
Shahfida Amjad Munni
Shahfida Amjad Munni completed her master of engineering degree at the Institute of
Information and Communication Technology (IICT) in Bangladesh University of
Engineering and Technology (BUET), Bangladesh in March 2018. Currently she is
working as a software engineer at Cygnus Innovation Limited, Bangladesh. Her
research interests include optical wireless communication, data science, wireless
communication, OFDM modulation and LiFi.