QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
Short tandem repeats (STRs) have become important molecular markers for a broad range of applications, such
as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a
range of molecular ecology and diversity studies. These repeated DNA sequences are found in both Plants and
bacteria. Most of the computer programs that find STRs failed to report its number of occurrences of the
repeated pattern, exact position and it is difficult task to obtain accurate results from the larger datasets. So we
need high performance computing models to extract certain repeats. One of the solution is STRs using parallel
string matching, it gives number of occurrences with corresponding line number and exact location or position
of each STR in the genome of any length. In this, we implemented parallel string matching using JAVA Multithreading
with multi core processing, for this we implemented a basic algorithm and made a comparison with
previous algorithms like Knuth Morris Pratt, Boyer Moore and Brute force string matching algorithms and from
the results our new basic algorithm gives better results than the previous algorithms. We apply this algorithm in
parallel string matching using multi-threading concept to reduce the time by running on multicore processors.
From the test results it is shown that the multicore processing is a remarkably efficient and powerful compared
to lower versions and finally this proposed STR using parallel string matching algorithm is better than the
sequential approaches.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...sipij
This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
Short tandem repeats (STRs) have become important molecular markers for a broad range of applications, such
as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a
range of molecular ecology and diversity studies. These repeated DNA sequences are found in both Plants and
bacteria. Most of the computer programs that find STRs failed to report its number of occurrences of the
repeated pattern, exact position and it is difficult task to obtain accurate results from the larger datasets. So we
need high performance computing models to extract certain repeats. One of the solution is STRs using parallel
string matching, it gives number of occurrences with corresponding line number and exact location or position
of each STR in the genome of any length. In this, we implemented parallel string matching using JAVA Multithreading
with multi core processing, for this we implemented a basic algorithm and made a comparison with
previous algorithms like Knuth Morris Pratt, Boyer Moore and Brute force string matching algorithms and from
the results our new basic algorithm gives better results than the previous algorithms. We apply this algorithm in
parallel string matching using multi-threading concept to reduce the time by running on multicore processors.
From the test results it is shown that the multicore processing is a remarkably efficient and powerful compared
to lower versions and finally this proposed STR using parallel string matching algorithm is better than the
sequential approaches.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...IJNSA Journal
This paper introduces a multi-layer hybrid text steganography approach by utilizing word tagging and recoloring. Existing approaches are planned to be either progressive in getting imperceptibility, or high hiding limit, or robustness. The proposed approach does not use the ordinary sequential inserting process and overcome issues of the current approaches by taking a careful of getting imperceptibility, high hiding limit, and robustness through its hybrid work by using a linguistic technique and a format-based technique. The linguistic technique is used to divide the cover text into embedding layers where each layer consists of a sequence of words that has a single part of speech detected by POS tagger, while the format-based technique is used to recolor the letters of a cover text with a near RGB color coding to embed 12 bits from the secret message in each letter which leads to high hidden capacity and blinds the embedding, moreover, the robustness is accomplished through a multi-layer embedding process, and the generated stego key significantly assists the security of the embedding messages and its size. The experimental results comparison shows that the purpose approach is better than currently developed approaches in providing an ideal balance between imperceptibility, high hiding limit, and robustness criteria.
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...sipij
This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
Speech recognition is the next big step that the technology needs to take for general users. An Automatic Speech Recognition (ASR) will play a major role in focusing new technology to users. Applications of ASR are speech to text conversion, voice input in aircraft, data entry, voice user interfaces such as voice dialing. Speech recognition involves extracting features from the input signal and classifying them to classes using pattern matching model. This can be done using feature extraction method. This paper involves a general study of automatic speech recognition and various methods to generate an ASR system. General techniques that can be used to implement an ASR includes artificial neural networks, Hidden Markov model, acoustic –phonetic approach
LPC Models and Different Speech Enhancement Techniques- A Reviewijiert bestjournal
Author has already published one review paper on the quality enhancement of a speech signal by minimizing the noise. This is a second paper of same series. In last two decades the researchers have taken continuous efforts to reduce the noise signal from the speech signal. Th is paper comments on,various study carried out and analysis propos als of the researchers for en hancement of the quality of speech signal. Various models,coding,speech quality improvement methods,speaker dependent codebooks,autocorrelation subtraction,speech restoration,producing speech at low bit rates,compression and enhancement are the vari ous aspects of speech enhancement. We have presented the review of all above mentioned technologies in this paper and also willing to examine few of the techniques in order to analyze the factors affecting them in upcoming paper of the series.
Deep convolutional neural networks-based features for Indonesian large vocabu...IAESIJAI
There are great interests in developing speech recognition using deep
learning technologies due to their capability to model the complexity of
pronunciations, syntax, and language rules of speech data better than the
traditional hidden Markov model (HMM) do. But, the availability of large
amount of data is necessary for deep learning-based speech recognition to be
effective. While this is not a problem for mainstream languages such as
English or Chinese, this is not the case for non-mainstream languages such
as Indonesian. To overcome this limitation, we present deep features based
on convolutional neural networks (CNN) for Indonesian large vocabulary
continuous speech recognition in this paper. The CNN is trained
discriminatively which is different from usual deep learning
implementations where the networks are trained generatively. Our
evaluations show that the proposed method on Indonesian speech data
achieves 7.26% and 9.01% error reduction rates over the state-of-the-art
deep belief networks-deep neural networks (DBN-DNN) for large
vocabulary continuous speech recognition (LVCSR), with Mel frequency
cepstral coefficients (MFCC) and filterbank (FBANK) used as features,
respectively. An error reduction rate of 6.13% is achieved compared to
CNN-DNN with generative training.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Isolated word recognition using lpc & vector quantizationeSAT Journals
Abstract Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human recognition and their behavior. This paper explicates the theory and implementation of Speech recognition. This is a speaker-dependent real time isolated word recognizer. The major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by measuring the Minimum average distortion. All Speech Recognition systems contain Two Main Phases, namely Training Phase and Testing Phase. In the Training Phase, the Features of the words are extracted and during the recognition phase feature matching Takes place. The feature or the template thus extracted is stored in the data base, during the recognition phase the extracted features are compared with the template in the database. The features of the words are extracted by using LPC analysis. Vector Quantization is used for generating the code books. Finally the recognition decision is made based on the matching score. MATLAB will be used to implement this concept to achieve further understanding. Index Terms: Speech Recognition, LPC, Vector Quantization, and Code Book.
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...ijistjournal
Spread-spectrum communication, with its inherent interference attenuation capability, has over the years become an increasingly popular technique for use in many different systems. They have very beneficial and tempting features, like Antijam, Security, and Multiple accesses. This thesis basically deals with the pseudo codes used in spread spectrum communication system. The cross-correlation and auto-correlation properties of the long Barker Code are analyzed. It has been seen that the length of the code, autocorrelation and cross-correlation properties can help us to determine the best suitable code for any particular communication environment. We have tried to find out the code with suitable auto-correlation properties along with low cross-correlation values. Barker code has good auto-correlation properties and we have found the pairs with the low cross- correlation so that they can be used in multi-user environment.
Similar to Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language processing (20)
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language processing
1. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEEACM Transactions on Audio, Speech, and Language
Processing
Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-
Network-Based Speech Waveform Synthesis
ABSTRACT
This paper presents a mel-cepstrum-based quantization noise shaping method for
improving the quality of synthetic speech generated by neural-network-based speech waveform
synthesis systems. Since mel-cepstral coefficients closely match the characteristics of human
auditory perception, the proposed method effectively masks the white noise introduced by the
quantization typically used in neural-network-based speech waveform synthesis systems. The
paper also describes a computationally efficient implementation of the proposed method using
the structure of the mel-log spectrum approximation filter. Experiments using the WaveNet
generative model, which is a state-of-theart model for neural-network-based speech waveform
synthesis, showed that speech quality is significantly improved by the proposed method.
A Multi-Objective Learning and Ensembling Approach to High-Performance
Speech Enhancement with Compact Neural Network Architectures
ABSTRACT
In this study, we propose a novel deep neural network (DNN) architecture for speech
enhancement (SE) via a multi-objective learning and ensembling (MOLE) framework to achieve
2. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
a compact and low-latency design while maintaining good performance in quality evaluations.
MOLE follows the boosting concept when combining weak models into a strong classifier and
consists of two compact deep neural networks (DNNs). The first, called the multi-objective
learning DNN (MOLDNN), takes multiple features, such as log-power spectra (LPS), mel-
frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients
(GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature
and ideal ratio mask (IRM). The second, called the multi-objective ensembling DNN (MOE-
DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS
and IRM, clean MFCC and IRM and clean GFCC and IRM using three sets of weak regression
functions. Finally, a post-processing operation can be applied to the estimated clean features by
leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech
corrupted by 15 noise types not seen in model training the speech enhancement results show that
the MOLE approach, which features a small model size and low run-time latency, can achieve
consistent improvements over both DNN- and long short-term memory (LSTM)-based
techniques in terms of all the objective metrics evaluated in this study for all three cases (the
input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE
system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame
delay and also achieves better performance than the LSTM-based SE system with 4-frame, no
delay expansion by including only 3 previous frames, and with 170 times less processing
latency.
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional
Recurrent Neural Networks
ABSTRACT
3. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN
with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate
classifiers for confidence estimation in automatic speech recognition. At the same time, we have
recently shown that speaker adaptation of confidence measures using DBLSTM yields
significant improvements over non-adapted confidence measures. In accordance with these two
recent contributions to the state of the art in confidence estimation, this paper presents a
comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM
models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence
classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the
Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence
measures considering a multi-task framework in which RNN-based confidence classifiers trained
with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm
that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we
describe an unsupervised adaptation method of the acoustic DBLSTM model based on
confidence measures which results in better automatic speech recognition performance.
Mispronunciation Detection in Children’s Reading of Sentences
ABSTRACT
This work proposes an approach to automatically parse children’s reading of sentences by
detecting word pronunciations and extra content, and to classify words as correctly or incorrectly
pronounced. This approach can be directly helpful for automatic assessment of reading level or
for automatic reading tutors, where a correct reading must be identified. We propose a first
segmentation stage to locate candidate word pronunciations based on allowing repetitions and
false starts of a word’s syllables. A decoding grammar based solely on syllables allows silence to
appear during a word pronunciation. At a second stage, word candidates are classified as
4. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
mispronounced or not. The feature that best classifies mispronunciations is found to be the log-
likelihood ratio between a free phone loop and a word spotting model in the very close vicinity
of the candidate segmentation. Additional features are combined in multi-feature models to
further improve classification, including: normalizations of the log-likelihood ratio, derivations
from phone likelihoods, and Levenshtein distances between the correct pronunciation and
recognized phonemes through two phoneme recognition approaches. Results show that most
extra events were detected (close to 2% word error rate achieved) and that using automatic
segmentation for mispronunciation classification approaches the performance of manual
segmentation. Although the log-likelihood ratio from a spotting approach is already a good
metric to classify word pronunciations, the combination of additional features provides a relative
reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from
35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).
Analysis of the Reconstruction of Sparse Signals in the DCT Domain Applied
to Audio Signals
ABSTRACT
Sparse signals can be reconstructed from a reduced set of signal samples using
compressive sensing (CS) methods. The discrete cosine transform (DCT) can provide highly
concentrated representations of audio signals. This property implies the DCT as a good sparsity
domain for the audio signals. In this paper, the DCT is studied within the context of sparse audio
signal processing using the CS theory and methods. The DCT coefficients of a sparse signal,
calculated with a reduced set of available samples, can be modeled as random variables. It has
been shown that the statistical properties of these variables are closely related to the unique
reconstruction conditions. The main result of the paper is in an exact formula for the mean
square reconstruction error in the case of approximately sparse and nonsparse noisy signals,
5. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
reconstructed under the sparsity assumption. Based on the presented analysis, a simple and
computationally efficient reconstruction algorithm is proposed. The presented theoretical
concepts and the efficiency of the reconstruction algorithm are verified numerically, including
examples with synthetic and recorded audio signals with unavailable or corrupted samples.
Random disturbances and disturbances simulating clicks or inpainting in audio signals are
considered. Statistical verification is done on a dataset with experimental signals. Results are
compared with some classical and recent methods used in similar signal and disturbance
scenarios.
Speech Dereverberation with Context aware Recurrent Neural Networks
ABSTRACT
In this paper, we propose a model to perform speech dereverberation by estimating its
spectral magnitude from the reverberant counterpart. Our models are capable of extracting
features that take into account both short and long-term dependencies in the signal through a
convolutional encoder (which extracts features from a short, bounded context of frames) and a
recurrent neural network for extracting long-term information. Our model outperforms a recently
proposed model that uses different context information depending on the reverberation time,
without requiring any sort of additional input, yielding improvements of up to 0.4 on PESQ, 0.3
on STOI, and 1.0 on POLQA relative to reverberant speech. We also show our model is able to
generalize to real room impulse responses even when only trained with simulated room impulse
responses, different speakers, and high reverberation times. Lastly, listening tests show the
proposed method outperforming benchmark models in reduction of perceived reverberation.
6. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Do we need individual head-related transfer functions for vertical
localization? The case study of a spectral notch distance metric
ABSTRACT
This paper deals with the issue of individualizing the head-related transfer function
(HRTF) rendering process for auditory elevation perception: is it possible to find a
nonindividual, personalized HRTF set that allows a listener to have an equally accurate
localization performance than with his/her individual HRTFs? We propose a psychoacoustically
motivated, anthropometry based mismatch function between HRTF pairs, that exploits the close
relation between the listener’s pinna geometry and localization cues. This is evaluated using an
auditory model that computes a mapping between HRTF spectra and perceived spatial locations.
Results on a large number of subjects in the CIPIC and ARI HRTF databases suggest that there
exists a non-individual HRTF set which allows a listener to have an equally accurate vertical
localization than with individual HRTFs. Furthermore, we find the optimal parametrization of
the proposed mismatch function, i.e. the one that best reflects the information given by the
auditory model. Our findings show that the selection procedure yields statistically significant
improvements with respect to dummy-head HRTFs or random HRTF selection, with potentially
high impact from an applicative point of view.
7. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Interaural Coherence Preservation for Binaural Noise Reduction Using
Partial Noise Estimation and Spectral Postfiltering
ABSTRACT
The objective of binaural speech enhancement algorithms is to reduce the undesired noise
component, while preserving the desired speech source and the binaural cues of all sound
sources. For the scenario of a single desired speech source in a diffuse noise field, an extension
of the binaural multi-channel Wiener filter (MWF), namely the MWF-IC, has been recently
proposed, which aims to preserve the interaural coherence (IC) of the noise component.
However, due to the large complexity of the MWF-IC, in this paper we propose several
alternative algorithms at a lower computational complexity. First, we consider a
quasidistortionless version of the MWF-IC, denoted as MVDR-IC. Secondly, we propose to
preserve the IC of the noise component using the binaural MWF with partial noise estimation
(MWFN) and the binaural minimum-variance-distortionless response beamformer with partial
noise estimation (MVDR-N), for which closed-form expressions exist. In addition, we show that
for the MVDR-N a closed-form expression can be derived for the tradeoff parameter yielding a
desired magnitude squared coherence (MSC) for the output noise component. Since contrary to
the MWF-IC and the MWF-N the MVDR-IC and the MVDR-N do not take into account the
spectro-temporal properties of the speech and the noise components, we propose to apply a
spectral postfilter to the filter outputs, improving the noise reduction performance. The
performance of all algorithms is compared in several diffuse noise scenarios. The simulation
results show that both the MVDR-IC and the MVDR-N are able to preserve the MSC of the
noise component, while generally the MVDRIC shows a slightly better noise reduction
performance at a larger complexity. Further simulation results show that applying a spectral
postfilter leads to a very similar performance for all considered algorithms in terms of noise
reduction and speech distortion.
8. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Gating Neural Network for Large Vocabulary Audiovisual Speech
Recognition
ABSTRACT
Audio-based automatic speech recognition (A-ASR) systems are affected by noisy
conditions in real-world applications. Adding visual cues to the ASR system is an appealing
alternative to improve the robustness of the system, replicating the audiovisual perception
process used during human interactions. A common problem observed when using audiovisual
automatic speech recognition (AV-ASR) is the drop in performance when speech is clean. In this
case, visual features may not provide complementary information, introducing variability that
negatively affects the performance of the system. The experimental evaluation in this study
clearly demonstrates this problem when we train an audiovisual state-of-the-art hybrid system
with a deep neural network (DNN) and hidden Markov models (HMMs). This study proposes a
framework that addresses this problem, improving, or at least, maintaining the performance
when visual features are used. The proposed approach is a deep learning solution with a gating
layer that diminishes the effect of noisy or uninformative visual features, keeping only useful
information. The framework is implemented with a subset of the audiovisual CRSS-4ENGLISH-
14 corpus which consists of 61 hours of speech from 105 subjects simultaneously collected with
multiple cameras and microphones. The proposed framework is compared with conventional
HMMs with observation models implemented with either a Gaussian mixture model (GMM) or
DNNs. We also compare the system with a multi-stream hidden Markov model (MS-HMM)
system. The experimental evaluation indicates that the proposed framework outperforms
alternative methods under all configurations, showing the robustness of the gating-based
framework for AV-ASR.
9. For Details, Contact TSYS Academic Projects in Adyar.
Ph: 9841103123, 044-42607879
Website: http://www.tsysglobalsolutions.com/
Mail Id: tsysglobalsolutions2014@gmail.com.
Bias-Compensated Informed Sound Source Localization Using Relative
Transfer Functions
ABSTRACT
In this paper, we consider the problem of estimating the target sound direction of arrival
(DoA) for a hearing aid (HA) system, which can connect to a wireless microphone worn by the
talker of interest. The wireless microphone “informs” the HA system about the noise-free target
speech. To estimate the DoA, we consider a maximum-likelihood approach, and we assume that
a database of DoA-dependent relative transfer functions (RTFs) has been measured in advance
and is available. The proposed DoA estimator is able to take the available noise-free target
speech, ambient noise characteristics, and the shadowing effect of the user’s head on the received
signals into account, and it supports bothmonaural and binaural microphone array configurations.
Moreover, we analytically analyze the bias in the proposed estimator and introduce a modified
estimator, which has been compensated for the bias. We demonstrate that the proposed method
has lower computational complexity and better performance than recent RTF-based estimators.
Furthermore, to decrease the number of parameters required to be wirelessly exchanged between
the HAs in binaural configurations, we propose an information fusion strategy, which avoids
transmitting microphone signals between the HAs. An important benefit of the proposed IF
strategy is that the number of parameters to be exchanged between the HAs is independent of the
number of HA microphones. Finally, we investigate the performance of variants of the proposed
estimator extensively in different noisy and reverberant situations.
CONTACT: TSYS Center for Research and Development
(TSYS Academic Projects)