This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
This research proposes a simulation of the logic series of speech recognition on the MFCC (Mel Frequency Spread Spectrum) based FPGA and Euclidean Distance to control the robotic car motion. The speech known would be used as a command to operate the robotic car. MFCC in this study was used in the feature extraction process, while Euclidean distance was applied in the feature classification process of each speech that later would be forwarded to the part of decision to give the control logic in robotic motor. The test that has been conducted showed that the logic series designed was precise here by measuring the Mel Frequency Warping and Power Cepstrum. With the achievement of logic design in this research proven with a comparison between the Matlab computation and Xilinx simulation, it enables to facilitate the researchers to continue its implementation to FPGA hardware.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
This research proposes a simulation of the logic series of speech recognition on the MFCC (Mel Frequency Spread Spectrum) based FPGA and Euclidean Distance to control the robotic car motion. The speech known would be used as a command to operate the robotic car. MFCC in this study was used in the feature extraction process, while Euclidean distance was applied in the feature classification process of each speech that later would be forwarded to the part of decision to give the control logic in robotic motor. The test that has been conducted showed that the logic series designed was precise here by measuring the Mel Frequency Warping and Power Cepstrum. With the achievement of logic design in this research proven with a comparison between the Matlab computation and Xilinx simulation, it enables to facilitate the researchers to continue its implementation to FPGA hardware.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
Speaker Identification From Youtube Obtained Datasipij
An efficient, and intuitive algorithm is presented for the identification of speakers from a long dataset (like
YouTube long discussion, Cocktail party recorded audio or video).The goal of automatic speaker
identification is to identify the number of different speakers and prepare a model for that speaker by
extraction, characterization and speaker-specific information contained in the speech signal. It has many
diverse application specially in the field of Surveillance , Immigrations at Airport , cyber security ,
transcription in multi-source of similar sound source, where it is difficult to assign transcription arbitrary.
The most commonly speech parameterization used in speaker verification, K-mean, cepstral analysis, is
detailed. Gaussian mixture modeling, which is the speaker modeling technique is then explained. Gaussian
mixture models (GMM), perhaps the most robust machine learning algorithm has been introduced to
examine and judge carefully speaker identification in text independent. The application or employment of
Gaussian mixture models for monitoring & Analysing speaker identity is encouraged by the familiarity,
awareness, or understanding gained through experience that Gaussian spectrum depict the characteristics
of speaker's spectral conformational pattern and remarkable ability of GMM to construct capricious
densities after that we illustrate 'Expectation maximization' an iterative algorithm which takes some
arbitrary value in initial estimation and carry on the iterative process until the convergence of value is
observed We have tried to obtained 85 ~ 95% of accuracy using speaker modeling of vector quantization
and Gaussian Mixture model ,so by doing various number of experiments we are able to obtain 79 ~ 82%
of identification rate using Vector quantization and 85 ~ 92.6% of identification rate using GMM modeling
by Expectation maximization parameter estimation depending on variation of parameter.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
MULTI USER DETECTOR IN CDMA USING ELLIPTIC CURVE CRYPTOGRAPHYVLSICS Design
Code division multiple access (CDMA) is used in various radio communication techniques due to its advantages. In CDMA one of the most important processes is multi user detection (MUD). There are numerous methods for MUD in CDMA, but in most of the methods, they identify the exact user but the interference signal is high. One of the methods used for MUD in CDM A is elliptic curve cryptography (ECC). Normally, the multi user detector in CDMA using elliptic curve cryptography is performed by using one prime field. In ECC method the exact user is identified and also interference signal reduces comparing with other techniques. To reduce the interference signal to very low, here propose a new technique for MUD in CDMA using ECC. The proposed technique uses multiple prime numbers for key generation. By generating key using different prime numbers using ECC, the bit error rate was very low. The results shows the performance of the proposed for reduce in bit error rate for MUD in CDMA.
The accuracy of the speaker identification system reduces severely in the presence of noise. The auditory based features called gammatone frequency ceptral coefficient (GFCC) which resembles to the ability of human ear to identify the speaker identity's in noisy environment. The GFCC is based on gammatone filter bank which models the basilar membrane as a sequence of overlapping band pass filters. The system uses GFCC features and Gaussian Mixture Model - Universal Background Model (GMM - UBM ) for modeling of features of the speaker. The experiment are conducted on the English Language Speech Database for Speaker Recognition (ELSDR) databases. In order to find out the performance of the system, the test utterances are mixed with noises at various SNR levels to simulate the channel change. The results shows that the GFCC features has a good identification accuracy not only in clean sample, but also for noisy condition.
Development of Quranic Reciter Identification System using MFCC and GMM Clas...IJECEIAES
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
Limited Data Speaker Verification: Fusion of FeaturesIJECEIAES
The present work demonstrates experimental evaluation of speaker verification for dif- ferent speech feature extraction techniques with the constraints of limited data (less than 15 seconds). The state-of-the-art speaker verification techniques provide good performance for sufficient data (greater than 1 minutes). It is a challenging task to develop techniques which perform well for speaker verification under limited data condition. In this work different features like Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Delta (4), Delta-Delta (44), Linear Prediction Residual (LPR) and Linear Prediction Residual Phase (LPRP) are considered. The performance of individual features is studied and for better verification performance, combination of these features is attempted. A comparative study is made between Gaussian mixture model (GMM) and GMM-universal background model (GMM-UBM) through experimental evaluation. The experiments are conducted using NIST-2003 database. The experimental results show that, the combination of features provides better performance compared to the individual features. Further GMM-UBM modeling gives reduced equal error rate (EER) as compared to GMM.
Investigating the performance of various channel estimation techniques for mi...ijmnct
This paper simulates and investigates the performance of four widely-used channel estimation techniques for MIMO-OFDM wireless communication systems; namely, super imposed pilot (SIP), comb-type, spacetime block coding (STBC), and space-frequency block coding (SFBC) techniques. The performance is
evaluated through a number of MATLab simulations, where the bit-error rate (BER) and the mean square
error (MSE) are estimated and compared for different levels of signal-to-noise ratio (SNR). The simulation results demonstrate that the comb-type channel estimation and the SIP techniques overwhelmed the performance of the STFC and STBC techniques in terms of both bit-error rate (BER) and mean square error (MSE).
INVESTIGATING THE PERFORMANCE OF VARIOUS CHANNEL ESTIMATION TECHNIQUES FOR MI...ijmnct
This paper simulates and investigates the performance of four widely-used channel estimation techniques for MIMO-OFDM wireless communication systems; namely, super imposed pilot (SIP), comb-type, spacetime block coding (STBC), and space-frequency block coding (SFBC) techniques. The performance is evaluated through a number of MATLab simulations, where the bit-error rate (BER) and the mean square error (MSE) are estimated and compared for different levels of signal-to-noise ratio (SNR). The simulation results demonstrate that the comb-type channel estimation and the SIP techniques overwhelmed the performance of the STFC and STBC techniques in terms of both bit-error rate (BER) and mean square error (MSE).
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
The paper published in discusses implementation of a robust text-independent speaker recognition system using MFCC extraction of feature vectors its matching using VQ and optimization using LBG, further a text dependent speech recognition system using the DTW algorithm's implementation is discussed in the context of home automation.
Speech recognition is the next big step that the technology needs to take for general users. An Automatic Speech Recognition (ASR) will play a major role in focusing new technology to users. Applications of ASR are speech to text conversion, voice input in aircraft, data entry, voice user interfaces such as voice dialing. Speech recognition involves extracting features from the input signal and classifying them to classes using pattern matching model. This can be done using feature extraction method. This paper involves a general study of automatic speech recognition and various methods to generate an ASR system. General techniques that can be used to implement an ASR includes artificial neural networks, Hidden Markov model, acoustic –phonetic approach
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
Speaker Identification From Youtube Obtained Datasipij
An efficient, and intuitive algorithm is presented for the identification of speakers from a long dataset (like
YouTube long discussion, Cocktail party recorded audio or video).The goal of automatic speaker
identification is to identify the number of different speakers and prepare a model for that speaker by
extraction, characterization and speaker-specific information contained in the speech signal. It has many
diverse application specially in the field of Surveillance , Immigrations at Airport , cyber security ,
transcription in multi-source of similar sound source, where it is difficult to assign transcription arbitrary.
The most commonly speech parameterization used in speaker verification, K-mean, cepstral analysis, is
detailed. Gaussian mixture modeling, which is the speaker modeling technique is then explained. Gaussian
mixture models (GMM), perhaps the most robust machine learning algorithm has been introduced to
examine and judge carefully speaker identification in text independent. The application or employment of
Gaussian mixture models for monitoring & Analysing speaker identity is encouraged by the familiarity,
awareness, or understanding gained through experience that Gaussian spectrum depict the characteristics
of speaker's spectral conformational pattern and remarkable ability of GMM to construct capricious
densities after that we illustrate 'Expectation maximization' an iterative algorithm which takes some
arbitrary value in initial estimation and carry on the iterative process until the convergence of value is
observed We have tried to obtained 85 ~ 95% of accuracy using speaker modeling of vector quantization
and Gaussian Mixture model ,so by doing various number of experiments we are able to obtain 79 ~ 82%
of identification rate using Vector quantization and 85 ~ 92.6% of identification rate using GMM modeling
by Expectation maximization parameter estimation depending on variation of parameter.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
MULTI USER DETECTOR IN CDMA USING ELLIPTIC CURVE CRYPTOGRAPHYVLSICS Design
Code division multiple access (CDMA) is used in various radio communication techniques due to its advantages. In CDMA one of the most important processes is multi user detection (MUD). There are numerous methods for MUD in CDMA, but in most of the methods, they identify the exact user but the interference signal is high. One of the methods used for MUD in CDM A is elliptic curve cryptography (ECC). Normally, the multi user detector in CDMA using elliptic curve cryptography is performed by using one prime field. In ECC method the exact user is identified and also interference signal reduces comparing with other techniques. To reduce the interference signal to very low, here propose a new technique for MUD in CDMA using ECC. The proposed technique uses multiple prime numbers for key generation. By generating key using different prime numbers using ECC, the bit error rate was very low. The results shows the performance of the proposed for reduce in bit error rate for MUD in CDMA.
The accuracy of the speaker identification system reduces severely in the presence of noise. The auditory based features called gammatone frequency ceptral coefficient (GFCC) which resembles to the ability of human ear to identify the speaker identity's in noisy environment. The GFCC is based on gammatone filter bank which models the basilar membrane as a sequence of overlapping band pass filters. The system uses GFCC features and Gaussian Mixture Model - Universal Background Model (GMM - UBM ) for modeling of features of the speaker. The experiment are conducted on the English Language Speech Database for Speaker Recognition (ELSDR) databases. In order to find out the performance of the system, the test utterances are mixed with noises at various SNR levels to simulate the channel change. The results shows that the GFCC features has a good identification accuracy not only in clean sample, but also for noisy condition.
Development of Quranic Reciter Identification System using MFCC and GMM Clas...IJECEIAES
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
Limited Data Speaker Verification: Fusion of FeaturesIJECEIAES
The present work demonstrates experimental evaluation of speaker verification for dif- ferent speech feature extraction techniques with the constraints of limited data (less than 15 seconds). The state-of-the-art speaker verification techniques provide good performance for sufficient data (greater than 1 minutes). It is a challenging task to develop techniques which perform well for speaker verification under limited data condition. In this work different features like Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Delta (4), Delta-Delta (44), Linear Prediction Residual (LPR) and Linear Prediction Residual Phase (LPRP) are considered. The performance of individual features is studied and for better verification performance, combination of these features is attempted. A comparative study is made between Gaussian mixture model (GMM) and GMM-universal background model (GMM-UBM) through experimental evaluation. The experiments are conducted using NIST-2003 database. The experimental results show that, the combination of features provides better performance compared to the individual features. Further GMM-UBM modeling gives reduced equal error rate (EER) as compared to GMM.
Investigating the performance of various channel estimation techniques for mi...ijmnct
This paper simulates and investigates the performance of four widely-used channel estimation techniques for MIMO-OFDM wireless communication systems; namely, super imposed pilot (SIP), comb-type, spacetime block coding (STBC), and space-frequency block coding (SFBC) techniques. The performance is
evaluated through a number of MATLab simulations, where the bit-error rate (BER) and the mean square
error (MSE) are estimated and compared for different levels of signal-to-noise ratio (SNR). The simulation results demonstrate that the comb-type channel estimation and the SIP techniques overwhelmed the performance of the STFC and STBC techniques in terms of both bit-error rate (BER) and mean square error (MSE).
INVESTIGATING THE PERFORMANCE OF VARIOUS CHANNEL ESTIMATION TECHNIQUES FOR MI...ijmnct
This paper simulates and investigates the performance of four widely-used channel estimation techniques for MIMO-OFDM wireless communication systems; namely, super imposed pilot (SIP), comb-type, spacetime block coding (STBC), and space-frequency block coding (SFBC) techniques. The performance is evaluated through a number of MATLab simulations, where the bit-error rate (BER) and the mean square error (MSE) are estimated and compared for different levels of signal-to-noise ratio (SNR). The simulation results demonstrate that the comb-type channel estimation and the SIP techniques overwhelmed the performance of the STFC and STBC techniques in terms of both bit-error rate (BER) and mean square error (MSE).
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
The paper published in discusses implementation of a robust text-independent speaker recognition system using MFCC extraction of feature vectors its matching using VQ and optimization using LBG, further a text dependent speech recognition system using the DTW algorithm's implementation is discussed in the context of home automation.
Speech recognition is the next big step that the technology needs to take for general users. An Automatic Speech Recognition (ASR) will play a major role in focusing new technology to users. Applications of ASR are speech to text conversion, voice input in aircraft, data entry, voice user interfaces such as voice dialing. Speech recognition involves extracting features from the input signal and classifying them to classes using pattern matching model. This can be done using feature extraction method. This paper involves a general study of automatic speech recognition and various methods to generate an ASR system. General techniques that can be used to implement an ASR includes artificial neural networks, Hidden Markov model, acoustic –phonetic approach
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
The performance of speaker identification systems
decreases significantly under noisy conditions and especially
when there is a difference between the recognition and the
learning sessions. To improve robustness, we have proposed in
the previous study an auditory features and a robust speaker
recognition system using a front-end based on the combination of
MFCC and RASTA-PLP methods. In this paper, we further
study the auditory features by exploring the combination of
GFCC and RASTA-PLP. We find that the method performs
substantially better than all previous studied methods.
Furthermore, our current identification system achieves
significant performance improvements of 5.92% in a wide range
of signal-to-noise conditions compared with the last studied
front-end based on MFCC combined to RASTA-PLP.
Experimental results show an average accuracy improvement of
10.11% in case of GFCC combined with RASTA-PLP over the
base line MFCC thechnique across various SNR. This fusion
approach allow a highly and appreciable enhancement in the
higher noisy conditions.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
Speaker recognition is the computing task of validating a user's claimed identity using characteristics extracted from their voices. Voice -recognition is combination of the two where it uses learned aspects of a speaker’s voice to determine what is being said - such a system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices it has been trained with, which gives us various applications in day today life.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
Probabilistic Self-Organizing Maps for Text-Independent Speaker IdentificationTELKOMNIKA JOURNAL
The present paper introduces a novel speaker modeling technique for text-independent speaker identification using probabilistic self-organizing maps (PbSOMs). The basic motivation behind the introduced technique was to combine the self-organizing quality of the self-organizing maps and generative power of Gaussian mixture models. Experimental results show that the introduced modeling technique using probabilistic self-organizing maps significantly outperforms the traditional technique using the classical GMMs and the EM algorithm or its deterministic variant. More precisely, a relative accuracy improvement of roughly 39% has been gained, as well as, a much less sensitivity to the model-parameters initialization has been exhibited by using the introduced speaker modeling technique using probabilistic self-organizing maps.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Online aptitude test management system project report.pdf
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique
1. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
DOI : 10.5121/sipij.2010.1209 101
SPEECH RECOGNITION USING HMM WITH MFCC- AN
ANALYSIS USING FREQUENCY SPECRAL
DECOMPOSION TECHNIQUE
Ibrahim Patel1
Dr. Y. Srinivas Rao2
1
Assoc. Prof., Department of BME, Padmasri.Dr.B.V.Raju Institute of Technology.Narsapur s
Ptlibrahim@gmail.com
2
Assoc. Prof., Department of Instrument Technology, Andhra University, Vizag, A.P.
srinniwasarau@gmail.com
ABSTRACT
This paper presents an approach to the recognition of speech signal using frequency spectral information
with Mel frequency for the improvement of speech feature representation in a HMM based recognition
approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech
recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in
a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution
decomposition with separating frequency is mapping approach for a HMM based speech recognition
system. The Simulation results show an improvement in the quality metrics of speech recognition with
respect to computational time, learning accuracy for a speech recognition system.
KEYWORDS
Speech-recognition, Mel-frequencies, DCT, frequency decomposition, Mapping Approach, HMM, MFCC.
1. INTRODUCTION
Speech recognition is a process used to recognize speech uttered by a speaker and has been in the
field of research for more than five decades since 1950s [1]. Voice communication is the most
effective mode of communication used by humans. Speech recognition is an important and
emerging technology with great potential. The significance of speech recognition lies in its
simplicity. This simplicity together with the ease of operating a device using speech has lots of
advantages. It can be used in many applications like, security devices, household appliances,
cellular phones, ATM machines and computers.
With the advancement of automated system the complexity for integration & recognition problem
is increasing. The problem is found more complex when processing on randomly varying analog
signals such as speech signals. Although various methods are proposed for efficient extraction of
speech parameter for recognition, the MFCC method with advanced recognition method such as
HMM is more dominant used. This system found to be more accurate under low varying
environment but fails to recognition speech under highly varying environment. This need to the
development of an efficient recognition system which can provide is efficient varying system.
Research and development on speaker recognition method and technique has been undertaken for
well over four decade and it continues to be an active area. Approaches have spanned from
human auditory [2] and spectrogram comparisons [2], to simple template matching, to dynamic
time-warping approaches, to more modern statistical pattern recognition [3], such as neural
networks and Hidden Markov Model (HMM’s) [4].
2. Signal & Image Processing : An International
It is observed that, to extract and recognize different information from a speech signal at variable
environment, many algorithms for efficient speech recognition is proposed in past. Masakiyo
Fujimoto and Yasuo Ariki in their paper “Robust Speech Recognition in Additive and channel
noise environments using GMM and EM Algorithm” [5] evaluate the speech recogni
driving car environments by using a GMM based speech estimation method [6] and an EM
algorithm based channel noise estimation method.
A Gaussian mixture model (GMM) based speech estimation method proposed in J.C.Segura et al
[6] estimates the expectation of the mismatch factor between clean speech and noisy speech at
each frame by using GMM of clean speech and mean vector of noise. This approach shows a
significant improvement in recognition accuracy. However, the Segura’s method considered on
the additive noise environments and it did not consider about the channel noise problem such as
an acoustic transfer function, a microphone characteristic etc
A Parallel model combination (PMC) method [7] has been proposed by M.J.F Gales, and
S.J.Young adapts the speech recognition system to any kinds of noises. However, PMC has a
problem, of taking huge quantity of computation to recognize the speech signal. Another method
for speech recognition called “spectral subtraction” (SS) is also proposed as a c
reduction method [3]. However, using spectral subtraction method degrades the recognition rate
due to spectral distortion caused by over or under subtraction. Additionally, spectral subtraction
method does not consider the time varying pr
noise spectra as mean spectra within the time section assumed to be noise.
Hidden Markov Model (HMM) [4] is a natural and highly robust statistical methodology for
automatic speech recognition. It was test
applications. The model parameters of the HMM are essence in describing the behavior of the
utterance of the speech segments. Many successful heuristic algorithms are developed to optimize
the model parameters in order to best describe the trained observation sequences. The objective of
this paper is to develop an efficient speech recognition algorithm with the existing system
following HMM algorithm. The paper integrates the frequency isolation concept celled a
band decomposition to the existing MFCC approach for extraction of speech feature. The
additional feature concept of provides the information of varying speech coefficient at multiple
band level this feature could enhancement the recognition approach
2. Hidden Markov Modeling
An HMM is a stochastic finite state automation defined by the parameter
is a state transition probability, p is the initial state probability and B is the emission probability
density function of each state, defined by a finite multivariate Gaussian mixture as shown in
figure below. Each model can be used to compute the probability of observing a discrete input
sequence O = O1, . . . , OT, P(O|
probability of the input sequence, P(Q|O,
probability of a given sequence P(O|
Figure.1 A typical left-right HMM (aij is the station transition probabi
OT is the observation vector at time T
i).Given the form of the HMM, there are three basic problems to be solved.
Problem 1: calculation of probability P (O|
Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
It is observed that, to extract and recognize different information from a speech signal at variable
onment, many algorithms for efficient speech recognition is proposed in past. Masakiyo
Fujimoto and Yasuo Ariki in their paper “Robust Speech Recognition in Additive and channel
noise environments using GMM and EM Algorithm” [5] evaluate the speech recogni
driving car environments by using a GMM based speech estimation method [6] and an EM
algorithm based channel noise estimation method.
A Gaussian mixture model (GMM) based speech estimation method proposed in J.C.Segura et al
expectation of the mismatch factor between clean speech and noisy speech at
each frame by using GMM of clean speech and mean vector of noise. This approach shows a
significant improvement in recognition accuracy. However, the Segura’s method considered on
the additive noise environments and it did not consider about the channel noise problem such as
an acoustic transfer function, a microphone characteristic etc
A Parallel model combination (PMC) method [7] has been proposed by M.J.F Gales, and
adapts the speech recognition system to any kinds of noises. However, PMC has a
problem, of taking huge quantity of computation to recognize the speech signal. Another method
for speech recognition called “spectral subtraction” (SS) is also proposed as a conventional noise
reduction method [3]. However, using spectral subtraction method degrades the recognition rate
due to spectral distortion caused by over or under subtraction. Additionally, spectral subtraction
method does not consider the time varying property of noise spectra, because it estimates the
noise spectra as mean spectra within the time section assumed to be noise.
Hidden Markov Model (HMM) [4] is a natural and highly robust statistical methodology for
automatic speech recognition. It was tested and proved considerably in a wide range of
applications. The model parameters of the HMM are essence in describing the behavior of the
utterance of the speech segments. Many successful heuristic algorithms are developed to optimize
in order to best describe the trained observation sequences. The objective of
this paper is to develop an efficient speech recognition algorithm with the existing system
following HMM algorithm. The paper integrates the frequency isolation concept celled a
band decomposition to the existing MFCC approach for extraction of speech feature. The
additional feature concept of provides the information of varying speech coefficient at multiple
band level this feature could enhancement the recognition approach then the existing one.
. Hidden Markov Modeling
An HMM is a stochastic finite state automation defined by the parameter
is a state transition probability, p is the initial state probability and B is the emission probability
f each state, defined by a finite multivariate Gaussian mixture as shown in
figure below. Each model can be used to compute the probability of observing a discrete input
sequence O = O1, . . . , OT, P(O|λ) to find the corresponding state sequence that maxi
probability of the input sequence, P(Q|O, λ), and to induce the model that maximizes the
probability of a given sequence P(O|λ) > P(O|λ).
right HMM (aij is the station transition probability from state i to state j;
the observation vector at time T and bi(OT) is the probability that OT is generated by state
i).Given the form of the HMM, there are three basic problems to be solved.
Problem 1: calculation of probability P (O|λ) for a given observation sequence.
Journal(SIPIJ) Vol.1, No.2, December 2010
102
It is observed that, to extract and recognize different information from a speech signal at variable
onment, many algorithms for efficient speech recognition is proposed in past. Masakiyo
Fujimoto and Yasuo Ariki in their paper “Robust Speech Recognition in Additive and channel
noise environments using GMM and EM Algorithm” [5] evaluate the speech recognition in real
driving car environments by using a GMM based speech estimation method [6] and an EM
A Gaussian mixture model (GMM) based speech estimation method proposed in J.C.Segura et al
expectation of the mismatch factor between clean speech and noisy speech at
each frame by using GMM of clean speech and mean vector of noise. This approach shows a
significant improvement in recognition accuracy. However, the Segura’s method considered only
the additive noise environments and it did not consider about the channel noise problem such as
A Parallel model combination (PMC) method [7] has been proposed by M.J.F Gales, and
adapts the speech recognition system to any kinds of noises. However, PMC has a
problem, of taking huge quantity of computation to recognize the speech signal. Another method
onventional noise
reduction method [3]. However, using spectral subtraction method degrades the recognition rate
due to spectral distortion caused by over or under subtraction. Additionally, spectral subtraction
operty of noise spectra, because it estimates the
Hidden Markov Model (HMM) [4] is a natural and highly robust statistical methodology for
ed and proved considerably in a wide range of
applications. The model parameters of the HMM are essence in describing the behavior of the
utterance of the speech segments. Many successful heuristic algorithms are developed to optimize
in order to best describe the trained observation sequences. The objective of
this paper is to develop an efficient speech recognition algorithm with the existing system
following HMM algorithm. The paper integrates the frequency isolation concept celled as sub
band decomposition to the existing MFCC approach for extraction of speech feature. The
additional feature concept of provides the information of varying speech coefficient at multiple
then the existing one.
, where A
is a state transition probability, p is the initial state probability and B is the emission probability
f each state, defined by a finite multivariate Gaussian mixture as shown in
figure below. Each model can be used to compute the probability of observing a discrete input
) to find the corresponding state sequence that maximizes the
), and to induce the model that maximizes the
state i to state j;
is generated by state
3. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
103
For a given observation sequence O =O1, O2, . . . , OT and a model λ, the calculation for
probability of occurrence for a given feature is one of the difficult and time consuming task. For
a given HMM for speech recognition the probability is given by,
Here αT is the forward component of the Forward-Backward procedure.
Problem 2: For a given observation sequence O = O1O2, . . . , OT and a model λ, the selection to
corresponding state sequence Q = q1q2, . . . , qT which best “explains” the observation, is one
more problem faced in speech recognition for HMM.
This problem is solved by using Viterbi algorithm.
Problem 3: Parameters Optimization.
For a given model λ, the HMM parameters {A, b, ð} should be such chosen so as to maximize the
probability P(O|λ) called a learning procedure. Considering all the temporal feature values as
continuous using the continuous density HMM can optimize the learning process.
2.1 Clustering
Regarding dissimilarity measure selection the distance function d (i, j) is given by
with the assumption that the similar speech forming one cluster corresponds to one HMM model.
In this project work a complete link hierarchical clustering technique is been used. The first step
is to compute a measure matrix ‘W’ for HMM dissimilarity measurement. Once the measure
matrix ‘W’ is computed the clustering technique is applied to obtain the K clusters. The algorithm
produces a sequence of clustering of decreasing number of clusters at each step. The clustering
produced at each step results from the previous one by merging two clusters into one. Finally,
isolating the sequences into K groups we fit each HMM to each cluster using all the observation
sequences in cluster for HMM training.
3. Mel Spectrum Approach
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on
a scale called the ‘mel’ scale. The mel-frequency scale is linear frequency spacing below 1000 Hz
and a logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB
above the perceptual hearing threshold, is defined as 1000 mels. By applying the procedure
described above, for each speech frame of around 30msec with overlap, a set of mel-frequency
cepstrum coefficients is computed. These are result of a cosine transform of the logarithm of the
short-term power spectrum expressed on a mel-frequency scale. This set of coefficients is called
an acoustic vector. Therefore each input utterance is transformed into a sequence of acoustic
vectors. As these vectors are evaluated using a distinct filter spectrum the feature information
obtained is limited to certain frequency resolution information only and needs to be improved. In
the following section a frequency decomposition method incorporated with existing architecture
is suggested for HMM training and recognition. This mel spectrum is used recognition
information in conventional speech recognition system. The spectrum doesn’t exploit the
variations in fundamental resolution & hence is lower in accuracy to improve the accuracy of
operation a spectral decomposition approach is respected as shown in figure (2).
4. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
104
Figure (2): Speech process models
Therefore we can use the following approximate formula to compute the mels for a given
frequency ‘f’ in Hz;
mel (f) = 2595*log10 (1 + f / 700)
One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on
the mel scale where the filter bank has a triangular band pass frequency response, and the spacing
as well as the bandwidth is determined by a constant mel frequency interval. The modified
spectrum of S(ω) thus consists of the output power of these filters when S(ω) is the input. The
number of Mel spectrum coefficients, K, is typically chosen as 20.
Note that this filter bank is applied in the frequency domain; therefore it simply amounts to taking
those triangle-shape windows in the Figure.3 on the spectrum. A useful way of thinking about this
Mel-wrapping filter bank is to view each filter as a histogram bin (where bins have overlap) in the
frequency domain.
Figure. 3. An example of mel-spaced filterbank
The log Mel spectrum is converted back to time. The result is called the Mel frequency cepstrum
coefficients (MFCC). The cepstral representation of the speech spectrum provides a good
representation of the local spectral properties of the signal for the given frame analysis. Because
the Mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to
the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those Mel
power spectrum coefficients that are the result of the last step are we
, can calculate the MFCC's, as the first component,
K
n
K
k
n
S
c
K
k
k
n ...
3
,
2
,
1
,
2
1
cos
)
~
(log
~
1
=
−
= ∑
=
π
from the DCT since it represents the mean value of the input signal, which carried little
speaker specific information. As shown in the figure. 3
The normalized cross-correlation based method is used for pitch estimation. The algorithm
assumes a monophonic signal. The method follows the assumption that the signal has a
periodicity corresponding to the fundamental frequency or pitch. Starting with a signal that is
assumed to be periodic, or more precisely, quasi-periodic, it follows that s(t) ≈ αs(t + T) where T
is the quasi-period of the signal and the scalar α accounts for amplitude variations. For training
5. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
105
HMM, twelve frequency spectral coefficients (Mel) using 10 millisecond Hamming windowed
frames were extracted. The Mel scaled is defined by;
The employed filters were triangular and equally spaced along the entire Mel-scale. The pitch
frequency was also computed in the same window. The zero order energy and Frequency zero
were included with the above-mentioned features. In order to take the advantage of pitch and
spectral dynamics, the velocity (delta) and acceleration parameters were also added to the feature
space as shown in figure (4).
Figure. 4 The pitch frequency was also computed in the same window
4. Spectral Decomposition Approach
Filter bank can be regarded as wavelet transform in multi resolution band. Wavelet transform of a
signal is passing the signal through this filter bank. The outputs of the different filter stages are
the wavelet and scaling function transform coefficients. Analyzing a signal by passing it through
a filter bank is not a new idea and has been around for many years under the name sub band
coding. It is used for instance in computer vision applications.
Figure: 5. splitting the signal spectrum with an iterated filter bank.
The filter bank needed in sub band coding can be built in several ways. One way is to
build many band pass filters to split the spectrum into frequency bands. The advantage is
that the width of every band can be chosen freely, in such a way that the spectrum of the
signal to analyze is covered in the places of interest the disadvantage is that it is
necessary to design every filter separately and this can be a time consuming process.
Another way is to split the signal spectrum in two equal parts, a low pass and a high-pass
part. The high-pass part contains the smallest details importance that is to be considered
6. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
106
here. The low-pass part still contains some details and therefore it can be split again. And
again, until desired number of bands are created. In this way an iterated filter bank is
created.
Usually the number of bands is limited by for instance the amount of data or computation power
available. The process of splitting the spectrum is graphically displayed in figure: 6. The spectral
decomposition obtained coefficient could be observed as,
Figure: 6. Output after 1st
stage decomposition for a given speech signal
Figure: 7. Plot after 2nd
stage decomposition
Figure:8. Output after 3rd
Stage decomposition
The advantage of this scheme is that it is necessary to design only two filters; the disadvantage is
that the signal spectrum coverage is fixed.
Looking at figure 5 it is observed that it is left with lower spectrum, after the repeated spectrum
splitting is a series of band-pass bands with doubling bandwidth and one low-pass band. The first
split gave a high-pass band and a low-pass band; in reality the high-pass band is a band-pass band
due to the limited bandwidth of the signal. In other words, the same sub band analysis can be
performed by feeding the signal into a bank of band-pass filters of which each filter has a band
width twice as wide as its left neighbor and a low-pass filter. The wavelets give us the band-pass
bands with doubling bandwidth and the scaling function provides with the low-pass band. From
this it can be concluded that a wavelet transform is the same thing as a sub band coding scheme
using a constant-Q filter bank. It can be summarized, as in implementation of the wavelet
transform as an iterated filter bank, it is not necessary to specify the wavelets explicitly The actual
lengths of the detail and approximation coefficient vectors are slightly more than half the length
of the original signal. This has to do with the filtering process, which is implemented by
convolving the signal with a filter. The spectral decomposition reveals the accuracy of individual
resolution which was not explored in mel spectrum. This approach is developed with a mapping
concept for speech recognition as outlined between evaluation of the suggested system for a
7. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
107
simulation of the purposed system is carried out on mat lab tool & the resolution obtained are as
outlined below.
5. Mapping Approach
After the enrolment session, the acoustic vectors extracted from input speech of a speaker provide
a set of training vectors. A set of ‘L’ training vectors from the frequency information is derived
using well-known LBG algorithm [4]. The algorithm is formally implemented by the following
recursive procedure. The LBG algorithm designs an M-vector codebook in stages. It starts first by
designing a 1-vector codebook, then uses a splitting technique on the codewords to initialize the
search for a 2-vector codebook, and continues the splitting process until the desired M-vector
codebook is obtained. Figure 7 shows, in a flow diagram, the detailed steps of the LBG algorithm.
“Cluster vectors” is the nearest-neighbor search procedure, which assigns each training vector to
a cluster associated with the closest codeword. “Find centroids” is the centroid update procedure.
“Compute D (distortion)” sums the distances of all training vectors in the nearest-neighbor search
so as to determine whether the procedure has converged. A general operational flow diagram of
the suggested LBG mapping approach is shown in fig.7
Figure: 9. Flow diagram of the LBG algorithm
6. Simulation Observation
For the training of HMM network for the recognition of speech a vocabulary consist of collection
words are maintained. The vocabulary consists of words given as, “DISCRETE”, “FOURIER”,
“TRANSFORM”, “WISY”, “EASY”, “TELL”, “FELL”, “THE”, “DEPTH”, “WELL”, “CELL”,
“FIVE”, each word in the vocabulary is stored in correspondence to a feature define as a
knowledge to each speech word during training of HMM network. The features are extracted on
only voice sample for the corresponding word.
The test speech utterance: used for testing given as “its easy to tell the depth of a well”, at 16KHz
sampling.
8. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
108
(a)
(b)
(c)
(d)
9. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
109
(e)
(f)
Figure : (10) (a) original speech signal and it’s noise effect speech signal, (b) the energy peak
points picked for training, (c) the recognition computation time for the MFCC based and the
modified MFCC system, (d) the observed correct classified symbols for the two method, (e) the
estimation error for the two methods with respect to. Iteration, (f) the likelihood variation with
respect to iteration for the two methods.
7. Conclusion
A speech recognition system for robust to noise effect is developed. The MFCC conventional
approach & extracting the feature of speech signal at lower frequency & is modified in this paper.
An efficient speech recognition system with the integration of MFCC feature with frequency sub
band decomposition using subband coding is proposed. The two features passed to the HMM
network result in better recognition compared to existing MFCC method. From the observation
made for the implemented system it is observed to have better efficiency for accurate
classification & recognition compared to the existing system.
8. References
[1] Varga A.P, and Moore R.K.: Hidden Markov Model decomposition of speech and noise, Proc. IEEE
Intl. Conf. on Acoustics, Speech and Signal Processing, pp. 845-48, (1990)
[2] Allen, J.B.: How do humans process and recognize speech IEEE Trans. on Speech and Audio
Processing, vol. 2, no. 4, pp.567--577 (1994.)
[3] Kim W. Kang, S. and Ko, H. : Spectral subtraction based on phonetic dependency and masking effects,
IEEE Proc.- Vision, Image and Signal Processing, 147(5), pp 423--27 (2000)
[4] R.J. Elliott, L. Aggoun and J.B. : Moore Hidden Markov Models: Estimation and Control”, Springer
Verlag, (1995)
[5] Fujimoto, M.; Riki, Y.A.: Robust speech recognition in additive and channel noise environments using
GMM and EM algorithm. Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04).
IEEE International Conference , , Vol1 17--21 May (2004)
10. Signal & Image Processing : An International Journal(SIPIJ) Vol.1, No.2, December 2010
110
[6] J .C.Segura, A.de la Torre, M.C.Benitez and A.M.Peinado: Model Based Compensation of the Additive
Noise for Continuous Speech Recognition. Experiments Using AURORA II Database and Tasks,
EuroSpeech’01,Vol.I, Page(s):I – 941--944
[7] M.J.F.Gales and S.J.Young : Robust Continuous Speech Recognition Using Parallel Model
Combination, IEEE Trans. Speech and Audio Processing, Vol.4, No.5, pp.352--359(1996).
[8] Renals, S., Morgan, N., Bourlard, H., Cohen, M, and Franco, H. : Connectionist Probability Estimators
in HMM Speech Recognition, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 1, pp. 161--174,
(1994.)
[9] J. Neto, C. Martins and L. Almeida: Speaker-Adaptation in a Hybrid HMM-MLP Recognizer”, in
Proceedings ICASSP '96, Atlanta, Vol. 6, pp.3383--3386 (1996.)
[10] Sadaoki Furui : Digital speech processing , synthesis and recognition second edition
[11] Gajic, B.; Paliwal, Kuldip .K. “Robust speech recognition using features based on zero crossings with
peak amplitudes” Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE
International Conference Volume 1, 6-10
[15] Lockwood P. and Boudy J (1992), “Experiments with a non-linear Spectral Subtractor (NSS), HMM
and the projection, for robust speech recognition in cars”, Speech Communication, Vol. 11, Nos. 2-3,
pp.215-228.
[16] Das S., Bakis R., Nadas A., Nahamoo D., and Picheny M., (1 993), “Influence of background noise
and microphone on the performance of the IBM TANGORA speech recognition system”, Proc. IEEE
ICASSP , Australia, April 1994, VI, pp. 21-23.
[17] Gong Y., (1995), “Speech recognition in noisy environments: A survey”. SpeechCommunication, 16,
pp.261-291
[18] Junqua J-C., Haton J-P., (1996), “Robustness in ASR: Fundamentals and Applications”, Kluwer
Academic Publishers.
[19 Tamura S., (1 987), “An analysis of a noise reduction multi-layer neural network”,
[20] Tamura S., (1 990),“Improvements to the noise reduction neural network”, IEEE
[21] Xie F. and Campernolle D., (1994),“A family of ‘MLP’ based non-linear spectral estimators for noise
reduction”, IEEE ICASSP ’94, pp. 53-56.
Authors
1) Ibrahim Patel Working as Associate Professor BME department Vishnupur.
Narsapur, Medak, (Dist), Andhra Pradesh India. He has received B.Tech in
(E&CE), M.Tech Degree in Biomedical Instrumentation and currently pursuing
Ph.D at Andhra University. He is having 16+ years of teaching and research
experience and published 12 research papers in the International conference &
journals and His main research interest includes Voice to sign language.
2) .Y.Srinivasa Rao received his Ph.D in Electrical Communication
Engineering from Indian Institute of Science Bangalore in 1998. At present, he
is an Associate Professor in Instrument Technology Department, AU College of
Engineering, Andhra University, Visakhapatnam, AP, India. He is having 15+
years of teaching and research experience and published 38 research papers in
the International journals and presented few of them in the International Conferences. He had received
“Emerald Literati Network 2008 Outstanding Paper Award (UK)”, “Best Paper Award” and “Best
Student paper Award” in ICSCI –International Conference in 2007, “Vignan Pratibha Award –
2006” from Andhra University and selected as “Science Researcher” for Asia – Pacific region in 2005
by UNESCO and Australian Expert group in Industry Studies (AEGIS) at the University of Western
Sydney (UWS) and received “Young Scientist Award” from Department of Science and Technology,
Govt. of India in 2002. His present research focuses on nanotechnology and VLSI. He is a life member of
ISTE and member of IMAPS (India).