The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor
The objective of this paper is to evaluate the quality of HMM based Marathi TTS system. The main advantage of HMM technique is its ability to allow the variation in voice easily. The output speeches produced in this method have greater impact on emotion, style and intonation. The naturalness and intelligibility are the two important parameters to decide the quality of synthetic speech. Depending on the parameters specified the results of synthetic speech are categorized into 4 categories: natural speech, high quality synthetic speech, low quality synthetic speech and moderate quality synthetic speech. The results are obtained by using CT, DRT and MOS test.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Testing the dual-route model about the Simon effect with a replicated go-/no-go task, this experiment revealed that conditional trials did not exercise an inhibitory effect on subsequent response decisions.
I am the authorized K-12 representative in South Carolina for eInstruction by Turning Point.
Bill McIntosh
SchoolVision Inc..
Authorized K-12 Consultant for eInstruction by Turning Technologies
Phone :843-442-8888
Email :WKMcIntosh@Comcast.net
Twitter : @OtisTMcIntosh
SchoolVision Website on Facebook: https://www.facebook.com/WKMIII
Website : www.einstruction.com
Stay Connected - @eInstruction • Facebook
Turning Technologies | 255 West Federal Street | Youngstown, OH
Main #: 330-746-3015 | Toll Free: 866-746-3015 | Fax: 330-884-6065
www.TurningTechnologies.com
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor
The objective of this paper is to evaluate the quality of HMM based Marathi TTS system. The main advantage of HMM technique is its ability to allow the variation in voice easily. The output speeches produced in this method have greater impact on emotion, style and intonation. The naturalness and intelligibility are the two important parameters to decide the quality of synthetic speech. Depending on the parameters specified the results of synthetic speech are categorized into 4 categories: natural speech, high quality synthetic speech, low quality synthetic speech and moderate quality synthetic speech. The results are obtained by using CT, DRT and MOS test.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Testing the dual-route model about the Simon effect with a replicated go-/no-go task, this experiment revealed that conditional trials did not exercise an inhibitory effect on subsequent response decisions.
I am the authorized K-12 representative in South Carolina for eInstruction by Turning Point.
Bill McIntosh
SchoolVision Inc..
Authorized K-12 Consultant for eInstruction by Turning Technologies
Phone :843-442-8888
Email :WKMcIntosh@Comcast.net
Twitter : @OtisTMcIntosh
SchoolVision Website on Facebook: https://www.facebook.com/WKMIII
Website : www.einstruction.com
Stay Connected - @eInstruction • Facebook
Turning Technologies | 255 West Federal Street | Youngstown, OH
Main #: 330-746-3015 | Toll Free: 866-746-3015 | Fax: 330-884-6065
www.TurningTechnologies.com
CIM Formación: Legislación sobre animales de compañía de la Generalitat Valen...CIM Grupo de Formación
Desde CIM Formación os acercamos a la LEGISLACION SOBRE ANIMALES DE COMPAÑIA en la Comunidad Valenciana, realizada por Amparo Requena.
Se repasa la LEY 4/1994, de 8 de julio, de la Generalitat Valenciana.
Mención especial a los animales potencialmente peligrosos (Decreto 145/2000, de 26 de sept. del G. valenciano, resultará de aplicación subsidiaria la Ley 50/1999, de 23 de diciembre, estatal)
Y revisión por LEY ORGÁNICA 5/2010 de 22 de Junio.
During an influenza pandemic, businesses and other employers have a key role in protecting employees\' health and safety. You can prepare your business to maintain productivity and reduce rate of infections through careful planning.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
Machine learning for Arabic phonemes recognition using electrolarynx speechIJECEIAES
Automatic speech recognition system is one of the essential ways of interaction with machines. Interests in speech based intelligent systems have grown in the past few decades. Therefore, there is a need to develop more efficient methods for human speech recognition to ensure the reliability of communication between individuals and machines. This paper is concerned with Arabic phoneme recognition of electrolarynx device. Electrolarynx is a device used by cancer patients having vocal laryngeal cords removed. Speech recognition here is considered to find the preferred machine learning model that can classify phonemes produced by electrolarynx device. The phonemes recognition employs different machine learning schemes, including convolutional neural network, recurrent neural network, artificial neural network (ANN), random forest, extreme gradient boosting (XGBoost), and long short-term memory. Modern standard Arabic is utilized for testing and training phases of the recognition system. The dataset covers both an ordinary speech and electrolarynx device speech recorded by the same person. Mel frequency cepstral coefficients are considered as speech features. The results show that the ANN machine learning method outperformed other methods with an accuracy rate of 75%, a precision value of 77%, and a phoneme error rate (PER) of 21.85%.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Speech to text conversion for visually impaired person using µ law compandingiosrjce
The paper represents the overall design and implementation of DSP based speech recognition and
text conversion system. Speech is usually taken as a preferred mode of operation for human being, This paper
represent voice oriented command for converting into text. We intended to compute the entire speech processing
in real time. This involves simultaneously accepting the input from the user and using software filters to analyse
the data. The comparison was then to be established by using correlation and µ law companding techniques. In
this paper, voice recognition is carried out using MATLAB. The voice command is a person independent. The
voice command is stored in the data base with the help of the function keys. The real time input speech received
is then processed in the speech recognition system where the required feature of the speech words are extracted,
filtered out and matched with the existing sample stored in the database. Then the required MATLAB processes
are done to convert the received data and into text form.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
Speaker recognition is the computing task of validating a user's claimed identity using characteristics extracted from their voices. Voice -recognition is combination of the two where it uses learned aspects of a speaker’s voice to determine what is being said - such a system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices it has been trained with, which gives us various applications in day today life.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages: Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the input speech waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. And then the likelihood of the observation of feature vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language Model (LM). The system will produce the most likely sequence of words as the output. The system creates the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
Emotional telugu speech signals classification based on k nn classifiereSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Similar to EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS (20)
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
DOI : 10.5121/ijnlc.2013.2205 51
EFFECT OF MFCC BASED FEATURES FOR SPEECH
SIGNAL ALIGNMENTS
Jang Bahadur Singh, Radhika Khanna and Parveen Lehana*
Department of Physics and Electronics, University of Jammu, Jammu, India
E-mail: sonajbs@gmail.com, pklehanajournals@gmail.com.
ABSTRACT
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
KEYWORDS
Speech recognition, Speech transformation, MFCC, HNM, DTW
1. INTRODUCTION
Speech is one of the most dominating and natural means of communicate or express thoughts,
ideas, and emotions between individuals [1]. It is a complicated signal, naturally produced by
human beings because various processes are involved in the generation of speech signal. As a
result, verbal communication can be varied extensively in terms of their accent, pronunciation,
articulation, nasality, pitch, volume, and speed [2], [3].
Speech signal processing is one of the biggest developing areas of research in signal processing.
The aim of digital speech signal processing is to take advantage of digital computing techniques
to process the speech signal for increased understanding, improved communication, and increased
efficiency and productivity associated with speech activities. The major fields of speech signal
processing consists of speech synthesis, speech recognition, and speech/speaker transformation.
In order to extract features from speech signal it has to be divided into fixed frames called
windows. The types of the window function used are Hanning, Hamming, cosine, half
parallelogram, not with right angles. Length of frame can be varied, depending upon nature of
feature vectors to be extracted [4] Overlapping of two consecutive windows is done in order to
maintain continuity. Feature extraction approaches mostly depend upon modeling human voice
production and modeling perception system [5]. They provide a compressed representation of the
speech signal by extracting a sequence of feature vectors [6].
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
52
2. SPEECH RECOGNITION
It is an alternative and effective method of interacting with a machine. It is also known as
automatic speech recognition (ASR) converts spoken language in text. It has been two decades
since the ASR have started moving from research labs to real-world. Speech recognition is more
difficult than speech generation because of the fact that computers can store and recall enormous
amounts of data, perform mathematical computations at very high speed, and do repetitive tasks
without losing any type of efficiency. Because of these limitations, the accuracy of speech
recognition is reduced. There are two main steps for speech recognition: feature extraction and
feature matching. Isolated words of the input speech signal is analysed to obtain the speech
parameters using Mel frequency cepstral coefficients (MFCC) or line spectral frequencies (LSF).
These parameters provide the information related to the dynamically changing vocal tract during
speech production. These parameters are then matched up to with earlier pattern of spoken words
to recognize the closest match. Similar steps may be used for identity matching of a given speaker
[7]. There are two types of speech recognition: speaker-dependent and speaker-independent.
Speaker-dependent technique works by learning the distinctiveness of a single speaker like in
case of voice recognition, while speaker-independent systems involves no training as they are
designed to recognize anyone's voice. As the acoustic spaces of the speakers are
multidimensional, reduction of their dimensionality is very important [8]. The most common
method for training of the speaker recognition system is hidden Markov model (HNM) and its
latest variant is (HMM-GMM) [9]. For better alignment or matching, normalization of the sub-
band temporal modulation envelopes may be used [17]. The main factors responsible for the
stagnation in the fields of speech recognition are environmental noise, channel distortion, and
speaker variability [10], [11], [12].
Fig. 1 Blocks of speech recognition as pattern matching problem [13].
Let us consider simple speech recognition as a pattern matching problem. As shown in fig. 1, the
system accepts an input speech waveform. The feature extraction sub-block converts input speech
waveform to a set of feature vectors. These vectors represent speech characteristics/features
suitable for detection and then match up to it with reference patterns to find a closest sample. If
the input speech produced by the alike speaker or similar environmental condition from the
reference patterns, the system is likely to find the correct pattern otherwise not [13], [14]. The few
valuable approaches used for feature extraction are: full-band and subband energies [15],
spectrum divergence between speech signal and noise in background [16], pitch estimation [17],
zero crossing rate [18], and higher-order statistics [19], [20], [21], [22]. However, using long-
term speech information [23], [24] has shown considerable benefits for identifying speech signal
in noisy environments. Several methods used for the formulation of the rules in decision module
based on distance measures techniques like the Euclidean distance [25] Itakura-Saito and
Kullback-Leibler divergence [26]. Some other method are based on fuzzy logic [27] [support
vector machines (SVM) [28] and genetic algorithms [29], [30].
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
53
3. METHODOLOGY
The analysis process of the speech signals at phrase, word and phoneme levels was carried out
with the raw recordings of six speakers in Hindi language. Speakers were of different age group,
from different regions of Jammu. Sony recorder (ICD-UX513F) has been used for the recording
of speech. It is a 4GB UX Digital Voice Recorder with expandable memory capabilities, and
provides high quality voice recording. The investigation further includes the segmentation,
labeling of the recorded speech at phrase, word and phoneme levels. Fig. 2 shows sub-section of
speech signal waveform and spectrogram labeled at phoneme level. Feature extraction and lastly
alignment of source and target feature vectors using DTW technique and then calculating
Mahalanobis distance between them.
Fig. 2 Sub-section of speech signal waveform and spectrogram labeled at phoneme level.
Thus main experiment is separated into two main sections. In first section, segmented source and
target speech, features vectors were extracted using MFCC algorithms. The features extracted
were aligned separately by means of DTW techniques and alignment error was calculated using
Mahalanobis distance. In second section, segmented source and target speech were analysed first
by HNM module in order to obtain the HNM parameters afterward MFCC features were
calculated. The extracted features were aligned by means of DTW techniques and alignment error
was calculated using Mahalanobis distance. These steps were implemented on the phrase, word
and phoneme levels of the speech signal.
4. RESULTS AND DISCUSSION
In order to analyse the results based on the techniques namely MFCC, HNM, and DTW for the
alignment of segmented speech at phrase, word, and phoneme levels. These investigations were
carried out with male to male, female to male, and female to female speaker combinations. In
order to compare alignment techniques, mean and standard deviation of Mahalanobis distances
were calculated. Alignment error using Mahalanobis distances of various male-male
combinations are represented from fig. 3 to fig. 5. The various combinations of speaker pair’s use
as source and targets were written on the bottom of each plot, e.g. aman-s-nakul represents source
target alignment without using HNM at sentence level while aman-sh-nakul represents source
target alignment using HNM at sentence level.
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
54
From the below graphs it can easily be analysed that alignment error at all segmented level of
speech decreases by using HNM model, as Mahalanobis distances reduces. Therefore using HNM
model, alignment can further be improved in comparison with the feature extraction method
based on MFCC only. Decrease in the Mahalanobis distances means better alignment between the
two sequences.
The overall comparisons of the speech alignment were shown in form of bar graphs. It can be
well estimated that better speech alignment can be achieved at sentence level segmentation rather
than word level or phoneme level segmentation of speech.
Fig. 3 MFCC based speech alignment errors using Mahalanobis distance at sentence level with male
to male combination. The first column shows the results for alignment without the use of HNM
based method and second column shows the results for HNM based method.
Fig. 4 MFCC based speech alignment errors using Mahalanobis distance at word level with male to
male combination. The first column shows the results for alignment without the use of HNM based
method and second column shows the results for HNM based method
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
55
Fig. 5 MFCC based speech alignment errors using Mahalanobis distance at phoneme level with male to
male combination. The first column shows the results for alignment without the use of HNM based method
and second column shows the results for HNM based method.
Fig. 6 Bar graph illustrate averaged MFCC based mean along standard deviation speech alignment errors
using Mahalanobis distance at sentence, word and phoneme level with all male to male, female to male and
female to female combinations. The symbol s, w, p shows the results for alignment without the use of HNM
based method and the symbol s”, w”, p” shows the results for HNM based method.
5. CONCLUSION
The speech signals in the form of twenty phrases were recorded from each of six speakers.
Investigations were carried out with three different combinations of male to male, female to male,
and female to female speakers. The feature vectors are extracted using MFCC, and HNM
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
56
techniques. Speech alignment errors using Mahalanobis distances for labeled phrases, words, and
phonemes, aligned by means of DTW were calculated. From the analysis of the results it can be
observed that alignment error with HNM model decreases at all the levels of labeled speech
levels. Therefore implementing HNM model alignment error can further be reduced in
compression with the feature extraction method based on MFCC only. Reduction in alignment
error means better alignment or matching between two speech signals. Thus this research work is
concluded as follows in brief:
1) HNM based alignment is more promising than other existing techniques.
2) The accuracy of speech recognition cannot be considerably increased even labeling the
phrases at phoneme level. Effective speech alignment can be obtained at phrase level also
which save our valuable time and unnecessary step in algorithm.
3) It is remarkable to note that speech alignment error in case of female-male is much larger
than other combinations of male-male and female-female. Therefore such combination
must be avoided in speech recognition and speaker transformation.
REFERENCES
[1] Zaher A A (2009) “Recent Advances in Signal Processing”, Publisher: InTech, pp. 544.
[2] Tebelskis J (1995) “Speech Recognition using Neural Networks”, Ph.D. dissertation, Dept. School
of Computer Science, Carnegie Mellon University Pittsburgh, Pennsylvania.
[3] http://www.learnartificialneuralnetworks.com/speechrecognition.html
[4] Lindholm S (2010) “A speech recognition system for Swedish running on Android”.
[5] Ursin M (2002) “Triphone Clustering in Finnish Continuous Speech Recognition”, M.S thesis,
Department of Computer Science, Helsinki University of Technology, Finland,
[6] Trabelsi I and Ayed D Ben (2012) “On the Use of Different Feature Extraction Methods for Linear
and Non Linear kernels”, Proc. IEEE, Sciences of Electronic, Technologies of Information and
Telecommunications.
[7] Rabiner L R (1989) “A tutorial on Hidden Markov Models and selected applications in speech
recognition".
[8] Wang1 X & Paliwal K K (2003) “Feature extraction and dimensionality reduction algorithms and
their applications in vowel recognition,” Pattern Recognition, vol. 36, pp. 2429.
[9] Muller F & Mertins A (2011) “Contextual invariant-intergration features for improved speaker-
independent speech recognition”, Speech communication, pp. 830.
[10] Lu X, Unoki M & Nakamura S (2011) “Sub-band temporal modulation envelopes and their
normalization for automatic speech recognition in reverberant environments”, Computer Speech
and Language, pp. 571.
[11] Mporas I, Ganchev T, Kacsis O & Fakotakis N (2011) “Context-adaptive pre-processing scheme
for robust speech recognition in fast-varying noise environments”, Signal Processing, vol. 91, pp.
2101.
[12] Dai P & Soon I Y (2011) “A temporal warped 2D psychoacoustic modeling for robust speech
recognition system”, Speech Communication, vol. 53, pp. 229.
[13] Rabiner L R & Juang B H (1993) “Fundamentals of Speech Recognition”, Prentice-Hall.
[14] Doh S J (2000) “Enhancements to Transformation-Based Speaker Adaptation: Principal
Component and Inter-Class Maximum Likelihood Linear Regression”, PhD, Department of
Electrical and Computer Engineering, Carnegie Mellon University Pittsburgh.
[15] Woo K, Yang T, Park K & Lee C (2000) “Robust voice activity detection algorithm for estimating
noise spectrum”, Electronics Letters, vol. 36, pp. 180.
[16] Marzinzik M & Kollmeier B (2002) “Speech pause detection for noise spectrum estimation by
tracking power envelope dynamics”, IEEE Trans. Speech Audio Processing, vol. 10, pp. 341.
[17] Tucker R (1992), “Voice activity detection using a periodicity measure”, Proc. Institute of
Electronic Engineering, vol. 139, pp. 377.
7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
57
[18] Bachu R G, Kopparthi S, Adapa B. & Barkana B D (2008) “Separation of Voiced and Unvoiced
Using Zero Crossing Rate and Energy of the Speech Signal”, Proc., American Society for
Engineering Education, Zone Conference, pp. 1.
[19] Nemer E, Goubran R & Mahmoud S (2001) “Robust voice activity detection using higher order
statistics in the lpc residual domain”, IEEE Trans. Speech Audio Processing, vol. 9, pp. 217.
[20] Ramirez J, Gorriz J, Segura J C, Puntonet C G & Rubio A (2006) “Speech/Non-speech
Discrimination based on Contextual Information Integrated Bispectrum LRT”, IEEE Signal
Processing Letters, vol. 13, pp. 497.
[21] Gorriz J M, Ramirez J, Puntonet C G & Segura J C (2006) “Generalized LRT-based voice activity
detector”, IEEE Signal Processing Letters, vol. 13, pp. 636.
[22] Ramírez J, Gorriz J M & Segura J C (2007) “Statistical Voice Activity Detection Based on
Integrated Bispectrum Likelihood Ratio Tests”, Journal of the Acoustical Society of America, vo.
121, pp. 2946,
[23] Ramirez J, Segura J C, Benitez C, Torre A & Rubio A (2004), “Efficient Voice Activity Detection
Algorithms Using Long-Term Speech Information”, Speech Communication, vol. 42, pp. 271.
[24] Ramirez J, Segura J C, Benitez C, Torre A & Rubio A (2005) “An Effective Subband OSF-based
VAD with Noise Reduction for Robust Speech Recognition”, IEEE Transactions on Speech and
Audio Processing, vol. 13, pp. 1119.
[25] Gorriz J M, Ramirez J, Segura J C & Puntonet C G (2006) “An effective cluster-based model for
robust speech detection and speech recognition in noisy environments”, Journal of the Acoustical
Society of America, vol. 120, pp. 470.
[26] Ramirez J, Segura J C, Benitez C, Torre A & Rubio A (2004) “A New Kullback- Leibler VAD for
Robust Speech Recognition”, IEEE Signal Processing Letters, vol. 11, pp. 266.
[27] Beritelli F, Casale S, Rugeri G & Serrano S (2002) “Performance evaluation and comparison of
G.729/AMR/fuzzy voice activity detectors”, IEEE Signal Processing Letters, vol. 9, pp. 85.
[28] Ramirez J, Yelamos P, Gorriz J M & Segura J C (2006) “SVM-based Speech Endpoint Detection
Using Contextual Speech Features”, IEE Electronics Letters, vol. 42.
[29] Estevez P A, Yoma N, Boric N & Ramirez J A (2005) “Genetic programming based voice activity
detection”, Electronics Letters, vol. 41, pp. 1141.
[30] Ramirez J, Gorriz J M & Segura J C (2007) “Voice Activity Detection. Fundamentals and Speech
Recognition System Robustness”, Robust Speech Recognition and Understanding, I-TECH
Education and Publishing, pp. 1.