An efficient, and intuitive algorithm is presented for the identification of speakers from a long dataset (like
YouTube long discussion, Cocktail party recorded audio or video).The goal of automatic speaker
identification is to identify the number of different speakers and prepare a model for that speaker by
extraction, characterization and speaker-specific information contained in the speech signal. It has many
diverse application specially in the field of Surveillance , Immigrations at Airport , cyber security ,
transcription in multi-source of similar sound source, where it is difficult to assign transcription arbitrary.
The most commonly speech parameterization used in speaker verification, K-mean, cepstral analysis, is
detailed. Gaussian mixture modeling, which is the speaker modeling technique is then explained. Gaussian
mixture models (GMM), perhaps the most robust machine learning algorithm has been introduced to
examine and judge carefully speaker identification in text independent. The application or employment of
Gaussian mixture models for monitoring & Analysing speaker identity is encouraged by the familiarity,
awareness, or understanding gained through experience that Gaussian spectrum depict the characteristics
of speaker's spectral conformational pattern and remarkable ability of GMM to construct capricious
densities after that we illustrate 'Expectation maximization' an iterative algorithm which takes some
arbitrary value in initial estimation and carry on the iterative process until the convergence of value is
observed We have tried to obtained 85 ~ 95% of accuracy using speaker modeling of vector quantization
and Gaussian Mixture model ,so by doing various number of experiments we are able to obtain 79 ~ 82%
of identification rate using Vector quantization and 85 ~ 92.6% of identification rate using GMM modeling
by Expectation maximization parameter estimation depending on variation of parameter.
The accuracy of the speaker identification system reduces severely in the presence of noise. The auditory based features called gammatone frequency ceptral coefficient (GFCC) which resembles to the ability of human ear to identify the speaker identity's in noisy environment. The GFCC is based on gammatone filter bank which models the basilar membrane as a sequence of overlapping band pass filters. The system uses GFCC features and Gaussian Mixture Model - Universal Background Model (GMM - UBM ) for modeling of features of the speaker. The experiment are conducted on the English Language Speech Database for Speaker Recognition (ELSDR) databases. In order to find out the performance of the system, the test utterances are mixed with noises at various SNR levels to simulate the channel change. The results shows that the GFCC features has a good identification accuracy not only in clean sample, but also for noisy condition.
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
This research proposes a simulation of the logic series of speech recognition on the MFCC (Mel Frequency Spread Spectrum) based FPGA and Euclidean Distance to control the robotic car motion. The speech known would be used as a command to operate the robotic car. MFCC in this study was used in the feature extraction process, while Euclidean distance was applied in the feature classification process of each speech that later would be forwarded to the part of decision to give the control logic in robotic motor. The test that has been conducted showed that the logic series designed was precise here by measuring the Mel Frequency Warping and Power Cepstrum. With the achievement of logic design in this research proven with a comparison between the Matlab computation and Xilinx simulation, it enables to facilitate the researchers to continue its implementation to FPGA hardware.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
The accuracy of the speaker identification system reduces severely in the presence of noise. The auditory based features called gammatone frequency ceptral coefficient (GFCC) which resembles to the ability of human ear to identify the speaker identity's in noisy environment. The GFCC is based on gammatone filter bank which models the basilar membrane as a sequence of overlapping band pass filters. The system uses GFCC features and Gaussian Mixture Model - Universal Background Model (GMM - UBM ) for modeling of features of the speaker. The experiment are conducted on the English Language Speech Database for Speaker Recognition (ELSDR) databases. In order to find out the performance of the system, the test utterances are mixed with noises at various SNR levels to simulate the channel change. The results shows that the GFCC features has a good identification accuracy not only in clean sample, but also for noisy condition.
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
This research proposes a simulation of the logic series of speech recognition on the MFCC (Mel Frequency Spread Spectrum) based FPGA and Euclidean Distance to control the robotic car motion. The speech known would be used as a command to operate the robotic car. MFCC in this study was used in the feature extraction process, while Euclidean distance was applied in the feature classification process of each speech that later would be forwarded to the part of decision to give the control logic in robotic motor. The test that has been conducted showed that the logic series designed was precise here by measuring the Mel Frequency Warping and Power Cepstrum. With the achievement of logic design in this research proven with a comparison between the Matlab computation and Xilinx simulation, it enables to facilitate the researchers to continue its implementation to FPGA hardware.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
Design and implementation of a java based virtual laboratory for data communi...IJECEIAES
Students in this modern age find engineering courses taught in the university very abstract and difficult, and cannot relate theoretical calculations to real life scenarios. They consequently lose interest in their coursework and perform poorly in their grades. Simulation of classroom concepts with simulation software like MATLAB, were developed to facilitate learning experience. This paper involves the development of a virtual laboratory simulation package for teaching data communication concepts such as coding schemes, modulation and filtering. Unlike other simulation packages, no prior knowledge of computer programming is required for students to grasp these concepts.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
Text-Independent Speaker Verification ReportCody Ray
Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
The paper published in discusses implementation of a robust text-independent speaker recognition system using MFCC extraction of feature vectors its matching using VQ and optimization using LBG, further a text dependent speech recognition system using the DTW algorithm's implementation is discussed in the context of home automation.
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Performance analysis of image compression using fuzzy logic algorithmsipij
With the increase in demand, product of multimedia is increasing fast and thus contributes to insufficient
network bandwidth and memory storage. Therefore image compression is more significant for reducing
data redundancy for save more memory and transmission bandwidth. An efficient compression technique
has been proposed which combines fuzzy logic with that of Huffman coding. While normalizing image
pixel, each value of pixel image belonging to that image foreground are characterized and interpreted. The
image is sub divided into pixel which is then characterized by a pair of set of approximation. Here
encoding represent Huffman code which is statistically independent to produce more efficient code for
compression and decoding represents rough fuzzy logic which is used to rebuilt the pixel of image. The
method used here are rough fuzzy logic with Huffman coding algorithm (RFHA). Here comparison of
different compression techniques with Huffman coding is done and fuzzy logic is applied on the Huffman
reconstructed image. Result shows that high compression rates are achieved and visually negligible
difference between compressed images and original images
Line Spectral Pairs Based Voice Conversion using Radial Basis FunctionIDES Editor
Voice Conversion (VC) is a technique which morphs
the speaker dependent acoustical cues of the source speaker
to those of the target speaker. Speaker dependent acoustical
cues are characterized at different levels such as shape of
vocal tract and glottal excitation. In this paper, vocal tract
parameters and glottal excitations are characterized using
Line Spectral Pairs (LSP) and pitch residual respectively.
Strong generalization ability of Radial Basis Function (RBF)
is utilized to map the acoustical cues namely, LSP and pitch
residual of source speaker to that of target speaker. The
subjective and objective measures are used to evaluate the
comparative performance of RBF and state of the art GMM
based voice conversion system. Objective measures and
simulation results indicate that the RBF transformation model
performed better than GMM model. Subjective evaluations
illustrate that the proposed algorithm maintains target voice
individuality, naturalness and quality of the speech signal.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Parallax Effect Free Mosaicing of Underwater Video Sequence Based on Texture ...sipij
In this paper, we present feature-based technique for construction of mosaic image from underwater video
sequence, which suffers from parallax distortion due to propagation properties of light in the underwater
environment. The most of the available mosaic tools and underwater image mosaicing techniques yields
final result with some artifacts such as blurring, ghosting and seam due to presence of parallax in the input
images. The removal of parallax from input images may not reduce its effects instead it must be corrected
in successive steps of mosaicing. Thus, our approach minimizes the parallax effects by adopting an efficient
local alignment technique after global registration. We extract texture features using Centre Symmetric
Local Binary Pattern (CS-LBP) descriptor in order to find feature correspondences, which are used further
for estimation of homography through RANSAC. In order to increase the accuracy of global registration,
we perform preprocessing such as colour alignment between two selected frames based on colour
distribution adjustment. Because of existence of 100% overlap in consecutive frames of underwater video,
we select frames with minimum overlap based on mutual offset in order to reduce the computation cost
during mosaicing. Our approach minimizes the parallax effects considerably in final mosaic constructed
using our own underwater video sequences.
Application of parallel algorithm approach for performance optimization of oi...sipij
This paper gives a detailed study on the performance of image filter algorithm with various parameters
applied on an image of RGB model. There are various popular image filters, which consumes large amount
of computing resources for processing. Oil paint image filter is one of the very interesting filters, which is
very performance hungry. Current research tries to find improvement in oil paint image filter algorithm by
using parallel pattern library. With increasing kernel-size, the processing time of oil paint image filter
algorithm increases exponentially. I have also observed in various blogs and forums, the questions for
faster oil paint have been asked repeatedly.
Offline handwritten signature identification using adaptive window positionin...sipij
The paper presents to address this challenge, we have proposed the use of Adaptive Window Positioning
technique which focuses on not just the meaning of the handwritten signature but also on the individuality
of the writer. This innovative technique divides the handwritten signature into 13 small windows of size nxn
(13x13). This size should be large enough to contain ample information about the style of the author and
small enough to ensure a good identification performance. The process was tested with a GPDS dataset
containing 4870 signature samples from 90 different writers by comparing the robust features of the test
signature with that of the user’s signature using an appropriate classifier. Experimental results reveal that
adaptive window positioning technique proved to be the efficient and reliable method for accurate
signature feature extraction for the identification of offline handwritten signatures .The contribution of this
technique can be used to detect signatures signed under emotional duress
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
Text-Independent Speaker Verification ReportCody Ray
Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITIONcscpconf
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k - Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
The paper published in discusses implementation of a robust text-independent speaker recognition system using MFCC extraction of feature vectors its matching using VQ and optimization using LBG, further a text dependent speech recognition system using the DTW algorithm's implementation is discussed in the context of home automation.
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Performance analysis of image compression using fuzzy logic algorithmsipij
With the increase in demand, product of multimedia is increasing fast and thus contributes to insufficient
network bandwidth and memory storage. Therefore image compression is more significant for reducing
data redundancy for save more memory and transmission bandwidth. An efficient compression technique
has been proposed which combines fuzzy logic with that of Huffman coding. While normalizing image
pixel, each value of pixel image belonging to that image foreground are characterized and interpreted. The
image is sub divided into pixel which is then characterized by a pair of set of approximation. Here
encoding represent Huffman code which is statistically independent to produce more efficient code for
compression and decoding represents rough fuzzy logic which is used to rebuilt the pixel of image. The
method used here are rough fuzzy logic with Huffman coding algorithm (RFHA). Here comparison of
different compression techniques with Huffman coding is done and fuzzy logic is applied on the Huffman
reconstructed image. Result shows that high compression rates are achieved and visually negligible
difference between compressed images and original images
Line Spectral Pairs Based Voice Conversion using Radial Basis FunctionIDES Editor
Voice Conversion (VC) is a technique which morphs
the speaker dependent acoustical cues of the source speaker
to those of the target speaker. Speaker dependent acoustical
cues are characterized at different levels such as shape of
vocal tract and glottal excitation. In this paper, vocal tract
parameters and glottal excitations are characterized using
Line Spectral Pairs (LSP) and pitch residual respectively.
Strong generalization ability of Radial Basis Function (RBF)
is utilized to map the acoustical cues namely, LSP and pitch
residual of source speaker to that of target speaker. The
subjective and objective measures are used to evaluate the
comparative performance of RBF and state of the art GMM
based voice conversion system. Objective measures and
simulation results indicate that the RBF transformation model
performed better than GMM model. Subjective evaluations
illustrate that the proposed algorithm maintains target voice
individuality, naturalness and quality of the speech signal.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Parallax Effect Free Mosaicing of Underwater Video Sequence Based on Texture ...sipij
In this paper, we present feature-based technique for construction of mosaic image from underwater video
sequence, which suffers from parallax distortion due to propagation properties of light in the underwater
environment. The most of the available mosaic tools and underwater image mosaicing techniques yields
final result with some artifacts such as blurring, ghosting and seam due to presence of parallax in the input
images. The removal of parallax from input images may not reduce its effects instead it must be corrected
in successive steps of mosaicing. Thus, our approach minimizes the parallax effects by adopting an efficient
local alignment technique after global registration. We extract texture features using Centre Symmetric
Local Binary Pattern (CS-LBP) descriptor in order to find feature correspondences, which are used further
for estimation of homography through RANSAC. In order to increase the accuracy of global registration,
we perform preprocessing such as colour alignment between two selected frames based on colour
distribution adjustment. Because of existence of 100% overlap in consecutive frames of underwater video,
we select frames with minimum overlap based on mutual offset in order to reduce the computation cost
during mosaicing. Our approach minimizes the parallax effects considerably in final mosaic constructed
using our own underwater video sequences.
Application of parallel algorithm approach for performance optimization of oi...sipij
This paper gives a detailed study on the performance of image filter algorithm with various parameters
applied on an image of RGB model. There are various popular image filters, which consumes large amount
of computing resources for processing. Oil paint image filter is one of the very interesting filters, which is
very performance hungry. Current research tries to find improvement in oil paint image filter algorithm by
using parallel pattern library. With increasing kernel-size, the processing time of oil paint image filter
algorithm increases exponentially. I have also observed in various blogs and forums, the questions for
faster oil paint have been asked repeatedly.
Offline handwritten signature identification using adaptive window positionin...sipij
The paper presents to address this challenge, we have proposed the use of Adaptive Window Positioning
technique which focuses on not just the meaning of the handwritten signature but also on the individuality
of the writer. This innovative technique divides the handwritten signature into 13 small windows of size nxn
(13x13). This size should be large enough to contain ample information about the style of the author and
small enough to ensure a good identification performance. The process was tested with a GPDS dataset
containing 4870 signature samples from 90 different writers by comparing the robust features of the test
signature with that of the user’s signature using an appropriate classifier. Experimental results reveal that
adaptive window positioning technique proved to be the efficient and reliable method for accurate
signature feature extraction for the identification of offline handwritten signatures .The contribution of this
technique can be used to detect signatures signed under emotional duress
Robust content based watermarking algorithm using singular value decompositio...sipij
Nowadays, image content is frequently subject to different malicious manipulations. To protect images
from this illegal manipulations computer science community have recourse to watermarking techniques. To
protect digital multimedia content we need just to embed an invisible watermark into images which
facilitate the detection of different manipulations, duplication, illegitimate distributions of these images. In
this work a robust watermarking technique is presented that embedding invisible watermarks into colour
images the singular value decomposition bloc by bloc of a robust transform of images that is the Radial
symmetry transform. Each bit of the watermark is inserted in a bloc of eight pixels large of the blue
channel a high singular value of the corresponding bloc into the radial symmetry map. We justified the
insertion in the blue channel by our feeble sensibility to perturbations in this colour channel of images. We
present also results obtained with different tests. We had tested the imperceptibility of the mark using this
approach and also its robustness face to several attacks.
Lossless image compression using new biorthogonal waveletssipij
Even though a large number of wavelets exist, one needs new wavelets for their specific applications. One
of the basic wavelet categories is orthogonal wavelets. But it was hard to find orthogonal and symmetric
wavelets. Symmetricity is required for perfect reconstruction. Hence, a need for orthogonal and symmetric
arises. The solution was in the form of biorthogonal wavelets which preserves perfect reconstruction
condition. Though a number of biorthogonal wavelets are proposed in the literature, in this paper four new
biorthogonal wavelets are proposed which gives better compression performance. The new wavelets are
compared with traditional wavelets by using the design metrics Peak Signal to Noise Ratio (PSNR) and
Compression Ratio (CR). Set Partitioning in Hierarchical Trees (SPIHT) coding algorithm was utilized to
incorporate compression of images.
Beamforming with per antenna power constraint and transmit antenna selection ...sipij
In this paper, transmit beamforming and antenna selection techniques are presented for the Cooperative
Distributed Antenna System. Beamforming technique with minimum total weighted transmit power
satisfying threshold SINR and Per-Antenna Power constraints is formulated as a convex optimization
problem for the efficient performance of Distributed Antenna System (DAS). Antenna Selection technique is
implemented in this paper to select the optimum Remote Antenna Units from all the available ones. This
achieves the best compromise between capacity and system complexity. Dual polarized and Triple
Polarized systems are considered. Simulation results prove that by integrating Beamforming with DAS
enhances its performance. Also by using convex optimization in Antenna Selection enhances the
performance of multi polarized systems.
Global threshold and region based active contour model for accurate image seg...sipij
In this contribution, we develop a novel global threshold-based active contour model. This model deploys a new
edge-stopping function to control the direction of the evolution and to stop the evolving contour at weak or
blurred edges. An implementation of the model requires the use of selective binary and Gaussian filtering
regularized level set (SBGFRLS) method. The method uses either a selective local or global segmentation
property. It penalizes the level set function to force it to become a binary function. This procedure is followed by
using a regularisation Gaussian. The Gaussian filters smooth the level set function and stabilises the evolution
process. One of the merits of our proposed model stems from the ability to initialise the contour anywhere inside
the image to extract object boundaries. The proposed method is found to perform well, notably when the
intensities inside and outside the object are homogenous. Our method is applied with satisfactory results on
various types of images, including synthetic, medical and Arabic-characters images.
An intensity based medical image registration using genetic algorithmsipij
Medical imaging plays a vital role to create images of human body for clinical purposes. Biomedical
imaging has taken a leap by entering into the field of image registration. Image registration integrates the
large amount of medical information embedded in the images taken at different time intervals and images
at different orientations. In this paper, an intensity-based real-coded genetic algorithm is used for
registering two MRI images. To demonstrate the efficiency of the algorithm developed, the alignment of the
image is altered and algorithm is tested for better performance. Also the work involves the comparison of
two similarity metrics, and based on the outcome the best metric suited for genetic algorithm is studied.
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
Usually, hearing impaired people use hearing aids which are implemented with speech enhancement
algorithms. Estimation of speech and estimation of nose are the components in single channel speech
enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power
spectrum for non stationary environment. VAD (Voice Activity Detector) is used to identify speech pauses
and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error) speech
enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual
aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio) based on uncertainty
parameter introduced for the benefits of hearing impaired people in non stationary environments to control
distortions. By estimation and updating of noise based on division of original pure signal into three parts
such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different
values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed
method will compared with any one method WAT(Weighted Average Technique) Hence by using
parameters SR (signal to residual spectrum ratio) and LLR (log like hood ratio), MMSE (Minim Mean
Square Error) in terms of segmented SNR and LLR.
Review of ocr techniques used in automatic mail sorting of postal envelopessipij
This paper presents a review of various OCR techniq
ues used in the automatic mail sorting process. A
complete description on various existing methods fo
r address block extraction and digit recognition th
at
were used in the literature is discussed. The objec
tive of this study is to provide a complete overvie
w about
the methods and techniques used by many researchers
for automating the mail sorting process in postal
service in various countries. The significance of Z
ip code or Pincode recognition is discussed.
IDENTIFICATION OF SUITED QUALITY METRICS FOR NATURAL AND MEDICAL IMAGESsipij
To assess quality of the denoised image is one of the important task in image denoising application.
Numerous quality metrics are proposed by researchers with their particular characteristics till today. In
practice, image acquisition system is different for natural and medical images. Hence noise introduced in
these images is also different in nature. Considering this fact, authors in this paper tried to identify the
suited quality metrics for Gaussian, speckle and Poisson corrupted natural, ultrasound and X-ray images
respectively. In this paper, sixteen different quality metrics from full reference category are evaluated with
respect to noise variance and suited quality metric for particular type of noise is identified. Strong need to
develop noise dependent quality metric is also identified in this work.
A combined method of fractal and glcm features for mri and ct scan images cla...sipij
Fractal analysis has been shown to be useful in image processing for characterizing shape and gray-scale
complexity. The fractal feature is a compact descriptor used to give a numerical measure of the degree of
irregularity of the medical images. This descriptor property does not give ownership of the local image
structure. In this paper, we present a combination of this parameter based on Box Counting with GLCM
Features. This powerful combination has proved good results especially in classification of medical texture
from MRI and CT Scan images of trabecular bone. This method has the potential to improve clinical
diagnostics tests for osteoporosis pathologies.
A voting based approach to detect recursive order number of photocopy documen...sipij
Photocopy documents are very common in our normal life. People are permitted to carry and present
photocopied documents to avoid damages to the original documents. But this provision is misused for
temporary benefits by fabricating fake photocopied documents. Fabrication of fake photocopied document
is possible only in 2nd and higher order recursive order of photocopies. Whenever a photocopied document
is submitted, it may be required to check its originality. When the document is 1st order photocopy, chances
of fabrication may be ignored. On the other hand when the photocopy order is 2nd or above, probability of
fabrication may be suspected. Hence when a photocopy document is presented, the recursive order number
of photocopy is to be estimated to ascertain the originality. This requirement demands to investigate
methods to estimate order number of photocopy. In this work, a voting based approach is used to detect the
recursive order number of the photocopy document using probability distributions exponential, extreme
values and lognormal distributions is proposed. A detailed experimentation is performed on a generated
data set and the method exhibits efficiency close to 89%.
Contrast enhancement using various statistical operations and neighborhood pr...sipij
Histogram Equalization is a simple and effective contrast enhancement technique. In spite of its popularity
Histogram Equalization still have some limitations –produces artifacts, unnatural images and the local
details are not considered, therefore due to these limitations many other Equalization techniques have been
derived from it with some up gradation. In this proposed method statistics play an important role in image
processing, where statistical operations is applied to the image to get the desired result such as
manipulation of brightness and contrast. Thus, a novel algorithm using statistical operations and
neighborhood processing has been proposed in this paper where the algorithm has proven to be effective in
contrast enhancement based on the theory and experiment.
Image retrieval and re ranking techniques - a surveysipij
There is a huge amount of research work focusing on the searching, retrieval and re-ranking of images in
the image database. The diverse and scattered work in this domain needs to be collected and organized for
easy and quick reference.
Relating to the above context, this paper gives a brief overview of various image retrieval and re-ranking
techniques. Starting with the introduction to existing system the paper proceeds through the core
architecture of image harvesting and retrieval system to the different Re-ranking techniques. These
techniques are discussed in terms of approaches, methodologies and findings and are listed in tabular form
for quick review.
Feature selection approach in animal classificationsipij
In this paper, we propose a model for automatic classification of Animals using different classifiers Nearest
Neighbour, Probabilistic Neural Network and Symbolic. Animal images are segmented using maximal
region merging segmentation. The Gabor features are extracted from segmented animal images.
Discriminative texture features are then selected using the different feature selection algorithm like
Sequential Forward Selection, Sequential Floating Forward Selection, Sequential Backward Selection and
Sequential Floating Backward Selection. To corroborate the efficacy of the proposed method, an
experiment was conducted on our own data set of 25 classes of animals, containing 2500 samples. The
data set has different animal species with similar appearance (small inter-class variations) across different
classes and varying appearance (large intra-class variations) within a class. In addition, the images of
flowers are of different poses, with cluttered background under different lighting and climatic conditions.
Experiment results reveal that Symbolic classifier outperforms Nearest Neighbour and Probabilistic Neural
Network classifiers.
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
Bayesian distance metric learning and its application in automatic speaker re...IJECEIAES
This paper proposes state-of the-art Automatic Speaker Recognition System (ASR) based on Bayesian Distance Learning Metric as a feature extractor. In this modeling, I explored the constraints of the distance between modified and simplified i-vector pairs by the same speaker and different speakers. An approximation of the distance metric is used as a weighted covariance matrix from the higher eigenvectors of the covariance matrix, which is used to estimate the posterior distribution of the metric distance. Given a speaker tag, I select the data pair of the different speakers with the highest cosine score to form a set of speaker constraints. This collection captures the most discriminating variability between the speakers in the training data. This Bayesian distance learning approach achieves better performance than the most advanced methods. Furthermore, this method is insensitive to normalization compared to cosine scores. This method is very effective in the case of limited training data. The modified supervised i-vector based ASR system is evaluated on the NIST SRE 2008 database. The best performance of the combined cosine score EER 1.767% obtained using LDA200 + NCA200 + LDA200, and the best performance of Bayes_dml EER 1.775% obtained using LDA200 + NCA200 + LDA100. Bayesian_dml overcomes the combined norm of cosine scores and is the best result of the short2-short3 condition report for NIST SRE 2008 data.
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
This paper proposes a multimodal biometric system
using palmprint and speech signal. In this paper, we propose a
novel approaches for both the modalities. We extract the
features using Subband Cepstral Coefficients for speech signal
and Modified Canonical method for palmprint. The individual
feature score are passed to the fusion level. Also we have
proposed a new fusion method called weighted score. This
system is tested on clean and degraded database collected by
the authors for more than 300 subjects. The results show
significant improvement in the recognition rate.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Similar to Speaker Identification From Youtube Obtained Data (20)
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
1. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
SPEAKER IDENTIFICATION FROM YOUTUBE
OBTAINED DATA
Nitesh Kumar Chaudhary1 and Shraddha Srivastav2
1Department of Electronics & Communication Engineering, LNMIIT, Jaipur, India
2Bharti School Of Telecommunication, IIT, Delhi, India
ABSTRACT
An efficient, and intuitive algorithm is presented for the identification of speakers from a long dataset (like
YouTube long discussion, Cocktail party recorded audio or video).The goal of automatic speaker
identification is to identify the number of different speakers and prepare a model for that speaker by
extraction, characterization and speaker-specific information contained in the speech signal. It has many
diverse application specially in the field of Surveillance , Immigrations at Airport , cyber security ,
transcription in multi-source of similar sound source, where it is difficult to assign transcription arbitrary.
The most commonly speech parameterization used in speaker verification, K-mean, cepstral analysis, is
detailed. Gaussian mixture modeling, which is the speaker modeling technique is then explained. Gaussian
mixture models (GMM), perhaps the most robust machine learning algorithm has been introduced to
examine and judge carefully speaker identification in text independent. The application or employment of
Gaussian mixture models for monitoring & Analysing speaker identity is encouraged by the familiarity,
awareness, or understanding gained through experience that Gaussian spectrum depict the characteristics
of speaker's spectral conformational pattern and remarkable ability of GMM to construct capricious
densities after that we illustrate 'Expectation maximization' an iterative algorithm which takes some
arbitrary value in initial estimation and carry on the iterative process until the convergence of value is
observed We have tried to obtained 85 ~ 95% of accuracy using speaker modeling of vector quantization
and Gaussian Mixture model ,so by doing various number of experiments we are able to obtain 79 ~ 82%
of identification rate using Vector quantization and 85 ~ 92.6% of identification rate using GMM modeling
by Expectation maximization parameter estimation depending on variation of parameter.
KEYWORDS
MFCC, Vector Quantization (VQ), Gaussian Mixture Model (GMM), K-mean, Maximum Likelihood,
Expectation maximization.
1. INTRODUCTION
Speaker identification have two categories: text-dependent and text-independent. Essentially, we
are more interested and involved in the research of text independent speaker identification /
verification with the reason that it doesn't impose as a necessity and demands regarding the
utterances obtained from speaker that means there is no restriction over the words , it can be
anything in any order, so the first basic steps involve the feature extraction from speech sample in
the enrolment process of a speaker , extracted feature are collected in the database as a training
data utterances. For the better accuracy, every time we are updating our training dataset while
preparing training dataset for new utterances with previous stored data in our database, in this
way model of each speaker is trained. Now in identification process with the help of probabilistic
model a measurement of identification has to be done, the feature vectors of testing utterances is
measured with feature vectors of training dataset and decision has to be made whether it belongs
to a group of dataset or not.
DOI : 10.5121/sipij.2014.5503 27
2. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
28
2. FEATURE EXTRACTION FROM VOCAL TRACT
Human's vocal tract, the airway used in the production of speech is the organs above the vocal
folds, especially the passage above the larynx, including the pharynx, mouth, and nasal cavities.
which is formed of the oral part (pharynx, tongue, lips, and jaw) , olfactory nerves, and the nasal
tract. When the glottal pulses signal generated by the vibration of the vocal folds passes through
the vocal tract, it is modified. Human’s vocal tract is performing like a filter, and its frequency
characteristics is dependent upon the resonance peak from the vocal tract and vocal tract
configuration can be obtained from the spectral shape such as formant position and spectral
inclination of the speech signal. These features can be obtained from the spectrogram of the
speech signal and we are using Mel-Frequency Cepstral Coefficients (MFCC) features in speaker
identification as it combines the advantages of the cepstrum analysis with a perceptual frequency
scale based on critical bands. Although the speech signal is non-stationary, but can be assumed as
stationary for a short duration of time, so analysis is done by framing the speech signal; the frame
width is about 20−30 milliseconds, and the frames are shifted by about 10 milliseconds.
The number of feature vector that we get from utterances is usually large so for computation and
the number of feature vectors can diminished without lose by K-means clustering method. This
results in a small set of vectors called as codebook vector and a codebook is obtained for each
speaker utterances using K-means. A codebook consist of many different vectors, which depict
the important characteristics of each speaker. Each codebook is obtained as follows: Given a set
of training set of MFCC feature vectors of 16-point vector for each frame of the utterance, which
represent the speaker, find a decomposition of the feature vector space. Each decoposed region
contains a cluster of vectors, which depict the same kind of basic sound. This region is
represented by the centroid vector, which is the vector, which causes the minimum distortion
when vectors in the region are mapped to it. Thus, each speaker has a codebook with a number of
centroids which is prepared with the help of K-mean Clustering.
3. MODELING USING VECTOR QUANTIZATION AND IDENTIFICATION
Vector quantization (VQ) is one of the simplest text-independent speaker models. VQ is often
used for computational speed-up techniques and lightweight practical implementations. Vector
3. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
quantization (VQ) is a ancient well-known quantization skills from signal processing which
allows the modeling of probability density functions by the classification of prototype vectors.
VQ is generally used by classifying a large se of (MFCC)feature vector dataset in small groups
vectors having same number of points closet to the denser value i.e. means of codebook of
training dataset
For Identification average quantization distortion is computed for the test utterance feature
vectors by X = {x1,x2....xT}and the reference vectors by ={r1,r2….rk} .
Where d(.,.) is a distance measure such as the Euclidean distance ||xt-rk||. A smaller value DQ
indicates higher likelihood for X and R originating from the same speaker.
Above Diagram1 shows speaker identification using Vector quantization modeling, the upper
region is enrolment process while in lower region feature has been extracted from test utterance.
Diagram2 shows the Vector quantization distortion measurements plot when the speaker
identification’s test has been performed for 8 speakers. For K = 16 variations of distortion
measurement from speaker i to speaker i utterances lies between (3.4561 to 7.0723) and from
speaker i to speaker j lies between (13.1892 to 29.9038), where i, j (1 to 8).
29
4. Signal Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
30
Diagram 4
Table1. Performance of Vector quantization modeling with MFCC feature extraction
Training Utterances No. of Clustering (K mean) Identification rate (%)
8 Speaker utterances K = 32 82
8 Speaker utterances K = 16 82
8 Speaker utterances K = 8 80
Total Number of speaker’s utterances = 75
4. MODELING USING GAUSSIAN MIXTURE MODEL
A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a
weighted sum of Gaussian component densities. It is generally used as a parametric model of the
probability distribution of continuous spectrum. GMM parameters are computed by 'Expectation
maximization' an iterative algorithm which takes some arbitrary value in initial estimation and carry on the
iterative process until the convergence of value is observed and the whole Gaussian mixture model is
defined by these parameters namely mean vectors (μi), covariance matrices(i) and mixture
weights(pi) from all different components, and the weighted sum of M Component density is
given by
where is a D-dimensional continuous-valued feature vector (in our case 16 dimensional), pi ,
i = 1, . . . ,M, are the mixture weights, and g(/μi,i) are component densities. Each component
density is a D-dimensional Gaussian function of the form,
5. Signal Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
GMM based speaker identification model is shown below, the upper region is extraction of
features from training dataset and preparing a mixture model using expectation maximization
while in lower region feature has been extracted from test utterance, and finally with GMM
parameter and extracted features from test utterances identification measurement has been done
by using probability generation function
One of the powerful attributes of the GMM is its ability to form smooth approximations to
arbitrarily shaped densities. There are several techniques available for estimating the parameters
of a GMM i.e (μ, , p) and most popular and well-established method is maximum likelihood
(ML) estimation. Since the equation of joint probability is nonlinear function of parameter so
we can’t solve it easily because number of unknown variables are more than number of equations
, so the parameter of ML is estimated iteratively by expectation maximization . The basic idea of
the EM algorithm is, beginning with an initial model i to estimate a new model j such that p(/
j) p(/ i), where j i . The new model then becomes the initial model for the next iteration and
the process is repeated until some convergence threshold is reached, so by using number of
utterances = 10 can give you a good approximated parameters (μ, , p) of GMM.
31
Table2. Performance of GMM with MFCC feature based identification
Training data
in seconds
Testing data in
seconds
No. of component No. of Iterations
for EM
Rate of correct
identification (%)
6 6 2 6 78.8342
6 6 4 6 76.6654
12 6 4 8 83.6723
20 6 4 10 84.4348
30 4 to 10 4 12 87.4382
30 4 to 10 5 12 89.5630
60 4 to 10 6 12 92.6643
120 4 to 10 7 14 92.1273
Total Number of speaker’s utterances = 75
6. Signal Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
32
5. CONCLUSION
In this paper, we have introduced a text-dependent speaker identification system, we have
investigated that Gaussian mixture models (GMMs) have proven extremely successful for text-independent
speaker identification for long dataset of different speakers. Identification
performance of the Gaussian mixture speaker model is insensitive to the method of model
initialization however we have estimated the parameter using expectation maximization and
Identification rate is very sensitive to number of cluster and number of iteration of expectation
maximization we have performed the experiments in Matlab R2014a (The Language of Technical
Computing), this model is currently being evaluated on a 75 speaker's utterances and the results
indicate that Gaussian mixture models provide a robust speaker representation for the difficult
task of speaker identification using corrupted, unconstrained speech of cocktail party or YouTube
dataset The models are computationally inexpensive and easily implemented on a real-time
platform.
REFERENCES
[1] T.Matsui and S. Furui, “Likelihood normalization for speaker verification using a phoneme- and
speaker-independent model,” Speech Communication, vol. 17, no. 1-2, pp. 109–116, 1995.
[2] D. A. Reynolds, A Gaussian mixture modeling approach to text independent speaker identification,
Ph.D. thesis, Georgia Institute of Technology, Atlanta, Ga, USA, September 1992.
[3] D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,”
Speech Communication, vol. 17, no. 1-2, pp. 91–108, 1995.
[4] J. Attili, M. Savic, and J. Campbell. “A TMS32020-Based Real Time, Text-Independent, Automatic
Speaker Verification System,” In International Conference on Acoustics, Speech, and Signal
Processing in New York, IEEE, pp. 599-602, 1988
[5] Levinson, S . E., L. R. Rabiner, A. E. Rosenberg and J. G. Wilson, “Interactive Clustering Techniques
for Selecting Speaker-Independent Techniques for Selecting Speaker-Independent Reference
Templates for Isolated Word Recognition,” IEEE Trans. ASSP. Vol. 27, pp. 134-141,1979.
[6] R. Mammone, X. Zhang, R. Ramachandran, Robust Speaker Recognition, IEEE Signal Processing
Magazine, September 1996.
[7] F. Soong, E. Rosenberg, B. Juang, and L. Rabiner. A Vector Quantization Approach
[8] Campbell, J., 1997. Speaker recognition: a tutorial. Proc. IEEE 85 (9),1437–1462.
[9] Hatch, A., Stolcke, A., Peskin, B., 2005. Combining feature sets with support vector machines:
application to speaker recognition. In: The 2005 IEEE Workshop on Automatic Speech Recognition
and Understanding (ASRU), November 2005, pp. 75–79.
[10] Reynolds, D. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal
Processing 10.1‐3 (2000): 19‐41. Print.
[11] K. Chen and L. Wang and H. Chi, Methods of combining multiple classifiers with different features
and their application to text-independent speaker identification, International Journal of Pattern
Recognition and Artificial Intelligence, Vol. 11, no. 3, pp. 417{445, 1997.
[12] Linde, Y., A. Buzo, and R. Gray (1980). An algorithm for vector quantizer design. IEEE Transactions
on Communications, 28(1), 84–95.
[13] Lu, X. and J. Dang (2008). An investigation of dependencies between frequency components and
speaker characteristics for text-independent speaker identification. Speech Communication, 50(4),
312–322.
[14] Madikeri, S. R. and H. A. Murthy (2011a). Mel filter bank energy-based slope feature and its
application to speaker recognition. In Proc. National Conference on Communications, 1–4.
[15] NIST (2004). The NIST year 2004 speaker recognition evaluation plan.
http://www.itl.nist.gov/iad/mig/tests/sre/2004/index.html.
[16] Sha, F. and L. K. Saul, Large margin gaussian mixture modeling for phonetic classification and
recognition. volume 1. IEEE, 2006.
7. Signal Image Processing : An International Journal (SIPIJ) Vol.5, No.5, October 2014
33
AUTHORS
Nitesh Kumar Chaudhary Final Year undergraduate (Electronics Communication)
student at LNMIIT, jaipur (India). My Research interest includes audio Signal Processing,
Speaker Authentication Identification , Machine Learning (Probabilistic model). I have
worked as a Visiting Summer Researcher at Newyork University , Summer Research
Fellow at Indian Institute of Science, Bangalore and worked as Research Intern at Indian
Institute Of Technology ,Delhi a world class program organised by Bharti School Of
Telecommunication , Foundation of Innovation and Technology Transfer.
More Info : https://sites.google.com/site/nkcniteshkumarchaudhary/
Shraddha Srivastav Post Doc. Researcher at Bharti School of Telecommunication, Indian Institute of
Technology, Delhi, PhD from University College Cork, Ireland (UCC)