Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcsandit
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...IJCI JOURNAL
Automatic Speech Recognition (ASR) involves mainly
two steps; feature extraction and classification
(pattern recognition). Mel Frequency Cepstral Coeff
icient (MFCC) is used as one of the prominent featu
re
extraction techniques in ASR. Usually, the set of a
ll 12 MFCC coefficients is used as the feature vect
or in
the classification step. But the question is whethe
r the same or improved classification accuracy can
be
achieved by using a subset of 12 MFCC as feature ve
ctor. In this paper, Fisher’s ratio technique is us
ed for
selecting a subset of 12 MFCC coefficients that con
tribute more in discriminating a pattern. The selec
ted
coefficients are used in classification with Hidden
Markov Model (HMM) algorithm. The classification
accuracies that we get by using 12 coefficients and
by using the selected coefficients are compare
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
Development of Quranic Reciter Identification System using MFCC and GMM Clas...IJECEIAES
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
A Survey on: Hyper Spectral Image Segmentation and Classification Using FODPSOrahulmonikasharma
The Spatial analysis of image sensed and captured from a satellite provides less accurate information about a remote location. Hence analyzing spectral becomes essential. Hyper spectral images are one of the remotely sensed images, they are superior to multispectral images in providing spectral information. Detection of target is one of the significant requirements in many are assuc has military, agriculture etc. This paper gives the analysis of hyper spectral image segmentation using fuzzy C-Mean (FCM)clustering technique with FODPSO classifier algorithm. The 2D adaptive log filter is proposed to denoise the sensed and captured hyper spectral image in order to remove the speckle noise.
Optimal Coefficient Selection For Medical Image FusionIJERA Editor
Medical image fusion is one of the major research fields in image processing. Medical imaging has become a
vital component in major clinical applications such as detection/ diagnosis and treatment. Joint analysis of
medical data collected from same patient using different modalities is required in many clinical applications.
This paper introduces an optimal fusion technique for multiscale-decomposition based fusion of medical images
and measuring its performance with existing fusion techniques. This approach incorporates genetic algorithm
for optimal coefficient selection and employ various multiscale filters for noise removal. Experiments
demonstrate that proposed fusion technique generate better results than existing rules. The performance of
proposed system is found to be superior to existing schemes used in this literature.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Movie Sentiment Analysis using Deep Learning RNNijtsrd
Sentimental analysis or opinion mining is the process of obtaining sentiments about a given textual data using various methods of deep learning algorithms. The analysis is used to determine the polarity of the data as either positive or negative. This classifications can help automate data representation in various sectors which has a public feedback structure. In this paper, we are going to perform sentiment analysis on the infamous IMDB database which consists of 50000 movie reviews, in which we perform training on 25000 instances and test it on 25000 to determine the performance of the model. The model uses a variant of RNN algorithm which is LSTM Long Short Term Memory which will help us a make a model which will decide the polarity between 0 and 1. This approach has an accuracy of 88.04 Nirsen Amal A | Vijayakumar A "Movie Sentiment Analysis using Deep Learning - RNN" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42414.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42414/movie-sentiment-analysis-using-deep-learning--rnn/nirsen-amal-a
A Novel Approach for User Search Results Using Feedback SessionsIJMER
In present scenario user search results using Fuzzy c-means algorithm focuses queries are
submitted to search engines to represent the information needs of users. The proposed feedbacks
sessions are clustered by data are bound to each cluster by means of a membership function. Feedback
sessions are constructed from user click-through logs and can efficiently reflect the information
needs of users. Pseudo-documents are generated to better understand the clustered feedbacks. Fuzzy
C-means clustering algorithm is used to cluster the feedbacks. Clustering the feedbacks can effectively
reflect the user needs. Fuzzy c-means algorithm uses the reciprocal of distances to decide the cluster
centers. Ranking model is used to provide ranks to the URL based on the user search
feedbacks. Evaluate the performance using “Classified Average Precision (CAP)” for user search
results.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Image similarity using symbolic representation and its variationssipij
This paper proposes a new method for image/object retrieval. A pre-processing technique is applied to
describe the object, in one dimensional representation, as a pseudo time series. The proposed algorithm
develops the modified versions of the SAX representation: applies an approach called Extended SAX
(ESAX) in order to realize efficient and accurate discovering of important patterns, necessary for retrieving
the most plausible similar objects. Our approach depends upon a table contains the break-points that
divide a Gaussian distribution in an arbitrary number of equiprobable regions. Each breakpoint has more
than one cardinality. A distance measure is used to decide the most plausible matching between strings of
symbolic words. The experimental results have shown that our approach improves detection accuracy.
Detection of fabrication in photocopy document using texture features through...sipij
Photocopy documents are very common in our normal life. People are permitted to carry and produce
photocopied documents frequently, to avoid damages or losing the original documents. But this provision is
misused for temporary benefits by fabricating fake photocopied documents. When a photocopied document
is produced, it may be required to check for its originality. An attempt is made in this direction to detect
such fabricated photocopied documents. This paper proposes an unsupervised system to detect fabrication
in photocopied document using texture features. The work in this paper mainly focuses on detection of
fabrication in photocopied documents in which some contents are manipulated by new contents above it
through different ways. A detailed experimental study has been performed using a collected sample set of
considerable size and a decision model is developed for classification. Testing is performed with a different
set of collected testing samples resulted in an average detection rate of 89%.
Performance analysis of high resolution images using interpolation techniques...sipij
This paper presents various types of interpolation techniques to obtain a high quality image The difference
between the proposed algorithm and conventional algorithms (in estimation of missing pixel value) is that
if standard deviation of image is used to calculate pixel value rather than the value of nearmost neighbor,
the image gives the better result. The proposed method demonstrated higher performances in terms of
PSNR and SSIM when compared to the conventional interpolation algorithms mentioned.
A novel approach to generate face biometric template using binary discriminat...sipij
In identity management system, commonly used biometric recognition system needs attention towards issue
of biometric template protection as far as more reliable solution is concerned. In view of this biometric
template protection algorithm should satisfy security, discriminability and cancelability. As no single
template protection method is capable of satisfying the basic requirements, a novel technique for face
biometric template generation and protection is proposed. The novel approach is proposed to provide
security and accuracy in new user enrolment as well as verification process. This novel technique takes
advantage of both the hybrid approach and the binary discriminant analysis algorithm. This algorithm is
designed on the basis of random projection, binary discriminant analysis and fuzzy commitment scheme.
Three publicly available benchmark face databases (FERET, FRGC, CMU-PIE) are used for evaluation.
The proposed novel technique enhances the discriminability and recognition accuracy in terms of matching
score of the face images and provides high security. This paper discusses the corresponding results.
EFFICIENT IMAGE RETRIEVAL USING REGION BASED IMAGE RETRIEVALsipij
Early image retrieval techniques were based on textual annotation of images. Manual annotation of images
is a burdensome and expensive work for a huge image database. It is often introspective, context-sensitive
and crude. Content based image retrieval, is implemented using the optical constituents of an image such
as shape, colour, spatial layout, and texture to exhibit and index the image. The Region Based Image
Retrieval (RBIR) system uses the Discrete Wavelet Transform (DWT) and a k-means clustering algorithm
to segment an image into regions. Each region of the image is represented by a set of optical
characteristics and the likeness between regions and is measured using a particular metric function on
such characteristics
Development of Quranic Reciter Identification System using MFCC and GMM Clas...IJECEIAES
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
A Survey on: Hyper Spectral Image Segmentation and Classification Using FODPSOrahulmonikasharma
The Spatial analysis of image sensed and captured from a satellite provides less accurate information about a remote location. Hence analyzing spectral becomes essential. Hyper spectral images are one of the remotely sensed images, they are superior to multispectral images in providing spectral information. Detection of target is one of the significant requirements in many are assuc has military, agriculture etc. This paper gives the analysis of hyper spectral image segmentation using fuzzy C-Mean (FCM)clustering technique with FODPSO classifier algorithm. The 2D adaptive log filter is proposed to denoise the sensed and captured hyper spectral image in order to remove the speckle noise.
Optimal Coefficient Selection For Medical Image FusionIJERA Editor
Medical image fusion is one of the major research fields in image processing. Medical imaging has become a
vital component in major clinical applications such as detection/ diagnosis and treatment. Joint analysis of
medical data collected from same patient using different modalities is required in many clinical applications.
This paper introduces an optimal fusion technique for multiscale-decomposition based fusion of medical images
and measuring its performance with existing fusion techniques. This approach incorporates genetic algorithm
for optimal coefficient selection and employ various multiscale filters for noise removal. Experiments
demonstrate that proposed fusion technique generate better results than existing rules. The performance of
proposed system is found to be superior to existing schemes used in this literature.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Movie Sentiment Analysis using Deep Learning RNNijtsrd
Sentimental analysis or opinion mining is the process of obtaining sentiments about a given textual data using various methods of deep learning algorithms. The analysis is used to determine the polarity of the data as either positive or negative. This classifications can help automate data representation in various sectors which has a public feedback structure. In this paper, we are going to perform sentiment analysis on the infamous IMDB database which consists of 50000 movie reviews, in which we perform training on 25000 instances and test it on 25000 to determine the performance of the model. The model uses a variant of RNN algorithm which is LSTM Long Short Term Memory which will help us a make a model which will decide the polarity between 0 and 1. This approach has an accuracy of 88.04 Nirsen Amal A | Vijayakumar A "Movie Sentiment Analysis using Deep Learning - RNN" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42414.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42414/movie-sentiment-analysis-using-deep-learning--rnn/nirsen-amal-a
A Novel Approach for User Search Results Using Feedback SessionsIJMER
In present scenario user search results using Fuzzy c-means algorithm focuses queries are
submitted to search engines to represent the information needs of users. The proposed feedbacks
sessions are clustered by data are bound to each cluster by means of a membership function. Feedback
sessions are constructed from user click-through logs and can efficiently reflect the information
needs of users. Pseudo-documents are generated to better understand the clustered feedbacks. Fuzzy
C-means clustering algorithm is used to cluster the feedbacks. Clustering the feedbacks can effectively
reflect the user needs. Fuzzy c-means algorithm uses the reciprocal of distances to decide the cluster
centers. Ranking model is used to provide ranks to the URL based on the user search
feedbacks. Evaluate the performance using “Classified Average Precision (CAP)” for user search
results.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Image similarity using symbolic representation and its variationssipij
This paper proposes a new method for image/object retrieval. A pre-processing technique is applied to
describe the object, in one dimensional representation, as a pseudo time series. The proposed algorithm
develops the modified versions of the SAX representation: applies an approach called Extended SAX
(ESAX) in order to realize efficient and accurate discovering of important patterns, necessary for retrieving
the most plausible similar objects. Our approach depends upon a table contains the break-points that
divide a Gaussian distribution in an arbitrary number of equiprobable regions. Each breakpoint has more
than one cardinality. A distance measure is used to decide the most plausible matching between strings of
symbolic words. The experimental results have shown that our approach improves detection accuracy.
Detection of fabrication in photocopy document using texture features through...sipij
Photocopy documents are very common in our normal life. People are permitted to carry and produce
photocopied documents frequently, to avoid damages or losing the original documents. But this provision is
misused for temporary benefits by fabricating fake photocopied documents. When a photocopied document
is produced, it may be required to check for its originality. An attempt is made in this direction to detect
such fabricated photocopied documents. This paper proposes an unsupervised system to detect fabrication
in photocopied document using texture features. The work in this paper mainly focuses on detection of
fabrication in photocopied documents in which some contents are manipulated by new contents above it
through different ways. A detailed experimental study has been performed using a collected sample set of
considerable size and a decision model is developed for classification. Testing is performed with a different
set of collected testing samples resulted in an average detection rate of 89%.
Performance analysis of high resolution images using interpolation techniques...sipij
This paper presents various types of interpolation techniques to obtain a high quality image The difference
between the proposed algorithm and conventional algorithms (in estimation of missing pixel value) is that
if standard deviation of image is used to calculate pixel value rather than the value of nearmost neighbor,
the image gives the better result. The proposed method demonstrated higher performances in terms of
PSNR and SSIM when compared to the conventional interpolation algorithms mentioned.
A novel approach to generate face biometric template using binary discriminat...sipij
In identity management system, commonly used biometric recognition system needs attention towards issue
of biometric template protection as far as more reliable solution is concerned. In view of this biometric
template protection algorithm should satisfy security, discriminability and cancelability. As no single
template protection method is capable of satisfying the basic requirements, a novel technique for face
biometric template generation and protection is proposed. The novel approach is proposed to provide
security and accuracy in new user enrolment as well as verification process. This novel technique takes
advantage of both the hybrid approach and the binary discriminant analysis algorithm. This algorithm is
designed on the basis of random projection, binary discriminant analysis and fuzzy commitment scheme.
Three publicly available benchmark face databases (FERET, FRGC, CMU-PIE) are used for evaluation.
The proposed novel technique enhances the discriminability and recognition accuracy in terms of matching
score of the face images and provides high security. This paper discusses the corresponding results.
EFFICIENT IMAGE RETRIEVAL USING REGION BASED IMAGE RETRIEVALsipij
Early image retrieval techniques were based on textual annotation of images. Manual annotation of images
is a burdensome and expensive work for a huge image database. It is often introspective, context-sensitive
and crude. Content based image retrieval, is implemented using the optical constituents of an image such
as shape, colour, spatial layout, and texture to exhibit and index the image. The Region Based Image
Retrieval (RBIR) system uses the Discrete Wavelet Transform (DWT) and a k-means clustering algorithm
to segment an image into regions. Each region of the image is represented by a set of optical
characteristics and the likeness between regions and is measured using a particular metric function on
such characteristics
ALGORITHM AND TECHNIQUE ON VARIOUS EDGE DETECTION: A SURVEYsipij
An edge may be defined as a set of connected pixels that forms a boundary between two disjoints regions.
Edge detection is basically, a method of segmenting an image into regions of discontinuity. Edge detection
plays an important role in digital image processing and practical aspects of our life. .In this paper we
studied various edge detection techniques as Prewitt, Robert, Sobel, Marr Hildrith and Canny operators.
On comparing them we can see that canny edge detector performs better than all other edge detectors on
various aspects such as it is adaptive in nature, performs better for noisy image, gives sharp edges , low
probability of detecting false edges etc
Retinal image analysis using morphological process and clustering techniquesipij
This paper proposes a method for the Retinal image analysis through efficient detection of exudates and
recognizes the retina to be normal or abnormal. The contrast image is enhanced by curvelet transform.
Hence, morphology operators are applied to the enhanced image in order to find the retinal image ridges.
A simple thresholding method along with opening and closing operation indicates the remained ridges
belonging to vessels. The clustering method is used for effective detection of exudates of eye. Experimental
result proves that the blood vessels and exudates can be effectively detected by applying this method on the
retinal images. Fundus images of the retina were collected from a reputed eye clinic and 110 images were
trained and tested in order to extract the exudates and blood vessels. In this system we use the Probabilistic
Neural Network (PNN) for training and testing the pre-processed images. The results showed the retina is
normal or abnormal thereby analyzing the retinal image efficiently. There is 98% accuracy in the detection
of the exudates in the retina .
Performance analysis of image compression using fuzzy logic algorithmsipij
With the increase in demand, product of multimedia is increasing fast and thus contributes to insufficient
network bandwidth and memory storage. Therefore image compression is more significant for reducing
data redundancy for save more memory and transmission bandwidth. An efficient compression technique
has been proposed which combines fuzzy logic with that of Huffman coding. While normalizing image
pixel, each value of pixel image belonging to that image foreground are characterized and interpreted. The
image is sub divided into pixel which is then characterized by a pair of set of approximation. Here
encoding represent Huffman code which is statistically independent to produce more efficient code for
compression and decoding represents rough fuzzy logic which is used to rebuilt the pixel of image. The
method used here are rough fuzzy logic with Huffman coding algorithm (RFHA). Here comparison of
different compression techniques with Huffman coding is done and fuzzy logic is applied on the Huffman
reconstructed image. Result shows that high compression rates are achieved and visually negligible
difference between compressed images and original images
Wound image analysis classifier for efficient tracking of wound healing statussipij
Wounds are evolved by increase in number of damage tissues. The traditional way of assessing the wound
healing status is to periodic measure of the area covered by the wound. This technique is tedious to
measure and periodic assessment is cumbersome. Basically healing status of the wound can be classified
as contact methods and non contact methods. The purpose of this research work is to accurately assess the
healing status of the wound .To accurately assess the wound, capturing of the wound images are the first
task to be performed. There are various tools like the photographic wound assessment tool (PWAT) to
acquire efficient wound images. Since the characteristics of different types of wounds (venous, pressure,
diabetic, and arterial ulcers) vary markedly, determining the reliability and validity of using the PWAT to
assess wound appearance for both chronic pressure ulcers and leg ulcers due to vascular insufficiency is
important. Segmenting the area of the wound from the wound image using efficient segmentation
techniques and preprocessing the segmented wound to reduce the noise using efficient filters and efficient
denoising techniques. Efficient classifiers are needed to classify the wound images. One among the
classifiers are the Wound Image Analysis Classifier (WIAC). Experimental evaluation has been made on
comparing various classifiers like SVM, KNN, WIAC.
A binarization technique for extraction of devanagari text from camera based ...sipij
This paper presents a binarization method for camera based natural scene (NS) images based on edge
analysis and morphological dilation. Image is converted to grey scale image and edge detection is carried
out using canny edge detection. The edge image is dilated using morphological dilation and analyzed to
remove edges corresponding to non-text regions. The image is binarized using mean and standard
deviation of edge pixels. Post processing of resulting images is done to fill gaps and to smooth text strokes.
The algorithm is tested on a variety of NS images captured using a digital camera under variable
resolutions, lightening conditions having text of different fonts, styles and backgrounds. The results are
compared with other standard techniques. The method is fast and works well for camera based natural
scene images.
PERFORMANCE ANALYIS OF LMS ADAPTIVE FIR FILTER AND RLS ADAPTIVE FIR FILTER FO...sipij
Interest in adaptive filters continues to grow as they begin to find practical real-time applications in areas
such as channel equalization, echo cancellation, noise cancellation and many other adaptive signal
processing applications. The key to successful adaptive signal processing understands the fundamental
properties of adaptive algorithms such as LMS, RLS etc. Adaptive filter is used for the cancellation of the
noise component which is overlap with undesired signal in the same frequency range. This paper presents
design, implementation and performance comparison of adaptive FIR filter using LMS and RMS
algorithms. MATLAB Simulink environment are used for simulations
WAVELET BASED AUTHENTICATION/SECRET TRANSMISSION THROUGH IMAGE RESIZING (WA...sipij
The paper is aimed for a wavelet based steganographic/watermarking technique in frequency domain
termed as WASTIR for secret message/image transmission or image authentication. Number system
conversion of the secret image by changing radix form decimal to quaternary is the pre-processing of the
technique. Cover image scaling through inverse discrete wavelet transformation with false Horizontal and
vertical coefficients are embedded with quaternary digits through hash function and a secret key.
Experimental results are computed and compared with the existing steganographic techniques like WTSIC,
Yuancheng Li’s Method and Region-Based in terms of Mean Square Error (MSE), Peak Signal to Noise
Ratio (PSNR) and Image Fidelity (IF) which show better performances in WASTIR.
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGESsipij
Image segmentation is one of the important tasks in computer vision and image processing. Thresholding is
a simple but most effective technique in segmentation. It based on classify image pixels into object and
background depended on the relation between the gray level value of the pixels and the threshold. Otsu
technique is a robust and fast thresholding techniques for most real world images with regard to uniformity
and shape measures. Otsu technique splits the object from the background by increasing the separability
factor between the classes. Our aim form this work is (1) making a comparison among five thresholding
techniques (Otsu technique, valley emphasis technique, neighborhood valley emphasis technique, variance
and intensity contrast technique, and variance discrepancy technique)on different applications. (2)
determining the best thresholding technique that extracted the object from the background. Our
experimental results ensure that every thresholding technique has shown a superior level of performance
on specific type of bimodal images.
Image denoising using new adaptive based median filtersipij
Noise is a major issue while transferring images through all kinds of electronic communication. One of the
most common noise in electronic communication is an impulse noise which is caused by unstable voltage.
In this paper, the comparison of known image denoising techniques is discussed and a new technique using
the decision based approach has been used for the removal of impulse noise. All these methods can
primarily preserve image details while suppressing impulsive noise. The principle of these techniques is at
first introduced and then analysed with various simulation results using MATLAB. Most of the previously
known techniques are applicable for the denoising of images corrupted with less noise density. Here a new
decision based technique has been presented which shows better performances than those already being
used. The comparisons are made based on visual appreciation and further quantitatively by Mean Square
error (MSE) and Peak Signal to Noise Ratio (PSNR) of different filtered images..
An ensemble classification algorithm for hyperspectral imagessipij
Hyperspectral image analysis has been used for many purposes in environmental monitoring, remote
sensing, vegetation research and also for land cover classification. A hyperspectral image consists of many
layers in which each layer represents a specific wavelength. The layers stack on top of one another making
a cube-like image for entire spectrum. This work aims to classify the hyperspectral images and to produce
a thematic map accurately. Spatial information of hyperspectral images is collected by applying
morphological profile and local binary pattern. Support vector machine is an efficient classification
algorithm for classifying the hyperspectral images. Genetic algorithm is used to obtain the best feature
subjected for classification. Selected features are classified for obtaining the classes and to produce a
thematic map. Experiment is carried out with AVIRIS Indian Pines and ROSIS Pavia University. Proposed
method produces accuracy as 93% for Indian Pines and 92% for Pavia University.
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONcscpconf
Emotional state recognition through speech is being a very interesting research topic nowadays.
Using subliminal information of speech, it is possible to recognize the emotional state of the
person. One of the main problems in the design of automatic emotion recognition systems is the
small number of available patterns. This fact makes the learning process more difficult, due to
the generalization problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set
through the creation the new virtual patterns. In the case of emotional speech, most of the
emotional information is included in speed and pitch variations. So, a change in the average
pitch that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion. Thus, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral
Coefficients, used to classify the emotion. This proposed process allows us to synthetically
increase the number of available patterns in thetraining set, thus increasing the generalization
capability of the system and reducing the test error.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
Audio/Speech Signal Analysis for Depressionijsrd.com
The word “depressed†is a common everyday word. People might say "I am depressed" when in fact they mean "I am fed up because I have had a row, or failed an exam, or lost my job", etc. These ups and downs of life are common and normal. Most people recover quite quickly. Depression is identified by different methods. Here we are identified depression by MFCC (Mel Frequency Ceptral Coefficient) method. There are different parameters used for the identification of depressed speech and normal speech, but MFCCs based parameter is the most applicable information then other parameter because depressive speech or audio signal can contain more information in the higher energy bands when compared with normal speech.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
On the use of voice activity detection in speech emotion recognitionjournalBEEI
Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
A novel text-independent speaker identification system based on the Zak transform is implemented. The data used in this paper are drawn from the ELSDSR database. The efficiency of identification approaches 91.3% using single test file and 100% using two test files. The method shows comparable efficiency results with the well known MFCC method with an advantage of being faster in both modeling and identification.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
Similar to Mfcc based enlargement of the training set for emotion recognition in speech (20)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Mfcc based enlargement of the training set for emotion recognition in speech
1. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
DOI : 10.5121/sipij.2014.5103 29
MFCC BASED ENLARGEMENT OF THE TRAINING
SET FOR EMOTION RECOGNITION IN SPEECH
Inma Mohino-Herranz1
, Roberto Gil-Pita1
, Sagrario Alonso-Diaz2
and
Manuel Rosa-Zurera1
1
Department of Signal Theory and Communications, University of Alcala, Spain
2
Human Factors Unit, Technological Institute “La Marañosa” –Ministry of Defense,
Madrid (Spain)
ABSTRACT
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
KEYWORDS
Enlarged training set, MFCC, emotion recognition, pitch analysis
1. INTRODUCTION
Emotional state recognition (ESR) through speech is being a very interesting research topic
nowadays. Using subliminal information of speech, it is possible to recognize the emotional state
of the person. This information, denominated “prosody”, reflects some features of the speaker and
adds information to the communication [1], [2].
The standard scheme of an ESR system consists of a feature extraction stage followed by a
classification stage. Some of the most useful features used in speech-based ESR systems are the
Mel-Frequency Cepstral Coefficients (MFCCs), which are one of the most powerful features used
in speech information retrieval [3]. The classification stage uses artificial intelligence techniques
to learn from data in order to determine the classification rule. It is important to highlight that in
order to avoid loss of generalization of the results, it is also necessary to split the available data in
2. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
30
two sets, one for training the system and other for testing it, since the data must be different in
order to avoid loss of generalization of the results.
One of the main problems in the design of automatic ESR systems is the small number of
available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions [4], [5].
A possible solution to this problem consists in enlarging the training set through the creation the
new virtual patterns. This idea, originally proposed in [6], consists in the use of auxiliary
information, denominated hints, about the target function to guide the learning process. The use
of hints has been proposed several times in several applications, like, for instance, automatic
target recognition [7], or face recognition [8].
In the case of emotional speech, it is important to highlight that most of the information is
included in speed and pitch variations [9]. On the other hand, the average pitch of the sentence is
mainly dependent on the individual. Since the emotion recognition task in the problem at hand
must be carried out independently on the specific characteristics of a given subject, a change in
the average pitch value that does not modify neither the speed nor the pitch variations does not
affect the expressed emotion. It is important to highlight that in the particular case of ESR
systems tailored for a specific individual the proposed solution could not be suitable, since in that
case a change in the global pitch could represent a variation in the emotion instead of a variation
of the main characteristics of the subject.
In this work we propose the creation of new patterns by applying a pitch shift modification in the
feature extraction process of a multi-subject ESR system. For every pattern in the training set, we
apply a set of pitch shifts through frequency scaling in the MFCC extraction process. So, several
new virtual patterns are generated from each pattern in the training set using a range of shifts for
the pitch. The size, density and shape of the range of the applied pitch shifts are parameters of the
proposed enlargement method, and their effect over the final results is also studied. Furthermore,
since the gender is related to the range of possible valid pitches [10], [11], we propose to use it in
order to modulate the shape of the range of pitch variation, avoiding the creation of non-valid
pitches. This proposed process allows us to synthetically increase the number of available
patterns in the training set, thus increasing the generalization capability of the system and
reducing the test error.
In order to demonstrate the suitability of the proposal, two different classifiers (the Least Square
Linear Classifier and the Least Square Diagonal Quadratic Classifier) have been tested under a set
of experiments using an available database. These classifiers have different generalization
capabilities and serve to demonstrate the performance of the proposed enlargement under
different scenarios of generalization problems.
2. MATERIALS AND METHODS
This section explains the two main stages of an ESR system: the feature extraction stage and the
classification stage, describing the configuration of the ESR system used in the experiments.
2.1. Feature extraction: Mel-Frequency Cepstral Coefficients (MFCCs)
Obtaining MFCC coefficients [12] has been regarded as one of the techniques of parameterization
most important used in speech processing. They provide a compact representation of the spectral
envelope, so that most of the energy is concentrated in the first coefficients. Perceptual analysis
emulates human ear non-linear frequency response by creating a set of filters on non-linearly
spaced frequency bands. Mel cepstral analysis uses the Mel scale and a cepstral smoothing in
order to get the final smoothed spectrum. Figure 1 shows the scheme for the MFCC evaluation.
3. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
31
The main stages of MFCC analysis are:
- Windowed: In order to overcome the non-stationary of speech, it is necessary to analyse the
signal in short time periods, in which it can be considered almost stationary. So, time frames or
segments are obtained dividing the signal. This process is called windowed. In order to maintain
continuity of information signal, it is common to perform the windowed sample with frame
blocks overlap one another, so that the information is not lost in the transition between windows.
- DFT: Following the windowed, DFT is calculated to xt[n], the result of windowing the t-th time
frame with a window of length N.
(1)
From this moment, phase is discarded and we work with the energy of speech signal, |Xt[k]|2
.
- Filter bank: The signal |Xt[k]|2
is then multiplied by a triangular filter bank, using Equation (2).
(2)
where Hm[k] are the triangular filter responses, whose area is unity. These triangles are spaced
according to the MEL frequency scale. The bandwidth of the triangular filters is determined by
the distribution of the central frequency f[m], which is function of the sampling frequency and the
number of filters. If the number of filters is increased, the bandwidth is reduced.
So, in order to determine the central frequencies of the filters f[m], the behaviour of the human
psychoacoustic system is approximated through B(f), the frequency in MEL scale, in Equation
(3).
(3)
where f corresponds with the frequency represented on a linear scale axis.
Therefore, the triangular filters can be expressed using Equation (4).
(4)
where 1 ≤ m ≤ F , being N the number of filter, and furthermore we have the central frequency
f[m] of the m-th frequency band :
(5)
where , and Fs is the frequency sampling.
- DCT (Discrete Cosine Transform): Through the DCT, expressed in Equation (6), the spectral
coefficients are trans- formed to the frequency domain, so the spectral coefficients are converted
to cepstral coefficients.
4. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
32
(6)
The MFCCs are evaluated, features are determined from statistics of each MFCC. Some of the
most common used statistics are the mean and the standard deviation. In is also habitual to use
statistics from differential values of the MFCCs, denominated, delta MFCC, or ∆MFCCs. These
∆MFCCs are determined using Equation (7),
(7)
where d determines the differentiation shift. In this paper we use as features the mean and
standard deviation of the MFCCs, and the standard deviation of ∆MFCCs with d = 2, since we
have found that these values obtain very good results with a considerably low number of features.
Windowed log
Filter Bank
MEL
DCT|DFT|2
Figure 1 Scheme to MFCC calculate
2.2. Classification stage
This section aims to explain the classifier used, the Least Square Linear Classifier and the Least
Square Diagonal Quadratic Classifier.
2.2.1. Least Square Linear Classifier
Let us define a set of training patterns x=[x1, x2, … ,xL]T
, where each of these patterns corresponds
to one of the possible classes denoted as Ci, i=1, …,K. In a linear classifier, the decision rule is
obtained using a set of K linear combinations of the training patterns, as shown in Equation (8).
(8)
The design of the classifier consists in finding the best values of wkn and wk in order to minimize
the classification error.
The output of the linear combinations yk is used to determine the decision rule. For instance, if the
component yk gives the maximum value of the vector, then the k-th class is assigned to the
pattern. In order to determine the values of the weights, it is necessary to minimize the mean
squared error value, which can be carried out using the Wiener Hopf equations [13].
This classifier is very simple, because the boundaries are hyperplanes. Thus, and due to the
simplicity of the implemented decision boundaries, both the error performance and the
generalization capability use to be high.
2.2.2. Least Square Diagonal Quadratic Classifier
The Least Square Diagonal Quadratic Classifier is a classifier that renders very good results with
a very fast learning process [14] and therefore it has been selected for the experiments carried out
in this paper. Let us considerer a set of training patterns x = [x1, x2, . . . , xL ]T
, where each of
these patterns is assigned to one of the possible classes denoted as Ci, i = 1, . . . , k. In a quadratic
classifier, the decision rule can be obtaining using a set of k combinations, as shows Equation (9).
5. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
33
(9)
where wkn and vmnk are the linear and quadratic values weighting respectively.
This classifier is similar to the one above, but the boundaries are quadratic functions, which
implies that the complexity of the system is more elevated. Implies that the intelligence of the
classifier allows us to obtain better results in the study of this error probability. However, this
classifier presents generalization problems higher than the Linear Classifier.
3. PROPOSED MFCC-BASED ENLARGEMENT OF THE TRAINING SET
As we stated in the introduction, it is important to highlight that most of the information of
emotional speech is included in speed and the pitch variations [14]. So, an average change in the
pitch value that does not modify neither the speed nor the pitch variations does not affect the
expressed emotion.
In this paper we propose to modify the MFCC extraction in order to implement frequency scaling,
allowing to create new patterns for the training set. So, the MFCCs can be easily pitch-shifted
through a scale factor applied in frequency domain. This modification is applied to each pattern in
the database, allowing to enlarge the training set.
Let us define the Pitch Shift Factor (PSF) as a global change of the pitch, measured in semitones.
Then, this shift in the pitch is equivalent to scaling the frequency with a Frequency Scale Factor
(FSF). So, the relationship between PSF and FSF can be expressed using Equation (10).
(10)
In order to apply this frequency scaling in the MFCC process, the central frequencies f[m] of the
triangular filters are modified, taking into account the scaled frequency factor. So, in Equation
(11) we can observe the relationship between the original and synthetic frequency.
(11)
Being the new frequency scale, as shows in Equation (12)
(12)
In the Figure 2, we can observe the difference between the standard relationships between center
frequency for each coefficient, and the difference when the frequency has been scaled. So, the
center frequency is reduced when increase the number of cepstral coefficients.
As an example, the difference between the MFCCs calculated with PSF=0 and PSF=1 are shown in
Figure 3. So, we can observe the filter responses in logarithmic scale without frequency shift of
MFCCs with a shifting in frequency of one semitone.
As mentioned, we propose the creation of new patterns by applying a pitch shift modification in
the feature extraction process of a multi-subject ESR system. For every pattern in the training set,
we apply a set of pitch shifts through frequency scaling in the MFCC extraction process. So,
several new virtual patterns are generated from each pattern in the training set using a range of
shifts for the pitch. Furthermore, since the gender is related to the range of possible valid pitches,
we have used this information in order to modulate the shape of the range of pitch variations,
avoiding the creation of non-valid pitches.
In order to implement the enlargement of the database using pitch shifting, three factors must be
taken into account: the range of the pitch shifting (R) and the step of the pitch shifting (S) and the
symmetry factor (K).
6. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
34
- Range (R): The range defines the maximum absolute variation in the pitch modification process
in semitones. With this parameter it is possible to change the upper and lower limits of the shift
variations.
- Step (S): The step defines the smallest change in the pitch that is produced in the pitch shifting
process in semitones.
- Symmetry factor (K): This factor controls the symmetry factor of the range, that is, the
relationship between the maximum positive variation of the pitch and the minimum negative
variation of the pitch. The key point is that those files corresponding to a male speaker are mainly
positively shifted, and those files with a female speaker are mainly negatively shifted. So, K
modulates the minimum value of the range of variation for males and the maximum value of the
range of variation for females. Thus, it is possible to define the range shift to PSF −K ⋅ R, R] for
the case of males. In the case of females, the range shift used for calculating the PSF is
−R,K ⋅ R[ ]. For instance, R=2, S=0.5 and K=0.75, this implies that the range to the males is
−1.5,2] and in the case of females is 2,1.5[ ].
Taking into account these three factors, it is possible to determine the enlargement factor (EF),
that is, the number of times that the size of the training set is increased.
EF =
R 1+ K( )
S
+1
(13)
Figure 2 Central frequency f[m] of the triangular filter, original and modified
Figure 3 Triangular filters H[m] for the calculate of the MFCC, original and modified
7. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
35
4. RESULTS
This section shows the different experiments done and the results to highlight.
In this study, we have used the public database "The Berlin Database of Emotional Speech" [15].
This database consists of 800 files of 10 actors (5 males and 5 females), where each actor
produces 10 German utterances (5 short and 5 longer sentences) simulating seven different
emotions. These emotions are: Neutral, Anger, Fear, Happiness, Sadness, Disgust, and Boredom.
The recordings were using a sampling frequency of 48 kHz and later downsampled to 16 kHz.
Although this database consists of 800 files, 265 were eliminated, since only those utterances
with a recognition rate better than 80% and naturalness better than 60% were finally chosen. So,
the database consists of 535 files.
Since the size of the database is not very large, and in order to evaluate the results and to ensure
that they are independent of the partition between training set and test set, we have used the
validation method denominated Leave One Out [16] [17]. This is a model validation technique to
evaluate how the results of a statistical analysis generalize to an independent data set. This
method is used in environments where the main goal is the prediction and we want to estimate
how accurate is a model that will be implemented in practice.
This technique basically consists in tree stages:
- First, the database is divided into complementary subsets called: training set and test set, where
the test sets contains only one pattern in the database.
- Then, the parameters of the classification system are obtained using the training set.
- Finally, the performance of the classification system is obtained using the test set.
In order to increase the accuracy of the error estimation while maximizing the size of the training
set, multiple iteration of this process are performed using a different partitions each time, and the
test results are averaged over the different iterations.
In this paper we use an adaptation of this technique to the problem at hand, which we denominate
Leave One Couple Out. So, we have worked with the database discussed above, which consists of
5 male and 5 female. In this case, we used 4 male and 4 female for each training set and 1 male
and 1 female for each test set. This division guarantees complete independence between training
and test data, keeping a balance in the gender. Therefore, our leave one couple out is repeated 25
times, using each iteration different training and test sets.
Concerning the features, a window size of N = 512 has been used, which implies time frames of
32ms. We have then selected mean and standard deviation of 25 MFCCs, and standard deviation
of 2-∆MFCCS, resulting in a total of 75 features, which has been used to design a linear and
quadratic classifier. In order to complete the comprehension of the results obtained, it is
necessary to analyse the error probability for training set, the error probability for test set and the
enlargement factor (EF).
This work aims at the enlargement of the database available. For this purpose we have modified
the average pitch in the available patterns through the frequency scaling the MFCCs. It is
important to highlight that, in order to make the results comparable, the test set has not been
enlarged nor modified in the experiments carried out in this paper.
In order to demonstrate the results obtained, we have done different experiments with different
parameters. The main parameters varied in the experiments are: Range (R), Step (S) and
Symmetry Coefficient (K). The experiments have been done with a wide range of values for each
parameters used. Hence, the values used in order to calculate the Range (R) are 0, 0.5, 1, 2, 3, 4,
5, 6, 8, 10 y 12. These values have been chosen in order to clearly show the evolution of the
results. For the second parameter the Step (S), we have taken values from 1/32 to 2, being the
values chosen, 1/32, 1/16, 1/8, 1/4, 1/2, 1 and 2.
8. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
36
At last, the values for the Symmetry coefficient, are only two, K = 1 and K = 0.75. We have
checked with others different values of K, but using these values it is possible to see the response
of the system under different conditions.
The Figure 4 shows the relationship between the Error Probability for the test set and the
Enlargement Factor using the Linear Classifier. It is possible to observe that the results are very
similar for both values of symmetry coefficient, that is, K=1 and K=0.75, even the error
probability tends to slightly increase. The minimum error probability for the test set is around
30.80% using K=0.75. This percentage is very similar to the result obtained using the K=1. In the
Figure 4 is possible to observe the different values of error probability in function of the EF.
Being, the Error Probability around 33.7% with an EF=1. However, using EF=29 we can to
obtain the Error Probability around 30.8%. It implies, using the enlargement of the training set for
emotion recognition in speech it is possible to reduce the error probability for the test set.
Figure 4 Relationship between the Error Probability for the test set and the Enlargement Factor.
Linear Classifier
However, if we study the performance of the system using the Quadratic Classifier as shows the
Figure 5, the error probability for the test set decreases around 3% with respect to the Linear
Classifier. Additionally, the difference between the results obtained to the different symmetry
coefficients shows the importance of this coefficient, since the error probability is reduced almost
every value of S.
Figure 5 Relationship between the Error Probability for the test set and the Enlargement Factor.
Quadratic Classifier
9. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
37
As we have demonstrated in the Figures 4 and 5, the Error Probability for the test set improve
with K=0.75.
In order to study the response of the system with different values of K, we have check several
values, but the best results correspond to the K=0.75, hence it has been selected to show in the
next tables.
The Figure 6 it is possible to observe the evolution and trend of the error probability for the test
set and training set in different cases, such as, using the Linear Classifier and the Quadratic
Classifier. The error probability for the test set is around 30%. Besides, it is possible to see that
using Range R= 4 the error probability for the test set is lower than other R.
However, if we pay attention to the trend of the error probability for the training set, it is easy to
observe a continue increase of the error probability with respect to the Range (R).
0 2 4 6 8 10 12
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
K=0.75 and S=1/4
Range
ErrorProbability(%)
EPTrain with Linear Classifier
EPTrain with Quadratic Classifier
EPTest with Linear Classifier
EPTest with Quadratic Classifier
Figure 6 Linear Classifier vs. Quadratic Classifier. K = 0.75 and S=1/4
The Table 1 shows the Error Probability for the test set using the Linear Classifier for 10 different
values of Range (R) and 7 different values of Step (S). The symmetry coefficient used is K=0.75,
since it is demonstrated above that it provides better results. In this table, it is possible to observe
the trend of the error probability with R. Furthermore, the S that provides better results is S=1/4.
Being the error probability less around 30%. In order to get to this results is necessary an
enlargement factor (EF) of 29. That is, it is necessary increase 29 times the patterns set. In this
Table it is possible to compare the results obtained when de EF=1, being 33.68% and when the
EF is increased until 29, we obtain an error probability for the test set of 30.88%. It implies the
improved performance with enlargement for the training set.
The EF in each case is shown in the Table 3.
Table 1 Error Probability for the test set using the Linear Classifier
10. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
38
In the Table 2, it is possible to observe the Error Probability for the test set using the Quadratic
Classifier. The Error Probabilities are less than those obtained in the Table 1 using the Linear
Classifier. In the case of the Quadratic Classifier, it is possible to reduce the Error Probability to
26.69%. This result is obtained with R=4, S=1/16 and EF= 113. The Enlargement Factor used is
shown in the Table 3. However, if EF=1 the Error probability for the test set is 33.94%. It
demonstrates that when the EF is increased until certain values it is possible to improve the
results obtained.
Table 2 Error Probability for the test set using the Quadratic Classifier
Table 3 Enlargement Factor
5. CONCLUSIONS
In the study of emotions in speech, one of the main problems is the small number of available
patterns. This fact makes the learning process more difficult, due to the generalization problems
in the learning stage. In this work we propose a solution to this problem consisting in enlarging
the training set through the creation the new virtual patterns. In the case of emotional speech,
most of the emotional information is included in speed and pitch variations. Thus, a change in the
average pitch value that does not modify neither the speed nor the pitch variations does not affect
the expressed emotion. So, we use this prior information in order to create new patterns applying
a pitch shift modification in the feature extraction process of the classification system. For this
purpose, we propose a gender dependent frequency scaling modification. This proposed process
allows us to synthetically increase the number of available patterns in the training set, thus
increasing the generalization capability of the system and reducing the test error.
In order to demonstrate the suitability of the proposal, two different classifiers (the Least Square
Linear Classifier and the Least Square Diagonal Quadratic Classifier) have been tested under a set
of experiments using an available database. The results have demonstrated that the Quadratic
Classifier provides errors less than Classifier. However, the generalization problems are less
using Linear Classifier.
11. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
39
Using MFCC-based enlargement of the training set, the system has a number of patterns
appropriate, and it is possible train to the system correctly. With this enlargement of the database,
it is possible to reduce the error probability in emotion recognition near 8%, which is a
considerable improvement in the performance.
ACKNOWLEDGEMENTS
This work has been funded by the Spanish Ministry of Education and Science (TEC2012-38142-
C04-02), by the Spanish Ministry of Defense (DN8644-ATREC) and by the University of Alcala
under project CCG2013/EXP-074.
REFERENCES
[1] Verderis, D. & Kotropoulos, C., (2006) “Emotional speech recognition: Resources, features, and
method”, Elsevier Speech communication, Vol. 48, No. 9, pp1162-1181.
[2] Schuller, B., Batliner, A., Steidl, S. & Seppi, D., (2011) “Recognising realistic emotions and affect in
speech: State of the art and lessons learnt from the first challenge”, Elsevier Speech Communication,
Vol. 53, No. 9, pp1062-1087
[3] Mohino, I., Goñi, M., Alvarez, L. Llerena, C. Gil-Pita, R., (2013) “Detection of emotions and stress
through speech analysis”, International Association of Science and Technology for Development.
[4] Öztürk, N., (2003) “Use of genetic algorithm to design optimal neural network structure”, MCB UP
Ltd Engineering Computations, Vol. 20, No.8, pp979-997.
[5] Mori, R., Suzuki, S. & Takahara, H., (2007) “Optimization of Neural Network Modeling for Human
Landing Control Analysis”, AIAA Infotech@ Aerospace 2007 Conference and Exhibit, pp7-10.
[6] Abu-Mostafa, Yaser S, (1995) “Hints”, MIT Press Neural Computation, Vol. 7, No. 4, pp639-671.
[7] Gil-Pita, R., Jarabo-Amores, P., Rosa-Zurera, M., Lopez-Ferreras, F., (2002) “Improving neural
classifiers for ATR using a kernel method for generating synthetic training sets”, IEEE Neural
Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, pp425-434.
[8] Niyogi, Partha & Girosi, Federico & Poggio, Tomaso, (1998) “Incorporating prior information in
machine learning by creating virtual examples”, IEEE Proceedings of the IEEE, Vol. 86, No. 11,
pp2196-2209.
[9] Vroomen, J., Collier, R. & Mozziconacci, Sylvie JL, (1993) “Duration and intonation in emotional
speech”, Eurospeech.
[10] Ting, Huang and Yingchun, Yang and Zhaohui, Wu (2006) “Combining MFCC and pitch to enhance
the performance of the gender recognition”.
[11] Zeng, Yu-Min and Wu, Zhen-Yang and Falk, Tiago and Chan, W-Y (2006) “Robust GMM based
gender classification using pitch and RASTA-PLP parameters of speech”, 3376-3379.
[12] Davis, S. & Mermelstein P., (1980) “Experiments in syllable-based recognition of continuous
speech”, IEEE Transactions on Acoustics Speech and Signal Processing, Vol. 28, pp357-366.
[13] H.L. Van Trees, “Detection, estimation, and modulation theory”, vol. 1. Wiley, 1968.
[14] Gil-Pita, R. & Alvarez-Perez, L. & Mohino, Inma, (2012) “Evolutionary diagonal quadratic
discriminant for speech separation in binaural hearing aids”, Advances in Computer science, Vol. 20,
No. 5, pp227-232.
[15] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F. & Weiss, B., (2005) “A database of
German emotional speech”, Interspeech, pp15717-1520.
[16] Chang, M.-W. and Lin, C.-J., (2005) “Leave-one-out bounds for support vector regression model
selection”, MIT Press Neural Computation, Vol. 17, No. 5, pp1188-1222.
[17] Cawley, G. C. & Talbot, N. L., (2004) “Fast exact leave-one-out cross-validation of sparse least-
squares support vector machines”, Vol. 16, No. 10, pp1467-1475.
12. Signal & Image Processing : An International Journal (SIPIJ) Vol.5, No.1, February 2014
40
AUTHORS
Inma Mohino-Herranz
Year in which an academic degree was awarded: Telecommunication Engineer, Alcalá
University, 2010. PhD student about Information and Communication Technologies.
Area of research: Signal Processing.
Roberto Gil-Pita
Year in which an academic degree was awarded: Telecommnunication Engineer, Alcalá
University, 2001. Possition: Associate Professor. Polytechnic School in the Department
of Signal Theory and Communications. Some of his research interest include, audio,
speech, image, biological signals.
Sagrario Alonso-Díaz
PhD Psicologist. Researcher in the Human Factors Unit. Technological Institue “La
Marañosa” –MoD.
Manuel Rosa-Zurera,
Year in which an academic degree was awarded: Telecomunication Engineer,
Polythecnic University of Madrid, 1995. Possition: Full professor and dean of
Polytechnic School. University of Alcalá. His areas of interest are audio, radar, speech
source separation.