This document presents a comparative analysis of different classifiers for emotion recognition through acoustic features. It analyzes prosody features like energy and pitch as well as spectral features like MFCCs. Feature fusion, which combines prosody and spectral features, improves classification performance for LDA, RDA, SVM and kNN classifiers by around 20% compared to using features individually. Results on the Berlin and Spanish emotional speech databases show that RDA performs best as it avoids the singularity problem that affects LDA when dimensionality is high relative to the number of training samples.
This document summarizes Dongang Wang's speech emotion recognition project which compares feature selection and classification methods. Wang selects mel-frequency cepstral coefficients (MFCCs) and energy as features. For methods, Wang tests Gaussian mixture models (GMMs), discrete hidden Markov models (HMMs), and continuous HMMs including Kalman filters. Testing on German and English corpora, continuous HMMs achieved the best average accuracy of 61.67%, outperforming GMMs and discrete HMMs. While results are promising, Wang notes challenges in recognizing emotion across languages and speakers.
This document summarizes a research paper on classifying speech using Mel frequency cepstrum coefficients (MFCC) and power spectrum analysis. The paper reviews different classifiers used for speech emotion recognition, including neural networks, Gaussian mixture models, and support vector machines. It proposes using MFCC and power spectrum features as inputs to an artificial neural network classifier to identify emotions in speech, such as anger, happiness, sadness, and neutral states. Testing is performed on emotional speech samples to evaluate the performance and limitations of the proposed speech emotion recognition system.
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
- The document describes a framework for emotional voice conversion using VAW-GAN that can disentangle and recompose emotional elements in speech. It proposes using VAW-GAN with continuous wavelet transform to model prosody and decompose fundamental frequency into different time scales. Conditioning the decoder on fundamental frequency is shown to improve emotion conversion performance. Experiments demonstrate the effectiveness of the approach on an English emotional speech database.
Predicting the Level of Emotion by Means of Indonesian Speech SignalTELKOMNIKA JOURNAL
This document summarizes a study that aimed to evaluate how well Mel-Frequency Cepstral Coefficients (MFCC) features extracted from Indonesian speech signals relate to four emotions: happy, sad, angry, and fear. Nearly 300 speech signals were collected from actors speaking Indonesian sentences with different emotions. Using support vector machine classification, the study found that the Teager energy feature and the first MFCC coefficient were most crucial for prediction, achieving 86% accuracy. Additional initial MFCC features increased accuracy slightly, but more than four features had negligible effects.
The document summarizes Kun Zhou's PhD research on emotional voice conversion with non-parallel data at the National University of Singapore. It introduces emotional voice conversion and its challenges, including the lack of parallel training data. It then summarizes Kun's publications, which propose CycleGAN-based and VAW-GAN approaches to model prosody for speaker-dependent and independent emotional voice conversion. One publication introduces a method for transferring both seen and unseen emotional styles using a pre-trained speech emotion recognizer to describe emotional styles.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluates different classification methods for speech emotion recognition, including Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0). The paper extracts features like energy, zero crossing rate, pitch, and MFCCs from speech samples in the Berlin Emotional Speech Database, which contains utterances expressing seven emotions. These features are classified using SVM, C5.0, and SVM-C5.0, and the results show that SVM-C5.0 performs best, achieving recognition rates between 5.5-8.9% higher than SVM or C5.0 alone depending on the number of emotions.
This document summarizes a research paper that proposes a method for emotion identification in continuous speech using cepstral analysis and generalized gamma mixture modeling. The key contributions are:
1) It extracts MFCC and LPC features from speech signals to model emotions like happy, angry, boredom and sad.
2) It uses a generalized gamma distribution instead of GMM for more accurate feature extraction and classification, as GGD can model speech signal variations better.
3) An experiment is conducted on a database of 50 speakers' speech in 5 emotions, achieving over 90% recognition accuracy using the proposed MFCC-LPC features and GGD modeling.
This document summarizes Dongang Wang's speech emotion recognition project which compares feature selection and classification methods. Wang selects mel-frequency cepstral coefficients (MFCCs) and energy as features. For methods, Wang tests Gaussian mixture models (GMMs), discrete hidden Markov models (HMMs), and continuous HMMs including Kalman filters. Testing on German and English corpora, continuous HMMs achieved the best average accuracy of 61.67%, outperforming GMMs and discrete HMMs. While results are promising, Wang notes challenges in recognizing emotion across languages and speakers.
This document summarizes a research paper on classifying speech using Mel frequency cepstrum coefficients (MFCC) and power spectrum analysis. The paper reviews different classifiers used for speech emotion recognition, including neural networks, Gaussian mixture models, and support vector machines. It proposes using MFCC and power spectrum features as inputs to an artificial neural network classifier to identify emotions in speech, such as anger, happiness, sadness, and neutral states. Testing is performed on emotional speech samples to evaluate the performance and limitations of the proposed speech emotion recognition system.
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
- The document describes a framework for emotional voice conversion using VAW-GAN that can disentangle and recompose emotional elements in speech. It proposes using VAW-GAN with continuous wavelet transform to model prosody and decompose fundamental frequency into different time scales. Conditioning the decoder on fundamental frequency is shown to improve emotion conversion performance. Experiments demonstrate the effectiveness of the approach on an English emotional speech database.
Predicting the Level of Emotion by Means of Indonesian Speech SignalTELKOMNIKA JOURNAL
This document summarizes a study that aimed to evaluate how well Mel-Frequency Cepstral Coefficients (MFCC) features extracted from Indonesian speech signals relate to four emotions: happy, sad, angry, and fear. Nearly 300 speech signals were collected from actors speaking Indonesian sentences with different emotions. Using support vector machine classification, the study found that the Teager energy feature and the first MFCC coefficient were most crucial for prediction, achieving 86% accuracy. Additional initial MFCC features increased accuracy slightly, but more than four features had negligible effects.
The document summarizes Kun Zhou's PhD research on emotional voice conversion with non-parallel data at the National University of Singapore. It introduces emotional voice conversion and its challenges, including the lack of parallel training data. It then summarizes Kun's publications, which propose CycleGAN-based and VAW-GAN approaches to model prosody for speaker-dependent and independent emotional voice conversion. One publication introduces a method for transferring both seen and unseen emotional styles using a pre-trained speech emotion recognizer to describe emotional styles.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluates different classification methods for speech emotion recognition, including Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0). The paper extracts features like energy, zero crossing rate, pitch, and MFCCs from speech samples in the Berlin Emotional Speech Database, which contains utterances expressing seven emotions. These features are classified using SVM, C5.0, and SVM-C5.0, and the results show that SVM-C5.0 performs best, achieving recognition rates between 5.5-8.9% higher than SVM or C5.0 alone depending on the number of emotions.
This document summarizes a research paper that proposes a method for emotion identification in continuous speech using cepstral analysis and generalized gamma mixture modeling. The key contributions are:
1) It extracts MFCC and LPC features from speech signals to model emotions like happy, angry, boredom and sad.
2) It uses a generalized gamma distribution instead of GMM for more accurate feature extraction and classification, as GGD can model speech signal variations better.
3) An experiment is conducted on a database of 50 speakers' speech in 5 emotions, achieving over 90% recognition accuracy using the proposed MFCC-LPC features and GGD modeling.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODELsipij
This document presents research on developing an Arabic speech emotion recognition system using a convolutional neural network (CNN) model. The researchers propose a model called ASERS-CNN and evaluate it on an Arabic speech dataset containing recordings of 4 emotions. Their results show the ASERS-CNN achieves 98.18% accuracy, outperforming their previous ASERS-LSTM model which achieved 97.44% accuracy. They also find that using 5 acoustic feature types and 50 training epochs leads to the best ASERS-CNN performance of 98.52% accuracy.
Novel Methodologies for Classifying Gender and Emotions Using Machine Learnin...BRIGHT WORLD INNOVATIONS
This document proposes a system for classifying gender and emotions from human voice using machine learning algorithms. It involves preprocessing voice data, removing noise using hidden Markov models, extracting features using discrete wavelet transforms, and classifying gender and emotion using k-nearest neighbors. The system combines gender and emotion classification into a single system, achieving higher accuracy than systems that only classify one or the other. Evaluation shows the proposed system achieves 97% accuracy, outperforming existing systems with accuracies of 75-86%.
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level IOSR Journals
This document presents research on improving the accuracy of automatic speech emotion recognition using minimal inputs and features. The researchers used only the vowel formants from English speech recordings of 10 female speakers producing neutral and 6 basic emotions. They analyzed the vowel formants using statistical analysis and 3 classifiers to identify the best performing formants. An artificial neural network using the selected formant values achieved 95.6% accuracy in classifying emotions, higher than previous studies. The approach requires fewer features and less complex processing while achieving good recognition rates.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Estimating the quality of digitally transmitted speech over satellite communi...Alexander Decker
This document discusses methods for estimating the quality of digitally transmitted speech over satellite
communication channels. It proposes a methodology that divides the speech signal into segments and calculates the
segmental signal-to-noise ratio (SNR) for each segment. This allows assessing changes in speech quality over time
through statistical modeling. It also considers the spectral characteristics of the speech signal, providing a better
measure than just signal power. The document reviews other existing objective and subjective methods for quality
estimation and argues that the proposed segmental SNR approach has advantages over these other approaches.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Signal Processing Tool for Emotion Recognitionidescitation
In the course of realization of modern day robots,
which not only perform tasks, but also behaves like human
beings during their interaction with the natural environment,
it is essential for us to impart knowledge of the underlying
emotions in the spoken utterances of human beings to the
robots, enabling them to be consistent, whole, complete and
perfect. To this end, it is essential for them too to understand
and identify the human emotions. For this reason, stress is
laid now-a-days on the study of emotional content of the speech
and accordingly speech emotion recognition engines have been
proposed. This paper is a survey of the main aspects of speech
emotion recognition, namely, features extractions and types
of features commonly used, selection of most informed
features from the original dataset of the features, and
classification of the features according to different classifying
techniques based on relative information regarding commonly
used database for the speech emotion recognition.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Emotion Detection from Voice Based Classified Frame-Energy Signal Using K-Mea...ijseajournal
Emotion detection is a new research era in health informatics and forensic technology. Besides having some challenges, voice based emotion recognition is getting popular, as the situation where the facial image is not available, the voice is the only way to detect the emotional or psychiatric condition of a
person. However, the voice signal is so dynamic even in a short-time frame so that, a voice of the same person can differ within a very subtle period of time. Therefore, in this research basically two key criterion have been considered; firstly, this is clear that there is a necessity to partition the training data according
to the emotional stage of each individual speaker. Secondly, rather than using the entire voice signal, short time significant frames can be used, which would be enough to identify the emotional condition of the speaker. In this research, Cepstral Coefficient (CC) has been used as voice feature and a fixed valued kmeans clustered method has been used for feature classification. The value of k will depend on the number
of emotional situations in human physiology is being an evaluation. Consequently, the value of k does not necessarily consider the volume of experimental dataset. In this experiment, three emotional conditions: happy, angry and sad have been detected from eight female and seven male voice signals. This methodology has increased the emotion detection accuracy rate significantly comparing to some recent works and also reduced the CPU time of cluster formation and matching.
Graphical visualization of musical emotionsPranay Prasoon
The document discusses graphical visualization of musical emotions using artificial neural networks. 13 audio features are extracted from Hindustani classical music clips labeled as happy or sad. An ANN model with backpropagation algorithm is trained on 70% of data, validated on 15% and tested on 15%. The model correctly classified 15 of 17 happy clips and 21 of 22 sad clips. Testing was repeated 10 times with over 90% accuracy each time, showing the model effectively recognizes musical emotions. Future work involves expanding the model to recognize additional emotions and incorporating physiological features.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABsipij
1) The document describes the development of a speaker-dependent speech recognition system using MATLAB. It uses Gaussian mixture models for acoustic modeling and mel-frequency cepstral coefficients for feature extraction.
2) The system is designed to recognize isolated digits 0-9. Voice activity detection is performed to detect segments of speech. Various windowing functions are evaluated to reduce spectral leakage during feature extraction.
3) 13 MFCCs plus energy are extracted from each 30ms frame with 10ms shift. The first and second derivatives are also calculated to capture dynamic information, resulting in a 39-dimensional feature vector.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Speech retrieval involves extracting information from audio recordings containing spoken text to facilitate content-based retrieval. There are several areas of speech retrieval including stress and emotion classification, speaker diarization, multilingual audio analysis, and audio fingerprinting. Speaker diarization divides speech audio into segments according to distinct speakers, answering the question "who spoke when?". Audio fingerprinting creates a condensed digital summary of audio that can identify samples or locate similar items in a database. The presentation provided an overview of current speech retrieval techniques and their applications.
Advantages and disadvantages of hidden markov modeljoshiblog
An HMM is a statistical model that generates observable sequences through a series of unobserved states. It can be visualized as a finite state machine that emits symbols as it progresses through states. HMMs have a strong statistical foundation and efficient learning algorithms, allowing them to handle inputs of variable length. They have been applied to problems like multiple sequence alignment, data mining, and modeling protein domains. However, HMMs also have limitations like many unstructured parameters, an inability to capture long-term dependencies, and not fully representing protein 3D structures.
The document provides information about articulation and speech sounds. It discusses the importance of clear articulation and natural speech. It also covers the classification of consonant and vowel sounds, including their place and method of articulation. The document encourages exercises to improve articulation and recommends studying phonetic symbols to distinguish speech sounds.
This document provides an introduction to hidden Markov models (HMMs). It discusses that HMMs can be used to model sequential processes where the underlying states cannot be directly observed, only the outputs of the states. The document outlines the basic components of an HMM including states, transition probabilities, emission probabilities, and gives examples of HMMs for coin tossing and weather. It also briefly discusses the history of HMMs and their applications in fields like bioinformatics for problems such as gene finding.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODELsipij
This document presents research on developing an Arabic speech emotion recognition system using a convolutional neural network (CNN) model. The researchers propose a model called ASERS-CNN and evaluate it on an Arabic speech dataset containing recordings of 4 emotions. Their results show the ASERS-CNN achieves 98.18% accuracy, outperforming their previous ASERS-LSTM model which achieved 97.44% accuracy. They also find that using 5 acoustic feature types and 50 training epochs leads to the best ASERS-CNN performance of 98.52% accuracy.
Novel Methodologies for Classifying Gender and Emotions Using Machine Learnin...BRIGHT WORLD INNOVATIONS
This document proposes a system for classifying gender and emotions from human voice using machine learning algorithms. It involves preprocessing voice data, removing noise using hidden Markov models, extracting features using discrete wavelet transforms, and classifying gender and emotion using k-nearest neighbors. The system combines gender and emotion classification into a single system, achieving higher accuracy than systems that only classify one or the other. Evaluation shows the proposed system achieves 97% accuracy, outperforming existing systems with accuracies of 75-86%.
Upgrading the Performance of Speech Emotion Recognition at the Segmental Level IOSR Journals
This document presents research on improving the accuracy of automatic speech emotion recognition using minimal inputs and features. The researchers used only the vowel formants from English speech recordings of 10 female speakers producing neutral and 6 basic emotions. They analyzed the vowel formants using statistical analysis and 3 classifiers to identify the best performing formants. An artificial neural network using the selected formant values achieved 95.6% accuracy in classifying emotions, higher than previous studies. The approach requires fewer features and less complex processing while achieving good recognition rates.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Estimating the quality of digitally transmitted speech over satellite communi...Alexander Decker
This document discusses methods for estimating the quality of digitally transmitted speech over satellite
communication channels. It proposes a methodology that divides the speech signal into segments and calculates the
segmental signal-to-noise ratio (SNR) for each segment. This allows assessing changes in speech quality over time
through statistical modeling. It also considers the spectral characteristics of the speech signal, providing a better
measure than just signal power. The document reviews other existing objective and subjective methods for quality
estimation and argues that the proposed segmental SNR approach has advantages over these other approaches.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Signal Processing Tool for Emotion Recognitionidescitation
In the course of realization of modern day robots,
which not only perform tasks, but also behaves like human
beings during their interaction with the natural environment,
it is essential for us to impart knowledge of the underlying
emotions in the spoken utterances of human beings to the
robots, enabling them to be consistent, whole, complete and
perfect. To this end, it is essential for them too to understand
and identify the human emotions. For this reason, stress is
laid now-a-days on the study of emotional content of the speech
and accordingly speech emotion recognition engines have been
proposed. This paper is a survey of the main aspects of speech
emotion recognition, namely, features extractions and types
of features commonly used, selection of most informed
features from the original dataset of the features, and
classification of the features according to different classifying
techniques based on relative information regarding commonly
used database for the speech emotion recognition.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Emotion Detection from Voice Based Classified Frame-Energy Signal Using K-Mea...ijseajournal
Emotion detection is a new research era in health informatics and forensic technology. Besides having some challenges, voice based emotion recognition is getting popular, as the situation where the facial image is not available, the voice is the only way to detect the emotional or psychiatric condition of a
person. However, the voice signal is so dynamic even in a short-time frame so that, a voice of the same person can differ within a very subtle period of time. Therefore, in this research basically two key criterion have been considered; firstly, this is clear that there is a necessity to partition the training data according
to the emotional stage of each individual speaker. Secondly, rather than using the entire voice signal, short time significant frames can be used, which would be enough to identify the emotional condition of the speaker. In this research, Cepstral Coefficient (CC) has been used as voice feature and a fixed valued kmeans clustered method has been used for feature classification. The value of k will depend on the number
of emotional situations in human physiology is being an evaluation. Consequently, the value of k does not necessarily consider the volume of experimental dataset. In this experiment, three emotional conditions: happy, angry and sad have been detected from eight female and seven male voice signals. This methodology has increased the emotion detection accuracy rate significantly comparing to some recent works and also reduced the CPU time of cluster formation and matching.
Graphical visualization of musical emotionsPranay Prasoon
The document discusses graphical visualization of musical emotions using artificial neural networks. 13 audio features are extracted from Hindustani classical music clips labeled as happy or sad. An ANN model with backpropagation algorithm is trained on 70% of data, validated on 15% and tested on 15%. The model correctly classified 15 of 17 happy clips and 21 of 22 sad clips. Testing was repeated 10 times with over 90% accuracy each time, showing the model effectively recognizes musical emotions. Future work involves expanding the model to recognize additional emotions and incorporating physiological features.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABsipij
1) The document describes the development of a speaker-dependent speech recognition system using MATLAB. It uses Gaussian mixture models for acoustic modeling and mel-frequency cepstral coefficients for feature extraction.
2) The system is designed to recognize isolated digits 0-9. Voice activity detection is performed to detect segments of speech. Various windowing functions are evaluated to reduce spectral leakage during feature extraction.
3) 13 MFCCs plus energy are extracted from each 30ms frame with 10ms shift. The first and second derivatives are also calculated to capture dynamic information, resulting in a 39-dimensional feature vector.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
Speech retrieval involves extracting information from audio recordings containing spoken text to facilitate content-based retrieval. There are several areas of speech retrieval including stress and emotion classification, speaker diarization, multilingual audio analysis, and audio fingerprinting. Speaker diarization divides speech audio into segments according to distinct speakers, answering the question "who spoke when?". Audio fingerprinting creates a condensed digital summary of audio that can identify samples or locate similar items in a database. The presentation provided an overview of current speech retrieval techniques and their applications.
Advantages and disadvantages of hidden markov modeljoshiblog
An HMM is a statistical model that generates observable sequences through a series of unobserved states. It can be visualized as a finite state machine that emits symbols as it progresses through states. HMMs have a strong statistical foundation and efficient learning algorithms, allowing them to handle inputs of variable length. They have been applied to problems like multiple sequence alignment, data mining, and modeling protein domains. However, HMMs also have limitations like many unstructured parameters, an inability to capture long-term dependencies, and not fully representing protein 3D structures.
The document provides information about articulation and speech sounds. It discusses the importance of clear articulation and natural speech. It also covers the classification of consonant and vowel sounds, including their place and method of articulation. The document encourages exercises to improve articulation and recommends studying phonetic symbols to distinguish speech sounds.
This document provides an introduction to hidden Markov models (HMMs). It discusses that HMMs can be used to model sequential processes where the underlying states cannot be directly observed, only the outputs of the states. The document outlines the basic components of an HMM including states, transition probabilities, emission probabilities, and gives examples of HMMs for coin tossing and weather. It also briefly discusses the history of HMMs and their applications in fields like bioinformatics for problems such as gene finding.
The document proposes developing Android applications to sense emotions using smartphones for better health and human-machine interactions. It discusses detecting emotions through passive sensors like cameras, microphones, and accelerometers that can capture facial expressions, speech, heart rate without interpreting input. Recognition involves extracting meaningful patterns from sensor data using techniques like speech recognition, facial expression detection to produce labels or inference algorithms. Specific techniques are discussed for recognizing emotions from speech, facial expressions based on the Facial Action Coding System, and heart rate variability. The conclusion states that understanding emotions with smartphones can help people succeed and make research easier.
This document discusses applications of emotion recognition technology across several domains:
1) Medicine - To help monitor patients and enhance healthcare like rehabilitation, companion robots, and counseling. It could also help with autism therapy and music therapy.
2) E-learning - To adjust an online tutor's presentation based on a learner's emotions, making the tutoring experience more interactive and effective.
3) Monitoring - Such as detecting emotions in car drivers to alert other cars, warning employees before angry emails, and prioritizing angry calls at call centers.
4) Entertainment - Like music players that satisfy users based on recognized moods.
5) Marketing - To improve advertising by understanding emotional impacts and engagement
- Hidden Markov models (HMMs) are statistical models where the system is assumed to be a Markov process with hidden states. Each state has a number of possible transitions to other states, each with an assigned probability.
- There are three main issues in HMMs: model evaluation, decoding the most probable path, and model training.
- HMMs have applications in areas like speech recognition, gesture recognition, language modeling, and video analysis.
Space Vector Modulation(SVM) Technique for PWM InverterPurushotam Kumar
This document discusses space vector pulse width modulation (SVM) for three-phase voltage source inverters. It begins by introducing SVM and its benefits over other PWM techniques, such as reduced total harmonic distortion. It then provides details on how SVM works, including transforming a three-phase reference signal to a rotating vector in the d-q reference frame. The document explains the eight possible switching states, sectors, and how to calculate switching times to synthesize the reference signal using adjacent active vectors and zero vectors. It concludes by comparing SVM to sinusoidal PWM, showing SVM offers better voltage utilization and harmonic performance.
This document summarizes a research paper on understanding and estimating emotional expression using acoustic analysis of natural speech. The paper explores identifying seven emotional states (anger, surprise, sadness, happiness, fear, disgust, and neutral) using fifteen acoustic features extracted from the SAVEE speech database. Three models using different combinations of features were evaluated using various machine learning algorithms. The results showed that Model 2, using energy intensity, pitch, standard deviation, jitter, and shimmer, achieved the highest classification accuracy. Estimation of emotions using confidence intervals showed that most emotions could be accurately estimated using energy intensity and pitch. The paper concludes that expanding the study to include more features and databases could improve emotional state recognition.
This paper introduces new features based on histograms of MFCC extracted from audio files to improve emotion recognition from speech. Experimental results on the Berlin and PAU databases using SVM and Random Forest classifiers show the proposed features achieve better classification results than current methods. Detailed analysis is provided on speech type (acted vs natural) and gender.
Signal & Image Processing : An International Journal sipij
Signal & Image Processing : An International Journal is an Open Access peer-reviewed journal intended for researchers from academia and industry, who are active in the multidisciplinary field of signal & image processing. The scope of the journal covers all theoretical and practical aspects of the Digital Signal Processing & Image processing, from basic research to development of application.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Signal & Image processing.
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions.
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
Suctioning is a common procedure performed by nurses to maintain the gas exchange, adequate oxygenation and alveolar ventilation in critical ill patients under mechanical ventilation and aim of this research is to provide knowledge regarding maintaining airway patency with suctioning care that will help in the implementation of the quality of nursing care, eventually it will lead to better results. The planned study is a pre experimental study to assess the effectiveness of planned teaching programme on knowledge regarding airway patency on patients with mechanical ventilator among the B.Sc. internship students of selected college of nursing at Moradabad. To assess the level of knowledge regarding maintaining airway patency in patients with mechanical ventilator among B.Sc. Nursing internship students. To assess the effectiveness of planned teaching programme in term of knowledge regarding airway patency among B.Sc. nursing internship students. The purpose of this study is to examine the association between knowledge and effectiveness regarding airway patency among B.Sc. Nursing internship demographic students and their selected partner variables. A pre experimental study was conducted among 86 participants, selected by non probability convenient sampling method. Demographic Performa and self structured questionnaire was used to collect the data from the B.Sc. internship students. Nafees Ahmed | Sana Usmani "A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledge Regarding Maintaining Airway Patency in Patients with Mechanical Ventilator" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-1 , December 2021, URL: https://www.ijtsrd.com/papers/ijtsrd47917.pdf Paper URL: https://www.ijtsrd.com/medicine/nursing/47917/a-study-to-assess-the-effectiveness-of-planned-teaching-programme-on-knowledge-regarding-maintaining-airway-patency-in-patients-with-mechanical-ventilator/nafees-ahmed
Speech Emotion Recognition Using Neural Networksijtsrd
Speech is the most natural and easy method for people to communicate, and interpreting speech is one of the most sophisticated tasks that the human brain conducts. The goal of Speech Emotion Recognition SER is to identify human emotion from speech. This is due to the fact that tone and pitch of the voice frequently reflect underlying emotions. Librosa was used to analyse audio and music, sound file was used to read and write sampled sound file formats, and sklearn was used to create the model. The current study looked on the effectiveness of Convolutional Neural Networks CNN in recognising spoken emotions. The networks input characteristics are spectrograms of voice samples. Mel Frequency Cepstral Coefficients MFCC are used to extract characteristics from audio. Our own voice dataset is utilised to train and test our algorithms. The emotions of the speech happy, sad, angry, neutral, shocked, disgusted will be determined based on the evaluation. Anirban Chakraborty "Speech Emotion Recognition Using Neural Networks" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-1 , December 2021, URL: https://www.ijtsrd.com/papers/ijtsrd47958.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/47958/speech-emotion-recognition-using-neural-networks/anirban-chakraborty
Emotion Recognition based on Speech and EEG Using Machine Learning Techniques...HanzalaSiddiqui8
This research investigates emotion recognition using both speech and EEG data through machine learning techniques, achieving 97% accuracy. An artificial neural network model is designed to effectively extract features and learn representations from the integrated speech and EEG data, demonstrating the potential of this multimodal approach. The high accuracy attained underscores applications in human-computer interaction, healthcare, and more.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluated different machine learning classifiers for speech emotion recognition using the Berlin Emotional Speech Database. The paper compared the performance of Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0) classifiers. Features extracted from the speech data included energy, zero crossing rate, pitch, and Mel-frequency cepstral coefficients. The results showed that the proposed SVM-C5.0 method achieved 8.9% to 5.5% better emotion recognition accuracy compared to SVM and C5.0 alone, depending on the number of emotion states.
This document discusses an analysis of an emotion recognition system through speech signals using K-nearest neighbors (KNN) and Gaussian mixture model (KNN) classifiers. It provides background on the challenges of automatic emotion recognition from speech and describes common features extracted from speech like mel frequency cepstrum coefficients and prosodic features. The document outlines the process of an emotion recognition system including feature extraction, training classifiers on a speech database, and classifying emotions. It then gives more detail on the KNN and GMM classifiers and how they were used to classify six emotional states from the Berlin emotional speech database.
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”IRJET Journal
The document discusses speech emotion recognition using deep neural networks. It first provides an overview of SER and the challenges in the field. It then reviews 20 research papers on the topic, finding that most use deep neural network techniques like CNNs and DNNs for model building. The papers evaluated various datasets and algorithms, with accuracy ranging from 84% to 90%. Overall limitations identified included the need for more data, handling of multiple simultaneous emotions, and improving cross-corpus performance. The literature review contributes to knowledge in using machine learning for SER.
H IDDEN M ARKOV M ODEL A PPROACH T OWARDS E MOTION D ETECTION F ROM S PEECH S...csandit
Emotions carry the token indicating a human’s menta
l state. Understanding the emotion
exhibited becomes difficult for people suffering fr
om autism and alexithymia. Assessment of
emotions can also be beneficial in interactions inv
olving a human and a machine. A system is
developed to recognize the universally accepted emo
tions such as happy, anger, sad, disgust,
fear and surprise. The gender of the speaker helps
to obtain better clarity for identifying the
emotion. Hidden Markov Model serves the purpose of
gender identification
BASIC ANALYSIS ON PROSODIC FEATURES IN EMOTIONAL SPEECHIJCSEA Journal
Speech is a rich source of information which gives not only about what a speaker says, but also about what the speaker’s attitude is toward the listener and toward the topic under discussion—as well as the speaker’s own current state of mind. Recently increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. The focus of this research work is to enhance man machine interface by focusing on user’s speech emotion. This paper gives the results of the basic analysis on prosodic features and also compares the prosodic features
of, various types and degrees of emotional expressions in Tamil speech based on the auditory impressions between the two genders of speakers as well as listeners. The speech samples consist of “neutral” speech as well as speech with three types of emotions (“anger”, “joy”, and “sadness”) of three degrees (“light”, “medium”, and “strong”). A listening test is also being conducted using 300 speech samples uttered by students at the ages of 19 -22 the ages of 19-22 years old. The features of prosodic parameters based on the emotional speech classified according to the auditory impressions of the subjects are analyzed. Analysis results suggest that prosodic features that identify their emotions and degrees are not only speakers’ gender dependent, but also listeners’ gender dependent.
ASERS-CNN: Arabic Speech Emotion Recognition System based on CNN Modelsipij
When two people are on the phone, although they cannot observe the other person's facial expression and
physiological state, it is possible to estimate the speaker's emotional state by voice roughly. In medical
care, if the emotional state of a patient, especially a patient with an expression disorder, can be known,
different care measures can be made according to the patient's mood to increase the amount of care. The
system that capable for recognize the emotional states of human being from his speech is known as Speech
emotion recognition system (SER). Deep learning is one of most technique that has been widely used in
emotion recognition studies, in this paper we implement CNN model for Arabic speech emotion
recognition. We propose ASERS-CNN model for Arabic Speech Emotion Recognition based on CNN
model. We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech
corpus (BAES-DB). In addition of that we compare the accuracy between our previous ASERS-LSTM and
new ASERS-CNN model proposed in this paper and we comes out that our new proposed mode is
outperformed ASERS-LSTM model where it get 98.18% accuracy.
Signal & Image Processing : An International Journalsipij
This document presents research on developing an Arabic speech emotion recognition system using a convolutional neural network (CNN) model. The researchers propose a model called ASERS-CNN and evaluate it on an Arabic speech dataset containing recordings of 4 emotions. Their best performing model achieves 98.52% accuracy using 5 acoustic features extracted from the speech and preprocessed with normalization and silence removal. They compare this result to their previous ASERS-LSTM model and other state-of-the-art methods, finding that ASERS-CNN outperforms ASERS-LSTM and existing approaches.
A Review Paper on Speech Based Emotion Detection Using Deep LearningIRJET Journal
This document reviews techniques for speech-based emotion detection using deep learning. It discusses how deep learning techniques have been proposed as an alternative to traditional methods for speech emotion recognition. Feature extraction is an important part of speech emotion recognition, and deep learning can help minimize the complexity of extracted features. The document surveys related work on speech emotion recognition using techniques like deep neural networks, convolutional neural networks, recurrent neural networks, and more. It examines the limitations of current approaches and the potential for deep learning to improve speech-based emotion detection.
SPEECH EMOTION RECOGNITION SYSTEM USING RNNIRJET Journal
This document discusses a speech emotion recognition system using recurrent neural networks (RNNs). It begins with an abstract describing speech emotion recognition and its importance. Then it provides background on speech emotion databases, feature extraction using MFCC, and classification approaches like RNNs. It reviews related work on speech emotion recognition using various methods. Finally, it concludes that MFCC feature extraction and RNN classification was used in the proposed system to take advantage of their performance in machine learning applications. The system aims to help machines understand human interaction and respond based on the user's emotion.
Similar to A comparative analysis of classifiers in emotion recognition thru acoustic features (20)
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
2. Int J Speech Technol
Fig. 1 speech samples of different emotions when the driver is in dif-
ferent states like (a) happy emotion, (b) neutral emotion, (c) anger
emotion, (d) sad emotion
criminant analysis (RDA) by Ye et al. (2006) and support
vector machine (SVM) by Koolagudi et al. (2011). Among
these the classifiers considered in this paper are LDA, RDA,
SVM and kNN.
Most traditional systems in speech emotion recognition
have been focused on either prosody or spectral features
(Ververidis and Kotropoulos 2006; Tato et al. 2002). But the
recognition accuracy is not good enough by using any one
of them alone (Zhou et al. 2009). The innovative step in this
direction is to fuse both the spectral and prosody features to
improve the emotion recognition accuray (Zhou et al. 2009).
This paper focuses more on implementation and analysis of
results with both feature fusion and individual features. All
this is done with those four classifiers specified earlier on
both Berlin and Spanish databases.
This paper is organized as follows: Sect. 2 describes the
description of databases used in the experiments. Section 3
describes the methodology and the extracted features fol-
lowed by description of different classifier techniques used
in this paper. Section 4 describes the experimental results
and their extensive comparative study of the classifiers on
different databases and Sect. 5 concludes the paper.
2 Speech databases
In this paper the experiments are conducted over Berlin and
Spanish emotional speech data bases. The reason for choos-
ing these databases is both are multi speaker databases, so it
is possible to perform speaker independent tests. The data-
bases contains the following emotions: anger, boredom, dis-
gust, fear, happiness, sadness and neutral. The Berlin data-
base contains 500 speech samples and are simulated by ten
Table 2 Description of number of files in Berlin and Spanish databases
corresponding to each emotion
Emotion No.of files in database
Berlin Spanish
Anger 127 184
Boredom 81 184
Happy 71 184
Fear 69 184
Sad 62 184
Disgust 46 184
Neutral 79 184
Total 535 1288
The present work demonstrate on these databases
professional native German actors, five male and five female
(Burkhardt et al. 2005). The number of speech files are as
shown in Table 2.
Spanish database contains 184 sentences for each emo-
tion which include numbers, words, sentences etc. as shown
in Table 2. The corpus comprises of recordings from two pro-
fessional actors, one male and one female. Among 184 files,
1–100 are Affirmative sentences, 101–134 are Interrogative
sentences, 135–150 Paragraphs, 151–160 Digits, 161–184
Isolated words.
3 Methodology
The Block diagram of overall process of Speech emotion
recognition system is shown in Fig. 2. It shows the procedure
of how the input speech sample is processed to get the correct
emotion of the human. Test speech sample is classified by
using one of the classifier and the information provided by
the training speech samples.
3.1 Pre processing
Speech signals are normally pre-processed before features
are extracted to enhance the accuracy and efficiency of the
Table 1 The two most efficient
features of a speech signal are
prosody and spectral
The differences between these
two features are described in
this table
Prosody features Spectral features
Occurs when we put sounds together in a connected
speech
Extracted from spectral content of the
speech signal
Influenced by vocal fold activity Influenced by vocal tract activity
Discriminate high arousal emotions (happy, anger)
with low arousal emotions (sadness, boredom)
(Luengo et al. 2010; Scherer 2003)
Discriminate emotions in the same arousal
state (high or low) (Luengo et al. 2010;
Scherer 2003)
Examples: energy, pitch, speaking rate etc., Examples: MFCC, LPCC, LFPC,
formants etc.,
123
3. Int J Speech Technol
Fig. 2 Various stages in extracting emotions from a speech sample like
pre processing, feature extraction, classification and emotion recogni-
tion
Fig. 3 The process of splitting the speech signal into several frames,
applying an hamming window for each frame and its corresponding
feature vector
feature extraction process. The modules filtering, framing
and windowing are considered as steps under preprocessing.
The filtering technique is applied to reduce the noise,
which is occurred due to the environmental conditions while
recording the speech sample. This is done by using high pass
filter.
Instead of analyzing the entire signal at once, split the
signal into several frames with 256 samples for each frame.
A feature vector is created for each frame. While dividing the
continuous speech signal into frames some discontinuity in
the speech sample is observed. To maintain the continuity of
the speech sample an overlapping of 100 samples is applied
to these frames with a hamming window. The procedure for
implementing framing and windowing is shown in Fig. 3 .
3.2 Feature extraction
An important module in the design of speech emotion
recognition system is the selection of features which best
Table 3 The statistics and their corresponding symbols used in this
paper
Statistics Symbol
Mean E
Variance V
Mininum Min
Range R
Skewness Sk
Kurtosis K
classify the emotions. In this paper some of the key
prosody and spectral features are used. Those features distin-
guishes the emotions of different classes of speech samples.
These are estimated over simple six statistics as shown in
Table 3
The prosodic features extracted from the speech signal
are Energy and Pitch. Their first and second order differen-
tiation provides new useful information hence the informa-
tion provided by their derivatives also considered (Luengo et
al. 2005). Energy and pitch were estimated for each frame
together with their first and second derivatives, providing six
features per frame and applying statistics as shown in Table 3
giving a total of 36 prosodic features.
Mel frequency cepstral coefficients (MFCC) are the most
widely used spectral representation of speech. These are most
efficient in order to extract correct emotional state of the
speech sample (Luengo et al. 2010). MFCC features rep-
resents the short term power spectrum of a speech signal,
based on a linear cosine transform of a log power spec-
trum on a non linear melscale of frequency. The proce-
dure for implementing MFCC is shown in Sato and Obuchi
(2007), Vankayalapati and SVKK Anne (Vankayalapati and
SVKK Anne). Similar to prosody statistics, spectral sta-
tistics are calculated using the statistics as shown in the
Table 3. The extracted MFCC are 18 and the number of
filter banks used are 24. Eighteen MFCC Coefficients and
their first and second derivatives are estimated for each
frame giving a total of 54 spectral features. The statistics as
mentioned earlier are applied to these 54 values so totally
54×6=324 different features are calculated. The speech
signal and its spectral and prosody features are shown in
Fig. 4.
3.3 Classification
The features extracted from the feature extraction module
are given as input to the classifier. The purpose of classifier
is to optimally classify the emotional state of a speech sam-
ple. The speech samples considered in this paper belongs
to four(happy, neutral, anger and sad) emotional classes of
two (Berlin, Spanish) different databases. Among the speech
123
4. Int J Speech Technol
Fig. 4 Prosody and spectral features of the speech signal. a Original
speech signal, b Energy contour, c Pitch track, d Mel frequency cepstral
coefficients
samples 2
3 rd of the samples are used for training and 1
3 rd
of the samples are used for testing for each database. The
classifiers are trained using training set and the classification
accuracy is estimated on test set. The training and test sam-
ples are chosen randomly. Once the testing is completed on
different test samples the results are gathered and are used
to calculate the emotion recognition accuracy of each classi-
fier. The way the classifiers used in this paper are described
as follows.
3.3.1 Linear discriminant analysis (LDA)
LDA is a well known statistical method for classification
as well as dimensionality reduction of speech samples. It
computes an optimal transformation matrix by maximiz-
ing the between class covariance matrix and minimizing the
within class covariance matrix of the speech samples. In other
words, it groups the wav files belongs to one class and sepa-
rates the wav files belongs to different emotional classes. A
class is a collection of speech samples belonging to the same
emotion. When dealing with high dimensional and low sam-
ple size speech data, LDA suffers from singularity problem,
Ji and Ye (2008). The consequences of this problem are we
may not get the accurate results. To improve the results Reg-
ularized Discriminant Analysis is applied on these speech
samples.
3.3.2 Regularized discriminant analysis(RDA)
The key idea behind the regularized discriminant analysis is
to add a constant to the diagonal elements of with in class
scatter matrix Sw of the speech samples is shown in Eq. 1.
Sw = Sw + λI (1)
where λ is regularized parameter which is relatively small
such that Sw is positive definite. In our paper the value of
λ is 0.001. It is somehow difficult in estimation of regular-
ization parameter value in RDA as higher values of λ will
disturb the information in the within class scatter matrix and
lower values of λ will not solve the singularity problem LDA
(Vankayalapati et al. 2010; Ye et al. 2006).
The transformation matrix W is calculated by using
Sw
−1SbW = Wλ. Once the transformation matrix W is
given, the speech samples are projected on to this W. After
projection, the Euclidian distance between each train speech
sample and the test speech sample are calculated, the mini-
mum value among them will classify the result.
3.3.3 Support vector machine (SVM)
SVMisusedforspeechemotionclassificationasitcanhandle
linearly non separable classes. For multi class classification
of speech samples one against all approach is used. A set of
prosody and spectral features are extracted from the speech
sample and given as input to train the SVM. Among many
possiblehyperplanestheSVMclassifierselectsahyperplane
which optimally separates the training speech samples with
a maximum margin. To construct an optimal hyper plane
support vectors are calculated for speech samples in train-
ing phase which determines the margin (Milton et al. 2013).
As higher the margin, more the speech signal classification
performance and vice versa.
The decision function of a SVM to classify the speech
samples is given by Eq. 2
f (x) = sign
N
i=1
αi yik(xi , x) + b (2)
where k(xi , x) = xi .x is a linear kernel and N is the number
of support vectors. The dot product is applied between each
testspeechsample.Thesupportvectors,alphaandbiasvalues
are obtained during the training phase. As a result we obtain
four models, one for each emotional class in training phase.
Basing on these models and the generated support vectors
we will classify the emotion of the test speech sample.
3.3.4 KNearestNeighbor (KNN)
KNN is a non-parametric method for classifying speech sam-
ples based on closest training samples in the feature space
123
5. Int J Speech Technol
Table 4 Emotion recognition percentage accuracy of various classifiers
(lda, rda, svm and knn) over Berlin and Spanish databases using prosody
and spectral features
Classifier Berlin Spanish
Prosody (%) Spectral (%) Prosody (%) Spectral (%)
LDA 42.0 51.0 40.0 49.0
RDA 67.0 78.75 44.0 61.0
SVM 55.0 68.75 60.25 64.5
kNN 57.0 60.7 58.25 67.75
(Ravikumar and Suresha 2013). Similar to SVM and RDA
the speech samples are given as input to the KNN classi-
fier. Nearest Neighbor classification uses training samples
directly rather than that of a model derived from those sam-
ples. It represents each speech sample in a d-dimensional
space where d is the number of features. The similarity of
the test samples with the training samples is compared using
Euclidian distance. Once the nearest neighbor speech sam-
pleslistisobtainedtheinputspeechsampleisclassifiedbased
on the majority class of nearest neighbors.
4 Experimental evaluation
In order to validate the results of different classifiers on
happy, neutral, anger and sad emotional classes, recognition
tests were carried out in two phases like baseline results and
feature fusion results.
4.1 Baseline results
In the first phase all the classifiers have been trained using
prosody and spectral features of Berlin and Spanish emo-
tional speech samples and the results are shown in Table 4. It
seems that spectral features provide higher recognition accu-
racies than prosody ones. According to these results, it is
observed that even though prosody parameters show very
low class separability it does not perform so badly. It shows
an accuracy in the range of 40– 67% for both the databases.
Results with spectral statistics seems rather modest reach-
ing an accuracy of 50–70% for all the classifiers, except for
RDA it reaches 78.75%. For further improvement in the per-
formance of all the classifiers, feature fusion technique is
used in the next phase.
4.2 Feature fusion results
Feature fusion is done by combining both prosody and spec-
tral features. By doing this, the emotion recognition perfor-
mance of the classifiers is effectively improved when com-
Table 5 Emotion recognition percentage accuracy of various classi-
fiers (lda, rda, svm and knn) over Berlin and Spanish databases using
feature fusion
Classifier Feature fusion
Berlin Spanish (%)
LDA 62 60
RDA 80.7 74
SVM 75.5 73
kNN 72 71.5
Fig. 5 Comparison of emotion recognition accuracy performance of
different classifiers using Berlin database
pared with baseline results and are shown in Table 5. Among
these RDA and SVM performs considerably better when
compared with remaining classifiers.
The overall recognition performances of these four clas-
sifiers for Berlin and Spanish databases is shown in Fig. 5.
Horizontal axis represents the name of the classifier and ver-
tical axis represents the accuracy rate. The efficiency of each
classifier is improved by 20% approximately for each clas-
sifier when compared with base line results.This proves that
feature fusion is a best technique for improving the efficiency
of the classifier.
4.3 Analysis of results with each emotion
The confusion matrices of RDA classifier for both the data-
bases is shown in the Table 6. By observing the results, we
can say that the emotions happy is confused with anger and
the emotion neutral is confused with sad. This occurs in all
the features but the amount of variation is more by using indi-
vidual features and is less by using feature fusion. The rea-
son for this confusion is explained by using valence arousal
space. Prosodic features are able to discriminating the emo-
tions (happy, anger) from high arousal space to emotions
(neutral, sad) from low arousal space, but there exists a con-
fusionamongtheemotionsinthesamearousalstate.Byusing
123
6. Int J Speech Technol
Table 6 Emotion classification performance in percentage using fea-
turefusion technique (a) using Berlin database speech utterances, (b)
using Spanish database speech utterances
Berlin Happy Neutral Anger Sad
(a)
Happy 73 4 20 3
Neutral 9 70 – 21
Anger 14 – 83 3
Sad – 3 – 97
Spanish Happy Neutral Anger Sad
(b)
Happy 71 5 15 9
Neutral – 60 14 26
Anger 22 – 67 11
Sad 1 2 – 97
Table 7 Recognition accuracy percentage for emotions(happy, neutral,
anger and sad) with various classifiers using both the data bases (a)
Berlin database, (b) Spanish database
Berlin Feature fusion
Algorithm Happy Neutral Anger Sad
(a)
LDA 49 59 68 72
RDA 73 70 83 97
SVM 70 65 74 93
kNN 55 63 93 77
Spanish Feature fusion
(b)
LDA 49 56 65 70
RDA 71 60 67 97
SVM 67 77 65 83
kNN 63 70 88 65
spectral features and feature fusion the confusion among the
emotions in the same arousal state is reduced.
The recognition accuracy of each emotion are examined
separately with all the classifiers and are shown in Table 7.
The left column of the Table 4 shows the classifiers and top
row shows the emotion. Each cell represents the recognition
accuracy of the emotion by the corresponding classifier. With
Berlindatabase,theemotionrecognitionrateoftheclassifiers
RDA and SVM are comparable with each other for all the
emotions. The emotion anger is identified more with kNN
classifier for both databases.
The graphical representation of efficiencies of these clas-
sifiers is shown in the Fig. 6. The Blue, Red, Green and violet
bars or bars from front to back comprises the efficiencies of
LDA, kNN, SVM and RDA respectively. Analysis of emo-
tions happy, neutral, anger and sad are represented by the
bars from left to right.
4.4 Analysis of results by ROC curves
The shape of the ROC Curve and the area under the curve
helps to estimate the discriminative power of a classifier. The
area under the curve can have any value between 0 and 1 and
it is a good indicator of the goodness of the test. From the
literature survey the relationship between Area under curve
and its diagnostic accuracy are shown in the Table 8. From
the Table 8 it is observed that as the area under curve (AUC)
reaches 1.0, the performance of the classification technique
is excellent in classifying emotions, if AUC is less than 0.6
the performance of classification technique is poor, if the
AUC is in between 0.6 and 0.9 then the results obtained are
satisfactory if it is less than 0.5 the classification technique
is not useful.
To plot an ROC curve to our four emotional classes, we
first divide speech samples into positive and negative emo-
tions. Happy and Neutral speech samples are positive emo-
tions and speech samples of Anger, Sad are negative emo-
tions. Sensitivity evaluates how the test is good at detecting
positive emotions and Specificity evaluates how good the
negatives emotions are discarded from positive emotions.
ROC curve is a graphical representation of the relationship
between both sensitivity and specificity.
The Fig. 7. shows the performance comparison of different
classifiers. The Table 9 shows the values extracted from ROC
plot for different classification Algorithms over Berlin and
Spanish databases. The results obtained using ROC curve are
comparable with each other. The AUC is greater than 0.8 for
RDA and SVM which means the diagnostic accuracy of these
classifiers is very good, similarly the AUC is greater than 0.6
for kNN and LDA which means the diagnostic accuracy of
these classifiers is sufficient from Table 8.
From the Table 9 it is observed that the AUC is above 0.8
for RDA and above 0.7 for SVM which means the diagnostic
accuracy of the classifiers are very good and good respec-
tively, in the same way the AUC is more than 0.6 for kNN
and LDA which means the diagnostic accuracies of these
classifiers is sufficient from Table 8.
4.5 Analysis of results for driver applications
In driving task, emotion of the driver has an impact on the
driving performance, which can be improved if the automo-
bile actively respond to the emotional state of the driver. It
is important for a smart driver support system to accurately
monitor the driver state in a robust manner.
The analysis presented in this paper has been performed
using two acted speech emotional databases. Emotions from
these databases are extracted up to moderate level. This
123
7. Int J Speech Technol
Fig. 6 Comparison of emotion
recognition performances of
different classifiers using (a)
Berlin database, (b) Spanish
database
Fig. 7 Shows the comparison
of performances of different
classifiers using (a) Berlin
database, (b) Spanish database
Table 8 The relationship between area under curve and its diagnostic
accuracy
Area under curve Diagnostic accuracy
0.9–1.0 Excellent
0.8–0.9 Very good
0.7–0.8 Good
0.6–0.7 Sufficient
0.5–0.6 Bad
<0.5 Test not useful
success is obtained due to implementation of fused features.
In order to see whether these results are applicable to real
life driver emotion identification system we have to create a
database of speech samples collected from a driver naturally.
The speech samples from Berlin and Spanish databases are
collected in a noise free environment so they do not contain
noise but the speech samples that are collected in real life
contain noise due to vehicle, environment and co-passengers
voice. To get the noise free samples of a driver the technique
called voice activity detection (VAD) is used as a preprocess-
ing step to eliminate noise. After extracting the noise free
samples features are extracted and combined using feature
fusion which gives the emotion specific information of the
driver. Finally by using this information and a classifier will
Table 9 Shows the values (Accu: Accuracy, Sens: Sensitivity, Spec:
Specificity, AUC: Area Under Curve) extracted from roc plot for dif-
ferent classifiers for (a) Spanish database and (b) Berlin database
Algorithm Accu (%) Sens (%) Spec (%) AUC (%)
(a)
RDA 74.0 70.7 78.4 0.814
SVM 72.0 72.2 71.8 0.734
kNN 71.8 75.1 69.2 0.688
LDA 62.0 60.3 64.3 0.619
(b)
RDA 80.8 75.7 88.2 0.854
SVM 75.5 72.0 80.4 0.806
kNN 72.0 67.5 79.7 0.709
LDA 60.0 57.7 60.8 0.601
classify the emotional state of the driver. Basing on these
results the smart driver support system will alert the driver.
5 Conclusion
The emotion recognition accuracy using Acoustic informa-
tion generated by the driver is systematically evaluated by
using Berlin and Spanish emotional databases. This has been
implemented by a variety of classifiers including LDA, RDA,
123
8. Int J Speech Technol
SVM and KNN using various prosody and spectral fea-
tures. The emotion recognition performance of the classi-
fiers is obtained effectively with feature fusion technique. An
extensive comparative study has been made on these classi-
fiers. The results of the evaluation have showed that RDA
yields better recognition performance and SVM also gives
good recognition performance compared with other classi-
fiers. The use of feature fusion technique instead of using
individual features enhances the recognition performance.
The experimental results suggests that recognition accuracy
is improved by 20% approximately for each classifier with
feature fusion.
Acknowledgments This work was supported by Research Project on
“Non-intrusive real time driving process ergonomics monitoring system
to improve road safety in a car – pc environment” funded by DST, New
Delhi.
References
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss,
B. (2005). A database of german emotional speech. In: Interspeech,
pp. 1517–1520.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S.,
Fellenz, W., et al. (2001). Emotion recognition in human-computer
interaction. IEEE on Signal Processing Magazine, 18(1), 32–80.
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emo-
tion recognition: Features, classification schemes, and databases.
Pattern Recognition, 44(3), 572–587.
El Ayadi, M. M., Kamel, M. S., & Karray, F. (2007). Speech emotion
recognition using gaussian mixture vector autoregressive models.
In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007.
IEEE International Conference on, IEEE, vol. 4, pp. IV-957.
Ji, S., & Ye, J. (2008). Generalized linear discriminant analysis: A uni-
fied framework and efficient model selection. IEEE Transactions on
Neural Networks, 19(10), 1768–1782.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from
speech: a review. International Journal of Speech Technology, 15(2),
99–117.
Koolagudi, S. G., Kumar, N., & Rao, K. S. (2011). Speech emotion
recognition using segmental level prosodic analysis. In: Devices
and communications (ICDeCom), 2011 International Conference
on, IEEE, pp. 1–5.
Luengo, I., Navas, E., Hernáez, I., & Sánchez, J. (2005). Automatic
emotion recognition using prosodic parameters. In: Interspeech,
pp. 493–496.
Luengo, I., Navas, E., & Hernáez, I. (2010). Feature analysis and eval-
uation for automatic emotion identification in speech. IEEE Trans-
actions on Multimedia, 12(6), 490–501.
Milton, A., Roy, S. S., & Selvi, S. (2013). Svm scheme for speech
emotion recognition using mfcc feature. International Journal of
Computer Applications, 69(9), 34–39.
Nicholson, J., Takahashi, K., & Nakatsu, R. (2000). Emotion recogni-
tion in speech using neural networks. Neural Computing and Appli-
cations, 9(4), 290–296.
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion
recognition using hidden Markov models. Speech Communication,
41(4), 603–623.
Ravikumar, M., & Suresha, M. (2013). Dimensionality reduction and
classification of color features data using svm and knn. Interna-
tional Journal of Image Processing and Visual Communication, 1(4),
16–21.
Sato, N., & Obuchi, Y. (2007). Emotion recognition using mel-
frequency cepstral coefficients. Information and Media Technolo-
gies, 2(3), 835–848.
Scherer, K. R. (2003). Vocal communication of emotion: A review of
research paradigms. Speech Communication, 40(1), 227–256.
Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-
based speech emotion recognition. In: Acoustics, Speech, and Signal
Processing, 2003. Proceedings. (ICASSP’03). 2003 IEEE Interna-
tional Conference on, IEEE, vol. 2, pp. II-1.
Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional space
improves emotion recognition. In: Interspeech.
Vankayalapati, H., Anne, K., & Kyamakya, K. (2010). Extraction of
visual and acoustic features of the driver for monitoring driver
ergonomics applied to extended driver assistance systems. In: Data
and mobility, Springer, pp. 83–94.
Vankayalapati, H. D., & SVKK Anne, K. R. (2011). Driver emo-
tion detection from the acoustic features of the driver for real-time
assessment of driving ergonomics process. International Society for
Advanced Science and Technology (ISAST) Transactions on Com-
puters and Intelligent Systems journal, 3(1), 65–73.
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recog-
nition: Resources, features, and methods. Speech Communication,
48(9), 1162–1181.
Vogt, T., André, E., & Wagner, J. (2008). Automatic recognition of emo-
tions from speech: a review of the literature and recommendations
for practical realisation. In C. Peter & R. Beale (Eds.), Affect and
emotion in human–computer interaction (pp. 75–91). Springer.
Ye, J., Xiong, T., Li, Q., Janardan, R., Bi, J., Cherkassky, V., & Kamb-
hamettu, C. (2006). Efficient model selection for regularized linear
discriminant analysis. In: Proceedings of the 15th ACM international
conference on Information and knowledge management. ACM,
pp. 532–539.
Zhou, Y., Sun, Y., Zhang, J., & Yan, Y. (2009). Speech emotion recogni-
tion using both spectral and prosodic features. In: Information engi-
neering and computer science, 2009. ICIECS 2009. International
Conference on, IEEE, pp. 1–4.
123