This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
This paper introduces new features based on histograms of MFCC extracted from audio files to improve emotion recognition from speech. Experimental results on the Berlin and PAU databases using SVM and Random Forest classifiers show the proposed features achieve better classification results than current methods. Detailed analysis is provided on speech type (acted vs natural) and gender.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
Mfcc based enlargement of the training set for emotion recognition in speechsipij
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
This paper introduces new features based on histograms of MFCC extracted from audio files to improve emotion recognition from speech. Experimental results on the Berlin and PAU databases using SVM and Random Forest classifiers show the proposed features achieve better classification results than current methods. Detailed analysis is provided on speech type (acted vs natural) and gender.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
This document summarizes a study that compares different acoustic feature extraction methods (LPC, MFCC, PLP) for a Bangla speech recognition system using LSTM neural networks. It finds that PLP outperforms MFCC and LPC based on statistical distance measurements of phoneme coefficients. PLP shows better distinction between phonemes compared to MFCC and LPC. While RNN/LSTM are inherently slow, combining PLP with faster networks like Transformers may improve performance for large datasets.
Mfcc based enlargement of the training set for emotion recognition in speechsipij
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
On the use of voice activity detection in speech emotion recognitionjournalBEEI
Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
This document proposes an automatic emotion recognition system that analyzes audio information to classify human emotions. It uses spectral features and MFCC coefficients for feature extraction from voice signals. Then, a deep learning-based LSTM algorithm is used for classification. The system is evaluated on three audio datasets. Recurrent convolutional neural networks are proposed to capture temporal and frequency dependencies in speech spectrograms. The system aims to improve on existing methods which have lower accuracy and require more computational resources for implementation.
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
This document provides a course report on speech recognition using MATLAB. It includes an abstract, introduction, report outline, design of the system, process and results, flowchart of the experiment, simulation and analysis, discussion, and conclusion sections. The report details using MFCC and GMM algorithms for speech recognition, including signal preprocessing, parameter extraction, and training. Simulation results showed recognition rates approaching 100% for same speakers in quiet environments, but performance was impacted by noise and different speakers. The system demonstrated speech recognition capabilities in MATLAB but had limitations.
This document discusses speaker recognition techniques and applications. It provides an overview of common feature extraction methods for speaker recognition systems such as Linear Predictive Coding (LPC), Perceptual Linear Predictive Coefficients (PLPC), Mel Frequency Cepstral Coefficients (MFCC), and Gamma Tone Frequency Cepstral Coefficients (GFCC). It compares these techniques and finds that MFCC and PLPC are most often used due to being derived from models of human auditory processing. GFCC has benefits of higher frequency resolution and noise robustness compared to MFCC. Speaker recognition has applications in security systems for access control.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document presents a text-dependent speaker recognition system using neural networks that aims to improve recognition accuracy. It proposes changing the number of Mel Frequency Cepstral Coefficients (MFCCs) used in training. Voice Activity Detection is also used as a preprocessing step. Experimental results show recognition accuracy increases from 70.41% to a maximum of 92.91% as the number of MFCCs increases from 14 to 20, but then decreases with more MFCCs. The system is implemented on a Raspberry Pi for hardware acceleration.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
This document summarizes a research paper on speaker identification using artificial neural networks. The paper presents a speaker identification system that uses digital signal processing and ANN techniques. Speech features are extracted from utterances using FFT and windowing. These features are used to train a multi-layer perceptron network to classify speakers. The system was tested on Bangla speech and achieved accurate identification of speakers from their utterances.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using back propagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Signal & Image Processing : An International Journal sipij
Signal & Image Processing : An International Journal is an Open Access peer-reviewed journal intended for researchers from academia and industry, who are active in the multidisciplinary field of signal & image processing. The scope of the journal covers all theoretical and practical aspects of the Digital Signal Processing & Image processing, from basic research to development of application.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Signal & Image processing.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
SPEECH EMOTION RECOGNITION SYSTEM USING RNNIRJET Journal
This document discusses a speech emotion recognition system using recurrent neural networks (RNNs). It begins with an abstract describing speech emotion recognition and its importance. Then it provides background on speech emotion databases, feature extraction using MFCC, and classification approaches like RNNs. It reviews related work on speech emotion recognition using various methods. Finally, it concludes that MFCC feature extraction and RNN classification was used in the proposed system to take advantage of their performance in machine learning applications. The system aims to help machines understand human interaction and respond based on the user's emotion.
Air quality forecasting using convolutional neural networksBIJIAM Journal
Air pollution is now one of the biggest environmental risks, which causes more than 6 million premature deathseach year from heart diseases, stroke, diabetes, respiratory disease, and so on. Protecting humans from thedamage which is caused by air pollution is one of the major issues for the global community. The prediction ofair pollution can be done by machine learning (ML) algorithms. ML combines statistics and computer science tomaximize the prediction power. ML can be also used to predict the air quality index (AQI). The aim of this researchis to develop a convolutional neural network (CNN) model to predict air quality from the unseen data set, whichincludes concentration of nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). The proposedsystem will be implemented in two steps; the first step will focus on data analysis and pre-processing, includingfiltering, feature extraction, constructing convolutional neural network layers, and optimizing the parameters ofeach layer, while the second step is used to evaluate its model accuracy. The output is predicted as AQI for thedeveloped CNN model. The developed CNN model achieves a root-mean-square error of 13.4150 and a highaccuracy of 86.585%. The overall model is implemented using MATLAB software.
Prevention of fire and hunting in forests using machine learning for sustaina...BIJIAM Journal
Deforestation, illegal hunting, and forest fires are a few current issues that have an impact on the diversity andecosystem of forests. To increase the biodiversity of species and ecosystems, it becomes imperative to preservethe forest. The conventional techniques employed to prevent these issues are costly, less effective, and insecure.The current systems are unreliable and use more energy. By utilizing an Internet of Things (IOT) system, thistechnology offers a more practical and economical method of continuously maintaining and monitoring the statusof the forest. To guarantee excellent security, this system combines a number of sensors, alarms, cameras, lights,and microphones. It aids in reducing forest loss, animal trafficking, and forest fires. In the suggested system,sensors are used for monitoring, and cloud storage is used for data storage. Through the use of machine learning,the raspberry pi camera module significantly aids in the prevention of unlawful wildlife hunting as well as thedetection and prevention of forest fires.
More Related Content
Similar to Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
On the use of voice activity detection in speech emotion recognitionjournalBEEI
Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
This document proposes an automatic emotion recognition system that analyzes audio information to classify human emotions. It uses spectral features and MFCC coefficients for feature extraction from voice signals. Then, a deep learning-based LSTM algorithm is used for classification. The system is evaluated on three audio datasets. Recurrent convolutional neural networks are proposed to capture temporal and frequency dependencies in speech spectrograms. The system aims to improve on existing methods which have lower accuracy and require more computational resources for implementation.
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
This document provides a course report on speech recognition using MATLAB. It includes an abstract, introduction, report outline, design of the system, process and results, flowchart of the experiment, simulation and analysis, discussion, and conclusion sections. The report details using MFCC and GMM algorithms for speech recognition, including signal preprocessing, parameter extraction, and training. Simulation results showed recognition rates approaching 100% for same speakers in quiet environments, but performance was impacted by noise and different speakers. The system demonstrated speech recognition capabilities in MATLAB but had limitations.
This document discusses speaker recognition techniques and applications. It provides an overview of common feature extraction methods for speaker recognition systems such as Linear Predictive Coding (LPC), Perceptual Linear Predictive Coefficients (PLPC), Mel Frequency Cepstral Coefficients (MFCC), and Gamma Tone Frequency Cepstral Coefficients (GFCC). It compares these techniques and finds that MFCC and PLPC are most often used due to being derived from models of human auditory processing. GFCC has benefits of higher frequency resolution and noise robustness compared to MFCC. Speaker recognition has applications in security systems for access control.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document presents a text-dependent speaker recognition system using neural networks that aims to improve recognition accuracy. It proposes changing the number of Mel Frequency Cepstral Coefficients (MFCCs) used in training. Voice Activity Detection is also used as a preprocessing step. Experimental results show recognition accuracy increases from 70.41% to a maximum of 92.91% as the number of MFCCs increases from 14 to 20, but then decreases with more MFCCs. The system is implemented on a Raspberry Pi for hardware acceleration.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
This document summarizes a research paper on speaker identification using artificial neural networks. The paper presents a speaker identification system that uses digital signal processing and ANN techniques. Speech features are extracted from utterances using FFT and windowing. These features are used to train a multi-layer perceptron network to classify speakers. The system was tested on Bangla speech and achieved accurate identification of speakers from their utterances.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using back propagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Signal & Image Processing : An International Journal sipij
Signal & Image Processing : An International Journal is an Open Access peer-reviewed journal intended for researchers from academia and industry, who are active in the multidisciplinary field of signal & image processing. The scope of the journal covers all theoretical and practical aspects of the Digital Signal Processing & Image processing, from basic research to development of application.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Signal & Image processing.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
SPEECH EMOTION RECOGNITION SYSTEM USING RNNIRJET Journal
This document discusses a speech emotion recognition system using recurrent neural networks (RNNs). It begins with an abstract describing speech emotion recognition and its importance. Then it provides background on speech emotion databases, feature extraction using MFCC, and classification approaches like RNNs. It reviews related work on speech emotion recognition using various methods. Finally, it concludes that MFCC feature extraction and RNN classification was used in the proposed system to take advantage of their performance in machine learning applications. The system aims to help machines understand human interaction and respond based on the user's emotion.
Similar to Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network (20)
Air quality forecasting using convolutional neural networksBIJIAM Journal
Air pollution is now one of the biggest environmental risks, which causes more than 6 million premature deathseach year from heart diseases, stroke, diabetes, respiratory disease, and so on. Protecting humans from thedamage which is caused by air pollution is one of the major issues for the global community. The prediction ofair pollution can be done by machine learning (ML) algorithms. ML combines statistics and computer science tomaximize the prediction power. ML can be also used to predict the air quality index (AQI). The aim of this researchis to develop a convolutional neural network (CNN) model to predict air quality from the unseen data set, whichincludes concentration of nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). The proposedsystem will be implemented in two steps; the first step will focus on data analysis and pre-processing, includingfiltering, feature extraction, constructing convolutional neural network layers, and optimizing the parameters ofeach layer, while the second step is used to evaluate its model accuracy. The output is predicted as AQI for thedeveloped CNN model. The developed CNN model achieves a root-mean-square error of 13.4150 and a highaccuracy of 86.585%. The overall model is implemented using MATLAB software.
Prevention of fire and hunting in forests using machine learning for sustaina...BIJIAM Journal
Deforestation, illegal hunting, and forest fires are a few current issues that have an impact on the diversity andecosystem of forests. To increase the biodiversity of species and ecosystems, it becomes imperative to preservethe forest. The conventional techniques employed to prevent these issues are costly, less effective, and insecure.The current systems are unreliable and use more energy. By utilizing an Internet of Things (IOT) system, thistechnology offers a more practical and economical method of continuously maintaining and monitoring the statusof the forest. To guarantee excellent security, this system combines a number of sensors, alarms, cameras, lights,and microphones. It aids in reducing forest loss, animal trafficking, and forest fires. In the suggested system,sensors are used for monitoring, and cloud storage is used for data storage. Through the use of machine learning,the raspberry pi camera module significantly aids in the prevention of unlawful wildlife hunting as well as thedetection and prevention of forest fires.
Object detection and ship classification using YOLO and Amazon RekognitionBIJIAM Journal
The need for reliable automated object detection and classification in the maritime sphere has increasedsignificantly over the last decade. The increased use of unmanned marine vehicles necessitates the developmentand use of an autonomous object detection and classification systems that utilize onboard cameras and operatereliably, efficiently, and accurately without the need for human intervention. As the autonomous vehicle realmmoves further away from active human supervision, the requirements for a detection and classification suite callfor a higher level of accuracy in the classification models. This paper presents a comparative study using differentclassification models on a large maritime vessel image dataset. For this study, additional images were annotatedusing Roboflow, focusing on the types of subclasses that had the lowest detection performance, a previous workby Brown et al. (1). The present study uses a dataset of over 5,000 images. Using the enhanced set of imageannotations, models were created in the cloud using Google Colab using YOLOv5, YOLOv7, and YOLOv8 as wellas in Amazon Web Services using Amazon Rekognition. The models were tuned and run for five runs of 150 epochseach to collect efficiency and performance metrics. Comparison of these metrics from the YOLO models yieldedinteresting improvements in classification accuracies for the subclasses that previously had lower scores, but atthe expense of the overall model accuracy. Furthermore, training time efficiency was improved drastically using thenewer YOLO APIs. In contrast, using Amazon Rekognition yielded superior accuracy results across the board, butat the expense of the lowest training efficiency.
Machine learning for autonomous online exam frauddetection: A concept designBIJIAM Journal
E-learning (EL) has emerged as one of the most valuable means for continuing education around the world,especially in the aftermath of the global pandemic that included a variety of obstacles. Real-time onlineassessments have become a significant concern for educational organizations. Instances of fraudulent behaviorduring online exams (OEs) have created considerable challenges for exam invigilators, who are unable to identifyand remove such dishonest behavior. In response to this significant issue, educational institutions have used avariety of manual procedures to alleviate the situation, but none of these measures have shown to be particularlyinnovative or effective. The current study presents a novel strategy for detecting fraudulent actions in real timeduring OEs that uses convolutional neural network (CNN) algorithms and image processing. The developmentmodel will be trained using the CK and CK++ datasets. The training procedure will use 80% of the selecteddataset, with the remaining 20% used for model testing to confirm the model’s efficacy and generalization capacity.This project intends to revolutionize the monitoring and prevention of fraudulent actions during online tests byintegrating CNN techniques and image processing. The use of CK and CK++ datasets, as well as an 80–20split for training and testing, contributes to the study’s thorough and rigorous approach. Educational institutionscan improve their assessment procedures and maintain the credibility of EL as a credible and equitable way ofcontinuing education by successfully using this unique technique.
Prognostication of the placement of students applying machine learning algori...BIJIAM Journal
Placement is the process of connecting the selected candidate with the employer. Every student might have adream of having a job offer when he or she is about to complete her course. All educational institutions aim athaving their students well placed in good organizations. The reputation of any institution depends on the placementof its students. Hence, many institutions try hard to have a good placement cell. Classification using machinelearning may be utilized to retrieve data from the student-databases. A prediction model that can foretell theeligibility of the students based on their academic and extracurricular achievements is proposed. Related data wascollected from many institutions for which the placement-prediction is made. This paradigm is being weighed upwith the existing algorithms, and findings have been made regarding the accuracy of predictions. It was found thatthe proposed algorithm performed significantly better and yielded good results.
Drug recommendation using recurrent neural networks augmented with cellular a...BIJIAM Journal
Drug recommendation systems are systems that have the capability to recommend drugs. On a daily basis, a huge amount of data is being generated by the patients. All this valuable data can be properly utilized to create a reliable drug recommendation system. In this paper, we recommend a system for drug recommendations. The main scope of our system is to predict the correct medication based on reviews and ratings. Our proposed system uses natural language processing techniques (NLP), recurrent neural networks (RNN), and cellular automata (CA). We also considered various metrics like precision, recall, accuracy, F1 score, and ROC curve as measures of our system’s performance. NLP techniques are being used for gathering useful information from patient data, and RNN is a machine learning methodology that works really well in analyzing textual data. The system considers various patient data attributes like age, gender, dosage, medical history, and symptoms in order to make appropriate predictions. The proposed system has the potential to help medical professionals make informed drug recommendations.
Discrete clusters formulation through the exploitation of optimized k-modes a...BIJIAM Journal
This article focuses on the results of self-funded quantitative research conducted by social workers working inthe “refugee” crisis and social services in Greece (1). The research, among other findings, argues that front-lineprofessionals possess specific characteristics regarding their working profile. Statistical methods in the researchperformed significance tests to validate the initial hypotheses concerning the correlation between dataset variables.On the contrary of this concept, in this work, we present an alternative approach for validating initial hypothesesthrough the exploitation of clustering algorithms. Toward that goal, we evaluated several frequently used clusteringalgorithms regarding their efficiency in feature selection processes, and we finally propose a modified k-Modesalgorithm for efficient feature subset selection.
Enhanced 3D Brain Tumor Segmentation Using Assorted Precision TrainingBIJIAM Journal
A brain tumor is a medical disorder faced by individuals of all demographics. Medically, it is described as the spread of nonessential cells close to or throughout the brain. Symptoms of this ailment include headaches, seizures, and sensory changes. This research explores two main categories of brain tumors: benign and malignant. Benign spreads steadily, and malignant express growth makes it dangerous. Early identification of brain tumors is a crucial factor for the survival of patients. This research provides a state-of-the-art approach to the early identification of tumors within the brain. We implemented the SegResNet architecture, a widely adopted architecture for threedimensional segmentation, and trained it using the automatic multi-precision method. We incorporated the dice loss function and dice metric for evaluating the model. We got a dice score of 0.84. For the tumor core, we got a dice score of 0.84; for the whole tumor, 0.90; and for the enhanced tumor, we got a score of 0.79.
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...BIJIAM Journal
Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven facial emotion categories considered. The DCNN employed in this research has more convolutional layers, ReLU activation functions, and multiple kernels to enhance filtering depth and facial feature extraction. In addition, a Haar Cascade model was also mutually used to detect facial features in real-time images and video frames. Grayscale images from the Kaggle repository (FER-2013) and then exploited graphics processing unit (GPU) computation to expedite the training and validation process. Pre-processing and data augmentation techniques are applied to improve training efficiency and classification performance. The experimental results show a significantly improved classification performance compared to state-of-the-art (SoTA) experiments and research. Also, compared to other conventional models, this paper validates that the proposed architecture is superior in classification performance with an improvement of up to 6%, totaling up to 70% accuracy, and with less execution time of 2,098.8 s.
Role of Artificial Intelligence in Supply Chain ManagementBIJIAM Journal
The term “supply chain” refers to a network of facilities that includes a variety of companies. To minimise the entire cost of the supply chain, these entities must collaborate. This research focuses on the use of Artificial Intelligence techniques in supply chain management. It includes supply chain management examples like as
demand forecasting, supply forecasting, text analytics, pricing panning, and more to help companies improve their processes, lower costs and risk, and boost revenue. It gives us a quick rundown of all the key principles of economics and how to comprehend and use them effectively.
An Internet – of – Things Enabled Smart Fire Fighting SystemBIJIAM Journal
Fire accident is a mishap that could be either man made or natural. Fire accident occurs frequently and can be controlled but may at time result in severe loss of life and property. Many a times fire fighters struggle to sort out the exact source of fire as it is continuously flammable and it spreads all over the area. On this concern, we have designed an Internet – of – Things (IoT) based device which sorts out the exact source of fire through a software and
hardware devices and also the complete detail of an area can be visualized by fire fighters which are preinstalled in the software itself. Because of our system’s intelligence in decision making during fire fighting, the proposed system is claimed as ‘Smart Fire Fighting System’. The implementation of the smart fire fighting system can make
the fire fighters to analyze the current situation immediately and make the decision quicker in an effective manner. Because of this system, fire losses in a building can be greatly reduced and many lives can be rescued immediately and moreover fire spread can also be restricted.
Evaluate Sustainability of Using Autonomous Vehicles for the Last-mile Delive...BIJIAM Journal
This study aims to determine if autonomous vehicles (AV) for last-mile deliveries are sustainable from three perspectives: social, environmental, and economic. Because of the relevance of AV application for last-mile delivery, safety was only addressed for the sake of societal sustainability. According to this study, it is rather safe to deploy AVs for delivery since current society’s speed restriction and decent road conditions give the foundation for AVs to run safely for last-mile delivery in metropolitan areas. Furthermore, AV has a distinct advantage in the event of a pandemic. The biggest worry for environmental sustainability is the emission problem. In terms of a series of emissions, it is established that AV has a substantial benefit in terms of emission reduction. This is mostly due to the differences in driving behaviour between autonomous cars and human vehicles. Because of AV’s commercial character, this research used a quantitative approach to highlight the economic sustainability of AV. The
study demonstrates the economic benefit of AV for various carrying capacities.
Automation Attendance Systems Approaches: A Practical ReviewBIJIAM Journal
Accounting for people is the first step of every manpower-based organization in today’s world. Hence, it takes up a signification amount of energy and value in the form of money from respective organizations for both
implementing a suitable system for manpower management as well as maintaining that same system. Although this amount of expenditure for big organizations is near to nothing, rather just a formality, it does not hold as much truth for small organizations such as schools, colleges, and even universities to a certain degree. This is the first point. The second point for discussion is that much work has been done to solve this issue. Various technologies like Biometrics, RFID, Bluetooth, GPS, QR Code, etc., have been used to tackle the issues of attendance collection. This study paves
the path for researchers by reviewing practical methods and technologies used for existing attendance systems.
Transforming African Education Systems through the Application of IOTBIJIAM Journal
The project sought to provide a paradigm for improving African education systems through the use of the IOT. The created IOT model for Africa will enable African countries, notably Namibia, to exchange educational content and resources with other African countries. The objective behind the IOT paradigm in Africa’s education sectors is to provide open access to knowledge and information. The study revealed that there are no recognised platforms in African education systems that are utilised by African governments to interact, communicate, and share educational material directly with African institutions. As a result, the current research developed a model for
transforming African education systems using the IOT in the Namibian context, which will serve as a centralised online platform for self-study, new skill acquisition, and self-improvement using materials provided by African institutions of higher learning. Everyone is welcome to use the platform, including students, instructors, and
members of the general public.
Deep Learning Analysis for Estimating Sleep Syndrome Detection Utilizing the ...BIJIAM Journal
Manual sleep stage scoring is frequently performed by sleep specialists by visually evaluating the patient’s neurophysiological signals acquired in sleep laboratories. This is a difficult, time-consuming, & laborious process. Because of the limits of human sleep stage scoring, there is a greater need for creating Automatic Sleep Stage Classification (ASSC) systems. Sleep stage categorization is the process of distinguishing the distinct stages of sleep & is an important step in assisting physicians in the diagnosis & treatment of associated sleep disorders. In this research, we offer a unique method & a practical strategy to predicting early onsets of sleep disorders, such as restless leg syndrome & insomnia, using the Twin Convolutional Model FTC2, based on an algorithm composed of two modules. To provide localised time-frequency information, 30 second long epochs of EEG recordings are subjected to a Fast Fourier Transform, & a deep convolutional LSTM neural network is trained for sleep stage categorization. Automating sleep stages detection from EEG data offers a great potential to tackling sleep irregularities on a
daily basis. Thereby, a novel approach for sleep stage classification is proposed which combines the best of signal
processing & statistics. In this study, we used the PhysioNet Sleep European Data Format (EDF) Database. The code
evaluation showed impressive results, reaching accuracy of 90.43, precision of 77.76, recall of 93,32, F1-score of 89.12 with the final mean false error loss 0.09. All the source code is availlable at https://github.com/timothy102/eeg.
Robotic Mushroom Harvesting by Employing Probabilistic Road Map and Inverse K...BIJIAM Journal
A collision-free path to a destination position in a random farm is determined using a probabilistic roadmap (PRM) that can manage static and dynamic obstacles. The position of ripening mushrooms is a result of picture processing. A mushroom harvesting robot is explored that uses inverse kinematics (IK) at the target
position to compute the state of a robotic hand for grasping a ripening mushroom and plucking it. The DenavitHartenberg approach was used to create a kinematic model of a two-finger dexterous hand with three degrees of freedom for mushroom picking. Unlike prior experiments in mushroom harvesting, mushrooms are not planted in a grid or design, but are randomly scattered. At any point throughout the harvesting process, no human interaction is necessary
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
1. BOHR International Journal of Internet of things,
Artificial Intelligence and Machine Learning
2023, Vol. 2, No. 1, pp. 1–10
DOI: 10.54646/bijiam.2023.11
www.bohrpub.com
METHODS
Emotion recognition based on speech signals by combining
empirical mode decomposition and deep neural network
Shing-Tai Pan1*, Ching-Fa Chen2 and Chuan-Cheng Hong3
1Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, R.O.C
2Department of Electronic Engineering, Kao Yuan University, Kaohsiung, Taiwan, R.O.C
3Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, R.O.C
*Correspondence:
Shing-Tai Pan,
stpan@nu.edu.tw
Received: 05 January 2023; Accepted: 17 January 2023; Published: 30 January 2023
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is
applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN)
is used to classify speech emotions. This paper enhances the emotional components in speech signals by using
EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates
of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech
signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional
features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train
the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions
in speeches. Experimental results reveal that the proposed method is effective.
Keywords: speech emotion recognition, empirical mode decomposition, deep neural network, mel-scale
frequency cepstral coefficients, hidden markov model
1. Introduction
People can, most of the time, sense precisely the emotion
of a speaker during communication. For example, people
can detect an angry emotion from a loud harsh voice and
a happy emotion from a voice full of laughter. This means
that people can easily get information about the mood of
a person simply by listening to them. In fact, emotion is a
piece of vital information that speech signals carry apart from
the verbal corpus (1). Human-computer interface (HCI)
that can automatically detect emotions from the speech is
then reasonable and promising. Recently, studies concerning
automatic emotion recognition from speech attract a lot of
attention. These studies include topics across many fields,
for example, psychology, sociology, biomedical science, and
education. All these foci on the impact of emotion on their
health, and how to recognize the status of spirit one is
from his speech. Speech is the most important media of
these studies due to the following reasons: (a) the availability
of fast computing systems, (b) the effectiveness of various
signal processing algorithms, and (c) the acoustic differences
in speech signals that are naturally embedded in various
emotional situations (2).
Computing ability has enormously improved in this
decade. Hence, it becomes possible and innovative to
develop a system with machine learning methods or
deep learning methods, which can recognize automatically
people’s emotions from their speeches. There are some
literatures on this topic. For example, refer to the
papers (1–6). In the paper (1), the Fuzzy Rank-Based
Ensemble of Transfer Learning Model is used for speech
emotion recognition. In the paper (2), the empirical
mode decomposition (EMD) is applied to decompose
speech signals and obtain non-linear features for emotion
recognition. This paper (3) uses Hidden Markov Model
(HMM) for speech emotion recognition. A hybrid system
1
2. 2 Pan et al.
of using signals from faces and voices to recognize people’s
emotions is proposed in the paper (4). An exploration of
various models and speech features for speech emotion
recognition is introduced in papers (5, 6). However,
according to the studies (7, 8), the main topics of automatic
emotion recognition from speeches include the selection of
a database, feature extraction problems, and development
of recognition algorithms (8). The exploration of the
recognition algorithm is an important issue in the emotion
recognition problem. There are some algorithms used for
this application, for example, HMM (9), Support Vector
Machine (SVM) (10, 11), Gaussian Mixture Model (GMM)
(12), K-Nearest Neighbors (KNN) (13), and Artificial Neural
Network (ANN) (14). A method of combining speech and
image for emotion recognition is explored in (15). However,
this method will take more computation time, and more
hardware resources are required. Hence, this paper focuses
on emotion recognition based on speech signals. Since
ANN mimics the architecture of neurons in the organism
to process signals, it has some advantages over the other
methods: excellent fault tolerance capacity, good learning
ability, and suitable for nonlinear regression problems.
Hence, this paper adopts ANN as an emotion recognition
algorithm. A supervised ANN, Deep Neural Network
(DNN), is used in this paper for training the emotion model
of speeches and then recognizing emotions from speeches.
The objective of this paper was to improve the emotion
recognition rates based on speech signals by applying
deep learning methods due to the massive progress in
the capability of deep learning methods in recent years.
A DNN will be adopted in this paper for this purpose. First,
this paper applies the EMD method to improve emotional
feature extraction. The weighted IMFs decomposed by using
EMD will be summed to obtain the emotional features
from speeches. The weights for the IMFs are designed by
using genetic algorithms. The weighted sum of IMFs is
then calculated to extract MFCC features. The MFCCs are
then used to train classifiers for emotion recognition. As
to the classifier, since HMM has been used for decades for
speech recognition and has been successfully applied in many
applications of speech recognition, this paper will use the
emotion recognition results from HMM for comparison.
Besides, for the purpose of saving computation time and
using fewer hardware resources, the DNN architecture
will be designed to be as simple as possible to achieve
better emotion recognition rates compared with those
obtained by using HMM.
The organization of this paper is as follows. For readability,
Section 2 will briefly introduce the preprocessing and
feature extraction methods for speeches in this paper. The
EMD method used for extracting emotional features is
introduced in Section 3. The classifiers HMM and DNN are
then introduced in Section 4 and Section 5, respectively.
The experimental results of emotion recognition using the
proposed methods are revealed in Section 6. Finally, Section
7 makes some conclusions for this paper.
2. Preprocessing and feature
extraction for speech
2.1. Framing speech
Speech signals are non-stationary signals and vary with time.
It is necessary for speech signal processing to divide a speech
into several short blocks to get more stationary signals.
Hence, frames are taken from speech signals at the first step.
The extracted frames are always overlapped to make the
frame contain some previous information. Different rates of
overlap will make difference to the features of speech signals.
However, some experiments are required to choose a suitable
over-lapping rate. A frame with 256 points is used in this
paper. That is, for a speech with 8 kHz sampling rates and 1
second length of time, there will be about 32 frames obtained
from framing the speech. In this paper, uniform sampling of
speech signals is used since it is more robust and less biased
than non-uniform sampling (16). However, since emotional
speeches are always different in length of time, there are
different numbers of frames after framing different speeches.
2.2. Speech preemphasis
While speeches transmit in the air, high-frequency signals in
the speeches are attenuated more than low-frequency signals
in speeches. Hence, a high-pass finite-impulse-response
(FIR) filter is applied to speech signals to enhance the high-
frequency components. A high-pass filter can be described as
follows (8):
Spe(n) = Sof (n) − 0.97 × Sof (n − 1), 1 ≤ n ≤ N (1)
in which Spe(n) is the output of the FIR filter; Sof (n)
is the original speech signal; and N is the number of
points in a frame.
2.3. Applying hamming window
Fourier transform is used commonly to calculate features of
the speeches. However, due to the discontinuity at the start
and at the end of a frame, high-frequency noisy signals may
occur when the Fourier transform is taken on the frame. To
solve this problem, a hamming window will be applied to
the frames to reduce the effects caused by the noises. The
hamming window is described by the following equation (9):
W(n) =
0.54 − 0.46 cos( 2nπ
N−1 ), 0 ≤ n ≤ N − 1;
0, otherwise
(2)
4. 4 Pan et al.
FIGURE 3 | Flowchart for the training of HMM.
FIGURE 4 | The structure of the artificial neural network (ANN).
5. 10.54646/bijiam.2023.11 5
TABLE 1 | Description of Berlin emotional database.
Name of database Berlin emotional database
Language German
Speaker Gender Number Total
Male 5 10
Female 5
Sentence Type Number Total
Long sentence 5 10
Short sentence 5
Emotion Type Number
anger 7
joy
sadness
fear
disgust
boredom
neutral
Facility for recording Sennheiser MKH 40-P48 microphone
Tascam DA-P1 portable DAT recorder
File Data acquisition Sampling rate: 16 kHz
Resolution: 16bits
Channel: Mono
Format: wav
Number 535
where N is the number of points in a frame. And,
F(n) = W(n) × S(n) (3)
in which S(n) is the nth point in a frame, and F(n) is the result
signal after applying a hamming window to the frame.
2.4. Fast fourier transform
To calculate Mel-Frequency Cepstral Coefficients
(MFCC) for a frame, the speech signals will be
presented in the frequency domain. Since the speech
signals are presented initially in the time domain,
fast Fourier transform (FFT) will be applied to the
frames to transform them into frequency-domain
representation. FFT can be described as follows
(10):
Xk =
N−1
X
n = 0
Sw (n) × Wkn
N , 0 ≤ k ≤ N − 1 (4)
WN = e
−j2π
N (5)
2.5. Mel-frequency cepstral coefficients
The feature Mel-frequency cepstrum simulates the reception
properties of human ears. The MFCC will be calculated for
each frame. To calculate MFCC, FFT is applied to speech
frames first. Then, Mel triangular band-pass filter is applied
to the results of the FFT, X(k). The Mel triangular band-pass
filter is described by the following equation:
Bm(k) =
0, k fm−1
k−fm−1
fm−fm−1
, fm−1 ≤ k ≤ fm
fm+1−k
fm+1−fm
, fm ≤ k ≤ fm+1
0, fm+1 k
(6)
where M denotes the number of filters and 1 ≤ m ≤ M. The
logarithm is then taken on the summation of the product
of the frequency representation X(k) and Mel triangular
band-pass filter Bm(k) as follows:
Y(m) = log
fm+1
X
k=fm−1
|Xk| Bm(k)
. (7)
FIGURE 5 | The steps of the experiments.
6. 6 Pan et al.
TABLE 2 | Parameters and activation function for deep neural
network (DNN).
Model DNN
No. of hidden layers 5
No. of neuron in hidden layers 13
Activation function Hyperbolic tangent function
Iterations 100000
Goal of error 10ˆ(-15)
Learning rate 0.1
Then, the discrete cosine transform is applied to Y(m) as
follows:
cx(n) =
1
M
M
X
m=1
Y(m) × cos(
πn(m − 1
2 )
M
) (8)
in which cx (n) is MFCC. In this paper, the first 13
coefficients of cx (n) are calculated and then formed as
a feature vector. These MFCCs for the frames of training
speeches are used to train DNN, and those of testing speeches
are used to be tested with the trained DNN.
3. Empirical mode decomposition
In this paper, EMD is used to decompose emotional
speech signals into various emotional components, which are
defined as intrinsic mode functions (IMFs). An IMF must
satisfy the following two conditions (17):
(1) The number of local extremes and the number of
zero-crossings differ at most by one.
(2) Upper and lower envelopes of the
function are symmetric.
The steps of EMD are described as follows. It is noticed
that, in this paper, the Cubic Spline (17) is adopted to
construct the upper envelop and lower envelop of the signals
in the process of deriving IMFs. Let the original signal be X(t)
and Temp(t) = X(t).
Step 1:
Find the upper envelope U(t) and lower envelope L(t)
of the signal Temp(t). Calculate the mean of the two
envelops m(t) = [U(t) + L(t)]/2. The intermediate signal
TABLE 3 | Recognition rates using DNN without empirical mode decomposition (EMD) for 7 emotions based on 10-fold validation.
Emotions Experiments Anger (%) Joy (%) Sadness (%) Fear (%) Disgust (%) Boredom (%) Neutral (%)
1 65.00 70.00 85.00 80.00 95.00 40.00 45.00
2 90.00 55.00 80.00 65.00 90.00 75.00 65.00
3 75.00 75.00 90.00 60.00 95.00 45.00 40.00
4 60.00 50.00 80.00 10.00 100.00 15.00 60.00
5 0 90.00 0 85.00 100.00 100.00 55.00
6 75.00 70.00 90.00 35.00 95.00 20.00 28.57
7 65.00 35.00 70.00 10.00 100.00 25.00 25.00
8 70.00 70.00 100.00 75.00 100.00 60.00 65.00
9 75.00 70.00 90.00 70.00 95.00 75.00 60.00
10 80.00 75.00 85.00 75.00 100.00 55.00 60.00
Avg. 65.50 66.00 77.00 56.50 97.00 51.00 50.36
TABLE 4 | Recognition rates using DNN with EMD for 7 emotions based on 10-fold validation.
Emotions Experiments Anger (%) Joy (%) Sadness (%) Fear (%) Disgust (%) Boredom (%) Neutral (%)
1 55.00 80.00 80.00 80.00 95.00 60.00 40.00
2 85.00 80.00 80.00 60.00 90.00 75.00 70.00
3 90.00 70.00 80.00 60.00 95.00 50.00 45.00
4 80.00 70.00 90.00 20.00 100.00 20.00 60.00
5 0 75.00 0 85.00 100.00 100.00 60.00
6 65.00 70.00 85.00 25.00 95.00 25.00 28.57
7 65.00 50.00 65.00 10.00 100.00 15.00 45.00
8 75.00 75.00 100.00 70.00 95.00 70.00 70.00
9 85.00 75.00 90.00 70.00 95.00 75.00 55.00
10 95.00 85.00 95.00 85.00 100.00 60.00 65.00
Avg. 69.50 73.00 76.50 56.50 96.50 55.00 53.86
7. 10.54646/bijiam.2023.11 7
h(t) is calculated as follows:
h(t) = Temp(t) − m(t) (9)
Step 2:
Check whether the intermediate signal h(t) satisfies the
conditions of IMF or not. If it does, then the first IMF is
obtained as follows: imf1(t) = h(t), and we moved to the next
step or assigned the intermediate signal h(t) as Temp(t) and
moved back to Step 1.
Step 3:
Calculate the residue r1(t) as follows:
r1(t) = Temp(t) − imf1(t). (10)
Assign the signal r1(t) as X(t) and repeat Step 1 and Step 2 to
find imf2 (t).
Step 4:
Repeat Step 1 to Step 3 to find the subsequent IMFs as
follows:
rn(t) = rn−1(t) − imfn(t), n = 2, 3, 4, . . . . (11)
If the signal rn(t) is constant or a monotone function,
then the EMD procedure is completed. Then, the following
decomposition of X(t) is obtained as follows:
X(t) =
n
X
i=1
imfi(t) + rn(t) (12)
The flowchart of EMD is depicted in Figure 1. In this paper,
a weighted sum of imf s is proposed to recover the emotional
components and is written as the following equation (13).
The value of weights wi will be set according to the results
in (7).
X (t)
n
X
i 1
wi · imf i(t) (13)
4. Hidden markov model
In this paper, a discrete HMM is used as a comparison with
the proposed DNN method. The feature MFCCs that are
extracted from the speech signals after EMD processing are
used to train HMM and then for testing. The MFCC features
of speech signals are arranged as a time series according
to the order of frames obtained from framing each speech
signal. The time series of MFCCs is treated as the observation
of the HMM model, and the hidden states of the model
will be estimated using the Viterbi Algorithm (18). Figure 2
shows the mechanism of the HMM model with the features,
observations, and hidden states. The parameters in HMM
λ = (A, B, π) are explained as follows (18–20):
A = [aij], aij =P(qt = xj|qt−1 = xi), the probability of
hidden state xi transfers to hidden state xj,
B = [bj(k)], bj(k) =P(ot = vk|qt = xj), the probability of
kth observation vk happens at the jth hidden state xj,
π = [πi], πi = P(q1 = xi), the probability of hidden state
xi happens at the initial of the time series,
X = (x1, x2, · · ·, xN), is the hidden state of HMM.
The training process for modeling HMM is depicted in
Figure 3. The initial values of matrices A, B, and π are
given randomly. A trained codebook is then used to quantize
MFCC features. The matrices A, B, and π are then updated
using the Viterbi Algorithm (18). This process will repeat
until the parameters in A, B, and π converge. Then, the
training process for HMM is completed.
5. Deep neural network
The architecture of ANN is constructed based on
connections of the multiple-layer perceptron. Each layer
comprises several neurons. The transition of signals between
neurons is similar to that between neurons in an organism.
In this paper, a deep neural network structure, i.e., the
network with several hidden layers, is proposed and will be
TABLE 5 | Comparison of recognition rates for 10-fold experiments using DNN with and without EMD.
Set 1 (%) Set 2 (%) Set 3 (%) Set 4 (%) Set 5 (%) Set 6 (%) Set 7 (%) Set 8 (%) Set 9 (%) Set 10 (%) Avg. (%)
Without EMD 68.571 74.286 68.571 53.571 61.429 58.865 47.143 77.143 76.429 75.714 66.172
With EMD 70.0 77.143 70.0 62.857 60.0 56.028 50.0 79.286 77.857 83.571 68.674
TABLE 6 | Comparison of recognition rates for 7 emotions using DNN with and without EMD.
Anger (%) Joy (%) Sadness (%) Fear (%) Disgust (%) Boredom (%) Neutral (%) Avg. (%)
without EMD 65.50 66.00 77.00 56.50 97.00 51.00 50.36 66.19
With EMD 69.50 73.00 76.50 56.50 96.50 55.00 53.86 68.69
Moreover, the results gained by the proposed method are compared with those in ref. (9), in which HMM was used. Table 7 shows the results of the comparison. It is obvious that the
proposed method has higher recognition rates at each fold of experiments as well as at the average recognition rates of the 10-fold experiments in both cases whether EMD is used or
not. This verifies the performance of the proposed method.
8. 8 Pan et al.
compared with HMM. This structure allows us to train the
neural network deeply. The structure of a three-layer ANN is
shown in Figure 4 (21).
In Figure 4, Pn is nth input, wk
i,j is the weight between the
ith and jth neuron in (k-1)th layer and kth layer, respectively;
ok
n is nth output in the kth layer, and bk
n is the bias of nth
neuron in the kth layer.
After completing the calculation of MFCC for all
emotional speech signals, we started to train ANN using the
MFCCs obtained from training speech signals. The MFCC
with a label of emotion will be fed to the input of ANN, and
then the output of the ANN is used to compute the errors
with respect to the label of MFCC in the input. The output
function of ANN is described in equations (14) and (15).
ok
n = f
X
∀i
wk
i,nok−1
i − bk
n
!
(14)
f (x) = tanh (x) (15)
The back-propagation algorithm with the steepest descent
method (SDM) is used to train ANN by updating the weights
based on the error function between the output and the
goal to find the optimal parameters of ANN. The back-
propagation algorithm is described in equations (16)-(21).
δn = (Tn − o2
n) × f
0 X
i
w2
i,no1
i − b2
n
!
(16)
1W2
i,n = η × δn × o1
i (17)
1b2
n = −η × δn (18)
δi =
X
n
δnw2
i,n
!
× f
0 X
w1
r,iPr − b1
i
!
(19)
TABLE 7 | Comparison of the results by the proposed method and
those by the Hidden Markov Model (HMM) (9)
HMM (%) HMM+EMD (%) DNN (%) DNN +EMD (%)
1 57.14 62.86 68.571 70.0
2 56.43 67.14 74.286 77.143
3 55.71 67.86 68.571 70.0
4 47.86 54.29 53.571 62.857
5 66.43 76.43 61.429 60.0
6 47.14 58.57 58.865 56.028
7 39.29 52.86 47.143 50.0
8 64.29 74.29 77.143 79.286
9 69.29 73.57 76.429 77.857
10 65.71 77.14 75.714 83.571
Avg. 56.93 66.50 66.172 68.674
1W1
r,i = η × δi × Pr (20)
1b1
i = −η × δi. (21)
where Tn is the goal of the output of ANN; η is
the learning step.
6. Experimental results
The experiments conducted in this paper are performed
on a personal computer (PC), and the algorithms for
the experiments are implemented using MATLAB. The
emotional speech database used for the experiments in this
paper is the Berlin emotional database (22). The database is
recorded in the German language by 10 professional actors.
The professional actors include 5 males and 5 females. All the
speeches in this database are sampled with 8 kHz of 16bits
length in.wav format. The details of the Berlin emotional
database are described in Table 1. A 10-fold cross-validation
method is adopted for this experiment.
The experiments conducted in this paper are performed by
the following steps:
1. Classify the dataset from Berlin Emotional Database
into the training dataset and testing dataset.
2. Separate all the speeches in the training dataset and
testing dataset into various IMFs and recombine the
IMFs with the weights in ref. (9).
3. Calculate MFCC for the results in Step 2.
4. Train the DNN model and the HMM model using
the feature MFCC obtained in Step 3.
5. Repeat Step 1 to Step 4 until the recognition rate
meets the goal set in this experiment.
6. Use the model trained in Step 5 for testing.
The steps of performing the experiments in this paper are
described in Figure 5.
In this experiment, the structure of DNN, activation
functions, and settings for the experiment are described in
Table 2. The number of hidden layers is set to 5 to fulfill
a deep learning architecture. A commonly used activation,
hyperbolic tangent function, is adopted here.
The results of the experiments with and without EMD
will be compared to verify the advantage of applying EMD
in the experiments. First, the experimental numerical results
without EMD for the 7 emotions in the Berlin database
with 10-fold validation are shown in Table 3. The average
recognition rates of the 7 emotions are from 50.36% to 97%.
The recognition rates for some emotions are high, especially
the emotions “disgust” and “sadness”. This is because the
features of these two emotional speeches are distinct while
the emotions “fear”, “boredom,” and “neutral” have similar
9. 10.54646/bijiam.2023.11 9
features to each other. Hence, the recognition rates are
relatively low. Then, EMD is applied for emotion extractions
of speeches. The experimental results for the 7 emotions in
the Berlin database with 10-fold validations are shown in
Table 4.
The comparisons of the recognition rates between the
DNN with and without EMD for 10-fold experiments are
shown in Table 5. It can be seen, from the red context
in the table, that the recognition rates of most runs and
the average recognition rate in the experiment are better
when EMD is applied. Moreover, according to Table 6, when
EMD is applied for extractions of emotion components,
better recognition rates are gained for emotions “anger”,
“joy”, “boredom,” and “neutral”. The recognition rate for the
emotion “fear” remains the same. The emotions “sadness”
and “disgust” have slightly lower recognition rates. The
average recognition rate is better than that without using
EMD. Please refer to the red context in Table 6 for more
details. These experimental results verify the advantage of
using EMD to extract emotional components.
7. Conclusion
In this paper, EMD is applied to extract the emotional
features from speeches. The experimental results of this
paper reveal that the emotion recognition rates are better
for both classifiers, i.e., HMM and DNN, after applying
EMD for emotional feature extractions. However, according
to Table 6, EMD does not work well for two emotions,
i.e., “sadness” and “disgust”. It is likely that the features of
the two emotions are similar, and EMD cannot effectively
distinguish them. In the future work, some advanced EMD,
such as Ensemble EMD, may be used to get a better
extraction of emotional features from emotional speeches
and hence get better emotion recognition rates. Besides,
in this paper, a simple DNN is designed to get better
recognition rates than those gained by using HMM in
both cases whether EMD is applied or not. According to
Table 7, the improved recognition rates are about 10%
and 2% respective to that the EMD is not and is applied
to speech signals. However, in our experiments, only a
few minutes are needed to train HMM while DNN used
in this paper takes more than 40 minutes. Consequently,
the improvement of time consumption of DNN is still
an open problem.
Author contributions
S-TP: Conceptualization, methodology, investigation, and
writing—original draft preparation. C-FC: validation,
formal analysis, and writing—review and editing. C-CH:
software and resources.
Funding
This research was funded by the Ministry of Science
and Technology of the Republic of China, grant
number MOST 109-2221-E-390-014-MY2. This research
work was supported by the Ministry of Science and
Technology of the Republic of China under contract MOST
108-2221-E-390-018.
Conflict of interest
The authors declare that the research was conducted in the
absence of any commercial or financial relationships that
could be construed as a potential conflict of interest.
References
1. Sahoo KK, Dutta I, Ijaz MF, Wozniak M, Singh PK. TLEFuzzyNet:
fuzzy rank-based ensemble of transfer learning models for
emotion recognition from human speeches. IEEE Access. (2021)
9:166518–30.
2. Krishnan PT, Joseph Raj AN, Rajangam V. Emotion classification from
speech signal based on empirical mode decomposition and non-linear
features. Complex Intelligent Syst. (2021) 7:1919–34.
3. Schuller B, Rigoll G, Lang M. Hidden markov model-based speech
emotion recognition. In: Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003).
Baltimore, MD: IEEE (2003). p. 1-4.
4. Fragopanagos N, Taylor JG. Emotion recognition in human-computer
interaction. Neural Networks. (2005). 18:389–405.
5. Cen L, Ser,W, Yu ZL, Cen W. Automatic recognition of emotional states
from human speeches. In: A. Herout, editor. Pattern recognition Recent
Advances. London: intechopen. (2011).
6. Wu S, Falk TH, Chan WY. Automatic speech emotion recognition
using modulation spectral features. Speech Commun. (2011)
53:768–85.
7. Picard RW. Affective Computing. Cambridge, MA: MIT press (1997).
8. Basu S., Chakraborty, J., Bag A., Aftabuddin, M. A review on emotion
recognition using speech. In: Proceedings of the International Conference
on Inventive Communication and Computational Technologies
(ICICCT). Coimbatore (2017). p. 109–14.
9. Lee YW, Pan ST. Applications of Empirical Mode Decomposition on the
Computation of Emotional Speech Features. Taiwan: National University
of Kaohsiung (2012).
10. Zhu L, Chen L, Zhao D, Zhou J, Zhang W. Emotion recognition from
chinese speech for smart affective services using a combination of SVM
and DBN. Sensors. (2017) 17:1694.
11. Trabelsi I, Bouhlel MS. Feature selection for GUMI kernel-based SVM
in speech emotion recognition. In: Information Reso Management
Association, editor. Artificial Intelligence: Concepts, Methodologies,
Tools, and Applications. Pennsylvania: IGI Global (2017). p. 941–53.
12. Patel P, Chaudhari A, Kale R, Pund M. Emotion recognition from speech
with gaussian mixture models via boosted GMM. Int J Res Sci Eng.
(2017) 3: 47–53.
13. Jo Y, Lee H, Cho A, Whang M. Emotion recognition through
cardiovascular response in daily life using KNN classifier. In: J. J.
Park, editor. Advances in Computer Science and Ubiquitous Computing.
Singapore: Springer (2017). p. 1451–6.
10. 10 Pan et al.
14. Alhagry S, Fahmy AA, El-Khoribi RA. Emotion Recognition based on
EEG using LSTM Recurrent Neural Network. Emotion. (2017) 8:355–8.
15. Ghaleb E, Popa M, Asteriadis S. Metric Learning-based multimodal
audio-visual emotion recognition. IEEE multimedia. (2020)
27:1–8.
16. Shang Y. Subgraph robustness of complex networks under attacks. IEEE
Trans Syst Man Cybernet. (2019) 49:821–32.
17. Huang NE. The empirical mode decomposition and the hilbert
spectrum for nonlinear and non-stationary time series analysis. Proc. R.
Soc. Lond. Ser. A Math. Phys. Eng. Sci. (1996) 454:903–95.
18. Blunsom P. Hidden Markov Model. Parkville, VIC: The University of
Melbourne (2004)
19. Pan ST, Hong TP. Robust speech recognition by DHMM with a
codebook trained by genetic algorithm. J Informat Hiding Multimedia
Signal Processing. (2012) 3:306–19.
20. Pan ST, Li WC. Fuzzy-HMM modeling for emotion detection using
electrocardiogram signals. Asian J Control. (2020) 22:2206–16.
21. Pan ST, Lan ML. An efficient hybrid learning algorithm for neural
network–based speech recognition systems on FPGA chip. Neural
Comput Appl. (2014) 24:1879–85.
22. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B. A database
of German emotional speech. Interspeech. (2005) 5:1517–20.
23. Mathews JH, Fink KD. Numerical Methods Using MATLAB. 4th ed.
Hoboken, NJ: Prentice-Hall (2004)