This document presents research on developing an Arabic speech emotion recognition system using a convolutional neural network (CNN) model. The researchers propose a model called ASERS-CNN and evaluate it on an Arabic speech dataset containing recordings of 4 emotions. Their results show the ASERS-CNN achieves 98.18% accuracy, outperforming their previous ASERS-LSTM model which achieved 97.44% accuracy. They also find that using 5 acoustic feature types and 50 training epochs leads to the best ASERS-CNN performance of 98.52% accuracy.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
This document discusses speech-based emotion recognition using Gaussian mixture models (GMM). GMMs are statistical models that are well-suited for developing emotion recognition systems from large feature datasets. The document proposes using GMMs trained on excitation features extracted from speech signals to classify emotions into categories like happy, angry, sad, and neutral. It describes extracting excitation source features through linear predictive coding analysis to capture information about a speaker's vocal excitation source. The goal is to develop a GMM-based emotion recognition system that can classify emotions in conversations.
This document discusses an analysis of an emotion recognition system through speech signals using K-nearest neighbors (KNN) and Gaussian mixture model (KNN) classifiers. It provides background on the challenges of automatic emotion recognition from speech and describes common features extracted from speech like mel frequency cepstrum coefficients and prosodic features. The document outlines the process of an emotion recognition system including feature extraction, training classifiers on a speech database, and classifying emotions. It then gives more detail on the KNN and GMM classifiers and how they were used to classify six emotional states from the Berlin emotional speech database.
This document summarizes a research paper on classifying speech using Mel frequency cepstrum coefficients (MFCC) and power spectrum analysis. The paper reviews different classifiers used for speech emotion recognition, including neural networks, Gaussian mixture models, and support vector machines. It proposes using MFCC and power spectrum features as inputs to an artificial neural network classifier to identify emotions in speech, such as anger, happiness, sadness, and neutral states. Testing is performed on emotional speech samples to evaluate the performance and limitations of the proposed speech emotion recognition system.
A critical insight into multi-languages speech emotion databasesjournalBEEI
With increased interest of human-computer/human-human interactions, systems deducing and identifying emotional aspects of a speech signal has emerged as a hot research topic. Recent researches are directed towards the development of automated and intelligent analysis of human utterances. Although numerous researches have been put into place for designing systems, algorithms, classifiers in the related field; however the things are far from standardization yet. There still exists considerable amount of uncertainty with regard to aspects such as determining influencing features, better performing algorithms, number of emotion classification etc. Among the influencing factors, the uniqueness between speech databases such as data collection method is accepted to be significant among the research community. Speech emotion database is essentially a repository of varied human speech samples collected and sampled using a specified method. This paper reviews 34 `speech emotion databases for their characteristics and specifications. Furthermore critical insight into the imitational aspects for the same have also been highlighted.
This document summarizes Dongang Wang's speech emotion recognition project which compares feature selection and classification methods. Wang selects mel-frequency cepstral coefficients (MFCCs) and energy as features. For methods, Wang tests Gaussian mixture models (GMMs), discrete hidden Markov models (HMMs), and continuous HMMs including Kalman filters. Testing on German and English corpora, continuous HMMs achieved the best average accuracy of 61.67%, outperforming GMMs and discrete HMMs. While results are promising, Wang notes challenges in recognizing emotion across languages and speakers.
IRJET- Emotion recognition using Speech Signal: A ReviewIRJET Journal
This document provides a review of speech emotion recognition techniques. It discusses how speech emotion recognition systems work, including common features extracted from speech like MFCCs and LPC coefficients. Classification techniques used in these systems are also examined, such as DTW, ANN, GMM, and K-NN. The document concludes that speech emotion recognition could be useful for applications requiring natural human-computer interaction, like car systems that monitor driver emotion or educational tutorials that adapt based on student emotion.
ASERS-LSTM: Arabic Speech Emotion Recognition System Based on LSTM Modelsipij
The swift progress in the study field of human-computer interaction (HCI) causes to increase in the interest in systems for Speech emotion recognition (SER). The speech Emotion Recognition System is the system that can identify the emotional states of human beings from their voice. There are well works in Speech Emotion Recognition for different language but few researches have implemented for Arabic SER systems and that because of the shortage of available Arabic speech emotion databases. The most commonly considered languages for SER is English and other European and Asian languages. Several machine learning-based classifiers that have been used by researchers to distinguish emotional classes: SVMs, RFs, and the KNN algorithm, hidden Markov models (HMMs), MLPs and deep learning. In this paper we propose ASERS-LSTM model for Arabic Speech Emotion Recognition based on LSTM model. We extracted five features from the speech: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Melscaled spectrogram, spectral contrast and tonal centroid features (tonnetz). We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech corpus (BAES-DB). In addition of that we also construct a DNN for classify the Emotion and compare the accuracy between LSTM and DNN model. For DNN the accuracy is 93.34% and for LSTM is 96.81%
This document discusses speech-based emotion recognition using Gaussian mixture models (GMM). GMMs are statistical models that are well-suited for developing emotion recognition systems from large feature datasets. The document proposes using GMMs trained on excitation features extracted from speech signals to classify emotions into categories like happy, angry, sad, and neutral. It describes extracting excitation source features through linear predictive coding analysis to capture information about a speaker's vocal excitation source. The goal is to develop a GMM-based emotion recognition system that can classify emotions in conversations.
This document discusses an analysis of an emotion recognition system through speech signals using K-nearest neighbors (KNN) and Gaussian mixture model (KNN) classifiers. It provides background on the challenges of automatic emotion recognition from speech and describes common features extracted from speech like mel frequency cepstrum coefficients and prosodic features. The document outlines the process of an emotion recognition system including feature extraction, training classifiers on a speech database, and classifying emotions. It then gives more detail on the KNN and GMM classifiers and how they were used to classify six emotional states from the Berlin emotional speech database.
This document summarizes a research paper on classifying speech using Mel frequency cepstrum coefficients (MFCC) and power spectrum analysis. The paper reviews different classifiers used for speech emotion recognition, including neural networks, Gaussian mixture models, and support vector machines. It proposes using MFCC and power spectrum features as inputs to an artificial neural network classifier to identify emotions in speech, such as anger, happiness, sadness, and neutral states. Testing is performed on emotional speech samples to evaluate the performance and limitations of the proposed speech emotion recognition system.
A critical insight into multi-languages speech emotion databasesjournalBEEI
With increased interest of human-computer/human-human interactions, systems deducing and identifying emotional aspects of a speech signal has emerged as a hot research topic. Recent researches are directed towards the development of automated and intelligent analysis of human utterances. Although numerous researches have been put into place for designing systems, algorithms, classifiers in the related field; however the things are far from standardization yet. There still exists considerable amount of uncertainty with regard to aspects such as determining influencing features, better performing algorithms, number of emotion classification etc. Among the influencing factors, the uniqueness between speech databases such as data collection method is accepted to be significant among the research community. Speech emotion database is essentially a repository of varied human speech samples collected and sampled using a specified method. This paper reviews 34 `speech emotion databases for their characteristics and specifications. Furthermore critical insight into the imitational aspects for the same have also been highlighted.
This document summarizes Dongang Wang's speech emotion recognition project which compares feature selection and classification methods. Wang selects mel-frequency cepstral coefficients (MFCCs) and energy as features. For methods, Wang tests Gaussian mixture models (GMMs), discrete hidden Markov models (HMMs), and continuous HMMs including Kalman filters. Testing on German and English corpora, continuous HMMs achieved the best average accuracy of 61.67%, outperforming GMMs and discrete HMMs. While results are promising, Wang notes challenges in recognizing emotion across languages and speakers.
IRJET- Emotion recognition using Speech Signal: A ReviewIRJET Journal
This document provides a review of speech emotion recognition techniques. It discusses how speech emotion recognition systems work, including common features extracted from speech like MFCCs and LPC coefficients. Classification techniques used in these systems are also examined, such as DTW, ANN, GMM, and K-NN. The document concludes that speech emotion recognition could be useful for applications requiring natural human-computer interaction, like car systems that monitor driver emotion or educational tutorials that adapt based on student emotion.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluates different classification methods for speech emotion recognition, including Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0). The paper extracts features like energy, zero crossing rate, pitch, and MFCCs from speech samples in the Berlin Emotional Speech Database, which contains utterances expressing seven emotions. These features are classified using SVM, C5.0, and SVM-C5.0, and the results show that SVM-C5.0 performs best, achieving recognition rates between 5.5-8.9% higher than SVM or C5.0 alone depending on the number of emotions.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document summarizes a research paper that proposes a method for emotion identification in continuous speech using cepstral analysis and generalized gamma mixture modeling. The key contributions are:
1) It extracts MFCC and LPC features from speech signals to model emotions like happy, angry, boredom and sad.
2) It uses a generalized gamma distribution instead of GMM for more accurate feature extraction and classification, as GGD can model speech signal variations better.
3) An experiment is conducted on a database of 50 speakers' speech in 5 emotions, achieving over 90% recognition accuracy using the proposed MFCC-LPC features and GGD modeling.
This document discusses a proposed system for classifying audio scenes in action movies. It aims to provide scene recognition and detection by separating audio classes and obtaining better sound classification accuracy. The system extracts audio features like zero-crossing rate, short-time energy, volume root mean square, and volume dynamic range. It then uses hidden Markov models and support vector machines to classify audio scenes, labeling them as happy, miserable, or action scenes. Sound event types classified include gunshots, screams, car crashes, talking, laughter, fighting, shouting, and background crowd noise. The goal is to index and retrieve interesting events from action movies to engage viewers.
This document discusses issues in sentiment analysis and emotion extraction from text. It provides an overview of natural language processing and its applications. The document then discusses the need for sentiment analysis in areas like artificial intelligence. It proceeds to compare different techniques for emotion extraction from text, including text mining, empirical studies, emotion extraction engines, vector space models, and emotion markup languages. For each technique, it outlines the general approach and provides examples or tables to illustrate how emotions can be identified from text. However, it notes that current applications have not achieved 100% accuracy in realistic sentiment analysis.
The document summarizes Kun Zhou's PhD research on emotional voice conversion with non-parallel data at the National University of Singapore. It introduces emotional voice conversion and its challenges, including the lack of parallel training data. It then summarizes Kun's publications, which propose CycleGAN-based and VAW-GAN approaches to model prosody for speaker-dependent and independent emotional voice conversion. One publication introduces a method for transferring both seen and unseen emotional styles using a pre-trained speech emotion recognizer to describe emotional styles.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
- The document describes a framework for emotional voice conversion using VAW-GAN that can disentangle and recompose emotional elements in speech. It proposes using VAW-GAN with continuous wavelet transform to model prosody and decompose fundamental frequency into different time scales. Conditioning the decoder on fundamental frequency is shown to improve emotion conversion performance. Experiments demonstrate the effectiveness of the approach on an English emotional speech database.
Marathi Isolated Word Recognition System using MFCC and DTW FeaturesIDES Editor
This paper presents a Marathi database and isolated
Word recognition system based on Mel-frequency cepstral
coefficient (MFCC), and Distance Time Warping (DTW) as
features. For the extraction of the feature, Marathi speech
database has been designed by using the Computerized Speech
Lab. The database consists of the Marathi vowels, isolated
words starting with each vowels and simple Marathi sentences.
Each word has been repeated three times by the 35 speakers.
This paper presents the comparative recognition accuracy of
DTW and MFCC.
Ai based character recognition and speech synthesisAnkita Jadhao
The document discusses an AI seminar on character recognition and speech synthesis. It describes how optical character recognition can convert scanned images or text into machine code, and speech synthesis can artificially produce human speech. It provides details on preprocessing techniques for character recognition, such as de-noising and binarization of images. It also explains the processes of text analysis, phoneme generation and prosody generation used in speech synthesis engines.
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET Journal
This document discusses speech emotion recognition using principal component analysis (PCA). It analyzes speech features like mel frequency cepstral coefficients, pitch, energy, and formant frequency from the Berlin database containing emotions like anger, sadness, happiness, and fear. PCA is applied to reduce the feature dimension and decorrelate features. A support vector machine classifier is then used to classify emotions based on the PCA-processed features. Results show applying PCA improves the classification accuracy compared to without using PCA, with accuracy increasing from 68% to 64.5% on average.
This document summarizes research on speaker recognition in noisy environments. It begins with an introduction discussing the goals of speaker identification and verification and their applications. It then provides details on the basic components of a speaker recognition system, including feature extraction and classification. The document focuses on methods for modeling noise, including generating multiple noisy training conditions and focusing matching on unaffected features. Experimental results are shown through snapshots of a prototype system interface that allows adding and recognizing speakers based on voice samples. The system is able to identify speakers in the presence of noise by comparing features to stored codebooks generated during training.
This paper introduces new features based on histograms of MFCC extracted from audio files to improve emotion recognition from speech. Experimental results on the Berlin and PAU databases using SVM and Random Forest classifiers show the proposed features achieve better classification results than current methods. Detailed analysis is provided on speech type (acted vs natural) and gender.
This document summarizes a paper on multimodal emotion recognition from speech, text, and video data. It discusses how combining multiple modalities can provide richer information than single modalities alone. It presents the IEMOCAP and CMU-MOSEI datasets and compares their modalities. Techniques for fusing modalities include early and late fusion. The paper proposes a solution that filters ineffective data, regenerates proxy features, and uses multiplicative fusion to boost stronger modalities. It evaluates the approach on the CMU-MOSEI dataset using speech, text, and video features and discusses limitations in distinguishing some emotions.
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Emotion Detection from Voice Based Classified Frame-Energy Signal Using K-Mea...ijseajournal
Emotion detection is a new research era in health informatics and forensic technology. Besides having some challenges, voice based emotion recognition is getting popular, as the situation where the facial image is not available, the voice is the only way to detect the emotional or psychiatric condition of a
person. However, the voice signal is so dynamic even in a short-time frame so that, a voice of the same person can differ within a very subtle period of time. Therefore, in this research basically two key criterion have been considered; firstly, this is clear that there is a necessity to partition the training data according
to the emotional stage of each individual speaker. Secondly, rather than using the entire voice signal, short time significant frames can be used, which would be enough to identify the emotional condition of the speaker. In this research, Cepstral Coefficient (CC) has been used as voice feature and a fixed valued kmeans clustered method has been used for feature classification. The value of k will depend on the number
of emotional situations in human physiology is being an evaluation. Consequently, the value of k does not necessarily consider the volume of experimental dataset. In this experiment, three emotional conditions: happy, angry and sad have been detected from eight female and seven male voice signals. This methodology has increased the emotion detection accuracy rate significantly comparing to some recent works and also reduced the CPU time of cluster formation and matching.
Signal & Image Processing : An International Journal sipij
Signal & Image Processing : An International Journal is an Open Access peer-reviewed journal intended for researchers from academia and industry, who are active in the multidisciplinary field of signal & image processing. The scope of the journal covers all theoretical and practical aspects of the Digital Signal Processing & Image processing, from basic research to development of application.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Signal & Image processing.
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions.
SPEECH EMOTION RECOGNITION SYSTEM USING RNNIRJET Journal
This document discusses a speech emotion recognition system using recurrent neural networks (RNNs). It begins with an abstract describing speech emotion recognition and its importance. Then it provides background on speech emotion databases, feature extraction using MFCC, and classification approaches like RNNs. It reviews related work on speech emotion recognition using various methods. Finally, it concludes that MFCC feature extraction and RNN classification was used in the proposed system to take advantage of their performance in machine learning applications. The system aims to help machines understand human interaction and respond based on the user's emotion.
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”IRJET Journal
The document discusses speech emotion recognition using deep neural networks. It first provides an overview of SER and the challenges in the field. It then reviews 20 research papers on the topic, finding that most use deep neural network techniques like CNNs and DNNs for model building. The papers evaluated various datasets and algorithms, with accuracy ranging from 84% to 90%. Overall limitations identified included the need for more data, handling of multiple simultaneous emotions, and improving cross-corpus performance. The literature review contributes to knowledge in using machine learning for SER.
Emotion Recognition Based On Audio SpeechIOSR Journals
This document summarizes a research paper on emotion recognition based on audio speech. It discusses how acoustic features are extracted from speech signals by applying preprocessing techniques like preemphasis and framing. It describes extracting features like Mel frequency cepstral coefficients (MFCCs) that capture characteristics of the vocal tract. Support vector machines (SVMs) are used as pattern classification methods to build models for each emotion and compare test speech features to recognize emotions. The paper confirms the advantage of its audio-based emotion recognition approach through experimental results and discusses potential improvements and future work on increasing efficiency and recognizing emotion intensity.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluates different classification methods for speech emotion recognition, including Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0). The paper extracts features like energy, zero crossing rate, pitch, and MFCCs from speech samples in the Berlin Emotional Speech Database, which contains utterances expressing seven emotions. These features are classified using SVM, C5.0, and SVM-C5.0, and the results show that SVM-C5.0 performs best, achieving recognition rates between 5.5-8.9% higher than SVM or C5.0 alone depending on the number of emotions.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document summarizes a research paper that proposes a method for emotion identification in continuous speech using cepstral analysis and generalized gamma mixture modeling. The key contributions are:
1) It extracts MFCC and LPC features from speech signals to model emotions like happy, angry, boredom and sad.
2) It uses a generalized gamma distribution instead of GMM for more accurate feature extraction and classification, as GGD can model speech signal variations better.
3) An experiment is conducted on a database of 50 speakers' speech in 5 emotions, achieving over 90% recognition accuracy using the proposed MFCC-LPC features and GGD modeling.
This document discusses a proposed system for classifying audio scenes in action movies. It aims to provide scene recognition and detection by separating audio classes and obtaining better sound classification accuracy. The system extracts audio features like zero-crossing rate, short-time energy, volume root mean square, and volume dynamic range. It then uses hidden Markov models and support vector machines to classify audio scenes, labeling them as happy, miserable, or action scenes. Sound event types classified include gunshots, screams, car crashes, talking, laughter, fighting, shouting, and background crowd noise. The goal is to index and retrieve interesting events from action movies to engage viewers.
This document discusses issues in sentiment analysis and emotion extraction from text. It provides an overview of natural language processing and its applications. The document then discusses the need for sentiment analysis in areas like artificial intelligence. It proceeds to compare different techniques for emotion extraction from text, including text mining, empirical studies, emotion extraction engines, vector space models, and emotion markup languages. For each technique, it outlines the general approach and provides examples or tables to illustrate how emotions can be identified from text. However, it notes that current applications have not achieved 100% accuracy in realistic sentiment analysis.
The document summarizes Kun Zhou's PhD research on emotional voice conversion with non-parallel data at the National University of Singapore. It introduces emotional voice conversion and its challenges, including the lack of parallel training data. It then summarizes Kun's publications, which propose CycleGAN-based and VAW-GAN approaches to model prosody for speaker-dependent and independent emotional voice conversion. One publication introduces a method for transferring both seen and unseen emotional styles using a pre-trained speech emotion recognizer to describe emotional styles.
This document describes a system to help deaf and mute people communicate through sign language and voice recognition. The system uses algorithms like support vector machines and hidden Markov models to recognize hand gestures and speech. It can translate sign language into text and voice into sign language representations. The system aims to reduce communication barriers for deaf/mute communities by converting between sign language, text, and voice. It outlines the implementation process which includes steps like skin color detection, hand location detection, finger region detection, and pattern matching to recognize gestures from video input.
This is the presentation of our IEEE ICASSP 2021 paper "seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset".
VAW-GAN for disentanglement and recomposition of emotional elements in speechKunZhou18
- The document describes a framework for emotional voice conversion using VAW-GAN that can disentangle and recompose emotional elements in speech. It proposes using VAW-GAN with continuous wavelet transform to model prosody and decompose fundamental frequency into different time scales. Conditioning the decoder on fundamental frequency is shown to improve emotion conversion performance. Experiments demonstrate the effectiveness of the approach on an English emotional speech database.
Marathi Isolated Word Recognition System using MFCC and DTW FeaturesIDES Editor
This paper presents a Marathi database and isolated
Word recognition system based on Mel-frequency cepstral
coefficient (MFCC), and Distance Time Warping (DTW) as
features. For the extraction of the feature, Marathi speech
database has been designed by using the Computerized Speech
Lab. The database consists of the Marathi vowels, isolated
words starting with each vowels and simple Marathi sentences.
Each word has been repeated three times by the 35 speakers.
This paper presents the comparative recognition accuracy of
DTW and MFCC.
Ai based character recognition and speech synthesisAnkita Jadhao
The document discusses an AI seminar on character recognition and speech synthesis. It describes how optical character recognition can convert scanned images or text into machine code, and speech synthesis can artificially produce human speech. It provides details on preprocessing techniques for character recognition, such as de-noising and binarization of images. It also explains the processes of text analysis, phoneme generation and prosody generation used in speech synthesis engines.
IRJET- Study of Effect of PCA on Speech Emotion RecognitionIRJET Journal
This document discusses speech emotion recognition using principal component analysis (PCA). It analyzes speech features like mel frequency cepstral coefficients, pitch, energy, and formant frequency from the Berlin database containing emotions like anger, sadness, happiness, and fear. PCA is applied to reduce the feature dimension and decorrelate features. A support vector machine classifier is then used to classify emotions based on the PCA-processed features. Results show applying PCA improves the classification accuracy compared to without using PCA, with accuracy increasing from 68% to 64.5% on average.
This document summarizes research on speaker recognition in noisy environments. It begins with an introduction discussing the goals of speaker identification and verification and their applications. It then provides details on the basic components of a speaker recognition system, including feature extraction and classification. The document focuses on methods for modeling noise, including generating multiple noisy training conditions and focusing matching on unaffected features. Experimental results are shown through snapshots of a prototype system interface that allows adding and recognizing speakers based on voice samples. The system is able to identify speakers in the presence of noise by comparing features to stored codebooks generated during training.
This paper introduces new features based on histograms of MFCC extracted from audio files to improve emotion recognition from speech. Experimental results on the Berlin and PAU databases using SVM and Random Forest classifiers show the proposed features achieve better classification results than current methods. Detailed analysis is provided on speech type (acted vs natural) and gender.
This document summarizes a paper on multimodal emotion recognition from speech, text, and video data. It discusses how combining multiple modalities can provide richer information than single modalities alone. It presents the IEMOCAP and CMU-MOSEI datasets and compares their modalities. Techniques for fusing modalities include early and late fusion. The paper proposes a solution that filters ineffective data, regenerates proxy features, and uses multiplicative fusion to boost stronger modalities. It evaluates the approach on the CMU-MOSEI dataset using speech, text, and video features and discusses limitations in distinguishing some emotions.
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Emotion Detection from Voice Based Classified Frame-Energy Signal Using K-Mea...ijseajournal
Emotion detection is a new research era in health informatics and forensic technology. Besides having some challenges, voice based emotion recognition is getting popular, as the situation where the facial image is not available, the voice is the only way to detect the emotional or psychiatric condition of a
person. However, the voice signal is so dynamic even in a short-time frame so that, a voice of the same person can differ within a very subtle period of time. Therefore, in this research basically two key criterion have been considered; firstly, this is clear that there is a necessity to partition the training data according
to the emotional stage of each individual speaker. Secondly, rather than using the entire voice signal, short time significant frames can be used, which would be enough to identify the emotional condition of the speaker. In this research, Cepstral Coefficient (CC) has been used as voice feature and a fixed valued kmeans clustered method has been used for feature classification. The value of k will depend on the number
of emotional situations in human physiology is being an evaluation. Consequently, the value of k does not necessarily consider the volume of experimental dataset. In this experiment, three emotional conditions: happy, angry and sad have been detected from eight female and seven male voice signals. This methodology has increased the emotion detection accuracy rate significantly comparing to some recent works and also reduced the CPU time of cluster formation and matching.
Signal & Image Processing : An International Journal sipij
Signal & Image Processing : An International Journal is an Open Access peer-reviewed journal intended for researchers from academia and industry, who are active in the multidisciplinary field of signal & image processing. The scope of the journal covers all theoretical and practical aspects of the Digital Signal Processing & Image processing, from basic research to development of application.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of Signal & Image processing.
Emotion recognition based on the energy distribution of plosive syllablesIJECEIAES
We usually encounter two problems during speech emotion recognition (SER): expression and perception problems, which vary considerably between speakers, languages, and sentence pronunciation. In fact, finding an optimal system that characterizes the emotions overcoming all these differences is a promising prospect. In this perspective, we considered two emotional databases: Moroccan Arabic dialect emotional database (MADED), and Ryerson audio-visual database on emotional speech and song (RAVDESS) which present notable differences in terms of type (natural/acted), and language (Arabic/English). We proposed a detection process based on 27 acoustic features extracted from consonant-vowel (CV) syllabic units: \ba, \du, \ki, \ta common to both databases. We tested two classification strategies: multiclass (all emotions combined: joy, sadness, neutral, anger) and binary (neutral vs. others, positive emotions (joy) vs. negative emotions (sadness, anger), sadness vs. anger). These strategies were tested three times: i) on MADED, ii) on RAVDESS, iii) on MADED and RAVDESS. The proposed method gave better recognition accuracy in the case of binary classification. The rates reach an average of 78% for the multi-class classification, 100% for neutral vs. other cases, 100% for the negative emotions (i.e. anger vs. sadness), and 96% for the positive vs. negative emotions.
SPEECH EMOTION RECOGNITION SYSTEM USING RNNIRJET Journal
This document discusses a speech emotion recognition system using recurrent neural networks (RNNs). It begins with an abstract describing speech emotion recognition and its importance. Then it provides background on speech emotion databases, feature extraction using MFCC, and classification approaches like RNNs. It reviews related work on speech emotion recognition using various methods. Finally, it concludes that MFCC feature extraction and RNN classification was used in the proposed system to take advantage of their performance in machine learning applications. The system aims to help machines understand human interaction and respond based on the user's emotion.
Literature Review On: ”Speech Emotion Recognition Using Deep Neural Network”IRJET Journal
The document discusses speech emotion recognition using deep neural networks. It first provides an overview of SER and the challenges in the field. It then reviews 20 research papers on the topic, finding that most use deep neural network techniques like CNNs and DNNs for model building. The papers evaluated various datasets and algorithms, with accuracy ranging from 84% to 90%. Overall limitations identified included the need for more data, handling of multiple simultaneous emotions, and improving cross-corpus performance. The literature review contributes to knowledge in using machine learning for SER.
Signal Processing Tool for Emotion Recognitionidescitation
In the course of realization of modern day robots,
which not only perform tasks, but also behaves like human
beings during their interaction with the natural environment,
it is essential for us to impart knowledge of the underlying
emotions in the spoken utterances of human beings to the
robots, enabling them to be consistent, whole, complete and
perfect. To this end, it is essential for them too to understand
and identify the human emotions. For this reason, stress is
laid now-a-days on the study of emotional content of the speech
and accordingly speech emotion recognition engines have been
proposed. This paper is a survey of the main aspects of speech
emotion recognition, namely, features extractions and types
of features commonly used, selection of most informed
features from the original dataset of the features, and
classification of the features according to different classifying
techniques based on relative information regarding commonly
used database for the speech emotion recognition.
A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledg...ijtsrd
Suctioning is a common procedure performed by nurses to maintain the gas exchange, adequate oxygenation and alveolar ventilation in critical ill patients under mechanical ventilation and aim of this research is to provide knowledge regarding maintaining airway patency with suctioning care that will help in the implementation of the quality of nursing care, eventually it will lead to better results. The planned study is a pre experimental study to assess the effectiveness of planned teaching programme on knowledge regarding airway patency on patients with mechanical ventilator among the B.Sc. internship students of selected college of nursing at Moradabad. To assess the level of knowledge regarding maintaining airway patency in patients with mechanical ventilator among B.Sc. Nursing internship students. To assess the effectiveness of planned teaching programme in term of knowledge regarding airway patency among B.Sc. nursing internship students. The purpose of this study is to examine the association between knowledge and effectiveness regarding airway patency among B.Sc. Nursing internship demographic students and their selected partner variables. A pre experimental study was conducted among 86 participants, selected by non probability convenient sampling method. Demographic Performa and self structured questionnaire was used to collect the data from the B.Sc. internship students. Nafees Ahmed | Sana Usmani "A Study to Assess the Effectiveness of Planned Teaching Programme on Knowledge Regarding Maintaining Airway Patency in Patients with Mechanical Ventilator" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-1 , December 2021, URL: https://www.ijtsrd.com/papers/ijtsrd47917.pdf Paper URL: https://www.ijtsrd.com/medicine/nursing/47917/a-study-to-assess-the-effectiveness-of-planned-teaching-programme-on-knowledge-regarding-maintaining-airway-patency-in-patients-with-mechanical-ventilator/nafees-ahmed
Speech Emotion Recognition Using Neural Networksijtsrd
Speech is the most natural and easy method for people to communicate, and interpreting speech is one of the most sophisticated tasks that the human brain conducts. The goal of Speech Emotion Recognition SER is to identify human emotion from speech. This is due to the fact that tone and pitch of the voice frequently reflect underlying emotions. Librosa was used to analyse audio and music, sound file was used to read and write sampled sound file formats, and sklearn was used to create the model. The current study looked on the effectiveness of Convolutional Neural Networks CNN in recognising spoken emotions. The networks input characteristics are spectrograms of voice samples. Mel Frequency Cepstral Coefficients MFCC are used to extract characteristics from audio. Our own voice dataset is utilised to train and test our algorithms. The emotions of the speech happy, sad, angry, neutral, shocked, disgusted will be determined based on the evaluation. Anirban Chakraborty "Speech Emotion Recognition Using Neural Networks" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-1 , December 2021, URL: https://www.ijtsrd.com/papers/ijtsrd47958.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/47958/speech-emotion-recognition-using-neural-networks/anirban-chakraborty
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
Speech emotion recognition enables a computer system to records sounds and realizes the emotion of the
speaker. we are still far from having a natural interaction between the human and machine because
machines cannot distinguishes the emotion of the speaker. For this reason it has been established a new
investigation field, namely “the speech emotion recognition systems”. The accuracy of these systems
depend on the various factors such as the type and the number of the emotion states and also the classifier
type. In this paper, the classification methods of C5.0, Support Vector Machine (SVM), and the
combination of C5.0 and SVM (SVM-C5.0) are verified, and their efficiencies in speech emotion
recognition are compared. The utilized features in this research include energy, Zero Crossing Rate (ZCR),
pitch, and Mel-scale Frequency Cepstral Coefficients (MFCC). The results of paper demonstrate that the
effectiveness proposed SVM-C5.0 classification method is more efficient in recognizing the emotion of the
between -5.5 % and 8.9 % depending on the number of emotion states than SVM, C5.0.
Speech Emotion Recognition by Using Combinations of Support Vector Machine (S...mathsjournal
This document summarizes a research paper that evaluated different machine learning classifiers for speech emotion recognition using the Berlin Emotional Speech Database. The paper compared the performance of Support Vector Machine (SVM), C5.0, and a combination of SVM and C5.0 (SVM-C5.0) classifiers. Features extracted from the speech data included energy, zero crossing rate, pitch, and Mel-frequency cepstral coefficients. The results showed that the proposed SVM-C5.0 method achieved 8.9% to 5.5% better emotion recognition accuracy compared to SVM and C5.0 alone, depending on the number of emotion states.
Human emotion recognition is an upcoming research field of human computer interaction based on facial gestures and is being used for real-time analysis in classifying cognitive affective states from a facial video data. Since computers have become an integral part of life, many researchers are using emotion recognition and classification of data based on audio and text. But these approaches offer limited accuracy and relevance in emotion classification. Therefore we have introduced and analyzed a hybrid approach which could outperform the existing strategies that uses an innovative approach supported by selection of audio and video data characteristics for classification. The research uses SVM for classifying the data using audiovisual savee database and the results obtained show maximum classification accuracy with respect to audio data about 91.6 could be improved to 99.2% after the application of hybrid strategy.
Speech emotion recognition using 2D-convolutional neural networkIJECEIAES
This research proposes a speech emotion recognition model to predict human emotions using the convolutional neural network (CNN) by learning segmented audio of specific emotions. Speech emotion recognition utilizes the extracted features of audio waves to learn speech emotion characteristics; one of them is mel frequency cepstral coefficient (MFCC). Dataset takes a vital role to obtain valuable results in model learning. Hence this research provides the leverage of dataset combination implementation. The model learns a combined dataset with audio segmentation and zero padding using 2D-CNN. Audio segmentation and zero padding equalize the extracted audio features to learn the characteristics. The model results in 83.69% accuracy to predict seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise from the combined dataset with the segmentation of the audio files.
On the use of voice activity detection in speech emotion recognitionjournalBEEI
Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.
Efficient Speech Emotion Recognition using SVM and Decision TreesIRJET Journal
This document discusses efficient speech emotion recognition using support vector machines and decision trees. It summarizes a research paper that extracted speech features like variance, standard deviation, energy and pitch from an emotional speech corpus containing 535 speech segments expressing seven emotions. The extracted features were used to train and test an SVM classifier for emotion recognition. The classifier achieved an average accuracy of 85% across training and test sets at recognizing the seven emotions. Feature selection techniques were used to address the curse of dimensionality caused by the large number of extracted features.
Emotion Recognition through Speech Analysis using various Deep Learning Algor...IRJET Journal
This document summarizes a research paper on emotion recognition through speech analysis using deep learning algorithms. The researchers used datasets containing speech samples labeled with seven emotions to train and test convolutional neural network (CNN), support vector machine (SVM), recurrent neural network (RNN), and random forest models. They found that the RNN model achieved the highest testing accuracy at 75.31% for emotion recognition. The researchers concluded that speech emotion recognition systems could be useful for applications like dialogue systems, call centers, student voice reviews, and building healthy relationships.
The document describes AffectNet, a new database of over 1 million facial images collected from the internet and annotated for facial expressions, valence, and arousal. About 450,000 images were manually annotated for the presence of 7 discrete facial expressions in the categorical model and intensity of valence and arousal in the dimensional model. This makes AffectNet the largest database of facial expressions in the wild annotated for both categorical and dimensional models of affect. Two neural network baselines are used to classify images by expression in the categorical model and predict valence and arousal in the dimensional model, showing improved performance over conventional methods.
IRJET- Facial Expression Recognition System using Neural Network based on...IRJET Journal
This document describes a facial expression recognition system using a neural network approach. It uses the Japanese Female Facial Expressions (JAFFE) database to classify 7 facial expressions. The system extracts features using 2D discrete cosine transform (DCT), local binary patterns (LBP), and histogram of oriented gradients (HOG). These features are used to create a hybrid feature vector for each image. A single hidden layer feedforward neural network is trained on the feature vectors using different learning algorithms to classify the expressions. Experimental results show that a neural network trained with gradient descent and adaptive learning rate achieves the highest average accuracy of 97.2% for classifying expressions in the JAFFE database.
A comparative analysis of classifiers in emotion recognition thru acoustic fea...Pravena Duplex
This document presents a comparative analysis of different classifiers for emotion recognition through acoustic features. It analyzes prosody features like energy and pitch as well as spectral features like MFCCs. Feature fusion, which combines prosody and spectral features, improves classification performance for LDA, RDA, SVM and kNN classifiers by around 20% compared to using features individually. Results on the Berlin and Spanish emotional speech databases show that RDA performs best as it avoids the singularity problem that affects LDA when dimensionality is high relative to the number of training samples.
Similar to ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL (20)
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
ASERS-CNN: ARABIC SPEECH EMOTION RECOGNITION SYSTEM BASED ON CNN MODEL
1. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
DOI: 10.5121/sipij.2022.13104 45
ASERS-CNN: ARABIC SPEECH EMOTION
RECOGNITION SYSTEM BASED ON CNN MODEL
Mohammed Tajalsir1
, Susana Mu˜noz Hern´andez2
and Fatima Abdalbagi Mohammed3
1
Department of Computer Science,
Sudan University of Science and Technology, Khartoum, Sudan
2
Technical University of Madrid (UPM),
Computer Science School (FI), Madrid, Spain
3
Department of Computer Science,
Sudan University of Science and Technology, Khartoum, Sudan
ABSTRACT
When two people are on the phone, although they cannot observe the other person's facial expression and
physiological state, it is possible to estimate the speaker's emotional state by voice roughly. In medical
care, if the emotional state of a patient, especially a patient with an expression disorder, can be known,
different care measures can be made according to the patient's mood to increase the amount of care. The
system that capable for recognize the emotional states of human being from his speech is known as Speech
emotion recognition system (SER). Deep learning is one of most technique that has been widely used in
emotion recognition studies, in this paper we implement CNN model for Arabic speech emotion
recognition. We propose ASERS-CNN model for Arabic Speech Emotion Recognition based on CNN
model. We evaluated our model using Arabic speech dataset named Basic Arabic Expressive Speech
corpus (BAES-DB). In addition of that we compare the accuracy between our previous ASERS-LSTM and
new ASERS-CNN model proposed in this paper and we comes out that our new proposed mode is
outperformed ASERS-LSTM model where it get 98.18% accuracy.
KEYWORDS:
BAES-DB, ASERS-LSTM, Deep learning, Speech emotion recognition.
1. INTRODUCTION
Speech is the compound words that are useful in the situation and used for communication, to
express the thoughts and opinions. The emotions are the mental states such as happiness, love,
fear, anger, or joy that effect on the human behavior. is easy for the Human using available
senses to detect the emotional states from speaker's speech , but this is a very difficult task for
machines. The research work on speech emotion recognition generally starts with the study of
phonetic features in the field of phonetics, and some studies make use of the characteristics of
linguistics. For example, some special vocabulary, syntax, etc., in general, all studies have shown
that the performance of automatic speech emotion recognition is inferior to the ability of humans
to recognize emotions.
The Arabic language is spoken by more than 300 million Arabic speakers. Since that, there are
needs for developing an Arabic Speech Recognition system to make the computer not just can
recognize the speech, but to recognize how has spoken. However, the Arabic Speech Recognition
system is challenged by many factors such as the cultural effects and determining the suitable
2. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
46
features that classify the emotions. There are few Arabic speech dataset to be used for training
Arabic Speech Recognition models, the most difficult task related the speech samples is to find
reliable data, most of dataset are not public, and most of datasets related sentiment analysis are
not real or simulated. And it is very difficult to record people's real emotions. In some works
related to detecting lying or stress you can force the people to lie or be stressed. But this is very
difficult in other interesting emotions as happiness, sadness, anger.
Deep Learning is a new area of Machine Learning research, the relationship between the artificial
intelligence.
There are few research work that was focused on Arabic emotions recognition such as the work
in [2], [3], [4], [5], [6], [7] and [8]. In [2] recognizing emotions was performed using Arabic
natural real life utterances for the first time. The realistic speech corpus collected from Arabic TV
shows and the three emotions happy, angry and surprised are recognized. The low-level
descriptors (LLDs); acoustic and spectral features extracted and calculate 15 statistical functions
and the delta coefficient is computed for every feature, then ineffective features removed using
Kruskal–Wallis non-parametric test leading to new database with 845 features. Thirty-five
classifiers belonging to six classification groups Trees, Rules, Bayes, Lazy, Functions and
Metawere applied over the extracted features. The Sequential minimal optimization (SMO)
classifier (Functions group) overcomes the others giving 95.52% accuracy.
in [3] they enhancement the emotion recognition system by proposed a new two phase, the first
phase aim to remove the units that were misclassified by most classifiers from the original
corpora by labelling each video unit misclassified by more than twenty four methods as "1" and
good units as "0" and phase two remove all videos with type "1" from original database than all
classification models over the new database. The new enhancement model improved the accuracy
by 3% for all for all thirty five classification models. The result accuracy of Sequential minimal
optimization (SMO) classifier improved from 95.52% to 98.04%.
[4] They performed two neural architectures to develop an emotion recognition system for Arabic
data using KSUEmotions dataset; an attention-based CNNLSTM-DNN model and a strong deep
CNN model as a baseline. In the first emotion classifier the CNN layers used to extract audio
signal features and the bi-directional LSTM (BLSTM) layers used to process the sequential
phenomena of the speech signal then followed by an attention layer to extracts a summary vector
which is fed to a DNN layer which finally connects to a softmax layer. The results show that an
attention-based CNNLSTM-DNN approach can produce to significant improvements (2.2%
absolute improvements) over the baseline system.
A semi-natural Egyptian Arabic speech emotion (EYASE) database is introduced by [5], the
EYASE database has been created from Egyptian TV series. Prosodic, spectral and wavelet
features in addition to pitch, intensity, formants, Mel-frequency cepstral coefficients (MFCC),
long-term average spectrum (LTAS) and wavelet parameters are extracted to recognize four
emotions: angry, happy, neutral and sad. Several experiments were performed to detect emotions:
emotion vs. neutral classifications, arousal & valence classifications and multi-emotion
classifications for both speaker independent and dependent experiments. The experiments
analysis finds that the gender and culture effects on SER. Furthermore, For the EYASE database,
anger emotion most readily detected while happiness was the most challenging. Arousal
(angry/sad) recognition rates were shown to be superior to valence (angry/happy) recognition
rates. In most cases the speaker dependent SER performance overcomes the speaker independent
SER performance.
3. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
47
In [6] five ensemble models {Bagging, Adaboost, Logitboost, Random Subspace and Random
Committee} were employed and studied their effect on a speech emotions recognition system.
The highest accuracy was obtained by SMO 95.52% through all single classifiers for recognizing
happy, angry, and surprise emotion from The Arabic Natural Audio Dataset (ANAD). And after
applying the ensemble models on 19 single classifiers the result accuracy of SMO classifier
improved from 95.52% to 95.95% as the best enhanced. The Boosting technique having the
Naïve Bayes Multinomial as base classifier achieved the highest improvement in accuracy
19.09%.
In [7] they built a system to automatically recognize emotion in speech using a corpus of Arabic
expressive sentences phonetically balanced. They study the influence of the speaker dependency
on their result. The joy, sadness, anger and neutral emotions were recognized after extracts the
cepstral features, their first and second derivatives, Shimmer, Jitter and the duration. They used a
multilayer perceptron neural network (MLPs) to recognize emotion. The experiments shows that
in the case of an intra-speaker classification recognition rate could reach more than 98% and in
the case of an inter-speaker classification recognition rate could reach 54.75%.thus the system's
dependence on speaker is obvious.
A natural Arabic visual-audio dataset was designed by [8] which consist of audio-visual records
from the Algerian TV talk show “Red line” for the present. The dataset was records by 14
speakers with 1, 443 complete sentences speech. The openSMILE feature extraction tool used to
extracts variety of acoustic features to recognize five emotions enthusiasm, admiration,
disapproval, neutral, and joy. The min, max, range, standard deviation and mean statistical
functions are applied over all the extracted features (energy, pitch, ZCR, spectral features,
MFCCs, and Line Spectral Frequencies (LSP)). The WEKA toolkit was used to perform some
classification algorithms. The experiments results shows that using the SMO classifier with the
Energy, pitch, ZCR, spectral features, MFCCs, LSP features set (430 features) achieves the better
classification results (0.48) which are measured by a weighted average of f-measure. And
recognizing the Enthusiasm is the most difficult task through the five emotions.
The rest of this paper is organized as flow: section 2 briefly describes the methods and techniques
used in this work. The results are detailed and discussed in section 3. Finally, in section 4 the
conclusions are drawn.
2. METHODS AND TECHNIQUES
Pre-processing the signal and Feature Extraction, training the model and finally testing the model
these are the phases for our methodology to recognize the Emotion as we can see in Figure 2.
4. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
48
Figure 1: The Methodology to recognize Emotion
2.1. Dataset
Basic Arabic Expressive Speech corpus (BAES-DB) is the dataset that we used for our
experimental. The corpus consists of 13 speakers, 4 emotions and 10 sentences; in total it
contains 520 sound files. Each file of them contains the ten utterances of a sentence in one of the
four emotions states for each of the thirteen speakers. The four selected emotions are neutral, joy,
sadness and anger. Any speaker recorded the all ten sentences in a specific situation before
moving on to the next one. The first four speakers were recorded while sitting, whereas the 9
others were standing.
2.2. Preprocessing
For preprocessing step, Pre-emphasis and Silent Remover are the two preprocessing algorithms
that we apply it in here in this paper same as we did in our previous work.
5. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
49
2.3. Feature Extraction
Human beings can assume the emotional state of the speaker by the voice of the other
party. Subjectively, we know that when a person is angry, his speech rate will increase, the
loudness of the voice will increase, and the tone will increase. When a person is sad, the speech
rate will slow down, the loudness will decrease, the tone will decrease, etc. This shows that the
human voice situation will be significantly affected by the speaker's emotions. When people are
in different emotions, the sounds produced will have different acoustic characteristics or
generally call it as Feature. So we need firstly to extract the feature from speech to detect the
speaker's emotions. Here in this paper five different feature types are investigated using the CNN
architecture: Mel-Frequency Cepstral Coefficients (MFCC) features, chromagram, Mel-scaled
spectrogram, spectral contrast and tonal centroid features (tonnetz).
2.4. Convolutional Neural Network
The convolutional neural network (CNN) defines as: one of the most popular algorithms for deep
learning with images and video [10]. CNN it's composed of an input layer, an output layer, and
many hidden layers in between just like other neural networks.
2.4.1. Training CNN model
We implemented the CNN model for emotion classification using Keras deep learning library
with Tensorflow backend on laptop with processor Intel(R) Core™ i7-3520 CPU @ 2.90GHz,
8GB RAM and 64 bit operating system. The ASERS-CNN model Architecture proposed here in
this paper is consist of 4 block, each one of them contain of the convolution, Batch
Normalization, Dropout and MaxPooling layer as shown in Figure 4.
6. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
50
Figure 2: ASERS-CNN model
The model was trained using Adam optimization algorithm with dropout rate=0.2. The initial
learning rate was set to 0.001 and the batch size was set to 32 for 50 Epoch, Categorical cross
entropy loss function was used. The accuracy and loss curves for the model are shown in Figure
5.
7. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
51
(a) (b)
Figure 3: (a) CNN Model Accuracy curve (b) CNN Model loss curve
3. RESULTS AND DISCUSSION
In order to evaluate the performance of the proposed CNN model we used Basic Arabic
Expressive Speech corpus (BAES-DB) database, which 520 wav files for 13 speakers. For
training and evaluation the model, we used four categorical emotions Angry, Happy, Sad and
Neutral, which represent the majority of the emotion categories in the database. For low-level
acoustic features, we extract 5 features: Mel-Frequency Cepstral Coefficients (MFCC) features,
chromagram, Mel-scaled spectrogram, spectral contrast and tonal centroid features (tonnetz). To
determine whether the number of Epoch for the training has any effect on the accuracy of
emotion classification, we was trained the model using 200 and 50 Epoch. Also we study the
effect of the number of feature that extracted from each wav file.
In Table 1 we summarize the Compression between the three models DNN, LSTM and CNN
Where the DNN its get Accuracy 93.34% before applying the Pre-processing step and after Pre-
processing it get 97.78% when using 5 Feature. But when using 7 Feature and before Pre-
processing is getting 92.95% and after Pre-processing it gets 98.06%. When decrease number of
Epoch from 200 to 50 it get 97.04%. The CNN model its get Accuracy 98.18% before applying
the Pre-processing step and after Pre-processing it get 98.52% when using 5 Feature. But when
using 7 Feature and before Pre-processing is getting 98.40% and after Pre-processing it gets
98.46%. When decrease number of Epoch from 200 to 50 it get 95.79%.
The LSTM model its get Accuracy 96.81% before applying the Pre-processing step and after Pre-
processing it get 97.44% when using 5 Feature. But when using 7 Feature and before Pre-
processing is getting 96.98% and after Pre-processing it gets 95.84%. When decrease number of
Epoch from 200 to 50 it get 95.16%. As we can see in Table 1, the results show the best
Accuracy for CNN model is when we use 200 epochs with two pre-processing and five extracted
features.
8. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
52
Table 1: Compression the Accuracy between CNN, DNN and LSTM models
The Accuracy (%)
200 Epoch 50 Epoch
5 Feature 7 Feature
7 Feature
With
Preprocessing
Without
Preprocessing
With
Preprocessing
Without
Preprocessing
With
Preprocessing
Classifier
DNN 93.34% 97.78% 92.95% 98.06% 97.04%
CNN 98.18% 98.52% 98.40% 98.46% 95.79%
LSTM 96.81% 97.44% 96.98% 95.84% 95.16%
The results are compared with some of related works include (Klaylat et al., 2018a), )Klaylat et
al., 2018b(, (Schuller, Rigoll and Lang, 2014),(Shaw, 2016( and (Farooque et al., 2004) as shown
in Table 2 bellow. Our Proposed CNN it obviously from the Table 2 it's outperformed the other
state-of-the-art model which it's obtain 98.52%. (Klaylat et al., 2018b) it's obtaining better
Accuracy than Our Proposed LSTM where it get 98.04% and Our Proposed LSTM it get 97.44%.
Our Proposed DNN it's outperformed (Klaylat et al., 2018a) , [12], [13] and [14].
Table 2: Compression between our proposed models and some of related works
Ref Classifier Features Database Accuracy
[2]
Thirty-five classifiers
belonging to six groups
low-level descriptors (LLDs);
acoustic and spectral features
created
database
95.52%
[3]
Thirty-five classifiers
belonging to six groups
low-level descriptors (LLDs);
acoustic and spectral features
created
database
98.04%.
[12] CHM, GMM. Pitch and energy. - 78%
[13] ANN
Pitch, Energy, MFCC, Formant
Frequencies
created
database
85%.
[14] RFuzzy model
SP-IN, TDIFFVUV, and
TDIFFW
created
database
90%
Our
DNN
DNN Network
Mel-Frequency Cepstral
Coefficients (MFCC) features,
chromagram, Mel-scaled
spectrogram, spectral contrast
and tonal centroid features
BAES-
DB
98.06%
Our
CNN
CNN Network 98.52%
Our
LSTM
LSTM Network 97.44%
9. Signal & Image Processing: An International Journal (SIPIJ) Vol.13, No.1, February 2022
53
4. CONCLUSION
in This paper we propose ASERS-CNN Deep learning model for Arabic speech emotion
recognition (ASER),we use Arabic speech dataset named Basic Arabic Expressive Speech corpus
(BAES-DB) for our evaluation the model get Accuracy 98.18% before applying any Pre-
processing and after Pre-processing it get 98.52% when using 5 Feature. When we use 7 Feature
and before Pre-processing are getting 98.40% and after Pre-processing it get 98.46%. we also
decrease number of Epoch from 200 to 50 and it get 95.79%.We compared the accuracy of
proposed ASERS-CNN model with our previous LSTM and DNN models. The proposed
ASERS-CNN model it's outperformed the previous LSTM and DNN model where it gets 98.52%
accuracy and LSTM and DNN was gets 98.06% and 97.44%, respectively. We also compared the
accuracy of proposed ASERS-CNN model with some of related works include (Klaylat et al.,
2018a), ( Klaylat et al., 2018b), (Schuller, Rigoll and Lang, 2014),(Shaw, 2016) and (Farooque et
al., 2004) as shown in Table 2 bellow. Our Proposed CNN it obviously from the Table 2 it's
outperformed the other state-of-the-art model.
REFERENCES
[1] F. Chollet, Deep Learning with Python, vol. 10, no. 2. 2011.
[2] S. Klaylat, Z. Osman, L. Hamandi, and R. Zantout, “Emotion recognition in Arabic speech,” Analog
Integr. Circuits Signal Process., vol. 96, no. 2, pp. 337–351, 2018.
[3] S. Klaylat, Z. Osman, L. Hamandi, and R. Zantout, “Enhancement of an Arabic Speech Emotion
Recognition System,” Int. J. Appl. Eng. Res., vol. 13, no. 5, pp. 2380–2389, 2018.
[4] Y. Hifny and A. Ali, “Efficient Arabic emotion recognition using deep neural networks,” pp. 6710–
6714, 2019.
[5] L. Abdel-Hamid, “Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet
features,” Speech Commun., vol. 122, pp. 19–30, 2020.
[6] S. Klaylat, Z. Osman, L. Hamandi, and R. Zantout, “Ensemble Models for Enhancement of an Arabic
Speech Emotion Recognition System,” vol. 70, no. January, pp. 299–311, 2020.
[7] I. Hadjadji, L. Falek, L. Demri, and H. Teffahi, “Emotion recognition in Arabic speech,” 2019 Int.
Conf. Adv. Electr. Eng. ICAEE 2019, 2019.
[8] H. Dahmani, H. Hussein, B. Meyer-Sickendiek, and O. Jokisch, “Natural Arabic Language Resources
for Emotion Recognition in Algerian Dialect,” vol. 2, no. October, pp. 18–33, 2019.
[9] B. Y. Goodfellow Ian and Courville Aaron, Deep learning 简介一、什么是 Deep Learning ?, vol.
29. 2019.
[10] Mathworks, “Introducing Deep Learning with MATLAB,” Introd. Deep Learn. with MATLAB, p. 15,
2017.
[11] S. Hung-Il, “Chapter 1 - An Introduction to Neural Networks and Deep Learning,” Deep Learning for
Medical Image Analysis. pp. 3–24, 2017.
[12] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov Model-based Speech Emotion Recognition,”
no. August 2003, 2014.
[13] A. Shaw, “Emotion Recognition and Classification in Speech using Artificial Neural Networks,” vol.
145, no. 8, pp. 5–9, 2016.
[14] M. Farooque, S. Munoz-hernandez, C. De Montegancedo, and B. Monte, “Prototype from Voice
Speech Analysis,” 2004.