Emotion Recognition based on Speech and EEG Using Machine Learning Techniques.docx

Emotion Recognition based on Speech and EEG Using Machine
Learning Techniques
Hanzala Javed: 2022-MSCS-17 Muhammad Sarfraz: 2022-MSCS-48
Abstract:
This research investigates the synergy of speech and EEG data for emotion recognition using machine learning, with a
focus on Artificial Neural Networks (ANNs). The study achieves a notable 97 percent accuracy in discerning emotions,
employing a diverse dataset encompassing various emotional states. The integrated analysis of speech and EEG data
enhances the model's robustness, capturing both physiological and vocal cues. The ANN architecture is meticulously
designed for effective feature extraction and representation learning, demonstrating its efficacy in discerning subtle
emotional nuances. The high accuracy attained underscores the potential of this approach in human-computer interaction,
affective computing, and mental health monitoring. This research contributes to the evolving field of emotion recognition,
offering a novel multimodal approach with practical implications for creating more intuitive and responsive systems in
various applications.
Key Words:
Emotion Recognition
Speech Analysis
EEG Signals
Machine Learning
Artificial Neural Networks (ANNs)
Introduction:
Emotion recognition has emerged as a pivotal area of research within the broader scope of artificial intelligence,
contributing significantly to human-computer interaction and affective computing. Understanding and accurately
interpreting human emotions through various modalities, such as speech and electroencephalogram (EEG) signals, hold
immense potential for applications ranging from healthcare to human-machine interfaces. This introduction seeks to
provide a comprehensive overview of the literature landscape in emotion recognition, drawing upon key studies that have
explored diverse modalities and methodologies.
The exploration of emotion recognition using audio features has been a prominent focus in recent research [1]. Aouani et
al. delved into the intricacies of leveraging Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR),
Harmonic to Noise Rate (HNR), and Teager Energy Operator (TEO) for identifying emotions [1]. The utilization of an
Auto-Encoder for feature selection and Support Vector Harmonic Machines (SVM) as a classifier underscored the
effectiveness of this two-step approach. The study conducted experiments on the Ryerson Multimedia Laboratory (RML)
dataset, highlighting the potential for advancements in the field [1].
A review by Maithri et al. spanning the years 2016 to 2021 meticulously analyzed state-of-the-art models in automated
emotion recognition, emphasizing the dominance of deep learning techniques and their impressive performance metrics,
particularly in controlled environments [4]. The review identified a critical gap in the literature regarding models tailored
for dynamic, uncontrolled settings, emphasizing the need for future research to address this limitation [4].

Speech emotion recognition was addressed by Issa et al., who introduced a novel architecture utilizing a one-dimensional
Convolutional Neural Network (CNN) and various audio features [6]. The proposed model outperformed existing
frameworks and set a new state-of-the-art, emphasizing the complexity of speech emotion recognition and suggesting
avenues for further research [6].
Building upon the rich foundation laid by these studies, the present research aims to contribute to the field of emotion
recognition based on speech and EEG signals. Leveraging machine learning techniques, specifically an Artificial Neural
Network (ANN) model, our study achieved a remarkable 97 percent accuracy. This paper aims to elucidate the
methodology, experimental setup, and findings, further enhancing our understanding of emotion recognition and its
potential applications.
Literature Survey:
[1] Aouani, H et al., explored emotion recognition using audio features, specifically employing 39
coefficients of Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Harmonic to Noise
Rate (HNR), and Teager Energy Operator (TEO). The authors propose a two-step approach, first utilizing
Auto-Encoder for selecting relevant parameters from the initially extracted features and then employing
Support Vector Harmonic Machines (SVM) as a classifier method. The experiments are conducted on the
Ryerson Multimedia Laboratory (RML) dataset. The paper concludes by presenting the performance of
the proposed systems, emphasizing the fusion of HNR with widely recognized emotion features. The use
of auto-encoder dimension reduction is highlighted for improving identification rates. The authors
suggest future research directions, including the exploration of different feature types, application on
larger datasets, alternative methods for feature dimension reduction, and potential incorporation of
audiovisual data to enhance emotion recognition rates. The study positions its findings as effective
compared to other emotion recognition systems, underscoring the potential for further advancements
in the field.
[2] Wang, Q., addresses the critical aspect of Automatic Emotion Recognition (AER) for enhancing
Human–Machine Interactions (HMI) by introducing a novel multi-modal emotion database, MED4,
comprising EEG, photo plethysmography, speech, and facial images. Conducted in two environmental
conditions, a research lab and an anechoic chamber, the study employs four baseline algorithms to
assess AER methods and presents two fusion strategies at feature and decision levels. Results indicate
that EEG signals outperform speech signals in emotion recognition, and their fusion significantly
improves accuracy. The paper emphasizes the robustness of AER in noisy environments and the
database's availability for global research collaboration. The conclusion discusses the unique
contributions of MED4, including its multi-modality and multi-environmental design, the effectiveness of
EEG signals, the impact of environmental noise, and the potential for future research directions,
highlighting the paper's significance in advancing the field of AER. Moreover, the paper evaluates the
impact of variable-length EEG on AER performance, finding it to outperform other methods, particularly
in recognizing happy emotions. The study also explores single-modality emotion recognition based on
speech and EEG signals, revealing that EEG signals exhibit high accuracy, especially in identifying happy
emotions, while speech signals are more effective in recognizing neutral and angry emotions. The
analysis of environmental noise indicates a more stable performance using EEG signals across different
environments compared to speech signals, emphasizing the reliability of EEG in suboptimal acoustic
conditions

[3] Jafari, M. et al., investigates the role of Deep Learning (DL) techniques in emotion recognition from
Electroencephalogram (EEG) signals, acknowledging emotions' crucial influence on human decision-
making and mental states. It discusses the challenges associated with EEG-based emotion recognition,
such as signal variability and the lack of a universal processing standard. The study emphasizes the
advantages of EEG signals in terms of spatial resolution and ease of recording. The conclusion reviews
recent research efforts employing DL models for emotion recognition, especially focusing on diverse
emotions and associated psychological conditions. The paper presents a comprehensive literature
review, covering the period from 2016 to 2023, discussing the challenges faced, summarizing studies on
DL techniques in EEG-based emotion recognition, and proposing future research directions. The
comprehensive review positions the paper as a valuable resource for understanding the current state,
challenges, and potential advancements in the field of emotion recognition using EEG signals and DL
techniques. Furthermore, the paper provides a systematic review of the literature, categorizing articles
based on DL techniques, EEG signal processing steps, and challenges encountered in emotion
recognition. The thorough exploration of challenges, DL methods, and potential future directions
enhances the paper's contribution to the field, offering insights for researchers and practitioners aiming
to develop more robust and effective systems for emotion recognition from EEG signals.
[4] Maithri, M et al., provided a review delves into the realm of automated emotion recognition (ER)
methodologies spanning the years 2016 to 2021, with a specific emphasis on electroencephalogram
(EEG), facial, and speech signals. A meticulous analysis of state-of-the-art models reveals a conspicuous
upswing in the adoption of deep learning techniques for ER. Notably, these approaches have showcased
impressive performance metrics, particularly within controlled environments, signifying a discernible
trend in the landscape. The comprehensive summary meticulously categorizes the diverse
methodologies employed, the modalities considered (EEG, facial, and speech signals), and the
corresponding performance metrics, underscoring the dominance of deep learning in achieving
heightened accuracy and efficiency. Despite the notable successes in controlled environments, the
review accentuates a critical lacuna in the literature— the scarcity of models tailored for dynamic,
uncontrolled settings. These scenarios, marked by subject movements and sudden shifts between
expressions, present a substantial challenge for existing automated ER systems. The review critically
underscores the imperative for future research to address this gap. Developing and refining automated
ER systems capable of operating effectively in real-time, unpredictable scenarios becomes crucial for
their broader practical deployment. Such advancements have far-reaching implications across diverse
domains, including healthcare, e-learning, and surveillance, where the practical application of ER
technologies often involves uncontrolled and dynamic settings. Bridging this gap not only enhances the
robustness of automated ER systems but also expands their real-world applicability, ensuring their
efficacy in capturing the nuances of human emotions in various contexts.
[5] Yu, C et al., delves into the crucial realm of emotion recognition, emphasizing its pivotal role in
artificial intelligence and human-computer interaction. Focusing on electroencephalogram (EEG) signals,
directly generated by the central nervous system and intimately linked with human emotions, the paper
reviews recent advancements in emotion recognition methodologies. It covers various aspects, including
emotion induction, EEG preprocessing, feature extraction, and emotion classification. The paper
critically compares the strengths and weaknesses of these methods while highlighting the existing
challenges in current research methodologies. The conclusion underscores the foundational importance
of emotion recognition in human-computer emotion interaction and its broad application value in

enhancing various aspects of human life. Notably, with the continuous progress in brain-computer
interface technology and the development of artificial intelligence, emotion recognition based on EEG
signals emerges as a promising avenue, garnering extensive attention. The paper emphasizes the impact
of EEG signal acquisition and preprocessing on classification accuracy and notes the successful
integration of deep learning techniques, particularly neural networks, in advancing emotion recognition
models within the domain of brain-computer interfaces. The research direction highlighted in the
conclusion underscores the evolving landscape of emotion classification through the integration of deep
learning and EEG signals.
[6] Issa, D et al., addresses the challenging task of speech emotion recognition through the introduction
of a novel architecture utilizing a one-dimensional Convolutional Neural Network (CNN). The proposed
model extracts diverse audio features, including mel-frequency cepstral coefficients, chromagram, mel-
scale spectrogram, Tonnetz representation, and spectral contrast features, from raw sound files. The
datasets employed for evaluation encompass the Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS), Berlin (EMO-DB), and Interactive Emotional Dyadic Motion Capture (IEMOCAP).
The study employs an incremental method for refining the initial model, resulting in enhanced
classification accuracy. Unlike some prior approaches, the proposed framework operates directly with
raw sound data, avoiding the need for conversion to visual representations. In conclusion, the paper
emphasizes the complexity of speech emotion recognition, addressing the key challenges of feature
extraction and classification. The proposed one-dimensional deep CNN, combined with a variety of
audio features, outperforms existing frameworks for RAVDESS and IEMOCAP, setting a new state-of-the-
art. For EMO-DB, the paper presents an incremental set of models to enhance performance, achieving
competitive results compared to prior works in terms of generality, simplicity, and applicability. The
authors acknowledge the potential for further research, suggesting exploration of alternative features or
the integration of auxiliary neural networks for high-level feature extraction. Additionally,
comprehensive data augmentation techniques and the incorporation of additional layers of Long Short-
Term Memory (LSTM) are identified as potential avenues for improving classification accuracy. The
paper also highlights the significance of the order of stacking sound features and proposes it as a subject
for future investigation, reflecting a commitment to ongoing refinement and optimization in the field of
speech emotion recognition.
[7] Alhalaseh, R et al., explores the development of an automated model for identifying emotions based
on EEG signals. Addressing the challenges of using brain signals for emotion recognition due to their
inherent instability, the study proposes a novel approach employing empirical mode
decomposition/intrinsic mode functions (EMD/IMF) and variational mode decomposition (VMD) for
signal processing. Distinct from previous works, the paper focuses on the application of EMD/IMFs and
VMD, which are not commonly utilized in emotion recognition literature. The feature extraction stage
utilizes entropy and Higuchi's fractal dimension (HFD), and in the classification stage, four methods—
naïve Bayes, k-nearest neighbor (k-NN), convolutional neural network (CNN), and decision tree (DT)—
are employed. The study evaluates the proposed model using the DEAP database and various
performance metrics, achieving a remarkable 95.20% accuracy with the CNN-based method.
The conclusion highlights the significance of advancements in sensor and signal recording technologies,
enabling the utilization of signals extracted from human organs for condition identification. Categorizing

emotions based on EEG signals presents a complex application, aiming to discern a person's emotional
state, reflecting potential issues. The proposed model, encompassing signal processing, feature
extraction, and classification stages, utilizes innovative techniques like EMD/IMFs and VMD. The study
underscores the superiority of the CNN classifier in terms of accuracy and runtime, outperforming other
classifiers and demonstrating favorable results in comparison to existing literature. The comprehensive
evaluation considers metrics such as accuracy, precision, recall, and F1-measure, consistently
showcasing CNN's superior performance in EEG signal categorization. The paper emphasizes the
potential of this research for advancing emotion recognition systems and contributes unique insights
through its novel signal processing approach and classifier performance analysis.
[8] The presented paper introduces a novel approach, Deep-Emotion, for Multimodal Emotion
Recognition (MER) using facial expressions, speech, and Electroencephalogram (EEG). The authors
identify existing challenges in emotion recognition, including effective utilization of different modalities
and real-time detection in the context of increasing computing power demands. The proposed Deep-
Emotion framework comprises three branches: facial, speech, and EEG, each utilizing specialized neural
networks for feature extraction. The facial branch employs an improved GhostNet neural network to
address overfitting issues and enhance classification accuracy. The speech branch introduces a
lightweight fully convolutional neural network (LFCNN) for efficient speech emotion feature extraction.
For the EEG branch, the authors propose a tree-like Long Short-Term Memory (tLSTM) model capable of
fusing multi-stage features. Decision-level fusion is then adopted to integrate recognition results from
the three branches, ensuring comprehensive and accurate performance. The authors conduct extensive
experiments on CK+, EMO-DB, and MAHNOB-HCI datasets, demonstrating the superior performance of
the Deep-Emotion method. This paper represents the first attempt to combine facial expressions,
speech, and EEG for MER. The improved GhostNet for facial expressions, LFCNN for speech signals, and
tLSTM for EEG signals contribute to enhanced accuracy and robustness. The study also introduces an
optimal weight distribution search algorithm for decision-level fusion, further improving reliability. The
experimental results validate the feasibility of the proposed method across multiple public datasets,
suggesting the potential for future enhancements, particularly in refining dynamic weight allocation for
improved overall algorithm robustness.
[9] Houssein, E et al., presents a comprehensive literature review on emotion recognition, with a
specific focus on methods utilizing multi-channel Electroencephalogram (EEG) signals in Brain-Computer
Interfaces (BCIs). Affective computing, a subset of artificial intelligence, is highlighted for its role in
detecting, interpreting, and mimicking human emotions. The authors emphasize the limitations of
traditional modalities such as facial expressions, speech, and behavior, which may be influenced by
conscious or unconscious social masking, and advocate for the efficacy of physiological signals,
particularly EEG, in providing more accurate and objective emotion recognition. The review covers the
period from 2015 to 2021 and encompasses over 195 publications. The authors explore EEG-based BCI
emotion recognition approaches, detailing the entire process, including data collection, preprocessing,
feature extraction, feature selection, classification, and performance evaluation. Emphasis is placed on
the real-time responsiveness and authenticity of EEG signals, which react to emotional changes, making
them a reliable source for emotion recognition. The paper extensively surveys EEG feature extraction
techniques, feature selection/dimensionality reduction methods, and various machine and deep
learning classification techniques, including k-nearest neighbor, support vector machine, decision tree,
artificial neural network, convolutional and recurrent neural networks with long short-term memory.

The review delves into EEG rhythms associated with emotions and the intricate relationship between
distinct brain areas and emotional states. The authors discuss challenges and future research directions
in EEG-based emotion recognition, anticipating resolution of current obstacles and envisioning diverse
applications. The paper aims to provide valuable insights for researchers, especially those new to the
field, offering a snapshot of the current state of research in emotional-oriented EEG features recognition
and categorization. The funding sources for the study are acknowledged from The Science, Technology
& Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Research Methodology:
The research methodology involved leveraging a dataset sourced from Kaggle, comprising EEG signals
and speech data, for the purpose of emotion recognition. The EEG signals were preprocessed,
addressing artifacts and segmenting into relevant time intervals, while the speech data underwent noise
reduction, feature extraction, and normalization. Subsequently, both sets of preprocessed features were
integrated to form a comprehensive dataset. An Artificial Neural Network (ANN) was designed to
process this combined data, with careful consideration given to input layer configuration to
accommodate the multidimensional nature of the features. The model was trained and validated using
appropriate datasets, and performance evaluation was conducted using metrics such as accuracy,
precision, and recall. Results were analyzed and discussed in the context of research objectives,
emphasizing potential applications and future directions. Figure 1 depicts the flow diagram of the
research methodology, illustrating the sequential steps from data collection to model evaluation. The
diagram serves as a visual representation of the systematic approach employed in the study.
Figure 1

Results and Discussions:
The integration of Speech and EEG data for Emotion Recognition, utilizing an Artificial Neural Network
(ANN) model, demonstrated significant success, achieving a notable 97 percent accuracy on the Kaggle
dataset. The results showcase the model's proficiency in accurately classifying diverse emotional states,
highlighting its robustness in capturing both vocal and physiological cues. Precision, recall, and F1-score
metrics further validate the model's effectiveness across various emotions. The confusion matrix,
illustrated in Figure 1, provides a visual representation of the model's performance on individual
emotion classes, emphasizing its capability to discern subtle nuances. Specifically sourced from Kaggle,
the dataset's diversity contributes to the generalizability of the model across a wide range of emotional
expressions. These findings underscore the potential of the proposed multimodal approach in real-world
applications such as human-computer interaction and affective computing. The achieved high accuracy
substantiates the efficacy of leveraging Kaggle's EEG signals and speech data, showcasing the success of
the applied machine learning techniques in emotion recognition.

References:
[1] Aouani, H., & Ayed, Y. B. (2020). Speech emotion recognition with deep learning. Procedia Computer
Science, 176, 251-260.
[2] Wang, Q., Wang, M., Yang, Y., & Zhang, X. (2022). Multi-modal emotion recognition using EEG and
speech signals. Computers in Biology and Medicine, 149, 105907.
[3] Jafari, M., Shoeibi, A., Khodatars, M., Bagherzadeh, S., Shalbaf, A., García, D. L., ... & Acharya, U. R.
(2023). Emotion recognition in EEG signals using deep learning methods: A review. Computers in Biology
and Medicine, 107450.
[4] Maithri, M., Raghavendra, U., Gudigar, A., Samanth, J., Barua, P. D., Murugappan, M., ... & Acharya,
U. R. (2022). Automated emotion recognition: Current trends and future perspectives. Computer methods
and programs in biomedicine, 215, 106646.
[5] Yu, C., & Wang, M. (2022). Survey of emotion recognition methods using EEG information. Cognitive
Robotics, 2, 132-146.
[6] Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional
neural networks. Biomedical Signal Processing and Control, 59, 101894.
[7] Alhalaseh, R., & Alasasfeh, S. (2020). Machine-learning-based emotion recognition system using EEG
signals. Computers, 9(4), 95.
[8] Pan, J., Fang, W., Zhang, Z., Chen, B., Zhang, Z., & Wang, S. (2023). Multimodal emotion recognition
based on facial expressions, speech, and EEG. IEEE Open Journal of Engineering in Medicine and
Biology.
[9] Houssein, E. H., Hammad, A., & Ali, A. A. (2022). Human emotion recognition from EEG-based brain–
computer interface using machine learning: a comprehensive review. Neural Computing and
Applications, 34(15), 12527-12557.

Emotion Recognition based on Speech and EEG Using Machine Learning Techniques.docx

Recommended

Recommended

More Related Content

Similar to Emotion Recognition based on Speech and EEG Using Machine Learning Techniques.docx

Similar to Emotion Recognition based on Speech and EEG Using Machine Learning Techniques.docx (20)

Recently uploaded

Recently uploaded (20)

Emotion Recognition based on Speech and EEG Using Machine Learning Techniques.docx