Speech Emotion Recognition
Guided by:
Mrs. R.K. Patole
111707049-Pragya Sharma
141807005-Kanchan Itankar
141807008-Saniya Shaikh
141807009-Triveni Vyavahare
Aim
Speech Emotion
Recognition using
Machine Learning
[1] Speech Emotion Recognition using Neural Network and MLP Classifier (Jerry
Joy, Aparna Kannan, Shreya Ram, S. Rama)
● MLP Classifier
● 5 features extracted- MFCC, Contrast, Mel Spectrograph Frequency, Chroma
and Tonnetz
● Accuracy 70.28%
[2]Voice Emotion Recognition using CNN and Decision Tree (Navya Damodar,
Vani H Y, Anusuya M A.)
● Decision Tree , CNN
● MFCCs extracted
● Accuracy 72% CNN, 63% Decision Tree
Literature Review
● To build a model to recognize emotion from speech using the librosa and
sklearn libraries and the RAVDESS dataset.
● To present a classification model of emotion elicited by speeches based on
deep neural networks MLP Classification based on acoustic features such as
Mel Frequency Cepstral Coefficient (MFCC). The model has been trained to
classify eight different emotions (calm, happy, fearful, disgust, angry, neutral,
surprised,sad).
Objective
Applications
Business Marketing
Suicide prevention
Voice Assistant
● As human beings speech is amongst the most natural way to express ourselves. We depend
so much on it that we recognize its importance when resorting to other communication
forms like emails and text messages where we often use emojis to express the emotions
associated with the messages. As emotions play a vital role in communication, the detection
and analysis of the same is of vital importance in today’s digital world of remote
communication.
● Emotion detection is a challenging task, because emotions are subjective. There is no
common consensus on how to measure them. We define a Speech Emotions Recognition
system as a collection of methodologies that process and classify speech signals to detect
emotions embedded in them.
Motivation
● Human machine interaction is widely used nowadays in many applications. One of the medium
of interaction is speech. The main challenges in human machine interaction is detection of
emotion from speech.
● Emotion can play an important role in decision making. Emotion can be detected from different
physiological signal also. If emotion can be recognized properly from speech then a system can
act accordingly. Identification of emotion can be done by extracting the features or different
characteristics from the speech and training needed for a large number of speech database to
make the system accurate.
● An emotional speech RAVDESS dataset is selected then emotion specific features are extracted
from those speeches and finally a MLP classification model is used to recognize the emotions.
Introduction
System Block Diagram
Methodology
Preprocessing Feature Extraction Classification
1.Preprocessing
The removal of unwanted noise signal from the speech.
➢Silent removal
➢Background Noise
removal
➢Windowing
➢Normalization
2.Feature Extraction
● Extract the feature from audio file
● Used to identify How we speak
➢ Pitch
➢ Loudness
➢ Rhythm,etc
Dataset
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset.
● [3]RAVDESS dataset has recordings of 24 actors, 12 male actors and 12 female
actors, the actors are numbered from 01 to 24 in North American accent.
● All emotional expressions are uttered at two levels of intensity: normal and strong,
except for the ‘neutral’ emotion, it is produced only in normal intensity. Thus, the
portion of the RAVDESS, that we use contains 60 trials for each of the 24 actors,
thus making it 1440 files in total.
[1] Training process workflow
[1] Testing process workflow
3.Classification
● Match the feature with corresponding emotions
Multilayer Perceptron
Multi-Layer Perceptron Classifier
● A multilayer perceptron (MLP) is a class of feedforward
artificial neural network (ANN).
● MLP consists of at least three layers of nodes-input
layer,hidden layer and output layer.
● MLPs are suitable for classification prediction problems
where inputs are assigned a class or label.
Building the MLP Classifier involves the following steps-
1. Initialisation MLP Classifier.
2. Neural Network.
3. Prediction.
4. Accuracy Calculation.
Multi-Layer Perceptron Classifier
Fig:- Multi-Layer Perceptron Classifier
Feature Extraction
From the Audio data we have extracted three key features which have been used in this, namely:
● MFCC (Mel Frequency Cepstral Coefficients)
● Mel Spectrogram
● Chroma
MFCC (Mel Frequency Cepstral Coefficients)
Mel Spectrogram
A Fast Fourier Transform is computed on overlapping windowed segments of the signal,
and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude
which is mapped on a Mel scale.
Chroma
A Chroma vector is typically a 12-element feature vector indicating how much energy of
each pitch class is present in the signal in a standard chromatic scale.
MFCC Chroma
Accuracy
Classification Matrix
Confusion Matrix
1.angry
2.calm
3.disgust
4.fearful
5.happy
6.neutral
7.sad
8.surprised
● The proposed model achieved an accuracy of 66.67%.
● Calm was the best identified emotion.
● The model gets confused between similar emotions like calm-neutral, happy-surprised.
● We tested the model on our own voice file for the sentence “Dogs are sitting by the door” and it
identified the emotion correctly.
Conclusion
Future Work
● The system could take into consideration multiple speakers from different geographic locations
speaking with different accents.
● Though standard feed forward MLP is powerful tool for classification problems, we can use
CNN, RNN models with larger data sets and high computational power machines and compare
between them.
● Study shows that people suffering with autism have difficulty expressing their emotions
explicitly. Image based speech processing in real time can prove to be of great assistance.
References
[1] Jerry Joy, Aparna Kannan, Shreya Ram, S. Rama Speech Emotion Recognition using Neural
Network and MLP Classifier, IJESC, April 2020.
[2]Navya Damodar, Vani H Y, Anusuya M A. Voice Emotion Recognition using CNN and
Decision Tree. International Journal of Innovative Technology and Exploring Engineering
(IJITEE), October 2019.
[3]RAVDESS Dataset: https://zenodo.org/record/1188976#.X5r20ogzZPZ
[4]MLP/CNN/RNN Classification:
https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/
[5]MFCC:https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd

Speech emotion recognition

  • 1.
    Speech Emotion Recognition Guidedby: Mrs. R.K. Patole 111707049-Pragya Sharma 141807005-Kanchan Itankar 141807008-Saniya Shaikh 141807009-Triveni Vyavahare
  • 2.
  • 3.
    [1] Speech EmotionRecognition using Neural Network and MLP Classifier (Jerry Joy, Aparna Kannan, Shreya Ram, S. Rama) ● MLP Classifier ● 5 features extracted- MFCC, Contrast, Mel Spectrograph Frequency, Chroma and Tonnetz ● Accuracy 70.28% [2]Voice Emotion Recognition using CNN and Decision Tree (Navya Damodar, Vani H Y, Anusuya M A.) ● Decision Tree , CNN ● MFCCs extracted ● Accuracy 72% CNN, 63% Decision Tree Literature Review
  • 4.
    ● To builda model to recognize emotion from speech using the librosa and sklearn libraries and the RAVDESS dataset. ● To present a classification model of emotion elicited by speeches based on deep neural networks MLP Classification based on acoustic features such as Mel Frequency Cepstral Coefficient (MFCC). The model has been trained to classify eight different emotions (calm, happy, fearful, disgust, angry, neutral, surprised,sad). Objective
  • 5.
  • 6.
    ● As humanbeings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. ● Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure them. We define a Speech Emotions Recognition system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them. Motivation
  • 7.
    ● Human machineinteraction is widely used nowadays in many applications. One of the medium of interaction is speech. The main challenges in human machine interaction is detection of emotion from speech. ● Emotion can play an important role in decision making. Emotion can be detected from different physiological signal also. If emotion can be recognized properly from speech then a system can act accordingly. Identification of emotion can be done by extracting the features or different characteristics from the speech and training needed for a large number of speech database to make the system accurate. ● An emotional speech RAVDESS dataset is selected then emotion specific features are extracted from those speeches and finally a MLP classification model is used to recognize the emotions. Introduction
  • 8.
  • 9.
  • 10.
    1.Preprocessing The removal ofunwanted noise signal from the speech. ➢Silent removal ➢Background Noise removal ➢Windowing ➢Normalization
  • 11.
    2.Feature Extraction ● Extractthe feature from audio file ● Used to identify How we speak ➢ Pitch ➢ Loudness ➢ Rhythm,etc
  • 12.
    Dataset Ryerson Audio-Visual Databaseof Emotional Speech and Song (RAVDESS) dataset. ● [3]RAVDESS dataset has recordings of 24 actors, 12 male actors and 12 female actors, the actors are numbered from 01 to 24 in North American accent. ● All emotional expressions are uttered at two levels of intensity: normal and strong, except for the ‘neutral’ emotion, it is produced only in normal intensity. Thus, the portion of the RAVDESS, that we use contains 60 trials for each of the 24 actors, thus making it 1440 files in total.
  • 13.
  • 14.
  • 15.
    3.Classification ● Match thefeature with corresponding emotions Multilayer Perceptron
  • 16.
    Multi-Layer Perceptron Classifier ●A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). ● MLP consists of at least three layers of nodes-input layer,hidden layer and output layer. ● MLPs are suitable for classification prediction problems where inputs are assigned a class or label.
  • 17.
    Building the MLPClassifier involves the following steps- 1. Initialisation MLP Classifier. 2. Neural Network. 3. Prediction. 4. Accuracy Calculation.
  • 18.
    Multi-Layer Perceptron Classifier Fig:-Multi-Layer Perceptron Classifier
  • 19.
    Feature Extraction From theAudio data we have extracted three key features which have been used in this, namely: ● MFCC (Mel Frequency Cepstral Coefficients) ● Mel Spectrogram ● Chroma
  • 20.
    MFCC (Mel FrequencyCepstral Coefficients)
  • 21.
    Mel Spectrogram A FastFourier Transform is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude which is mapped on a Mel scale. Chroma A Chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class is present in the signal in a standard chromatic scale.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    ● The proposedmodel achieved an accuracy of 66.67%. ● Calm was the best identified emotion. ● The model gets confused between similar emotions like calm-neutral, happy-surprised. ● We tested the model on our own voice file for the sentence “Dogs are sitting by the door” and it identified the emotion correctly. Conclusion
  • 27.
    Future Work ● Thesystem could take into consideration multiple speakers from different geographic locations speaking with different accents. ● Though standard feed forward MLP is powerful tool for classification problems, we can use CNN, RNN models with larger data sets and high computational power machines and compare between them. ● Study shows that people suffering with autism have difficulty expressing their emotions explicitly. Image based speech processing in real time can prove to be of great assistance.
  • 28.
    References [1] Jerry Joy,Aparna Kannan, Shreya Ram, S. Rama Speech Emotion Recognition using Neural Network and MLP Classifier, IJESC, April 2020. [2]Navya Damodar, Vani H Y, Anusuya M A. Voice Emotion Recognition using CNN and Decision Tree. International Journal of Innovative Technology and Exploring Engineering (IJITEE), October 2019. [3]RAVDESS Dataset: https://zenodo.org/record/1188976#.X5r20ogzZPZ [4]MLP/CNN/RNN Classification: https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/ [5]MFCC:https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd