Classifying Gender and Emotions from Voice Using Machine Learning

© May 2019 | IJIRT | Volume 5 Issue 12 | ISSN: 2349-6002
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGYIJIRT 148155 190
Novel Methodologies for Classifying Gender and Emotions
Using Machine Learning Algorithms
G.Neema1
, Mr.C.J.Profun M.E(Ph.D)2
1
PG Student,Department of Electronics and Communication Engineering,
2
Assistant Professor, Department of Electronics and Communication Engineering,
DMI College of Engineering, Chennai, Tamil Nadu, India
Abstract- In this paper, we proposed an emotion and
gender detection and classification by human voice. In
the real world, security plays a very important role, to
improve such security we combine both emotion and
gender detection in single system which is better than
existing system, in the existing system gender is
classified by using image or video only emotion
detection is done separately. Our proposed system
combines emotion and gender detection based on voice
which uses a noise classification then feature extraction
uses Discrete Wavelet Transform , Noise removal based
on Hidden Markov Model (HMM), Finally classification
is done using K-nearest neighbours (KNN).The
Performance metrics is good of 97% as compared to the
existing system which is 75%-86%. Here we will get
final output as audio according to the emotion which
are being detected.
Index Terms- emotion and gender detection, Discrete
Wavelet Transform, Hidden Markov Model (HMM), K-
nearest neighbours (KNN).
I.INTRODUCTION
human life. As per human’s perspective or feelings
emotions are essential medium of expressing his or her
psychological state to others. Humans have the natural
ability to recognize the emotions of their
communication partner by using all their available
senses. They hear the sound, they read lips, they
interpret gestures and facial expression Humans has
normal ability to recognize an emotion through spoken
words but since machine does not have capability to
analyze emotions from speech signal for machine
emotion recognition using speech signal is very
difficult task. Automatic emotion recognition paid
close attention in identifying emotional state of
speaker from voice signal. An emotion plays a key role
for better decision making and there is a desirable
requirement for intelligent machine human interfaces.
Speech emotion Recognition is a complicated and
complex task because for a given speech sample there
are number of tentative answer found as recognized
emotion The vocal emotions may be acted or elicited
from “real” life situation .The identification and
detection of the human emotional state through his or
her voice signal or extracted feature from speech signal
means emotion recognition through speech. it is
principally useful for applications which require natural
machine human interaction such as E-tutoring ,
electronic machine pet , storytelling, intelligent sensing
toys , also in the car board system application where the
detected emotion of users which makes it more
practical . Emotion recognition from speech signal is
Useful for enhancing the naturalness in speech based
human machine interaction To improve machine human
interface automatic emotion recognition through speech
provides some other applications such as speech
emotion recognition system used in aircraft cockpits to
provide analysis of Psychological state of pilot to avoid
accidents. speech emotion recognition systems also
utilizes to recognize stress in speech for better
performance lie detection , in Call centre conversation
to analyze behavioural study of the customers which
helps to improve quality of service of a call attendant
also in medical field for Psychiatric diagnosis, emotion
analysis conversation between criminals would help
crime investigation department. if machine will able to
understand humans like emotions conversation with
robotic toys would be more realistic and enjoyable,
Interactive movie, remote teach school would be more
practical. There are various difficulties occurs in
emotion recognition from the speaker’s voice due to
certain reasons such as, existence of the differ in
speaking styles, speakers, sentences, languages,
speaking rates introduces accosting variability affected
different voice features this a particular features of
speech are not capable to distinguish between various
emotions also each emotion may correspond to the
different portions of the spoken utterance. In this
Project, K nearest Neighbour classifier is utilized for
classification of the basic six emotional states such as
anger, happiness, sad, fear, disgust and neutral state and
no distinct emotion is observed.
At the current time, the use of emotion in computers is

becoming an increasingly important field for human
computer interaction. Indeed, Affective computing is
becoming a focus in interactive technological systems
and more essential for communication, decision-
making and behavior. There is a rising need for
emotional state recognition in several domains, such as
health monitoring, video games and human-computer
interaction. Indeed, detection of emotion is becoming
an increasingly important field for human-computer
Fig 1:Three Layer Model
II.LITERATURE REVIEW
This different system also differs by different features
extracted and classifiers used for classification. There
are different features utilizes for recognizing emotion
from speech signal such as spectral features and
Prosodic features can be used. Because both of these
features contain large amount of emotional
information. Some of the spectral features are Mel-
frequency cepstrum This different system also differs
by different features extracted and classifiers used for
classification. There are different features utilizes for
recognizing emotion from speech signal such as
spectral features and Prosodic features can be used.
Because both of these features contain large amount of
emotional information. Some of the spectral features
are Mel-frequency cepstrum coefficients (MFCC) and
Linear predictive cepstrum coefficients (LPCC). Some
prosodic features formants , Fundamental frequency,
loudness , Pitch ,energy and speech intensity and
glottal parameters are the prosodic features also for
detecting emotions through speech some of the
semantic labels, linguistic and phonetic features also
used. To made human machine interaction becomes
more powerful there are various types of classifiers
which are used for emotion recognition such as
Gaussian Mixtures Model (GMM) ,k-nearest
neighbours (KNN), Hidden Markov Model (HMM),
Artificial Neural Network (ANN) , GMM super vector
based SVM classifier ,and Support Vector Machine
(SVM). A. Bombatkar, et.al studied K Nearest
Neighbour classifier which give recognition
performance for emotions upto 86.02% classification
accuracy for using energy, entropy, MFCC, ZCC, pitch
Features. Xianglin et al. has been performed emotion
classification using GMM and obtained the recognition
rate of 79% for best features. Also emotion recognition
in speaker independent recognition system typical
performance obtained of 75%, and that of 89.12% for
speaker dependent recognition using GMM if this study
was limited only on pitch and MFCC features. M. Khan
et.al. performed emotion classification using K-NN
classifier average accuracy 91.71% forward feature
selection while SVM classifier has accuracy of 76.57%
show SVM classification for neutral and fear emotion.
III.PROPOSED ARCHITECTURE
In the proposed system, uses a combination of emotion
and gender detection and classification using human
voice signal. In this system, human’s voice signal is
given as input signal , then pre-process step is carried
out ,then noise is classified depends upon the frequency
range. Then, Fast Fourier transform (FFT) is used for
converting time domain to frequency domain of the
signal then noise signal is removed from the original
signal by using Hidden markov model, A Hidden
Markov Model (HMM) is a powerful statistical tool
with many practical applications in temporal pattern
recognition. These applications include speech
enhancement, de-noising of speech, speech recognition
and related tasks. At present there is limited number of
efficient approaches to denoising of speech based on
single channel operations (i.e., where there is only one
sensor/microphone available in the system under
consideration). HMM based approach provides a viable
alternative to other methods such as spectral
subtraction, and, in many ways, is considered as more
powerful, generally speaking. The main reason for
being more powerful is that unlike the spectral
subtraction approach, which is based on the assumption
that the distractor (i.e., undesired signal such as noise)
is stationary, the HMM is not bounded by this limiting
assumption: it is intended to work with non-stationary
distractors as well. Then, feature extraction is done
using Discrete Wavelet Transform (DWT), Feature

selection prior to classification plays a vital role and a
feature selection technique which combines discrete
wavelet transform (DWT) and moving window
technique. The approximation coefficients of DWT
together with some useful features from the high
frequency coefficients selected by the maximum
modulus method are used as features. A novel way to
think of microarray data is as a signals set. The
number of genes is the length of signals and hence
signal processing techniques such as wavelet transform
can be used to perform microarray data analysis.
Finally , the Classification is done based on KNN
classifier, KNN is a non-parametric and lazy learning
algorithm. Nonparametric means there is no
assumption for underlying data distribution. In other
words, the model structure determined from the
dataset. This will be very helpful in practice where
most of the real world datasets do not follow
mathematical theoretical assumptions. Lazy algorithm
means it does not need any training data points for
model generation. All training data used in the testing
phase. This makes training faster and testing phase
slower and costlier. Costly testing phase means time
and memory. In the worst case, KNN needs more time
to scan all data points and scanning all data points will
require more memory for storing training data.
BLOCK DIAGRAM:
Fig 2: Overall block Diagram
IV.IMPLEMENTATION
A.PRE-PROCESSING
In speech processing it is often advantageous to divide
the signal into frames to achieve stationary. This
worksheet describes how to split speech into frames and
how to combine the frames into a speech
signal.Normally a speech signal is not stationary, but
seen from a short-time point of view it is. This result
from the fact that the glottal system cannot change
immediately. XXX states that a speech signal typically
is stationary in windows of 20 ms.
When the signal is framed it is necessary to
consider how to treat the edges of the frame. This result
from the harmonics the edges add. Therefore it is
expedient to use a window to tone down the edges. As a
consequence the samples will not be assigned the same
weight in the following computations and for this
reason it is prudent to use an overlap.
Fig 3:Illustration of Framing
Figure 3 shows how a speech signal is divided into
frames. Each frame shares the first part with the
previous frame and the last part with the next frame.
The time frame step tfs indicates how long time there is
between the start time of each frame. The overlap is
defined as the time from a new frame starts until the
current stops.
.B.FAST FOURIER TRANSFORM:
The FFT is a fast algorithm for computing the DFT. If
we take the 2-point DFT and 4-point DFT and
generalize them to 8-point, 16-point, ..., 2r -point, we
get the FFT algorithm. To compute the DFT of an N-
point sequence using equation (1) would take O(𝑁 2 )
multiplies and adds. The FFT algorithm computes the
DFT using O(N log N) multiplies and adds. There are
many variants of the FFT algorithm. We’ll discuss one
of them, the “decimation in-time” FFT algorithm for
sequences whose length is a power of two (N = 2r for
some integer r). The FFT algorithm decomposes the
DFT into log2 N stages, each of which consists of N/2
butterfly computations.Each butterfly takes two
complex numbers p and q and computes from them two
other numbers, p + αq and p − αq, where α is a complex

C.HIDDEN MARKOV MODEL:
The method used for recognition of speech as
mentioned in the introduction part is (HMM) Hidden
Markov Model. Training of models is achieved
through this method, which is used to represent an
utterance of the spoken word. To test the utterance,
this model is only used later. This model is later used
to test an utterance and probability of the model
having created the vector sequences.
When MFCC is achieved, all the given
training negotiations are required to be generalized.
The number of matrix States is divided into several
coefficients. Then all these metrics are used to
calculate the mean and variance. Amid the
experimentation with the quantity of entrance inside
the re-estimation of A the last assessed estimations of
A where seen to stray a considerable amount from the
earliest starting point estimation. The last introduction
estimations of A are instated with the accompanying
esteems rather, which will probably the reassessed
values (the re-estimation issue is managed later on in
this segment. Changes in the initial values are not an
important event, so according to the estimated process,
the estimation again adjusts the value to the right
people.
D.FEATURE EXTRACTION: DISCRETE
WAVELET TRANSFORM:
The Wavelet Transform (WT) is a technique for
analyzing signals. It was developed as an alternative to
the short time Fourier Transform (STFT) to overcome
problems related to its frequency and time resolution
properties. More specifically, unlike the STFT that
provides uniform time resolution for all frequencies
the DWT provides high time resolution and low
frequency resolution for high frequencies and high
frequency resolution and low time resolution for low
frequencies. In that respect it is similar to the human
ear which exhibits similar time-frequency resolution
characteristics. The Discrete Wavelet Transform
(DWT) is a special case of the WT that provides a
compact representation of a signal in time and
frequency that can be computed efficiently.
As a multirate filter bank the DWT can be viewed as a
constant Q filterbank with octave spacing between the
centres’ of the filters. Each subband contains half the
samples of the neighbouring higher frequency
subband. In the pyramidal algorithm the signal is
analyzed at different frequency bands with different
resolution by decomposing the signal into a coarse
approximation and detail information. The coarse
approximation is then further decomposed using the
same wavelet decomposition step. This is achieved by
successive highpass and low pass filtering of the time
domain signal
The extracted wavelet coefficients provide a compact
representation that shows the energy distribution of the
signal in time and frequency. In order to further reduce
the dimensionality of the extracted feature vectors,
statistics over the set of the wavelet coefficients are
used. That way the statistical characteristics of the
“texture” or the “music surface” of the piece can be
represented. For example the distribution of energy in
time and frequency for music is different from that of
speech.
E.CLASSIFICATION: KNN CLASSIFIER:
In pattern recognition, the k-Nearest Neighbors
algorithm (or k-NN) is a nonparametric method which
is used for classification and regression. The input
consists of the k closest training examples in the feature
space. The output depends on whether k-NN is used for
regression or classification .In k-NN classification, the
output is a class member. An object is classified by a
majority vote of its neighbors, with the object being
assigned to the class most common among its k nearest
neighbors (k is a positive integer, k ). If k = 1, then the
object is simply assigned to the class of that single
nearest neighbor. In K-NN regression, the output is the
property value for the object. This value is the average
of the values of its k nearest neighbors. K-NN comes
under instance based learning, or lazy learning, where
the function is only approximated locally and all
evaluation is deferred until classification. The KNN
algorithm is among the simplest of all machine learning
algorithms in the terms of classification and regression,
it can be useful to weight the contributions of the
neighbors, so that the nearer neighbors contribute more
to the average than the more distant ones. For example,
a common weighing scheme consists in giving each
neighbor a weight of 1/d, where d is the distance to the
neighbor.
The k-Nearest-Neighbours (kNN) method of
classification is one of the simplest methods in machine
learning, and is a great way to introduce yourself to
machine learning and classification in general. At its
most basic level, it is essentially classification by
finding the most similar data points in the training data,
and making an educated guess based on their
classifications.

V. RESULTS
Several experiments were performed in order to
evaluate the accuracy of the classifiers to determine
the emotion and gender. In the analysis of DWT, the
feature that gives the highest classification accuracy
for two emotions angry and excited is HMM. KNN
classifier is used for Gender Classification. In the
analysis of K Nearest Negibour model classifier, the
average correct classification of emotions is 97%. The
features extracted from the speech signal is HMM,
KNNclassifier is employed to get the emotion class
label. KNN experiments were conducted using the
MATLAB software using KNN classifier, and all
results are based on crossvalidation [13]. The acoustic
features were extracted from the best feature
combination from all features by this classifier.
Experiments were conducted using the MATLAB
software and Rapid Miner using Naive Bayes
Classifier, and all results are based on cross-validation.
The acoustic features such as shimmer, jitter, energy,
and pitch were extracted from the best feature
combination from all features by this classifier [14].
1.INPUT SIGNAL
2.OUTPUT: MALE:
3.FEMALE:
4.PERFORMANCE MATRIX
VI.CONCLUSION
Our proposed system combines emotion and gender
detection based on voice which uses a noise
classification then feature extraction uses Discrete
Wavelet Transform , Noise removal based on Hidden
Markov Model (HMM), Finally classification is done
using K-nearest neighbours (KNN).we will be getting
more accurate output while comparing with all other
classifier. In our project, the existing system limitations
of only detecting two emotions are being overcome by
our proposed system.
RFERENCES
[1]. Ayadi M. E., Kamel M. S. and Karray F., „Survey on
Speech Emotion Recognition: Features, Classification
Schemes, and Databases‟, Pattern Recognition, 44 (16), 572-
587, 2011.
[2]. A. S. Utane, Dr. S. L. Nalbalwar , “Emotion Recognition
through Speech Using Gaussian Mixture Model & Support
Vector Machine” International Journal of Scientific &
Engineering Research, Volume 4, Issue 5, May -2013
[3]. Chiriacescu I., „Automatic Emotion Analysis Based On
Speech‟, M.Sc.Thesis, Department of Electrical Engineering,
Delft University of Technology, 2009.
[4]. N. Thapliyal, G. Amoli “Speech based Emotion
Recognition with Gaussian Mixture Model” international
Journal of Advanced Research in Computer Engineering &
Technology Volume 1, Issue 5, July 2012
[5]. Zhou y., Sun Y., Zhang J, Yan Y., „Speech Emotion
Recognition using Both Spectral and Prosodic Features‟,
IEEE,23(5),545-549,2009

Classifying Gender and Emotions from Voice Using Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Classifying Gender and Emotions from Voice Using Machine Learning

Similar to Classifying Gender and Emotions from Voice Using Machine Learning (20)

More from BRIGHT WORLD INNOVATIONS

More from BRIGHT WORLD INNOVATIONS (6)

Recently uploaded

Recently uploaded (20)

Classifying Gender and Emotions from Voice Using Machine Learning