2. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
emotion based on auditory impressions and Mean opinion score was collected. Then speaker’s
emotion Identification of sample sentences was done with probabilistic neural network (PNN) and k-neighboring
numbers (KNN) using LPC and subsequently PRAAT software package was used to
extract the Pattern of acoustic parameters for sample sentences [2].
161
II. EMOTIONAL DATABASE
Obtaining emotional corpus is quite difficult in itself. Various methods have been utilized in
the past, like the use of acted speech, the speech obtained from movies or television shows and
speech recorded in event recall [2, 5, 6].
The database is composed of 4 different emotions (happy, sad, anger and fear) and neutral
emotion as uttered by two male Kannada actors, consisting of a total of 60 sentences containing
minimum 3 to maximum 7 words. The first step was to record the voice of each words and sentences.
The recordings of all the words and sentences were done using recording studio. These words and
sentences were recorded at a sample rate of 44100 Hz with a mono channel. The sentences used for
Statistical analysis is listed in table 1.
Table 1: Sentences used in analysis
Sent. KANNADA (English)
S2
3. (long live like a wind)
S3
!.
( I am blessed ,as I protected the lives of elders)
S5
#$%'( )*+ ,-./0.
(I have fought and Experienced with so many people like you.)
S5
1023!
(Aravinda is my Disciple)
S6
40 0+5 6 78
(I study during night time)
S7
9% : ;10 8=8.
(He might be a Brahmin ,there is no doubt about it)
S8
11?$%@.5A?
(Father, who is that fellow who troubles us?)
III. ANALYSIS
Pitch is strongly correlated with the fundamental frequency of the sound. It occupies a central
place in the study of prosodic attributes as it is the perceived fundamental frequency of the sound [3,
4 8]. It differs from the actual fundamental frequency due to overtones inherent in the sound
Fig 1 to Fig 5 shows the pitch and intensity of different emotions of Sentence 6. The table 2
shows the mean pitch of the different emotion and Fig.6 shows the variation of mean pitch in
different emotions. It shows that mean pitch is highest in fear and lowest in sadness when compare to
other emotions.
4. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Figure 1: Pitch and intensity of neutral sentence
Figure 2: Pitch and intensity of emotion (sad)
Figure 3: Pitch and intensity of emotion (fear)
Figure 4: Pitch and intensity of emotion (anger)
Figure 5: Pitch and intensity of emotion (happy)
Table 2: Mean pitch of sentences in different emotion (Hz)
Sent Neutral Sadness Fear Anger Happy
S1 129.12 119.71 209.53 189 140.4
S2 116.95 137.37 198.84 189 135.2
S3 123.33 131.45 195.83 210 176.3
S4 113.37 116.56 164.74 177 162.7
S5 125.55 156.28 226.61 195 172.5
S6 103.04 160.46 202.5 223 153.2
S7 108.97 124.59 192.17 174 127.7
S8 108.87 107.61 165.21 136 110
Table.3 shows the intensity of different emotions and Fig.7 shows the variation of intensity. It
shows that intensity is highest in anger and lowest in fear.
162
5. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Figure.6: Mean pitch of 8 sentences in different emotion
Table 3: Intensity of different emotion in dB
Sent.No neutral Sad Fear anger Happy
S1 85.64 84.94 88.78 90.88 90.39
S2 85.50 79.29 78.29 83.15 84.17
S3 87.33 84.82 87.70 89.17 90.51
S4 83.29 88.01 88.99 91.98 86.93
S5 86.39 86.98 89.16 91.30 90.61
S6 83.22 85.35 88.98 87.28 85.59
S7 88.92 86.48 88.00 92.16 85.74
S8 88.14 87.70 87.26 87.95 85.51
Figure 7: Intensity of different emotions
For analysis purpose speech signal is decomposed in to number of frames. These frames may
be voiced or unvoiced. If voiced frame contain prosodic feauteres, unvoiced frames contains
excitation features along the prosodic features. so it necessary to analyse the unvoiced frames.
163
6. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Table 4 contains the percentage of unvoiced frames in a sentence in all emotions. Fig 8 shows that
unvoiced frames are highest in fear lowest in happy when compare other emotion.Pressure of
sound influence the Intensity which in turn affects the power at each formant. (PSD) of different
emotions is plotted in Fig.9 and pressure of sound in Fig.10. Irrespective of emotions the radiance of
lips for the a sentences or utterence remains same. The rate of vocal fold changes for different
emotions causing the less tilt in specrtum, which greatly influences the emotions. This indicates that
not only prosodic features but also excitation sources influence the emotions. Fig.11 shows the vocal
ract variations in different emotions.
Table 4: Percentage of unvoiced frames in different emotions
Sent.No Neutral Sadness Fear Anger Happy
S1 17.88% 43.08% 54.37% 28.41% 25.73%
S2 31.14% 33.93% 39.02% 19.41% 24.74%
S3 14.86% 28.37% 29.43% 23.17% 27.32%
S4 30.77% 25.65% 43.56% 19.16% 20.15%
S5 34.04% 43.28% 50.00% 37.69% 38.53%
S6 29.44% 27.38% 53.40% 31.10% 30.09%
S7 23.61% 32.16% 41.76% 22.55% 27.25%
S8 25.94% 27.45% 29.13% 32.28% 40.29%
Figure 8: Percentage of unvoiced frames in different emotions
Figure 9: PSD in different emotions
164
7. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Figure 10: Pressure of sound in different emotions
By Analysis of different parameters like intensity, pitch, number of unvoiced frames, sound
pressure, PSD and vocal fold influence it is very difficult to characterize each emotions. While
coming to statistical variance of values, it is much more difficult to characterize emotions. So it is
necessary to design an envelope which considers all the above characterises. This can be done using
LPC, LSF, MFCC or LFCC. In this work we are making use of LPC
Figure 11: Vocal fold variance in different emotions
.
Figure 12: Spectrogram of the neutral sentence
Figure 13: Spectrogram of Emotion (sad)
165
8. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Figure 14: Spectrogram of Emotion (Fear)
Figure 15: Spectrogram of Emotion (Anger)
Figure 16: Spectrogram of Emotion (Happy)
The Effects of Excitation which cannot be seen in prosodic analysis can be seen in
Spectrogram analysis, which can be analysed using the nonparametric methods of non-stationary
signal.
166
IV. FEATURE EXTRACTION
The performance of an emotion classifier relies heavily on the Quality of speech data. LPC is
powerful speech signal analysis technique. LPC determines the coefficients of a forward linear
predictor by minimizing the error in the least squares sense. It has applications in filter design and
speech coding, since LPC provides a good approximation of vocal tract spectral envelop. LPC finds
the coefficients of a pth-order linear predictor (FIR filter) that predicts the current value of the real-valued
time series x based on past samples.
Figure 17: Block diagram of LPC
9. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
p is the order of the prediction filter polynomial, a = [1,a(2), ... a(p+1)]. If p is unspecified,
LPC uses as a default p = length(x)-1. If x is a matrix containing a separate signal in each column,
LPC returns a model estimate for each column in the rows of matrix and a column vector of
prediction error variances g. The length of p must be less than or equal to the length of x.
LPC analyses the speech signal by eliminating the formant and speech by estimating the
intensity and frequency of the remaining buzz. The process is called inverse filtering and the
remaining is called the residue. The excitation signal obtained from the LPC analysis is viewed
mostly as error signal, and contains higher order relations. Higher order relations contain strength of
excitation, characteristics of glottal volume velocity waveform, shapes of glottal pulse, variance of
vocal folds.
167
V. EVALUATION
Evaluation is carried in two methods
Evaluation by listener: Perception test is done and Mean Opinion Score is taken, the main objective
of perception test is to validate the recorded voice for recognition of emotion. The perception test
involved 25 people from various backgrounds. Sentences in random order were played to the
listeners and they were asked to identify expression of emotion in the utterances. The listeners were
required to choose the emotion of the recorded voice from a list of 4 emotions along with the neutral
sentences. The MOS was of the test was calculated.
Evaluation by classifier
Probabilistic neural network (PNN): PNN is closely related to Parzen window Probability Density
Function (PDF) estimator. A PNN consists of several sub-networks, each of which is a Parzen
window PDF estimator for each of the classes. The input nodes are the set of measurements. The
second layer consists of the Gaussian functions formed using the given set of data points as centers.
The third layer performs an average operation of the outputs from the second layer for each class.
The fourth layer performs a vote, selecting the largest value. The associated class label is then
determined.
Figure 18: PNN classifier
10. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
---------(1)
168
In general, a PNN for M classes is defined as
11. Where nj denotes the number of data points in class j. The PNN assign x into class k if yk(x) yj(x),
j€[1……M], ||x j,i-x||2 is calculated as the sum of Squares
K-Neighboring numbers: In pattern recognition, the k Nearest Neighbors algorithm is a non-parametric
method used for classification. The output depends on value of K in algorithm.
In k-NN classification, the output is a class membership. An object is classified by a majority
vote of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the
class of that single nearest neighbor.
Figure 19: Block diagram of emotion recognition
In k-NN regression, the output is the property value for the object. This value is the average
of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning,
where the function is only approximated locally and all computation is deferred until classification.
The k-NN algorithm is among the simplest of all machine learning algorithms.
Both for classification, it can be useful to weight the contributions of the neighbors, so that the nearer
neighbors contribute more to the average than the more distant ones. For example, a common
weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or
the object property value (for k-NN regression) is known. This can be thought of as the training set
for the algorithm, though no explicit training step is required.
VI. RESULTS AND DISCUSSION
EVALUSTION OF EMOTION
Evaluation by people: Confusion matrix created after calculating the MOS is shown in table 5, it
was observed that the most recognised emotion was anger (91%), while the least recognized emotion
was fear (70%). From the table, it can be observed that fear is the most confusing emotion that is
very much confused with sadness. The average of recognition of emotion was 81% and the order of
recognition of all emotion is anger neutra l sadness happy fear.
12. Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 – 19, July 2014, Mysore, Karnataka, India
Table 5:Confusion matrix of perception test
Category Neutral sadness fear anger happy
Neutral 89% 2% 1% 6% 2%
Sadness 4% 78% 11% 4% 3%
Fear 3% 18% 70% 7% 2%
Anger 5% 1% 1% 91% 2%
happy 10% 1% 1% 11% 77%
Evaluation by classifiers : LPC coeffieints are fed as input to both algorithms for classification of
emotions. The results obtained in both methos are almost same. That is, as the number of coeffients
and K increases the accuracy towards detecting emotions like sadness and fear increases but
ambiguity in detecting other emotions like neutral, happy, anger also increases. As the number of co-efficient
and k decreases the accuracy toward detecting emotions like neutral, happy and anger
increases and ambiguity exists between sad and fear.
Table 6: Confusion matrix of evaluation of emotions by k-NN and PNN
LPC=50,K=1
Neutral sadness fear anger happy
Neutral 70% 2% 5% 3% 20%
Sadness 30% 11% 6% 30% 23%
Fear 35% 10% 5% 25% 25%
Anger 12% 5% 8% 65% 10%
happy 5% 2% 5% 20% 68%
LPC=500,K=5
Neutral sadness fear anger happy
Neutral 20% 2% 8% 30% 40%
Sad 6% 69% 20% 5% 0%
fear 2% 11% 68% 19% 0%
anger 30% 5% 8% 22% 35%
happy 20% 25% 5% 30% 20%
169
VII. CONCLUSION
In this paper, the prosodic and excitation features in Kannada speech has been analysed
from spoken sentences for important categories of emotion. It has been observed that all these
prosodic features (F0, A0, D), along with the excitation parameters (PSD, pressure and vocal fold
variance) play significant role in expression of emotions. Evaluation of database has been conducted
using the database created to express the emotion. Here along with prosodic parameter excitation
parameters has been used for training PNN, k-NN classifier. The result shows, there is an ambiguity
in detection of emotion like neutral, anger, happy with sad and fear when LPC co-efficient and k
value varies.This work can be enhanced using MFCC, LFCC, and PFCC. Further studies should be
conducted using database created by natural conversations