Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Mr. Saurabh Padmawar, Prof. Mrs. P.S. Deshpande / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1451-1454 “Classification Of Speech Using Mfcc And Power Spectrum” Mr. Saurabh Padmawar, Prof. Mrs. P.S. Deshpande Dept of E&TC S.K.N.C.O.E. Pune Dept of E&TC S.K.N.C.O.E PuneAbstract In human machine interface application, etc. Voices differ for men and women in severalemotion recognition from the speech signal has aspects such as speaking pitch, pitch range, thebeen research topic since many years. To identify space between the vocal folds, formant frequency,the emotions from the speech signal, many and the incidence of voice problems. Females speaksystems have been developed. In this paper with a higher fundamental frequency (voice pitch)speech emotion recognition based on the previous when compared to males [1]. The higher pitch intechnologies which uses different classifiers for women compared to men means the vocal foldsthe emotion recognition is reviewed as well as vibrate or come together almost twice as many timeswhich will be best to classify is suggested. The per second in females than in males. The otherclassifiers are used to differentiate emotions such differences found in voice quality are caused by theas anger, happiness, sadness, surprise, neutral way the vocal folds vibrate between male andstate, etc. The database for the speech emotion female. Usually males speak creakier than femalerecognition system is the emotional speech and females speak breathier than males.samples and the features extracted from these Two base emotions, Questionspeech samples are the energy, pitch, linear mark/Exclamatory and neutral are taken intoprediction cepstrum coefficient (LPCC), Mel account. Various speech sets that belong to thesefrequency cepstrum coefficient (MFCC), Power emotion groups are taken and used for training andspectrum . The classification performance is testing. The ERNN will be capable of distinguishingbased on extracted features. Inference about the these test samples. Neural networks are chosen forperformance and limitation of speech emotion the solution because a basic formula cannot berecognition system based on the different devised for the problem. The neural networks areclassifiers are also discussed. Keywords – Back also quick to respond which is a requirement as thepropagation neural network, Mel frequency emotion should be determined almost instantly. TheCepstrum Coefficient, Power spectrum . training takes a long time but is irrelevant as the training will be mostly off-line and on-line both.I.Introduction Speech and emotion recognition improve II.Emotion Recognition Systemthe quality of human computer interaction and allow Emotion recognition is not a new topic andmore easy to use interfaces for every level of user in both research and applications exist using varioussoftware applications. This paper is all about the methods most of which require extracting certainemotion recognition system to classify the voice features from the speech [1]. A common problem issignals for emotion recognition. Speech is one of the determining emotion from noisy speech, whicholdest tools humans use for interaction among each complicates the extraction of the emotion because ofother. It is therefore one of the most natural ways to the background noise [2]. Emotion recognition frominteract with the computers as well. Although the speech information may be the speakerspeech recognition is now good enough to allow dependent or speaker independent. The differentspeech to text engines, emotion recognition can classifiers available are k-nearest neighbors (KNN),increase the overall efficiency of interaction and Hidden Markov Model (HMM) and Support Vectormay provide everyone a more comfortable user Machine (SVM), Artificial Neural Network (ANN),interface. It is often trivial for humans to get the Gaussian Mixtures Model (GMM). The paperemotion of the speaker and adjust their behavior reviews the mentioned classifiers [4]. Theaccordingly. Emotion recognition will give the application of the speech emotion recognitionprogrammer a chance to develop an artificial system include the psychiatric diagnosis, intelligentintelligence that can meet the speakers feelings that toys, lie detection, in the call centre conversationscan be used in many scenarios from computer which is the most important application for thegames to virtual sales-programs. automated recognition of emotions from the speech, Speech is one of the natural forms of in car board system where information of the mentalcommunication. Emotional speech recognition aims state of the driver may provide to the system to startat automatically identifying the emotional or his/her safety [1].physical state of a human being from his or her Emotion recognition from the speaker‟svoice. Voice may be a normal speech or produced speech is very difficult because of the followingalong with an emotion like happy, sad, fear, angry reasons: In differentiating between various emotions 1451 | P a g e
  2. 2. Mr. Saurabh Padmawar, Prof. Mrs. P.S. Deshpande / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1451-1454 which particular speech features are more useful is different emotional state, corresponding changes not clear. Because of the existence of the different occurs in the speak rate, pitch, energy, and sentences, speakers, speaking styles, speaking rates spectrum. accosting variability was introduced, because of In feature extraction all of the basic speech which speech features get directly affected. The feature extracted may not be helpful and essential same utterance may show different emotions. Each for speech emotion recognition system. If all the emotion may correspond to the different portions of extracted features gives as an input to the classifier the spoken utterance. Therefore it is very difficult to this would not guarantee the best system differentiate these portions of utterance. Another performance which shows that there is a need to problem is that emotion expression is depending on remove such a unusefull features from the base the speaker and his or her culture and environment. features. Therefore there is a need of systematic As the culture and environment gets change the feature selection to reduce these features. So that for speaking style also gets change, which is another this system we have used only two feature and that challenge in front of the speech emotion recognition are MFCC and Power specrum. system. There may be two or more types of emotions, long term emotion and transient one, so it IV. RECOGNITION ALGORITHM is not clear which type of emotion the recognizer In the speech emotion recognition system will detect [1].The following figure shows the after calculation of the features, the best features are proposed system for speech classification. Which provided to the classifier. A classifier recognizes the consist of different blocks and each block has its emotion in the speaker‟s speech utterance. Various features. Feature extraction block uses different types of classifier have been proposed for the task of parameter such as power spectrum, MFCC and speech emotion recognition. Gaussian Mixtures LPCC. Similarly recognition algorithm block have Model (GMM), K-nearest neighbors (KNN), Hidden different algorithm in which ANN have maximum Markov Model (HMM) and Support Vector efficiency so we will use Artificial Neural Network Machine (SVM), Artificial Neural Network (ANN), (ANN) as classifier in proposed syste. etc. are the classifiers used in the speech emotion recognition system. Each classifier has some advantages and limitations over the others. Input Sampling Feature Recognition Only when the global features are extractedspeech Extractio algorithm from the training utterances, Gaussian Mixture n Model is more suitable for speech emotion recognition. All the training and testing equations are based on the supposition that all vectors are independent therefore GMM cannot form temporal Output structureof the training data. For the best features a maximum accuracy of 78.77% could be achieved using GMM. In speaker independent recognition Fig 2.1 proposed system of speech classification typical performance obtained of 75%, and that of 89.12% for speaker dependent recognition using III. FEATURE EXTRACTION GMM [10]. Any emotion from the speaker‟s speech is In speech recognition system like isolated represented by the large number of parameters word recognition and speech emotion recognition, which is contained in the speech and the changes in hidden markov model is generally used; the main these parameters will result in corresponding change reason is its physical relation with the speech signals in emotions. Therefore an extraction of these speech production mechanism. In speech emotion features which represents emotions is an important recognition system, HMM has achieved great factor in speech emotion recognition system [8]. success for modeling temporal information in the The speech features can be divide into two main speech spectrum. The HMM is doubly stochastic categories that is long term and short term features. process consist of first order markov chain whose The region of analysis of the speech signal used for states are buried from the observer [10]. For speech the feature extraction is an important issue which is emotion recognition typically a single HMM is to be considering in the feature extraction. The trained for each emotion and an unknown sample is speech signal is divided into the small intervals classified according to the model which illustrate the which are referred as a frame [8]. The prosodic derived feature sequence best [11]. HMM has the features are known as the primary indicator of the important advantage that the temporal dynamics of speakers emotional states. Research on emotion of speech features can be caught second accessibility speech indicates that pitch, energy, duration, of the well established procedure for optimizing the formant,Power,Mel frequency cepstrum coefficient recognition framework. The main problem in (MFCC), and linear prediction cepstrum coefficient building the HMM based recognition model is the (LPCC) are the important features [5, 6]. With the features selection process. Because it is not enough 1452 | P a g e
  3. 3. Mr. Saurabh Padmawar, Prof. Mrs. P.S. Deshpande / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1451-1454that features carries information about the emotional biological and emotional factors from speaker sidestates, but it must fit the HMM structure as well. are involved so it is difficult to mathematical modelHMM provides better classification accuracies for all those parameter which characterizes perfectspeech emotion recognition as compared with the emotion. So neural network approach is depends onother classifiers [9]. HMM classifiers using prosody better which is probabilistic training.and formant features have considerably lower recall In future this neural network will work with nearrates than that of the classifiers using spectral real time applications, like computer games. Rolefeatures [9]. The accuracy rate of the speech playing games can utilize the players actingemotion recognition by using HMM classifier is capabilities as another input that might improveobserved as 46.12% for the speaker dependent in the gaming experience. It also can be extended toprevious study and that for the speaker independent classifying genders for other emotional speechesit was 44.77% [1]. such as fear, happy, angry, surprise and disgust. By Transforming the original feature set to a including more number of subjects from varioushigh dimensional feature space by using the kernel ethnic groups, a user independent and generalfunction is the main thought behind the support system for gender classification from any type ofvector machine (SVM) classifier, which leads to get speech – normal or emotional can be developed.Inoptimum classification in this new feature space. conventional TTS toning of speech is monotonicThe kernel functions like linear, polynomial, radial which looks very robotic, so if this system isbasis function (RBF) can be used in SVM model for interface with these TTS then it is helpful tolarge extent. In the main applications like pattern enhance its working and effectiveness.recognition and classification problems, SVMclassifier are generally used, and because of that it is REFERENCESused in the speech emotion recognition system. [1] Emotion Recognition Using NeuralSVM is having much better classification Networks MEHMET S. UNLUTURK, performance compared to other classifiers KAYA OGUZ, COSKUN ATAY[10, 4]. The emotional states can be separated to Proceedings of the 10th WSEAShuge margin by using SVM classifier. This margin International Conference on NEURALis nothing but the width of the largest tube without NETWORKS 2010any utterances, which can obtain around decision [2] DWT and MFCC Based Human Emotionalboundary. The support vectors can be known as the Speech Classification Using LDA Mmeasurement vectors which define the boundaries of Murugappan, Nurul Qasturi Idayuthe margin. An original SVM classifier was Baharuddin, Jerritta S 2012 Internationaldesigned only for two class problems, but it can be Conference on Biomedical Engineeringuse for more classes. Because of the structural risk (ICoBE),27-28 February 2012,Penangminimization oriented training SVM is having high [3] J. A. Freeman and D. M. Skapura, Neuralgeneralization capability. The accuracy of the SVM Networks: Algorithms, Applications andfor the speaker independent and dependent Programming Techniques, Addison Wesleyclassification are 65% and above 80% respectively Publishing Company, 1991.[10, 7]. [4] T. Masters, Practical Neural Network Other classifier that is used for the emotion Recipes in C++, Academic Pressclassification is an artificial neural network (ANN), Publishing Company, 1993.which is used due to its ability to find nonlinear [5] A. Cichocki and R. Unbehauen, Neuralboundaries separating the emotional states. Out of Networks for Optimazing and Signalthe many types, Back propagation neural network is Processing, John Wiley & Sons Publishingused most frequently in speech emotion recognition Company, 1993.[5]. Multilayer perceptron layer neural networks are [6] Werbos, P. The Roots of Backpropagation:relatively common in speech emotion recognition as From Ordered Derivatives to Neuralit is easy for implementation and it has well defined Networks and Political Forecasting, Johntraining algorithm [10].The ANN based classifiers Wiley & Sons, New York, 1994.may achieve a correct classification rate of 71.19% [7] Gelfer M. P., Mikos, Victoria A, "Thein speaker dependent recognition, and that of Relative Contributions of Speaking72.87% for speaker independent recognition. So for Fundamental Frequency and Formantthis system as a classifier we choose ANN in which Frequencies to Gender Identification Basedwe use back propagation algorithm. On Isolated Vowels," Journal of Voice 1 December 2005III. Conclusion [8] Lindasalwa Muda M. B., Elamvazuth I. , In this proposed work we will develop a "Voice Recognition Algorithms using Melneural network that is designed to classify the voice Frequency Cepstral Coefficient (MFCC)signals with the help of MFCC and Power and Dynamic Time Warping (DTW)Spectrum. In speech signal classifications many 1453 | P a g e
  4. 4. Mr. Saurabh Padmawar, Prof. Mrs. P.S. Deshpande / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1451-1454 Techniques " Computing, vol. 2, March 2010. [9] Speech Emotion Recognition, Ashish B. Ingale, D. S. Chaudhari, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-1, March 2012. [10] M. E. Ayadi, M. S. Kamel, F. Karray, “Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases”, Pattern Recognition 44, PP.572-587, 2011. [11] T. Vogt, E. Andre and J. Wagner, “Automatic Recognition of Emotions from Speech: A review of the literature and recommendations for practical realization”, LNCS 4868, PP.75-91, 2008.AUTHORSFirst name-Mr. Saurabh Padmawar,ME,S.K.N.C.O.E Pune.Second name-Prof. Mrs. Pallavi S. Deshpande, M-Tech, S.K.N.C.O.E Pune. 1454 | P a g e