Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129Environmental Natural Sound Detection And Classification Using Content-Based Retrieval (CBR) And MFCC Subrata Mondal1, Shiladitya Pujari2, Tapas Sangiri3 1 Technical Assistant, Department of Computer Science, Bankura Unnayani Institute of Engineering, Bankura, West Bengal, India 2 Assistant Professor, Department of IT, University Institute of Technology, Burdwan University, Burdwan, West Bengal, India 3 Assistant Professor, Department of IT, Bankura Unnayani Institute of Engineering, Bankura, West Bengal, IndiaABSTRACT This paper deals with the extraction and important aspects of our surroundings environment.classification of environmental natural sounds We have presented the different phases of the entirewith the help of content-based retrieval method. process such as capturing the environmental audio,Environmental sounds are extremely pre- processing of audio data, feature extraction,unpredictable and are very much hard to classify training and lastly the testing phase.and store in cluster forms according to its feature The major steps involved in the entire method are ascontents. Environmental sounds provide many follows:contextual clues that enable us to recognize  Extraction of features for classifying highlyimportant aspects of our surroundings diversified natural sounds.environment. This paper presented the techniques  Making clusters according to their featurethat allow computer system to extract and classify similarity.features from predefined classes of sounds in the  Finding a match for a particular soundenvironment. query from the cluster. The audio files used for testing purpose areKeywords-content-based retrieval, feature of various categories in sample rate, bit rate and noextraction, environmental sound, Mel of channels, but files has been digitized to similarfrequency Cepstral coefficiant. type, i.e. 22k Hz, 16 bit, mono channel. Then all the sounds normalized to -6 db and DC offset correctedI. INTRODUCTION to 0%. This reprocessing of sound data helps The development of high speed computer considering low level descriptor in the sound fileshas developed a tremendous opportunity to handle are similar, hence we can proceed without anylarge amount of data from different media. The drawback in primary classification methods.proliferation of world-wide- web has fartheraccelerated its growth and this trend is expected to II. RELATED WORKScontinue at an even faster pace in the coming years. A brief survey of previous works relevantEven in the past decade textual data was the major to our work is presented here. Parekh R [1] proposedform of data that was handled by the computer. But a statistical method of classification based on Zeroabove advancement in technology has led to uses of crossing rate (ZCR) and Root means square (RMS)non textual data such as video, audio and image value of audio file. For this purpose sounds has beenwhich are known as multimedia data and the categorized in three semantic classes. 1. Aircraftdatabase deals with this type of data is called sound, 2.Bird sound, 3.Car sound. First and third aremultimedia database. This database has much more inorganic in nature whereas second is organic incapability than traditional database. One of the best nature. The author tried to extract samples,features of multimedia database is content-based amplitudes and frequencies of those sounds. Toretrieval (CBR). In content base retrieval process the generate the feature vector, ZCR and RMS methodsmajor advantage is that it can search data by its are used. Statistical measure is applied to thesecontent rather textual indexing. This paper deals features σZ, μZ, μR, σR, T, R and values arewith the classification of environmental sounds calculated.classification with the help of content-based retrievaland Mel frequency cepstral coefficient (MFCC). Goldhor R. S. et al [2] presented a methodEnvironmental sounds are extremely unpredictable of classification technique based on dimensionalso its really hard to classify and store in cluster Cepstral Co-efficient. They segmented the soundforms according to its feature contents. wave into tokens of sound according to signal powerEnvironmental sounds provide many contextual Histogram. From the histogram statistics, a thresholdclues or characteristics that enable us to recognize –power level was calculated to separate portions of 123 | P a g e
  2. 2. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129the signal stream containing the sound from portions human perception. They developed an algorithmcontaining only background noise. Result is derived based on matching approximation of repeating partsfrom a high SNR recording. As feature vector they of the sequence unit. For separating pattern they useused two-dimensional cepstral-coefficients. They the fact that silent segments affect human perceptioncalculated cepstral Coefficient for each token by and are recognized as the separators of perceivedusing Hamming window technique. They also used patterns, determined the length of the silent portiontwo types of cepstral coefficients; one is linear by dynamic process of histogram of length of silentcepstral coefficient and another is the Mel frequency segments using threshold method. Finally distancescepstral coefficient. between every pair of the units are computed by In their paper, Toyoda Y. et al [3] presented DTW which minimizes the total distance betweenthe uses of neural network to classify certain types units by aligning their time series.sound such as crashing sound, clapping sound, bellsound, switch typing sound etc All these sounds Janku L. [6] used MPEG7 low levelwere classified into three categories; burst sound, descriptor for sound recognition with the help ofrepeated sound and continuous sound. For this hidden markov model. Author used four low levelpurpose they fetched the environmental sounds from descriptor defined in mpeg 7. Those are Audioa sound database RWCP-OB. For further spectrum envelop, audio spectrum centroid, audioimprovement, they developed multistage neural spectrum spread and audio spectrum flatness. Thenetwork. In this work, a three layer perception neural testing method described is a three-step procedure:network is used. In the first step of recognition they (a) Receiving audio wave file and taking 512 pointsused long time power pattern. This classifies the short time Fourier transform.sound as single impact. Second stage of (b) According to the presented spectrogram, we takeclassification was done by combination of short time the associated parameters.power pattern and frequency at power peak. (c) Taking the SVD of flatness and running the Hattori Y. et al [4] proposed an Viterbi algorithm to get the sound class candidate.environmental sounds recognition system using LPC- Cepstral coefficients for characterization and Chen Z. and Maher R. C. [7] dealt withartificial neural network as verification method. This Automatic off-line classification and recognition ofsystem was evaluated using a database containing bird vocalizations. They focused on syllable levelfiles of different sound-sources under a variety of discrimination of bird sound. For this purpose theyrecording conditions. Two neural networks were developed a set of parameters those are spectraltrained with the magnitude of the discrete Fourier frequencies, frequency differences, track shape,transform of the LPC Cepstral matrices. The global spectral power, and track duration. The parameterspercentage of verification was of 96.66%. First they include the frequencies, the frequency differences,collected a common sound database. A segmentation the relative the shape, and the duration of thealgorithm is applied to each file. For this purpose spectral peak tracks. These parameters were selectedthey uses 240 point Hamming window with frame based on several key insights. For the featureoverlap of 50%. LPC cepstral coefficient is extraction they had two choices: MPEG-7 audiocomputed and DFT is measured. For LPC cepstral features and Mel-scale Frequency Cepstralcalculating they used Levinson-Durbin recursion. Coefficients (MFCC), For the classification they had two choices: Maximum Likelihood Hidden Markov Wang J. et al [5] focused on repeat Models (ML-HMM) and Entropic Prior HMM (EP-recognition in environmental sounds. For this HMM. They have achieved accuracy of 90% withpurpose they designed a three module system; 1) the best and the second best being MPEG-7 featuresUnit of the repetitions; 2) Distance between units; 3) with EP- HMM and MFCC with ML-HMM.Algorithm to detect repeating part. In their paper, Constantine G. et al [8]For unit extraction they used the concept of a peak in presented a comparison of three audio taxonomypower envelop. They extracted peak power with the methods for MPEG-7 sound classification. For thehelp of following: low-level descriptors they used, low-dimensional1.The power is below the threshold T L features based on spectral basis descriptors are2. The interval between the maximum and the produced in three stages:neighbor maximum is than the threshold T2.  Normalized audio spectrum envelope.3. The ratio of the power to the power of the  Principal component analysis.neighbor local minimum is than the threshold T3.  Independent component analysis.Distance of every pair of unit was calculated usingdynamic time warping MFCC. High-level description schemes they used to describe the modeling of audio features, theMel frequency cepstral coefficient (MFCC) was procedure of audio classification, and retrieval. Fordeveloped by Stevens and was designed to adapt classification they tested three approaches: 124 | P a g e
  3. 3. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129  Mel frequency cepstral coefficients (music The direct approach. and speech). The hierarchical approach without hints.  Linear prediction cepstral (LPC) Hierarchical approach with hints. coefficients.  Mel frequency LPC coefficients. They suggested that best approach is  Bark frequency cepstral coefficients.hierarchical approach with hints, which results in  Bark frequency LPC coefficients.classification accuracy of around 99%. The direct  Perceptual linear prediction (PLP) features.approach produces the second best results, and thehierarchical approach without hints the third best Cepstral frequency based featureresults. extractions are pseudo stationary feature extraction techniques because they split the signal into manyIII. SOUND ANALYSIS TECHNIQUE small part in time domain. Techniques based on LPC Undoubtedly the most important part of the coefficients were based on the idea of a decoder,entire process is to extract the features from the audio which is a simulation of the human vocal tract. Sincesignal. The audio signal feature extraction in a the human vocal tract does not producecategorization problem. It is about reducing the environmental sounds, these techniques typicallydimensionality of the input-vector while maintaining seem to highlight non-unique features in the soundthe discriminating power of the signal. It is clear and are therefore not appropriate for recognition offrom the above discussion of environmental sound non- speech sounds.identification and verification systems, that thetraining and test vector needed for the classification The main non stationary feature extractionproblem grows dimension of the given input vector techniques are:[2-3], feature extraction is necessary. More variety insound needs more variety in features for a particular  Short-time Fourier transforms (STFT).genre of sound.  Fast (discrete) wavelet transform (FWT). There are various techniques suitable for  Continuous wavelet transforms (CWT).environmental sound recognition. Recognition of  Wigner-Ville distribution (WVD).sound (whether it is speech or environmental) isgenerally consists of two phases: These techniques use different algorithms The Feature Extraction phase, followed by to represent time frequency based representation of The Classification (using artificial signals. Short-time Fourier transforms use standardintelligence techniques or any other) phase. Fourier transforms of the signal where as wavelet In feature extraction phase, sound is based techniques apply a mother wavelet to amanipulated in order to produce a set of waveform to surmount the resolution issues inherentcharacteristics. Classification phase is to recognize in STFT. WVD is a bilinear time-frequencythe sound by cataloging the features of existing distribution.sounds during training session and then comparingthe test sound to this database of features (which is Moreover there are some other featurescalled the testing session). These two phases has been which are suitable for classification ofdescribed in the following text. environmental sounds. These features are described below:A. Feature Extraction Feature extraction mainly of two types. (a) Zero-crossing rate:Frequency based feature extraction and (b) Time- The zero-crossing rate is a measure of thefrequency based feature extraction. number of time the signal value crosses the zero axis. Periodic sounds tend to have small value of it, Frequency based feature extraction method while noisy sounds tend to have a high value of it. Itproduces an overall result detailing the contents of is computed in each time frame on the signal.the entire signal. According to Cowling [11]Frequency extraction is stationary feature extraction Spectral Features:and Time-frequency extraction is non-stationary Spectral features are single valued features,feature extraction, calculated using the frequency spectrum of a signal. Thus for the time domain signal x (t):Eight popular stationary feature extraction methods A (f) = 𝐹[𝑥 𝑡 ]are there: Frequency extraction (music and speech). Spectral Centroid: Homomorphic cepstral coefficients. The Spectral Centroid (μ) is bar center of the spectrum. It is a weighted average of the 125 | P a g e
  4. 4. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129probability to observe the normalized amplitude. B. Sampling :Given A (f) from the equation mentioned above It is the process of converting a continuous signal into a discrete signal. Sampling can be done 𝜇= 𝑓. 𝑝 𝑓 𝛿𝑓 for signals varying in space, time, or any other dimension, and similar results are obtained in two orWhere, p (f) = A (f) / ∑A (f) more dimensions. For signals that vary with time, letSpectral spread: x(t) be a continuous signal to be sampled, and letThe spectral spread (σ) is a the spectrum around the sampling be performed by measuring the value ofmean value calculated in following equation the continuous signal every T seconds, which is called the sampling interval. Thus, the sampled signal x[n] given by: σ2 = 𝑓 − . 2𝑝 𝑓 δ 𝑓 x[n] = x(nT), with n = 0, 1, 2, 3, ...Spectral Roll-off:The spectral roll-off point (fc) is 95% of the signal The sampling frequency or sampling rate fsenergy below this frequency. is defined as the number of samples obtained in one second, i.e. fs = 1/T. The sampling rate is measuredIV. DESIGN ARCHITECTURE in hertz or in samples per second. In context of multimedia database systemmajor challenge is to classify the environmental C. Pre-emphasis :sound because they are so unpredictable in nature In processing of electronic audio signals,that they are harder to categorize into smaller number pre-emphasis refers to a system process designed toof classes. Classification problem concerns with increase (within a frequency band) the magnitude ofdetermining two samples match each other or not some (usually higher) frequencies with respect to theunder some similarity criterion and to different magnitude of other (usually lower) frequencies inclasses keeping similar type of samples in a class. In order to improve the overall signal-to-noise ratiocase of Content Based Retrieval (CBR) system we (SNR) by minimizing the adverse effects.must consider one thing that how might peopleaccess the sound from database. That may be similar D. Windowing :one sound or a group of sounds in terms of some In signal processing, a window functioncharacteristics. For example, class of bird sound, (also known as tapering function) is a mathematicalclass of sound of car, where system has been function that is zero-valued outside of some chosenpreviously train on other sound in similar class. To interval. For instance, a function that is constantachieve the desired result, we have divided the entire inside the interval and zero elsewhere is called aprocedure into two sessions - training and testing rectangular window, which describes the shape of itssession. graphical representation. In order to make an automated Window Terminology:classification of environmental sound, main part is to  N represents the width, in samples, of afind out the right features of the sound. In this paper, discrete-time, symmetrical window function.we have selected Mel frequency Cepstral Coefficient  n is an integer, with values 0 ≤ n ≤ N-1.(MFCC) to extract features from the sound signaland compare the unknown signal with the existing 1) Hamming window:signals in the multimedia database. Hamming window is also called the raised cosine window. The equation and plot for the Hamming window shown below. In a window function there is a zero outside of some chosen interval. For example, a function that is stable inside the interval and zero elsewhere is called a rectangular window that illustrates the shape of its graphical representation, When signal or any other function is multiplied by a window function, the product is also zero valued outside the interval. The windowing is done to avoid problems due to truncation of the signal. Window function has some other applications such as spectral analysis, filter design, audio data compression.Figure 1: Mel Frequency Cepstral Coefficientpipeline 126 | P a g e
  5. 5. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129 labeled to 2000 mel. The process is repeated for half of the frequency, then this frequency as 500 and so on. On this basis the normal frequency is mapped into the mel frequency. The mel is normally a linear mapping below 1000 Hz and logarithmically above 1000 Hz. Transformation of normal frequency into mel frequency and vice versa can be obtained by using the following formulas shown here: 𝑓 𝑚 = 1127.01048 log⁡ + (1 )……….... (1)Figure 2: Example of hamming window 700 The raised cosine with these particular Where m is the Mel frequency and f is the normalcoefficients was proposed by Richard W. Hamming. frequency.The window is optimized to minimize the maximum 𝑒 𝑚(nearest) side lobe, giving it a height of about one- 𝑓 = 700(1127 .01048 − 1)………………… (2)fifth that of the Hann window, a raised cosine withsimpler coefficients. Here mel frequency is transformed into normalFunction plotted on the graph bellow : frequency. Figure 4 showing the mapping of normal frequencies to the mel frequencies.Figure 3: Function is plotted Figure 4: Mapping of normal frequency into MelE. Cepstrum : frequency Cepstrum name was derived, from thespectrum by reversing the first four spectrums. We F. Distance measure:can say cepstrum is the Fourier Transformer of the In sound recognition phase, an unknownlog with phase of the Fourier Transformer frequency sound is represented by a sequence of feature vectorsbands are not positioned logarithmically. As the (x1, x2 ... xi), and then it is compared with the otherfrequency bands are positioned logarithmically in sound from the database. In order to identify thethe Fourier Transform the MFCC approximates the unknown speaker, this can be done by measuring thehuman system response better than any other system. distortion distance of two vector sets based onThese coefficients allow better processing of data. In minimizing the Euclidean distance. The Euclideanthe Mel Frequency Cepstral Coefficients the distance is the scalar distance between the two pointscalculation of the mel is as the real Cepstrum except that one would measure with a ruler, which can bethe Mel cepstrums frequency is to up a proven by repeated application of the Pythagoreancorrespondence to the Mel scale. The Mel scale was Theorem, The formula used to calculate theprojected by Stevens, Volkmann and Newman In Euclidean distance is as follows:1937.The Mel scale is mainly based on the study of The Euclidean distance between two points P =observing the pitch or frequency of human. The (x1, x2...xn) and Q = (y1, y2… yn)scale is divided into the units mel. In this test the 𝒏person started out hearing a frequency of 1000 Hz,and labeled it 1000 mel. Listeners were asked to 𝒅= (𝒙 𝒊 − 𝒚 𝒊 ) 𝟐change the frequency till it reaches to the frequency 𝒊=𝟏of the reference frequency. Then this frequency is 127 | P a g e
  6. 6. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129The sound signal with the lowest distortion distance that both look for linear combinations of variablesis chosen to be identified as the unknown signal. which best explains the data. LDA explicitly attempts to model the difference between the classesG. Fast Fourier Transformation : of data. PCA on the other hand does not take into FFTs are of great importance to a wide account any difference in class, and factor analysisvariety of applications, from digital signal builds the feature combinations based on differencesprocessing and solving partial differential equations rather than similarities. Discriminate analysis is alsoto algorithms for quick multiplication of large different from factor analysis in that it is not anintegers. interdependence technique: a distinction between independent variables and dependent variables (alsoH. Absolute value: called criterion variables) must be made. In mathematics, the absolute value (ormodulus) |a| of a real number a is the numerical V. TRAINING AND TESTING SESSIONSvalue of a without its sign. The absolute value of a As stated earlier, due to achieve the bestnumber may be thought of as its distance from zero. result, the entire sound detection and classification Generalizations of the absolute value for real process is divided into two sessions. One is callednumbers occur in a wide variety of mathematical the training session, during which the system issettings. For example an absolute value is also being trained with known sound signals and featuresdefined for the complex numbers, the quaternion, of that known signals are extracted. These extractedordered rings, fields and vector spaces. The absolute features are then stored into the multimedia databasevalue is closely related to the notions of magnitude, as the reference model for future reference. Figure 5distance, and norm in various mathematical and shows the processes during the training session.physical contexts. Form this figure, we can see that the known sound signal is first re-sampled, then re-referenced andI. DCT : normalized. The normalized signal is then passed In particular, a DCT is a Fourier-related through the process to extract the features. System istransform similar to the discrete Fourier transform then trained with the help of these features and(DFT), but uses only real numbers. DCTs are reference model is built.equivalent to DFTs of roughly twice the length,operating on real data with even symmetry (since theFourier transform of a real and even function is realand even), where in some variants the input and/oroutput data are shifted by half a sample. There areeight standard DCT variants, of which four arecommonly used.J. Linear Discriminate Analysis (LDA): Linear discriminate analysis (LDA) andthe related Fishers linear discriminate are methodsused in statistics, pattern recognition and machinelearning to find a linear combination of featureswhich characterizes or separates two or more classesof objects or events. The resulting combination maybe used as a linear classifier or, more commonly, fordimensionality reduction before later classification. LDA is closely related to ANOVA(analysis of variance) and regression analysis, whichalso attempts to express one dependent variable as alinear combination of other features ormeasurements. In the other two methods, however,the dependent variable is a numerical quantity, whilefor LDA it is a categorical variable (i.e. the class Figure 5: Flow chart of Training Sessionlabel). Logistic regression is more similar to LDA,as it also explains a categorical variable. These other Through the training session, themethods are preferable in applications where it is not multimedia database is populated by feature vectorsreasonable to assume that the independent variables of known sound signals so that those knownare normally distributed, which is a fundamental environmental sound can be detected and classifiedassumption of the LDA method. later. In case of unknown signals, signals are newly LDA is also closely related to principal classified during training session and detection cancomponent analysis (PCA) and factor analysis in be done with that classification. 128 | P a g e
  7. 7. Subrata Mondal, Shiladitya Pujari, Tapas Sangiri / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.123-129 power. Using more than one features of a sound may Another session of the detection and obviously improve the performance of the method.classification process is the testing session. During Applying clustering technique, accuracy can bethis session, depicted in figure 6, the input signal is boosted. Another good feature available today issampled, DC offset is determined and then Audio spectrum projection provided by MPEG7normalization is done. After normalization, features specification. Inclusion of this feature may increaseare extracted for the input signal. Then for a the performance measure of the method.particular sound, feature vectors are retrieved fromthe multimedia database using content-based REFERENCESretrieval (CBR) method. The retrieved features are [1] Ranjan Parekh, "Classification Ofthen matched with the feature vector of the input Environmental Sounds",ICEMC2sound signal. 2007, Second International Conference on Embedded Mobile Commu- nication and Computing, ICEMC2 2007. [2] R. S. Goldhor, "Recognition of Environmental Sounds/ Proceedings of ICASSP, vol.1, pp.149-152, April 1993. [3] Yoshiyuki Toyoda, Jie Huang, Shuxue Ding,Yong Liu, "Environmental Sound Recognition by Multi layered Neural Networks" ,The Fourth International Conference on Computer School of Education Technology, Jadavpur University, Kolkata 32 and Information Technology (CIT04) pp. 123-127,2004. [4] Yuya Hattori, Kazushi Ishihara, Kazunori Komatani, Tetsuya Ogata, Hiroshi G.Okuno/Repeat Recognition for Environmental Sounds", Proceedings of the 2004 IEEE International Workshop on Robot and Human Interactive Communication Kurashiki, Okayama Japan September 20-22, pp. 83-88, 2004. [5] Jia-Ching Wang, Jhing-Fa Wang, KuokFigure 6: Flowchart of Testing Session Wai He, and Cheng-Shu Hsu/Environmental Sound Classification After feature extraction, the matching Using Hybrid SVM/KNN Classifier andprocess is carried on and feature vector of input MPEG-7 Audio Conference on Low-Levelsignal is matched with every reference model of Descriptor", 2006 International Joint Neuralsound signals stored in the database. Whenever a Networks Sheraton Vancouver Wall Centremaxim um matching is found, the input signal is said Hotel, Vancouver, BC, Canada July 16-21,to be detected. This is the entire process of detection 2006.and classification of environmental sounds. In both [6] Ladislava Janku, "Artificial Perception:sessions, feature extraction process is done using the Auditory Decomposition of Mixtures ofMFCC pipeline. Environmental Sounds - Combining Information Theoretical and SupervisedVI. CONCLUSION Pattern Recognition Approaches", P. Van This method of environmental sound Emde Boas et al. (Eds.): SOFSEM 2004,detection and classification is developed using LNCS 2932, pp. 229-240,2004MFCC pipeline and CBR for extraction of features [7] Z, Chen, R. C, Maher, "Semi-automaticof a particular sound and retrieval of sound features classification of bird vocalizations1 Accostfrom the multimedia database respectively. This Soc Am., Vol. 120, No. 5, pp. 2974-2984,method can be implemented in the domain of November 2006.robotics where sound detection and recognition may [8] G. Constantine, A, Rizzi, and D, Casali,be possible up to a satisfactory level. If the method "Recognition of musical instruments bywill be properly implemented with computer vision, generalized min-max classifiers/ in IEEEthen human-computer interaction process can be Workshop on Neural Networks for Signaldeveloped much. MFCC is undoubtedly more Processing, pp. 555-564, 2003.efficient feature extraction method because it isdesigned by giving emphasis on human perception 129 | P a g e