Environmental Sound detection Using MFCC technique
SOUND DETECTION AND
(CBR) AND MFCC
Project Mentor :- Shiladitya Pujari
Project group member :Par th Sinha(20093043)
scope & conclusion
To develop an Environmental Sound Detection &
Classification technique (using Content Based
Retrieval & MFCC) so that computer system can
predict and understand “SOUND” more
To make computer systems more intelligent &
reliable in understanding its environment based
on this technique.
WHAT ARE MFCCS?
In sound processing, the Mel-frequency cepstrum (MFC) is a
representation of the short-term power spectrum of a sound, based
on a linear cosine transform of a log power spectrum on a
nonlinear Mel scale of frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients
that collectively make up an MFC. They are derived from a type
of cepstral representation of the audio clip (a nonlinear "spectrum-ofa-spectrum").
The difference between the cepstrum and the Mel-frequency
cepstrum is that in the MFC, the frequency bands are equally spaced
on the Mel scale, which approximates the human auditory system's
response more closely than the linearly-spaced frequency bands used
in the normal cepstrum. This frequency warping can allow for better
representation of sound, for example, in audio compression.
MFCCs are commonly derived as follows:
1. Take the Fourier transform of (a windowed excerpt of) a signal.
2. Map the powers of the spectrum obtained above onto the Mel 5
scale, triangular overlapping windows.
3.Take the logs of the powers at each of the mel frequencies.
4.Take the discrete cosine transform of the list of mel log powers,
as if it were a signal.
5. The MFCCs are the amplitudes of the resulting spectrum.
as features in speech
recognition systems, such as the systems which can
automatically recognize numbers spoken into a telephone. They
are also common in speaker recognition, which is the task of
recognizing people from their voices.
MFCCs are also increasingly finding uses in music information
as genre classification,
similarity measures, etc.
Content Based Retrieval means that the retrieval
and the required search is based on the analysis
of the actual contents of the data(here sound)
rather than the metadata such as keywords, tags
and/or descriptions associated with the sounds.
In our project we’ll use multimedia database
which provides Content Based Retrieval .
The major steps involved in the entire method
are as follows :
of feature for classifying highly diversified
clusters according to their feature similarity.
a match for a particular sound query from the
First we take input sound(audio signal of any format).
Then some preprocessing will be done to normalize the
Feature Extraction of the audio signal.
Next will be the Classification phase(consisting of two
Fig: Mel Frequency Cepstral Coefficient pipeline
It is the process of converting a continuous signal into a discrete signal. Sampling can be done for
signals varying in space, time, or any other dimension, and similar results are obtained in two or
In processing of electronic audio signals,pre-emphasis refers to a system process designed to
increase (within a frequency band) the magnitude of some (usually higher) frequencies with respect
to the magnitude of other (usually lower) frequencies in order to improve the overall signal-to-noise
ratio (SNR) by minimizing the adverse effects.
In signal processing, a window function (also known as tapering function) is a mathematical
function that is zero-valued outside of some chosen interval. For instance, a function that is
constant inside the interval and zero elsewhere is called a rectangular window, which describes the
shape of its graphical representation.
Fast Fourier Transform
FFTs are of great importance to a wide variety of applications, from digital signal processing and
solving partial differential equations to algorithms for quick multiplication of large integers.
In mathematics, the absolute value (or modulus) |a of a real number a is the numerical value of a
without its sign. The absolute value of a number may be thought of as its distance from zero.
Discrete cosine transformation(DCT)
In particular, a DCT is a Fourier-related transform similar to the discrete Fourier transform
(DFT), but uses only real numbers. DCTs are equivalent to DFTs of roughly twice the length,
operating on real data with even symmetry (since the Fourier transform of a real and even
function is real and even), where in some variants the input and/or output data are shifted by
half a sample. There are eight standard DCT variants, of which four are commonly used.
Linear Discriminate Analysis (LDA)
Linear discriminate analysis (LDA) and the related Fisher's linear discriminate are methods
used in statistics, pattern recognition and machine learning to find a linear combination of
features which characterizes or separates two or more classes of objects or events. The
resulting combination may be used as a linear classifier or, more commonly, for
dimensionality reduction before later classification.
TRAINING AND TESTING
Fig: Flow chart of Training Session
Fig: Flowchart of Testing Session
On using the above mentioned approaches (MFCC and
CBR) for sound detection and classification system we find
that the Recognition Rate is very high and very accurate.
Although the recognition rate is high enough, one
problem is that of Rejection Rate, that is, the rejection rate
is not quite good enough.
This implies that if the particular sound that is to be
tested is already present in the database then the matching
process is very accurate but if that sound is not present in
the database then the system doesn’t reject the sound (or
stop the matching) rather it matches it with the nearest
and closest sounds in terms of features.
Future scope and applications
Audio similarity measures
This method of environmental sound detection and classification is developed using MFCC
pipeline and CBR for extraction of features of a particular sound and retrieval of sound
features from the multimedia database respectively. This method can be implemented in the
domain of robotics where sound detection and recognition may be possible up to a satisfactory
level. If the method will be properly implemented with computer vision, then humancomputer interaction process can be developed much. MFCC is undoubtedly more efficient
feature extraction method because it is designed by giving emphasis on human perception
power. Using more than one features of a sound may obviously improve the performance of the
method. Applying clustering technique, accuracy can be boosted. Another good feature
available today is Audio spectrum projection provided by MPEG7 specification. Inclusion of this
feature may increase the performance measure of the method.