Anvita Eusipco 2004

AUDIO CLIP CLASSIFICATION USING LP RESIDUAL AND NEURAL NETWORKS
MODELS
Anvita Bajpai and B. Yegnanarayana
Department of Computer Science and Engineering
Indian Institute of Technology Madras, Chennai- 600 036, India
anvita, yegna¡ @cs.iitm.ernet.in

ABSTRACT et al. classify audio into five categories of television (TV)
programs using spectral features [4]. Features based on
In this paper, we demonstrate the presence of audio-specific
amplitude, zero-crossing, bandwidth, band energy in the
information in the linear prediction (LP) residual, obtained
subbands, spectrum and periodicity properties, along with
after removing the predictable part of the signal. We empha-
hidden Markov model (HMM) for classification are explored
size the importance of information present in the LP resid-
for audio indexing applications in [5]. But it was shown
ual of audio signals, which if added to the spectral informa-
that perceptually significant information of audio data is
tion, can give a better performing system. Since it is dif-
present in the form of sequence of events, which can be
ficult to extract information from the residual using known
obtained after removing the predictable part in the audio
signal processing algorithms, neural networks (NN) models
data. Perceptually, there are some discriminating features
are proposed. In this paper, autoassociative neural networks
present in the residual which could help in various audio
(AANN) models are used to capture the audio-specific infor-
indexing tasks. The challenge lies in developing algorithms
mation from the LP residual of signals. Multilayer feedfor-
to capture these perceptually significant features from the
ward neural networks (MLFFNN) models or multilayer per-
residual, as it is difficult to extract information using known
ceptron (MLP) are used to classify the audio data using the
signal processing algorithms.
audio-specific information captured by AANN models.

Objective of this study is to explore the features in
addition to the features that are currently used to improve the
performance of an audio indexing system. In particular, fea-
1. INTRODUCTION
tures not used explicitly or implicitly in the current system
In this era of information technology, the data that we use is are being investigated. Many interesting and perceptually
mostly in the form of audio, video and multimedia. The data, important features are present in the residual signal obtained
once recorded and stored digitally, conveys no significant after removing the predictable part. Thus the main objective
information in order to organize and use it. The volume of of this study is to explore the features present in the linear
data is large, and is increasing daily. Therefore it is difficult prediction (LP) residual for audio clip classification task.
to organize the data manually. We need to have an automatic The reason for considering the residual data for study is that
method to index the data, for further search and retrieval. the residual part of the signal is generally subject to less
Audio plays an important role in classifying multimedia degradation as compared to the system part [6]. The residual
data as it contains significant information, and is easier to data contains higher order correlation among samples. As
process when compared to video data. For these reasons, known signal processing and statistical techniques are not
commercial products of audio retrieval are emerging, e.g., suitable to capture this correlation, an autoassociative neural
(http://www.musclefish.com) [1]. Content-based classifica- networks (AANN) model is proposed to capture these higher
tion of data into different categories is one important step for order correlations among samples of the residual of the
building an audio indexing system. audio data. AANN have already been studied to capture
information from the residual data for tasks such as speaker
In the traditional approach of audio indexing, audio recognition [7]. Further, multilayer feedforward neural
is first converted to text, and then it is given to text-based networks (MLFFNN) models or multilayer perceptron
search engines [2]. Drawbacks of this approach are: a£ (MLP) are proposed for decision making task using the
¢

not having accurate speech recognizer, b£ not using speech audio-specific information captured by AANN models.
¢

information present in form of prosody, and c£ not appli-
¢

cable for non-speech data like music. An elaborate audio
content categorization is proposed by Wold et al. [1], which The paper is organized as follows: Section 2 discusses
divides the audio content into sixteen groups. The authors extraction of the LP residual from audio data. Section 3 dis-
have used mean, variance and autocorrelation of loudness, cusses AANN models for capturing features in LP residual
pitch and bandwidth as audio features and a nearest neigh- for the audio clip classification. Section 4 discusses MLP
borhood classifier for the task. The authors quote 81% models for decision making. Section 5 presents the work-
classification accuracy for an audio database with 400 sound flow of the system. The results of the experimental studies
files. Guo et al. [3] have used features consisting of total are presented in Section 6. Various issues addressed in this
power, subband energies, bandwidth, pitch and MFCCs, and paper and possible directions for the future study are summa-
support vector machines (SVMs) for classification. Wang rized in Section 7.

2299

2300
which is non-speech and non-music). There are significant formation from the LP residual.
selected for the study are speech, music and noise (audio section, we discuss methods to capture the audio-specific in-
in audio is used for classification task. The components information could be perceived while listening. In the next
components. The knowledge of these components present cannot be observed in the residual signal, the audio-specific
while advertisement audio has music and speech as major shown in Figure 1. For some cases even if the difference
components. For example, news audio has clean speech, within it. The residual signal of the five different classes are
In an audio category, there could be one or more audio the most difficult class to study, as it has many variations
more in the case of football audio. Advertisement audio is
AANN Models
speech and other background sounds, like noise. Noise is
3.1 Use of Audio Component Knowledge for Building
is a part of cartoon audio. Cricket and football have casual
gory differs from the news speech in terms of prosody. Music
information present in LP residual signal. vironment) has clean speech, while speech for cartoon cate-
for building AANN models for capturing the audio-specific variations among them. News audio (in closed studio en-
section, we discuss the use of audio component knowledge The five classes considered for the present study show
on the structure of the network [7]. In the following sub-
The performance of the network does not critically depend
ing to five different categories.
of nodes in hidden layers have been decided experimentally.
Figure 1: LP residual for the segments of audio clips belong-
sidered for study), and may vary for other audio. The number
speech knowledge (which is suitable for the categories con-
No. of Residual Samples
ual signal to be considered for study has been derived using
400 350 300 250 200 150 100 50 0
−0.5
and output layer is 40. However, this the duration of resid-
sampling frequency and hence the number of nodes in input
0
the residual signal is chosen. The signal is recorded at 8 kHz
News

0.5
we are interested in the source features, the 5 ms duration of
400 350 300 250 200 150 100 50 0
−0.5
A tanh is used as the non-linear activation function. Since,
40L, where L refers to linear units, and N to non-linear units.
0
ture of the network used in our study is 40L 48N 12N 40N
Football

0.5
layer AANN model as shown in Figure 2 is used. The struc-
400 350 300 250 200 150 100 50 0
−0.5
the audio information present in the LP residual signal, a five
specific to the speaker in the LP residual [7]. For capturing
0
models have been shown to capture excitation source features
Cricket

0.5
forming an identity mapping of the input space [10]. AANN
400 350 300 250 200 150 100 50 0
−0.5
AANN models are feedforward neural networks per-
0
Residual Amplitude for Audio Clips

Cartoon

audio-specific information.
0.5
400 350 300 250 200 150 100 50 0
Figure 2: Structure of AANN model used for capturing
−0.5
0
Layer
Output Layer Input Layer
Compression
H HHH
GGGG
Commercials

0.5
4 44 GHGHGHGH
BI BI BI BI 12121212
bb bb
56CDab‚56CDab‚56CDab‚56CDab‚
aaaa
‚‚‚‚ 78BI78BI78BI78BI
c4 GHGHG 39@439@439@4 12EFWXY`12EFWXY`12EFWXY`12EFWXY`
ÈFWX ÈFWX ÈFWX ÈFWX
EEEE

BIBI 78BI78BI78BI78BI
6abstw‚ƒ„56abst6abstw‚ƒ„56abst6abstw‚ƒ„56abst6abstw‚ƒ„56abst
56ab56abab56CDab‚56CDab‚56CDab‚
5555 FFFF
349@39@44349@39@44349@349@4 12EFWXY`12EFGHWX12EFWXY`12EFGHWX1EWY1EGW
c34349349349 2222
34349@349@349@ 1111
5xx 5xx 5xx 5xx
XXXX
XY`XY`XY`XY`
WWWWWWWW
787878BI78BI78BI78BI d 9@9@9@9@9@9@ EFYÈFYÈFYÈFY`
34@ 34@ 34@
2E2E2E2E
66abst66abstabst
)5555 BIBIBIBI
abstwxabstwx56‚ababstwxabstwx56‚ab56‚ab PPPP
QQQQ
ƒƒƒƒ A778A778A778A778
RRRR
TT TT
‚ƒ„6‚ƒ„6‚ƒ„6‚ƒ„6 d34d 444 2222
34333 1111
349@9@349@9@349@9@ 12EFWXYÈ`12EFWXYÈ`1EWYE
stwx‚ƒ„stwx‚ƒ„stwx‚ƒ„stwx‚ƒ„ 7777
)056abƒ„56abƒ„6ƒ556abƒ„56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x hhh X1X1X1X1
bbbb

APAPAPAP
56a56a56a56a
RSTUVRSTUVRSTUVRSTUV
„„„„
8888
8888 QQQQ
QRQRQRQR ipipip
444
pef pef pef
333
56xƒabstw‚56xƒabstw‚56xƒabstw‚56xƒabstw‚ 4444 12E12E12E12E
78UV77878APQRSTUV78APQRSTUVA78BI78APQRSTUV78APQRSTUV78ABI 33349@ef33349@ef349@ef `12EFWXY`12WX```12EFWXY`12WX1EWY1W
ipqrefgipqrefgipqrefg W W W W

4 hhh ````
yyyy

3efgefgefg FWXYFWXYFWXYFWXY

aaaa

€€€€
ssss

‚„‚„‚„‚„

ƒ„ƒ„ƒ„ƒ„ qrhipqrhipqrhipqr FWXYFWXYFWXYFWXY
78QRSTAPUV78QRSTAPUV78APQRSTUV7PQRT78QRSTAPUV78QRSTAPUV78APQRSTUV7PQRT
56ƒb„ssss minimizing the mean squared error over an analysis frame.

3fff

aaaa €€€€
xxxx
€‚€‚€‚€‚
xxxx hhh
tttt
uuuu 34e34efghip34e34e34efghip34efghip Y`12EFWXY`Y`Y`Y`12EFWXY`1EWY !quot;
refghiqr34pefghiqr34pefghiqr34p 12WXY12EFWXY`12WXY12WXY12WXY12EFWXY`1EWY
bbbb
'bbbb yyyy
vwvwvwvw
yyyy
wwww 4ggg 12E12E12E12E ggg
78UV78UV878QRSTAPUV78APQRSTUV8QR78QRSTAPUV78APQRSTUV8QR %34efgghhipqr34p34efgghhipqr34p34efgghhipqr34p 112WXYY`11112WXYY`1WYY
78QRSTUV78QRSTUV78QRSTUV78QRSTUV ©
abuvwxy€abuvwxy€56xƒbw„abuvwxy€abuvwxy€56xƒbw„56bwxƒ„
„„„„
‚ƒ‚ƒ‚ƒ‚ƒ %
VVVV 34q34efghipqr34efghipqr34efghipqr 2WWXY`2WWXY`2WWXY`2WWXY`
¨
'(abuvƒ„abuvƒ„abbuvƒ„aa„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„
abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„ 78QRT78QRT78QRT78QRT

SSUVUSSUVUSSUVUSSUVU $34qr34ghipqr34ghipqr34ghipqr Y`Y`Y`Y` r rr The linear prediction coefficients ak are determined by
RUVRUVRUVRUV
QQQQ
8888
78UV78UUVV78QRSSTUV78QRSTUVUV78QRSSTUV78QRSTUUVV
abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„
„„„„ QSTQSTQSTQST
8888
$3434efipqr343434efipqr34efipqr 12WX`12WX`1W
#prprpr XXXX
ƒƒƒƒ 7777
7777 #34qriii WWWW
ripipip XXXX

LP Residual
LP Residual
33434ghipqr34ghipqr34ghipqr Y`WXY`WY`Y`Y`WXY`WWYW
8888 qrghpqrghghpqrghpqrghgh WXYYWXYWXYWXYYY
uvwxy€‚ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„uvwxy€‚ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„ 7777

VVVV
STUSTUSTUSTU

78UV7788UV78STUV7788QQRSTUV78STUV7788QQRSTUV ipqrighq3ipqrighq3ipqrighq3 ````
uvƒ„abuvƒƒ„„uvwxy€‚ƒ„abuvwxy€‚ƒ„ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒƒ„„ qrghq34iprrghq34iprr34ghipqrr Y`WXY`XY`Y`Y`WXY`XWY
uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„

VVVV
TUTUTUTU qripqrghipqr3ipipqrghipqr3ipipqrghipqr3ip Y`Y`Y`Y`YY
78S78S78S78S
UV78UV78STUV78STUV
vvvv
„„„„
‚ƒ‚ƒ‚ƒ‚ƒ
ipqripqripqr
STUUVSTUUVSTUUVSTUUV
uuuu vv vv

uvƒ„uvƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„ k¥ 1
„„„„
‚‚‚‚
ƒƒƒƒ rqrqqq `` `` `` ``
qqipqripqripqr YYYY
STUVSTUVSTUVSTUV
VVVV ¢

„u„u„u„u
ƒ„uƒ„vƒ„u‚ƒ„vƒ„u‚ƒ„vƒ„u‚ƒ„v
ƒƒƒƒ iii Y`YYY`
pqripqrpqrpqripqripqr Y`Y`Y
UVUVSTUVUVUVSTUV qrqripqrqrqripqripqr YYY
UVUVVUUVSTUVVUUVSTUVVU
UVUVUVUV £ ¡ ¢ £ ¦ ¡ ¢
qrqrqripqrqqrqrqripqrqqripqrq
UVUVUVUVUVUV UV UV UV UV k£ s¢ n£ s n£ s¢ n£ e¢ n£
(2)
12 §
qrqrqrqrqrqrqrqr
∑ ak s¢ n
40 Nodes 40
qrqrqrqr
qrqrqrqrqrqr p
48
48
given by,
ple value is termed as prediction error or residual, which is
information from the LP residual [9].
The difference between the actual, and predictable sam-
propose neural network models to capture the audio-specific
k¥ 1
such an information may involve non-linear processing, we
¢
£ £¡
¤¢
the LP residual is not clear, and also since the extraction of k£ s n£
(1) ∑ ak s¢ n
p
cific set of parameters to represent the audio information in
tions among the samples of the residual signal. Since spe-
past p samples as,
the audio features may be present in the higher order rela-
If s¢ n£ is the present sample, then it is predicted by the
acteristics are present in the LP residual. We conjecture that
past p samples, where p represents the order for prediction.
the audio production system. But the excitation source char-
ysis each sample is predicted as a linear weighted sum of the
does not contain any significant second order correlations of
nal using linear prediction (LP) analysis [8]. In the LP anal-
tures through the autocorrelation functions, the LP residual
The first step is to extract the LP residual from the audio sig-
Since LP analysis extracts the second order statistical fea-
INFORMATION IN LP RESIDUAL CLIP CLASSIFICATION
3. AANN MODELS FOR CAPTURING AUDIO 2. SIGNIFICANCE OF LP RESIDUAL FOR AUDIO

Anvita Eusipco 2004

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Anvita Eusipco 2004

Similar to Anvita Eusipco 2004 (20)

Recently uploaded

Recently uploaded (20)

Anvita Eusipco 2004