Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anvita Eusipco 2004


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Anvita Eusipco 2004

  1. 1. AUDIO CLIP CLASSIFICATION USING LP RESIDUAL AND NEURAL NETWORKS MODELS Anvita Bajpai and B. Yegnanarayana Department of Computer Science and Engineering Indian Institute of Technology Madras, Chennai- 600 036, India anvita, yegna¡   ABSTRACT et al. classify audio into five categories of television (TV) programs using spectral features [4]. Features based on In this paper, we demonstrate the presence of audio-specific amplitude, zero-crossing, bandwidth, band energy in the information in the linear prediction (LP) residual, obtained subbands, spectrum and periodicity properties, along with after removing the predictable part of the signal. We empha- hidden Markov model (HMM) for classification are explored size the importance of information present in the LP resid- for audio indexing applications in [5]. But it was shown ual of audio signals, which if added to the spectral informa- that perceptually significant information of audio data is tion, can give a better performing system. Since it is dif- present in the form of sequence of events, which can be ficult to extract information from the residual using known obtained after removing the predictable part in the audio signal processing algorithms, neural networks (NN) models data. Perceptually, there are some discriminating features are proposed. In this paper, autoassociative neural networks present in the residual which could help in various audio (AANN) models are used to capture the audio-specific infor- indexing tasks. The challenge lies in developing algorithms mation from the LP residual of signals. Multilayer feedfor- to capture these perceptually significant features from the ward neural networks (MLFFNN) models or multilayer per- residual, as it is difficult to extract information using known ceptron (MLP) are used to classify the audio data using the signal processing algorithms. audio-specific information captured by AANN models. Objective of this study is to explore the features in addition to the features that are currently used to improve the performance of an audio indexing system. In particular, fea- 1. INTRODUCTION tures not used explicitly or implicitly in the current system In this era of information technology, the data that we use is are being investigated. Many interesting and perceptually mostly in the form of audio, video and multimedia. The data, important features are present in the residual signal obtained once recorded and stored digitally, conveys no significant after removing the predictable part. Thus the main objective information in order to organize and use it. The volume of of this study is to explore the features present in the linear data is large, and is increasing daily. Therefore it is difficult prediction (LP) residual for audio clip classification task. to organize the data manually. We need to have an automatic The reason for considering the residual data for study is that method to index the data, for further search and retrieval. the residual part of the signal is generally subject to less Audio plays an important role in classifying multimedia degradation as compared to the system part [6]. The residual data as it contains significant information, and is easier to data contains higher order correlation among samples. As process when compared to video data. For these reasons, known signal processing and statistical techniques are not commercial products of audio retrieval are emerging, e.g., suitable to capture this correlation, an autoassociative neural ( [1]. Content-based classifica- networks (AANN) model is proposed to capture these higher tion of data into different categories is one important step for order correlations among samples of the residual of the building an audio indexing system. audio data. AANN have already been studied to capture information from the residual data for tasks such as speaker In the traditional approach of audio indexing, audio recognition [7]. Further, multilayer feedforward neural is first converted to text, and then it is given to text-based networks (MLFFNN) models or multilayer perceptron search engines [2]. Drawbacks of this approach are: a£ (MLP) are proposed for decision making task using the ¢ not having accurate speech recognizer, b£ not using speech audio-specific information captured by AANN models. ¢ information present in form of prosody, and c£ not appli- ¢ cable for non-speech data like music. An elaborate audio content categorization is proposed by Wold et al. [1], which The paper is organized as follows: Section 2 discusses divides the audio content into sixteen groups. The authors extraction of the LP residual from audio data. Section 3 dis- have used mean, variance and autocorrelation of loudness, cusses AANN models for capturing features in LP residual pitch and bandwidth as audio features and a nearest neigh- for the audio clip classification. Section 4 discusses MLP borhood classifier for the task. The authors quote 81% models for decision making. Section 5 presents the work- classification accuracy for an audio database with 400 sound flow of the system. The results of the experimental studies files. Guo et al. [3] have used features consisting of total are presented in Section 6. Various issues addressed in this power, subband energies, bandwidth, pitch and MFCCs, and paper and possible directions for the future study are summa- support vector machines (SVMs) for classification. Wang rized in Section 7. 2299
  2. 2. 2300 which is non-speech and non-music). There are significant formation from the LP residual. selected for the study are speech, music and noise (audio section, we discuss methods to capture the audio-specific in- in audio is used for classification task. The components information could be perceived while listening. In the next components. The knowledge of these components present cannot be observed in the residual signal, the audio-specific while advertisement audio has music and speech as major shown in Figure 1. For some cases even if the difference components. For example, news audio has clean speech, within it. The residual signal of the five different classes are In an audio category, there could be one or more audio the most difficult class to study, as it has many variations more in the case of football audio. Advertisement audio is AANN Models speech and other background sounds, like noise. Noise is 3.1 Use of Audio Component Knowledge for Building is a part of cartoon audio. Cricket and football have casual gory differs from the news speech in terms of prosody. Music information present in LP residual signal. vironment) has clean speech, while speech for cartoon cate- for building AANN models for capturing the audio-specific variations among them. News audio (in closed studio en- section, we discuss the use of audio component knowledge The five classes considered for the present study show on the structure of the network [7]. In the following sub- The performance of the network does not critically depend ing to five different categories. of nodes in hidden layers have been decided experimentally. Figure 1: LP residual for the segments of audio clips belong- sidered for study), and may vary for other audio. The number speech knowledge (which is suitable for the categories con- No. of Residual Samples ual signal to be considered for study has been derived using 400 350 300 250 200 150 100 50 0 −0.5 and output layer is 40. However, this the duration of resid- sampling frequency and hence the number of nodes in input 0 the residual signal is chosen. The signal is recorded at 8 kHz News 0.5 we are interested in the source features, the 5 ms duration of 400 350 300 250 200 150 100 50 0 −0.5 A tanh is used as the non-linear activation function. Since, 40L, where L refers to linear units, and N to non-linear units. 0 ture of the network used in our study is 40L 48N 12N 40N Football 0.5 layer AANN model as shown in Figure 2 is used. The struc- 400 350 300 250 200 150 100 50 0 −0.5 the audio information present in the LP residual signal, a five specific to the speaker in the LP residual [7]. For capturing 0 models have been shown to capture excitation source features Cricket 0.5 forming an identity mapping of the input space [10]. AANN 400 350 300 250 200 150 100 50 0 −0.5 AANN models are feedforward neural networks per- 0 Residual Amplitude for Audio Clips Cartoon audio-specific information. 0.5 400 350 300 250 200 150 100 50 0 Figure 2: Structure of AANN model used for capturing −0.5 0 Layer Output Layer Input Layer Compression H HHH GGGG Commercials 0.5 4 44 GHGHGHGH BI BI BI BI 12121212 bb bb 56CDab‚56CDab‚56CDab‚56CDab‚ aaaa ‚‚‚‚ 78BI78BI78BI78BI c4 GHGHG 39@439@439@4 12EFWXY`12EFWXY`12EFWXY`12EFWXY` `EFWX `EFWX `EFWX `EFWX EEEE BIBI 78BI78BI78BI78BI 6abstw‚ƒ„56abst6abstw‚ƒ„56abst6abstw‚ƒ„56abst6abstw‚ƒ„56abst 56ab56abab56CDab‚56CDab‚56CDab‚ 5555 FFFF 349@39@44349@39@44349@349@4 12EFWXY`12EFGHWX12EFWXY`12EFGHWX1EWY1EGW c34349349349 2222 34349@349@349@ 1111 5xx 5xx 5xx 5xx XXXX XY`XY`XY`XY` WWWWWWWW 787878BI78BI78BI78BI d 9@9@9@9@9@9@ EFY`EFY`EFY`EFY` 34@ 34@ 34@ 2E2E2E2E 66abst66abstabst )5555 BIBIBIBI abstwxabstwx56‚ababstwxabstwx56‚ab56‚ab PPPP QQQQ ƒƒƒƒ A778A778A778A778 RRRR TT TT ‚ƒ„6‚ƒ„6‚ƒ„6‚ƒ„6 d34d 444 2222 34333 1111 349@9@349@9@349@9@ 12EFWXY`E`12EFWXY`E`1EWYE stwx‚ƒ„stwx‚ƒ„stwx‚ƒ„stwx‚ƒ„ 7777 )056abƒ„56abƒ„6ƒ556abƒ„56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x56abstwx‚ƒ„56abstwx‚ƒ„6ƒ556abstwx‚ƒ„x hhh X1X1X1X1 bbbb APAPAPAP 56a56a56a56a RSTUVRSTUVRSTUVRSTUV „„„„ 8888 8888 QQQQ QRQRQRQR ipipip 444 pef pef pef 333 56xƒabstw‚56xƒabstw‚56xƒabstw‚56xƒabstw‚ 4444 12E12E12E12E 78UV77878APQRSTUV78APQRSTUVA78BI78APQRSTUV78APQRSTUV78ABI 33349@ef33349@ef349@ef `12EFWXY`12WX```12EFWXY`12WX1EWY1W ipqrefgipqrefgipqrefg W W W W 4 hhh ```` yyyy 3efgefgefg FWXYFWXYFWXYFWXY aaaa €€€€ ssss ‚„‚„‚„‚„ ƒ„ƒ„ƒ„ƒ„ qrhipqrhipqrhipqr FWXYFWXYFWXYFWXY 78QRSTAPUV78QRSTAPUV78APQRSTUV7PQRT78QRSTAPUV78QRSTAPUV78APQRSTUV7PQRT 56ƒb„ssss minimizing the mean squared error over an analysis frame. 3fff aaaa €€€€ xxxx €‚€‚€‚€‚ xxxx hhh tttt uuuu 34e34efghip34e34e34efghip34efghip Y`12EFWXY`Y`Y`Y`12EFWXY`1EWY !quot; refghiqr34pefghiqr34pefghiqr34p 12WXY12EFWXY`12WXY12WXY12WXY12EFWXY`1EWY bbbb 'bbbb yyyy vwvwvwvw yyyy wwww 4ggg 12E12E12E12E ggg 78UV78UV878QRSTAPUV78APQRSTUV8QR78QRSTAPUV78APQRSTUV8QR %34efgghhipqr34p34efgghhipqr34p34efgghhipqr34p 112WXYY`11112WXYY`1WYY 78QRSTUV78QRSTUV78QRSTUV78QRSTUV © abuvwxy€abuvwxy€56xƒbw„abuvwxy€abuvwxy€56xƒbw„56bwxƒ„ „„„„ ‚ƒ‚ƒ‚ƒ‚ƒ % VVVV 34q34efghipqr34efghipqr34efghipqr 2WWXY`2WWXY`2WWXY`2WWXY` ¨ '(abuvƒ„abuvƒ„abbuvƒ„aa„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„abuvwxy€‚ƒ„abbstuvwwxxyy€€‚ƒ„aassty€‚‚„ abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„ 78QRT78QRT78QRT78QRT SSUVUSSUVUSSUVUSSUVU $34qr34ghipqr34ghipqr34ghipqr Y`Y`Y`Y` r rr The linear prediction coefficients ak are determined by RUVRUVRUVRUV QQQQ 8888 78UV78UUVV78QRSSTUV78QRSTUVUV78QRSSTUV78QRSTUUVV abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„ „„„„ QSTQSTQSTQST 8888 $3434efipqr343434efipqr34efipqr 12WX`12WX`1W #prprpr XXXX ƒƒƒƒ 7777 7777 #34qriii WWWW ripipip XXXX LP Residual LP Residual 33434ghipqr34ghipqr34ghipqr Y`WXY`WY`Y`Y`WXY`WWYW 8888 qrghpqrghghpqrghpqrghgh WXYYWXYWXYWXYYY uvwxy€‚ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„uvwxy€‚ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„abuvwxy€‚ƒ„ 7777 VVVV STUSTUSTUSTU  78UV7788UV78STUV7788QQRSTUV78STUV7788QQRSTUV ipqrighq3ipqrighq3ipqrighq3 ```` uvƒ„abuvƒƒ„„uvwxy€‚ƒ„abuvwxy€‚ƒ„ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒ„ƒ„uvwxy€‚ƒ„abuvwxy€‚ƒƒ„„ qrghq34iprrghq34iprr34ghipqrr Y`WXY`XY`Y`Y`WXY`XWY uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„ VVVV TUTUTUTU qripqrghipqr3ipipqrghipqr3ipipqrghipqr3ip Y`Y`Y`Y`YY 78S78S78S78S UV78UV78STUV78STUV vvvv „„„„ ‚ƒ‚ƒ‚ƒ‚ƒ ipqripqripqr STUUVSTUUVSTUUVSTUUV uuuu vv vv uvƒ„uvƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„uv‚ƒ„ k¥ 1 „„„„ ‚‚‚‚ ƒƒƒƒ rqrqqq `` `` `` `` qqipqripqripqr YYYY STUVSTUVSTUVSTUV VVVV ¢ „u„u„u„u ƒ„uƒ„vƒ„u‚ƒ„vƒ„u‚ƒ„vƒ„u‚ƒ„v ƒƒƒƒ iii Y`YYY` pqripqrpqrpqripqripqr Y`Y`Y UVUVSTUVUVUVSTUV qrqripqrqrqripqripqr YYY UVUVVUUVSTUVVUUVSTUVVU UVUVUVUV £ ¡ ¢ £ ¦ ¡ ¢ qrqrqripqrqqrqrqripqrqqripqrq UVUVUVUVUVUV UV UV UV UV k£ s¢ n£ s  n£ s¢ n£ e¢ n£ (2) 12 § qrqrqrqrqrqrqrqr ∑ ak s¢ n 40 Nodes 40 qrqrqrqr qrqrqrqrqrqr p 48 48 given by, ple value is termed as prediction error or residual, which is information from the LP residual [9]. The difference between the actual, and predictable sam- propose neural network models to capture the audio-specific k¥ 1 such an information may involve non-linear processing, we ¢ £ £¡ ¤¢ the LP residual is not clear, and also since the extraction of k£ s  n£ (1) ∑ ak s¢ n p cific set of parameters to represent the audio information in tions among the samples of the residual signal. Since spe- past p samples as, the audio features may be present in the higher order rela- If s¢ n£ is the present sample, then it is predicted by the acteristics are present in the LP residual. We conjecture that past p samples, where p represents the order for prediction. the audio production system. But the excitation source char- ysis each sample is predicted as a linear weighted sum of the does not contain any significant second order correlations of nal using linear prediction (LP) analysis [8]. In the LP anal- tures through the autocorrelation functions, the LP residual The first step is to extract the LP residual from the audio sig- Since LP analysis extracts the second order statistical fea- INFORMATION IN LP RESIDUAL CLIP CLASSIFICATION 3. AANN MODELS FOR CAPTURING AUDIO 2. SIGNIFICANCE OF LP RESIDUAL FOR AUDIO