Anvita Audio Classification Presentation


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Anvita Audio Classification Presentation

  1. 1. Audio Clip Classification Anvita Bajpai
  2. 2. Source:
  3. 3. Exploding information One hour of TV broadcast across the world is 100 Petabyte. ● Source:
  4. 4. Audio indexing Reason of choosing audio data for study ● Easier to process – Contains significant information – Indexing – method of organizing data for further ● search and retrieval Example – book indexing – Audio Indexing – indexing non-text data using ● audio part of it
  5. 5. Example of an audio indexing system Source: J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. “Speech and language technologies for audio indexing and retrieval”, in Proc. of the IEEE, 88(8), pp. 1338-1353, 2000.
  6. 6. More examples of audio indexing tasks Spoken document retrieval ● Speaker identification ● Language identification ● Music classification ● Music/speech discrimination ● Audio classification ● An important step in building an audio indexing system. –
  7. 7. Levels of information in audio signal Subsegmental information ● Related to excitation source characteristics – Segmental information ● Related to system / physiological characteristics – Suprasegmental information ● Related to behavioural characteristics of audio –
  8. 8. Audio clip classification Closed set problem ● To classify a given audio clip in one of the ● following predefined categories Advertisement – Cartoon – Cricket – Football – News –
  9. 9. Issues in audio clip classification Feature extraction ● Effective representation of data to capture all – significant properties of audio for the task Robust under various conditions – Classification ● Formulation of a distance measure and rule/models – Training a models for the task ● Testing – actual classification task ● Combining evidences from different systems ●
  10. 10. Missing component in existing approaches and it's importance Features derived based on spectral analysis ● Carry significant properties of audio data at segmental level – Miss information present at subsegmental, suprasegmental level – Perceptually significant information in linear prediction ● (LP) residual of signal Complimentary in nature to the spectral information – Subsegmental and suprasegmental information not being used – in current systems
  11. 11. Presence of audio-specific Residual Original information in LP residual Aa_res.wav Aa1.wav Aa1.wav
  12. 12. Extracting audio-specific information from LP residual LP residual – May contain higher order correlation among samples ● It is difficult to extract it using standard signal processing and ● statistical techniques Hence proposed autoassociative neural networks – (AANN) models to capture information from residual Used to capture features ● for speaker recognition task Structure of network ● 40L 48N 12N 48N 40L –
  13. 13. Use of audio component knowledge Audio category ● Composed of one or more audio components – Audio component ● Specific to an audio category – Six components chosen for study ● Music – Speech - Conversational, Cartoon, Clean – Noise - Football, Cricket –
  14. 14. Training phase of AANN models Trained one AANN model for each of six ● components Models trained ● for 2000 epochs AANN training error curve
  15. 15. Testing phase (confidence scores output of 6 AANN models for a news test clip) a) for a segment of the clip, (b) expended version of the same. Duration of total test clip is 10 sec
  16. 16. Work flow diagram (of 6 components) MLP – Multilayer perceptron
  17. 17. MLP for decision making task MLP for capturing audio-specific information ● captured by AANN, as it is Suitable for pattern recognition tasks – Have ability to form complex decision surface by – using discriminating learning algorithms Structure of MLP used - 6L 24N 12N 5N ●
  18. 18. Confidence scores output of 6 component AANN models Contd... 24 12 Nodes 6 5 M A S1 C Audio Category S2 K S3 F N1 N N2 OP layer IP layer Hidden layers
  19. 19. Classification results Audio class % of clips correctly classified DB1 DB2 Advertisement 83.00% 43.50% Cartoon 88.00% 45.50% Cricket 86.00% 38.50% Football 90.50% 75.50% News 85.50% 63.30% Average 86.60% 53.26% DB1 – Data collected from single TV channel, contains 200 clips, 40 of each category DB2 – Data collected across all broadcasted channels, contains 1659 clips, Adv. – 226, Cartoon – 208, Cricket – 318, Football – 600, News – 306
  20. 20. Classification results for spectral 1 features-based system Audio class % of clips correctly classified Spectral features-based system LP residual-based system DB1 DB2 DB1 DB2 Advertisement 85.00% 65.00% 83.00% 43.50% Cartoon 90.00% 75.00% 88.00% 45.50% Cricket 90.00% 65.00% 86.00% 38.50% Football 92.50% 40.00% 90.50% 75.60% News 87.50% 65.30% 85.50% 63.30% Average 89.00% 62.06% 86.60% 53.26% Ref. [1] Gaurav Aggarwal, Features for Audio Indexing, M Tech report, CSE Deptt, IIT Madras, Apr. 2002
  21. 21. Classification results from source, spectral features-based systems A System 1 System 2 A – All test audio clips (DB2) System 1 – clips recognised using spectral features-based system System 2 – clips recognised using excitation source (LP residual) based system
  22. 22. Results of combined (subsegmental and segmental) system for DB2 Audio class % of clips correctly classified in systems Spectral LP residual Abstract level Rank+measurement level Based based Combination Combination Advertisement 65.00% 43.50% 83.00% 92.47% Cartoon 75.00% 45.50% 92.00% 98.55% Cricket 65.00% 38.50% 87.50% 88.67% Football 40.00% 75.60% 87.00% 91.16% News 65.30% 63.30% 86.30% 95.10% Average 62.06% 53.26% 87.25% 93.18%
  23. 23. uprasegmental information in Hilbert nvelope of LP residual of audio signal
  24. 24. Suprasegmental information in LP residual for audio clip classification Autocorrelation samples of Hilbert envelope of LP residual for 5 audio classes
  25. 25. Statistics of autocorrelation sequence Correction – here we have statistics of autocorrelation sequence peaks of HE (not LP residual)
  26. 26. Statistics of autocorrelation sequence
  27. 27. Scope of future work Extending the framework for other audio ● indexing applications Exploring methods to add suprasegmental ● information to the combined system (though far away..) Building a multimedia ● indexing system
  28. 28. Summary and conclusions Need to organize audio data because of its large volume and ● need in real-life applications Presence of audio specific information in LP residual ● AANN model's ability to capture subsegmental information ● from residual for the task Use of MLP for decision making using the information ● captured by AANN Complementary nature of source information to the system ● information Presence of audio-specific suprasegmental information in LP ● residual
  29. 29. Major contributions Extraction of audio-specific information from LP ● residual using NN models Showing the complementary nature of source and ● system information for the audio clip classification task Showing the presence of audio-specific ● suprasegmental information in LP residual
  30. 30. References T. Zhang and C.-C. J. Kuo, quot;Content-based classification and retrieval of audio,quot; in Conference on 1. Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, San Diego, California, July 1998, vol. 3461 of Proc. of SPIE. J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. “Speech and 2. language technologies for audio indexing and retrieval”, in Proc. of the IEEE, 88(8), pp. 1338-1353, 2000. Y. Wang, Z. Liu, and J. Huang. “Multimedia Content Analysis using Audio and Visual Clues”, 3. IEEE SP Magazine, 17(6), Nov. 2000. M.A. Kramer, quot;Nonlinear principal component analysis using autoassociative neural networks,quot; 4. AIChE Journal, vol. 37, pp. 233-243, Feb. 1991. J. Makhoul, quot;Linear prediction: A tutorial review,quot; in Proc. IEEE, vol. 63, pp. 561--580, 1975. 5. B. Yegnanarayana, S.R.M. Prasanna, and K.S. Rao, “Speech enhancement using excitation source 6. information,'' in Proc. Int. Conf. Acoust., Speech, Signal Processing, Orlando, FL, USA, May 2002. S.R.M. Prasanna, Ch.S. Gupta, and B. Yegnanarayana, “Autoassociative neural network models for 7. speaker verification using source features,'' in Proc. Sixth Int. Conf. Cognitive Neural Systems, Boston University, Boston, USA, May-June 2002. B. Yegnanarayana, Artificial Neural Networks, Prentice Hall of India, New Delhi, 1999. 8.
  31. 31. Related publications 1. Anvita Bajpai and B. Yegnanarayana, “Audio Clip Classification using LP Residual and Neural Networks Models”, European Signal and Image Processing Conference (EUSIPCO-2004), Vienna, Austria, 6-10 September 2004 2. Anvita Bajpai and B. Yegnanarayana, “Exploring Features for Audio Indexing using LP Residual and AANN Models”, accepted for The 17th International FLAIRS Conference (FLAIRS - 2004), Miami Beach, Florida, 17-19 May 2004. 3. Anvita Bajpai and B. Yegnanarayana, “Exploring Features for Audio Clip Classification using LP Residual and Neural Networks Models”, International Conference on Intelligent Signal and Image Processing (ICISIP- 2004), Chennai, India, 4-7 January 2004 4. Gaurav Aggarwal, Anvita Bajpai and B. Yegnanarayana, “Exploring Features for Audio Indexing”, in Indian Research Scholar Seminar (IRIS- 2002), Indian Institute of Science, Bangalore, India, March 2002
  32. 32. Following are extra slides not part of main presentation
  33. 33. Effect of # of epochs used for AANN training Confidence scores output of 6 AANN models for a news test clip
  34. 34. Even well-trained humans don't always react the ● way they were trained. Source: – 0103/random/r1014.pdf, by Bob Colwell
  35. 35. Classification of audio using spectral features •Extraction of features - based on –Volume Standard deviation and Dynamic range of volume, Volume ● undulation, 4Hz modulation energy –Zero Crossing Rate Standard deviation of ZCR, Silence-nonsilence ratio ● –Pitch Pitch contour, Pitch standard deviation, Similar pitch ratio, Pitch- ● nonpitch ratio –Spectrum Frequency centroid, Bandwidth, Ratio of energy in various frequency ● sub-bands
  36. 36. Features for Categorization of Audio Clips (4Hz modulation energy) Cricket Football News
  37. 37. Features for Categorization of Audio Clips (Similar Pitch Ratio) . (Contd..) Cricket Football News
  38. 38. Importance of Task Dependent Feature (Standard deviation of ZCR) Speaker 1 Speaker 2 Music