Context-based modeling of audio signals toward information retrieval


Published on

Presented at IEEK DSP Workshop 2013

Published in: Technology, Education
1 Comment
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Ambiguities in soundHeterogeneous A mixture of multiple sound sourcesDependency on context Similar audio contents may represent different meanings according to surrounding sound.
  • Acoustic word, which play similar role with words in textTransform audio signals to text-likesignalsWe have tried various strategies, like ASR, onomatopoeic words, but MFCC-VQ rocks
  • Context-based modeling of audio signals toward information retrieval

    1. 1. /24Contextual modeling of audio signalstoward information retrievalSamuel Kim Ph.D.Given Zone, LLCallthatsignal@gmail.comhttp://allthatsignal.comAll that signal; for the people, by the people, of the people © Given Zone, LLC
    2. 2. /242Audio Information Retrieval
    3. 3. /243Motivation
    4. 4. /24Open ChallengesHeterogeneous …
    5. 5. /24Context-based approach rather than content-basedProposed approach
    6. 6. /246 An acoustic scene consists of a set of acoustic topics. Each acoustic topic has a probability over acoustic words.HypothesisAcoustic TopicsAcoustic scene Acoustic TopicsAcoustic words(signal characteristics)
    7. 7. /247Conjugate pair of multinomial: Dirichlet distributionWhat if the number of balls, i.e., , are random ?Ball-picking problem (a.k.a. urn problem): MultinomialProblem formulation
    8. 8. /24Latent Dirichlet Allocation (LDA)8 Graphical representation of LDADirichletParameterTopicdistributionsWord distributionw/ given topicDirichletParameterTopic Word
    9. 9. /24Acoustic Words9
    10. 10. /24Approximation Infer/Model process• Involves intractable computations, such as Approximation methods• Gibbs sampling method [Steyvers 2007] A form of Markov Chain Monte Carlo (MCMC)• Variational approximation [Blei2003], etc.10
    11. 11. /24Interpretation11Latent Acoustic TopicsAcoustic TopicsProbabilistic assign to individual topicsthe size represents the probabilityAcoustic WordsAudio Features... ...Discrete symbol of acoustic characteristicsPlay similar roles with text wordsProbabilistic (soft) Clusteringin terms of acoustic words’ co-occurrence
    12. 12. /24 Two-step LearningFor Classification Applications12Training DBAcousticTopic ModelTopic DistributionProbabilityClassifiers(Multiclass SVM)Test signalUnsupervisedmodelingSupervisedclassifierTestphaseTrainingphase
    13. 13. /24Possible Applications13Content Identification• Music Information Retrieval [Levi2009]• Audio Fingerprinting [Kim2012]Audio Scene Analysis• Understanding auditory scene [Kim2009]• Environmental sound classification [Kim2010]User Modeling• Behavioral Analysis [Kim2011]• Emotion recognition [Kim2013]
    14. 14. /24Target applicationAutomatic classification of TV genre using audio content
    15. 15. /24Scenarios Off-line• Assumes the system knows when the programstarts and ends.• Prior segmentation required. On-line• Makes decisions without prior segmentation Every X seconds Online scene detection, etc.15
    16. 16. /24Scenarios Models are trained in an off-line manner16Training DBAcousticTopic ModelTopic DistributionProbabilityClassifiers(Multiclass SVM)Test signalTestphaseTrainingphaseTest signalSegmentationOff-line resultOn-line resultOnOff
    17. 17. /24Dataset RAI dataset• Providing a benchmarking test-bed (6-fold cv)• Italian TV broadcast programs• 7 genres• 262 programs (15 min/pr.)17
    18. 18. /24Off-line classification18[2007] M. Montagnuolo and A. Messina, “TV Genre Classification Using Multimodal Information and Multilayer Perceptrons,” LNAI 4733, 2007.[2009] M. Montagnuolo and A. Messina, “Parallel neural networks for multimodal video genre classification,” Multimedia Tools and Applications, vol. 41, 2009[2010] H. Ekenel, et al. “Content-based Video Genre Classification Using Multiple Cues,” ACM 2010. Overall accuracy• Comparison with conventional content-basedapproaches Competitive results using only audio contentsAccuracy (%)MLP *[2007]MLP *[2009]SVM **[2010]GMM(64 mixtures)ATM(64 topics)(2,048 words)Audio Only - - 86.6 93.6 94.3Audio-Visual 92.0 94.9 99.6 - -* MLP: Multilayer Perceptron** SVM: Support Vector Machine
    19. 19. /24Off-line classification Confusion matrix• ATM• GMM19CTCMFBMUNETSWFCartoonCommercialFootballMusicNewsTalk showWeather Forecast
    20. 20. /24On-line classification Accuracy according to length of segments200 1 2 3 4 5 66870727476788082Accuracy(%)time (s)ATMGMM
    21. 21. /24On-line classification Per-class F-measure21[1 second] [6 seconds]
    22. 22. /24Summary Genre of TV programs can be detected usingonly audio content• Using context-based approach On-line and off-line tasks• Competitive results with conventional audio-visualapproaches in off-line tasks• ATM outperforms GMM if segments are longenough in on-line tasks22
    23. 23. /24Conclusions Acoustic Topic Model (ATM)• Capturing contextual information of audio signalsby modeling co-occurrence of text-like audiosignals• Can be used in various classification applicationincorporation with supervised classifier23
    24. 24. /24Merci beaucoupAll that signal; for the people, by the people, of the people © Given Zone, LLC