ARF @ MediaEval 2012: Multimodal Video Classification

773 views
663 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
773
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ARF @ MediaEval 2012: Multimodal Video Classification

  1. 1. ~ Multimodal Video Classification ~ ARF (Austria-Romania-France) team Bogdan IONESCU*1,3 Ionuț MIRONICĂ1 Klaus SEYERLEHNER2 bionescu@imag.pub.ro imironica@imag.pub.ro music@cp.jku.at Peter KNEES2 Jan SCHLÜTER4 Markus SCHEDL2 peter.knees@jku.at jan.schlueter@ofai.at markus.schedl@jku.at Horia CUCU1 Andi BUZO1 Patrick LAMBERT3 horia.cucu@upb.ro andi.buzo@upb.ro patrick.lambert@univ-savoie.fr *this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.1 2 3 4 University Austrian Research POLITEHNICA Institute for Artificial of Bucharest Intelligence
  2. 2. Presentation outline • The approach • Video content description • Experimental results • Conclusions and future workMediaEval - Pisa, Italy, 4-5 October 2012 1/16 2
  3. 3. The approach > challenge: find a way to assign (genre) tags to unknown videos; > approach: machine learning paradigm; … web food autos label data train unlabeled data classifier labeled data tagged video database video databaseMediaEval - Pisa, Italy, 4-5 October 2012 2/163
  4. 4. The approach: classification > the entire process relies on the concept of “similarity” computed between content annotations (numeric features), > this year focus is on: objective 1: go multimodal (truly) visual audio text objective 2: test a broad range of classifiers and descriptor combinations;MediaEval - Pisa, Italy, 4-5 October 2012 3/164
  5. 5. Video content description - audio  block-level audio features • Spectral Pattern, (capture also local temporal information) ~ soundtrack’s timbre; • delta Spectral Pattern, e.g. 50% overlapping ~ strength of onsets; • variance delta Spectral Pattern, average ~ variation of the onset strength; median • Logarithmic Fluctuation Pattern, variance ~ rhythmic aspects; ... • Correlation Pattern, ~ loudness changes; • Spectral Contrast Pattern, ~ ”toneness”; • Local Single Gaussian model, [Klaus Seyerlehner et al., MIREX’11, USA] ~ timbral; • George Tzanetakis model, ~ timbral;MediaEval - Pisa, Italy, 4-5 October 2012 4/16 5
  6. 6. Video content description - audio  standard audio features (audio frame-based) • Zero-Crossing Rate, • Linear Predictive Coefficients, time • Line Spectral Pairs, • Mel-Frequency Cepstral Coefficients, global feature • spectral centroid, flux, rolloff, and f1 f2 … fn = kurtosis,+ mean & + variance of each feature over var{f2} var{fn} variance a certain window. [B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]MediaEval - Pisa, Italy, 4-5 October 2012 5/16 6
  7. 7. Video content description - visual  MPEG-7 & color/texture descriptors (visual frame-based) • Local Binary Pattern, global • Autocorrelogram, feature • Color Coherence Vector, = mean & • Color Layout Pattern, dispersion & • Edge Histogram, skewness & time kurtosis & • Classic color histogram, f1 f2 … fn median & • Scalable Color Descriptor, root mean square • Color moments. [OpenCV toolbox, http://opencv.willowgarage.com]MediaEval - Pisa, Italy, 4-5 October 2012 6/16 7
  8. 8. Video content description - visual  feature descriptors (visual frame-based) • Histogram of oriented Gradients (HoG) ~ counts occurrences of gradient orientation feature points (e.g. Harris) in localized portions of an image (20º per bin) • Harris corner detector • Speeded Up Robust Feature (SURF) image source http://www.ifp.illinois.edu/~yuhuang [OpenCV toolbox, http://opencv.willowgarage.com]MediaEval - Pisa, Italy, 4-5 October 2012 7/16 8
  9. 9. Video content description - text  TF-IDF descriptors (Term Frequency-Inverse Document Frequency) > text sources: ASR and metadata, 1. remove XML markups, 2. remove terms <5%-percentile of the frequency distribution, 3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes, 4. for each document we represent the TF-IDF values.MediaEval - Pisa, Italy, 4-5 October 2012 8/16 9
  10. 10. Experimental results: devset (5,127 seq.) > classifiers from Weka (Bayes, lazy, functional, trees, etc.), > cross-validation (train 50% – test 50%), avg. Fscore (over all genres) - visual descriptors capabilities 30%±10%, - using more visual is not more accurate than using few, - best LBP+CCV+histogram (Fscore=41.2%). [Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]MediaEval - Pisa, Italy, 4-5 October 2012 9/1610
  11. 11. Experimental results: devset (5,127 seq.) > cross-validation (train 50% – test 50%), avg. Fscore (over all genres) - audio still better than visual (improvement ~6%), - proposed block-based better than standard (by ~10%), [Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]MediaEval - Pisa, Italy, 4-5 October 2012 10/16 11
  12. 12. Experimental results: devset (5,127 seq.) > cross-validation (train 50% – test 50%), avg. Fscore (over all genres) - ASR from LIMSI more representative than LIUM (~3%), - best performance ASR LIMSI + metadata (Fscore=68%). [Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]MediaEval - Pisa, Italy, 4-5 October 2012 11/16 12
  13. 13. Experimental results: devset (5,127 seq.) > cross-validation (train 50% – test 50%), avg. Fscore (over all genres) - audio-visual close to text (ASR) for the automatic descriptors, - increasing the number of modalities increases the performance. [Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]MediaEval - Pisa, Italy, 4-5 October 2012 12/16 13
  14. 14. Experimental results: official runs (9,550 seq.) > train on devset, test on testset (SVM linear), MediaEval MediaEval 2011 2011 MAP 12% MAP 10.3% Run1 Run2 Run3 Run4 Run5 LBP+CCV+ TF-IDF on audio block-based + audio TF-IDF on hist + audio ASR LIMSI LBP + CCV + hist + block-based metadata + metadata block-based TF-IDF on ASR ASR LIMSI LIMSIMediaEval - Pisa, Italy, 4-5 October 2012 13/16 14
  15. 15. Experimental results: official runs (9,550 seq.) > genre MAP for Run 5: TF-IDF on ASR + metadata, Run 1: visual + audio autos gaming religion environment 52% 71% 71% 50%MediaEval - Pisa, Italy, 4-5 October 2012 14/16 15
  16. 16. Conclusions and future work > classification adapts to the corpus – changing the corpus will change the performance; > audio-visual descriptors are inherently limited; > how far can we go with ad-hoc classification without human intervention? > future work:  more elaborated late-fusion ?  pursue tests on the entire data set;  perhaps more elaborated Bag-of-Visual-Words. Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and Prof. Nicu Sebe from University of Trento for their support.MediaEval - Pisa, Italy, 4-5 October 2012 15/16 16
  17. 17. thank you ! any questions ?MediaEval - Pisa, Italy, 4-5 October 2012 16/16 17

×