ARF @ MediaEval 2012: Multimodal Video Classification

~ Multimodal Video Classification ~

ARF (Austria-Romania-France) team

Bogdan IONESCU*1,3 Ionuț MIRONICĂ1 Klaus SEYERLEHNER2
bionescu@imag.pub.ro imironica@imag.pub.ro music@cp.jku.at

Peter KNEES2 Jan SCHLÜTER4 Markus SCHEDL2
peter.knees@jku.at jan.schlueter@ofai.at markus.schedl@jku.at

Horia CUCU1 Andi BUZO1 Patrick LAMBERT3
horia.cucu@upb.ro andi.buzo@upb.ro patrick.lambert@univ-savoie.fr

*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
1 2 3 4
University Austrian Research
POLITEHNICA Institute for Artificial
of Bucharest Intelligence

Presentation outline

• The approach

• Video content description

• Experimental results

• Conclusions and future work

MediaEval - Pisa, Italy, 4-5 October 2012 1/16 2

The approach
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;

…
web food autos label data

train

unlabeled data

classifier labeled data

tagged video database
video database
MediaEval - Pisa, Italy, 4-5 October 2012 2/163

The approach: classification
> the entire process relies on the concept of “similarity” computed
between content annotations (numeric features),

> this year focus is on:

objective 1: go multimodal (truly)

visual audio text

objective 2: test a broad range of classifiers and descriptor
combinations;


Video content description - audio
 block-level audio features • Spectral Pattern,
(capture also local temporal information) ~ soundtrack’s timbre;
• delta Spectral Pattern,
e.g. 50% overlapping
~ strength of onsets;
• variance delta Spectral Pattern,
average ~ variation of the onset strength;
median • Logarithmic Fluctuation Pattern,
variance ~ rhythmic aspects;
... • Correlation Pattern,
~ loudness changes;
• Spectral Contrast Pattern,
~ ”toneness”;
• Local Single Gaussian model,
[Klaus Seyerlehner et al., MIREX’11, USA] ~ timbral;
• George Tzanetakis model,
~ timbral;

5

Video content description - audio
 standard audio features
(audio frame-based)

• Zero-Crossing Rate,

• Linear Predictive Coefficients,

time • Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,
global
feature • spectral centroid, flux, rolloff, and
f1 f2 … fn
= kurtosis,
+ mean & + variance of each feature over
var{f2} var{fn} variance a certain window.

[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

6

Video content description - visual
 MPEG-7 & color/texture descriptors
(visual frame-based)

• Local Binary Pattern,

global • Autocorrelogram,
feature • Color Coherence Vector,
=
mean & • Color Layout Pattern,
dispersion & • Edge Histogram,
skewness &
time
kurtosis & • Classic color histogram,
f1 f2 … fn median &
• Scalable Color Descriptor,
root mean square
• Color moments.

[OpenCV toolbox, http://opencv.willowgarage.com]

7

Video content description - visual
 feature descriptors
(visual frame-based)
• Histogram of oriented Gradients (HoG)
~ counts occurrences of gradient orientation
feature points (e.g. Harris)
in localized portions of an image (20º per bin)

• Harris corner detector

• Speeded Up Robust Feature (SURF)

image source http://www.ifp.illinois.edu/~yuhuang

[OpenCV toolbox, http://opencv.willowgarage.com]

8

Video content description - text
 TF-IDF descriptors
(Term Frequency-Inverse Document Frequency)

> text sources: ASR and metadata,

1. remove XML markups,

2. remove terms <5%-percentile of the frequency distribution,

3. select term corpus: retaining for each genre class m terms (e.g. m =
150 for ASR and 20 for metadata) with the highest χ2 values that
occur more frequently than in complement classes,

4. for each document we represent the TF-IDF values.

9

Experimental results: devset (5,127 seq.)
> classifiers from Weka (Bayes, lazy, functional, trees, etc.),
> cross-validation (train 50% – test 50%),
avg. Fscore (over all genres)

- visual descriptors capabilities 30%±10%,
- using more visual is not more accurate than using few,
- best LBP+CCV+histogram (Fscore=41.2%).
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]




- audio still better than visual (improvement ~6%),

- proposed block-based better than standard (by ~10%),


11



- ASR from LIMSI more representative than LIUM (~3%),

- best performance ASR LIMSI + metadata (Fscore=68%).


12



- audio-visual close to text (ASR) for the automatic descriptors,

- increasing the number of modalities increases the performance.


13

Experimental results: official runs (9,550 seq.)
> train on devset, test on testset (SVM linear),

MediaEval MediaEval
2011 2011
MAP 12% MAP 10.3%

Run1 Run2 Run3 Run4 Run5
LBP+CCV+ TF-IDF on audio block-based + audio TF-IDF on
hist + audio ASR LIMSI LBP + CCV + hist + block-based metadata +
metadata
block-based TF-IDF on ASR ASR LIMSI
LIMSI

14

Experimental results: official runs (9,550 seq.)
> genre MAP for Run 5: TF-IDF on ASR + metadata,
Run 1: visual + audio
autos gaming religion environment
52% 71% 71% 50%

15

Conclusions and future work
> classification adapts to the corpus – changing the corpus will
change the performance;
> audio-visual descriptors are inherently limited;
> how far can we go with ad-hoc classification without human
intervention?

> future work:
 more elaborated late-fusion ?
 pursue tests on the entire data set;
 perhaps more elaborated Bag-of-Visual-Words.

Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and
Prof. Nicu Sebe from University of Trento for their support.

16

thank you !
any questions ?

17

ARF @ MediaEval 2012: Multimodal Video Classification

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to ARF @ MediaEval 2012: Multimodal Video Classification

Similar to ARF @ MediaEval 2012: Multimodal Video Classification (14)

More from MediaEval2012

More from MediaEval2012 (20)

Recently uploaded

Recently uploaded (20)

ARF @ MediaEval 2012: Multimodal Video Classification