Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Technicolor / INRIA / Imperial College
London at the MediaEval 2012 Violent
Scene Detection Task

PENET Cédric – Technicolor, INRIA
DEMARTY Claire-Hélène – Technicolor
SOLEYMANI Mohammad – Imperial College London
GRAVIER Guillaume – CNRS, IRISA
GROS Patrick – INRIA
MediaEval 2012 Pisa Workshop
October, 4th 2012

Outline
 Introduction
 Systems description
 Results and conclusion

2 10/7/2012

Outline
 Introduction

3 10/7/2012

Introduction
Joint effort between Technicolor / INRIA / Imperial College London
 5 runs  5 different systems
 Re-use of last year’s systems with few differences
 Bayesian networks structure learning (Technicolor/INRIA)
 Naive bayesian classifier (ICL)
 Two new systems from Technicolor/INRIA
 Exploiting similarity
 Bag-of-Audio words
 Fusion of three systems (Technicolor/INRIA – ICL)

4 10/7/2012

Outline
 Introduction

5 10/7/2012

Run 1: Exploiting Similarity
 Idea: can we get the same results as last year using only similarity
measures?
 Video features for each frame
 Motion activity
 Three color harmonisation features: harmonisation template, angle and
energy
 Decision: KNN using only closest neighbour
 10-movies used to populate KNN
 Test frames labelled according to closest neighbour
 1 frame of a shot labelled violent  shot is labelled violent

6 10/7/2012

Run 2: Bag-of-Audio words
 Audio features extraction
 Extraction of MFCC audio features (with & ) - 20ms windows, 10 ms overlap
 Extraction of silence segments with SPro
 Extraction of coherent audio segments – Andre-Obrecht 1988

 K-Means on non-silent audio segments for vocabulary (of size 128)
 Each audio segment replaced by closest centroid

 Construction of TF-IDF histograms
 Each shot is a document

 Classification using SVM
 ² and histogram intersection kernels
 Applied weight on SVM parameter

7 10/7/2012

Run 3: Bayesian Networks structure learning
 Re-use of Technicolor last year’s system with additionnal features
 Audio features: energy, asymmetry, centroid, ZCR, flatness and roll-off at 90%
 Video features: shot length, flashes, blood, activity, color coherence, average
luminance, fire and color harmonisation features
 Features are averaged over a video shot

 Graphical model for modeling conditional probability distributions along
with contextual features and temporal smoothing
 Naive Bayesian network (NB)
Bayesian network example
 Graph structure learning
 Forest augmented naive Bayesian network (FAN)
 K2

 Late modalities fusion using simple rule
Source: https://controls.engin.umich.edu/wiki/index.php/Bayesian_network_theory

8 10/7/2012

Run 4: Naïve Bayesian classifier
Audio modality
 Classical low level features extracted from non-silent segments
 RMS Energy, pitch, MFCC, ZCR, spectrum flux, Spectral RollOff
 Averaged over shots
Video modality
 Shot duration, luminance, Average activity, motion component
 Averaged over shots
Text features
 Simple features such as number of spoken words and the average valence and arousal
per shot (from the dictionary of affect in language)
 The results were bad and we decide not to include them in the final submission

A Naïve Bayesian classifier on each modality
 Modality fusion using a weighted sum of posterior probabilities.
 0.95* audio score +0.05 visual score

9 10/7/2012

Run 5: Systems fusion

 Simple fusion of three systems
 Run 2: Bag-of-Audio words
 Run 3: Bayesian networks structure learning
 Run 4: Naive bayesian classifier

 Fusion by multiplication of probabilities

10 10/7/2012

Outline
 Introduction
 Results and conclusions

11 10/7/2012

Results
Runs MAP@100 AP-1 AP-2 AP-3 STD
MediaEval Cost
N° Technique (%) (%) (%) (%) (%)
1 Similarity 13.89 0.00 12.91 28.77 14.41 2.29
2 BoAW 40.54 10.85 52.98 57.77 25.82 2.50
3 BN-SL 61.82 60.56 53.15 71.76 9.37 3.57
4 NBN 46.27 40.03 22.97 75.82 26.97 3.64
5 Fusion 57.47 64.52 37.21 70.69 17.82 4.60
 Average Precision (AP) for Dead Poet Society (AP-1), Fight Club (AP-2) and Independence Day (AP-3)
 STD: Standard deviation of the three test movies

High variation between movies

Best results on Independence day (similar to Armageddon)

Needs more movies to compute MAP

12 10/7/2012

Conclusion & perspectives
 Similarity search
 MAP is bad, but MediaEval Cost is one of the best (6th out of 35)
 Adding features and merge decisions from different KNN might improve the
results

 Fusion
 4th best run overall (out of 35)
 Results not as good as expected
 Improves precision at the cost of recall (false alarms reduced by a factor of
two)
 Test smarter fusion techniques

 Bayesian Networks – Structure Learning
 3rd best run overall (out of 35)
 Very low standard deviation over three movies
 Bayesian networks for intermediate concepts
13 10/7/2012

Conclusion & perspectives

 Bag-of-Audio words
 MAP is not bad (11th out of 35)
 False alarms and missed detections are pretty low too
 Simple tests proved efficient – more investigation needed

 Naive bayesian classifier
 Simple classifier with audio features can achieve moderatly good results
(10th out of 35)
 Text features don’t work
 Use a classifier that can learn temporal dynamics

14 10/7/2012

Thanks for your attention !

15 10/7/2012

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Similar to Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task (20)

More from MediaEval2012

More from MediaEval2012 (20)

Recently uploaded

Recently uploaded (20)

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task