This paper describes the participation of the TUB-IRML
group to the MediaEval 2014 Violent Scenes Detection (VSD) affect task. We employ low- and mid-level audio-visual features fused at the decision level. We perform feature space partitioning of training samples through
k-means clustering and train a different model for each cluster. These models are then used to predict the violence level of videos by employing two-class support vector machines (SVMs) and a classifier selection approach. The experimental results obtained on Hollywood movies and short Web videos show the superiority of mid-level audio features over visual features in terms of discriminative power, and a further enhanced performance resulting from the fusion of audio-visual cues at the decision-level. Finally, the results also demonstrate a performance gain obtained by partitioning the feature space
and training multiple models, compared to a unique violence
detection model.
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_68.pdf
SpotFlow: Tracking Method Calls and States at Runtime
TUB-IRML at MediaEval 2014 Violent Scenes Detection Task: Violence Modeling through Feature Space Partitioning
1. Competence Center Information Retrieval & Machine Learning
TUB-IRML at MediaEval 2014 Violent Scenes Detection
Task: Violence Modeling through Feature Space Partitioning
Esra Acar, Sahin Albayrak
2. Outline
216 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
►The Violence Detection Method
Video Representation
Violence Detection Model
►Results & Discussion
►Conclusions & Future Work
3. The Violence Detection Method
316 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
►The two main components of our method are:
(1) the representation of video segments, and
(2) the learning of a violence model.
4. Video Representation (1)
416 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
The generation process of sparse coding based audio and visual representations for video segments.
5. Video Representation (2)
516 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
The generation of audio and visual dictionaries with sparse coding.
6. Video Representation (3)
616 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
► In addition to the mid-level audio and visual representations,
we use low-level features which are:
Motion-related descriptors – Violent Flow (ViF) which is a
descriptor proposed for real-time detection of violent crowd
behaviors, and
Static content representations – Affect-related color
descriptors such as statistics on saturation, brightness and
hue in the HSL color space, and colorfulness.
7. Violence Detection Model
716 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
► Violence is a concept which can audio-visually be expressed in
diverse manners.
► We learn multiple models for the violence concept instead of a
unique model.
Feature space partitioning by clustering video segments in
the training dataset, and
Learn a different model for each violence sub-concept.
► We perform a classifier selection to solve the classifier
combination issue.
8. Results & Discussion (1)
816 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
Method MAP2014 –
Movies
MAP@100 –
Movies
MAP2014 –
Web videos
MAP@100 –
Web videos
Run1 0.169 0.368 0.517 0.582
Run2 0.139 0.284 0.371 0.478
Run3 0.080 0.208 0.477 0.495
Run4 0.172 0.409 0.489 0.586
Run5 0.170 0.406 0.479 0.567
SVM-based
unique model
0.093 0.302 - -
Run1 MFCC-based mid-level audio representations
Run2 HoG- and HoF-based mid-level features and ViF
Run3 Affect-related color features
Run4 Audio and visual features (except color)
Run5 All audio-visual representations are linearly fused at the decision level
The MAP2014 and MAP@100 of our method with different representations
9. Results & Discussion (2)
916 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
► The mid-level audio representation (Run1) provides promising
performance and outperforms all other representations (Run2
& 3).
► The performance is further improved by decision-level fusion
(Run4).
► Affect-related color features does NOT help much (Run5).
► The results on the Web video dataset demonstrate superior
results (i.e., our method generalizes well).
► Affect-related color features seem to provide better results on
the Web video dataset (Run3).
► Our method outperforms the SVM-based unique model.
10. Conclusions & Future Work
1016 October 2014 TUB-IRML at MediaEval 2014 Violent Scenes Detection Task
► The mid-level audio representation based on MFCC and
sparse coding
provides promising performance in terms of MAP2014 and
MAP@100 metrics, and
also outperforms our visual representations.
► As a future work, we need to
extend/improve our visual representation set, and
further investigate the feature space partitioning concept.
11. Competence Center Information Retrieval &
Machine Learning
www.dai-labor.de
Fon
Fax
+49 (0) 30 / 314 – 74
+49 (0) 30 / 314 – 74 003
DAI-Labor
Technische Universität Berlin
Fakultät IV – Elektrontechnik & Informatik
Sekretariat TEL 14
Ernst-Reuter-Platz 7
10587 Berlin, Deutschland
11
Esra Acar
Researcher
M.Sc.
esra.acar@tu-berlin.de
Thanks!
013
TUB-IRML at MediaEval 2014 Violent Scenes Detection Task16 October 2014