ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies
1. An Uninformed Approach to Violence
Detection in Hollywood Movies
ARF (Austria-Romania-France) team
Jan SCHLÜTER+1 Bogdan IONESCU*2,4
jan.schlueter@ofai.at bionescu@imag.pub.ro
Ionuț MIRONICĂ2 Markus SCHEDL3
imironica@imag.pub.ro markus.schedl@jku.at
+this work was supported by the Austrian Science Fund (FWF) under project no. Z159.
*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
1 2 3 4
Austrian Research University
Institute for Artificial POLITEHNICA
Intelligence of Bucharest
2. Presentation outline
• The approach
• Video content description & classification
• Experimental results
• Conclusions and future work
MediaEval - Pisa, Italy, 4-5 October 2012 1/13 2
3. The approach
e.g. movie: Harry Potter
> challenge: find a way
to tag violence in movies; correlation matrix
Armageddon
Kill Bill
The Wicker Man
(on ground truth)
> what approach ?
different correlations between
violence and concepts;
high variability in appearance
of violent scenes from movie
to movie;
training a classifier
on ground-truth to predict
directly the violence high low
frames is questionable.
MediaEval - Pisa, Italy, 4-5 October 2012 2/133
4. The approach: machine learning
> approach:
low-level features mid-level prediction predicting violence
training
pred. (real values)
blood
training & optimizing
frame-level …
descriptors pred.
fire violence
movies & yes/no
ground truth … (+ score)
(annotations)
pred.
screams
MediaEval - Pisa, Italy, 4-5 October 2012 3/134
5. The approach: machine learning
> approach: testing
low-level features mid-level prediction predicting violence
pred.
blood
frame-level …
descriptors pred.
fire violence
unseen yes/no
movie …
(+ score)
pred.
screams
MediaEval - Pisa, Italy, 4-5 October 2012 4/135
6. Video content description - audio
standard audio features
(frame-level)
• Zero-Crossing Rate,
• Linear Predictive Coefficients,
time • Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
global
• spectral centroid, flux, rolloff, and
f1 f2 … fn feature
= kurtosis,
+ mean & + variance of each feature over
var{f2} var{fn} variance a certain window.
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
MediaEval - Pisa, Italy, 4-5 October 2012 5/13
6
7. Video content description - visual
feature descriptors (frame-level)
• Histogram of oriented Gradients (HoG) ~ counts occurrences of gradient
orientation in localized portions of an image (20º per bin);
color descriptors (frame-level)
• Color naming histogram ~ project colours into 11 universal color names
(black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow);
[J. van de Weijer et al. IEEE TIP’09]
visual activity (frame-level)
high values will
9 2 account for
important visual
changes ~ action
time
[B. Ionescu et al. IEEE ICASSP’06]
MediaEval - Pisa, Italy, 4-5 October 2012 6/13
7
8. Classifier: multi-layer perceptron
desc. dim. 512 units 1-5 (~concept tags)
- training using back-propagation,
- use 'dropout' to reduce overfitting: a fraction of units is randomly
omitted for each training case so a unit cannot rely on all other units
being present. [G. Hinton et al. arXiv.org’12]
MediaEval - Pisa, Italy, 4-5 October 2012 7/13
8
9. Experimental results: concept prediction
> validation of the concept predictor (on the 15 train movies);
> use concept ground truth;
the purely visual
* concepts obtain high
Fscore mainly because
they are rare,
blood detector not that
accurate (e.g. missed
most blood in “Kill Bill”),
best results for fire and
explosions (prominent
yellow tones), gunshots
leave-one-movie-out cross-validation
and screams.
*results reported for an optimum threshold
MediaEval - Pisa, Italy, 4-5 October 2012 8/13
9
10. Experimental results: violence prediction
> validation of the violence predictor (on the 15 train movies);
> input: descriptors + mid-level predictions (real numbers);
> use violence ground truth; + median filtering
for predictions
0.41 0.46
0.3 0.34
0.23 0.27
prec. rec. F-sc. prec. rec. F-sc.
optimal threshold optimal threshold
leave-one-movie-out cross-validation
MediaEval - Pisa, Italy, 4-5 October 2012 9/13
10
11. Experimental results: official runs
> segment/shot violence decision: assign the frame-wise highest
prediction score + thresholding;
> segment-level results:
precision 0.28, recall 0.49, F-score 0.36, MAP@100 0.55;
> shot-level results:
results vary
significantly
with the movie
MediaEval - Pisa, Italy, 4-5 October 2012 10/13
11
12. D
0,1
0,2
0,3
0,4
0,5
0,6
0,7
D YN
0,05
0,15
0,25
0,35
0,1
0,2
0,3
00
Y I
D NI --5
D YN 5
Y I
D NI -1
D YN -1
Y I
D NI --4 4
D YN
YNI
-
MAP
TU I - 3
TU B 3
D B -5
D YN -5
YNI
-2
MAP@100
TE I -2
TE C
C-
TU - 1
TU B 1
B--2
N 2
N II-
TU II-5 5
TU B
B-
TU - 4
TU B 4
B-
TU - 1
TU B 1
B--3
N 3
N II-
II- 4
N 4
N II-
II- 1
MediaEval - Pisa, Italy, 4-5 October 2012
N 1
N II-
II- 2
N 2
N II-
II 3
L -3
LIIG
G -2
> shot-level comparative results:
-
L 2
LIIG
G -4
-
L 4
LIIG
G -3
-
L 3
LIIG
G -1
TU -
TU M 1
M-
TU 5
TU -5
M
M -3
TU -3
TU
M
M -2
TU -2
TU
M
M-
TE -4 4
TE
Sh C
C-
Sh
an
an g TE - 2
TE C 2
Sh ha
Sh gha i C-
an H TU 4
an iH o TU -4
Sh gha o ng M
Experimental results: official runs
Sh gha i ng k M--1
an H
an iH o ko n 1 o
Sh gha o ng n g 3
Sh gha ng k g--3
an H
an iiH o kon o
gh n ng
gh on g g--4
ai g k 4
ai H k o
H o on
on ng -
ng g
gk -5
ko 5
onng
Sh g--2
an TE 2
TE
gh C
C-
ai
H TE - 5
TE 5
on C
gk C--3 3
onng
g--1
1
AR
AR
F
F--1
1
11/13
12
13. Conclusions and future work
> fair performance for a naïve attempt to violence detection;
> a high baseline to be challenged by more sophisticated
approaches;
> future work:
investigate whether the concept predictions actually helped,
investigate contribution of modalities,
investigate dropout vs. classic learning.
MediaEval - Pisa, Italy, 4-5 October 2012 12/13
13
14. thank you !
any questions ?
MediaEval - Pisa, Italy, 4-5 October 2012 13/13
14