This document summarizes the MediaEval 2012 violent scenes detection task. The goal is to detect violent segments in movies to help users choose child-appropriate content. Violence is defined as physical injury or pain. Participants were provided 18 movies annotated at the shot-level for violence. Runs detecting violence at the shot and segment level were submitted. Evaluation used mean average precision and a cost-based metric. 11 teams from 9 countries participated, submitting 36 runs total. The best performing run achieved a MAP@100 of 65.05. Participation was higher than the previous year with more joint submissions and workshop attendance.
2. Task definition
Second year!
Derives from a Technicolor Use Case
Helping users choose movies that are suitable for children in their family by
proposing a preview of the most violent segments
Very same definition
“Physical violence or accident resulting in human injury or pain”
As objective as possible
But:
Dead people without seeing how they appear to be dead => Not annotated
Somebody hurting himself while shaving => Annotated
Does not match the use case…
2 10/08/12
3. Task definition
Two types of runs
Primary and required run at shot level,
i.e. a decision violent/non violent should be provided for each movie shot
Optional run at segment level,
i.e. violent segments (starting and ending times) should be extracted by the
participants
Scores are required to compute the official measure
Rules
Any features automatically extracted from the DVDs can be used
This includes audio, video and subtitles
No external additional data (e.g. from the internet)
3 10/08/12
4. Data set
18 Hollywood movies purchased by participants
Of different genre (from extremely violent to non violent) both in the learning and
test sets.
4 10/08/12
5. Data set – development set
Movie Duration Shot # Violence duration (%) Violent shots (%)
Armageddon 8680.16 3562 14.03 14.6
Billy elliot 6349.44 1236 5.14 4.21
Eragon 5985.44 1663 11.02 16.6
Harry Potter 5 7953.52 1891 10.46 13.43
I am Legend 5779.92 1547 12.75 20.43
Leon 6344.56 1547 4.3 7.24
Midnight Express 6961.04 1677 7.28 11.15
Pirates Carib. 8239.44 2534 11.3 12.47
Reservoir Dogs 5712.96 856 11.55 12.38
Saving private Ryan 9751.0 2494 12.92 18.81
The Six Senth 6178.04 963 1.34 2.80
The wicker man 5870.44 1638 8.36 6.72
Kill Bill1 5626.6 1597 17.4 24.8
The Bourne Identity 5877.6 1995 7.5 9.3
The wizard of Oz 5415.7 908 5.5 5.0
TOTAL 100725.8 (27h58min) 26108 9.39 11.99
5 10/08/12
6. Data set – test set
Movie Duration Shot # Violence duration (%) Violent shots (%)
Dead Poets Society 7413.24 1583 0.75 2.15
Fight Club 8005.72 2335 7.61 13.28
Independence Day 8834.32 2652 6.4 13.99
TOTAL 24253.28 6570 4.92 9.80
(6h44min)
6 10/08/12
7. Annotations & additional data
Groundtruth manually created by 7 human assessors:
Segments containing violent events according to the definition
One unique violent action per segment wherever possible
Or tag ‘multiple_action_scenes’
7 high level video concepts:
Presence of blood
Presence of fire
Presence of guns or assimilated weapons
Presence of cold arms (knives or assimilated weapons)
Fights (1 against 1, small, large, distant attack)
Car chases
Gory scenes (graphic images of bloodletting and/or tissue damage)
3 high level audio concepts:
Gunshots, cannon fire
Screams, effort noise
Explosions
Automatically generated shot boundaries with keyframes
7 10/08/12
9. Evaluation metrics
Official measure : Mean Average Precision @100
Average precision at the 100 top ranked violent shots, over the 3 test movies
For comparison purpose with 2011, the MediaEval Cost
C fa = 1
C = C fa Pfa + Cmiss Pmiss where
Cmiss = 10
and Pfa Pmiss are the estimated probabilities of false alarm and missed detection
Additional metrics:
false alarm rate, miss detection rate, precision, recall, F-measure, MAP@20, MAP
Detection error trade-off (DET) curves
9 10/08/12
10. Task participation
Survey:
35 teams manifested interest for the task (among which 12 were very interested)
2011: 13 teams
Registration:
11 teams = 6 core partipants + 1 organizers team + 4 additional teams
At least, 3 joint submissions - 16 research teams - 9 countries
3 teams already worked on the detection of violence in movies
2011: 6 teams = 4 + 2 organizers, 1 joint submission, 4 countries
Submission:
7 teams + 1 organizers team
We have lost 3 teams (corpus availability, economical issues, low performance)
Grand total of 36 runs: 35 at shot level and 1 brave submission at segment level!
2011: 29 runs at shot level, 4 teams + 2 organizers teams
Workshop participation:
6 teams
2011: 3 teams
10 10/08/12
11. Task baseline – random classification
Movie MAP@100
Dead Poets Society 2.17
Fight Club 13.27
Independence Day 13.98
Total 9.08
11 10/08/12
12. Task participation
Run 2011 Workshop
Registration Country MAP@100 MediaEvalCost
submission participation Participation
1 (shot) 65.05 3.56
ARF Austria X
1 (segment) 54.82 5.13
DYNI – LSIS France 5 X 12.44 7.96
NII - Video Processing
Japan 5 X 30.82 1.28
Lab
Shanghai-Hongkong China 5 X 62.38 5.52
TUB - DAI Germany 5 X X 18.53 4.20
Germany-
TUM 5 X 48.43 7.83
Austria
LIG - MRIM France 4 X X 31.37 4.16
TEC* France-UK 5 X X 61.82 3.56
8 teams
Total 36 5 6 (75%)
(23%)
Rand. classification 9.8
*: task organizer
Best run according to the MAP@100.
12 10/08/12
13. Task participation
Run 2011 Workshop
Registration Country MAP@100 MediaEvalCost
submission participation Participation
1 (shot) 65.05 3.56
ARF Austria X
1 (segment) 54.82 5.13
DYNI – LSIS France 5 X 12.44 7.96
NII - Video Processing
Japan 5 X 30.82 1.28
Lab
Shanghai-Hongkong China 5 X 62.38 5.52
TUB - DAI Germany 5 X X 18.53 4.20
Germany-
TUM 5 X 48.43 7.83
Austria
LIG - MRIM France 4 X X 31.37 4.16
TEC* France-UK 5 X X 61.82 3.56
8 teams
Total 36 5 6 (75%)
(23%)
Rand. classification 9.8
*: task organizer
Best run according to the MAP@100.
13 10/08/12
14. Task participation
Run 2011 Workshop
Registration Country MAP@100 MediaEvalCost
submission participation Participation
1 (shot) 65.05 3.56
ARF Austria X
1 (segment) 54.82 5.13
DYNI – LSIS France 5 X 12.44 7.96
NII - Video Processing
Japan 5 X 30.82 1.28
Lab
Shanghai-Hongkong China 5 X 62.38 5.52
TUB - DAI Germany 5 X X 18.53 4.20
Germany-
TUM 5 X 48.43 7.83
Austria
LIG - MRIM France 4 X X 31.37 4.16
TEC* France-UK 5 X X 61.82 3.56
8 teams
Total 36 5 6 (75%)
(23%)
Rand. classification 9.8
*: task organizer
Best run according to the MAP@100.
14 10/08/12
15. Learned points
Features:
Mainly classic low-level features either audio or video
Mainly computed at frame level
Classification step:
Mainly supervised machine learning systems
Mostly SVM-based, 1 NN, 1BN
Two systems based on similarity computation (KNN)
Multimodality:
Is audio, video, audio and video more informative? No real convergence
No use of text features
Mid-level concepts:
YES! This year, they were largelly used (4 teams out of 8)
Seems promising, for some of them (except blood)
But how to use them? (as additional features, as an intermediate step)
Test set: seems that…
It worked better on Independence Day and Dead Poets Society was more difficult.
Due to some similarity with other movies in the dev set?
Generalization issue?
15 10/08/12
18. Conclusions & perspectives
Success of the task
Increased number of participants
Attracked people from the domain
Quality of results has deeply increased
MediaEval2013
Which task definition?
How to go one step further in the multimodality?
Text is still not used
Who will join the organizers’ group for next year?
18 10/08/12