Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Technicolor / INRIA / Imperial CollegeLondon at the MediaEval 2012 Violent       Scene Detection Task         PENET Cédric...
Outline   Introduction   Systems description   Results and conclusion2      10/7/2012
Outline   Introduction   Systems description   Results and conclusion3      10/7/2012
IntroductionJoint effort between Technicolor / INRIA / Imperial College London   5 runs  5 different systems       Re-u...
Outline   Introduction   Systems description   Results and conclusion5      10/7/2012
Run 1: Exploiting Similarity   Idea: can we get the same results as last year using only similarity    measures?   Video...
Run 2: Bag-of-Audio words   Audio features extraction       Extraction of MFCC audio features (with    &   ) - 20ms wind...
Run 3: Bayesian Networks structure learning   Re-use of Technicolor last year’s system with additionnal features       A...
Run 4: Naïve Bayesian classifierAudio modality   Classical low level features extracted from non-silent segments     RMS...
Run 5: Systems fusion    Simple fusion of three systems        Run 2: Bag-of-Audio words        Run 3: Bayesian network...
Outline    Introduction    Systems description    Results and conclusions11      10/7/2012
Results            Runs                 MAP@100                AP-1       AP-2         AP-3           STD                 ...
Conclusion & perspectives    Similarity search        MAP is bad, but MediaEval Cost is one of the best (6th out of 35) ...
Conclusion & perspectives    Bag-of-Audio words        MAP is not bad (11th out of 35)        False alarms and missed d...
Thanks for your attention !15   10/7/2012
Upcoming SlideShare
Loading in …5
×

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

987 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

  1. 1. Technicolor / INRIA / Imperial CollegeLondon at the MediaEval 2012 Violent Scene Detection Task PENET Cédric – Technicolor, INRIA DEMARTY Claire-Hélène – Technicolor SOLEYMANI Mohammad – Imperial College London GRAVIER Guillaume – CNRS, IRISA GROS Patrick – INRIA MediaEval 2012 Pisa Workshop October, 4th 2012
  2. 2. Outline Introduction Systems description Results and conclusion2 10/7/2012
  3. 3. Outline Introduction Systems description Results and conclusion3 10/7/2012
  4. 4. IntroductionJoint effort between Technicolor / INRIA / Imperial College London 5 runs  5 different systems  Re-use of last year’s systems with few differences  Bayesian networks structure learning (Technicolor/INRIA)  Naive bayesian classifier (ICL)  Two new systems from Technicolor/INRIA  Exploiting similarity  Bag-of-Audio words  Fusion of three systems (Technicolor/INRIA – ICL)4 10/7/2012
  5. 5. Outline Introduction Systems description Results and conclusion5 10/7/2012
  6. 6. Run 1: Exploiting Similarity Idea: can we get the same results as last year using only similarity measures? Video features for each frame  Motion activity  Three color harmonisation features: harmonisation template, angle and energy Decision: KNN using only closest neighbour  10-movies used to populate KNN  Test frames labelled according to closest neighbour  1 frame of a shot labelled violent  shot is labelled violent6 10/7/2012
  7. 7. Run 2: Bag-of-Audio words Audio features extraction  Extraction of MFCC audio features (with & ) - 20ms windows, 10 ms overlap  Extraction of silence segments with SPro  Extraction of coherent audio segments – Andre-Obrecht 1988 K-Means on non-silent audio segments for vocabulary (of size 128)  Each audio segment replaced by closest centroid Construction of TF-IDF histograms  Each shot is a document Classification using SVM  ² and histogram intersection kernels  Applied weight on SVM parameter7 10/7/2012
  8. 8. Run 3: Bayesian Networks structure learning Re-use of Technicolor last year’s system with additionnal features  Audio features: energy, asymmetry, centroid, ZCR, flatness and roll-off at 90%  Video features: shot length, flashes, blood, activity, color coherence, average luminance, fire and color harmonisation features  Features are averaged over a video shot Graphical model for modeling conditional probability distributions along with contextual features and temporal smoothing  Naive Bayesian network (NB) Bayesian network example  Graph structure learning  Forest augmented naive Bayesian network (FAN)  K2 Late modalities fusion using simple rule Source: https://controls.engin.umich.edu/wiki/index.php/Bayesian_network_theory8 10/7/2012
  9. 9. Run 4: Naïve Bayesian classifierAudio modality Classical low level features extracted from non-silent segments  RMS Energy, pitch, MFCC, ZCR, spectrum flux, Spectral RollOff Averaged over shotsVideo modality Shot duration, luminance, Average activity, motion component Averaged over shotsText features Simple features such as number of spoken words and the average valence and arousal per shot (from the dictionary of affect in language) The results were bad and we decide not to include them in the final submissionA Naïve Bayesian classifier on each modality Modality fusion using a weighted sum of posterior probabilities.  0.95* audio score +0.05 visual score9 10/7/2012
  10. 10. Run 5: Systems fusion Simple fusion of three systems  Run 2: Bag-of-Audio words  Run 3: Bayesian networks structure learning  Run 4: Naive bayesian classifier Fusion by multiplication of probabilities10 10/7/2012
  11. 11. Outline Introduction Systems description Results and conclusions11 10/7/2012
  12. 12. Results Runs MAP@100 AP-1 AP-2 AP-3 STD MediaEval Cost N° Technique (%) (%) (%) (%) (%) 1 Similarity 13.89 0.00 12.91 28.77 14.41 2.29 2 BoAW 40.54 10.85 52.98 57.77 25.82 2.50 3 BN-SL 61.82 60.56 53.15 71.76 9.37 3.57 4 NBN 46.27 40.03 22.97 75.82 26.97 3.64 5 Fusion 57.47 64.52 37.21 70.69 17.82 4.60 Average Precision (AP) for Dead Poet Society (AP-1), Fight Club (AP-2) and Independence Day (AP-3) STD: Standard deviation of the three test movies High variation between movies Best results on Independence day (similar to Armageddon) Needs more movies to compute MAP12 10/7/2012
  13. 13. Conclusion & perspectives Similarity search  MAP is bad, but MediaEval Cost is one of the best (6th out of 35)  Adding features and merge decisions from different KNN might improve the results Fusion  4th best run overall (out of 35)  Results not as good as expected  Improves precision at the cost of recall (false alarms reduced by a factor of two)  Test smarter fusion techniques Bayesian Networks – Structure Learning  3rd best run overall (out of 35)  Very low standard deviation over three movies  Bayesian networks for intermediate concepts13 10/7/2012
  14. 14. Conclusion & perspectives Bag-of-Audio words  MAP is not bad (11th out of 35)  False alarms and missed detections are pretty low too  Simple tests proved efficient – more investigation needed Naive bayesian classifier  Simple classifier with audio features can achieve moderatly good results (10th out of 35)  Text features don’t work  Use a classifier that can learn temporal dynamics14 10/7/2012
  15. 15. Thanks for your attention !15 10/7/2012

×