Your SlideShare is downloading. ×
Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations


Published on

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Competence Center Information Retrieval & Machine Learning11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013Detecting Violent Content in Hollywood Movies by Mid-levelAudio RepresentationsEsra AcarEsra Acar, Frank Hopfgartner, Sahin Albayrak
  • 2. Outline217. Juni 2013 CBMI‘2013► Motivation► The Violence Detection Method Audio Representation of Videos Learning Violence Detection Model► Performance Evaluation► Conclusions & Future Work
  • 3. Motivation317. Juni 2013 CBMI‘2013► Goal: the detection of most violent scenes in Hollywoodmovies.► Use case: Parents select or reject movies by previewing parts ofthe movies that include the most violent moments.► We investigate the discriminative power of mid-level audiofeatures Bag-of-Audio Words (BoAW) representations based on Mel-Frequency Cepstral Coefficients (MFCCs) Two different BoAW construction methodsVector quantization-based (VQ-based) method, andSparse coding-based (SC-based) method
  • 4. The Violence Detection Method417. Juni 2013 CBMI‘2013►The definition of violence: “physical violence oraccident resulting in human injury or pain”“violence” as defined in the MediaEval ViolentScenes Detection (VSD) task.►Two main components of the method:The representation of video shotsThe learning of a violence model
  • 5. Audio Representation of Videos (1)517. Juni 2013 CBMI‘2013► Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used in speech recognition and musicinformation retrieval (e.g., genre classification). relate better to human perception. work well for the detection of excitement/non-excitement(i.e., indicators of the excitement level of video segments).► MFCC-based audio representation is employed for thedescription of the audio content of Hollywood movies.► Using mid-level representations may help modeling videosegments one step closer to human perception. Examples are: bags of features, the upper units of convolutional networks or deep beliefnetworks
  • 6. Audio Representation of Videos (2)617. Juni 2013 CBMI‘2013► We use mid-level audio features based on MFCCs (i.e., BoAWapproach).► The BoAW approach with two different coding schemes Vector quantization (by k-means clustering)dividing feature vectors into groups, where each group isrepresented by its centroid point (e.g., k-means clusteringalgorithm). Sparse coding (by the LARS algorithm)representing a feature vector as a linear combination of an over-complete set of basis vectors.
  • 7. Audio Representation of Videos (3)717. Juni 2013 CBMI‘2013Dictionary Generation Phase
  • 8. Audio Representation of Videos (4)817. Juni 2013 CBMI‘2013Representation Construction Phase
  • 9. Learning Violence Detection Model917. Juni 2013 CBMI‘2013Learning a Violence Model
  • 10. Performance Evaluation1017. Juni 2013 CBMI‘2013► Dataset: 32,708 video shots from 18 Hollywood movies of different genres(ranging from extremely violent movies to movies withoutviolence).Training set: 26,138 video shots from 15 movies.Test set: 6,570 video shots from 3 movies.► Ground truth: generated by 7 human assessors. Violent movie segments areannotated at the frame-level. Each video shot is labeled as violent or non-violent.The characteristics of training and test datasets
  • 11. Evaluation Metrics1117. Juni 2013 CBMI‘2013► The ranking of violent shots are more important for the usecase.► Metrics other than precision and recall are required tocompare the performance.► Average precision at 20 & 100 are used (official metrics in theMediaEval VSD task)► R-precision which can be seen as an alternative to the precisionat k.
  • 12. Results & Discussions (1)1217. Juni 2013 CBMI‘2013Average Precision at 100 for the Baseline and Our MethodsAverage Precision at 20 & 100 and R-precisionfor the VQ- and SC-based methods
  • 13. Results & Discussions (2)1317. Juni 2013 CBMI‘2013Average Precision at 20 & 100 and R-precision on Independence DayAverage Precision at 20 & 100 and R-precision on Dead Poets SocietyAverage Precision at 20 & 100 and R-precision on Fight Club
  • 14. Results & Discussions (3)1417. Juni 2013 CBMI‘2013Team Features Modality APat100*ARF Color, texture, audio and concepts audio-visual 0.651Shanghai-Hong KongTrajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624TEC Color, motion, acoustic features audio-visual 0.618TUM Acoustic energy and spectral, color, texture,optical flowaudio-visual 0.484SC-based(ours)BoAW with sparse coding audio 0.444VQ-based(ours)BoAW with vector quantization audio 0.387LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314NII Visual concepts learned from color andtexturevisual 0.308DYNI-LSIS Multi-scale local binary pattern visual 0.125* Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)
  • 15. Sample Video Shots (Correctly Classified)1517. Juni 2013 CBMI‘2013
  • 16. Sample Video Shots (Wrongly Classified)1617. Juni 2013 CBMI‘2013
  • 17. Conclusions1717. Juni 2013 CBMI‘2013► An approach for movie violent content detection at video shotlevel is presented.► Mid-level audio features based on BoAW approach with twodifferent coding schemes are employed.► Promising results are obtained the SC-based BoAW outperforms all uni-modal submissions inthe MediaEval VSD task except one vision-based method.► One significant point is that the average precision variation ofthe proposed method is high for movies of varying violencelevels.
  • 18. Future Work1817. Juni 2013 CBMI‘2013► Construction of more sophisticated mid-level representationsfor video content analysis.► Augmenting the feature set by including visual features (bothlow-level and mid-level) helps further improving classification.► Extend our approach to user-generated videos. Different from Hollywood movies, these videos are notprofessionally edited, e.g., in order to enhance dramaticscenes.
  • 19. 1917. Juni 2013 CBMI‘2013THANKS!QUESTIONS?