Detecting Violent Content in Hollywood Movies by Mid-level Audio Representations
Competence Center Information Retrieval & Machine Learning11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 2013Detecting Violent Content in Hollywood Movies by Mid-levelAudio RepresentationsEsra AcarEsra Acar, Frank Hopfgartner, Sahin Albayrak
Outline217. Juni 2013 CBMI‘2013► Motivation► The Violence Detection Method Audio Representation of Videos Learning Violence Detection Model► Performance Evaluation► Conclusions & Future Work
Motivation317. Juni 2013 CBMI‘2013► Goal: the detection of most violent scenes in Hollywoodmovies.► Use case: Parents select or reject movies by previewing parts ofthe movies that include the most violent moments.► We investigate the discriminative power of mid-level audiofeatures Bag-of-Audio Words (BoAW) representations based on Mel-Frequency Cepstral Coefficients (MFCCs) Two different BoAW construction methodsVector quantization-based (VQ-based) method, andSparse coding-based (SC-based) method
The Violence Detection Method417. Juni 2013 CBMI‘2013►The definition of violence: “physical violence oraccident resulting in human injury or pain”“violence” as defined in the MediaEval ViolentScenes Detection (VSD) task.►Two main components of the method:The representation of video shotsThe learning of a violence model
Audio Representation of Videos (1)517. Juni 2013 CBMI‘2013► Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used in speech recognition and musicinformation retrieval (e.g., genre classification). relate better to human perception. work well for the detection of excitement/non-excitement(i.e., indicators of the excitement level of video segments).► MFCC-based audio representation is employed for thedescription of the audio content of Hollywood movies.► Using mid-level representations may help modeling videosegments one step closer to human perception. Examples are: bags of features, the upper units of convolutional networks or deep beliefnetworks
Audio Representation of Videos (2)617. Juni 2013 CBMI‘2013► We use mid-level audio features based on MFCCs (i.e., BoAWapproach).► The BoAW approach with two different coding schemes Vector quantization (by k-means clustering)dividing feature vectors into groups, where each group isrepresented by its centroid point (e.g., k-means clusteringalgorithm). Sparse coding (by the LARS algorithm)representing a feature vector as a linear combination of an over-complete set of basis vectors.
Audio Representation of Videos (3)717. Juni 2013 CBMI‘2013Dictionary Generation Phase
Audio Representation of Videos (4)817. Juni 2013 CBMI‘2013Representation Construction Phase
Learning Violence Detection Model917. Juni 2013 CBMI‘2013Learning a Violence Model
Performance Evaluation1017. Juni 2013 CBMI‘2013► Dataset: 32,708 video shots from 18 Hollywood movies of different genres(ranging from extremely violent movies to movies withoutviolence).Training set: 26,138 video shots from 15 movies.Test set: 6,570 video shots from 3 movies.► Ground truth: generated by 7 human assessors. Violent movie segments areannotated at the frame-level. Each video shot is labeled as violent or non-violent.The characteristics of training and test datasets
Evaluation Metrics1117. Juni 2013 CBMI‘2013► The ranking of violent shots are more important for the usecase.► Metrics other than precision and recall are required tocompare the performance.► Average precision at 20 & 100 are used (official metrics in theMediaEval VSD task)► R-precision which can be seen as an alternative to the precisionat k.
Results & Discussions (1)1217. Juni 2013 CBMI‘2013Average Precision at 100 for the Baseline and Our MethodsAverage Precision at 20 & 100 and R-precisionfor the VQ- and SC-based methods
Results & Discussions (2)1317. Juni 2013 CBMI‘2013Average Precision at 20 & 100 and R-precision on Independence DayAverage Precision at 20 & 100 and R-precision on Dead Poets SocietyAverage Precision at 20 & 100 and R-precision on Fight Club
Results & Discussions (3)1417. Juni 2013 CBMI‘2013Team Features Modality APat100*ARF Color, texture, audio and concepts audio-visual 0.651Shanghai-Hong KongTrajectory-based features, SIFT, STIP, MFCCs audio-visual 0.624TEC Color, motion, acoustic features audio-visual 0.618TUM Acoustic energy and spectral, color, texture,optical flowaudio-visual 0.484SC-based(ours)BoAW with sparse coding audio 0.444VQ-based(ours)BoAW with vector quantization audio 0.387LIG-MIRM Color, texture, bag of SIFT and MFCCs audio-visual 0.314NII Visual concepts learned from color andtexturevisual 0.308DYNI-LSIS Multi-scale local binary pattern visual 0.125* Average Precision at 100 (the official evaluation metric of the MediaEval VSD task)
Sample Video Shots (Correctly Classified)1517. Juni 2013 CBMI‘2013
Sample Video Shots (Wrongly Classified)1617. Juni 2013 CBMI‘2013
Conclusions1717. Juni 2013 CBMI‘2013► An approach for movie violent content detection at video shotlevel is presented.► Mid-level audio features based on BoAW approach with twodifferent coding schemes are employed.► Promising results are obtained the SC-based BoAW outperforms all uni-modal submissions inthe MediaEval VSD task except one vision-based method.► One significant point is that the average precision variation ofthe proposed method is high for movies of varying violencelevels.
Future Work1817. Juni 2013 CBMI‘2013► Construction of more sophisticated mid-level representationsfor video content analysis.► Augmenting the feature set by including visual features (bothlow-level and mid-level) helps further improving classification.► Extend our approach to user-generated videos. Different from Hollywood movies, these videos are notprofessionally edited, e.g., in order to enhance dramaticscenes.