The Shanghai-Hongkong Team at MediaEval2012: Violent Scene Detection Using Trajectory-based Features
The Shanghai-Hongkong Team at MediaEval2012: ViolentScene Detection Using Trajectory- based Features Yu-Gang Jiang*, Qi Dai*, Chun Chet Tan**, Xiangyang Xue*, Chong-Wah Ngo** *School of Computer Science, Fudan University, Shanghai **Department of Computer Science, City University of Hong Kong, HK MediaEval 2012 Workshop, Oct 4-5, Pisa, Italy
Introduction• Violent Scene Detection task  - practical challenge, great potential in applications.• Focus on novel features.• Top performance in mAP@20, runner-up in mAP@100 C.-H. Demarty, C. Penet, G. Gravier, and M. Soleymani. The MediaEval 2012 Affect Task: Violent Scenes Detection. In MediaEval 2012 Workshop, Pisa, Italy, 2012.
Framework All features Temporal χ2 except feature kernel 2 concept-based smoothing SVM Feature extraction Classifiers Trajectory-based (7 features) 5Video shots SIFT Detection score-level χ2 4 temporal 1 Spatial-temporal interest point kernel smoothing SVM MFCC audio 3 feature Concept-based The circled numbers indicate the 5 submitted runs
Feature Extraction• Trajectory-based features : - dense trajectory, HOG, HOF, MBH  - TrajMF (relative locations and motions between trajectory pairs) - Trajectory shape feature• Advantages: robust to camera movement, rich information, implicitly capture object-object and object-background relationships. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, 2012. H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
Feature Extraction• SIFT • STIP • MFCC• Concept-based Features (10 concepts: blood, carchase, coldarms, fights, fire, firearms, gore, explosions, gunshots, screams) I. Laptev. On space-time interest points. International Journal of Computer Vision, 64:107-123, 2005. D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 60:91-110, 2004.
Classifiers• BoW representation• Chi-squared kernel SVMs• Kernel level early fusion is used to combine multiple features
Temporal Smoothing• Feature Smoothing – averaged features over a three-shot window.• Score Smoothing – averaged prediction scores over a three-shot window.
Results (mAP@20) 0.8 Mean Average Precision at 20 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 r3 r2 r5 r4 r1• Run 5: 7 dense trajectory features • Run 2: Run 4 + feature smoothing• Run 4: Run 5 + SIFT + STIP + MFCC • Run 1: Run 4 + score smoothing• Run 3: Run 4 + concept scores
Results (mAP@100) 0.7 Mean Average Precision at 100 0.6 0.5 0.4 0.3 0.2 0.1 0 r3 r4 r5 r2 r1• Run 5: 7 dense trajectory features • Run 2: Run 4 + feature smoothing• Run 4: Run 5 + SIFT + STIP + MFCC • Run 1: Run 4 + score smoothing• Run 3: Run 4 + concept scores
Discussions• SIFT + STIP + MFCC show insignificant improvement. TrajMF has encoded the rich information of SIFT and STIP.• Concept-based scores do not improve the performances - overfitting SVMs due to insufficient training data. In fact, using mid- level concept detectors is a promising direction.• Score smoothing boosts the performances. Feature smoothing that “blurs” the features across shots might not be a good option.