Many video analysis tasks require temporal localization for the detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is due to large scale annotation of temporal boundaries in untrimmed videos being expensive. Therefore, no suitable datasets exist that enable pre-training in a manner sensitive to temporal boundaries. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. By defining different ways of synthesizing boundaries, BSP can then be simply conducted in a self-supervised manner via the classification of the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification-based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks.
Boundary-sensitive Pre-training for Temporal Localization in Videos cvpr21-talk
1. Mengmeng Xu (Frost), @SAIC-KAUST
Boundary-sensitive Pretraining for
Temporal Localizations
An Application of Generic Event Boundary
2. Mengmeng Xu (Frost)
Our team
Juan-Manuel Pérez-Rúa Victor Escorcia Brais Martinez Xiatian Zhu
Li Zhang Bernard Ghanem Tao Xiang (Tony)
3. Mengmeng Xu (Frost)
Temporal Action Localization (TAL)
Generic Event Boundary in TAL
Boundary Synthesis and Pretraining
Results and Visualizations
Future Directions
Table of Contents
4. Mengmeng Xu (Frost)
Let’s Start By Defining The Task
Temporal Action Localization
When is the Activity Happening?
[(1:20, 1:32), (1:43, 1:59), …]
What Activity is Happening?
[Long Jump, Long Jump, …]
5. Mengmeng Xu (Frost)
Let’s Start By Defining The Task
Temporal Action Localization
Input: Long Untrimmed Video
Polishing
Furniture
Output: Temporally Localized Activity
Start Time End Time
6. Mengmeng Xu (Frost)
Temporal Action Localization
Our goal is to encourage the development of
automated systems to recognize and
localize human activities in videos
15. Mengmeng Xu (Frost)
Temporal Action Localization (TAL)
Generic Event Boundary in TAL
Boundary Synthesis and Pretraining
Results and Visualizations
Future Directions
Table of Contents
18. Mengmeng Xu (Frost)
Generic Event Boundary in TAL
Examples in TAL datasets (action change)
Collected from ANET [4
19. Mengmeng Xu (Frost)
Generic Event Boundary in TAL
Examples in TAL datasets (new subject)
Collected from ANET [4
20. Mengmeng Xu (Frost)
Temporal Action Localization (TAL)
Generic Event Boundary in TAL
Boundary Synthesis and Pretraining
Results and Visualizations
Future Directions
Table of Contents
23. Mengmeng Xu (Frost)
Boundary Synthesis and Pretraining
Vanilla
Encoder
concat
classifier
BSP
Encoder
classifier
Long Jump
Zumba
Cricket
Same-class
Diff-class
Diff-speed
Integration of BSP
TAL input
Figure is from G-TAD [5]
24. Mengmeng Xu (Frost)
Temporal Action Localization (TAL)
Generic Event Boundary in TAL
Boundary Synthesis and Pretraining
Results and Visualizations
Future Directions
Table of Contents
28. Mengmeng Xu (Frost)
Temporal Action Localization (TAL)
Generic Event Boundary in TAL
Boundary Synthesis and Pretraining
Results and Visualizations
Future Directions
Table of Contents
29. Mengmeng Xu (Frost)
• Backbone pre-training on GEBD dataset
• Advanced Integration of boundary encoding
• Boundary-Aware TAL solution
Future Directions
30. Mengmeng Xu (Frost)
References
[1] Lin, Tianwei, et al. "Bsn: Boundary sensitive network for temporal action proposal
generation." Proceedings of the European Conference on Computer Vision (ECCV).
2018.
[2] Shou, Mike Zheng, et al. "Generic Event Boundary Detection: A Benchmark for Event
Segmentation." arXiv preprint arXiv:2101.10511 (2021).
[3] Zhao, Hang, et al. "Hacs: Human action clips and segments dataset for recognition
and temporal localization." Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019.
[4] Caba Heilbron, Fabian, et al. "Activitynet: A large-scale video benchmark for human
activity understanding." Proceedings of the ieee conference on computer vision and
pattern recognition. 2015.
[5] Xu, Mengmeng, et al. "G-tad: Sub-graph localization for temporal action
detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2020.
Hi everyone. I am Mengmeng Xu. Thanks for coming to my talk. I will introduce our submission to the second track of LOVEU workshop.
Most of the recent work only focus on the development of the TAL head. In this talk, I will show that we can also improve the pretraining process of the video encoder.