2. 2
Video Grouding?
Video Grounding tries to determine the temporal boundaries
of the video moment corresponding to the given sentence [1].
Natural Language Video Localization is retrieving a
specific temporal segment, or moment, from a video given a
natural language text description [2].
Video Moment Retrieval aims to extract a video moment
from the untrimmed video that best matches the query [3].
[1] Zhang, Zhu, et al. "Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding." NIPS 2020
[2] Anne Hendricks, Lisa, et al. "Localizing moments in video with natural language." CCV 2017
[3] Zeng, Yawen, et al. "Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval." CVPR 2021
3. 3
Keywords
Referring Expressions Comprehension
Temporal Language Grounding
Natural Language Video Localization
Video Description Grounding
Phrase Localization
Natural Language Object Retrieval
Video Moment Retrieval
Temporal Video Grounding
All keywords (Grounding, Retrieval, Localization,
Referring)
are interchangeable!
4. 4
Natural Language Video Localization
Anne Hendricks, Lisa, et al. "Localizing moments in video with natural language." ICCV 2017
Chen, Jingyuan, et al. "Localizing natural language in videos." AAAI 2019
7. 7
Spatio-Temporal Video Grounding
Tang, Zongheng, et al. "Human-centric spatio-temporal video grounding with visual transformers.“ TCSVT 2021
Zhang, Zhu, et al. "Where does it exist: Spatio-temporal video grounding for multi-form sentences." CVPR 2020
Yamaguchi, Masataka, et al. "Spatio-temporal person retrieval via natural language queries." ICCV 2017
8. 8
Applications?
Lei, Jie, et al. "Tvqa+: Spatio-temporal grounding for video question answering.“ ACL 2020
https://www.youtube.com/results?search_query=A+man+is+holding+a+woman+while+the+woman+is+spreading+her+arms+at+the+front+of+the+shi
p
Video Question Answering
Searching Videos in YouTube
“A man is holding a
woman while the
woman is spreading
her arms at the front
of the ship.”