論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
1. Temporal Senctence
Grounding in Videos: A
Survey and Future Direction
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou
TPAMI
仁田智也(名工大)
2. 概要
nTemporal Sentence Grounding in Videos (TSGV)
• 入力
• 動画
• 文章
• 出力
• 文章と関連する動画の区間
• 動画内の一部分を文章で検索することができる
Index Terms—Temporal Sentence Grounding in Video, Natural Language Video Localization, Video Moment Retrieval, Temporal Video
Grounding, Multimodal Retrieval, Cross-modal Video Retrieval, Multimodal Learning, Video Understanding, Vision and Language.
F
1 INTRODUCTION
VIDEO has gradually become a major type of information
transmission media, thanks to the fast development and
innovation in communication and media creation technologies.
A video is formed from a sequence of continuous image frames
possibly accompanied by audio and subtitles. Compared to image
and text, video conveys richer semantic knowledge, as well as
more diverse and complex activities. Despite the strengths of
video, searching for content from the video is challenging. Thus,
there is a high demand for techniques that could quickly retrieve
video segments of user interest, specified in natural language.
1.1 Definition and History
Given an untrimmed video, temporal sentence grounding in videos
(TSGV) is to retrieve a video segment, also known as a temporal
moment, that semantically corresponds to a query in natural
language i.e., sentence. As illustrated in Fig. 1, for the query “A
person is putting clothes in the washing machine.”, TSGV needs
to return the start and end timestamps (i.e., 9.6s and 24.5s) of
a video moment from the input video as the answer. The answer
moment should contain actions or events described by the query.
As a fundamental vision-language problem, TSGV also serves
Query: A person is
putting clothes in the
washing machine.
…
9.6s 24.5s
TSGV Model
Fig. 1. An illustration of temporal sentence grounding in videos (TSGV).
connects computer vision (CV) and natural language processing
(NLP) and benefits from the advancements made in both areas.
TSGV also shares similarities with some classical tasks in both
CV and NLP. For instance, video action recognition (VAR) [1]–
[4] in CV is to detect video segments, which perform specific
actions in video. Although VAR localizes temporal segments with
activity information, it is constrained by the predefined action
categories. TSGV is more flexible and aims to retrieve complicated
and diverse activities from video via arbitrary language queries.
In this sense, TSGV needs a semantic understanding of both
arXiv:2201.08071v3
[cs.CV]
13
Mar
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGEN
0
10
20
30
40
50
60
70
2017 2018 2019 2020 2021 2022*
#
of
Papers
Year
*ACL
8%
AAAI
11%
ACM MM
10%
CVPR
12%
ICCV
4%
ECCV
4%
TIP
5%
TPAMI
2%
TMM
6%
NeurIPS
2%
SIGIR
4%
ArXiv
11%
Other
Confs/Jnls
21%
Fig. 2. Statistics of the collected papers in this survey. Left: number of
papers published each year (till September 2022). Right: distribution of
papers by venue, where *ACL denotes the series of conferences hosted
direction
existing
from res
surveys:
should t
does a
and (iv)
TSGV m
the com
research
to confid
make a c
suite of
論文投稿数
3. データセット
nDiDeMo (Hendricks+, ICCV2017)
• アノテーションの正解区間の時間が5秒固定
• 動画は特徴量抽出された形で配布
nCharades-STA (Gao+, ICCV2017)
• ビデオの平均の長さ:30秒
nActivityNet Captions (Krishna+, ICCV2017)
• Dence Video Captioning用のデータセットを利用
nTACoS (Regneri+, TACL2013)
• 修正版のTACoS 2DTAN (Zhang+, AAAI2020)がある
nMAD (Soldan+, CVPR2022)
• ビデオの平均時間110分
• アノテーションの正解区間の平均時間4.1秒
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INT
Fig. 7. Statistics of query length and normalized moment length over
benchmark datasets.
query pairs. Average video and moment lengths are 286.59 and
6.10 seconds, and average query length is 10.05 words. Each
video in TACoS contains 148.17 annotations on average. We name
this dataset TACoSorg in Table 1. A modified version TACoS2DTAN
is made available by Zhang et al. [18]. TACoS2DTAN has 18, 227
annotations. On average, there are 143.52 annotations per video.
The average moment length and query length after modification
are 27.88 seconds and 9.42 words, respectively.
MAD [121] is a large-scale dataset containing mainstream movies.
F
R
p
m
クエリの
単語長
正解区間の
動画時間に対する割合
5. 評価指標
ndR@n, IoU@m
• R@n, IoU@mを改良したもの
• 𝑅@𝑛, 𝐼𝑜𝑈@𝑚 =
!
"!
∑#$!
"!
𝑟(𝑛, 𝑚, 𝑞#) 3 𝛼#
%
3 𝛼#
&
• 𝛼#
%
, 𝛼#
&
は割引率
• 𝛼#
∗
= 1 − 𝑝#
∗
− 𝑔#
∗
• IoUが閾値を越えていても正解との差で割引をする
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Fig. 7. Statistics of query length and normalized moment length over
benchmark datasets.
query pairs. Average video and moment lengths are 286.59 and
6.10 seconds, and average query length is 10.05 words. Each
video in TACoS contains 148.17 annotations on average. We name
this dataset TACoSorg in Table 1. A modified version TACoS2DTAN
is made available by Zhang et al. [18]. TACoS2DTAN has 18, 227
annotations. On average, there are 143.52 annotations per video.
(a) An illustration of temporal IoU.
Prediction Moment #1
Prediction Moment #2
…
Ground-truth Moment
!!
"
!!
#
"!,%
"
"!,%
#
"!,&
"
"!,&
#
!!,#
$
− #!
$
!!,#
%
− #!
%
!!,&
$
− #!
$
!!,&
%
− #!
%
(b) An illustration of dR@n,IoU@m.
6. 手法
nTSGV手法の分類
• 教師あり学習
• 提案ベース
• 非提案型
• 強化学習
• その他
• 弱教師あり学習
upervised TSGV methods in different categories. The methods plotted at the same position on the
n, m, qi) (14)
U@mi is unreliable for
enerate long predictions
h moments are long in
ce of correct prediction
TSGV
Methods
Supervised
Methods
Weakly-supervised
Methods
Proposal-based
Methods
Reinforcement learning-
based Methods
Proposal-free
Methods
Multi-Instance Learning-
based Methods
Reconstruction-
based Methods
Other Weakly-
supervised Methods
Other Supervised
Methods
Sliding window-based
Methods
Proposal generated
Methods
Anchor-based Methods
Standard Anchor-
based Methods
2D-Map-based
Methods
Regression-based
Methods
Span-based Methods
7. 教師あり学習
1. 特徴量抽出
• 各モーダルの特徴量抽出機を用いて特徴量抽出を行う
2. エンコーダ
• 各モーダルの特徴量を結合できるように変換する
3. マルチモーダルインタラクター
• 二つのモーダルを結合して処理を行う
4. モーメントローカライザー
• 区間を推定する
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Visual Feature
Extractor
Textual Feature
Extractor
Multi-modal
Interaction
(Cross-modal
Reasoning)
Moment
Localizer
(a) Preprocessor (b) Feature Extractor (d) Feature Interactor (e) Answer Predictor
Visual Feature
Encoder
Textual Feature
Encoder
(c) Feature Encoder
Video
Query: The man speeds
up then returns to his
initial speed.
Proposal Generator
1 2 3 4
9. 提案ベース手法
nSliding Window (SW)
• 提案モジュール
• スライディングウィンドウ
• 𝜍フレームを基準にスライディングウィンドウ
• CTRL (Gao+, ICCV2017)
ACCEPTED BY IEEE TRANSACTIONS ON
frames
frames
...
... Proposal
Candidates
...
Proposal
Candidates
...
...
...
Proposal
Candidates
(a) Sliding window (SW) strategy (b
Fig. 5. Illustration of sliding window, pro
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Video
Query: The man
speeds up then
returns to his
initial speed.
Textual Feature Extractor
(SkipThoughts or
WE+LSTM)
...
Proposal
Candidates
Visual Feature
Extractor (C3D)
Multimodal
Fusion
FC
Concat FC FC
Alignment
Score
Location
Regressor
(a) Query-guided segment p
CTRL (Gao+, ICCV2017)
10. 提案ベース手法
nProposal Generation (PG)
• 提案モジュール
• 提案生成モジュール
• 入力ビデオとクエリを元に提案区間を生成
• QSPN (Xu, AAAI2019)
MACHINE INTELLIGENCE 9
cal SW
Query: The man speeds
up then returns to his
initial speed.
Video
Textual Feature
Extractor (WE+LSTM)
Visual Feature Extractor
(C3D)
Convolutional
Blocks
...
Temporal Attention Weight
Max
Pooling
3D ROI
Pooling
(a) Query-guided segment proposal network.
S AND MACHINE INTELLIGENCE 9
9].
anonical SW
V. They define
asets.
(a) Query-guided segment proposal network.
LSTM
<start>
LSTM LSTM LSTM
......
LSTM LSTM LSTM LSTM
......
<end>
...... FC
FC
......
(b) The early fusion retrieval model of QSPN.
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELL
(a) Sliding window (SW) strategy
Proposal Detector
Video
Language
Query
... ...
Proposal Candidates
... ...
(b) Proposal generated (PG) strategy
Fig. 5. Illustration of sliding window, proposal generated, anchor-based, and 2D-
inter
large
F
diffe
QSPN (Xu+, AAAI2019)
11. 提案ベース手法
nAnchor-base (AN)
• 提案モジュール
• アンカーベース
• 抽出した特徴量に対してアンカーを基準に区間を提案
• TGN (Chen+, EMNLP2018)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(a) Sliding window (SW) strategy (b) Proposal generated (PG) strategy
...
...
...
...
...
Visual Feature Sequence
(c) Anchor-based strategy
Fig. 5. Illustration of sliding window, proposal generated, anchor-based, and 2D-Map strategies.
interactor determines the performa
large extent.
Fig. 6 summarizes the various
different feature interactors among
input is determined by the types
or sentence-level), and the types o
ACHINE INTELLIGENCE 10
Query: The man
speeds up then
returns to his
initial speed.
Video
Visual Feature
Extractor
Textual Feature
Extractor
i-LSTM
i-LSTM
i-LSTM
i-LSTM
...
Encoder Interactor Grounder
...
...
...
...
Fig. 14. TGN architecture, reproduced from Chen et al. [15].
TGN (Chen+, EMNLP2018)
12. 提案ベース手法
n2D-Map Anchor-base (2D)
• 提案モジュール
• 2Dマップアンカー
• 区間の始点終点を表すデータを2Dマップで表す
• 2D-TAN (Zhang+, AAAI2020)
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
(SW) strategy (b) Proposal generated (PG) strategy (c) Anchor-based strategy
copy
0 1 2 3 4 5
(1,5)
end time
index
start
time index
a possible proposal
(d) 2D-Map strategy
sliding window, proposal generated, anchor-based, and 2D-Map strategies.
interactor determines the performance of a TSGV method to a
large extent.
Fig. 6 summarizes the various input and output formats of
different feature interactors among existing TSGV methods. The
HINE INTELLIGENCE 10
eed
the
the
age
Fig. 14. TGN architecture, reproduced from Chen et al. [15].
Video
Query: The man speeds up then
returns to his initial speed.
Textual Feature
Extractor
(WE+LSTM)
Visual Feature
Extractor (C3D or
VGG)
(0,2)
(6,6)
Max
Pooling
Hadamard
Product
2D Temporal Feature Map Extraction
Query Encoding
ConvNets
...
Temporal Adjacent Network End Idx
Start
Idx
Fig. 15. 2D-TAN architecture, reproduced from Zhang et al. [18].
2D-TAN (Zhang+, AAAI2020)
13. 非提案型手法
nRegression-base(RG)
• 区間の始点と終点のフレーム(𝑡%, 𝑡&)を回帰
• 出力 𝑡%, 𝑡& ∈ ℝ(
• ABLR (Yuan+, AAAI2019)
ACHINE INTELLIGENCE 11
gh in-
erates
more
a 2D
g. 15
them
Query: The man
speeds up then
returns to his initial
speed.
Video
Visual Feature
Extractor (C3D)
Visual Feature
Encoder (BiLSTM)
Textual Feature
Extractor (GloVe)
Textual Feature
Encoder (BiLSTM)
A A
A
Mean
Pooling
Multi-modal Co-attention Interaction Attention Based
Coorodinates Regression
Attention Weight
Based Regression
Attention Feature
Based Regression
Fig. 16. ABLR architecture, reproduced from Yuan et al. [19].
ABLR (Yuan+, AAAI2019)
14. 非提案型手法
nSpan-base (SN)
• 各フレームに対して区間の始点と終点である確率を推定
• 出力: 𝑃%, 𝑃& ∈ ℝ)×(
• VSLNet (Zhang+, ACL2020)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTEL
Query: The man speeds
up then returns to his
initial speed.
Video
Visual Feature
Extractor (I3D)
Feature
Encoder
Textual Feature
Extractor (GloVe)
Feature
Encoder
Context-Query
Attention
Shared
Query-Guided
Highlighting
Conditioned
Span Predictor
Fig. 17. VSLNet architecture, reproduced from Zhang et al. [23].
VSLNet (Zhang+, ACL2020)
15. 強化学習による手法
nReinforcement Learning-base (RL)
• 推定区間の拡大・縮小、移動をすることで報酬を受け取る
• RWM-RL (He+, AAAI2019)
• 7つの行動ポリシーによって推定区間を変化
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
Fig. 17. VSLNet architecture, reproduced from Zhang et al. [23].
DeNet [92] disentangles query into relations and modified features
and devises a debias mechanism to alleviate both query uncertainty
and annotation bias issues.
There are also regression methods [93], [98], [99], [152] in-
corporating additional modalities from video to improve the local-
ization performance. For instance, HVTG [98] and MARN [152]
extract both appearance and motion features from video. In ad-
dition to appearance and motion, PMI [99] further exploits audio
features from the video extracted by SoundNet [153]. DRFT [93]
leverages the visual, optical flow, and depth flow features of
video, and analyzes the retrieval results of different feature view
combinations.
4.2.2 Span-based Method
Environment
(Video + Query)
Agent
(Deep Model)
State value
Reward
Policy
Action
Query: A Person is putting Clothes in the washing machine.
Video
Fig. 18. Illustration of sequence decision making formulation in TSGV.
Fig. 19. RWM-RL architecture, reproduced from He et al. [24].
segmentation task. ABIN [106] devises an auxiliary adversaria
discriminator network to produce coordinate and frame correlation
l. [23].
ified features
y uncertainty
9], [152] in-
ove the local-
MARN [152]
video. In ad-
xploits audio
. DRFT [93]
Fig. 18. Illustration of sequence decision making formulation in TSGV.
Visual Feature
Encoder (C3D)
Textual Feature Encoder
(Skip-Thought)
Query: A person is
putting clothes in the
washing machine.
FC
FC
FC
FC
FC
Query Feature
Global Visual
Feature
Local Visual
Feature
Localition Feature
GRU
FC
FC
FC
FC
IoU
Regression
Location
Regression
Softmax
State-Value
Policy
Action Space
Observation Network
Environment Actor-Critic
Multi-task
Fig. 19. RWM-RL architecture, reproduced from He et al. [24].
RWM-RL (He+, AAAI2019)
17. 弱教師あり学習
nMulti-Instance Learninig (MIL)
• テキストとビデオの距離学習をしたモデル
• ビデオとテキストの相関のスコアを計算可能
• TGA (Mithun+, CVPR2019)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Query: The man speeds up then
returns to his initial speed.
Video
Visual Feature
Extractor (C3D or
VGG)
Textual Feature
Extractor (WE+GRU)
... FC ...
Weighted
Pooling
FC
FC
Joint Video-
Text Space
Text-Guided
Attention
Fig. 21. TGA architecture, reproduced from Mithun et al. [27].
TGA (Mithun+, CVPR2019)
18. 弱教師あり学習
nReconstruction-base (REC)
• キャプショニングによって入力クエリに近づくような区間を推定する
• SCN (Zhao+, AAAI2020)
t al. [27].
for understanding
and fusing them
antically relevant
interactor module
reat extent.
ber of annotations
ries on video with
d labor-intensive,
o suffer from the
deos are usually
Fig. 22. WS-DEC architecture, reproduced from Duan et al. [29].
Video
C3D
Query: The man
speeds up then
returns to his
initial speed.
GloVe
Textual
Encoder
Visual
Decoder
...
...
...
Proposal
Candidates
Confidence
Score
...
...
...
Masked Query:
The man <mask>
up then <mask> to
his initial speed.
Textual
Decoder
Visual
Encoder
...
C3D
GloVe
...
Reconstructed Query: The
man speeds up then
returns to his initial speed. ...
Reconstruction
Loss
...
Top-K
Rewards
...
Rank
Loss
Softmax
Top-K Proposals
Proposal Generation Module
Semantic Completion Module
Selection
Selected Proposals
Refinement
Fig. 23. SCN architecture, reproduced from Lin et al. [30].
SCN (Zhao+, AAAI2020)
22. 考察
nデータの不確実性
• アノテーション
• アノテーションの区間の誤り
• アノテーターによる偏り
• 1動画に対してアノテーター1人
nデータバイアス
• 使用単語
• 使用されている単語に偏りがある
• アノテーション区間
• trainとtestのデータセットでアノテーション区間の偏りが近い
AN
MAN [16] CVPR’19 - - 27.02 41.1
2D
TMN [70] ECCV’18 - - 22.92 35.1
VLG-Net [75] ICCV’21 33.35 25.57 25.57 -
SN L-Net [21] AAAI’19 - - - 41.4
RL SM-RL [25] CVPR’19 - - 31.06 43.9
…
Timeline
Query #1: the person
washes their hands
at the kitchen sink.
Query #2: a person
is washing their
hands in the sink.
Query #2: a person
washes their hands
in the kitchen sink.
1
2
3
4
5 Temporal Annotations by Different Annotators
Fig. 24. Illustration of data uncertainty in TSGV benchmarks. Annotati
uncertainty means disagreements of annotated ground-truth mome
across different annotators. Query uncertainty means various que
expressions for one ground-truth moment.
in Table 7, improving the feature extractor or exploiting mo
diverse features leads to better accuracy.
Moreover, we also conduct the efficiency comparison amo
different method categories by selecting one representative mod
from each category. The results and discussion are presented
Section A.1 in the Appendix.
6 CHALLENGES AND FUTURE DIRECTIONS
6.1 Critical Analysis
Data Uncertainty. Recent studies [92], [224] observe that da
samples among current benchmark datasets are ambiguous a
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
(a) Charades-STA
(b) ActivityNet Captions
Fig. 25. The top-30 frequent actions in Charades-STA and ActivityNet
Captions datasets.
To investigate the effects of distributional bias among existing
TSGV methods, Yuan et al. [126] further re-organize the two
(a) Charades-STA
(b) ActivityNet Captions
Fig. 26. An illustration of moment distributions for Charades-STA and
ActivityNet Captions, where “Start” and “End” axes represent the nor
malized start and end timestamps, respectively. The deeper the color
the larger density (i.e., more annotations) in the dataset.
ALYSIS AND MACHINE INTELLIGENCE 19
(a) Charades-STA
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
(a) Charades-STA (a) Charades-STA
(b) ActivityNet Captions
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE I
(a) Charades-STA
(b) ActivityNet Captions
Fig. 25. The top-30 frequent actions in Charades-STA and ActivityNet
Captions datasets.
To investigate the effects of distributional bias among existing
TSGV methods, Yuan et al. [126] further re-organize the two
benchmark datasets and develop Charades-CD and ActivityNet-
CD datasets. Each dataset contains two test sets, i.e., the
independent-and-identical distribution (iid) test set, and the out-
of-distribution (ood) test set (see Fig. 27). Then, Yuan et al. [126]
collects a set of SOTA TSGV baselines and evaluates them on
the reorganized benchmark datasets. Results show that baselines
generally achieve impressive performance on the iid test set, but
fail to generalize to the ood test set. It is worth noting that weakly-
supervised methods are naturally immune to distributional bias
since they do not require annotated samples for training.