論文紹介：Revealing the unseen: Benchmarking video action recognition under occlusion

Revealing the unseen:
Benchmarking video action
recognition under occlusion
Shresth Grover, Vibhav Vineet, Yogesh S Rawat, NeurIPS2023
福沢匠（名工大）
2023/11/30

はじめに
nOcclusion
• 対象の一部に遮蔽物
• 実応用の場面で発生し得る
n本論文
• Occlusionが動作認識に与える影響を調査
• Occlusionの影響を調べるベンチマークを作成
• 合成動画データセット
• 実動画データセット
• 実動画のOcclusionに対して高い性能を出すモデルを提案

関連研究
n画像中のOcclusion
• 分類とセグメンテーションに焦点
• Bboxを持つオブジェクトの３次元関係を利用[Hsiao&Hebert, TPAMI2014]
• クラス間のOcclusion事前分布を導入[Tighe+, CVPR2021]
n動画中のOcclusion
• ポーズ推定とインスタンスセグメンテーションに焦点
• 自己教師あり学習を用いてポーズ推定を強化[Yang+, FG2021]
• Occlusionのインスタンスセグメンテーションに特化したデータセット[Qi+,
IJCV2022]
• BCNet: Occlusionを推定するための分岐を導入した[Ke+, CVPR2021]

Occlusion ベンチマーク
n2つの合成データセット
• UCF-101-O
• K-400-O
• Pascal VOC[Everingham+, IJCV2010]データセットのオブジェクトを合成
• Occlusionの程度による影響を調査可能
• 極端なケースとして機能
n自然なOcclusionのデータセット
• UCF-19-Y-OCC
• YouTubeから収集した実世界の動画からなる
• カテゴリはUCF101のサブセット
• 19クラス（１クラスごとに，5秒の30個の動画）

合成動画
UCF-101-O:
Drumming
K-400-O:
Air Drumming

UCF-19-Y-OCC
Mopping
Skate Board

合成データセットのパラメータ
nOcclusionの割合
• 映像全体を通して遮蔽物が占める画素数の平均率
• ベンチマークでは40~60%を採用
• 事前学習では0~40%を採用
nOcclusionの動き
• 静的occlusion
• 遮蔽物の位置をランダムに決定．動画中は一定．
• 動的occlusion
• 3つのモーション（線形，円形，ランダム）
• ランダムな座標から始まり，次フレームの位置をモーションによって変更

CTx-Net
nCompositionalNet [Kortylewski+, CVPR2020]
の動画拡張
• 線形ヘッドのかわりに，微分可能な
Compositional generative modelを追加
• 遮蔽箇所を推定するOccluder kernels
[Kortylewski+, CVPR2020]

CTx-Net
n動作に関連する領域のみに注目する
nバックボーンはMViT [Fan+, ICCV2021]
nOccluder kernel
• どの特徴量が遮蔽されているかを示す
• 動作と関係ない画像も学習 Figure 18: Visualization
Is it possible to identify individual
or indirectly (i.e., in combination with
identifiable from YouTube.
Collection process Each instance of
videos on YouTube. The entire video w

実験設定
nエポック数：15
n学習率：1e-4
nデータオーグメンテーション
• Pascal VOCの物体を合成
• ベースラインで用いた物体クラスは使用しない
• 線形モーション
• Occlusion割合：20~40%
n評価指標
• 絶対ロバスト性
• 相対ロバスト性
• 𝐴!: 元データの性能，𝐴": Occlusionデータの性能
ness to occlusion. Except for VideoMAE, all
n the training split of the K-400 dataset. For
with weights derived from the preliminary K-
n the training subset of the UCF-101 dataset.
on K-400 and subsequently transitioning to
On UCF-101, weights originating from fully
mployed. Augmented models were initialized
ction of synthetic occlusions to each video,
r using the UCF-101 training split.
evaluation. The robustness metric is defined
obustness score is given as a
p = 1
Ac Ap
100
vel p, Ac is the accuracy on the clean dataset,
erity level p. Relative robustness is defined as
core.
employed models underwent fully supervised training on the training
the evaluation on UCF-101, all models were initialized with weights de
400 training phase. Sequentially, they were fine-tuned, on the training s
VideoMAE model underwent self-supervised training on K-400 and
fully supervised training on the K-400 training segment. On UCF-101,
supervised training on the K-400 training subset were employed. Augm
with pre-trained K-400 weights. Following the introduction of synth
each model was fine-tuned in a fully supervised manner using the UC
3.4 Evaluation metric
We use accuracy and robustness scores as the metrics for evaluation. Th
in two ways, relative and absolute [38]. The absolute robustness score
where a
p is the absolute robustness score for severity level p, Ac is the
and Ap is the accuracy on the occluded dataset with severity level p. Re
r
p = 1
Ac Ap
Ac
, where a
c is the relative robustness score.
5

合成データセット
Table 1: Comparison of performance on UCF101, UCF101-O and UCF-101Y-OCC datasets.
Dataset UCF-101 UCF-101-O UCF-19-Y-OCC
Models Aug Top-1-Acc Top-1-Acc a r
Top-1-Acc a r
R50 ⇥ 89.4 39.6 0.50 0.44 50.9 0.61 0.57
R2P1D ⇥ 88.3 48.9 0.61 0.55 49.3 0.61 0.55
X3D ⇥ 91.2 62.4 0.71 0.68 56.3 0.65 0.62
I3D ⇥ 89.1 59.4 0.70 0.67 52.9 0.64 0.59
MViT ⇥ 93.5 81.9 0.89 0.88 64.3 0.71 0.69
VideoMAE ⇥ 96.0 79.2 0.83 0.82 65.4 0.69 0.68
MViTv2 ⇥ 96.5 83.5 0.87 0.86 66.3 0.70 0.69
R2P1D X 83.0 71.4 0.89 0.86 37.5 0.55 0.45
MViT X 92.7 84.3 0.92 0.91 59.7 0.67 0.64
MViTv2 X 95.7 88.3 0.92 0.92 65.3 0.69 0.68
VideoMAE X 95.8 87.1 0.91 0.91 64.2 0.68 0.67
CTx-Net ⇥ 93.4 86.4 0.93 0.92 67.4 0.74 0.72
Table 2: Comparison of accuracy and robustness score of studied models on the K-400-O dataset.
CTx-Net2 uses MviTv2 backbone and CTx-Net-mae uses Video MAE as pretrained backbone.
Dataset K-400 K-400-O
n TransformerはCNNよりOcclusionに対してロバスト性が高い
n 合成データによるオーグメンテーション：合成のOcclusionに有効
n CTx-Net：自然なOcclusionに対してもある程度有効
CNN
Transformer

CNNとTransformerの比較
Figure 4: T-sne feature analysis under occlusion comparing CNN and transformer architecture. Left
two plots: R2P1D features on UCF-101 and UCF-101-O respectively, and right two plots: MViT
features on UCF-101 and UCF-101-O respectively.
3.5 Preliminary benchmark analysis
n モデルの特徴量をt-SNEで3次元プロット
n Transformer (MViT)
• 左）UCF-101
• 右）UCF-101-O
n CNN (R2P1D[Tran+, CVPR2018])
• 左）UCF-101
• 右）UCF-101-O

Occlutionの割合による性能変化
n UCF-101-O
n L0: 0%, L1: (0-20)%, L2: (20-40)%, L3: (40-60)%
n S: 静的occlusion，D: 動的occlusion
MViT 76.4 54.4 0.78 0.71
MViTv2 77.6 56.7 0.79 0.73
VideoMAE 78.4 58.1 0.79 0.74
CTx-Net 75.9 58.4 0.79 0.77
CTx-Net2 77.4 59.7 0.82 0.77
CTx-Net-mae 76.8 58.2 0.81 0.76
Table 3: Performance comparison showing accuracy with different severity levels and occluder
motion on UCF-101-O dataset. Here L0 is 0%, L1 is (0-20)%, L2 is 20-40)%, and L3 is (40-60)%
occlusion. S and D denote Static and Dynamic occluders respectively.
Models L0 L1 -S L1-D L2-S L2-D L3-S L3-D Avg Acc
R50 89.4 84.7 52.9 71.3 34.3 46.7 16.5 56.5
R2P1D 88.3 79.3 45.3 55.3 27.2 33.3 19.3 49.7
I3D 89.1 85.2 56.7 71.7 40.2 46.8 23.7 59.1
X3D 90.3 85.3 46.6 73.5 34.4 47.0 21.8 56.9
MViT 93.5 91.8 86.6 87.2 80.3 80.2 70.1 84.2
MViTv2 96.5 94.5 87.2 89.5 82.3 81.3 72.4 86.2
VideoMAE 96.0 94.7 87.8 91.2 80.1 81.5 66.7 85.4
CTx-Net 93.4 92.4 89.7 87.6 82.4 82.2 78.9 86.7

CTx-Netのablation
nOcclusionありのデータで事前学習で性能低下
• 遮蔽物を推論に必要なものとして学習してしまうため
• 遮蔽領域を適切に推論できない
Figure 6: Effect of pretrained weights on models performance for UCF-101-O dataset. Left: accuracy,
middle: relative robustness r
, and right: absolute robustness a
.
Table 5: Ablations to study the effect of different components of CTx-Net on UCF101-O dataset.
Dataset UCF-101 UCF-101-O
Models Top-1-Acc Top-1-Acc a r
CTx-Net (CNN) 24.2 10.3 0.86 0.43
CTx-Net (spatial occluder kernels) 90.3 84.2 0.94 0.93
CTx-Net (w/o class encoding) 91.2 83.3 0.92 0.91
CTx-Net (pretrained with occluded data) 85.2 75.4 0.89 0.88
CTx-Net 93.4 86.4 0.93 0.92

まとめ
nOcclusionがある場合の動作認識タスク
n２つの合成ベンチマーク
n１つの実動画ベンチマーク
n自然なOcclusionにもロバストなCTx-Netを提案

論文紹介：Revealing the unseen: Benchmarking video action recognition under occlusion

Recommended

Recommended

More Related Content

Similar to 論文紹介：Revealing the unseen: Benchmarking video action recognition under occlusion

Similar to 論文紹介：Revealing the unseen: Benchmarking video action recognition under occlusion (20)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (10)

論文紹介：Revealing the unseen: Benchmarking video action recognition under occlusion