動作の例
train/baking/0-2-8-2-0-8-1-6-3302820816_13 train/bending/-DczhmCwr38_40
MiT [Monfort+,TPAMI2019]
train/100/"Putting [something]and [something]on the table"
train/1/"Putting [something similar to other things that are already on the table]"
SSv2 [Goyal+, ICCV2017]
shaking hands/-5IoqELcxSU_000138_000148 robot dancing/-ADLGQ4KR0U_000019_000029
Kinetics [Kay+, arXiv2017] 握手 ロボットダンス
何かを置く
同じ物を
更に置く
料理 腰を曲げて
作業する
画像認識から動画認識へ
◼ CNNによる画像認識
• LeNet[LeCun+, Proc. IEEE, 1998]
• AlexNet [Krizhevsky+, NIPS2012]
• VGG [Simonyan&Zisserman, ICLR2015]
• GoogLeNet / Inception [Szegedy+, CVPR2015]
• ResNet [He+, CVPR2016]
◼ 動画像のフレーム毎に2D CNNを適用
• 画像認識モデルの再利用
• 時間情報の貧弱なモデル化
◼ 動画像への3D CNNの適用
• C3D [Tran+, ICCV2015]
• I3D [Carreira&Zisserman, CVPR2017]
• 3D ResNet [Hara+, CVPR2018]
• SlowFast [Feichtenhofer+, ICCV2019]
• X3D [Feichtenhofer, CVPR2020]
• 計算量の増大
[Karpathy+, CVPR2014]
Figure 1: Explored approaches for fusing information over
temporal dimension through the network. Red, green and
blue boxes indicate convolutional, normalization and pool-
ing layers respectively. In the Slow Fusion model, the de-
picted columns shareparameters.
3.1. TimeInformation Fusion in CNNs
We investigate several approaches to fusing information
in the first fully connected layer. T
frame tower alone can detect any m
connected layer can compute globa
by comparing outputs of both towe
Slow Fusion. The Slow Fusio
mix between thetwo approaches th
information throughout the netwo
ers get access to progressively mo
both spatial and temporal dimensio
by extending the connectivity of
in time and carrying out temporal
to spatial convolutions to compute
[1, 10]. In the model we use, the fir
extended to apply every filter of te
an input clip of 10 frames through
stride 2 and produces 4 responses
Figure 1: Explored approaches for fu
temporal dimension through the netw
blue boxes indicate convolutional, no
ing layers respectively. In the Slow F
picted columns shareparameters.
3.1. TimeInformation Fusion in
We investigate several approaches
[LeCun+, Proc. IEEE, 1998]
[Feichtenhofer, CVPR2020]