SlideShare a Scribd company logo
1 of 15
Download to read offline
!"#$%&'( )*($#+,'-.#+
/01*#-2*3#450(0#5
!"#$%&'" ()$*"+,)"&-)#./01&2$ (1,./3$+),4#1/516$7./81$9$,: 0)
;!!<=>?@
仁田智也(名工大玉木研)
!"#$%&'(#)%"
nA6&B(1%+モデルの提案
• 2ストリームモデル
• フレーム数を変えて入力
• 既存モデルよりも軽量
• エンドツーエンド
• 認識率の向上
• 8$,)+$*%CD>>/E81FG./1#H$I=>?JK
• 8$,)+$*%CL>>/E!1##)#$1G./1#H$I=>?MK
• !"1#14)%/EA$:N#4%%&,G./O!!<=>?LK
• P<P/EQNG./!<RS=>?MK
Facebook AI Research (FAIR)
Abstract
We present SlowFast networks for video recognition. Our
model involves (i) a Slow pathway, operating at low frame
rate, to capture spatial semantics, and (ii) a Fast path-
way, operating at high frame rate, to capture motion at
fine temporal resolution. The Fast pathway can be made
very lightweight by reducing its channel capacity, yet can
learn useful temporal information for video recognition.
Our models achieve strong performance for both action
classification and detection in video, and large improve-
ments are pin-pointed as contributions by our SlowFast con-
cept. We report state-of-the-art accuracy on major video
recognition benchmarks, Kinetics, Charades and AVA. Code
has been made available at: https://github.com/
facebookresearch/SlowFast.
1. Introduction
It is customary in the recognition of images I(x, y) to
treat the two spatial dimensions x and y symmetrically. This
is justified by the statistics of natural images, which are to
a first approximation isotropic—all orientations are equally
T
C
H,W
prediction
High frame rate
C
αT
C
C
αT
αT βC
βC
βC
T
T
T
Low frame rate
Figure 1. A SlowFast network has a low frame rate, low temporal
resolution Slow pathway and a high frame rate, α× higher temporal
resolution Fast pathway. The Fast pathway is lightweight by using
a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them.
For example, waving hands do not change their identity as
“hands” over the span of the waving action, and a person
is always in the “person” category even though he/she can
transit from walking to running. So the recognition of the cat-
egorical semantics (as well as their colors, textures, lighting
etc.) can be refreshed relatively slowly. On the other hand,
手法
nA6&B
• 入力フレームの?T/𝜏を処理
• 最初の方の畳み込みでは=U
• 時間軸のダウンサンプリングをしない
• QPRで出力
n(1%+
• A6&BのV倍のフレームを処理
• チャネル数をA6&BのW倍 E>XWX?K
• ステージごとに%6&Bへ接続
• 既存の=ストリームモデル同様
• 時間軸のダウンサンプリングをしない
• QPRで出力
Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He
Facebook AI Research (FAIR)
Abstract
We present SlowFast networks for video recognition. Our
model involves (i) a Slow pathway, operating at low frame
rate, to capture spatial semantics, and (ii) a Fast path-
way, operating at high frame rate, to capture motion at
fine temporal resolution. The Fast pathway can be made
very lightweight by reducing its channel capacity, yet can
learn useful temporal information for video recognition.
Our models achieve strong performance for both action
classification and detection in video, and large improve-
ments are pin-pointed as contributions by our SlowFast con-
cept. We report state-of-the-art accuracy on major video
recognition benchmarks, Kinetics, Charades and AVA. Code
has been made available at: https://github.com/
facebookresearch/SlowFast.
1. Introduction
It is customary in the recognition of images I(x, y) to
T
C
H,W
prediction
High frame rate
C
αT
C
C
αT
αT βC
βC
βC
T
T
T
Low frame rate
Figure 1. A SlowFast network has a low frame rate, low temporal
resolution Slow pathway and a high frame rate, α× higher temporal
resolution Fast pathway. The Fast pathway is lightweight by using
a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them.
For example, waving hands do not change their identity as
“hands” over the span of the waving action, and a person
is always in the “person” category even though he/she can
SlowFast Networks for Video Recognition
Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He
Facebook AI Research (FAIR)
Abstract
We present SlowFast networks for video recognition. Our
model involves (i) a Slow pathway, operating at low frame
rate, to capture spatial semantics, and (ii) a Fast path-
way, operating at high frame rate, to capture motion at
fine temporal resolution. The Fast pathway can be made
very lightweight by reducing its channel capacity, yet can
learn useful temporal information for video recognition.
Our models achieve strong performance for both action
classification and detection in video, and large improve-
ments are pin-pointed as contributions by our SlowFast con-
cept. We report state-of-the-art accuracy on major video
recognition benchmarks, Kinetics, Charades and AVA. Code
has been made available at: https://github.com/
facebookresearch/SlowFast.
T
C
H,W
prediction
High frame rate
C
αT
C
C
αT
αT βC
βC
βC
T
T
T
Low frame rate
Figure 1. A SlowFast network has a low frame rate, low temporal
resolution Slow pathway and a high frame rate, α× higher temporal
resolution Fast pathway. The Fast pathway is lightweight by using
a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them.
Abstract
We present SlowFast networks for video recognition. Our
model involves (i) a Slow pathway, operating at low frame
rate, to capture spatial semantics, and (ii) a Fast path-
way, operating at high frame rate, to capture motion at
fine temporal resolution. The Fast pathway can be made
very lightweight by reducing its channel capacity, yet can
earn useful temporal information for video recognition.
Our models achieve strong performance for both action
classification and detection in video, and large improve-
ments are pin-pointed as contributions by our SlowFast con-
cept. We report state-of-the-art accuracy on major video
recognition benchmarks, Kinetics, Charades and AVA. Code
has been made available at: https://github.com/
facebookresearch/SlowFast.
1. Introduction
It is customary in the recognition of images I(x, y) to
T
C
H,W
High frame rate
C
αT
C
C
αT
αT βC
βC
βC
T
T
T
Low frame rate
Figure 1. A SlowFast network has a low frame rate, low tempo
resolution Slow pathway and a high frame rate, α× higher tempo
resolution Fast pathway. The Fast pathway is lightweight by usi
a fraction (β, e.g., 1/8) of channels. Lateral connections fuse the
For example, waving hands do not change their identity
“hands” over the span of the waving action, and a pers
is always in the “person” category even though he/she c
手法
n接続方法
• Y$9)C+&C*"1,,)6/EY+&!K
• Vフレームをチャネルへ
• Y$9)C%+#$4)4 %19'6$,:/EYC%19'6)K
• Vフレーム毎に特徴量を抽出
• Y$9)C%+#$4)4 *&,I&6N+$&,/EYC*&,IK
• 3次元畳み込み (ストライドV)
!"#$から%&'(の方向のみ
!"
ZZZ
VY
#!"
ZZZ
Y
)$'*
!"
ZZZ
VY
!"
ZZZ
Y
)+#",-&.
!" #$
%!" $
畳み込み
)+/'01
実験方法 (クラス識別)
nデータセット
• 8$,)+$*%CD>>/E81FG./1#H$I=>?JK
• 8$,)+$*%CL>>/E!1##)#$1G./
1#H$I=>?MK
• !"1#14)%/EA$:N#4%%&,G./
O!!<=>?LK
n学習
• ==D×==Dにランダムにクロップ
• ランダムに水平反転
• 元動画からY×𝜏をサンプリング
n推論
• 1動画に対して[×?>の推論の平均
• 空間軸
• =L×=Lを3つクロップ
• 時間軸
• Y×]を10個サンプリング
518 G.A. Sigurdsson et al.
Fig. 4. Keyframes from five videos in Charades. We see that actions occur together in
Charades (Sigurdsson+, ECCV2016)
実験結果 (クラス識別)
model flow pretrain top-1 top-5 GFLOPs×views
I3D [5] ImageNet 72.1 90.3 108 × N/A
Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A
S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A
Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30
Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30
R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115
STC [9] - 68.7 88.5 N/A × N/A
ARTNet [54] - 69.2 88.3 23.5 × 250
S3D [61] - 69.4 89.1 66.4 × N/A
ECO [63] - 70.0 89.4 N/A × N/A
I3D [5] ! - 71.6 90.0 216 × N/A
R(2+1)D [50] - 72.0 90.0 152 × 115
R(2+1)D [50] ! - 73.9 90.9 304 × 115
SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30
SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30
SlowFast 8×8, R101 - 77.9 93.2 106 × 30
SlowFast 16×8, R101 - 78.9 93.5 213 × 30
SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30
Table 2. Comparison with the state-of-the-art on Kinetics-400.
In the last column, we report the inference cost with a single “view"
(temporal clip with spatial crop) × the numbers of such views used.
The SlowFast models are with different input sampling (T×τ) and
backbones (R-50, R-101, NL). “N/A” indicates the numbers are
not available for us.
model pretrain top-1 top-5 GFLOPs×views
I3D [3] - 71.9 90.1 108 × N/A
StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A
SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30
SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30
SlowFast 8×8, R101 - 80.4 94.8 106 × 30
SlowFast 16×8, R101 - 81.1 95.1 213 × 30
SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30
Table 3. Comparison with the state-of-the-art on Kinetics-600.
SlowFast models the same as in Table 2.
model pretrain mAP GFLOPs×views
CoViAR, R-50 [59] ImageNet 21.9 N/A
Asyn-TF, VGG16 [42] ImageNet 22.4 N/A
MultiScale TRN [62] ImageNet 25.2 N/A
Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30
STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30
our baseline (Slow-only) Kinetics-400 39.0 187 × 30
SlowFast Kinetics-400 42.1 213 × 30
SlowFast, +NL Kinetics-400 42.5 234 × 30
SlowFast, +NL Kinetics-600 45.2 234 × 30
Table 4. Comparison with the state-of-the-art on Charades. All
our variants are based on T×τ = 16×8, R-101.
single-model, single-modality accuracy of 79.0%. Our vari-
ants show good performance with the best model at 81.8%.
SlowFast results on the recent Kinetics-700 [4] are in [11].
8$,)+$*%CL>>
model flow pretrain top-1 top-5 GFLOPs×views
I3D [5] ImageNet 72.1 90.3 108 × N/A
Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A
S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A
Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30
Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30
R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115
STC [9] - 68.7 88.5 N/A × N/A
ARTNet [54] - 69.2 88.3 23.5 × 250
S3D [61] - 69.4 89.1 66.4 × N/A
ECO [63] - 70.0 89.4 N/A × N/A
I3D [5] ! - 71.6 90.0 216 × N/A
R(2+1)D [50] - 72.0 90.0 152 × 115
R(2+1)D [50] ! - 73.9 90.9 304 × 115
SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30
SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30
SlowFast 8×8, R101 - 77.9 93.2 106 × 30
SlowFast 16×8, R101 - 78.9 93.5 213 × 30
SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30
Table 2. Comparison with the state-of-the-art on Kinetics-400.
In the last column, we report the inference cost with a single “view"
(temporal clip with spatial crop) × the numbers of such views used.
The SlowFast models are with different input sampling (T×τ) and
backbones (R-50, R-101, NL). “N/A” indicates the numbers are
not available for us.
model pretrain top-1 top-5 GFLOPs×views
I3D [3] - 71.9 90.1 108 × N/A
StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A
SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30
SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30
SlowFast 8×8, R101 - 80.4 94.8 106 × 30
SlowFast 16×8, R101 - 81.1 95.1 213 × 30
SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30
Table 3. Comparison with the state-of-the-art on Kinetics-600.
SlowFast models the same as in Table 2.
model pretrain mAP GFLOPs×views
CoViAR, R-50 [59] ImageNet 21.9 N/A
Asyn-TF, VGG16 [42] ImageNet 22.4 N/A
MultiScale TRN [62] ImageNet 25.2 N/A
Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30
STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30
our baseline (Slow-only) Kinetics-400 39.0 187 × 30
SlowFast Kinetics-400 42.1 213 × 30
SlowFast, +NL Kinetics-400 42.5 234 × 30
SlowFast, +NL Kinetics-600 45.2 234 × 30
Table 4. Comparison with the state-of-the-art on Charades. All
our variants are based on T×τ = 16×8, R-101.
single-model, single-modality accuracy of 79.0%. Our vari-
8$,)+$*%CD>>
model flow pretrain top-1 top-5 GFLOPs×views
I3D [5] ImageNet 72.1 90.3 108 × N/A
Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A
S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A
Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30
Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30
R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115
STC [9] - 68.7 88.5 N/A × N/A
ARTNet [54] - 69.2 88.3 23.5 × 250
S3D [61] - 69.4 89.1 66.4 × N/A
ECO [63] - 70.0 89.4 N/A × N/A
I3D [5] ! - 71.6 90.0 216 × N/A
R(2+1)D [50] - 72.0 90.0 152 × 115
R(2+1)D [50] ! - 73.9 90.9 304 × 115
SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30
SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30
SlowFast 8×8, R101 - 77.9 93.2 106 × 30
SlowFast 16×8, R101 - 78.9 93.5 213 × 30
SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30
Table 2. Comparison with the state-of-the-art on Kinetics-400.
In the last column, we report the inference cost with a single “view"
(temporal clip with spatial crop) × the numbers of such views used.
The SlowFast models are with different input sampling (T×τ) and
model pretrain top-1 top-5 GFLOPs×views
I3D [3] - 71.9 90.1 108 × N/A
StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A
SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30
SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30
SlowFast 8×8, R101 - 80.4 94.8 106 × 30
SlowFast 16×8, R101 - 81.1 95.1 213 × 30
SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30
Table 3. Comparison with the state-of-the-art on Kinetics-600.
SlowFast models the same as in Table 2.
model pretrain mAP GFLOPs×views
CoViAR, R-50 [59] ImageNet 21.9 N/A
Asyn-TF, VGG16 [42] ImageNet 22.4 N/A
MultiScale TRN [62] ImageNet 25.2 N/A
Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30
STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30
our baseline (Slow-only) Kinetics-400 39.0 187 × 30
SlowFast Kinetics-400 42.1 213 × 30
SlowFast, +NL Kinetics-400 42.5 234 × 30
SlowFast, +NL Kinetics-600 45.2 234 × 30
Table 4. Comparison with the state-of-the-art on Charades. All
our variants are based on T×τ = 16×8, R-101.
!"1#14)%
234は3'054'/"&56'78&.59:"0;2<5*=>?@ABCDを適応
既存手法
事前学習
既存手法
提案手法
実験結果 (クラス識別)
計算量が@倍近く増加
少ない計算量で
認識率向上
The SlowFast models are with different input sampling (T×τ) and
backbones (R-50, R-101, NL). “N/A” indicates the numbers are
not available for us.
+3.3
+3.0
+3.4 +2.1
+2.0
+1.7
16×8, R101
8×8, R101
4×16, R101
4×16, R50
2×32, R50
8×8, R50
SlowFast
Slow-only
Model capacity in GFLOPs for a single clip with 2562
spatial size
Kinetics
top-1
accuracy
(%)
100 125 150 175 200
25 50 75
70
72
74
76
78
Figure 2. Accuracy/complexity tradeoff on Kinetics-400 for the
our variants are bas
single-model, sin
ants show good p
SlowFast results
Charades [43] i
ble 4 shows our S
our baseline is th
SlowFast increas
while the extra N
achieve 45.2 mA
all, our SlowFast
best number (STR
4.2. Ablation E
This section p
comparing accur
実験結果 (クラス識別)
接続方法
#8,と/'0/"$
lateral top-1 top-5 GFLOPs
Slow-only - 72.6 90.3 27.3
Fast-only - 51.7 78.5 6.4
SlowFast - 73.5 90.3 34.2
SlowFast TtoC, sum 74.5 91.3 34.2
SlowFast TtoC, concat 74.3 91.0 39.8
SlowFast T-sample 75.4 91.8 34.9
SlowFast T-conv 75.6 92.1 36.1
(a) SlowFast fusion: Fusing Slow and Fast pathways
with various types of lateral connections throughout
the network hierarchy is consistently better than the
Slow and Fast only baselines.
top-1 top-5 GFLOPs
Slow-only 72.6 90.3 27.3
β = 1/4 75.6 91.7 54.5
1/6 75.8 92.0 41.8
1/8 75.6 92.1 36.1
1/12 75.2 91.8 32.8
1/16 75.1 91.7 30.6
1/32 74.2 91.3 28.6
(b) Channel capacity ratio: Varying
values of β, the channel capacity ratio
of the Fast pathway to make SlowFast
lightweight.
Fast pathway spatial top-1 top-5 GFLOPs
RGB - 75.6 92.1 36.1
RGB, β=1/4 half 74.7 91.8 34.4
gray-scale - 75.5 91.9 34.1
time diff - 74.5 91.6 34.2
optical flow - 73.8 91.3 35.1
(c) Weaker spatial input to Fast pathway: Alter-
native ways of weakening spatial inputs to the Fast
pathway in SlowFast models. β=1/8 unless speci-
fied otherwise.
Table 5. Ablations on the Fast pathway design on Kinetics-400. We show top-1 and top-5 classification accuracy (%), as well as
computational complexity measured in GFLOPs (floating-point operations, in # of multiply-adds ×109
) for a single clip input of spatial size
2562
. Inference-time computational cost is proportional to this, as a fixed number of 30 of views is used. Backbone: 4×16, R-50.
Individual pathways. The first two rows in Table 5a show
the results for using the structure of one individual pathway
alone. The default instantiations of the Slow and Fast path-
way are very lightweight with only 27.3 and 6.4 GFLOPs,
32.4M and 0.53M parameters, producing 72.6% and 51.7%
model pre-train top-1 top-5 GFLOPs
3D R-50 [56] ImageNet 73.4 90.9 36.7
3D R-50, recipe in [56] - 69.4 88.6 36.7
3D R-50, our recipe - 73.5 90.8 36.7
Table 6. Baselines trained from scratch: Using the same network
structure as [56], our training recipe achieves comparable results
lateral top-1 top-5 GFLOPs
Slow-only - 72.6 90.3 27.3
Fast-only - 51.7 78.5 6.4
SlowFast - 73.5 90.3 34.2
SlowFast TtoC, sum 74.5 91.3 34.2
SlowFast TtoC, concat 74.3 91.0 39.8
SlowFast T-sample 75.4 91.8 34.9
SlowFast T-conv 75.6 92.1 36.1
(a) SlowFast fusion: Fusing Slow and Fast pathways
with various types of lateral connections throughout
the network hierarchy is consistently better than the
Slow and Fast only baselines.
top-1 top-5 GFLOPs
Slow-only 72.6 90.3 27.3
β = 1/4 75.6 91.7 54.5
1/6 75.8 92.0 41.8
1/8 75.6 92.1 36.1
1/12 75.2 91.8 32.8
1/16 75.1 91.7 30.6
1/32 74.2 91.3 28.6
(b) Channel capacity ratio: Varying
values of β, the channel capacity ratio
of the Fast pathway to make SlowFast
lightweight.
Fast pathway
RGB
RGB, β=1/4
gray-scale
time diff
optical flow
(c) Weaker spat
native ways of w
pathway in Slow
fied otherwise.
Table 5. Ablations on the Fast pathway design on Kinetics-400. We show top-1 and top-5 classifi
computational complexity measured in GFLOPs (floating-point operations, in # of multiply-adds ×109
) fo
2562
. Inference-time computational cost is proportional to this, as a fixed number of 30 of views is used.
Individual pathways. The first two rows in Table 5a show
the results for using the structure of one individual pathway
model pre-t
3D R-50 [56] Imag
3D R-50, recipe in [56] -
GFLOPs
27.3
54.5
41.8
36.1
32.8
30.6
28.6
Varying
city ratio
Fast pathway spatial top-1 top-5 GFLOPs
RGB - 75.6 92.1 36.1
RGB, β=1/4 half 74.7 91.8 34.4
gray-scale - 75.5 91.9 34.1
time diff - 74.5 91.6 34.2
optical flow - 73.8 91.3 35.1
(c) Weaker spatial input to Fast pathway: Alter-
native ways of weakening spatial inputs to the Fast
(1%+のチャネル数
チャネル数を
少なくしても有効
(1%+の入力
?EFと同等で計算量の削減
実験方法 (アクション検出)
nデータセット
• P<P
• クラス数:L>
• ;^_閾値:>Z
n学習
• 入力
• 空間軸:==D×==D
• 時間軸:Y× 𝜏
n推論
• =L×=Lにリサイズ
• 8$,)+$*%で事前学習
AVA (Gu+, CVPR2018)
実験結果 (アクション検出)
on
atiotemporal
en from 437
or one frame
h a bounding
difficulty in
zation is less
7k validation
We follow the
s (see Fig. 3).
cision (mAP)
old of 0.5.
model flow video pretrain val mAP test mAP
I3D [20] Kinetics-400 14.5 -
I3D [20] ! Kinetics-400 15.6 -
ACRN, S3D [46] ! Kinetics-400 17.4 -
ATR, R50+NL [29] Kinetics-400 20.0 -
ATR, R50+NL [29] ! Kinetics-400 21.7 -
9-model ensemble [29] ! Kinetics-400 25.6 21.1
I3D [16] Kinetics-600 21.9 21.0
SlowFast Kinetics-400 26.3 -
SlowFast Kinetics-600 26.8 -
SlowFast, +NL Kinetics-600 27.3 27.1
SlowFast*, +NL Kinetics-600 28.2 -
Table 7. Comparison with the state-of-the-art on AVA v2.1. All
our variants are based on T×τ = 8×8, R101. Here “*” indicates a
version of our method that uses our region proposals for training.
@ストリームモデルで
性能の向上
G'H閾値の変更で性能向上
まとめ
nフレームレートの違う動画を入力
• 人間の視神経と近い作り
n既存のモデルよりも軽量で性能向上
nエンドツーエンドで高性能
• オプティカルフローなど用いない
n既存手法よりも高い認識率
n(1%+経路の入力を工夫することでより軽量で同等の性能
予備スライド
*+%,-./# 構造
Our Fast pathway not
t also pursues high-
twork hierarchy. In
l downsampling lay-
strided convolutions)
global pooling layer
ature tensors always
mension, maintaining
athway also distin-
can use significantly
ood accuracy for the
eight.
onvolutional network
s a ratio of β (β < 1)
ical value is β = 1/8
omputation (floating-
mmon layer is often
stage Slow pathway Fast pathway output sizes T×S2
raw clip - - 64×2242
data layer stride 16, 12 stride 2, 12 Slow : 4×2242
Fast : 32×2242
conv1
1×72, 64 5×72, 8 Slow : 4×1122
Fast : 32×1122
stride 1, 22 stride 1, 22
pool1
1×32 max 1×32 max Slow : 4×562
Fast : 32×562
stride 1, 22 stride 1, 22
res2


1×12, 64
1×32, 64
1×12, 256

×3


3×12, 8
1×32, 8
1×12, 32

×3
Slow : 4×562
Fast : 32×562
res3


1×12, 128
1×32, 128
1×12, 512

×4


3×12, 16
1×32, 16
1×12, 64

×4
Slow : 4×282
Fast : 32×282
res4


3×12, 256
1×32, 256
1×12, 1024

×6


3×12, 32
1×32, 32
1×12, 128

×6
Slow : 4×142
Fast : 32×142
res5


3×12, 512
1×32, 512
1×12, 2048

×3


3×12, 64
1×32, 64
1×12, 256

×3
Slow : 4×72
Fast : 32×72
global average pool, concate, fc # classes
Table 1. An example instantiation of the SlowFast network. The
実験結果 (クラス識別)
nバックボーン
• [U/SC>
n学習
• `&,C6&*16/`)N#16/`)+B&#7%/Ea1,:G./!<RS=>?MKの手法
• 本論文での手法
• スクラッチからの学習に適している
78.5 6.4
90.3 34.2
91.3 34.2
91.0 39.8
91.8 34.9
92.1 36.1
nd Fast pathways
tions throughout
y better than the
β = 1/4 75.6 91.7 54.5
1/6 75.8 92.0 41.8
1/8 75.6 92.1 36.1
1/12 75.2 91.8 32.8
1/16 75.1 91.7 30.6
1/32 74.2 91.3 28.6
(b) Channel capacity ratio: Varying
values of β, the channel capacity ratio
of the Fast pathway to make SlowFast
lightweight.
RGB, β=1/4 half 74.7 91.8 34.4
gray-scale - 75.5 91.9 34.1
time diff - 74.5 91.6 34.2
optical flow - 73.8 91.3 35.1
(c) Weaker spatial input to Fast pathway: Alter-
native ways of weakening spatial inputs to the Fast
pathway in SlowFast models. β=1/8 unless speci-
fied otherwise.
athway design on Kinetics-400. We show top-1 and top-5 classification accuracy (%), as well as
d in GFLOPs (floating-point operations, in # of multiply-adds ×109
) for a single clip input of spatial size
al cost is proportional to this, as a fixed number of 30 of views is used. Backbone: 4×16, R-50.
t two rows in Table 5a show
e of one individual pathway
ns of the Slow and Fast path-
only 27.3 and 6.4 GFLOPs,
producing 72.6% and 51.7%
model pre-train top-1 top-5 GFLOPs
3D R-50 [56] ImageNet 73.4 90.9 36.7
3D R-50, recipe in [56] - 69.4 88.6 36.7
3D R-50, our recipe - 73.5 90.8 36.7
Table 6. Baselines trained from scratch: Using the same network
structure as [56], our training recipe achieves comparable results
実験結果 (アクション認識)
SlowFast, +NL Kinetics-600 27.3 27.1
SlowFast*, +NL Kinetics-600 28.2 -
Table 7. Comparison with the state-of-the-art on AVA v2.1. All
our variants are based on T×τ = 8×8, R101. Here “*” indicates a
version of our method that uses our region proposals for training.
model flow video pretrain val mAP test mAP
SlowFast, 8×8 Kinetics-600 29.0 -
SlowFast, 16×8 Kinetics-600 29.8 -
SlowFast++, 16×8 Kinetics-600 30.7 -
SlowFast++, ensemble Kinetics-600 - 34.3
Table 8. SlowFast models on AVA v2.2. Here “++” indicates a
version of our method that is tested with multi-scale and horizontal
flipping augmentation. The backbone is R-101+NL and region
proposals are used for training.
5.1. Main Results
We compare with previous results on AVA in Table 7. An
interesting observation is on the potential benefit of using
optical flow (see column ‘flow’ in Table 7). Existing works
have observed mild improvements: +1.1 mAP for I3D in
[20], and +1.7 mAP for ATR in [29]. In contrast, our base-
line improves by the Fast pathway by +5.2 mAP (see Table 9
in our ablation experiments in the next section). Moreover,
two-stream methods using optical flow can double the com-
putational cost, whereas our Fast pathway is lightweight.
As system-level comparisons, our SlowFast model has
ハイパーパラメータの調整
s
t
a
n
d
w
a
t
c
h
(
a
p
e
r
s
o
n
)
t
a
l
k
t
o
(
e
.
g
.
,
s
e
l
f
,
a
p
e
r
s
o
n
)
l
i
s
t
e
n
t
o
(
a
p
e
r
s
o
n
)
s
i
t
c
a
r
r
y
/
h
o
l
d
(
a
n
o
b
j
e
c
t
)
w
a
l
k
t
o
u
c
h
(
a
n
o
b
j
e
c
t
)
b
e
n
d
/
b
o
w
(
a
t
t
h
e
w
a
i
s
t
)
l
i
e
/
s
l
e
e
p
r
i
d
e
(
e
.
g
.
,
a
b
i
k
e
,
a
h
o
r
s
e
)
d
a
n
c
e
a
n
s
w
e
r
p
h
o
n
e
r
u
n
/
j
o
g
e
a
t
s
m
o
k
e
c
r
o
u
c
h
/
k
n
e
e
l
f
i
g
h
t
/
h
i
t
(
a
p
e
r
s
o
n
)
r
e
a
d
d
r
i
n
k
g
r
a
b
(
a
p
e
r
s
o
n
)
m
a
r
t
i
a
l
a
r
t
w
a
t
c
h
(
e
.
g
.
,
T
V
)
s
i
n
g
t
o
(
e
.
g
.
,
s
e
l
f
,
a
p
e
r
s
o
n
)
p
l
a
y
m
u
s
i
c
a
l
i
n
s
t
r
u
m
e
n
t
d
r
i
v
e
(
e
.
g
.
,
a
c
a
r
,
a
t
r
u
c
k
)
h
a
n
d
c
l
a
p
o
p
e
n
(
e
.
g
.
,
a
w
i
n
d
o
w
)
h
u
g
(
a
p
e
r
s
o
n
)
g
e
t
u
p
g
i
v
e
/
s
e
r
v
e
o
b
j
e
c
t
t
o
p
e
r
s
o
n
w
r
i
t
e
c
l
o
s
e
(
e
.
g
.
,
a
d
o
o
r
,
a
b
o
x
)
l
i
s
t
e
n
(
e
.
g
.
,
t
o
m
u
s
i
c
)
k
i
s
s
(
a
p
e
r
s
o
n
)
t
a
k
e
o
b
j
e
c
t
f
r
o
m
p
e
r
s
o
n
h
a
n
d
s
h
a
k
e
s
a
i
l
b
o
a
t
p
u
t
d
o
w
n
l
i
f
t
/
p
i
c
k
u
p
t
e
x
t
o
n
/
l
o
o
k
a
t
a
c
e
l
l
p
h
o
n
e
l
i
f
t
(
a
p
e
r
s
o
n
)
p
u
l
l
(
a
n
o
b
j
e
c
t
)
p
u
s
h
(
a
n
o
b
j
e
c
t
)
h
a
n
d
w
a
v
e
p
u
s
h
(
a
n
o
t
h
e
r
p
e
r
s
o
n
)
d
r
e
s
s
/
p
u
t
o
n
c
l
o
t
h
i
n
g
f
a
l
l
d
o
w
n
t
h
r
o
w
c
l
i
m
b
(
e
.
g
.
,
a
m
o
u
n
t
a
i
n
)
j
u
m
p
/
l
e
a
p
w
o
r
k
o
n
a
c
o
m
p
u
t
e
r
e
n
t
e
r
s
h
o
o
t
h
i
t
(
a
n
o
b
j
e
c
t
)
t
a
k
e
a
p
h
o
t
o
c
u
t
t
u
r
n
(
e
.
g
.
,
a
s
c
r
e
w
d
r
i
v
e
r
)
s
w
i
m
p
o
i
n
t
t
o
(
a
n
o
b
j
e
c
t
)
0
10
20
30
40
50
60
70
80
AP
(%)
+15.9
+18.8
+12.5
+27.7
4.4x
2.9x 2.9x
2.8x
3.1x
+27.4
Slow-only (19.0 mAP)
SlowFast (24.2 mAP)
Figure 3. Per-category AP on AVA: a Slow-only baseline (19.0 mAP) vs. its SlowFast counterpart (24.2 mAP). The highlighted categories
Figure 3. Per-category AP on AVA: a Slow-only baseline (19.0
are the 5 highest absolute increase (black) or 5 highest relative inc
of examples. Note that the SlowFast instantiation in this ablation
model T × τ α mAP
Slow-only, R-50 4×16 - 19.0
SlowFast, R-50 4×16 8 24.2
Table 9. AVA action detection baselines: Slow-only vs. SlowFas
Finally, we create an ensemble of 7 models and submit
to the official test server for the ActivityNet challenge 201
[1]. As shown in Table 8 this entry (SlowFast++, ensemble
achieved 34.3 mAP accuracy on the test set, ranking firs
in the AVA action detection challenge 2019. Further detail
on our winning solution are provided in the correspondin
technical report [11].
5.2. Ablation Experiments
A6&B(1%+による各クラスのPRの検証
A6&BのみとA6&B(1%+の検証

More Related Content

What's hot

動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセットToru Tamaki
 
3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向Kensho Hara
 
スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元Shogo Muramatsu
 
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-SupervisionDeep Learning JP
 
Towards Performant Video Recognition
Towards Performant Video RecognitionTowards Performant Video Recognition
Towards Performant Video Recognitioncvpaper. challenge
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP
 
[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphing[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphingDeep Learning JP
 
画像処理ライブラリ OpenCV で 出来ること・出来ないこと
画像処理ライブラリ OpenCV で 出来ること・出来ないこと画像処理ライブラリ OpenCV で 出来ること・出来ないこと
画像処理ライブラリ OpenCV で 出来ること・出来ないことNorishige Fukushima
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
StyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNAStyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNAKento Doi
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用Yoshitaka Ushiku
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展Deep Learning JP
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識Hirokatsu Kataoka
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video RecognitionToru Tamaki
 
20160724_cv_sfm_revisited
20160724_cv_sfm_revisited20160724_cv_sfm_revisited
20160724_cv_sfm_revisitedKyohei Unno
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向ohken
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向Motokawa Tetsuya
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門joisino
 

What's hot (20)

動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット動作認識の最前線:手法,タスク,データセット
動作認識の最前線:手法,タスク,データセット
 
3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向
 
スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元スパースモデリングによる多次元信号・画像復元
スパースモデリングによる多次元信号・画像復元
 
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
【DL輪読会】Unpaired Image Super-Resolution Using Pseudo-Supervision
 
Towards Performant Video Recognition
Towards Performant Video RecognitionTowards Performant Video Recognition
Towards Performant Video Recognition
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
 
[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphing[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphing
 
画像処理ライブラリ OpenCV で 出来ること・出来ないこと
画像処理ライブラリ OpenCV で 出来ること・出来ないこと画像処理ライブラリ OpenCV で 出来ること・出来ないこと
画像処理ライブラリ OpenCV で 出来ること・出来ないこと
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
StyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNAStyleGAN解説 CVPR2019読み会@DeNA
StyleGAN解説 CVPR2019読み会@DeNA
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
 
20160724_cv_sfm_revisited
20160724_cv_sfm_revisited20160724_cv_sfm_revisited
20160724_cv_sfm_revisited
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門
 

Similar to 文献紹介:SlowFast Networks for Video Recognition

文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video ClassificationToru Tamaki
 
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
 
文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
 
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...Toru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical FlowToru Tamaki
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer NetworkToru Tamaki
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video InpaintingToru Tamaki
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoderSho Tatsuno
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 Nobuaki Oshiro
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜Michiharu Niimi
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610HIDEOMI SUZUKI
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料Masayuki Tanaka
 
introduce "Stealing Machine Learning Models via Prediction APIs"
introduce "Stealing Machine Learning Models  via Prediction APIs"introduce "Stealing Machine Learning Models  via Prediction APIs"
introduce "Stealing Machine Learning Models via Prediction APIs"Isao Takaesu
 
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Parallel WaveNet: Fast High-Fidelity Speech SynthesisParallel WaveNet: Fast High-Fidelity Speech Synthesis
Parallel WaveNet: Fast High-Fidelity Speech Synthesisdhigurashi
 

Similar to 文献紹介:SlowFast Networks for Video Recognition (20)

文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
 
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
文献紹介:Selective Feature Compression for Efficient Activity Recognition Inference
 
文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition
 
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer Network
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料
 
Prelude to Halide
Prelude to HalidePrelude to Halide
Prelude to Halide
 
introduce "Stealing Machine Learning Models via Prediction APIs"
introduce "Stealing Machine Learning Models  via Prediction APIs"introduce "Stealing Machine Learning Models  via Prediction APIs"
introduce "Stealing Machine Learning Models via Prediction APIs"
 
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Parallel WaveNet: Fast High-Fidelity Speech SynthesisParallel WaveNet: Fast High-Fidelity Speech Synthesis
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
 

More from Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 

Recently uploaded

業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成Hiroshi Tomioka
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?akihisamiyanaga1
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)UEHARA, Tetsutaro
 
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)Hiroki Ichikura
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案sugiuralab
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...博三 太田
 

Recently uploaded (9)

業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
 
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
 

文献紹介:SlowFast Networks for Video Recognition

  • 1. !"#$%&'( )*($#+,'-.#+ /01*#-2*3#450(0#5 !"#$%&'" ()$*"+,)"&-)#./01&2$ (1,./3$+),4#1/516$7./81$9$,: 0) ;!!<=>?@ 仁田智也(名工大玉木研)
  • 2. !"#$%&'(#)%" nA6&B(1%+モデルの提案 • 2ストリームモデル • フレーム数を変えて入力 • 既存モデルよりも軽量 • エンドツーエンド • 認識率の向上 • 8$,)+$*%CD>>/E81FG./1#H$I=>?JK • 8$,)+$*%CL>>/E!1##)#$1G./1#H$I=>?MK • !"1#14)%/EA$:N#4%%&,G./O!!<=>?LK • P<P/EQNG./!<RS=>?MK Facebook AI Research (FAIR) Abstract We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path- way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improve- ments are pin-pointed as contributions by our SlowFast con- cept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast. 1. Introduction It is customary in the recognition of images I(x, y) to treat the two spatial dimensions x and y symmetrically. This is justified by the statistics of natural images, which are to a first approximation isotropic—all orientations are equally T C H,W prediction High frame rate C αT C C αT αT βC βC βC T T T Low frame rate Figure 1. A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, α× higher temporal resolution Fast pathway. The Fast pathway is lightweight by using a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them. For example, waving hands do not change their identity as “hands” over the span of the waving action, and a person is always in the “person” category even though he/she can transit from walking to running. So the recognition of the cat- egorical semantics (as well as their colors, textures, lighting etc.) can be refreshed relatively slowly. On the other hand,
  • 3. 手法 nA6&B • 入力フレームの?T/𝜏を処理 • 最初の方の畳み込みでは=U • 時間軸のダウンサンプリングをしない • QPRで出力 n(1%+ • A6&BのV倍のフレームを処理 • チャネル数をA6&BのW倍 E>XWX?K • ステージごとに%6&Bへ接続 • 既存の=ストリームモデル同様 • 時間軸のダウンサンプリングをしない • QPRで出力 Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He Facebook AI Research (FAIR) Abstract We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path- way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improve- ments are pin-pointed as contributions by our SlowFast con- cept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast. 1. Introduction It is customary in the recognition of images I(x, y) to T C H,W prediction High frame rate C αT C C αT αT βC βC βC T T T Low frame rate Figure 1. A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, α× higher temporal resolution Fast pathway. The Fast pathway is lightweight by using a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them. For example, waving hands do not change their identity as “hands” over the span of the waving action, and a person is always in the “person” category even though he/she can SlowFast Networks for Video Recognition Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He Facebook AI Research (FAIR) Abstract We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path- way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improve- ments are pin-pointed as contributions by our SlowFast con- cept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast. T C H,W prediction High frame rate C αT C C αT αT βC βC βC T T T Low frame rate Figure 1. A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, α× higher temporal resolution Fast pathway. The Fast pathway is lightweight by using a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them.
  • 4. Abstract We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path- way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can earn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improve- ments are pin-pointed as contributions by our SlowFast con- cept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast. 1. Introduction It is customary in the recognition of images I(x, y) to T C H,W High frame rate C αT C C αT αT βC βC βC T T T Low frame rate Figure 1. A SlowFast network has a low frame rate, low tempo resolution Slow pathway and a high frame rate, α× higher tempo resolution Fast pathway. The Fast pathway is lightweight by usi a fraction (β, e.g., 1/8) of channels. Lateral connections fuse the For example, waving hands do not change their identity “hands” over the span of the waving action, and a pers is always in the “person” category even though he/she c 手法 n接続方法 • Y$9)C+&C*"1,,)6/EY+&!K • Vフレームをチャネルへ • Y$9)C%+#$4)4 %19'6$,:/EYC%19'6)K • Vフレーム毎に特徴量を抽出 • Y$9)C%+#$4)4 *&,I&6N+$&,/EYC*&,IK • 3次元畳み込み (ストライドV) !"#$から%&'(の方向のみ !" ZZZ VY #!" ZZZ Y )$'* !" ZZZ VY !" ZZZ Y )+#",-&. !" #$ %!" $ 畳み込み )+/'01
  • 5. 実験方法 (クラス識別) nデータセット • 8$,)+$*%CD>>/E81FG./1#H$I=>?JK • 8$,)+$*%CL>>/E!1##)#$1G./ 1#H$I=>?MK • !"1#14)%/EA$:N#4%%&,G./ O!!<=>?LK n学習 • ==D×==Dにランダムにクロップ • ランダムに水平反転 • 元動画からY×𝜏をサンプリング n推論 • 1動画に対して[×?>の推論の平均 • 空間軸 • =L×=Lを3つクロップ • 時間軸 • Y×]を10個サンプリング 518 G.A. Sigurdsson et al. Fig. 4. Keyframes from five videos in Charades. We see that actions occur together in Charades (Sigurdsson+, ECCV2016)
  • 6. 実験結果 (クラス識別) model flow pretrain top-1 top-5 GFLOPs×views I3D [5] ImageNet 72.1 90.3 108 × N/A Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30 Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30 R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115 STC [9] - 68.7 88.5 N/A × N/A ARTNet [54] - 69.2 88.3 23.5 × 250 S3D [61] - 69.4 89.1 66.4 × N/A ECO [63] - 70.0 89.4 N/A × N/A I3D [5] ! - 71.6 90.0 216 × N/A R(2+1)D [50] - 72.0 90.0 152 × 115 R(2+1)D [50] ! - 73.9 90.9 304 × 115 SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30 SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30 SlowFast 8×8, R101 - 77.9 93.2 106 × 30 SlowFast 16×8, R101 - 78.9 93.5 213 × 30 SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30 Table 2. Comparison with the state-of-the-art on Kinetics-400. In the last column, we report the inference cost with a single “view" (temporal clip with spatial crop) × the numbers of such views used. The SlowFast models are with different input sampling (T×τ) and backbones (R-50, R-101, NL). “N/A” indicates the numbers are not available for us. model pretrain top-1 top-5 GFLOPs×views I3D [3] - 71.9 90.1 108 × N/A StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30 SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30 SlowFast 8×8, R101 - 80.4 94.8 106 × 30 SlowFast 16×8, R101 - 81.1 95.1 213 × 30 SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30 Table 3. Comparison with the state-of-the-art on Kinetics-600. SlowFast models the same as in Table 2. model pretrain mAP GFLOPs×views CoViAR, R-50 [59] ImageNet 21.9 N/A Asyn-TF, VGG16 [42] ImageNet 22.4 N/A MultiScale TRN [62] ImageNet 25.2 N/A Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30 STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30 our baseline (Slow-only) Kinetics-400 39.0 187 × 30 SlowFast Kinetics-400 42.1 213 × 30 SlowFast, +NL Kinetics-400 42.5 234 × 30 SlowFast, +NL Kinetics-600 45.2 234 × 30 Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101. single-model, single-modality accuracy of 79.0%. Our vari- ants show good performance with the best model at 81.8%. SlowFast results on the recent Kinetics-700 [4] are in [11]. 8$,)+$*%CL>> model flow pretrain top-1 top-5 GFLOPs×views I3D [5] ImageNet 72.1 90.3 108 × N/A Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30 Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30 R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115 STC [9] - 68.7 88.5 N/A × N/A ARTNet [54] - 69.2 88.3 23.5 × 250 S3D [61] - 69.4 89.1 66.4 × N/A ECO [63] - 70.0 89.4 N/A × N/A I3D [5] ! - 71.6 90.0 216 × N/A R(2+1)D [50] - 72.0 90.0 152 × 115 R(2+1)D [50] ! - 73.9 90.9 304 × 115 SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30 SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30 SlowFast 8×8, R101 - 77.9 93.2 106 × 30 SlowFast 16×8, R101 - 78.9 93.5 213 × 30 SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30 Table 2. Comparison with the state-of-the-art on Kinetics-400. In the last column, we report the inference cost with a single “view" (temporal clip with spatial crop) × the numbers of such views used. The SlowFast models are with different input sampling (T×τ) and backbones (R-50, R-101, NL). “N/A” indicates the numbers are not available for us. model pretrain top-1 top-5 GFLOPs×views I3D [3] - 71.9 90.1 108 × N/A StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30 SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30 SlowFast 8×8, R101 - 80.4 94.8 106 × 30 SlowFast 16×8, R101 - 81.1 95.1 213 × 30 SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30 Table 3. Comparison with the state-of-the-art on Kinetics-600. SlowFast models the same as in Table 2. model pretrain mAP GFLOPs×views CoViAR, R-50 [59] ImageNet 21.9 N/A Asyn-TF, VGG16 [42] ImageNet 22.4 N/A MultiScale TRN [62] ImageNet 25.2 N/A Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30 STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30 our baseline (Slow-only) Kinetics-400 39.0 187 × 30 SlowFast Kinetics-400 42.1 213 × 30 SlowFast, +NL Kinetics-400 42.5 234 × 30 SlowFast, +NL Kinetics-600 45.2 234 × 30 Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101. single-model, single-modality accuracy of 79.0%. Our vari- 8$,)+$*%CD>> model flow pretrain top-1 top-5 GFLOPs×views I3D [5] ImageNet 72.1 90.3 108 × N/A Two-Stream I3D [5] ! ImageNet 75.7 92.0 216 × N/A S3D-G [61] ! ImageNet 77.2 93.0 143 × N/A Nonlocal R50 [56] ImageNet 76.5 92.6 282 × 30 Nonlocal R101 [56] ImageNet 77.7 93.3 359 × 30 R(2+1)D Flow [50] ! - 67.5 87.2 152 × 115 STC [9] - 68.7 88.5 N/A × N/A ARTNet [54] - 69.2 88.3 23.5 × 250 S3D [61] - 69.4 89.1 66.4 × N/A ECO [63] - 70.0 89.4 N/A × N/A I3D [5] ! - 71.6 90.0 216 × N/A R(2+1)D [50] - 72.0 90.0 152 × 115 R(2+1)D [50] ! - 73.9 90.9 304 × 115 SlowFast 4×16, R50 - 75.6 92.1 36.1 × 30 SlowFast 8×8, R50 - 77.0 92.6 65.7 × 30 SlowFast 8×8, R101 - 77.9 93.2 106 × 30 SlowFast 16×8, R101 - 78.9 93.5 213 × 30 SlowFast 16×8, R101+NL - 79.8 93.9 234 × 30 Table 2. Comparison with the state-of-the-art on Kinetics-400. In the last column, we report the inference cost with a single “view" (temporal clip with spatial crop) × the numbers of such views used. The SlowFast models are with different input sampling (T×τ) and model pretrain top-1 top-5 GFLOPs×views I3D [3] - 71.9 90.1 108 × N/A StNet-IRv2 RGB [21] ImgNet+Kin400 79.0 N/A N/A SlowFast 4×16, R50 - 78.8 94.0 36.1 × 30 SlowFast 8×8, R50 - 79.9 94.5 65.7 ×30 SlowFast 8×8, R101 - 80.4 94.8 106 × 30 SlowFast 16×8, R101 - 81.1 95.1 213 × 30 SlowFast 16×8, R101+NL - 81.8 95.1 234 × 30 Table 3. Comparison with the state-of-the-art on Kinetics-600. SlowFast models the same as in Table 2. model pretrain mAP GFLOPs×views CoViAR, R-50 [59] ImageNet 21.9 N/A Asyn-TF, VGG16 [42] ImageNet 22.4 N/A MultiScale TRN [62] ImageNet 25.2 N/A Nonlocal, R101 [56] ImageNet+Kinetics400 37.5 544 × 30 STRG, R101+NL [57] ImageNet+Kinetics400 39.7 630 × 30 our baseline (Slow-only) Kinetics-400 39.0 187 × 30 SlowFast Kinetics-400 42.1 213 × 30 SlowFast, +NL Kinetics-400 42.5 234 × 30 SlowFast, +NL Kinetics-600 45.2 234 × 30 Table 4. Comparison with the state-of-the-art on Charades. All our variants are based on T×τ = 16×8, R-101. !"1#14)% 234は3'054'/"&56'78&.59:"0;2<5*=>?@ABCDを適応 既存手法 事前学習 既存手法 提案手法
  • 7. 実験結果 (クラス識別) 計算量が@倍近く増加 少ない計算量で 認識率向上 The SlowFast models are with different input sampling (T×τ) and backbones (R-50, R-101, NL). “N/A” indicates the numbers are not available for us. +3.3 +3.0 +3.4 +2.1 +2.0 +1.7 16×8, R101 8×8, R101 4×16, R101 4×16, R50 2×32, R50 8×8, R50 SlowFast Slow-only Model capacity in GFLOPs for a single clip with 2562 spatial size Kinetics top-1 accuracy (%) 100 125 150 175 200 25 50 75 70 72 74 76 78 Figure 2. Accuracy/complexity tradeoff on Kinetics-400 for the our variants are bas single-model, sin ants show good p SlowFast results Charades [43] i ble 4 shows our S our baseline is th SlowFast increas while the extra N achieve 45.2 mA all, our SlowFast best number (STR 4.2. Ablation E This section p comparing accur
  • 8. 実験結果 (クラス識別) 接続方法 #8,と/'0/"$ lateral top-1 top-5 GFLOPs Slow-only - 72.6 90.3 27.3 Fast-only - 51.7 78.5 6.4 SlowFast - 73.5 90.3 34.2 SlowFast TtoC, sum 74.5 91.3 34.2 SlowFast TtoC, concat 74.3 91.0 39.8 SlowFast T-sample 75.4 91.8 34.9 SlowFast T-conv 75.6 92.1 36.1 (a) SlowFast fusion: Fusing Slow and Fast pathways with various types of lateral connections throughout the network hierarchy is consistently better than the Slow and Fast only baselines. top-1 top-5 GFLOPs Slow-only 72.6 90.3 27.3 β = 1/4 75.6 91.7 54.5 1/6 75.8 92.0 41.8 1/8 75.6 92.1 36.1 1/12 75.2 91.8 32.8 1/16 75.1 91.7 30.6 1/32 74.2 91.3 28.6 (b) Channel capacity ratio: Varying values of β, the channel capacity ratio of the Fast pathway to make SlowFast lightweight. Fast pathway spatial top-1 top-5 GFLOPs RGB - 75.6 92.1 36.1 RGB, β=1/4 half 74.7 91.8 34.4 gray-scale - 75.5 91.9 34.1 time diff - 74.5 91.6 34.2 optical flow - 73.8 91.3 35.1 (c) Weaker spatial input to Fast pathway: Alter- native ways of weakening spatial inputs to the Fast pathway in SlowFast models. β=1/8 unless speci- fied otherwise. Table 5. Ablations on the Fast pathway design on Kinetics-400. We show top-1 and top-5 classification accuracy (%), as well as computational complexity measured in GFLOPs (floating-point operations, in # of multiply-adds ×109 ) for a single clip input of spatial size 2562 . Inference-time computational cost is proportional to this, as a fixed number of 30 of views is used. Backbone: 4×16, R-50. Individual pathways. The first two rows in Table 5a show the results for using the structure of one individual pathway alone. The default instantiations of the Slow and Fast path- way are very lightweight with only 27.3 and 6.4 GFLOPs, 32.4M and 0.53M parameters, producing 72.6% and 51.7% model pre-train top-1 top-5 GFLOPs 3D R-50 [56] ImageNet 73.4 90.9 36.7 3D R-50, recipe in [56] - 69.4 88.6 36.7 3D R-50, our recipe - 73.5 90.8 36.7 Table 6. Baselines trained from scratch: Using the same network structure as [56], our training recipe achieves comparable results lateral top-1 top-5 GFLOPs Slow-only - 72.6 90.3 27.3 Fast-only - 51.7 78.5 6.4 SlowFast - 73.5 90.3 34.2 SlowFast TtoC, sum 74.5 91.3 34.2 SlowFast TtoC, concat 74.3 91.0 39.8 SlowFast T-sample 75.4 91.8 34.9 SlowFast T-conv 75.6 92.1 36.1 (a) SlowFast fusion: Fusing Slow and Fast pathways with various types of lateral connections throughout the network hierarchy is consistently better than the Slow and Fast only baselines. top-1 top-5 GFLOPs Slow-only 72.6 90.3 27.3 β = 1/4 75.6 91.7 54.5 1/6 75.8 92.0 41.8 1/8 75.6 92.1 36.1 1/12 75.2 91.8 32.8 1/16 75.1 91.7 30.6 1/32 74.2 91.3 28.6 (b) Channel capacity ratio: Varying values of β, the channel capacity ratio of the Fast pathway to make SlowFast lightweight. Fast pathway RGB RGB, β=1/4 gray-scale time diff optical flow (c) Weaker spat native ways of w pathway in Slow fied otherwise. Table 5. Ablations on the Fast pathway design on Kinetics-400. We show top-1 and top-5 classifi computational complexity measured in GFLOPs (floating-point operations, in # of multiply-adds ×109 ) fo 2562 . Inference-time computational cost is proportional to this, as a fixed number of 30 of views is used. Individual pathways. The first two rows in Table 5a show the results for using the structure of one individual pathway model pre-t 3D R-50 [56] Imag 3D R-50, recipe in [56] - GFLOPs 27.3 54.5 41.8 36.1 32.8 30.6 28.6 Varying city ratio Fast pathway spatial top-1 top-5 GFLOPs RGB - 75.6 92.1 36.1 RGB, β=1/4 half 74.7 91.8 34.4 gray-scale - 75.5 91.9 34.1 time diff - 74.5 91.6 34.2 optical flow - 73.8 91.3 35.1 (c) Weaker spatial input to Fast pathway: Alter- native ways of weakening spatial inputs to the Fast (1%+のチャネル数 チャネル数を 少なくしても有効 (1%+の入力 ?EFと同等で計算量の削減
  • 9. 実験方法 (アクション検出) nデータセット • P<P • クラス数:L> • ;^_閾値:>Z n学習 • 入力 • 空間軸:==D×==D • 時間軸:Y× 𝜏 n推論 • =L×=Lにリサイズ • 8$,)+$*%で事前学習 AVA (Gu+, CVPR2018)
  • 10. 実験結果 (アクション検出) on atiotemporal en from 437 or one frame h a bounding difficulty in zation is less 7k validation We follow the s (see Fig. 3). cision (mAP) old of 0.5. model flow video pretrain val mAP test mAP I3D [20] Kinetics-400 14.5 - I3D [20] ! Kinetics-400 15.6 - ACRN, S3D [46] ! Kinetics-400 17.4 - ATR, R50+NL [29] Kinetics-400 20.0 - ATR, R50+NL [29] ! Kinetics-400 21.7 - 9-model ensemble [29] ! Kinetics-400 25.6 21.1 I3D [16] Kinetics-600 21.9 21.0 SlowFast Kinetics-400 26.3 - SlowFast Kinetics-600 26.8 - SlowFast, +NL Kinetics-600 27.3 27.1 SlowFast*, +NL Kinetics-600 28.2 - Table 7. Comparison with the state-of-the-art on AVA v2.1. All our variants are based on T×τ = 8×8, R101. Here “*” indicates a version of our method that uses our region proposals for training. @ストリームモデルで 性能の向上 G'H閾値の変更で性能向上
  • 13. *+%,-./# 構造 Our Fast pathway not t also pursues high- twork hierarchy. In l downsampling lay- strided convolutions) global pooling layer ature tensors always mension, maintaining athway also distin- can use significantly ood accuracy for the eight. onvolutional network s a ratio of β (β < 1) ical value is β = 1/8 omputation (floating- mmon layer is often stage Slow pathway Fast pathway output sizes T×S2 raw clip - - 64×2242 data layer stride 16, 12 stride 2, 12 Slow : 4×2242 Fast : 32×2242 conv1 1×72, 64 5×72, 8 Slow : 4×1122 Fast : 32×1122 stride 1, 22 stride 1, 22 pool1 1×32 max 1×32 max Slow : 4×562 Fast : 32×562 stride 1, 22 stride 1, 22 res2   1×12, 64 1×32, 64 1×12, 256  ×3   3×12, 8 1×32, 8 1×12, 32  ×3 Slow : 4×562 Fast : 32×562 res3   1×12, 128 1×32, 128 1×12, 512  ×4   3×12, 16 1×32, 16 1×12, 64  ×4 Slow : 4×282 Fast : 32×282 res4   3×12, 256 1×32, 256 1×12, 1024  ×6   3×12, 32 1×32, 32 1×12, 128  ×6 Slow : 4×142 Fast : 32×142 res5   3×12, 512 1×32, 512 1×12, 2048  ×3   3×12, 64 1×32, 64 1×12, 256  ×3 Slow : 4×72 Fast : 32×72 global average pool, concate, fc # classes Table 1. An example instantiation of the SlowFast network. The
  • 14. 実験結果 (クラス識別) nバックボーン • [U/SC> n学習 • `&,C6&*16/`)N#16/`)+B&#7%/Ea1,:G./!<RS=>?MKの手法 • 本論文での手法 • スクラッチからの学習に適している 78.5 6.4 90.3 34.2 91.3 34.2 91.0 39.8 91.8 34.9 92.1 36.1 nd Fast pathways tions throughout y better than the β = 1/4 75.6 91.7 54.5 1/6 75.8 92.0 41.8 1/8 75.6 92.1 36.1 1/12 75.2 91.8 32.8 1/16 75.1 91.7 30.6 1/32 74.2 91.3 28.6 (b) Channel capacity ratio: Varying values of β, the channel capacity ratio of the Fast pathway to make SlowFast lightweight. RGB, β=1/4 half 74.7 91.8 34.4 gray-scale - 75.5 91.9 34.1 time diff - 74.5 91.6 34.2 optical flow - 73.8 91.3 35.1 (c) Weaker spatial input to Fast pathway: Alter- native ways of weakening spatial inputs to the Fast pathway in SlowFast models. β=1/8 unless speci- fied otherwise. athway design on Kinetics-400. We show top-1 and top-5 classification accuracy (%), as well as d in GFLOPs (floating-point operations, in # of multiply-adds ×109 ) for a single clip input of spatial size al cost is proportional to this, as a fixed number of 30 of views is used. Backbone: 4×16, R-50. t two rows in Table 5a show e of one individual pathway ns of the Slow and Fast path- only 27.3 and 6.4 GFLOPs, producing 72.6% and 51.7% model pre-train top-1 top-5 GFLOPs 3D R-50 [56] ImageNet 73.4 90.9 36.7 3D R-50, recipe in [56] - 69.4 88.6 36.7 3D R-50, our recipe - 73.5 90.8 36.7 Table 6. Baselines trained from scratch: Using the same network structure as [56], our training recipe achieves comparable results
  • 15. 実験結果 (アクション認識) SlowFast, +NL Kinetics-600 27.3 27.1 SlowFast*, +NL Kinetics-600 28.2 - Table 7. Comparison with the state-of-the-art on AVA v2.1. All our variants are based on T×τ = 8×8, R101. Here “*” indicates a version of our method that uses our region proposals for training. model flow video pretrain val mAP test mAP SlowFast, 8×8 Kinetics-600 29.0 - SlowFast, 16×8 Kinetics-600 29.8 - SlowFast++, 16×8 Kinetics-600 30.7 - SlowFast++, ensemble Kinetics-600 - 34.3 Table 8. SlowFast models on AVA v2.2. Here “++” indicates a version of our method that is tested with multi-scale and horizontal flipping augmentation. The backbone is R-101+NL and region proposals are used for training. 5.1. Main Results We compare with previous results on AVA in Table 7. An interesting observation is on the potential benefit of using optical flow (see column ‘flow’ in Table 7). Existing works have observed mild improvements: +1.1 mAP for I3D in [20], and +1.7 mAP for ATR in [29]. In contrast, our base- line improves by the Fast pathway by +5.2 mAP (see Table 9 in our ablation experiments in the next section). Moreover, two-stream methods using optical flow can double the com- putational cost, whereas our Fast pathway is lightweight. As system-level comparisons, our SlowFast model has ハイパーパラメータの調整 s t a n d w a t c h ( a p e r s o n ) t a l k t o ( e . g . , s e l f , a p e r s o n ) l i s t e n t o ( a p e r s o n ) s i t c a r r y / h o l d ( a n o b j e c t ) w a l k t o u c h ( a n o b j e c t ) b e n d / b o w ( a t t h e w a i s t ) l i e / s l e e p r i d e ( e . g . , a b i k e , a h o r s e ) d a n c e a n s w e r p h o n e r u n / j o g e a t s m o k e c r o u c h / k n e e l f i g h t / h i t ( a p e r s o n ) r e a d d r i n k g r a b ( a p e r s o n ) m a r t i a l a r t w a t c h ( e . g . , T V ) s i n g t o ( e . g . , s e l f , a p e r s o n ) p l a y m u s i c a l i n s t r u m e n t d r i v e ( e . g . , a c a r , a t r u c k ) h a n d c l a p o p e n ( e . g . , a w i n d o w ) h u g ( a p e r s o n ) g e t u p g i v e / s e r v e o b j e c t t o p e r s o n w r i t e c l o s e ( e . g . , a d o o r , a b o x ) l i s t e n ( e . g . , t o m u s i c ) k i s s ( a p e r s o n ) t a k e o b j e c t f r o m p e r s o n h a n d s h a k e s a i l b o a t p u t d o w n l i f t / p i c k u p t e x t o n / l o o k a t a c e l l p h o n e l i f t ( a p e r s o n ) p u l l ( a n o b j e c t ) p u s h ( a n o b j e c t ) h a n d w a v e p u s h ( a n o t h e r p e r s o n ) d r e s s / p u t o n c l o t h i n g f a l l d o w n t h r o w c l i m b ( e . g . , a m o u n t a i n ) j u m p / l e a p w o r k o n a c o m p u t e r e n t e r s h o o t h i t ( a n o b j e c t ) t a k e a p h o t o c u t t u r n ( e . g . , a s c r e w d r i v e r ) s w i m p o i n t t o ( a n o b j e c t ) 0 10 20 30 40 50 60 70 80 AP (%) +15.9 +18.8 +12.5 +27.7 4.4x 2.9x 2.9x 2.8x 3.1x +27.4 Slow-only (19.0 mAP) SlowFast (24.2 mAP) Figure 3. Per-category AP on AVA: a Slow-only baseline (19.0 mAP) vs. its SlowFast counterpart (24.2 mAP). The highlighted categories Figure 3. Per-category AP on AVA: a Slow-only baseline (19.0 are the 5 highest absolute increase (black) or 5 highest relative inc of examples. Note that the SlowFast instantiation in this ablation model T × τ α mAP Slow-only, R-50 4×16 - 19.0 SlowFast, R-50 4×16 8 24.2 Table 9. AVA action detection baselines: Slow-only vs. SlowFas Finally, we create an ensemble of 7 models and submit to the official test server for the ActivityNet challenge 201 [1]. As shown in Table 8 this entry (SlowFast++, ensemble achieved 34.3 mAP accuracy on the test set, ranking firs in the AVA action detection challenge 2019. Further detail on our winning solution are provided in the correspondin technical report [11]. 5.2. Ablation Experiments A6&B(1%+による各クラスのPRの検証 A6&BのみとA6&B(1%+の検証