SlideShare a Scribd company logo
!"#$%!&'()*+,%"-./0%#)12,&%
/)*%3//.4.&50%6.1&)%
751&*80+51.59
!"#$"%&#'()*%+#,*%&#-.%+#/*%&#0''12345
橋口凌大(名工大玉木研)
英語論文紹介2324643647
概要
nトレードオフ
• 28'99
• 高効率
• 低性能
• :8'99
• 低効率
• 高性能
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99ベースで時間情報を考慮
nリアルタイムで低レイテンシーで動画認識
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
to generate the bounding boxes, which is unfair to compare
since external data (MSCOCO) and extra training cost is
introduced. Thus we compared TSM to its CNN part: Non-
Mod
I3D fro
ECO16
I3D fro
I3Dre
TSM
TSM
work [
conside
optical
nition m
of two-
We s
trade-o
set of S
parame
see tha
than bo
based m
based m
It can a
it achie
compu
is alrea
効率
高 低
低
高
性能
!"#$$の計算量削減
n:8'99の計算量を減らす
• 9.%#@.E*@ =.D)@<#F9$G
HI*%+J&#'1KL234MN
• O'P#HQ.@A*+(*?"J&#O''1234MN
• 前半28'99
• 後半:8'99
i is often based only on the current and the latest time steps
(e.g., j = i or i 1).
The non-local operation is also different from a fully-
connected (fc) layer. Eq.(1) computes responses based on
relationships between different locations, whereas fc uses
learned weights. In other words, the relationship between xj
and xi is not a function of the input data in fc, unlike in non-
local layers. Furthermore, our formulation in Eq.(1) supports
inputs of variable sizes, and maintains the corresponding
size in the output. On the contrary, an fc layer requires a
fixed-size input/output and loses positional correspondence
(e.g., that from xi to yi at the position i).
A non-local operation is a flexible building block and can
be easily used together with convolutional/recurrent layers.
It can be added into the earlier part of deep neural networks,
unlike fc layers that are often used in the end. This allows us
to build a richer hierarchy that combines both non-local and
local information.
3.2. Instantiations
Next we describe several versions of f and g. Interest-
ingly, we will show by experiments (Table 2a) that our non-
local models are not sensitive to these choices, indicating
that the generic non-local behavior is the main reason for the
observed improvements.
For simplicity, we only consider g in the form of a linear
θ: 1×1×1 φ: 1×1×1 g: 1×1×1
1×1×1
softmax
z
T×H×W×1024
T×H×W×512 T×H×W×512 T×H×W×512
THW×512 512×THW
THW×THW
THW×512
THW×512
T×H×W×512
T×H×W×1024
x
Figure 2. A spacetime non-local block. The feature maps are
shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for
1024 channels (proper reshaping is performed when noted). “⌦”
denotes matrix multiplication, and “ ” denotes element-wise sum.
The softmax operation is performed on each row. The blue boxes de-
note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian
version, with a bottleneck of 512 channels. The vanilla Gaussian
version can be done by removing ✓ and , and the dot-product
version can be done by replacing softmax with scaling by 1/N.
y = softmax(xT
WT
✓ W x)g(x), which is the self-attention
form in [49]. As such, our work provides insight by relating
this recent self-attention model to the classic computer vision
:8
28
ECO: Efficient Convolutional Network for Online Video Understanding 5
Input to 3D Net
K x [N x 28 x 28]
M2 M1
K x 28 x 28
Weight Sharing
Weight Sharing
M2 M1
Video (V)
K x 28 x 28
2D Net (H2d)
K x 28 x 28
3D Net
Action
s1
s2
SN
Mϕ = [M1,. .,M1]
Mϕ = [MK,. .,MK]
MK
1
K
(H3d)
1
1
1
MK
2
2
2
M2 M1
MK
N
N
N
1 N
N
1
Temporal stacking of feature
maps
Feature Maps
動画認識のための%"#$$のシフト
nシフト方法
• オフライン
• 一部特徴量を前後フレームと交換
• オンライン
• 次のフレームに特徴量を渡す
n挿入方法
• 0%R>@*E<
• シフトした特徴量をST">#E.%%<EB".%
• L<S"D)*@
• シフト前の特徴量をST">#E.%%<EB".%
nL<S"D)*@の方が高性能
• シフトする前と後を足し合わせる
• 空間情報が壊れず時間情報を融合
X Y
+
shift conv
(a) In-place TSM.
X Y
+
shift conv
(b) Residual TSM.
Figure 3. Residual shift is better than in-place shift. In-place shift
happens before a convolution layer (or a residual block). Residual
shift fuses temporal information inside a residual branch.
ule for Efficient Video Understanding
Chuang Gan
M Watson AI Lab
ng@csail.mit.edu
Song Han
MIT
songhan@mit.edu
se to
accu-
CNNs
poral
good
ng it
neric
both
Channel C
Temporal
T
(a) The original ten-
sor without shift.
pad zero
temporal
shift
truncate
T
C
H,W
(b) Offline temporal
shift (bi-direction).
t=0
t=3
…
t=1
t=2
Channel C
(c) Online temporal
shift (uni-direction).
Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
dimension. It is computationally free on top of a 2D convolution,
画像認識における空間シフトと問題点
n畳み込みの代わりに特徴量を全てシフト
n動画認識にも適用したい
n問題点
• レイテンシーの増加
• 空間特徴量が把握できない
!"#$%&'()*+,-./
.3. Efficient Neural Networks
The efficiency of 2D CNN has been extensively studied.
ome works focused on designing an efficient model [21, 20,
6, 56]. Recently neural architecture search [62, 63, 31]
as been introduced to find an efficient architecture au-
omatically [44, 3]. Another way is to prune, quan-
ze and compress an existing model for efficient deploy-
ment [16, 15, 29, 59, 18, 47]. Address shift, which is a
ardware-friendly primitive, has also been exploited for com-
act 2D CNN design on image recognition tasks [51, 57].
0 1/8 1/4 1/2 1
P100
TX2
CPU
Naive shift:
large overhead
Latency
Overhead
Shift Proportion
0%
3%
6%
9%
12%
15%
Our Choice
(a) Overhead vs. proportion.
0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1
In-place TSM
Residual TSM
Naive shift:
low acc.
Accuracy
15%
Shift Proportion
69%
71%
73%
75%
67%
Our Choice
2D baseline
(b) Residual vs. in-place.
⨷ …
N
(a)	Spatial	Convolution
M
DF
DF
M
DK
DK
N
DF
DF
…
⨷
…
DK
DK
1
⨷
⨷
…
…
(b)	Depth-wise	convolution
M
DF
DF
M M
DF
DF
(c)	Shift
M
DF
DF
DF
DF
M
…
…
…
Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift
matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s.
In this paper, we present the shift operation (Figure 1) as
an alternative to spatial convolutions. The shift operation
moves each channel of its input tensor in a different spatial
direction. A shift-based module interleaves shift operations
with point-wise convolutions, which further mixes spatial
information across channels. Unlike spatial convolutions,
the shift operation itself requires zero FLOPs and zero pa-
rameters. As opposed to depth-wise convolutions, shift op-
erations can be easily and efficiently implemented.
Our approach is orthogonal to model compression [4],
tensor factorization [27] and low-bit networks [16]. As a
result, any of these techniques could be composed with our
proposed method to further reduce model size.
We introduce a new hyperparameter for shift-based mod-
where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered
spatial indices; k, l and i, j index along spatial dimensions
and n, m index into channels. The number of parameters
required by a spatial convolution is M × N × D2
K and the
computational cost is M × N × D2
K × D2
F . As the kernel
size DK increases, we see the number of parameters and
computational cost grow quadratically.
A popular variant of the spatial convolution is a depth-
wise convolution [7, 1], which is usually followed by a
point-wise convolution (1x1 convolution). Altogether, the
module is called the depth-wise separable convolution. A
depth-wise convolution, as shown in Figure 2(b), aggre-
gates spatial information from a DK × DK patch within
each channel, and can be described as
データセットとベースライン
n時間情報が重要
• -.=<B("%+RS.=<B("%+#U4VU2 H,.W*@J&#0''1234XN
• '(*?*D<S#H,)%%*?J&#234YN
• !<SB<?#HC*B<?ZW%ST*J&#N
n時間情報がそれほど重要でない
• ['434#H-..=?.J&#2342N
• ]"%<B"ES#H]*WJ&#234XN
• /C8^74#H])<(%<J&#2344N
nベースライン
• ;<=>.?*@#S<+=<%B#%<B_.?T#HI*%+J&#O''1234YN
-.=<B("%+RS.=<B("%+
各データセットにおける&'(の効果
nC.?<#;<=>.?*@だと大幅な性能向上
n$<SS#;<=>.?*@でも性能向上
nline
ence,
each
next
maps
f 7/8
erate
M for
ges:
need
ring
per-
eline.
one
he a
mory
MB
Table 1. Our method consistently outperforms 2D counterparts on
multiple datasets at zero extra computation (protocol: ResNet-50
8f input, 10 clips for Kinetics, 2 for others, full-resolution).
Dataset Model Acc1 Acc5 ∆ Acc1
Less
Temporal
Kinetics
TSN 70.6 89.2
+3.5
Ours 74.1 91.2
UCF101
TSN 91.7 99.2
+4.2
Ours 95.9 99.7
HMDB51
TSN 64.7 89.9
+8.8
Ours 73.5 94.3
More
Temporal
Something
V1
TSN 20.5 47.5
+28.0
Ours 47.3 76.2
Something
V2
TSN 30.4 61.0
+31.3
Ours 61.7 87.4
Jester
TSN 83.9 99.6
+11.7
Ours 97.0 99.9
I3D from [50] 3D ResNet-50 32×
Non-local I3D from [50] 3D ResNet-50 32×
Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×
TSM ResNet-50
TSM ResNet-50 1
TSMEn ResNet-50 2
TSMRGB+Flow ResNet-50 16
Table 3. TSM can consistently improve the performance over
ferent backbones on Kinetics dataset.
Mb-V2 R-50 RX-101 NL R-50
TSN 66.5 70.7 72.4 74.6
TSM 69.5 74.1 76.3 75.7
∆Acc. +3.0 +3.4 +3.9 +1.1
consistently outperforms the 2D TSN baseline at no ex
computation. For the lower part, we present the results
Something-Something V1 and V2 [14] and Jester [1], wh
n全てのベースラインで性能向上
nすでに時間的モデリングをしているモデ
ルより性能向上
n;-Cの時間情報モデリング能力が高い
時間情報を考慮したモデルでも性能向上
')&*との比較
n28ベースラインを大幅に改善
n:8の手法より性能が良い
Table 2. Comparing TSM against other methods on Something-Something dataset (center crop, 1 clip/video unless otherwise specified).
Model Backbone #Frame FLOPs/Video #Param. Val Top-1 Val Top-5 Test Top-1
TSN [58] BNInception 8 16G 10.7M 19.5 - -
TSN (our impl.) ResNet-50 8 33G 24.3M 19.7 46.6 -
TRN-Multiscale [58] BNInception 8 16G 18.3M 34.4 - 33.6
TRN-Multiscale (our impl.) ResNet-50 8 33G 31.8M 38.9 68.1 -
Two-stream TRNRGB+Flow [58] BNInception 8+8 - 36.6M 42.0 - 40.7
ECO [61] BNIncep+3D Res18 8 32G 47.5M 39.6 - -
ECO [61] BNIncep+3D Res18 16 64G 47.5M 41.4 - -
ECOEnLite [61] BNIncep+3D Res18 92 267G 150M 46.4 - 42.3
ECOEnLiteRGB+Flow [61] BNIncep+3D Res18 92+92 - 300M 49.5 - 43.9
I3D from [50] 3D ResNet-50 32×2clip 153G1
×2 28.0M 41.6 72.2 -
Non-local I3D from [50] 3D ResNet-50 32×2clip 168G1
×2 35.3M 44.4 76.0 -
Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×2clip 303G2
×2 62.2M2
46.1 76.8 45.0
TSM ResNet-50 8 33G 24.3M 45.6 74.2 -
TSM ResNet-50 16 65G 24.3M 47.2 77.1 46.0
TSMEn ResNet-50 24 98G 48.6M 49.7 78.5 -
TSMRGB+Flow ResNet-50 16+16 - 48.6M 52.6 81.9 50.7
Table 3. TSM can consistently improve the performance over dif- Something-Something-V1. Something-Something-V1 is
28ベース
:8ベース
提案手法
トレードオフ
n-.=<B("%+R-.=<B("%+#U4による比較
• O'Pより:倍
• 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
We show the accuracy, FLOPs, and number of parameters
trade-off in Figure 5. The accuracy is tested on the validation
set of Something-Something-V1 dataset, and the number of
parameters is indicated by the area of the circles. We can
see that our TSM based methods have a better Pareto curve
than both previous state-of-the-art efficient models (ECO
based models) and high-performance models (non-local I3D
based models). TSM models are both efficient and accurate.
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
43
46
48
51
Ours ECO [ ] I3D from [ ]
curacy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
TSM16F
TSM8F
61 50
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
高速な認識の優位性
n['434の観察時間ごとの認識率
n最初の43`を観測したとき
• ;-CはO'Pに比べてX`ほど高い
n観測初期から高精度
nオンラインの場合2フレーム目から
前フレームの特徴量を考慮できる
Table 6. Comparing the accuracy of offline TSM and online TSM on
different datasets. Online TSM brings negligible latency overhead.
Model Latency Kinetics UCF101 HMDB51 Something
TSN 4.7ms 70.6% 91.7% 64.7% 20.5%
+Offline - 74.1% 95.9% 73.5% 47.3%
+Online 4.8ms 74.3% 95.5% 73.6% 46.3%
Accuracy
%
80
84
88
92
96
Video Observation %
10 20 40 60 80 100
ECO (s=8)
ECO (s=12)
ECO (s=20)
TSM
Figure 6. Early recognition on UCF101. TSM gives high prediction
accuracy after only observing a small portion of the video.
of backbone design, we replace every TSM primitive with
3 × 1 × 1 convolution and denote this model as I3Dreplace. It
まとめ
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99に挿入し,時間方向モデリングが可能
• 追加コストなし
• 計算量ゼロ
• パラメータゼロ
n低遅延動画認識
• 効率的で高精度
• エッジデバイスによる低遅延な動画認識が可能

More Related Content

What's hot

【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
Deep Learning JP
 
3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向
Kensho Hara
 
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
Deep Learning JP
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
Hirokatsu Kataoka
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ
cvpaper. challenge
 
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
Deep Learning JP
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2
Hirokatsu Kataoka
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
Deep Learning JP
 
ガイデットフィルタとその周辺
ガイデットフィルタとその周辺ガイデットフィルタとその周辺
ガイデットフィルタとその周辺Norishige Fukushima
 
[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection
Deep Learning JP
 
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
Seiya Tokui
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII
 
[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ
Deep Learning JP
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
Ohnishi Katsunori
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
cvpaper. challenge
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Yosuke Shinya
 
動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)
cvpaper. challenge
 
【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence
Deep Learning JP
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
joisino
 

What's hot (20)

【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
【DL輪読会】Domain Generalization by Learning and Removing Domainspecific Features
 
3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向3D CNNによる人物行動認識の動向
3D CNNによる人物行動認識の動向
 
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
[DL輪読会]LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
 
【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ
 
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
ガイデットフィルタとその周辺
ガイデットフィルタとその周辺ガイデットフィルタとその周辺
ガイデットフィルタとその周辺
 
[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection[DL輪読会]Focal Loss for Dense Object Detection
[DL輪読会]Focal Loss for Dense Object Detection
 
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding ModelNIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
NIPS2013読み会 DeViSE: A Deep Visual-Semantic Embedding Model
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
 
[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
Active Convolution, Deformable Convolution ―形状・スケールを学習可能なConvolution―
 
動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)動画認識における代表的なモデル・データセット(メタサーベイ)
動画認識における代表的なモデル・データセット(メタサーベイ)
 
【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
 

Similar to 文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding

文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition
Toru Tamaki
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
Toru Tamaki
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
Toru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Toru Tamaki
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
Toru Tamaki
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
Toru Tamaki
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
Toru Tamaki
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer Network
Toru Tamaki
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
Toru Tamaki
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
Toru Tamaki
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
Toru Tamaki
 
文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting
Toru Tamaki
 
レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章
Hiroki Iida
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
RCCSRENKEI
 
Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Katsutoshi Makino
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
Masahiro Suzuki
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
Naoya Chiba
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用
Seiya Tokui
 
RでGISハンズオンセッション
RでGISハンズオンセッションRでGISハンズオンセッション
RでGISハンズオンセッションarctic_tern265
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろ
Hiroshi Yamashita
 

Similar to 文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding (20)

文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer Network
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting
 
レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
 
Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用
 
RでGISハンズオンセッション
RでGISハンズオンセッションRでGISハンズオンセッション
RでGISハンズオンセッション
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろ
 

More from Toru Tamaki

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
Toru Tamaki
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Toru Tamaki
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
Toru Tamaki
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Toru Tamaki
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Toru Tamaki
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Toru Tamaki
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
Toru Tamaki
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Toru Tamaki
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
Toru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
Toru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Toru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
Toru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
Toru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
Toru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
Toru Tamaki
 

More from Toru Tamaki (20)

論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 

Recently uploaded

MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
iPride Co., Ltd.
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
harmonylab
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
Fukuoka Institute of Technology
 
【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
Sony - Neural Network Libraries
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
Matsushita Laboratory
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
NTT DATA Technology & Innovation
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
atsushi061452
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
CRI Japan, Inc.
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
yassun7010
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
atsushi061452
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance
 

Recently uploaded (15)

MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
MPAなWebフレームワーク、Astroの紹介 (その2) 2024/05/24の勉強会で発表されたものです。
 
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matching
 
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
単腕マニピュレータによる 複数物体の同時組み立ての 基礎的考察 / Basic Approach to Robotic Assembly of Multi...
 
【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow【AI論文解説】Consistency ModelとRectified Flow
【AI論文解説】Consistency ModelとRectified Flow
 
FIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdfFIDO Alliance Osaka Seminar: CloudGate.pdf
FIDO Alliance Osaka Seminar: CloudGate.pdf
 
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
TaketoFujikawa_物語のコンセプトに基づく情報アクセス手法の基礎検討_JSAI2024
 
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
YugabyteDB適用に向けた取り組みと隠れた魅力 (DSS Asia 2024 発表資料)
 
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdfFIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
FIDO Alliance Osaka Seminar: PlayStation Passkey Deployment Case Study.pdf
 
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdfFIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
FIDO Alliance Osaka Seminar: LY-DOCOMO-KDDI-Mercari Panel.pdf
 
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
論文紹介: Offline Q-Learning on diverse Multi-Task data both scales and generalizes
 
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアルLoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
LoRaWAN 4チャンネル電流センサー・コンバーター CS01-LB 日本語マニュアル
 
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdfFIDO Alliance Osaka Seminar: Welcome Slides.pdf
FIDO Alliance Osaka Seminar: Welcome Slides.pdf
 
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
2024年度_サイバーエージェント_新卒研修「データベースの歴史」.pptx
 
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
論文紹介: Exploiting semantic segmentation to boost reinforcement learning in vid...
 
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdfFIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
FIDO Alliance Osaka Seminar: NEC & Yubico Panel.pdf
 

文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding

  • 2. 概要 nトレードオフ • 28'99 • 高効率 • 低性能 • :8'99 • 低効率 • 高性能 n;<=>.?*@#-("AB#C.D)@<の提案 • 28'99ベースで時間情報を考慮 nリアルタイムで低レイテンシーで動画認識 Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 38 41 43 46 48 51 0 100 200 300 400 500 600 700 Ours ECO [ ] I3D from [ ] FLOPs/Video (G) Accuracy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D I3D ECO16F ECO8F TSM16F 30M 100M 150M # Parameters TSM8F 61 50 Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family and ECO family on Something-Something-V1 [14] dataset. (GCN includes the cost of ResNet-50 RPN to generate region proposals.) to generate the bounding boxes, which is unfair to compare since external data (MSCOCO) and extra training cost is introduced. Thus we compared TSM to its CNN part: Non- Mod I3D fro ECO16 I3D fro I3Dre TSM TSM work [ conside optical nition m of two- We s trade-o set of S parame see tha than bo based m based m It can a it achie compu is alrea 効率 高 低 低 高 性能
  • 3. !"#$$の計算量削減 n:8'99の計算量を減らす • 9.%#@.E*@ =.D)@<#F9$G HI*%+J&#'1KL234MN • O'P#HQ.@A*+(*?"J&#O''1234MN • 前半28'99 • 後半:8'99 i is often based only on the current and the latest time steps (e.g., j = i or i 1). The non-local operation is also different from a fully- connected (fc) layer. Eq.(1) computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between xj and xi is not a function of the input data in fc, unlike in non- local layers. Furthermore, our formulation in Eq.(1) supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from xi to yi at the position i). A non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information. 3.2. Instantiations Next we describe several versions of f and g. Interest- ingly, we will show by experiments (Table 2a) that our non- local models are not sensitive to these choices, indicating that the generic non-local behavior is the main reason for the observed improvements. For simplicity, we only consider g in the form of a linear θ: 1×1×1 φ: 1×1×1 g: 1×1×1 1×1×1 softmax z T×H×W×1024 T×H×W×512 T×H×W×512 T×H×W×512 THW×512 512×THW THW×THW THW×512 THW×512 T×H×W×512 T×H×W×1024 x Figure 2. A spacetime non-local block. The feature maps are shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for 1024 channels (proper reshaping is performed when noted). “⌦” denotes matrix multiplication, and “ ” denotes element-wise sum. The softmax operation is performed on each row. The blue boxes de- note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian version, with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing ✓ and , and the dot-product version can be done by replacing softmax with scaling by 1/N. y = softmax(xT WT ✓ W x)g(x), which is the self-attention form in [49]. As such, our work provides insight by relating this recent self-attention model to the classic computer vision :8 28 ECO: Efficient Convolutional Network for Online Video Understanding 5 Input to 3D Net K x [N x 28 x 28] M2 M1 K x 28 x 28 Weight Sharing Weight Sharing M2 M1 Video (V) K x 28 x 28 2D Net (H2d) K x 28 x 28 3D Net Action s1 s2 SN Mϕ = [M1,. .,M1] Mϕ = [MK,. .,MK] MK 1 K (H3d) 1 1 1 MK 2 2 2 M2 M1 MK N N N 1 N N 1 Temporal stacking of feature maps Feature Maps
  • 4. 動画認識のための%"#$$のシフト nシフト方法 • オフライン • 一部特徴量を前後フレームと交換 • オンライン • 次のフレームに特徴量を渡す n挿入方法 • 0%R>@*E< • シフトした特徴量をST">#E.%%<EB".% • L<S"D)*@ • シフト前の特徴量をST">#E.%%<EB".% nL<S"D)*@の方が高性能 • シフトする前と後を足し合わせる • 空間情報が壊れず時間情報を融合 X Y + shift conv (a) In-place TSM. X Y + shift conv (b) Residual TSM. Figure 3. Residual shift is better than in-place shift. In-place shift happens before a convolution layer (or a residual block). Residual shift fuses temporal information inside a residual branch. ule for Efficient Video Understanding Chuang Gan M Watson AI Lab ng@csail.mit.edu Song Han MIT songhan@mit.edu se to accu- CNNs poral good ng it neric both Channel C Temporal T (a) The original ten- sor without shift. pad zero temporal shift truncate T C H,W (b) Offline temporal shift (bi-direction). t=0 t=3 … t=1 t=2 Channel C (c) Online temporal shift (uni-direction). Figure 1. Temporal Shift Module (TSM) performs efficient tem- poral modeling by moving the feature map along the temporal dimension. It is computationally free on top of a 2D convolution,
  • 5. 画像認識における空間シフトと問題点 n畳み込みの代わりに特徴量を全てシフト n動画認識にも適用したい n問題点 • レイテンシーの増加 • 空間特徴量が把握できない !"#$%&'()*+,-./ .3. Efficient Neural Networks The efficiency of 2D CNN has been extensively studied. ome works focused on designing an efficient model [21, 20, 6, 56]. Recently neural architecture search [62, 63, 31] as been introduced to find an efficient architecture au- omatically [44, 3]. Another way is to prune, quan- ze and compress an existing model for efficient deploy- ment [16, 15, 29, 59, 18, 47]. Address shift, which is a ardware-friendly primitive, has also been exploited for com- act 2D CNN design on image recognition tasks [51, 57]. 0 1/8 1/4 1/2 1 P100 TX2 CPU Naive shift: large overhead Latency Overhead Shift Proportion 0% 3% 6% 9% 12% 15% Our Choice (a) Overhead vs. proportion. 0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1 In-place TSM Residual TSM Naive shift: low acc. Accuracy 15% Shift Proportion 69% 71% 73% 75% 67% Our Choice 2D baseline (b) Residual vs. in-place. ⨷ … N (a) Spatial Convolution M DF DF M DK DK N DF DF … ⨷ … DK DK 1 ⨷ ⨷ … … (b) Depth-wise convolution M DF DF M M DF DF (c) Shift M DF DF DF DF M … … … Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s. In this paper, we present the shift operation (Figure 1) as an alternative to spatial convolutions. The shift operation moves each channel of its input tensor in a different spatial direction. A shift-based module interleaves shift operations with point-wise convolutions, which further mixes spatial information across channels. Unlike spatial convolutions, the shift operation itself requires zero FLOPs and zero pa- rameters. As opposed to depth-wise convolutions, shift op- erations can be easily and efficiently implemented. Our approach is orthogonal to model compression [4], tensor factorization [27] and low-bit networks [16]. As a result, any of these techniques could be composed with our proposed method to further reduce model size. We introduce a new hyperparameter for shift-based mod- where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered spatial indices; k, l and i, j index along spatial dimensions and n, m index into channels. The number of parameters required by a spatial convolution is M × N × D2 K and the computational cost is M × N × D2 K × D2 F . As the kernel size DK increases, we see the number of parameters and computational cost grow quadratically. A popular variant of the spatial convolution is a depth- wise convolution [7, 1], which is usually followed by a point-wise convolution (1x1 convolution). Altogether, the module is called the depth-wise separable convolution. A depth-wise convolution, as shown in Figure 2(b), aggre- gates spatial information from a DK × DK patch within each channel, and can be described as
  • 6. データセットとベースライン n時間情報が重要 • -.=<B("%+RS.=<B("%+#U4VU2 H,.W*@J&#0''1234XN • '(*?*D<S#H,)%%*?J&#234YN • !<SB<?#HC*B<?ZW%ST*J&#N n時間情報がそれほど重要でない • ['434#H-..=?.J&#2342N • ]"%<B"ES#H]*WJ&#234XN • /C8^74#H])<(%<J&#2344N nベースライン • ;<=>.?*@#S<+=<%B#%<B_.?T#HI*%+J&#O''1234YN -.=<B("%+RS.=<B("%+
  • 7. 各データセットにおける&'(の効果 nC.?<#;<=>.?*@だと大幅な性能向上 n$<SS#;<=>.?*@でも性能向上 nline ence, each next maps f 7/8 erate M for ges: need ring per- eline. one he a mory MB Table 1. Our method consistently outperforms 2D counterparts on multiple datasets at zero extra computation (protocol: ResNet-50 8f input, 10 clips for Kinetics, 2 for others, full-resolution). Dataset Model Acc1 Acc5 ∆ Acc1 Less Temporal Kinetics TSN 70.6 89.2 +3.5 Ours 74.1 91.2 UCF101 TSN 91.7 99.2 +4.2 Ours 95.9 99.7 HMDB51 TSN 64.7 89.9 +8.8 Ours 73.5 94.3 More Temporal Something V1 TSN 20.5 47.5 +28.0 Ours 47.3 76.2 Something V2 TSN 30.4 61.0 +31.3 Ours 61.7 87.4 Jester TSN 83.9 99.6 +11.7 Ours 97.0 99.9 I3D from [50] 3D ResNet-50 32× Non-local I3D from [50] 3D ResNet-50 32× Non-local I3D + GCN [50] 3D ResNet-50+GCN 32× TSM ResNet-50 TSM ResNet-50 1 TSMEn ResNet-50 2 TSMRGB+Flow ResNet-50 16 Table 3. TSM can consistently improve the performance over ferent backbones on Kinetics dataset. Mb-V2 R-50 RX-101 NL R-50 TSN 66.5 70.7 72.4 74.6 TSM 69.5 74.1 76.3 75.7 ∆Acc. +3.0 +3.4 +3.9 +1.1 consistently outperforms the 2D TSN baseline at no ex computation. For the lower part, we present the results Something-Something V1 and V2 [14] and Jester [1], wh n全てのベースラインで性能向上 nすでに時間的モデリングをしているモデ ルより性能向上 n;-Cの時間情報モデリング能力が高い 時間情報を考慮したモデルでも性能向上
  • 8. ')&*との比較 n28ベースラインを大幅に改善 n:8の手法より性能が良い Table 2. Comparing TSM against other methods on Something-Something dataset (center crop, 1 clip/video unless otherwise specified). Model Backbone #Frame FLOPs/Video #Param. Val Top-1 Val Top-5 Test Top-1 TSN [58] BNInception 8 16G 10.7M 19.5 - - TSN (our impl.) ResNet-50 8 33G 24.3M 19.7 46.6 - TRN-Multiscale [58] BNInception 8 16G 18.3M 34.4 - 33.6 TRN-Multiscale (our impl.) ResNet-50 8 33G 31.8M 38.9 68.1 - Two-stream TRNRGB+Flow [58] BNInception 8+8 - 36.6M 42.0 - 40.7 ECO [61] BNIncep+3D Res18 8 32G 47.5M 39.6 - - ECO [61] BNIncep+3D Res18 16 64G 47.5M 41.4 - - ECOEnLite [61] BNIncep+3D Res18 92 267G 150M 46.4 - 42.3 ECOEnLiteRGB+Flow [61] BNIncep+3D Res18 92+92 - 300M 49.5 - 43.9 I3D from [50] 3D ResNet-50 32×2clip 153G1 ×2 28.0M 41.6 72.2 - Non-local I3D from [50] 3D ResNet-50 32×2clip 168G1 ×2 35.3M 44.4 76.0 - Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×2clip 303G2 ×2 62.2M2 46.1 76.8 45.0 TSM ResNet-50 8 33G 24.3M 45.6 74.2 - TSM ResNet-50 16 65G 24.3M 47.2 77.1 46.0 TSMEn ResNet-50 24 98G 48.6M 49.7 78.5 - TSMRGB+Flow ResNet-50 16+16 - 48.6M 52.6 81.9 50.7 Table 3. TSM can consistently improve the performance over dif- Something-Something-V1. Something-Something-V1 is 28ベース :8ベース 提案手法
  • 9. トレードオフ n-.=<B("%+R-.=<B("%+#U4による比較 • O'Pより:倍 • 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない Table 4. Results on Something-Something-V2. Our TSM achieves state-of-the-art performance. Method Val Test Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 38 41 43 46 48 51 0 100 200 300 400 500 600 700 Ours ECO [ ] I3D from [ ] FLOPs/Video (G) Accuracy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D I3D ECO16F ECO8F TSM16F 30M 100M 150M # Parameters TSM8F 61 50 Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family and ECO family on Something-Something-V1 [14] dataset. (GCN includes the cost of ResNet-50 RPN to generate region proposals.) Table 5. TSM enjoys low GPU inference latency and high through- put. V/s means videos per second, higher the better (Measured on NVIDIA Tesla P100 GPU). Model Efficiency Statistics Accuracy FLOPs Param. Latency Thrput. Sth. Kinetics I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% - ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% - I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3% I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% - TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1% TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7% work [34] to extract bounding boxes, whose cost is also considered in the chart. Note that the computation cost of optical flow extraction is usually larger than the video recog- nition model itself. Therefore, we do not report the FLOPs of two-stream based methods. We show the accuracy, FLOPs, and number of parameters trade-off in Figure 5. The accuracy is tested on the validation set of Something-Something-V1 dataset, and the number of parameters is indicated by the area of the circles. We can see that our TSM based methods have a better Pareto curve than both previous state-of-the-art efficient models (ECO based models) and high-performance models (non-local I3D based models). TSM models are both efficient and accurate. Table 4. Results on Something-Something-V2. Our TSM achieves state-of-the-art performance. Method Val Test Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 43 46 48 51 Ours ECO [ ] I3D from [ ] curacy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D TSM16F TSM8F 61 50 Table 5. TSM enjoys low GPU inference latency and high through- put. V/s means videos per second, higher the better (Measured on NVIDIA Tesla P100 GPU). Model Efficiency Statistics Accuracy FLOPs Param. Latency Thrput. Sth. Kinetics I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% - ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% - I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3% I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% - TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1% TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7% work [34] to extract bounding boxes, whose cost is also considered in the chart. Note that the computation cost of optical flow extraction is usually larger than the video recog- nition model itself. Therefore, we do not report the FLOPs of two-stream based methods.
  • 10. 高速な認識の優位性 n['434の観察時間ごとの認識率 n最初の43`を観測したとき • ;-CはO'Pに比べてX`ほど高い n観測初期から高精度 nオンラインの場合2フレーム目から 前フレームの特徴量を考慮できる Table 6. Comparing the accuracy of offline TSM and online TSM on different datasets. Online TSM brings negligible latency overhead. Model Latency Kinetics UCF101 HMDB51 Something TSN 4.7ms 70.6% 91.7% 64.7% 20.5% +Offline - 74.1% 95.9% 73.5% 47.3% +Online 4.8ms 74.3% 95.5% 73.6% 46.3% Accuracy % 80 84 88 92 96 Video Observation % 10 20 40 60 80 100 ECO (s=8) ECO (s=12) ECO (s=20) TSM Figure 6. Early recognition on UCF101. TSM gives high prediction accuracy after only observing a small portion of the video. of backbone design, we replace every TSM primitive with 3 × 1 × 1 convolution and denote this model as I3Dreplace. It
  • 11. まとめ n;<=>.?*@#-("AB#C.D)@<の提案 • 28'99に挿入し,時間方向モデリングが可能 • 追加コストなし • 計算量ゼロ • パラメータゼロ n低遅延動画認識 • 効率的で高精度 • エッジデバイスによる低遅延な動画認識が可能