Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz; Gate-Shift Networks for Video Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1102-1111
https://openaccess.thecvf.com/content_CVPR_2020/html/Sudhakaran_Gate-Shift_Networks_for_Video_Action_Recognition_CVPR_2020_paper.html
文献紹介:Token Shift Transformer for Video ClassificationToru Tamaki
Hao Zhang, Yanbin Hao, Chong-Wah Ngo, Token Shift Transformer for Video Classification, ACM MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021 Pages 917–925https://doi.org/10.1145/3474085.3475272
http://vireo.cs.cityu.edu.hk/papers/Hao_MM2021.pdf
http://arxiv.org/abs/2108.02432
https://dl.acm.org/doi/abs/10.1145/3474085.3475272
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz; Gate-Shift Networks for Video Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1102-1111
https://openaccess.thecvf.com/content_CVPR_2020/html/Sudhakaran_Gate-Shift_Networks_for_Video_Action_Recognition_CVPR_2020_paper.html
文献紹介:Token Shift Transformer for Video ClassificationToru Tamaki
Hao Zhang, Yanbin Hao, Chong-Wah Ngo, Token Shift Transformer for Video Classification, ACM MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021 Pages 917–925https://doi.org/10.1145/3474085.3475272
http://vireo.cs.cityu.edu.hk/papers/Hao_MM2021.pdf
http://arxiv.org/abs/2108.02432
https://dl.acm.org/doi/abs/10.1145/3474085.3475272
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR2021.
https://openreview.net/forum?id=YicbFdNTTy
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan; Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6165-6175
https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Deep_Analysis_of_CNN-Based_Spatio-Temporal_Representations_for_Action_Recognition_CVPR_2021_paper.html
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann; Video Transformer Network, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3163-3172
https://openaccess.thecvf.com/content/ICCV2021W/CVEU/html/Neimark_Video_Transformer_Network_ICCVW_2021_paper.html
https://arxiv.org/abs/2102.00719
文献紹介:Learning Video Stabilization Using Optical FlowToru Tamaki
Jiyang Yu, Ravi Ramamoorthi; Learning Video Stabilization Using Optical Flow, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8159-8167
https://openaccess.thecvf.com/content_CVPR_2020/html/Yu_Learning_Video_Stabilization_Using_Optical_Flow_CVPR_2020_paper.html
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn; Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8375-8384
https://openaccess.thecvf.com/content_CVPR_2020/html/Yoo_Rethinking_Data_Augmentation_for_Image_Super-resolution_A_Comprehensive_Analysis_and_CVPR_2020_paper.html
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, SlowFast Networks for Video Recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6202-6211
https://openaccess.thecvf.com/content_ICCV_2019/html/Feichtenhofer_SlowFast_Networks_for_Video_Recognition_ICCV_2019_paper.html
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matchingharmonylab
公開URL:https://arxiv.org/pdf/2404.19174
出典:Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. ascimento: XFeat: Accelerated Features for Lightweight Image Matching, Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
概要:リソース効率に優れた特徴点マッチングのための軽量なアーキテクチャ「XFeat(Accelerated Features)」を提案します。手法は、局所的な特徴点の検出、抽出、マッチングのための畳み込みニューラルネットワークの基本的な設計を再検討します。特に、リソースが限られたデバイス向けに迅速かつ堅牢なアルゴリズムが必要とされるため、解像度を可能な限り高く保ちながら、ネットワークのチャネル数を制限します。さらに、スパース下でのマッチングを選択できる設計となっており、ナビゲーションやARなどのアプリケーションに適しています。XFeatは、高速かつ同等以上の精度を実現し、一般的なラップトップのCPU上でリアルタイムで動作します。
セル生産方式におけるロボットの活用には様々な問題があるが,その一つとして 3 体以上の物体の組み立てが挙げられる.一般に,複数物体を同時に組み立てる際は,対象の部品をそれぞれロボットアームまたは治具でそれぞれ独立に保持することで組み立てを遂行すると考えられる.ただし,この方法ではロボットアームや治具を部品数と同じ数だけ必要とし,部品数が多いほどコスト面や設置スペースの関係で無駄が多くなる.この課題に対して音𣷓らは組み立て対象物に働く接触力等の解析により,治具等で固定されていない対象物が組み立て作業中に運動しにくい状態となる条件を求めた.すなわち,環境中の非把持対象物のロバスト性を考慮して,組み立て作業条件を検討している.本研究ではこの方策に基づいて,複数物体の組み立て作業を単腕マニピュレータで実行することを目的とする.このとき,対象物のロバスト性を考慮することで,仮組状態の複数物体を同時に扱う手法を提案する.作業対象としてパイプジョイントの組み立てを挙げ,簡易な道具を用いることで単腕マニピュレータで複数物体を同時に把持できることを示す.さらに,作業成功率の向上のために RGB-D カメラを用いた物体の位置検出に基づくロボット制御及び動作計画を実装する.
This paper discusses assembly operations using a single manipulator and a parallel gripper to simultaneously
grasp multiple objects and hold the group of temporarily assembled objects. Multiple robots and jigs generally operate
assembly tasks by constraining the target objects mechanically or geometrically to prevent them from moving. It is
necessary to analyze the physical interaction between the objects for such constraints to achieve the tasks with a single
gripper. In this paper, we focus on assembling pipe joints as an example and discuss constraining the motion of the
objects. Our demonstration shows that a simple tool can facilitate holding multiple objects with a single gripper.
2. 概要
nトレードオフ
• 28'99
• 高効率
• 低性能
• :8'99
• 低効率
• 高性能
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99ベースで時間情報を考慮
nリアルタイムで低レイテンシーで動画認識
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
to generate the bounding boxes, which is unfair to compare
since external data (MSCOCO) and extra training cost is
introduced. Thus we compared TSM to its CNN part: Non-
Mod
I3D fro
ECO16
I3D fro
I3Dre
TSM
TSM
work [
conside
optical
nition m
of two-
We s
trade-o
set of S
parame
see tha
than bo
based m
based m
It can a
it achie
compu
is alrea
効率
高 低
低
高
性能
3. !"#$$の計算量削減
n:8'99の計算量を減らす
• 9.%#@.E*@ =.D)@<#F9$G
HI*%+J&#'1KL234MN
• O'P#HQ.@A*+(*?"J&#O''1234MN
• 前半28'99
• 後半:8'99
i is often based only on the current and the latest time steps
(e.g., j = i or i 1).
The non-local operation is also different from a fully-
connected (fc) layer. Eq.(1) computes responses based on
relationships between different locations, whereas fc uses
learned weights. In other words, the relationship between xj
and xi is not a function of the input data in fc, unlike in non-
local layers. Furthermore, our formulation in Eq.(1) supports
inputs of variable sizes, and maintains the corresponding
size in the output. On the contrary, an fc layer requires a
fixed-size input/output and loses positional correspondence
(e.g., that from xi to yi at the position i).
A non-local operation is a flexible building block and can
be easily used together with convolutional/recurrent layers.
It can be added into the earlier part of deep neural networks,
unlike fc layers that are often used in the end. This allows us
to build a richer hierarchy that combines both non-local and
local information.
3.2. Instantiations
Next we describe several versions of f and g. Interest-
ingly, we will show by experiments (Table 2a) that our non-
local models are not sensitive to these choices, indicating
that the generic non-local behavior is the main reason for the
observed improvements.
For simplicity, we only consider g in the form of a linear
θ: 1×1×1 φ: 1×1×1 g: 1×1×1
1×1×1
softmax
z
T×H×W×1024
T×H×W×512 T×H×W×512 T×H×W×512
THW×512 512×THW
THW×THW
THW×512
THW×512
T×H×W×512
T×H×W×1024
x
Figure 2. A spacetime non-local block. The feature maps are
shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for
1024 channels (proper reshaping is performed when noted). “⌦”
denotes matrix multiplication, and “ ” denotes element-wise sum.
The softmax operation is performed on each row. The blue boxes de-
note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian
version, with a bottleneck of 512 channels. The vanilla Gaussian
version can be done by removing ✓ and , and the dot-product
version can be done by replacing softmax with scaling by 1/N.
y = softmax(xT
WT
✓ W x)g(x), which is the self-attention
form in [49]. As such, our work provides insight by relating
this recent self-attention model to the classic computer vision
:8
28
ECO: Efficient Convolutional Network for Online Video Understanding 5
Input to 3D Net
K x [N x 28 x 28]
M2 M1
K x 28 x 28
Weight Sharing
Weight Sharing
M2 M1
Video (V)
K x 28 x 28
2D Net (H2d)
K x 28 x 28
3D Net
Action
s1
s2
SN
Mϕ = [M1,. .,M1]
Mϕ = [MK,. .,MK]
MK
1
K
(H3d)
1
1
1
MK
2
2
2
M2 M1
MK
N
N
N
1 N
N
1
Temporal stacking of feature
maps
Feature Maps
4. 動画認識のための%"#$$のシフト
nシフト方法
• オフライン
• 一部特徴量を前後フレームと交換
• オンライン
• 次のフレームに特徴量を渡す
n挿入方法
• 0%R>@*E<
• シフトした特徴量をST">#E.%%<EB".%
• L<S"D)*@
• シフト前の特徴量をST">#E.%%<EB".%
nL<S"D)*@の方が高性能
• シフトする前と後を足し合わせる
• 空間情報が壊れず時間情報を融合
X Y
+
shift conv
(a) In-place TSM.
X Y
+
shift conv
(b) Residual TSM.
Figure 3. Residual shift is better than in-place shift. In-place shift
happens before a convolution layer (or a residual block). Residual
shift fuses temporal information inside a residual branch.
ule for Efficient Video Understanding
Chuang Gan
M Watson AI Lab
ng@csail.mit.edu
Song Han
MIT
songhan@mit.edu
se to
accu-
CNNs
poral
good
ng it
neric
both
Channel C
Temporal
T
(a) The original ten-
sor without shift.
pad zero
temporal
shift
truncate
T
C
H,W
(b) Offline temporal
shift (bi-direction).
t=0
t=3
…
t=1
t=2
Channel C
(c) Online temporal
shift (uni-direction).
Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
dimension. It is computationally free on top of a 2D convolution,
5. 画像認識における空間シフトと問題点
n畳み込みの代わりに特徴量を全てシフト
n動画認識にも適用したい
n問題点
• レイテンシーの増加
• 空間特徴量が把握できない
!"#$%&'()*+,-./
.3. Efficient Neural Networks
The efficiency of 2D CNN has been extensively studied.
ome works focused on designing an efficient model [21, 20,
6, 56]. Recently neural architecture search [62, 63, 31]
as been introduced to find an efficient architecture au-
omatically [44, 3]. Another way is to prune, quan-
ze and compress an existing model for efficient deploy-
ment [16, 15, 29, 59, 18, 47]. Address shift, which is a
ardware-friendly primitive, has also been exploited for com-
act 2D CNN design on image recognition tasks [51, 57].
0 1/8 1/4 1/2 1
P100
TX2
CPU
Naive shift:
large overhead
Latency
Overhead
Shift Proportion
0%
3%
6%
9%
12%
15%
Our Choice
(a) Overhead vs. proportion.
0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1
In-place TSM
Residual TSM
Naive shift:
low acc.
Accuracy
15%
Shift Proportion
69%
71%
73%
75%
67%
Our Choice
2D baseline
(b) Residual vs. in-place.
⨷ …
N
(a) Spatial Convolution
M
DF
DF
M
DK
DK
N
DF
DF
…
⨷
…
DK
DK
1
⨷
⨷
…
…
(b) Depth-wise convolution
M
DF
DF
M M
DF
DF
(c) Shift
M
DF
DF
DF
DF
M
…
…
…
Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift
matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s.
In this paper, we present the shift operation (Figure 1) as
an alternative to spatial convolutions. The shift operation
moves each channel of its input tensor in a different spatial
direction. A shift-based module interleaves shift operations
with point-wise convolutions, which further mixes spatial
information across channels. Unlike spatial convolutions,
the shift operation itself requires zero FLOPs and zero pa-
rameters. As opposed to depth-wise convolutions, shift op-
erations can be easily and efficiently implemented.
Our approach is orthogonal to model compression [4],
tensor factorization [27] and low-bit networks [16]. As a
result, any of these techniques could be composed with our
proposed method to further reduce model size.
We introduce a new hyperparameter for shift-based mod-
where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered
spatial indices; k, l and i, j index along spatial dimensions
and n, m index into channels. The number of parameters
required by a spatial convolution is M × N × D2
K and the
computational cost is M × N × D2
K × D2
F . As the kernel
size DK increases, we see the number of parameters and
computational cost grow quadratically.
A popular variant of the spatial convolution is a depth-
wise convolution [7, 1], which is usually followed by a
point-wise convolution (1x1 convolution). Altogether, the
module is called the depth-wise separable convolution. A
depth-wise convolution, as shown in Figure 2(b), aggre-
gates spatial information from a DK × DK patch within
each channel, and can be described as
9. トレードオフ
n-.=<B("%+R-.=<B("%+#U4による比較
• O'Pより:倍
• 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
We show the accuracy, FLOPs, and number of parameters
trade-off in Figure 5. The accuracy is tested on the validation
set of Something-Something-V1 dataset, and the number of
parameters is indicated by the area of the circles. We can
see that our TSM based methods have a better Pareto curve
than both previous state-of-the-art efficient models (ECO
based models) and high-performance models (non-local I3D
based models). TSM models are both efficient and accurate.
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
43
46
48
51
Ours ECO [ ] I3D from [ ]
curacy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
TSM16F
TSM8F
61 50
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
10. 高速な認識の優位性
n['434の観察時間ごとの認識率
n最初の43`を観測したとき
• ;-CはO'Pに比べてX`ほど高い
n観測初期から高精度
nオンラインの場合2フレーム目から
前フレームの特徴量を考慮できる
Table 6. Comparing the accuracy of offline TSM and online TSM on
different datasets. Online TSM brings negligible latency overhead.
Model Latency Kinetics UCF101 HMDB51 Something
TSN 4.7ms 70.6% 91.7% 64.7% 20.5%
+Offline - 74.1% 95.9% 73.5% 47.3%
+Online 4.8ms 74.3% 95.5% 73.6% 46.3%
Accuracy
%
80
84
88
92
96
Video Observation %
10 20 40 60 80 100
ECO (s=8)
ECO (s=12)
ECO (s=20)
TSM
Figure 6. Early recognition on UCF101. TSM gives high prediction
accuracy after only observing a small portion of the video.
of backbone design, we replace every TSM primitive with
3 × 1 × 1 convolution and denote this model as I3Dreplace. It