Christoph Feichtenhofer; X3D: Expanding Architectures for Efficient Video Recognition , Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 203-213
https://openaccess.thecvf.com/content_CVPR_2020/html/Feichtenhofer_X3D_Expanding_Architectures_for_Efficient_Video_Recognition_CVPR_2020_paper.html
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
【DL輪読会】Standardized Max Logits: A Simple yet Effective Approach for Identifyi...Deep Learning JP
1) The document proposes a simple method called Standardized Max Logits (SML) to detect unexpected road obstacles in semantic segmentation. SML normalizes the maximum logit values for each class to account for differences between in-distribution classes and better identify anomalies.
2) SML is combined with iterative boundary suppression and dilated smoothing techniques to gradually remove false positives and negatives, especially around boundaries.
3) Experiments on three datasets demonstrate SML achieves state-of-the-art performance in detecting anomalies without requiring retraining or additional out-of-distribution data, while maintaining efficient computation.
文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
SAM is a new segmentation model that can segment objects in images using natural language prompts. It was trained on over 1,100 datasets totaling over 10,000 images using a model-in-the-loop approach. SAM uses a transformer-based architecture with encoders for images, text, bounding boxes and masks. It achieves state-of-the-art zero-shot segmentation performance without any fine-tuning on target datasets.
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, SlowFast Networks for Video Recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6202-6211
https://openaccess.thecvf.com/content_ICCV_2019/html/Feichtenhofer_SlowFast_Networks_for_Video_Recognition_ICCV_2019_paper.html
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan; Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6165-6175
https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Deep_Analysis_of_CNN-Based_Spatio-Temporal_Representations_for_Action_Recognition_CVPR_2021_paper.html
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
【DL輪読会】Standardized Max Logits: A Simple yet Effective Approach for Identifyi...Deep Learning JP
1) The document proposes a simple method called Standardized Max Logits (SML) to detect unexpected road obstacles in semantic segmentation. SML normalizes the maximum logit values for each class to account for differences between in-distribution classes and better identify anomalies.
2) SML is combined with iterative boundary suppression and dilated smoothing techniques to gradually remove false positives and negatives, especially around boundaries.
3) Experiments on three datasets demonstrate SML achieves state-of-the-art performance in detecting anomalies without requiring retraining or additional out-of-distribution data, while maintaining efficient computation.
文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
SAM is a new segmentation model that can segment objects in images using natural language prompts. It was trained on over 1,100 datasets totaling over 10,000 images using a model-in-the-loop approach. SAM uses a transformer-based architecture with encoders for images, text, bounding boxes and masks. It achieves state-of-the-art zero-shot segmentation performance without any fine-tuning on target datasets.
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, SlowFast Networks for Video Recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6202-6211
https://openaccess.thecvf.com/content_ICCV_2019/html/Feichtenhofer_SlowFast_Networks_for_Video_Recognition_ICCV_2019_paper.html
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan; Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6165-6175
https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Deep_Analysis_of_CNN-Based_Spatio-Temporal_Representations_for_Action_Recognition_CVPR_2021_paper.html
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Reco...Toru Tamaki
Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis; 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6155-6164
https://openaccess.thecvf.com/content/CVPR2021/html/Li_2D_or_not_2D_Adaptive_3D_Convolution_Selection_for_Efficient_CVPR_2021_paper.html
文献紹介:Token Shift Transformer for Video ClassificationToru Tamaki
Hao Zhang, Yanbin Hao, Chong-Wah Ngo, Token Shift Transformer for Video Classification, ACM MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021 Pages 917–925https://doi.org/10.1145/3474085.3475272
http://vireo.cs.cityu.edu.hk/papers/Hao_MM2021.pdf
http://arxiv.org/abs/2108.02432
https://dl.acm.org/doi/abs/10.1145/3474085.3475272
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz; Gate-Shift Networks for Video Action Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1102-1111
https://openaccess.thecvf.com/content_CVPR_2020/html/Sudhakaran_Gate-Shift_Networks_for_Video_Action_Recognition_CVPR_2020_paper.html
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn; Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8375-8384
https://openaccess.thecvf.com/content_CVPR_2020/html/Yoo_Rethinking_Data_Augmentation_for_Image_Super-resolution_A_Comprehensive_Analysis_and_CVPR_2020_paper.html
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...Toru Tamaki
Chaoyang Zhu, Long Chen, "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future" arXiv2023
https://arxiv.org/abs/2307.09220
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
4. !"#
n426
• ベースとなる26モデル
• 軸の探索をするモデル
• Tつの軸を全てCに設定
n426からTつの軸を探索
• フレームレート:𝛾!
• フレーム数:𝛾"
• 解像度:𝛾#
• チャネル幅:𝛾$
• ボトルネック幅:𝛾%
• 深さ:𝛾&
In relation to most of these works, our approach does
not assume a fixed inherited design from 2D networks, but
expands a tiny architecture across several axes in space, time,
channels and depth to achieve a good efficiency trade-off.
3. X3D Networks
Image classification architectures have gone through an
evolution of architecture design with progressively expand-
ng existing models along network depth [7,23,37,51,58,81],
nput resolution [27,57,60] or channel width [75,80]. Simi-
ar progress can be observed for the mobile image classifi-
cation domain where contracting modifications (shallower
networks, lower resolution, thinner layers, separable convo-
ution [24,25,29,48,82]) allowed operating at lower compu-
ational budget. Given this history in image ConvNet design,
a similar progress has not been observed for video architec-
ures as these were customarily based on direct temporal
extensions of image models. However, is single expansion
of a fixed 2D architecture to 3D ideal, or is it better to expand
stage filters output sizes T×S2
data layer stride γτ , 12
1γt×(112γs)2
conv1 1×32, 3×1, 24γw 1γt×(56γs)2
res2
1×12, 24γbγw
3×32, 24γbγw
1×12, 24γw
×γd 1γt×(28γs)2
res3
1×12, 48γbγw
3×32, 48γbγw
1×12, 48γw
×2γd 1γt×(14γs)2
res4
1×12, 96γbγw
3×32, 96γbγw
1×12, 96γw
×5γd 1γt×(7γs)2
res5
1×12, 192γbγw
3×32, 192γbγw
1×12, 192γw
×3γd 1γt×(4γs)2
conv5 1×12, 192γbγw 1γt×(4γs)2
pool5 1γt×(4γs)2 1×1×1
fc1 1×12, 2048 1×1×1
fc2 1×12, #classes 1×1×1
𝛾!が1なので!"
6. !$#への拡張
τ
s
d
b
t
w
X3D
d
d
t
t
b
τ
τ
X3D-L
X3D-M
X3D-S
X3D-XS
s
s
s
X2D
w
X3D-XL
Model capacity in GFLOPs (# of multiply-adds x 109
)
0 5 15 25 35
10 20 30
80
75
70
65
60
55
50
Kinetics
top-1
accuracy
(%)
Figure 2. Progressive network expansion of X3D. The X2D base
model is expanded 1st
across bottleneck width (γb), 2nd
tempo-
ral resolution (γτ ), 3rd
spatial resolution (γs), 4th
depth (γd), 5th
model top-1 top-5
regime FLOPs Params
FLOPs (G) (G) (M)
X3D-XS 68.6 87.9 X-Small ≤ 0.6 0.60 3.76
X3D-S 72.9 90.5 Small ≤ 2 1.96 3.76
X3D-M 74.6 91.7 Medium ≤ 5 4.73 3.76
X3D-L 76.8 92.5 Large ≤ 20 18.37 6.08
X3D-XL 78.4 93.6 X-Large ≤ 40 35.84 11.0
Table 2. Expanded instances on K400-val. 10-Center clip testing is
used. We show top-1 and top-5 classification accuracy (%), as well
as computational complexity measured in GFLOPs (floating-point
operations, in # of multiply-adds ×109
) for a single clip input.
Inference-time computational cost is proportional to 10× of this,
as a fixed number of 10 of clips is used per video.
4.1. Expanded networks
The accuracy/complexity trade-off curve for the expan-
sion process on K400 is shown in Fig. 2. Expansion starts
from X2D that produces 47.75% top-1 accuracy (vertical
axis) with 1.63M parameters 20.67M FLOPs per clip (hori-
zontal axis), which is roughly doubled in each progressive
step. We use 10-Center clip testing as our default test setting
for expansion, so the overall cost per video is ×10. We
τ
s
d
b
t
w
X3D
d
d
t
t
b
τ
τ
X3D-L
X3D-M
X3D-S
X3D-XS
s
s
s
X2D
w
X3D-XL
Model capacity in GFLOPs (# of multiply-adds x 109
)
0 5 15 25 35
10 20 30
80
75
70
65
60
55
50
Kinetics
top-1
accuracy
(%)
Figure 2. Progressive network expansion of X3D. The X2D base
model is expanded 1st
across bottleneck width (γb), 2nd
tempo-
model top-1 top-5
regime FLOPs Params
FLOPs (G) (G) (M)
X3D-XS 68.6 87.9 X-Small ≤ 0.6 0.60 3.76
X3D-S 72.9 90.5 Small ≤ 2 1.96 3.76
X3D-M 74.6 91.7 Medium ≤ 5 4.73 3.76
X3D-L 76.8 92.5 Large ≤ 20 18.37 6.08
X3D-XL 78.4 93.6 X-Large ≤ 40 35.84 11.0
Table 2. Expanded instances on K400-val. 10-Center clip testing is
used. We show top-1 and top-5 classification accuracy (%), as well
as computational complexity measured in GFLOPs (floating-point
operations, in # of multiply-adds ×109
) for a single clip input.
Inference-time computational cost is proportional to 10× of this,
as a fixed number of 10 of clips is used per video.
4.1. Expanded networks
The accuracy/complexity trade-off curve for the expan-
sion process on K400 is shown in Fig. 2. Expansion starts
from X2D that produces 47.75% top-1 accuracy (vertical
axis) with 1.63M parameters 20.67M FLOPs per clip (hori-
zontal axis), which is roughly doubled in each progressive
step. We use 10-Center clip testing as our default test setting
!回目の拡張結果
𝛾!が一番良いモデル
一番良い結果となった拡張 拡張した結果から5つのモデルを抽出
拡張のコスト
• 456
• 1ステップに6つのモデルを学習
• 6×ステップ数の学習
• K..$,$+-&:+&
• グリッドサーチ
• 全てのモデルを総当たり
8. !$#性能比較
model pre top-1 top-5 test GFLOPs×views Param
I3D [6]
ImageNet
71.1 90.3 80.2 108 × N/A 12M
Two-Stream I3D [6] 75.7 92.0 82.8 216 × N/A 25M
Two-Stream S3D-G [76] 77.2 93.0 143 × N/A 23.1M
MF-Net [9] 72.8 90.4 11.1 × 50 8.0M
TSM R50 [41] 74.7 N/A 65 × 10 24.3M
Nonlocal R50 [68] 76.5 92.6 282 × 30 35.3M
Nonlocal R101 [68] 77.7 93.3 83.8 359 × 30 54.3M
Two-Stream I3D [6] - 71.6 90.0 216 × NA 25.0M
R(2+1)D [64] - 72.0 90.0 152 × 115 63.6M
Two-Stream R(2+1)D [64] - 73.9 90.9 304 × 115 127.2M
Oct-I3D + NL [8] - 75.7 N/A 28.9 × 30 33.6M
ip-CSN-152 [63] - 77.8 92.8 109 × 30 32.8M
SlowFast 4×16, R50 [14] - 75.6 92.1 36.1 × 30 34.4M
SlowFast 8×8, R101 [14] - 77.9 93.2 84.2 106 × 30 53.7M
SlowFast 8×8, R101+NL [14] - 78.7 93.5 84.9 116 × 30 59.9M
SlowFast 16×8, R101+NL [14] - 79.8 93.9 85.7 234 × 30 59.9M
X3D-M - 76.0 92.3 82.9 6.2 × 30 3.8M
X3D-L - 77.5 92.9 83.8 24.8 × 30 6.1M
X3D-XL - 79.1 93.9 85.3 48.4 × 30 11.0M
Table 4. Comparison to the state-of-the-art on K400-val & test.
We report the inference cost with a single “view" (temporal clip with
spatial crop) × the numbers of such views used (GFLOPs×views).
“N/A” indicates the numbers are not available for us. The “test”
model pretrain mAP GFLOPs×views Param
Nonlocal [68] ImageNet+Kinetics400 37.5 544 × 30 54.3M
STRG, +NL [69] ImageNet+Kinetics400 39.7 630 × 30 58.3M
Timeception [28] Kinetics-400 41.1 N/A×N/A N/A
LFB, +NL [71] Kinetics-400 42.5 529 × 30 122M
SlowFast, +NL [14] Kinetics-400 42.5 234 × 30 59.9M
SlowFast, +NL [14] Kinetics-600 45.2 234 × 30 59.9M
X3D-XL Kinetics-400 43.4 48.4 × 30 11.0M
X3D-XL Kinetics-600 47.1 48.4 × 30 11.0M
Table 6. Comparison with the state-of-the-art on Charades.
SlowFast variants are based on T×τ = 16×8.
model data top-1 top-5 FLOPs (G) Params (M)
EfficientNet3D-B0
K400
66.7 86.6 0.74 3.30
X3D-XS
val
68.6 (+1.9) 87.9 (+1.3) 0.60 (−1.4) 3.76 (+0.5)
EfficientNet3D-B3 72.4 89.6 6.91 8.19
X3D-M 74.6 (+2.2) 91.7 (+2.1) 4.73 (−2.2) 3.76 (−4.4)
EfficientNet3D-B4 74.5 90.6 23.80 12.16
X3D-L 76.8 (+2.3) 92.5 (+1.9) 18.37 (−5.4) 6.08 (−6.1)
EfficientNet3D-B0
K400
64.8 85.4 0.74 3.30
X3D-XS
test
66.6 (+1.8) 86.8 (+1.4) 0.60 (−1.4) 3.76 (+0.5)
EfficientNet3D-B3 69.9 88.1 6.91 8.19
X3D-M 73.0 (+2.1) 90.8 (+2.7) 4.73 (−2.2) 3.76 (−4.4)
EfficientNet3D-B4 71.8 88.9 23.80 12.16
X3D-XL - 79.1 93.9 85.3 48.4 × 30 11.0M
Table 4. Comparison to the state-of-the-art on K400-val & test.
We report the inference cost with a single “view" (temporal clip with
spatial crop) × the numbers of such views used (GFLOPs×views).
“N/A” indicates the numbers are not available for us. The “test”
column shows average of top1 and top5 on the Kinetics-400 testset.
model pretrain top-1 top-5 GFLOPs×views Param
I3D [3] - 71.9 90.1 108 × N/A 12M
Oct-I3D + NL [8] ImageNet 76.0 N/A 25.6 × 30 12M
SlowFast 4×16, R50 [14] - 78.8 94.0 36.1 × 30 34.4M
SlowFast 16×8, R101+NL [14] - 81.8 95.1 234 × 30 59.9M
X3D-M - 78.8 94.5 6.2 × 30 3.8M
X3D-XL - 81.9 95.5 48.4 × 30 11.0M
Table 5. Comparison with the state-of-the-art on Kinetics-600.
Results are consistent with K400 in Table 4 above.
Kinetics-600 is a larger version of Kinetics that shall
demonstrate further generalization of our approach. Re-
sults are shown in Table 5. Our variants demonstrate similar
performance as above, with the best model now providing
slightly better performance than the previous state-of-the-
art SlowFast 16×8, R101+NL [14], again for 4.8× fewer
FLOPs (i.e. multiply-add operations) and 5.5× fewer param-
eter. In the lower computation regime, X3D-M is compara-
ble to SlowFast 4×16, R50 but requires 5.8× fewer FLOPs
and 9.1× fewer parameters.
Charades [49] is a dataset with longer range activities. Ta-
model pre top-1 top-5 test GFLOPs×views Param
I3D [6]
ImageNet
71.1 90.3 80.2 108 × N/A 12M
Two-Stream I3D [6] 75.7 92.0 82.8 216 × N/A 25M
Two-Stream S3D-G [76] 77.2 93.0 143 × N/A 23.1M
MF-Net [9] 72.8 90.4 11.1 × 50 8.0M
TSM R50 [41] 74.7 N/A 65 × 10 24.3M
Nonlocal R50 [68] 76.5 92.6 282 × 30 35.3M
Nonlocal R101 [68] 77.7 93.3 83.8 359 × 30 54.3M
Two-Stream I3D [6] - 71.6 90.0 216 × NA 25.0M
R(2+1)D [64] - 72.0 90.0 152 × 115 63.6M
Two-Stream R(2+1)D [64] - 73.9 90.9 304 × 115 127.2M
Oct-I3D + NL [8] - 75.7 N/A 28.9 × 30 33.6M
ip-CSN-152 [63] - 77.8 92.8 109 × 30 32.8M
SlowFast 4×16, R50 [14] - 75.6 92.1 36.1 × 30 34.4M
SlowFast 8×8, R101 [14] - 77.9 93.2 84.2 106 × 30 53.7M
SlowFast 8×8, R101+NL [14] - 78.7 93.5 84.9 116 × 30 59.9M
model pretrain mAP GFLOPs×views Param
Nonlocal [68] ImageNet+Kinetics400 37.5 544 × 30 54.3M
STRG, +NL [69] ImageNet+Kinetics400 39.7 630 × 30 58.3M
Timeception [28] Kinetics-400 41.1 N/A×N/A N/A
LFB, +NL [71] Kinetics-400 42.5 529 × 30 122M
SlowFast, +NL [14] Kinetics-400 42.5 234 × 30 59.9M
SlowFast, +NL [14] Kinetics-600 45.2 234 × 30 59.9M
X3D-XL Kinetics-400 43.4 48.4 × 30 11.0M
X3D-XL Kinetics-600 47.1 48.4 × 30 11.0M
Table 6. Comparison with the state-of-the-art on Charades.
SlowFast variants are based on T×τ = 16×8.
model data top-1 top-5 FLOPs (G) Params (M)
EfficientNet3D-B0 66.7 86.6 0.74 3.30
"#$%&#'()*++
"#$%&#'(),++
-./0/1%(
事前
学習
既存
手法
同等の性能よりも
計算量削減
9. !$#性能比較
SlowFast, +NL [14] Kinetics-600 45.2 234 × 30 59.9M
X3D-XL Kinetics-400 43.4 48.4 × 30 11.0M
X3D-XL Kinetics-600 47.1 48.4 × 30 11.0M
Table 6. Comparison with the state-of-the-art on Charades.
SlowFast variants are based on T×τ = 16×8.
model data top-1 top-5 FLOPs (G) Params (M)
EfficientNet3D-B0
K400
66.7 86.6 0.74 3.30
X3D-XS
val
68.6 (+1.9) 87.9 (+1.3) 0.60 (−1.4) 3.76 (+0.5)
EfficientNet3D-B3 72.4 89.6 6.91 8.19
X3D-M 74.6 (+2.2) 91.7 (+2.1) 4.73 (−2.2) 3.76 (−4.4)
EfficientNet3D-B4 74.5 90.6 23.80 12.16
X3D-L 76.8 (+2.3) 92.5 (+1.9) 18.37 (−5.4) 6.08 (−6.1)
EfficientNet3D-B0
K400
64.8 85.4 0.74 3.30
X3D-XS
test
66.6 (+1.8) 86.8 (+1.4) 0.60 (−1.4) 3.76 (+0.5)
EfficientNet3D-B3 69.9 88.1 6.91 8.19
X3D-M 73.0 (+2.1) 90.8 (+2.7) 4.73 (−2.2) 3.76 (−4.4)
EfficientNet3D-B4 71.8 88.9 23.80 12.16
X3D-L 74.6 (+2.8) 91.4 (+2.5) 18.37 (−5.4) 6.08 (−6.1)
Table 7. Comparison to EfficientNet3D: We compare to a 3D
version of EfficientNet on K400-val and test. 10-Center clip testing
is used. EfficientNet3D has the same mobile components as X3D.
but was found by searching a large number of models for op-
timal trade-off on image-classification. This ablation studies
if a direct extension of EfficientNet to 3D is comparable to
X3D (which is expanded by only training few models). Effi-
cientNet models are provided for various complexity ranges.
233#'#%$&4%&56との比較
456のハイパーパラメータ探索が優れている
Inference cost per video in TFLOPs (# of multiply-adds x 1012
)
0.0 0.2 0.4 0.6 0.8
76
74
72
70
68
66
64
Kinetics
top-1
accuracy
(%)
78
SlowFast 8x8, R101+NL
SlowFast 8x8, R101
X3D-XL
X3D-M
X3D-S
CSN (Channel-Separated Networks)
TSM (Temporal Shift Module)
Figure 3. Accuracy/complexity trade-off on Kinetics-400 for dif-
ferent number of inference clips per video. The top-1 accuracy
(vertical axis) is obtained by K-Center clip testing where the num-
ber of temporal clips K ∈ {1, 3, 5, 7, 10} is shown in each curve.
The horizontal axis shows the full inference cost per video.
Inference cost. In many cases, like the experiments before,
推論時のクリップ数と認識率
!75クリップの
比較
5クリップ以上は
計算量に対して認識率の伸びが悪い
17. !$#性能比較 (アクション検出)
nデータセット
• _/_);YG@A)!/0123CE
n推論
• C28𝛾#×C2𝛾#にセンタークロップ
XL
M
S
model AVA pre val mAP GFLOPs Param
LFB, R50+NL [71]
v2.1 K400
25.8 529 73.6M
LFB, R101+NL [71] 26.8 677 122M
X3D-XL 26.1 48.4 11.0M
SlowFast 4×16, R50 [14]
v2.2 K600
24.7 52.5 33.7M
SlowFast, 8×8 R101+NL [14] 27.4 146 59.2M
X3D-M 23.2 6.2 3.1M
X3D-XL 27.4 48.4 11.0M
Table 8. Comparison with the state-of-the-art on AVA. All meth-
ods use single center crop inference; full testing cost is directly
18. 推論と計算コスト
Inference cost per video in TFLOPs (# of multiply-adds x 1012
)
0.0 0.2 0.4 0.6 0.8
76
74
72
70
68
66
64
Kinetics
top-1
val
accuracy
(%)
78
80
SlowFast 8x8, R101+NL
SlowFast 8x8, R101
X3D-XL
X3D-M
EfficientNet3D-B4
EfficientNet3D-B3
X3D-S
CSN (Channel-Separated Networks)
TSM (Temporal Shift Module)
Inference cost per video in FLOPs (# of multiply-adds), log-scale
SlowFast 8x8, R101+NL
SlowFast 8x8, R101
X3D-XL
X3D-M
EfficientNet3D-B4
EfficientNet3D-B3
X3D-S
CSN (Channel-Separated Networks)
TSM (Temporal Shift Module)
75
70
65
60
55
50
45
Kinetics
top-1
val
accuracy
(%)
80
40
1010
1011
1012
76
74
72
70
68
66
64
Kinetics
top-1
test
accuracy
(%)
78
Inference cost per video in TFLOPs (# of multiply-adds x 1012
)
0.0 0.2 0.4 0.6 0.8
SlowFast 8x8, R101+NL
SlowFast 8x8, R101
X3D-XL
X3D-M
EfficientNet3D-B4
EfficientNet3D-B3
X3D-S
SlowFast 8x8, R101+NL
SlowFast 8x8, R101
X3D-XL
X3D-M
EfficientNet3D-B4
EfficientNet3D-B3
X3D-S
75
70
65
60
55
50
45
Kinetics
top-1
test
accuracy
(%)
80
40
Inference cost per video in FLOPs (# of multiply-adds), log-scale
1010
1011
1012
Figure A.1. Accuracy/complexity trade-off on K400-val (top) & test (bottom) for varying # of inference clips per video. The top-1
accuracy (vertical axis) is obtained by K-Center clip testing where the number of temporal clips K 2 {1, 3, 5, 7, 10} is shown in each curve.
"#$%&#'()*++89/:
"#$%&#'()*++8&%(&
横軸:対数スケール
19. 追加実験
n6+(&"F=$%+)%+(>#>89+),'-B'9G&$'-
• 通常の56畳み込みに変更し𝛾%を変更
• パラメータ数が同等になるように
• 通常の56畳み込みに変更
nJ=$%"
• 1+N[に変更で性能が少し低下
• メモリを優先するなら1+N[
nJKブロック
• 削除で性能低下
• パフォーマンスに大きな影響を与えている
model top-1 top-5 FLOPs (G) Param (M)
X3D-S 72.9 90.5 1.96 3.76
CW conv b= 0.6 68.9 88.8 1.95 3.16
CW conv 73.2 90.4 17.6 22.1
swish 72.0 90.4 1.96 3.76
SE 71.3 89.9 1.96 3.60
(a) Ablating mobile components on a Small model.
X3D
C
Table A.1. Ablations of mobile components for video classification on K
parameters, and computational complexity measured in GFLOPs (floatin
input of size t⇥112 s⇥112 x. Inference-time computational cost is repo
The results show that removing channel-wise separable convolution (CW
increases mutliply-adds and parameters at slightly higher accuracy, while
[5] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr
Dollár, and Kaiming He. Detectron. https://github.
com/facebookresearch/detectron, 2018. 1
[6] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
Yangqing Jia, and Kaiming He. Accurate, large minibatch
SGD: training ImageNet in 1 hour. arXiv:1706.02677, 2017.
1
[7] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Car-
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,
George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia
[1
[1
[1
model top-1 top-5 FLOPs (G) Param (M)
X3D-S 72.9 90.5 1.96 3.76
CW conv b= 0.6 68.9 88.8 1.95 3.16
CW conv 73.2 90.4 17.6 22.1
swish 72.0 90.4 1.96 3.76
SE 71.3 89.9 1.96 3.60
(a) Ablating mobile components on a Small model.
model top-1 top-5 FLOPs (G) Param (M)
X3D-XL 78.4 93.6 35.84 11.0
CW conv, b= 0.56 76.0 92.6 34.80 9.73
CW conv 79.2 93.5 365.4 95.1
swish 78.0 93.4 35.84 11.0
SE 77.1 93.0 35.84 10.4
(b) Ablating mobile components on an X-Large model.
Table A.1. Ablations of mobile components for video classification on K400-val. We show top-1 and top-5 classification accuracy (%
parameters, and computational complexity measured in GFLOPs (floating-point operations, in # of multiply-adds ⇥109
) for a single c
input of size t⇥112 s⇥112 x. Inference-time computational cost is reported GFLOPs ⇥10, as a fixed number of 10-Center views is us
The results show that removing channel-wise separable convolution (CW conv) with unchanged bottleneck expansion ratio, b, drastica
increases mutliply-adds and parameters at slightly higher accuracy, while swish has a smaller effect on performance than SE.
[5] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr
Dollár, and Kaiming He. Detectron. https://github.
com/facebookresearch/detectron, 2018. 1
quality estimation of multiple intermediate frames for vid
interpolation. In Proc. CVPR, 2018. 1
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im