文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Jinhyung Kim, VideoMix: Rethinking Data Augmentation for Video Classification, arXiv:2012.03457
https://arxiv.org/abs/2012.03457
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann; Video Transformer Network, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3163-3172
https://openaccess.thecvf.com/content/ICCV2021W/CVEU/html/Neimark_Video_Transformer_Network_ICCVW_2021_paper.html
https://arxiv.org/abs/2102.00719
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Jinhyung Kim, VideoMix: Rethinking Data Augmentation for Video Classification, arXiv:2012.03457
https://arxiv.org/abs/2012.03457
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann; Video Transformer Network, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3163-3172
https://openaccess.thecvf.com/content/ICCV2021W/CVEU/html/Neimark_Video_Transformer_Network_ICCVW_2021_paper.html
https://arxiv.org/abs/2102.00719
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matchingharmonylab
公開URL:https://arxiv.org/pdf/2404.19174
出典:Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. ascimento: XFeat: Accelerated Features for Lightweight Image Matching, Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
概要:リソース効率に優れた特徴点マッチングのための軽量なアーキテクチャ「XFeat(Accelerated Features)」を提案します。手法は、局所的な特徴点の検出、抽出、マッチングのための畳み込みニューラルネットワークの基本的な設計を再検討します。特に、リソースが限られたデバイス向けに迅速かつ堅牢なアルゴリズムが必要とされるため、解像度を可能な限り高く保ちながら、ネットワークのチャネル数を制限します。さらに、スパース下でのマッチングを選択できる設計となっており、ナビゲーションやARなどのアプリケーションに適しています。XFeatは、高速かつ同等以上の精度を実現し、一般的なラップトップのCPU上でリアルタイムで動作します。
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
【DLゼミ】XFeat: Accelerated Features for Lightweight Image Matchingharmonylab
公開URL:https://arxiv.org/pdf/2404.19174
出典:Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. ascimento: XFeat: Accelerated Features for Lightweight Image Matching, Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
概要:リソース効率に優れた特徴点マッチングのための軽量なアーキテクチャ「XFeat(Accelerated Features)」を提案します。手法は、局所的な特徴点の検出、抽出、マッチングのための畳み込みニューラルネットワークの基本的な設計を再検討します。特に、リソースが限られたデバイス向けに迅速かつ堅牢なアルゴリズムが必要とされるため、解像度を可能な限り高く保ちながら、ネットワークのチャネル数を制限します。さらに、スパース下でのマッチングを選択できる設計となっており、ナビゲーションやARなどのアプリケーションに適しています。XFeatは、高速かつ同等以上の精度を実現し、一般的なラップトップのCPU上でリアルタイムで動作します。
セル生産方式におけるロボットの活用には様々な問題があるが,その一つとして 3 体以上の物体の組み立てが挙げられる.一般に,複数物体を同時に組み立てる際は,対象の部品をそれぞれロボットアームまたは治具でそれぞれ独立に保持することで組み立てを遂行すると考えられる.ただし,この方法ではロボットアームや治具を部品数と同じ数だけ必要とし,部品数が多いほどコスト面や設置スペースの関係で無駄が多くなる.この課題に対して音𣷓らは組み立て対象物に働く接触力等の解析により,治具等で固定されていない対象物が組み立て作業中に運動しにくい状態となる条件を求めた.すなわち,環境中の非把持対象物のロバスト性を考慮して,組み立て作業条件を検討している.本研究ではこの方策に基づいて,複数物体の組み立て作業を単腕マニピュレータで実行することを目的とする.このとき,対象物のロバスト性を考慮することで,仮組状態の複数物体を同時に扱う手法を提案する.作業対象としてパイプジョイントの組み立てを挙げ,簡易な道具を用いることで単腕マニピュレータで複数物体を同時に把持できることを示す.さらに,作業成功率の向上のために RGB-D カメラを用いた物体の位置検出に基づくロボット制御及び動作計画を実装する.
This paper discusses assembly operations using a single manipulator and a parallel gripper to simultaneously
grasp multiple objects and hold the group of temporarily assembled objects. Multiple robots and jigs generally operate
assembly tasks by constraining the target objects mechanically or geometrically to prevent them from moving. It is
necessary to analyze the physical interaction between the objects for such constraints to achieve the tasks with a single
gripper. In this paper, we focus on assembling pipe joints as an example and discuss constraining the motion of the
objects. Our demonstration shows that a simple tool can facilitate holding multiple objects with a single gripper.
5. VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nシングルストリーム or デュアルストリーム
6. VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nシングルストリーム
• [Li+, arXiv2019][Chen+, ECCV2020][Chen+,
IEEE2022]
• テキストと視覚特徴を連結
• 単一Transformerブロックに入力
• 両モダリティで同じパラメータを
使用
• パラメータ効率が高い
7. VLP: A Survey on Vision-Language Pre-training 5
Cross-Attn
Self-Attn
Feedforward
Visual Features
Cross-Attn
Self-Attn
Feedforward
Textual Features
(b) Dual-Stream Architecture
Self-Attn
Feedforward
Visual Features Textual Features
(a) Single-Stream Architecture
Fig. 1 Illustration of two types of model architectures for VLP.
事前学習モデルのアーキテクチャ
nデュアルストリーム
• [Zhang+, ACM MM2022][Dou+,
arXiv2021]
• テキストと視覚特徴を独立
• 異なるTransformerブロックに入
力
• ブロック間でパラメータ共有しな
い
• 高い性能を出すためにクロスアテ
ンション
14. SOTA : Image-Text VLP model
Table 2 The summary of mainstream image-text VLP models. The number of downstream tasks determines whether the model is generic or
domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL
in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VisualBERT [9] Image OD-RFs Emb Single-stream No MLM+VLM COCO GRE+NLVR+VCR+VQA
ViLBERT [8] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
LXMERT [108] Image OD-RFs+Xformer Xformer Dual-stream No MLM+VLM+MVM+VQA COCO+VG+VQA+GQA+VGQA GQA+NLVR+VQA
B2T2 [109] Image CNN-GFs Emb Single-stream No MLM+VLM CC3M VCR
Unicoder-VL [13] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M+SBU VLR+VCR
VL-BERT [110] Image OD-RFs Emb Single-stream No MLM+MVM CC3M GRE+VCR+VQA
VLP [111] Image OD-RFs Emb Dual-stream Yes MLM+LM CC3M VC+VQA
UNITER [30] Image OD-RFs Emb Single-stream No MLM+VLM+MVM+WRA COCO+VG+SBU+CC3M GRE+VLR+NLVR+VCR+VE+VQA
12-IN-1 [112] Image OD-RFs Emb Single-stream No MLM+MVM MTL GQA+GRE+VC+NLVR+VE+VQA
VisDial-BERT [113] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M+VQA VD
ImageBERT [53] Image OD-RFs Emb Single-stream No MLM+VLM+MVM LAIT+CC3M+SBU VLR
PREVALENT [114] Image CNN-GFs+Xformer Xformer Single-stream No MLM+MVM Matterport3D VLN
XGPT [41] Image OD-RFs Emb Dual-stream Yes MLM+IDA+VC+TIFG CC3M VC+VLR
InterBER [115] Image OD-RFs Emb Single-stream No MLM+VLM+MVM COCO+CC3M+SBU VLR+VCR
PixelBERT [116] Image CNN-GFs Emb Single-stream No MLM+VLM COCO+VG VLR+NLVR+VQA
OSCAR [10] Image OD-RFs Emb Single-stream No MLM+VLM COCO+SBU+CC3M+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VLN-BERT [117] Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M VLN
FashionBERT [118] Image Xformer Emb Single-stream No MLM+VLM+MVM FashionGen VLR
VILLA [119] Image OD-RFs+Xformer Xformer Single-stream No MLM+VLM+MVM COCO+VG+CC3M+SBU GRE+VLR+NLVR+VCR+VE+VQA
ERNIE-ViL [120] Image OD-RFs Emb Single-stream No MLM+MVM CC3M+SBU GRE+VLR+VCR+VQA
RVL-BERT [121] Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M VC+VQA
VinVL [27] Image OD-RFs Emb Single-stream No MLM+VLM COCO+CC3M+SBU+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VL-T5 [122] Image OD-RFs Emb Single-stream Yes MLM+VLM+VQA+GRE+VC COCO+VG+VQA+GQA+VGQA GQA+GRE+VC+MMT+NLVR+VCR+VQA
ViLT [123] Image ViT-PFs Emb Single-stream No MLM+VLM COCO+VG+SBU+CC3M VLR+NLVR+VQA
ALIGN [55] Image CNN-GFs Xformer Dual-stream No VLC ALIGN VLR
Kaleido-BERT [43] Image CNN-GFs Emb Single-stream No MLM+VLM+AKPM FashionGen CR+VC+VLR
MDETR [42] Image Xformer Xformer Single-stream Yes OD+MLM+VLC COCO+VG+FLKR+GQA GQA+VQA
SOHO [124] Image CNN-GFs Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
E2E-VLP [125] Image CNN-GFs Emb Single-stream Yes OD+MLM+VLM COCO+VG VC+VLR+NLVR+VQA
Visual Parsing [126] Image Xformer Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+VCR+VE+VQA
CLIP-ViL [127] Image CNN-GFs Emb Single-stream Yes MLM+VLM+VQA COCO+VG+VQA+GQA+VGQA VE+VLN+VQA
ALBEF [35] Image Xformer Xformer Dual-stream No MLM+VLM+VLC COCO+VG+CC3M+SBU VLR+NLVR+VQA
SimVLM [14] Image CNN-GFs Emb Single-stream Yes PrefixLM ALIGN VC+NLVR+VE+VQA
MURAL [128] Image CNN-GFs Xformer Dual-stream No VLC CC12M+ALIGN VC+VLR
VLMO [129] Image ViT-PFs Emb Single-stream No MLM+VLC+VLM COCO+VG+CC3M+SBU VQA+NLVR+VLR
METER [33] Image Xformer Xformer Dual-stream No MLM+VLM COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
X-VLM [28] Image Xformer Xformer Single-stream No MLM+VLM+VG COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
TCL [130] Image Xformer Xformer Single-stream No MLM+VLM+TCL COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
15. SOTA : Video-Text VLP model
Table 3 The summary of mainstream video-text VLP models. The number of downstream tasks determines whether the model is generic or
domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL
in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VideoBERT [58] Video CNN-GFs Emb Single-stream No MLM+VLM+MVM SC AC+VC
CBT [105] Video CNN-GFs+Xformer Xformer Single-stream No VLC Kinetics AC+AS+VC
UniVL [106] Video CNN-GFs Xformer Dual-stream Yes MLM+VLM+VC HT100M AS+ASL+MSA+VC+VLR
HERO [36] Video CNN-GFs+Xformer Xformer Single-stream No MLM+VLM+MVM+FOM HT100M+TV VC+VLI+VQA+VLR
MMFT-BERT [131] Video OD-RFs+Xformer Xformer Single-stream No VQA TV VQA
ActBERT [132] Video OD-RFs+CNN Emb Single-stream No MLM+VLM+MVM HT100M AS+ASL+VC+VQA+VLR
CLIP [16] Image / Video CNN/Xformer Xformer Dual-stream No VLC SC OCR +AC etc.
Frozen [57] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
Region-Learner [133] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP4Clip [17] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP2Video [18] Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR