OMNIVORE:
A Single Model for Many Visual
Modalities
Rohit Girdhar, Mannat Singh, Nikhila Ravi,
Laurens van der Maaten, Armand Joulin, Ishan Misra
CVPR2022
( )
2023/4/6
◼
•
◼ single-view 3D
◼
• single-view 3D
•
◼
•
•
•
◼ConvNet
•
◼Transformer [ ]
•
◼
•
◼
•
OMNIVORE
◼
• Swin-transformer [ ]
• self-attention
•
◼ Omnivore
•
• Transformer
◼
•
• SGD
OMNIVORE
◼
• Swin-transformer
• Swin-T
• Swin-S
• Swin-B
• Swin-L
•
• T < S < B < L
◼
• 500
• 1
• IN1K K400 1
• SUN 10
◼
• IN1K
•
•
• K400
•
• SUN
•
•
◼ 1
• IN1K, K400, SUN
• IN1K, K400, SUN
◼ 2
• IN1K, K400, SUN
•
◼
◼
•
•
•
•
◼
•
•
•
•
•
•
•
•
•
◼
•
•
•
•
•
•
•
•
•
•
➢OMNIVORE
• Swin-transformer
•
• IN1K
• K400
• SUN Single-view 3D
◼
• Swin-transformer
•
➢ImageSwin [Liu+, ICCV2021]
•
• IN1K
➢VideoSwin [Liu+, arXiv2021]
•
• ImageSwin
➢DepthSwin
• Single-view 3D
• ImageSwin
1
◼
2
◼
• 7
• Specific
◼
• MViT-B-24 [ ]
• ViT-L/16 [Dosovitskiy+, ICLR2021]
◼
• ViT-B-VTN [Neimark+, arXiv2021]
• TimeSformer-L [Bertasius+, ICML2021]
◼3D
• DF2Net [Li+, AAAI2018]
• G-L-SOOR [Song+, TIP2020]
◼OMNIVORE
• Swin-B
• Swin-L
Ablation study
◼
• IN21K
•
• IN1K, K400, SUN
◼
• IN21K, IN1K, K400, SUN
• SUN
•
•
• IN21K
•
•
◼
•
•
◼
• RGBD
RGB
•
Ablation study
◼ Baseline
•
• 0.1:1:1:50
•
•
◼ Finetuned
•
◼ Data ratio
•
◼ Batching
•
◼ Patch embedding
• RGBD 4
◼
• Single-view 3D
◼
◼
•Single-view 3D 3D
•

論文紹介:Omnivore: A Single Model for Many Visual Modalities