State of transformers in Computer Vision

•Download as PPTX, PDF•

1 like•149 views

Transformers have rapidly come up as a challenger network architecture to traditional convnets in computer vision. Here is a quick landscape analysis of the state of transformers in vision, as of 2021.

Data & Analytics

Transformers are getting popular in vision after taking over
NLP
NLP
Vision

Self-attention also exists in image processing

Transformers in Vision
Self-attention
Single-head Multi-head
Non-local
Criss-cross
Single-scale
Multi-scale
+Convolutions
Self-supervised
ViT
T-in-T
LV-ViT
XCit
PViT* **
SegFormer*
CvT**
CLIP
CoAtNet
DINO

ViT: Vision Transformer
Recipe for success:
- Large deep model
- Pretrained on large image
database

T-in-T: Transformer in Transformer
Recipe for success:
Computing attention at two levels

LV-ViT: Token Labeling Transformers
Recipe for success:
Auxiliary loss based on image patches

XCiT: Cross-covariance Transformers
Recipe for success:
- Efficient self-attention
- Processing of high-resolution
images

PViT: Pyramid ViT* **
Recipe for success:
- Progressive shrinking to capture
different scales
- Spatial reduction attention

SegFormer: Segmentation Transformer*
Recipe for success:
- Progressive shrinking to capture
different scales
- Overlapping windows

CvT: Convolutional Vision Transformers**
Recipe for success:
Convolution based projection to capture
the spatial structure and low-level
details

CoAtNet: Marrying Convolutions and Attention
Recipe for success:
Fusing the useful properties of
convolution and attention

DINO: Self distillation without labels
Recipe for success:
Self supervised pre training

CLIP: Multi modal self supervision
Recipe for success:
Fusing text and images in pre
training

Big models with big challenges
Cons
- Need large amounts of data
- Lack of inductive bias, as in
convolutions
- High compute costs
- Because of large amounts of data
- Need for specific hardware
- No Interpretability
- Multi head attention is hard to interpret
Pros
- High “capacity” to learn general
features
- Loose argument: self attention is more
global than convolutions
- Ideally suited transfer learning
paradigm
- Train once, use always

What's hot

210523 swin transformer v1.5taeseon ryu

Swin transformerJAEMINJEONG5

ViT.pptxChangjin Lee

lecun-01.pptVenkyChinna8

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club

【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP

Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ

Introduction to Visual transformers leopauly

BERTKhang Pham

Emerging Properties in Self-Supervised Vision TransformersSungchul Kim

A brief introduction to recent segmentation methodsShunta Saito

“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance

[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...Deep Learning JP

【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)Deep Learning JP

Survey of Attention mechanismSwatiNarkhede1

Bayesian Neural NetworksNatan Katz

Self-Attention with Linear ComplexitySangwoo Mo

Attention Is All You NeedIllia Polosukhin

【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with TransformersDeep Learning JP

AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)Fellowship at Vodafone FutureLab

What's hot (20)

210523 swin transformer v1.5

Swin transformer

ViT.pptx

lecun-01.ppt

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

【DL輪読会】ViT + Self Supervised Learningまとめ

Few shot learning/ one shot learning/ machine learning

Introduction to Visual transformers

BERT

Emerging Properties in Self-Supervised Vision Transformers

A brief introduction to recent segmentation methods

“How Transformers are Changing the Direction of Deep Learning Architectures,”...

[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...

【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)

Survey of Attention mechanism

Bayesian Neural Networks

Self-Attention with Linear Complexity

Attention Is All You Need

【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf

Data Science Jobs and Salaries Analysis.pptx

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

Brighton SEO | April 2024 | Data Storytelling

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Decoding Loan Approval: Predictive Modeling in Action

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

9654467111 Call Girls In Munirka Hotel And Home Service

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Call Girls in Saket 99530🔝 56974 Escort Service

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Schema on read is obsolete. Welcome metaprogramming..pdf

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

State of transformers in Computer Vision

1. Transformers in Vision Deep Kayal

2. Transformers are getting popular in vision after taking over NLP NLP Vision

3. Ingredient 1: Multi-head self-attention

4. Ingredient 2: Transformer architecture

5. Ingredient 3: Self-supervision NLP

6. Ingredient 3: Self-supervision Vision

7. Self-attention also exists in image processing

8. ...with efficient variants

9. Transformers in Vision Self-attention Single-head Multi-head Non-local Criss-cross Single-scale Multi-scale +Convolutions Self-supervised ViT T-in-T LV-ViT XCit PViT* ** SegFormer* CvT** CLIP CoAtNet DINO

10. ViT: Vision Transformer Recipe for success: - Large deep model - Pretrained on large image database

11. T-in-T: Transformer in Transformer Recipe for success: Computing attention at two levels

12. LV-ViT: Token Labeling Transformers Recipe for success: Auxiliary loss based on image patches

13. XCiT: Cross-covariance Transformers Recipe for success: - Efficient self-attention - Processing of high-resolution images

14. PViT: Pyramid ViT* ** Recipe for success: - Progressive shrinking to capture different scales - Spatial reduction attention

15. SegFormer: Segmentation Transformer* Recipe for success: - Progressive shrinking to capture different scales - Overlapping windows

16. CvT: Convolutional Vision Transformers** Recipe for success: Convolution based projection to capture the spatial structure and low-level details

17. CoAtNet: Marrying Convolutions and Attention Recipe for success: Fusing the useful properties of convolution and attention

18. DINO: Self distillation without labels Recipe for success: Self supervised pre training

19. CLIP: Multi modal self supervision Recipe for success: Fusing text and images in pre training

20. Big models with big challenges Cons - Need large amounts of data - Lack of inductive bias, as in convolutions - High compute costs - Because of large amounts of data - Need for specific hardware - No Interpretability - Multi head attention is hard to interpret Pros - High “capacity” to learn general features - Loose argument: self attention is more global than convolutions - Ideally suited transfer learning paradigm - Train once, use always

Editor's Notes

Why does it capture better features?

State of transformers in Computer Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Deep Kayal

More from Deep Kayal (6)

Recently uploaded

Recently uploaded (20)

State of transformers in Computer Vision

Editor's Notes