Transformers have rapidly come up as a challenger network architecture to traditional convnets in computer vision. Here is a quick landscape analysis of the state of transformers in vision, as of 2021.
19. CLIP: Multi modal self supervision
Recipe for success:
Fusing text and images in pre
training
20. Big models with big challenges
Cons
- Need large amounts of data
- Lack of inductive bias, as in
convolutions
- High compute costs
- Because of large amounts of data
- Need for specific hardware
- No Interpretability
- Multi head attention is hard to interpret
Pros
- High “capacity” to learn general
features
- Loose argument: self attention is more
global than convolutions
- Ideally suited transfer learning
paradigm
- Train once, use always