This document discusses various transformer models applied in vision tasks, including image generation, action recognition, and object detection. It highlights the evolution of transformer architectures from original concepts to their applications in image and video processing, as well as the benefits of large-scale training. The conclusion emphasizes the competence of transformers in modeling pixel inter-relationships and their potential to replace CNNs when adequately pre-trained.