This document discusses key concepts in visual transformers including key-value-query attention, pooling, multi-head attention, and unsupervised representation learning. It then summarizes several state-of-the-art papers applying transformers to computer vision tasks like image classification using ViT, object detection using DETR, and generative pretraining from pixels. Additional works extending visual transformers to tasks like segmentation, video analysis, and captioning are also briefly mentioned.