The document summarizes the Vision Transformer (ViT) model, which applies a transformer architecture to image classification tasks. ViT splits images into patches, embeds the patches with learned projections, and feeds them into a transformer encoder. Unlike CNNs, ViT lacks strong inductive biases for 2D structure, so it requires large datasets to learn spatial relationships from scratch. However, with sufficient data ViT can outperform CNNs by leveraging its global self-attention.