The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.