Emerging Properties in Self-Supervised Vision Transformers

PR-318
Sungchul Kim
https://arxiv.org/abs/2104.14294
https://github.com/facebookresearch/dino
https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training

Contents
1. Introduction
2. Related Work
3. Approach
4. Main Results
5. Ablation Study of DINO
2

Introduction
3
• The resulting Vision Transformers (ViT) are competitive with convnets
• computationally more demanding
• require more training data
• their features do not exhibit unique properties
• In this paper,
• The impact of self-supervised pretraining on ViT features
• Several interesting properties that do not emerge with supervised ViTs, nor with convnets
• The scene layout and, in particular, object boundaries
• A basic nearest neighbors classifier (k-NN) without any finetuning

Related Work
4
• Self-supervised learning
• Instance classification : SimCLR (PR-231), MoCo (PR-260), …
• Without discriminating between images : BYOL, SimSiam (PR-305), …
• Mean Teacher self-distillation
• Self-training and knowledge distillation
• Knowledge Distillation (J. Hinton et al., 2015.)
• Pseudo-label (Hard), Noisy Student (Soft)
• Self-supervised learning + Knowledge Distillation
• Self-supervised model compression (SEED)
• Performance gains (SimCLR v2)
SEED: Self-supervised distillation for visual representation
https://arxiv.org/pdf/2101.04731.pdf
Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2)

Approach
5
• Self-Supervised Learning with Knowledge Distillation
• The same overall structure as recent self-supervised approaches + Knowledge Distillation
①
②
②
①

Approach
6
• Knowledge Distillation (KD)
• Student network 𝑔𝜃𝑠
→ Given teacher network 𝑔𝜃𝑡
• 𝑃𝑠 and 𝑃𝑡 : output probability distributions over 𝐾 dimensions of both networks
𝑃𝑠 𝑥 𝑖
=
exp 𝑔𝜃𝑠
𝑥 𝑖
/𝜏𝑠
σ𝑘=1
𝐾
exp 𝑔𝜃𝑠
𝑥 𝑘 /𝜏𝑠
• 𝜏𝑠 > 0 : a temperature parameter that controls the sharpness of the output distribution
• Match these distributions by minimizing the cross-entropy loss with a fixed teacher network 𝑔𝜃𝑡
min
𝜃𝑠
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥
• 𝐻 𝑎, 𝑏 = −𝑎 log 𝑏

Approach
7
• How we adapt the problem in KD to self-supervised learning (SSL)?
• Different distorted views, or crops, of an image with multi crop strategy
• Two global views (𝑥1
𝑔
and 𝑥2
𝑔
) and several local views of smaller resolution : a set 𝑽 of different views
• All crops are passed through the student & Only the global views are passed through the teacher
→ “local-to-global” correspondences
min
𝜃𝑠
෍
𝑥∈ 𝑥1
𝑔
,𝑥2
𝑔
෍
𝑥′∈𝑉
𝑥′≠𝑥
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥′
• 2 global views at resolution 2242
covering a large area (>50%) of the original image
• Several local views of resolution 962
covering only small areas (<50%) of the original image

Approach
8
• Teacher network
• Two types of different update rules for the teacher
• Freezing the teacher network over an epoch
→ Works surprisingly well!
• Exponential Moving Average (EMA) on the student weights (e.g., MoCo, BYOL, mean teacher)
• 𝜽𝒕 ← 𝝀𝜽𝒕 + 𝟏 − 𝝀 𝜽𝒔, with 𝜆 following a cosine schedule from 0.996 to 1 during training
• Stochastic Weight Averaging (SWA) (Polyak-Ruppert averaging)
• Well….
• Copying the student weight for the teacher
→ Fails to converge

Approach
9
• Network architecture
• The neural network 𝑔 : A backbone 𝒇 (ViT or ResNet) + A projection head 𝒉 → 𝑔 = ℎ ∘ 𝑓
• A projection head : a 3-layer multi-layer perceptron (MLP) with 2048 dim + 𝑙2 norm + weight normalized FC with 𝐾 dim (SwAV)
• NO predictor for the same architecture in both student & teacher
• For ViT backbone → NO BN in the projection heads : entirely BN-free
• Avoiding collapse
• Only a centering and sharpening of the momentum teacher outputs
• Centering : less dependence over the batch
→ prevents one dimension to dominate but encourages collapse to the uniform distribution
𝑔𝑡 𝑥 ← 𝑔𝑡 𝑥 + 𝑐, 𝑐 ← 𝑚𝑐 + 1 − 𝑚
1
𝐵
σ𝑖=1
𝐵
𝑔𝜃𝑡
𝑥𝑖
• Sharpening : using a low value for the temperature 𝜏𝑡 in the teacher softmax normalization
→ the opposite effect

Approach
10
• Implementation and evaluation protocols
• Vision Transformer (ViT)
• Transformer → ViT → DeiT
https://arxiv.org/pdf/2012.12877.pdf`

Approach
11
• Vision Transformer (ViT)
• N=16 (“/16”) or N=8 (“/8”)
• The patches → linear layer → a set of embeddings
• The class token [CLS] for consistency with previous works
• NOT attached to any label nor supervision

Approach
12
• Implementation details
• Pretrain the models on the ImageNet dataset without labels
• AdamW optimizer & 1024 batch size (16 GPUs, DeiT-S/16)
• Learning rate warmup : 𝑙𝑟 = 0.0005 ∗ batchsize/256 for the first 10 epochs
• After warmup : learning rate decay with a cosine schedule
• Weight decay : a cosine schedule from 0.04 to 0.4
• Temperature : 𝜏𝑠 = 0.1 / 𝜏𝑡 = 0.04~0.07 (a linear warm-up for the first 30 epochs)
• Augmentation of BYOL
• Color jittering, Gaussian blur, Solarization, Multi-crop with a bicubic interpolation

Approach
13
• Evaluation protocols
• Linear evaluation (with frozen backbone)
• Random resize crops & Horizontal flips (training) / Central crop (test)
• Fine-tuning
• Initialize networks with pre-trained weights & Adapt them during training
→ So sensitive to hyperparameters!
• A simple weighted nearest neighbor classifier (k-NN)
• Compute and store the features of the training data of the downstream task
• Match the feature of an image to the k nearest stored features that votes for the label
• NO hyperparameter tuning & data augmentation and SIMPLE!

Main Results
14
• Comparing with SSL frameworks on ImageNet
• Comparing with the same architecture

Main Results
15
• Comparing with SSL frameworks on ImageNet
• Comparing across architectures

Main Results
16
• Properties of ViT trained with SSL
• Nearest neighbor retrieval with DINO ViT
• Image Retrieval
• Oxford and Paris image retrieval datasets
• 3 different splits of gradual difficulty with query/database pairs
• Mean Average Precision (mAP) for the Medium (M) and Hard (H) splits

Main Results
17
• Nearest neighbor retrieval with DINO ViT
• Copy detection
• The “strong” subset of the INRIA Copydays dataset
• To recognize images hat have been distorted by blur, insertions, print, and scan, etc
• YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset
• Randomly sampled 10k distractor images & an extra 20k random images for learning transformation
https://hal.inria.fr/inria-00394212/document
http://projects.dfki.uni-kl.de/yfcc100m/

Main Results
18
• Discovering the semantic layout of scenes
• Video instance segmentation
• DAVIS-2017 video instance segmentation benchmark
• Segment scenes with a nearest neighbor between consecutive frames
https://davischallenge.org/

Main Results
19
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image

Main Results
20
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image
• Attribution Guided Factorization Visualization

Main Results
21
• Transfer learning on downstream tasks
• Evaluate the quality of the features pretrained with DINO on different downstream tasks
• Compare with features from the same architectures trained with supervision on ImageNet
• Training protocol of DeiT & finetune the features on each downstream task

Ablation Study of DINO
22
• Importance of the Different Components
• Different model variants as we add or remove components

23
• Importance of the Different Components
• Importance of the patch size
• Different patch sizes : DeiT-S : 16x16, 8x8, 5x5 / ViT-B : 16x16, 8x8
• Trade-off : Improved performance vs. Increased throughput (with decreased patch size)
• Impact of the choice of Teacher Network
• Building different teachers from the student & Analyzing the training dynamic

24
• Avoiding collapse
• Regardless of the input, the model output is uniform along all the dimensions / dominated by one dimension.
𝐻 𝑃𝑡, 𝑃𝑠 = ℎ 𝑃𝑡 + 𝐷𝐾𝐿 𝑃𝑡|𝑃𝑠
• If KL → 0 : a constant output (collapse)
• If centering or sharpening are missing → collapse, but entropy ℎ converges to different value!
• Centering : uniform output → high entropy!
• Sharpening : opposite effect → low entropy!

25
• Compute requirements
• DeiT-S/16 DINO on two 8 GPU machines
• The “local-to-global” augmentation
• Training with small batches
• Scale the learning rate linearly with the batch size
𝑙𝑟 = 0.0005 ∗ batchsize/256
• 128 batch size : 1 GPU
• 8 batch size : 35.2% after 50 epochs

조류
뱀
파충류
원숭이, 오랑우탄
고양이
개
탈 것
이런 저런 사물들
음식

Emerging Properties in Self-Supervised Vision Transformers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Emerging Properties in Self-Supervised Vision Transformers

Similar to Emerging Properties in Self-Supervised Vision Transformers (20)

More from Sungchul Kim

More from Sungchul Kim (20)

Recently uploaded

Recently uploaded (20)

Emerging Properties in Self-Supervised Vision Transformers