PR-318
Sungchul Kim
https://arxiv.org/abs/2104.14294
https://github.com/facebookresearch/dino
https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
Contents
1. Introduction
2. Related Work
3. Approach
4. Main Results
5. Ablation Study of DINO
2
Introduction
3
• The resulting Vision Transformers (ViT) are competitive with convnets
• computationally more demanding
• require more training data
• their features do not exhibit unique properties
• In this paper,
• The impact of self-supervised pretraining on ViT features
• Several interesting properties that do not emerge with supervised ViTs, nor with convnets
• The scene layout and, in particular, object boundaries
• A basic nearest neighbors classifier (k-NN) without any finetuning
Related Work
4
• Self-supervised learning
• Instance classification : SimCLR (PR-231), MoCo (PR-260), …
• Without discriminating between images : BYOL, SimSiam (PR-305), …
• Mean Teacher self-distillation
• Self-training and knowledge distillation
• Knowledge Distillation (J. Hinton et al., 2015.)
• Pseudo-label (Hard), Noisy Student (Soft)
• Self-supervised learning + Knowledge Distillation
• Self-supervised model compression (SEED)
• Performance gains (SimCLR v2)
SEED: Self-supervised distillation for visual representation
https://arxiv.org/pdf/2101.04731.pdf
https://arxiv.org/pdf/2006.10029.pdf
Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2)
Approach
5
• Self-Supervised Learning with Knowledge Distillation
• The same overall structure as recent self-supervised approaches + Knowledge Distillation
①
②
②
①
Approach
6
• Self-Supervised Learning with Knowledge Distillation
• Knowledge Distillation (KD)
• Student network 𝑔𝜃𝑠
→ Given teacher network 𝑔𝜃𝑡
• 𝑃𝑠 and 𝑃𝑡 : output probability distributions over 𝐾 dimensions of both networks
𝑃𝑠 𝑥 𝑖
=
exp 𝑔𝜃𝑠
𝑥 𝑖
/𝜏𝑠
σ𝑘=1
𝐾
exp 𝑔𝜃𝑠
𝑥 𝑘 /𝜏𝑠
• 𝜏𝑠 > 0 : a temperature parameter that controls the sharpness of the output distribution
• Match these distributions by minimizing the cross-entropy loss with a fixed teacher network 𝑔𝜃𝑡
min
𝜃𝑠
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥
• 𝐻 𝑎, 𝑏 = −𝑎 log 𝑏
Approach
7
• Self-Supervised Learning with Knowledge Distillation
• How we adapt the problem in KD to self-supervised learning (SSL)?
• Different distorted views, or crops, of an image with multi crop strategy
• Two global views (𝑥1
𝑔
and 𝑥2
𝑔
) and several local views of smaller resolution : a set 𝑽 of different views
• All crops are passed through the student & Only the global views are passed through the teacher
→ “local-to-global” correspondences
min
𝜃𝑠
෍
𝑥∈ 𝑥1
𝑔
,𝑥2
𝑔
෍
𝑥′∈𝑉
𝑥′≠𝑥
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥′
• 2 global views at resolution 2242
covering a large area (>50%) of the original image
• Several local views of resolution 962
covering only small areas (<50%) of the original image
Approach
8
• Self-Supervised Learning with Knowledge Distillation
• Teacher network
• Two types of different update rules for the teacher
• Freezing the teacher network over an epoch
→ Works surprisingly well!
• Exponential Moving Average (EMA) on the student weights (e.g., MoCo, BYOL, mean teacher)
• 𝜽𝒕 ← 𝝀𝜽𝒕 + 𝟏 − 𝝀 𝜽𝒔, with 𝜆 following a cosine schedule from 0.996 to 1 during training
• Stochastic Weight Averaging (SWA) (Polyak-Ruppert averaging)
• Well….
• Copying the student weight for the teacher
→ Fails to converge
Approach
9
• Self-Supervised Learning with Knowledge Distillation
• Network architecture
• The neural network 𝑔 : A backbone 𝒇 (ViT or ResNet) + A projection head 𝒉 → 𝑔 = ℎ ∘ 𝑓
• A projection head : a 3-layer multi-layer perceptron (MLP) with 2048 dim + 𝑙2 norm + weight normalized FC with 𝐾 dim (SwAV)
• NO predictor for the same architecture in both student & teacher
• For ViT backbone → NO BN in the projection heads : entirely BN-free
• Avoiding collapse
• Only a centering and sharpening of the momentum teacher outputs
• Centering : less dependence over the batch
→ prevents one dimension to dominate but encourages collapse to the uniform distribution
𝑔𝑡 𝑥 ← 𝑔𝑡 𝑥 + 𝑐, 𝑐 ← 𝑚𝑐 + 1 − 𝑚
1
𝐵
σ𝑖=1
𝐵
𝑔𝜃𝑡
𝑥𝑖
• Sharpening : using a low value for the temperature 𝜏𝑡 in the teacher softmax normalization
→ the opposite effect
Approach
10
• Implementation and evaluation protocols
• Vision Transformer (ViT)
• Transformer → ViT → DeiT
https://arxiv.org/pdf/1706.03762.pdf
https://arxiv.org/pdf/2010.11929.pdf
https://arxiv.org/pdf/2012.12877.pdf`
Approach
11
• Implementation and evaluation protocols
• Vision Transformer (ViT)
• N=16 (“/16”) or N=8 (“/8”)
• The patches → linear layer → a set of embeddings
• The class token [CLS] for consistency with previous works
• NOT attached to any label nor supervision
Approach
12
• Implementation and evaluation protocols
• Implementation details
• Pretrain the models on the ImageNet dataset without labels
• AdamW optimizer & 1024 batch size (16 GPUs, DeiT-S/16)
• Learning rate warmup : 𝑙𝑟 = 0.0005 ∗ batchsize/256 for the first 10 epochs
• After warmup : learning rate decay with a cosine schedule
• Weight decay : a cosine schedule from 0.04 to 0.4
• Temperature : 𝜏𝑠 = 0.1 / 𝜏𝑡 = 0.04~0.07 (a linear warm-up for the first 30 epochs)
• Augmentation of BYOL
• Color jittering, Gaussian blur, Solarization, Multi-crop with a bicubic interpolation
Approach
13
• Implementation and evaluation protocols
• Evaluation protocols
• Linear evaluation (with frozen backbone)
• Random resize crops & Horizontal flips (training) / Central crop (test)
• Fine-tuning
• Initialize networks with pre-trained weights & Adapt them during training
→ So sensitive to hyperparameters!
• A simple weighted nearest neighbor classifier (k-NN)
• Compute and store the features of the training data of the downstream task
• Match the feature of an image to the k nearest stored features that votes for the label
• NO hyperparameter tuning & data augmentation and SIMPLE!
Main Results
14
• Comparing with SSL frameworks on ImageNet
• Comparing with the same architecture
Main Results
15
• Comparing with SSL frameworks on ImageNet
• Comparing across architectures
Main Results
16
• Properties of ViT trained with SSL
• Nearest neighbor retrieval with DINO ViT
• Image Retrieval
• Oxford and Paris image retrieval datasets
• 3 different splits of gradual difficulty with query/database pairs
• Mean Average Precision (mAP) for the Medium (M) and Hard (H) splits
https://arxiv.org/pdf/1803.11285.pdf
Main Results
17
• Properties of ViT trained with SSL
• Nearest neighbor retrieval with DINO ViT
• Copy detection
• The “strong” subset of the INRIA Copydays dataset
• To recognize images hat have been distorted by blur, insertions, print, and scan, etc
• YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset
• Randomly sampled 10k distractor images & an extra 20k random images for learning transformation
https://hal.inria.fr/inria-00394212/document
http://projects.dfki.uni-kl.de/yfcc100m/
Main Results
18
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Video instance segmentation
• DAVIS-2017 video instance segmentation benchmark
• Segment scenes with a nearest neighbor between consecutive frames
https://davischallenge.org/
https://arxiv.org/pdf/2006.14613.pdf
Main Results
19
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image
Main Results
20
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image
• Attribution Guided Factorization Visualization
https://arxiv.org/pdf/2012.02166.pdf
Main Results
21
• Properties of ViT trained with SSL
• Transfer learning on downstream tasks
• Evaluate the quality of the features pretrained with DINO on different downstream tasks
• Compare with features from the same architectures trained with supervision on ImageNet
• Training protocol of DeiT & finetune the features on each downstream task
Ablation Study of DINO
22
• Importance of the Different Components
• Different model variants as we add or remove components
Ablation Study of DINO
23
• Importance of the Different Components
• Importance of the patch size
• Different patch sizes : DeiT-S : 16x16, 8x8, 5x5 / ViT-B : 16x16, 8x8
• Trade-off : Improved performance vs. Increased throughput (with decreased patch size)
• Impact of the choice of Teacher Network
• Building different teachers from the student & Analyzing the training dynamic
Ablation Study of DINO
24
• Avoiding collapse
• Regardless of the input, the model output is uniform along all the dimensions / dominated by one dimension.
𝐻 𝑃𝑡, 𝑃𝑠 = ℎ 𝑃𝑡 + 𝐷𝐾𝐿 𝑃𝑡|𝑃𝑠
• If KL → 0 : a constant output (collapse)
• If centering or sharpening are missing → collapse, but entropy ℎ converges to different value!
• Centering : uniform output → high entropy!
• Sharpening : opposite effect → low entropy!
Ablation Study of DINO
25
• Compute requirements
• DeiT-S/16 DINO on two 8 GPU machines
• The “local-to-global” augmentation
• Training with small batches
• Scale the learning rate linearly with the batch size
𝑙𝑟 = 0.0005 ∗ batchsize/256
• 128 batch size : 1 GPU
• 8 batch size : 35.2% after 50 epochs
조류
뱀
파충류
원숭이, 오랑우탄
고양이
개
탈 것
이런 저런 사물들
음식
Thank you!
28

Emerging Properties in Self-Supervised Vision Transformers

  • 1.
  • 2.
    Contents 1. Introduction 2. RelatedWork 3. Approach 4. Main Results 5. Ablation Study of DINO 2
  • 3.
    Introduction 3 • The resultingVision Transformers (ViT) are competitive with convnets • computationally more demanding • require more training data • their features do not exhibit unique properties • In this paper, • The impact of self-supervised pretraining on ViT features • Several interesting properties that do not emerge with supervised ViTs, nor with convnets • The scene layout and, in particular, object boundaries • A basic nearest neighbors classifier (k-NN) without any finetuning
  • 4.
    Related Work 4 • Self-supervisedlearning • Instance classification : SimCLR (PR-231), MoCo (PR-260), … • Without discriminating between images : BYOL, SimSiam (PR-305), … • Mean Teacher self-distillation • Self-training and knowledge distillation • Knowledge Distillation (J. Hinton et al., 2015.) • Pseudo-label (Hard), Noisy Student (Soft) • Self-supervised learning + Knowledge Distillation • Self-supervised model compression (SEED) • Performance gains (SimCLR v2) SEED: Self-supervised distillation for visual representation https://arxiv.org/pdf/2101.04731.pdf https://arxiv.org/pdf/2006.10029.pdf Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2)
  • 5.
    Approach 5 • Self-Supervised Learningwith Knowledge Distillation • The same overall structure as recent self-supervised approaches + Knowledge Distillation ① ② ② ①
  • 6.
    Approach 6 • Self-Supervised Learningwith Knowledge Distillation • Knowledge Distillation (KD) • Student network 𝑔𝜃𝑠 → Given teacher network 𝑔𝜃𝑡 • 𝑃𝑠 and 𝑃𝑡 : output probability distributions over 𝐾 dimensions of both networks 𝑃𝑠 𝑥 𝑖 = exp 𝑔𝜃𝑠 𝑥 𝑖 /𝜏𝑠 σ𝑘=1 𝐾 exp 𝑔𝜃𝑠 𝑥 𝑘 /𝜏𝑠 • 𝜏𝑠 > 0 : a temperature parameter that controls the sharpness of the output distribution • Match these distributions by minimizing the cross-entropy loss with a fixed teacher network 𝑔𝜃𝑡 min 𝜃𝑠 𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥 • 𝐻 𝑎, 𝑏 = −𝑎 log 𝑏
  • 7.
    Approach 7 • Self-Supervised Learningwith Knowledge Distillation • How we adapt the problem in KD to self-supervised learning (SSL)? • Different distorted views, or crops, of an image with multi crop strategy • Two global views (𝑥1 𝑔 and 𝑥2 𝑔 ) and several local views of smaller resolution : a set 𝑽 of different views • All crops are passed through the student & Only the global views are passed through the teacher → “local-to-global” correspondences min 𝜃𝑠 ෍ 𝑥∈ 𝑥1 𝑔 ,𝑥2 𝑔 ෍ 𝑥′∈𝑉 𝑥′≠𝑥 𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥′ • 2 global views at resolution 2242 covering a large area (>50%) of the original image • Several local views of resolution 962 covering only small areas (<50%) of the original image
  • 8.
    Approach 8 • Self-Supervised Learningwith Knowledge Distillation • Teacher network • Two types of different update rules for the teacher • Freezing the teacher network over an epoch → Works surprisingly well! • Exponential Moving Average (EMA) on the student weights (e.g., MoCo, BYOL, mean teacher) • 𝜽𝒕 ← 𝝀𝜽𝒕 + 𝟏 − 𝝀 𝜽𝒔, with 𝜆 following a cosine schedule from 0.996 to 1 during training • Stochastic Weight Averaging (SWA) (Polyak-Ruppert averaging) • Well…. • Copying the student weight for the teacher → Fails to converge
  • 9.
    Approach 9 • Self-Supervised Learningwith Knowledge Distillation • Network architecture • The neural network 𝑔 : A backbone 𝒇 (ViT or ResNet) + A projection head 𝒉 → 𝑔 = ℎ ∘ 𝑓 • A projection head : a 3-layer multi-layer perceptron (MLP) with 2048 dim + 𝑙2 norm + weight normalized FC with 𝐾 dim (SwAV) • NO predictor for the same architecture in both student & teacher • For ViT backbone → NO BN in the projection heads : entirely BN-free • Avoiding collapse • Only a centering and sharpening of the momentum teacher outputs • Centering : less dependence over the batch → prevents one dimension to dominate but encourages collapse to the uniform distribution 𝑔𝑡 𝑥 ← 𝑔𝑡 𝑥 + 𝑐, 𝑐 ← 𝑚𝑐 + 1 − 𝑚 1 𝐵 σ𝑖=1 𝐵 𝑔𝜃𝑡 𝑥𝑖 • Sharpening : using a low value for the temperature 𝜏𝑡 in the teacher softmax normalization → the opposite effect
  • 10.
    Approach 10 • Implementation andevaluation protocols • Vision Transformer (ViT) • Transformer → ViT → DeiT https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2012.12877.pdf`
  • 11.
    Approach 11 • Implementation andevaluation protocols • Vision Transformer (ViT) • N=16 (“/16”) or N=8 (“/8”) • The patches → linear layer → a set of embeddings • The class token [CLS] for consistency with previous works • NOT attached to any label nor supervision
  • 12.
    Approach 12 • Implementation andevaluation protocols • Implementation details • Pretrain the models on the ImageNet dataset without labels • AdamW optimizer & 1024 batch size (16 GPUs, DeiT-S/16) • Learning rate warmup : 𝑙𝑟 = 0.0005 ∗ batchsize/256 for the first 10 epochs • After warmup : learning rate decay with a cosine schedule • Weight decay : a cosine schedule from 0.04 to 0.4 • Temperature : 𝜏𝑠 = 0.1 / 𝜏𝑡 = 0.04~0.07 (a linear warm-up for the first 30 epochs) • Augmentation of BYOL • Color jittering, Gaussian blur, Solarization, Multi-crop with a bicubic interpolation
  • 13.
    Approach 13 • Implementation andevaluation protocols • Evaluation protocols • Linear evaluation (with frozen backbone) • Random resize crops & Horizontal flips (training) / Central crop (test) • Fine-tuning • Initialize networks with pre-trained weights & Adapt them during training → So sensitive to hyperparameters! • A simple weighted nearest neighbor classifier (k-NN) • Compute and store the features of the training data of the downstream task • Match the feature of an image to the k nearest stored features that votes for the label • NO hyperparameter tuning & data augmentation and SIMPLE!
  • 14.
    Main Results 14 • Comparingwith SSL frameworks on ImageNet • Comparing with the same architecture
  • 15.
    Main Results 15 • Comparingwith SSL frameworks on ImageNet • Comparing across architectures
  • 16.
    Main Results 16 • Propertiesof ViT trained with SSL • Nearest neighbor retrieval with DINO ViT • Image Retrieval • Oxford and Paris image retrieval datasets • 3 different splits of gradual difficulty with query/database pairs • Mean Average Precision (mAP) for the Medium (M) and Hard (H) splits https://arxiv.org/pdf/1803.11285.pdf
  • 17.
    Main Results 17 • Propertiesof ViT trained with SSL • Nearest neighbor retrieval with DINO ViT • Copy detection • The “strong” subset of the INRIA Copydays dataset • To recognize images hat have been distorted by blur, insertions, print, and scan, etc • YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset • Randomly sampled 10k distractor images & an extra 20k random images for learning transformation https://hal.inria.fr/inria-00394212/document http://projects.dfki.uni-kl.de/yfcc100m/
  • 18.
    Main Results 18 • Propertiesof ViT trained with SSL • Discovering the semantic layout of scenes • Video instance segmentation • DAVIS-2017 video instance segmentation benchmark • Segment scenes with a nearest neighbor between consecutive frames https://davischallenge.org/ https://arxiv.org/pdf/2006.14613.pdf
  • 19.
    Main Results 19 • Propertiesof ViT trained with SSL • Discovering the semantic layout of scenes • Probing the self-attention map • Different heads can attend to different semantic regions of an image
  • 20.
    Main Results 20 • Propertiesof ViT trained with SSL • Discovering the semantic layout of scenes • Probing the self-attention map • Different heads can attend to different semantic regions of an image • Attribution Guided Factorization Visualization https://arxiv.org/pdf/2012.02166.pdf
  • 21.
    Main Results 21 • Propertiesof ViT trained with SSL • Transfer learning on downstream tasks • Evaluate the quality of the features pretrained with DINO on different downstream tasks • Compare with features from the same architectures trained with supervision on ImageNet • Training protocol of DeiT & finetune the features on each downstream task
  • 22.
    Ablation Study ofDINO 22 • Importance of the Different Components • Different model variants as we add or remove components
  • 23.
    Ablation Study ofDINO 23 • Importance of the Different Components • Importance of the patch size • Different patch sizes : DeiT-S : 16x16, 8x8, 5x5 / ViT-B : 16x16, 8x8 • Trade-off : Improved performance vs. Increased throughput (with decreased patch size) • Impact of the choice of Teacher Network • Building different teachers from the student & Analyzing the training dynamic
  • 24.
    Ablation Study ofDINO 24 • Avoiding collapse • Regardless of the input, the model output is uniform along all the dimensions / dominated by one dimension. 𝐻 𝑃𝑡, 𝑃𝑠 = ℎ 𝑃𝑡 + 𝐷𝐾𝐿 𝑃𝑡|𝑃𝑠 • If KL → 0 : a constant output (collapse) • If centering or sharpening are missing → collapse, but entropy ℎ converges to different value! • Centering : uniform output → high entropy! • Sharpening : opposite effect → low entropy!
  • 25.
    Ablation Study ofDINO 25 • Compute requirements • DeiT-S/16 DINO on two 8 GPU machines • The “local-to-global” augmentation • Training with small batches • Scale the learning rate linearly with the batch size 𝑙𝑟 = 0.0005 ∗ batchsize/256 • 128 batch size : 1 GPU • 8 batch size : 35.2% after 50 epochs
  • 27.
  • 28.