SlideShare a Scribd company logo
PR-318
Sungchul Kim
https://arxiv.org/abs/2104.14294
https://github.com/facebookresearch/dino
https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
Contents
1. Introduction
2. Related Work
3. Approach
4. Main Results
5. Ablation Study of DINO
2
Introduction
3
• The resulting Vision Transformers (ViT) are competitive with convnets
• computationally more demanding
• require more training data
• their features do not exhibit unique properties
• In this paper,
• The impact of self-supervised pretraining on ViT features
• Several interesting properties that do not emerge with supervised ViTs, nor with convnets
• The scene layout and, in particular, object boundaries
• A basic nearest neighbors classifier (k-NN) without any finetuning
Related Work
4
• Self-supervised learning
• Instance classification : SimCLR (PR-231), MoCo (PR-260), …
• Without discriminating between images : BYOL, SimSiam (PR-305), …
• Mean Teacher self-distillation
• Self-training and knowledge distillation
• Knowledge Distillation (J. Hinton et al., 2015.)
• Pseudo-label (Hard), Noisy Student (Soft)
• Self-supervised learning + Knowledge Distillation
• Self-supervised model compression (SEED)
• Performance gains (SimCLR v2)
SEED: Self-supervised distillation for visual representation
https://arxiv.org/pdf/2101.04731.pdf
https://arxiv.org/pdf/2006.10029.pdf
Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2)
Approach
5
• Self-Supervised Learning with Knowledge Distillation
• The same overall structure as recent self-supervised approaches + Knowledge Distillation
①
②
②
①
Approach
6
• Self-Supervised Learning with Knowledge Distillation
• Knowledge Distillation (KD)
• Student network 𝑔𝜃𝑠
→ Given teacher network 𝑔𝜃𝑡
• 𝑃𝑠 and 𝑃𝑡 : output probability distributions over 𝐾 dimensions of both networks
𝑃𝑠 𝑥 𝑖
=
exp 𝑔𝜃𝑠
𝑥 𝑖
/𝜏𝑠
σ𝑘=1
𝐾
exp 𝑔𝜃𝑠
𝑥 𝑘 /𝜏𝑠
• 𝜏𝑠 > 0 : a temperature parameter that controls the sharpness of the output distribution
• Match these distributions by minimizing the cross-entropy loss with a fixed teacher network 𝑔𝜃𝑡
min
𝜃𝑠
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥
• 𝐻 𝑎, 𝑏 = −𝑎 log 𝑏
Approach
7
• Self-Supervised Learning with Knowledge Distillation
• How we adapt the problem in KD to self-supervised learning (SSL)?
• Different distorted views, or crops, of an image with multi crop strategy
• Two global views (𝑥1
𝑔
and 𝑥2
𝑔
) and several local views of smaller resolution : a set 𝑽 of different views
• All crops are passed through the student & Only the global views are passed through the teacher
→ “local-to-global” correspondences
min
𝜃𝑠
෍
𝑥∈ 𝑥1
𝑔
,𝑥2
𝑔
෍
𝑥′∈𝑉
𝑥′≠𝑥
𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥′
• 2 global views at resolution 2242
covering a large area (>50%) of the original image
• Several local views of resolution 962
covering only small areas (<50%) of the original image
Approach
8
• Self-Supervised Learning with Knowledge Distillation
• Teacher network
• Two types of different update rules for the teacher
• Freezing the teacher network over an epoch
→ Works surprisingly well!
• Exponential Moving Average (EMA) on the student weights (e.g., MoCo, BYOL, mean teacher)
• 𝜽𝒕 ← 𝝀𝜽𝒕 + 𝟏 − 𝝀 𝜽𝒔, with 𝜆 following a cosine schedule from 0.996 to 1 during training
• Stochastic Weight Averaging (SWA) (Polyak-Ruppert averaging)
• Well….
• Copying the student weight for the teacher
→ Fails to converge
Approach
9
• Self-Supervised Learning with Knowledge Distillation
• Network architecture
• The neural network 𝑔 : A backbone 𝒇 (ViT or ResNet) + A projection head 𝒉 → 𝑔 = ℎ ∘ 𝑓
• A projection head : a 3-layer multi-layer perceptron (MLP) with 2048 dim + 𝑙2 norm + weight normalized FC with 𝐾 dim (SwAV)
• NO predictor for the same architecture in both student & teacher
• For ViT backbone → NO BN in the projection heads : entirely BN-free
• Avoiding collapse
• Only a centering and sharpening of the momentum teacher outputs
• Centering : less dependence over the batch
→ prevents one dimension to dominate but encourages collapse to the uniform distribution
𝑔𝑡 𝑥 ← 𝑔𝑡 𝑥 + 𝑐, 𝑐 ← 𝑚𝑐 + 1 − 𝑚
1
𝐵
σ𝑖=1
𝐵
𝑔𝜃𝑡
𝑥𝑖
• Sharpening : using a low value for the temperature 𝜏𝑡 in the teacher softmax normalization
→ the opposite effect
Approach
10
• Implementation and evaluation protocols
• Vision Transformer (ViT)
• Transformer → ViT → DeiT
https://arxiv.org/pdf/1706.03762.pdf
https://arxiv.org/pdf/2010.11929.pdf
https://arxiv.org/pdf/2012.12877.pdf`
Approach
11
• Implementation and evaluation protocols
• Vision Transformer (ViT)
• N=16 (“/16”) or N=8 (“/8”)
• The patches → linear layer → a set of embeddings
• The class token [CLS] for consistency with previous works
• NOT attached to any label nor supervision
Approach
12
• Implementation and evaluation protocols
• Implementation details
• Pretrain the models on the ImageNet dataset without labels
• AdamW optimizer & 1024 batch size (16 GPUs, DeiT-S/16)
• Learning rate warmup : 𝑙𝑟 = 0.0005 ∗ batchsize/256 for the first 10 epochs
• After warmup : learning rate decay with a cosine schedule
• Weight decay : a cosine schedule from 0.04 to 0.4
• Temperature : 𝜏𝑠 = 0.1 / 𝜏𝑡 = 0.04~0.07 (a linear warm-up for the first 30 epochs)
• Augmentation of BYOL
• Color jittering, Gaussian blur, Solarization, Multi-crop with a bicubic interpolation
Approach
13
• Implementation and evaluation protocols
• Evaluation protocols
• Linear evaluation (with frozen backbone)
• Random resize crops & Horizontal flips (training) / Central crop (test)
• Fine-tuning
• Initialize networks with pre-trained weights & Adapt them during training
→ So sensitive to hyperparameters!
• A simple weighted nearest neighbor classifier (k-NN)
• Compute and store the features of the training data of the downstream task
• Match the feature of an image to the k nearest stored features that votes for the label
• NO hyperparameter tuning & data augmentation and SIMPLE!
Main Results
14
• Comparing with SSL frameworks on ImageNet
• Comparing with the same architecture
Main Results
15
• Comparing with SSL frameworks on ImageNet
• Comparing across architectures
Main Results
16
• Properties of ViT trained with SSL
• Nearest neighbor retrieval with DINO ViT
• Image Retrieval
• Oxford and Paris image retrieval datasets
• 3 different splits of gradual difficulty with query/database pairs
• Mean Average Precision (mAP) for the Medium (M) and Hard (H) splits
https://arxiv.org/pdf/1803.11285.pdf
Main Results
17
• Properties of ViT trained with SSL
• Nearest neighbor retrieval with DINO ViT
• Copy detection
• The “strong” subset of the INRIA Copydays dataset
• To recognize images hat have been distorted by blur, insertions, print, and scan, etc
• YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset
• Randomly sampled 10k distractor images & an extra 20k random images for learning transformation
https://hal.inria.fr/inria-00394212/document
http://projects.dfki.uni-kl.de/yfcc100m/
Main Results
18
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Video instance segmentation
• DAVIS-2017 video instance segmentation benchmark
• Segment scenes with a nearest neighbor between consecutive frames
https://davischallenge.org/
https://arxiv.org/pdf/2006.14613.pdf
Main Results
19
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image
Main Results
20
• Properties of ViT trained with SSL
• Discovering the semantic layout of scenes
• Probing the self-attention map
• Different heads can attend to different semantic regions of an image
• Attribution Guided Factorization Visualization
https://arxiv.org/pdf/2012.02166.pdf
Main Results
21
• Properties of ViT trained with SSL
• Transfer learning on downstream tasks
• Evaluate the quality of the features pretrained with DINO on different downstream tasks
• Compare with features from the same architectures trained with supervision on ImageNet
• Training protocol of DeiT & finetune the features on each downstream task
Ablation Study of DINO
22
• Importance of the Different Components
• Different model variants as we add or remove components
Ablation Study of DINO
23
• Importance of the Different Components
• Importance of the patch size
• Different patch sizes : DeiT-S : 16x16, 8x8, 5x5 / ViT-B : 16x16, 8x8
• Trade-off : Improved performance vs. Increased throughput (with decreased patch size)
• Impact of the choice of Teacher Network
• Building different teachers from the student & Analyzing the training dynamic
Ablation Study of DINO
24
• Avoiding collapse
• Regardless of the input, the model output is uniform along all the dimensions / dominated by one dimension.
𝐻 𝑃𝑡, 𝑃𝑠 = ℎ 𝑃𝑡 + 𝐷𝐾𝐿 𝑃𝑡|𝑃𝑠
• If KL → 0 : a constant output (collapse)
• If centering or sharpening are missing → collapse, but entropy ℎ converges to different value!
• Centering : uniform output → high entropy!
• Sharpening : opposite effect → low entropy!
Ablation Study of DINO
25
• Compute requirements
• DeiT-S/16 DINO on two 8 GPU machines
• The “local-to-global” augmentation
• Training with small batches
• Scale the learning rate linearly with the batch size
𝑙𝑟 = 0.0005 ∗ batchsize/256
• 128 batch size : 1 GPU
• 8 batch size : 35.2% after 50 epochs
조류
뱀
파충류
원숭이, 오랑우탄
고양이
개
탈 것
이런 저런 사물들
음식
Thank you!
28

More Related Content

What's hot

【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
Deep Learning JP
 
ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティス
Yusuke Uchida
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
Benjaminlapid1
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
Deep Learning JP
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
Shunta Saito
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
Toru Tamaki
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
harmonylab
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki
 
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
Sangwoo Mo
 
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
Deep Learning JP
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
leopauly
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
Deep Learning JP
 
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
Deep Learning JP
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
Lucas kanade法について
Lucas kanade法についてLucas kanade法について
Lucas kanade法について
Hitoshi Nishimura
 

What's hot (20)

【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 
ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティス
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
 
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs &amp; outputs
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
論文紹介:Temporal Action Segmentation: An Analysis of Modern Techniques
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
[DL輪読会]An Image is Worth 16x16 Words: Transformers for Image Recognition at S...
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
 
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
Lucas kanade法について
Lucas kanade法についてLucas kanade法について
Lucas kanade法について
 

Similar to Emerging Properties in Self-Supervised Vision Transformers

Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
MongoDB
 
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA Taiwan
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
NAVER Engineering
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
FEG
 
UNIT IV DESIGN PATTERNS.pptx
UNIT IV DESIGN PATTERNS.pptxUNIT IV DESIGN PATTERNS.pptx
UNIT IV DESIGN PATTERNS.pptx
anguraju1
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with keras
MOHITKUMAR1379
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Sergey Karayev
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
BoahKim2
 
Centertrack and naver airush 2020 review
Centertrack and naver airush 2020 reviewCentertrack and naver airush 2020 review
Centertrack and naver airush 2020 review
경훈 김
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
Peter Tröger
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
Sean Yu
 
The art of architecture
The art of architectureThe art of architecture
The art of architecture
ADDQ
 
Transfer Learning (20230516)
Transfer Learning (20230516)Transfer Learning (20230516)
Transfer Learning (20230516)
FEG
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Vincenzo Lomonaco
 
Network recasting
Network recastingNetwork recasting
Network recasting
NAVER Engineering
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
corehard_by
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 

Similar to Emerging Properties in Self-Supervised Vision Transformers (20)

Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database ReplacementWebinar: How We Evaluated MongoDB as a Relational Database Replacement
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
 
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 
UNIT IV DESIGN PATTERNS.pptx
UNIT IV DESIGN PATTERNS.pptxUNIT IV DESIGN PATTERNS.pptx
UNIT IV DESIGN PATTERNS.pptx
 
Deep learning with keras
Deep learning with kerasDeep learning with keras
Deep learning with keras
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
 
Centertrack and naver airush 2020 review
Centertrack and naver airush 2020 reviewCentertrack and naver airush 2020 review
Centertrack and naver airush 2020 review
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
The art of architecture
The art of architectureThe art of architecture
The art of architecture
 
Transfer Learning (20230516)
Transfer Learning (20230516)Transfer Learning (20230516)
Transfer Learning (20230516)
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
 
Network recasting
Network recastingNetwork recasting
Network recasting
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 

More from Sungchul Kim

PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionPR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Sungchul Kim
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural Networks
Sungchul Kim
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
Sungchul Kim
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
Sungchul Kim
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sungchul Kim
 
Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
Sungchul Kim
 
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Sungchul Kim
 
Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+
Sungchul Kim
 
Going Deeper with Convolutions
Going Deeper with ConvolutionsGoing Deeper with Convolutions
Going Deeper with Convolutions
Sungchul Kim
 
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Sungchul Kim
 
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Sungchul Kim
 
Panoptic Segmentation
Panoptic SegmentationPanoptic Segmentation
Panoptic Segmentation
Sungchul Kim
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
Sungchul Kim
 
A Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksA Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural Networks
Sungchul Kim
 
KDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksKDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial Networks
Sungchul Kim
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design Spaces
Sungchul Kim
 
Search to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the EyesSearch to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the Eyes
Sungchul Kim
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
Sungchul Kim
 
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningBootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Sungchul Kim
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
Sungchul Kim
 

More from Sungchul Kim (20)

PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionPR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural Networks
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Revisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object DetectorRevisiting the Sibling Head in Object Detector
Revisiting the Sibling Head in Object Detector
 
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
 
Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+Deeplabv1, v2, v3, v3+
Deeplabv1, v2, v3, v3+
 
Going Deeper with Convolutions
Going Deeper with ConvolutionsGoing Deeper with Convolutions
Going Deeper with Convolutions
 
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationGrad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
 
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
 
Panoptic Segmentation
Panoptic SegmentationPanoptic Segmentation
Panoptic Segmentation
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
 
A Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural NetworksA Benchmark for Interpretability Methods in Deep Neural Networks
A Benchmark for Interpretability Methods in Deep Neural Networks
 
KDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial NetworksKDGAN: Knowledge Distillation with Generative Adversarial Networks
KDGAN: Knowledge Distillation with Generative Adversarial Networks
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design Spaces
 
Search to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the EyesSearch to Distill: Pearls are Everywhere but not the Eyes
Search to Distill: Pearls are Everywhere but not the Eyes
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
 
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningBootstrap Your Own Latent: A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning
 
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stoch...
 

Recently uploaded

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 

Recently uploaded (20)

DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 

Emerging Properties in Self-Supervised Vision Transformers

  • 2. Contents 1. Introduction 2. Related Work 3. Approach 4. Main Results 5. Ablation Study of DINO 2
  • 3. Introduction 3 • The resulting Vision Transformers (ViT) are competitive with convnets • computationally more demanding • require more training data • their features do not exhibit unique properties • In this paper, • The impact of self-supervised pretraining on ViT features • Several interesting properties that do not emerge with supervised ViTs, nor with convnets • The scene layout and, in particular, object boundaries • A basic nearest neighbors classifier (k-NN) without any finetuning
  • 4. Related Work 4 • Self-supervised learning • Instance classification : SimCLR (PR-231), MoCo (PR-260), … • Without discriminating between images : BYOL, SimSiam (PR-305), … • Mean Teacher self-distillation • Self-training and knowledge distillation • Knowledge Distillation (J. Hinton et al., 2015.) • Pseudo-label (Hard), Noisy Student (Soft) • Self-supervised learning + Knowledge Distillation • Self-supervised model compression (SEED) • Performance gains (SimCLR v2) SEED: Self-supervised distillation for visual representation https://arxiv.org/pdf/2101.04731.pdf https://arxiv.org/pdf/2006.10029.pdf Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2)
  • 5. Approach 5 • Self-Supervised Learning with Knowledge Distillation • The same overall structure as recent self-supervised approaches + Knowledge Distillation ① ② ② ①
  • 6. Approach 6 • Self-Supervised Learning with Knowledge Distillation • Knowledge Distillation (KD) • Student network 𝑔𝜃𝑠 → Given teacher network 𝑔𝜃𝑡 • 𝑃𝑠 and 𝑃𝑡 : output probability distributions over 𝐾 dimensions of both networks 𝑃𝑠 𝑥 𝑖 = exp 𝑔𝜃𝑠 𝑥 𝑖 /𝜏𝑠 σ𝑘=1 𝐾 exp 𝑔𝜃𝑠 𝑥 𝑘 /𝜏𝑠 • 𝜏𝑠 > 0 : a temperature parameter that controls the sharpness of the output distribution • Match these distributions by minimizing the cross-entropy loss with a fixed teacher network 𝑔𝜃𝑡 min 𝜃𝑠 𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥 • 𝐻 𝑎, 𝑏 = −𝑎 log 𝑏
  • 7. Approach 7 • Self-Supervised Learning with Knowledge Distillation • How we adapt the problem in KD to self-supervised learning (SSL)? • Different distorted views, or crops, of an image with multi crop strategy • Two global views (𝑥1 𝑔 and 𝑥2 𝑔 ) and several local views of smaller resolution : a set 𝑽 of different views • All crops are passed through the student & Only the global views are passed through the teacher → “local-to-global” correspondences min 𝜃𝑠 ෍ 𝑥∈ 𝑥1 𝑔 ,𝑥2 𝑔 ෍ 𝑥′∈𝑉 𝑥′≠𝑥 𝐻 𝑃𝑡 𝑥 , 𝑃𝑠 𝑥′ • 2 global views at resolution 2242 covering a large area (>50%) of the original image • Several local views of resolution 962 covering only small areas (<50%) of the original image
  • 8. Approach 8 • Self-Supervised Learning with Knowledge Distillation • Teacher network • Two types of different update rules for the teacher • Freezing the teacher network over an epoch → Works surprisingly well! • Exponential Moving Average (EMA) on the student weights (e.g., MoCo, BYOL, mean teacher) • 𝜽𝒕 ← 𝝀𝜽𝒕 + 𝟏 − 𝝀 𝜽𝒔, with 𝜆 following a cosine schedule from 0.996 to 1 during training • Stochastic Weight Averaging (SWA) (Polyak-Ruppert averaging) • Well…. • Copying the student weight for the teacher → Fails to converge
  • 9. Approach 9 • Self-Supervised Learning with Knowledge Distillation • Network architecture • The neural network 𝑔 : A backbone 𝒇 (ViT or ResNet) + A projection head 𝒉 → 𝑔 = ℎ ∘ 𝑓 • A projection head : a 3-layer multi-layer perceptron (MLP) with 2048 dim + 𝑙2 norm + weight normalized FC with 𝐾 dim (SwAV) • NO predictor for the same architecture in both student & teacher • For ViT backbone → NO BN in the projection heads : entirely BN-free • Avoiding collapse • Only a centering and sharpening of the momentum teacher outputs • Centering : less dependence over the batch → prevents one dimension to dominate but encourages collapse to the uniform distribution 𝑔𝑡 𝑥 ← 𝑔𝑡 𝑥 + 𝑐, 𝑐 ← 𝑚𝑐 + 1 − 𝑚 1 𝐵 σ𝑖=1 𝐵 𝑔𝜃𝑡 𝑥𝑖 • Sharpening : using a low value for the temperature 𝜏𝑡 in the teacher softmax normalization → the opposite effect
  • 10. Approach 10 • Implementation and evaluation protocols • Vision Transformer (ViT) • Transformer → ViT → DeiT https://arxiv.org/pdf/1706.03762.pdf https://arxiv.org/pdf/2010.11929.pdf https://arxiv.org/pdf/2012.12877.pdf`
  • 11. Approach 11 • Implementation and evaluation protocols • Vision Transformer (ViT) • N=16 (“/16”) or N=8 (“/8”) • The patches → linear layer → a set of embeddings • The class token [CLS] for consistency with previous works • NOT attached to any label nor supervision
  • 12. Approach 12 • Implementation and evaluation protocols • Implementation details • Pretrain the models on the ImageNet dataset without labels • AdamW optimizer & 1024 batch size (16 GPUs, DeiT-S/16) • Learning rate warmup : 𝑙𝑟 = 0.0005 ∗ batchsize/256 for the first 10 epochs • After warmup : learning rate decay with a cosine schedule • Weight decay : a cosine schedule from 0.04 to 0.4 • Temperature : 𝜏𝑠 = 0.1 / 𝜏𝑡 = 0.04~0.07 (a linear warm-up for the first 30 epochs) • Augmentation of BYOL • Color jittering, Gaussian blur, Solarization, Multi-crop with a bicubic interpolation
  • 13. Approach 13 • Implementation and evaluation protocols • Evaluation protocols • Linear evaluation (with frozen backbone) • Random resize crops & Horizontal flips (training) / Central crop (test) • Fine-tuning • Initialize networks with pre-trained weights & Adapt them during training → So sensitive to hyperparameters! • A simple weighted nearest neighbor classifier (k-NN) • Compute and store the features of the training data of the downstream task • Match the feature of an image to the k nearest stored features that votes for the label • NO hyperparameter tuning & data augmentation and SIMPLE!
  • 14. Main Results 14 • Comparing with SSL frameworks on ImageNet • Comparing with the same architecture
  • 15. Main Results 15 • Comparing with SSL frameworks on ImageNet • Comparing across architectures
  • 16. Main Results 16 • Properties of ViT trained with SSL • Nearest neighbor retrieval with DINO ViT • Image Retrieval • Oxford and Paris image retrieval datasets • 3 different splits of gradual difficulty with query/database pairs • Mean Average Precision (mAP) for the Medium (M) and Hard (H) splits https://arxiv.org/pdf/1803.11285.pdf
  • 17. Main Results 17 • Properties of ViT trained with SSL • Nearest neighbor retrieval with DINO ViT • Copy detection • The “strong” subset of the INRIA Copydays dataset • To recognize images hat have been distorted by blur, insertions, print, and scan, etc • YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset • Randomly sampled 10k distractor images & an extra 20k random images for learning transformation https://hal.inria.fr/inria-00394212/document http://projects.dfki.uni-kl.de/yfcc100m/
  • 18. Main Results 18 • Properties of ViT trained with SSL • Discovering the semantic layout of scenes • Video instance segmentation • DAVIS-2017 video instance segmentation benchmark • Segment scenes with a nearest neighbor between consecutive frames https://davischallenge.org/ https://arxiv.org/pdf/2006.14613.pdf
  • 19. Main Results 19 • Properties of ViT trained with SSL • Discovering the semantic layout of scenes • Probing the self-attention map • Different heads can attend to different semantic regions of an image
  • 20. Main Results 20 • Properties of ViT trained with SSL • Discovering the semantic layout of scenes • Probing the self-attention map • Different heads can attend to different semantic regions of an image • Attribution Guided Factorization Visualization https://arxiv.org/pdf/2012.02166.pdf
  • 21. Main Results 21 • Properties of ViT trained with SSL • Transfer learning on downstream tasks • Evaluate the quality of the features pretrained with DINO on different downstream tasks • Compare with features from the same architectures trained with supervision on ImageNet • Training protocol of DeiT & finetune the features on each downstream task
  • 22. Ablation Study of DINO 22 • Importance of the Different Components • Different model variants as we add or remove components
  • 23. Ablation Study of DINO 23 • Importance of the Different Components • Importance of the patch size • Different patch sizes : DeiT-S : 16x16, 8x8, 5x5 / ViT-B : 16x16, 8x8 • Trade-off : Improved performance vs. Increased throughput (with decreased patch size) • Impact of the choice of Teacher Network • Building different teachers from the student & Analyzing the training dynamic
  • 24. Ablation Study of DINO 24 • Avoiding collapse • Regardless of the input, the model output is uniform along all the dimensions / dominated by one dimension. 𝐻 𝑃𝑡, 𝑃𝑠 = ℎ 𝑃𝑡 + 𝐷𝐾𝐿 𝑃𝑡|𝑃𝑠 • If KL → 0 : a constant output (collapse) • If centering or sharpening are missing → collapse, but entropy ℎ converges to different value! • Centering : uniform output → high entropy! • Sharpening : opposite effect → low entropy!
  • 25. Ablation Study of DINO 25 • Compute requirements • DeiT-S/16 DINO on two 8 GPU machines • The “local-to-global” augmentation • Training with small batches • Scale the learning rate linearly with the batch size 𝑙𝑟 = 0.0005 ∗ batchsize/256 • 128 batch size : 1 GPU • 8 batch size : 35.2% after 50 epochs
  • 26.