SlideShare a Scribd company logo
1 of 45
Download to read offline
Ting Chen
Contrastive Learning
of Visual Representations
Google Research, Brain Team
Tutorial
What’s covered in this talk?
● Motivation for contrastive learning
● Contrastive learning with negative examples
● Contrastive learning without negative examples
● Important design choices in contrastive learning
● Open challenges for contrastive learning
Motivation for contrastive learning
The paradigm of learning “foundation” models
Labeled data
(e.g. images &
labels)
Pretrained
“foundation”
network
Downstream
applications
Unlabeled data
(e.g. images)
Transfer
“Self-supervised learning” is “supervised learning” without specific task annotations.
supervised
learning
self-supervised
learning
Learning by prediction in an abstract space
● One form of intelligence is the ability to predict
● →
● Predict abstracted states instead of raw inputs
● →
● Avoid collapse by a contrastive loss
○ Pull together positive states
○ Push away negative states
● →
● But what to predict?
[Oord et al, Representation Learning with Contrastive Predictive Coding, 2018]
A multi-view agreement prediction task
● Predict instance identity (each instance as a class)
● Predict other views of the same example
[Dosovitskiy et al, NeurIPS’14] [Wu et al, CVPR’18]
[Becker & Hinton, Nature’92] [Bachman et al, NuerIPS’19] [Tian et al, ECCV’20] [Misra et al, CVPR’20]
(and many others…)
Many exciting results
Linear evaluation of representations
SimCLR
ICML’20
MoCo
CVPR’20
SwAV
NeurIPS’20
BYOL
NeurIPS’20
Semi-supervised learning
SimCLR as an example: strong semi-supervised learners, outperforms
AlexNet with 100X fewer labels.
Transfer learning
SimCLR as an example: matches or surpasses supervised ImageNet
pretraining when transferring to other classification tasks.
* The two datasets, where the supervised ImageNet pretrained model is better, are Pets and Flowers, which share a portion of labels with ImageNet.
(with SimCRL and MoCo as examples)
Contrastive learning
with negative examples
SimCLR
Maximizing the agreement of representations under data transformation,
using a contrastive loss in the latent/feature space.
[Chen et al, A simple framework for contrastive learning of visual representations, ICML’20]
SimCLR component: data augmentation
We use random crop and color distortion for augmentation.
Examples of augmentation applied to the left most images:
SimCLR component: encoder
f(x) is the base network that computes internal
representation.
We use (unconstrained) ResNet in this work.
However, it can be other networks.
SimCLR component: projection head
g(h) is a projection network that project
representation to a latent space.
We use a MLP (with non-linearity).
SimCLR component: contrastive loss
Maximize agreement using a contrastive task:
Given {x_k} where two different examples x_i
and x_j are a positive pair, identify x_j in
{x_k}_{k!=i} for x_i.
Original image crop 1 crop 2 contrastive image
Loss function:
SimCLR pseudo code and illustration
GIF credit: Tom Small
More negative examples: MoCo
● SimCLR use images in the same mini-batch as negative
examples, so batch size and negatives are tied
● MoCo decouples batch size and negatives by introducing
a momentum encoder, and a queue of activations.
[He et al, Momentum Contrast for Unsupervised Visual Representation Learning, CVPR’20]
Momentum encoder update with:
(instead of backpropagation)
(Using BYOL and SimSiam as main examples)
Contrastive learning
without negative examples
BYOL
● With momentum encoder and additional predictor
network on top of the projection head, the model doesn’t
collapse even without negative examples.
[Grill et al, Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS’20]
BYOL
● Both predictor and momentum encoder play an
important role in preventing collapse
[Grill et al, Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS’20]
SimSiam
● Further simplifies the framework by removing the
momentum encoder.
[Chen and He, Exploring Simple Siamese Representation Learning, CVPR 2021]
SimSiam
● Key ingredients:
○ No momentum encoder but still has stop-gradient
■ equal to setting ema factor to 0.
○ Careful design of predictor:
■ Batch Norm & bottleneck structure & no lr decay
[Chen and He, Exploring Simple Siamese Representation Learning, CVPR 2021]
Avoid collapse by using other batch statistics
● We can avoid representation collapse using neither
negative examples nor predictor: other batch statistics
can work.
Barlow Twins
[Zbontar et al, ICML’21]
SWAV
[Caron et al, NeurIPS’20]
DINO
[Caron et al, CVPR‘21]
Also: SWD distribution loss, [chen et al, intriguing properties of contrastive losses, 2021]
(using SimCLR as main example)
Some important design choices
in contrastive learning
Evaluation setup
Main dataset for self-supervised pretraining:
● ImageNet (without labels)
Two evaluation protocols for the remaining slides
● Linear classifier trained on learned features
● Fine-tune the model (with few labels)
Important design choice in Contrastive Learning:
1. Data Augmentation is critical
A set of transformations studied in SimCLR
Systematically study a set of augmentation
* Note that we only test these for ablation, the augmentation policy used to train our models only involves random crop (with flip and resize) + color distortion + Gaussian blur.
[Figures from SimCLR paper]
Studying single or a pair of augmentations
● ImageNet images are of different resolutions, so random crops are
typically applied.
● To remove co-founding
○ First random crop an image and resize to a standard resolution.
○ Then apply a single or a pair of augmentations on one branch,
while keeping the other as identity mapping.
○ This is suboptimal than applying augmentations to both branches,
but sufficient for ablation.
Crop and
resize to a
stand size:
224x224x3
No augmentation Single or a pair of
augmentations
... ...
Composition of augmentations are crucial
Composition of crop and color stands out!
[Figures from SimCLR paper]
Random cropping gives the major learning signal
Simply via Random Crop (with resize to standard size), we can mimic (1)
global to local view prediction, and (2) neighboring view prediction.
This simple transformation defines a family of predictive tasks.
[Figures from SimCLR paper]
An enhancement of random cropping
Instead of taking two crops of the same size, one may take multiple crops
of different sizes.
SwAV [Caron et al, NeurIPS 2020]
DINO [Caron et al, CVPR 2021]
Patch masking as additional augmentation
Recently, there are methods leveraging image patch masking as additional
augmentation. Some examples:
[Zhou et al, iBOT, ICLR’22]
[Anonymous authors, CAN, ICLR’23 submission]
2. Projection head is important
Important design choice in Contrastive Learning:
A nonlinear projection head improves the representation quality
of the layer before it
Compare projection heads (after average pooling of ResNet) in SimCLR:
● Identity mapping
● Linear projection
● Nonlinear projection with one additional hidden layer (and ReLU
activation)
Even when non-linear projection is
used, the layer before the projection
head,h,is still much better (>10%) than
the layer after,z=g(h).
[Figures from SimCLR paper]
A nonlinear projection head improves the representation quality
of the layer before it
To understand why this happens, we measure information in h and z=g(h)
Contrastive loss can remove/damping rotation information in the last
layer when the model is asked to identify rotated variant of an image.
[Figures from SimCLR paper]
3. Model size
Important design choice in Contrastive Learning:
Unsupervised contrastive learning benefits (more) from
bigger models
[Figures from SimCLR paper]
Bigger models are more label-efficient
● Using pre-training + fine-tuning, “the fewer the labels, the bigger the
model”
● Increasing the size of model size by 10X, it reduces required labels to
achieve certain accuracy by 10X.
ImageNet top-1 (%)
Relative improvement (%)
[Figures from SimCLRv2 paper]
Distill/Self-train with unlabeled data to reduce the model size
[Figures from SimCLRv2 paper]
Distillation / self-training with unlabeled data
● To distill the task specific knowledge, we use the teacher model to
provide task-specific labels on unlabeled examples, based on which
train a student model:
● Evaluation on ImageNet with only 1%/10% labels (all images).
Distillation on labeled dataset alone is not sufficient.
Using unlabeled examples largely improve distillation.
[Figures from SimCLRv2 paper]
Distillation with unlabeled data improves all model sizes
● Both self-distillation and big-model-to-small-model distillation help.
● With 10% of labels, SimCLRv2 can achieve better performance than
standard supervised training with 100% of labels.
Self-distillation / self-training
Distillation from the biggest
self-distilled model
ImageNet (label fraction: 10%)
[Figures from SimCLRv2 paper]
4. Some hyper-parameters (e.g., in
contrastive loss)
Important design choice in Contrastive Learning:
Tune normalization and temperature
Compare variants of contrastive (NT-Xent) loss in SimCLR
● L2 normalization with temperature scaling makes a better loss.
● Contrastive accuracy is not correlated with linear evaluation when l2
norm and/or temperature are changed.
[Figures from SimCLR paper]
Contrastive learning benefits from longer training
Compare epochs & batch size in SimCLR (hyper-parameter tuned with
batch size of 4096)
[Figures from SimCLR paper]
Small batch sizes work well too with good hparams tuning
Original SimCLR was developed with large batch size, so the hyper-params
were not optimized for smaller ones in the above batch size study.
With proper tuning on learning rate, temperature, and deeper projection
head, the difference in batch sizes becomes smaller.
Table from “Intriguing Properties of Contrastive Losses” (Chen et al, 2020)

More Related Content

Similar to contrastive-learning2.pdf

Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative AttributesVikas Jain
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningMLAI2
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET Journal
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingcevesom156
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
 
Model based rl
Model based rlModel based rl
Model based rlSeolhokim
 
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”Lviv Startup Club
 
Deep Generative Modelling
Deep Generative ModellingDeep Generative Modelling
Deep Generative ModellingPetko Nikolov
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in ImagesAnil Kumar Gupta
 
A Novel GAN-Based Network for Unmasking of Masked ppt.pptx
A Novel GAN-Based Network for Unmasking of Masked ppt.pptxA Novel GAN-Based Network for Unmasking of Masked ppt.pptx
A Novel GAN-Based Network for Unmasking of Masked ppt.pptxpentestcomptia
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningSean Yu
 
Transfer Learning CV - Souradip and Sayak
Transfer Learning CV - Souradip and SayakTransfer Learning CV - Souradip and Sayak
Transfer Learning CV - Souradip and SayakSayak Paul
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 

Similar to contrastive-learning2.pdf (20)

Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative Attributes
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
C3 w5
C3 w5C3 w5
C3 w5
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Model based rl
Model based rlModel based rl
Model based rl
 
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”
Vladislav Kolbasin “Introduction to Generative Adversarial Networks (GANs)”
 
Deep Generative Modelling
Deep Generative ModellingDeep Generative Modelling
Deep Generative Modelling
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
 
A Novel GAN-Based Network for Unmasking of Masked ppt.pptx
A Novel GAN-Based Network for Unmasking of Masked ppt.pptxA Novel GAN-Based Network for Unmasking of Masked ppt.pptx
A Novel GAN-Based Network for Unmasking of Masked ppt.pptx
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
Transfer Learning CV - Souradip and Sayak
Transfer Learning CV - Souradip and SayakTransfer Learning CV - Souradip and Sayak
Transfer Learning CV - Souradip and Sayak
 
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
 
Conv xg
Conv xgConv xg
Conv xg
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

contrastive-learning2.pdf

  • 1. Ting Chen Contrastive Learning of Visual Representations Google Research, Brain Team Tutorial
  • 2. What’s covered in this talk? ● Motivation for contrastive learning ● Contrastive learning with negative examples ● Contrastive learning without negative examples ● Important design choices in contrastive learning ● Open challenges for contrastive learning
  • 4. The paradigm of learning “foundation” models Labeled data (e.g. images & labels) Pretrained “foundation” network Downstream applications Unlabeled data (e.g. images) Transfer “Self-supervised learning” is “supervised learning” without specific task annotations. supervised learning self-supervised learning
  • 5. Learning by prediction in an abstract space ● One form of intelligence is the ability to predict ● → ● Predict abstracted states instead of raw inputs ● → ● Avoid collapse by a contrastive loss ○ Pull together positive states ○ Push away negative states ● → ● But what to predict? [Oord et al, Representation Learning with Contrastive Predictive Coding, 2018]
  • 6. A multi-view agreement prediction task ● Predict instance identity (each instance as a class) ● Predict other views of the same example [Dosovitskiy et al, NeurIPS’14] [Wu et al, CVPR’18] [Becker & Hinton, Nature’92] [Bachman et al, NuerIPS’19] [Tian et al, ECCV’20] [Misra et al, CVPR’20] (and many others…)
  • 7. Many exciting results Linear evaluation of representations SimCLR ICML’20 MoCo CVPR’20 SwAV NeurIPS’20 BYOL NeurIPS’20
  • 8. Semi-supervised learning SimCLR as an example: strong semi-supervised learners, outperforms AlexNet with 100X fewer labels.
  • 9. Transfer learning SimCLR as an example: matches or surpasses supervised ImageNet pretraining when transferring to other classification tasks. * The two datasets, where the supervised ImageNet pretrained model is better, are Pets and Flowers, which share a portion of labels with ImageNet.
  • 10. (with SimCRL and MoCo as examples) Contrastive learning with negative examples
  • 11. SimCLR Maximizing the agreement of representations under data transformation, using a contrastive loss in the latent/feature space. [Chen et al, A simple framework for contrastive learning of visual representations, ICML’20]
  • 12. SimCLR component: data augmentation We use random crop and color distortion for augmentation. Examples of augmentation applied to the left most images:
  • 13. SimCLR component: encoder f(x) is the base network that computes internal representation. We use (unconstrained) ResNet in this work. However, it can be other networks.
  • 14. SimCLR component: projection head g(h) is a projection network that project representation to a latent space. We use a MLP (with non-linearity).
  • 15. SimCLR component: contrastive loss Maximize agreement using a contrastive task: Given {x_k} where two different examples x_i and x_j are a positive pair, identify x_j in {x_k}_{k!=i} for x_i. Original image crop 1 crop 2 contrastive image Loss function:
  • 16. SimCLR pseudo code and illustration GIF credit: Tom Small
  • 17. More negative examples: MoCo ● SimCLR use images in the same mini-batch as negative examples, so batch size and negatives are tied ● MoCo decouples batch size and negatives by introducing a momentum encoder, and a queue of activations. [He et al, Momentum Contrast for Unsupervised Visual Representation Learning, CVPR’20] Momentum encoder update with: (instead of backpropagation)
  • 18. (Using BYOL and SimSiam as main examples) Contrastive learning without negative examples
  • 19. BYOL ● With momentum encoder and additional predictor network on top of the projection head, the model doesn’t collapse even without negative examples. [Grill et al, Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS’20]
  • 20. BYOL ● Both predictor and momentum encoder play an important role in preventing collapse [Grill et al, Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS’20]
  • 21. SimSiam ● Further simplifies the framework by removing the momentum encoder. [Chen and He, Exploring Simple Siamese Representation Learning, CVPR 2021]
  • 22. SimSiam ● Key ingredients: ○ No momentum encoder but still has stop-gradient ■ equal to setting ema factor to 0. ○ Careful design of predictor: ■ Batch Norm & bottleneck structure & no lr decay [Chen and He, Exploring Simple Siamese Representation Learning, CVPR 2021]
  • 23. Avoid collapse by using other batch statistics ● We can avoid representation collapse using neither negative examples nor predictor: other batch statistics can work. Barlow Twins [Zbontar et al, ICML’21] SWAV [Caron et al, NeurIPS’20] DINO [Caron et al, CVPR‘21] Also: SWD distribution loss, [chen et al, intriguing properties of contrastive losses, 2021]
  • 24. (using SimCLR as main example) Some important design choices in contrastive learning
  • 25. Evaluation setup Main dataset for self-supervised pretraining: ● ImageNet (without labels) Two evaluation protocols for the remaining slides ● Linear classifier trained on learned features ● Fine-tune the model (with few labels)
  • 26. Important design choice in Contrastive Learning: 1. Data Augmentation is critical
  • 27. A set of transformations studied in SimCLR Systematically study a set of augmentation * Note that we only test these for ablation, the augmentation policy used to train our models only involves random crop (with flip and resize) + color distortion + Gaussian blur. [Figures from SimCLR paper]
  • 28. Studying single or a pair of augmentations ● ImageNet images are of different resolutions, so random crops are typically applied. ● To remove co-founding ○ First random crop an image and resize to a standard resolution. ○ Then apply a single or a pair of augmentations on one branch, while keeping the other as identity mapping. ○ This is suboptimal than applying augmentations to both branches, but sufficient for ablation. Crop and resize to a stand size: 224x224x3 No augmentation Single or a pair of augmentations ... ...
  • 29. Composition of augmentations are crucial Composition of crop and color stands out! [Figures from SimCLR paper]
  • 30. Random cropping gives the major learning signal Simply via Random Crop (with resize to standard size), we can mimic (1) global to local view prediction, and (2) neighboring view prediction. This simple transformation defines a family of predictive tasks. [Figures from SimCLR paper]
  • 31. An enhancement of random cropping Instead of taking two crops of the same size, one may take multiple crops of different sizes. SwAV [Caron et al, NeurIPS 2020] DINO [Caron et al, CVPR 2021]
  • 32. Patch masking as additional augmentation Recently, there are methods leveraging image patch masking as additional augmentation. Some examples: [Zhou et al, iBOT, ICLR’22] [Anonymous authors, CAN, ICLR’23 submission]
  • 33. 2. Projection head is important Important design choice in Contrastive Learning:
  • 34. A nonlinear projection head improves the representation quality of the layer before it Compare projection heads (after average pooling of ResNet) in SimCLR: ● Identity mapping ● Linear projection ● Nonlinear projection with one additional hidden layer (and ReLU activation) Even when non-linear projection is used, the layer before the projection head,h,is still much better (>10%) than the layer after,z=g(h). [Figures from SimCLR paper]
  • 35. A nonlinear projection head improves the representation quality of the layer before it To understand why this happens, we measure information in h and z=g(h) Contrastive loss can remove/damping rotation information in the last layer when the model is asked to identify rotated variant of an image. [Figures from SimCLR paper]
  • 36. 3. Model size Important design choice in Contrastive Learning:
  • 37. Unsupervised contrastive learning benefits (more) from bigger models [Figures from SimCLR paper]
  • 38. Bigger models are more label-efficient ● Using pre-training + fine-tuning, “the fewer the labels, the bigger the model” ● Increasing the size of model size by 10X, it reduces required labels to achieve certain accuracy by 10X. ImageNet top-1 (%) Relative improvement (%) [Figures from SimCLRv2 paper]
  • 39. Distill/Self-train with unlabeled data to reduce the model size [Figures from SimCLRv2 paper]
  • 40. Distillation / self-training with unlabeled data ● To distill the task specific knowledge, we use the teacher model to provide task-specific labels on unlabeled examples, based on which train a student model: ● Evaluation on ImageNet with only 1%/10% labels (all images). Distillation on labeled dataset alone is not sufficient. Using unlabeled examples largely improve distillation. [Figures from SimCLRv2 paper]
  • 41. Distillation with unlabeled data improves all model sizes ● Both self-distillation and big-model-to-small-model distillation help. ● With 10% of labels, SimCLRv2 can achieve better performance than standard supervised training with 100% of labels. Self-distillation / self-training Distillation from the biggest self-distilled model ImageNet (label fraction: 10%) [Figures from SimCLRv2 paper]
  • 42. 4. Some hyper-parameters (e.g., in contrastive loss) Important design choice in Contrastive Learning:
  • 43. Tune normalization and temperature Compare variants of contrastive (NT-Xent) loss in SimCLR ● L2 normalization with temperature scaling makes a better loss. ● Contrastive accuracy is not correlated with linear evaluation when l2 norm and/or temperature are changed. [Figures from SimCLR paper]
  • 44. Contrastive learning benefits from longer training Compare epochs & batch size in SimCLR (hyper-parameter tuned with batch size of 4096) [Figures from SimCLR paper]
  • 45. Small batch sizes work well too with good hparams tuning Original SimCLR was developed with large batch size, so the hyper-params were not optimized for smaller ones in the above batch size study. With proper tuning on learning rate, temperature, and deeper projection head, the difference in batch sizes becomes smaller. Table from “Intriguing Properties of Contrastive Losses” (Chen et al, 2020)