A Simple Framework for Contrastive Learning of
Visual Representations
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Google Research Team, Geoffrey Hinton | ICML 2020
2020.07.19
Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
SimCLR
Introduction – Proposal
• Most mainstream approaches for unsupervised visual representations fall into one
of two classes: Generative or Discriminative
Introduction / Related Work / Methods and Experiments / Conclusion
01Predict rotation
Autoencoder
Jigsaw Puzzle
SimCLR
Introduction – Proposal
• Discriminative approaches based on Contrastive Learning in the latent space have
recently shown state-of-the-art results.
Introduction / Related Work / Methods and Experiments / Conclusion
02[AMDIM]
SimCLR
Introduction – Proposal
Introduction / Related Work / Methods and Experiments / Conclusion
• SimCLR outperform previous
work but is simpler
• SimCLR achieves 76.5% top-1
accuracy which is a 7% relative
improvement over previous SOTA
method.
• When fine-tuned with only 1% of
the ImageNet labels, SimCLR
achieved 85.8% top-5 accuracy.
03
SimCLR
Introduction – Contributions
• Composition of multiple data augmentation operations is crucial in unsupervised
contrastive learning.
• Learnable nonlinear transformation between the representation and the
contrastive loss substantially improves the quality of the learned representations.
• Contrastive learning benefits from larger batch sizes and longer training.
• Like supervised learning, contrastive learning benefits from deeper and wider
networks.
• Representation learning with contrastive cross entropy loss benefits from
normalized embeddings and temperature parameter.
Introduction / Related Work / Methods and Experiments / Conclusion
04
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
05
Handcrafted pretext tasks
• Relative patch prediction
• Jigsaw puzzles
• Rotation Prediction
• Colorization Prediction
.
.
. Limits the GENERALITY of
learned Representations!
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
06
Contrastive Visual Representation learning
• CPC V2
• AMDIM
• Rotation Prediction
• MoCo (by Facebook)
.
.
. “SimCLR” is their composition!
Methods and Experiments
Overall Architecture
Introduction / Related Work / Methods and Experiments / Conclusion
07
https://www.youtube.com/watch?v=5lsmGWtxnKA
Methods and Experiments
Architecture – Data Augmentation
Introduction / Related Work / Methods and Experiments / Conclusion
08
https://www.youtube.com/watch?v=5lsmGWtxnKA
Methods and Experiments
Architecture – loss function
Introduction / Related Work / Methods and Experiments / Conclusion
09
https://www.youtube.com/watch?v=5lsmGWtxnKA
Methods and Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
10
https://www.youtube.com/watch?v=5lsmGWtxnKA
Final Loss
Architecture – loss function
[Normalized temperature-scaled cross entropy loss]
Methods and Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
11
Algorithm
Methods and Experiments
Other Methods
Introduction / Related Work / Methods and Experiments / Conclusion
12
• Large Batch Size
- Use Train batch 4096
- Use LARS optimizer, since using standard SGD/Momentum optimizer
might be unstable within large batch.
• Global BN
- When training with data parallelism, BN mean and variance are
typically aggregated locally per device.
- Aggregated BN mean and variance over all devices during the training.
Methods and Experiments
Evaluation Protocal
Introduction / Related Work / Methods and Experiments / Conclusion
13
• Dataset and Metrics
- ImageNet
- Transfer Learning on wide range of datasets (Cifar10, Cifar100, etc)
• Default Setting
- Random crop and resize, Color distortions, Gaussian blur
- ResNet-50 as base encoder network
- 2-layer MLP projection head to project the representation to a 128-
dimensional latent space
- Trained at batch size 4096 for 100 epochs
Methods and Experiments
Ablation Studies – Data Augmentation
Introduction / Related Work / Methods and Experiments / Conclusion
14
“Coloring”, “Crop” = Crucial
Methods and Experiments
Ablation Studies – Data Augmentation
Introduction / Related Work / Methods and Experiments / Conclusion
15
Methods and Experiments
Ablation Studies – Nonlinear Projection head
Introduction / Related Work / Methods and Experiments / Conclusion
16
• The hidden layer before the projection head is a better representation
than the layer after
Methods and Experiments
Ablation Studies – Batch Size
Introduction / Related Work / Methods and Experiments / Conclusion
17
Methods and Experiments
Results – ImageNet
Introduction / Related Work / Methods and Experiments / Conclusion
18
Methods and Experiments
Results – semi-supervised learning
Introduction / Related Work / Methods and Experiments / Conclusion
19
Methods and Experiments
Results – Transfer Learning
Introduction / Related Work / Methods and Experiments / Conclusion
20
Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• Improved considerably over previous methods for self-
supervised, semi-supervised, and transfer learning.
• SimCLR Differs from standard supervised learning on
ImageNet only in the choice of data augmentation, the use
of a nonlinear head, and the loss function.
• Despite a recent surge in interest, self-supervised learning
remains undervalued.
21

A Simple Framework for Contrastive Learning of Visual Representations

  • 1.
    A Simple Frameworkfor Contrastive Learning of Visual Representations Hwang seung hyun Yonsei University Severance Hospital CCIDS Google Research Team, Geoffrey Hinton | ICML 2020 2020.07.19
  • 2.
    Introduction Related WorkMethods and Experiments 01 02 03 Conclusion 04 Yonsei Unversity Severance Hospital CCIDS Contents
  • 3.
    SimCLR Introduction – Proposal •Most mainstream approaches for unsupervised visual representations fall into one of two classes: Generative or Discriminative Introduction / Related Work / Methods and Experiments / Conclusion 01Predict rotation Autoencoder Jigsaw Puzzle
  • 4.
    SimCLR Introduction – Proposal •Discriminative approaches based on Contrastive Learning in the latent space have recently shown state-of-the-art results. Introduction / Related Work / Methods and Experiments / Conclusion 02[AMDIM]
  • 5.
    SimCLR Introduction – Proposal Introduction/ Related Work / Methods and Experiments / Conclusion • SimCLR outperform previous work but is simpler • SimCLR achieves 76.5% top-1 accuracy which is a 7% relative improvement over previous SOTA method. • When fine-tuned with only 1% of the ImageNet labels, SimCLR achieved 85.8% top-5 accuracy. 03
  • 6.
    SimCLR Introduction – Contributions •Composition of multiple data augmentation operations is crucial in unsupervised contrastive learning. • Learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations. • Contrastive learning benefits from larger batch sizes and longer training. • Like supervised learning, contrastive learning benefits from deeper and wider networks. • Representation learning with contrastive cross entropy loss benefits from normalized embeddings and temperature parameter. Introduction / Related Work / Methods and Experiments / Conclusion 04
  • 7.
    Related Work Introduction /Related Work / Methods and Experiments / Conclusion 05 Handcrafted pretext tasks • Relative patch prediction • Jigsaw puzzles • Rotation Prediction • Colorization Prediction . . . Limits the GENERALITY of learned Representations!
  • 8.
    Related Work Introduction /Related Work / Methods and Experiments / Conclusion 06 Contrastive Visual Representation learning • CPC V2 • AMDIM • Rotation Prediction • MoCo (by Facebook) . . . “SimCLR” is their composition!
  • 9.
    Methods and Experiments OverallArchitecture Introduction / Related Work / Methods and Experiments / Conclusion 07 https://www.youtube.com/watch?v=5lsmGWtxnKA
  • 10.
    Methods and Experiments Architecture– Data Augmentation Introduction / Related Work / Methods and Experiments / Conclusion 08 https://www.youtube.com/watch?v=5lsmGWtxnKA
  • 11.
    Methods and Experiments Architecture– loss function Introduction / Related Work / Methods and Experiments / Conclusion 09 https://www.youtube.com/watch?v=5lsmGWtxnKA
  • 12.
    Methods and Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 10 https://www.youtube.com/watch?v=5lsmGWtxnKA Final Loss Architecture – loss function [Normalized temperature-scaled cross entropy loss]
  • 13.
    Methods and Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 11 Algorithm
  • 14.
    Methods and Experiments OtherMethods Introduction / Related Work / Methods and Experiments / Conclusion 12 • Large Batch Size - Use Train batch 4096 - Use LARS optimizer, since using standard SGD/Momentum optimizer might be unstable within large batch. • Global BN - When training with data parallelism, BN mean and variance are typically aggregated locally per device. - Aggregated BN mean and variance over all devices during the training.
  • 15.
    Methods and Experiments EvaluationProtocal Introduction / Related Work / Methods and Experiments / Conclusion 13 • Dataset and Metrics - ImageNet - Transfer Learning on wide range of datasets (Cifar10, Cifar100, etc) • Default Setting - Random crop and resize, Color distortions, Gaussian blur - ResNet-50 as base encoder network - 2-layer MLP projection head to project the representation to a 128- dimensional latent space - Trained at batch size 4096 for 100 epochs
  • 16.
    Methods and Experiments AblationStudies – Data Augmentation Introduction / Related Work / Methods and Experiments / Conclusion 14 “Coloring”, “Crop” = Crucial
  • 17.
    Methods and Experiments AblationStudies – Data Augmentation Introduction / Related Work / Methods and Experiments / Conclusion 15
  • 18.
    Methods and Experiments AblationStudies – Nonlinear Projection head Introduction / Related Work / Methods and Experiments / Conclusion 16 • The hidden layer before the projection head is a better representation than the layer after
  • 19.
    Methods and Experiments AblationStudies – Batch Size Introduction / Related Work / Methods and Experiments / Conclusion 17
  • 20.
    Methods and Experiments Results– ImageNet Introduction / Related Work / Methods and Experiments / Conclusion 18
  • 21.
    Methods and Experiments Results– semi-supervised learning Introduction / Related Work / Methods and Experiments / Conclusion 19
  • 22.
    Methods and Experiments Results– Transfer Learning Introduction / Related Work / Methods and Experiments / Conclusion 20
  • 23.
    Conclusion Introduction / RelatedWork / Methods and Experiments / Conclusion • Improved considerably over previous methods for self- supervised, semi-supervised, and transfer learning. • SimCLR Differs from standard supervised learning on ImageNet only in the choice of data augmentation, the use of a nonlinear head, and the loss function. • Despite a recent surge in interest, self-supervised learning remains undervalued. 21