Joint contrastive learning with infinite possibilities

Members : 김상현, 고형권, 허다운, 전선영, 김준철, 조경진
Presenter : 조경진
2021.10.10
Joint Contrastive Learning with Infinite Possibilities
(Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, Tao Mei , NeurIPS, 2020)
https://arxiv.org/abs/2009.14776

Contents
1. Introduction
2. Related Work
3. Methods
4. Experiments
5. Conclusion
1

Introduction Related Work Methods Experiments Conclusions
❖ What is Contrastive learning?
Contrastive learning is a machine learning technique used to learn the general features of a dataset
without labels by teaching the model which data points are similar or different.
https://amitness.com/2020/03/illustrated-simclr/ 2

https://amitness.com/2020/03/illustrated-simclr/ 3

CNN
Supervised learning
1. Label cost
1. Task-specific solution (poor generalizability) 4

5
Contrastive learning
1. No label cost
1. Task-agnostic solution (Good generalizability)

How can we train model using contrastive learning method?
6

Encoder
Mechanism to get representations that allow the machine to
understand an image.
❖ How do we train contrastive learning?
CNN
Image Representation
Similarity measure
Mechanism to compute the similarity of two images.
Similarity ( , )
7
Similar and dissimilar images
Example pairs of similar and dissimilar images.
Image Same
Different
Different

❖ How do we train contrastive learning?
Positive sample to be recognized as a similar image, Negative sample to be recognized as a different image.
Score ( , )
maximize
https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html
Score ( , )
Score ( , ) Score ( , )
+ 𝚺K-1
we can change softmax and NLL in this probability.
InfoNCE
8
Score ( , )
minimize

❖ InfoNCE loss
Maximize mutual Information
https://arxiv.org/pdf/1807.03748.pdf
Mutual Information
MI maximize
9
InfoNCE

❖ InfoNCE loss
Maximize mutual Information
https://arxiv.org/pdf/1807.03748.pdf
MI maximize
Mutual Information
10
InfoNCE

❖ Self supervised learning
Pretext Task : pre-designed tasks for networks to solve,
and visual features are learned by learning objective
functions of pretext tasks.
Downstream Task : computer vision applications that are
used to evaluate the quality of features learned by self-
supervised learning.
No label
Label
11

❖ Self supervised learning in Pretext task
https://arxiv.org/pdf/1803.07728.pdf,https://arxiv.org/pdf/1603.09246.pdf
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf 12

https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf
Rotation
0, 90, 180, 270 degree random rotation
4-class classification problem
13

https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf
Jigsaw Puzzle
Random shuffle patches,
Permutation prediction
Context Prediction
Relative patches location prediction
14

❖ MoCo
https://arxiv.org/abs/1911.05722 ,
● Dictionary as a queue
(JCL adopted)
● Momentum update
● Memory bank
● Contrastive loss function
○ one positive pair
○ many negative pair
❖ SimCLR
● Random cropping resize to
original, color distortion,
Gaussian blur
● Encoder f (ᐧ)
● Projection head g (ᐧ) mapping
● Contrastive loss function
○ one positive pair
○ many negative pair
15
● SimCLR augmentation strategy
● MLP projection head
● Cosine LR scheduler
● Same loss function as Moco v1
❖ MoCo v2

How does JCL actually work?
16

● Most existing contrastive learning methods only consider independently penalizing the incompatibility of
each single positive query-key pair at a time.
● This does not fully leverage the assumption that all augmentations corresponding to a specific
image are statistically dependent on each other, and are simultaneously similar to the query.
❖ Positive, Negative key pairs
17

● Instance xi to be firstly augmented M+1 times (M for positive keys and 1 extra for the query itself).
● Unfortunately, this is not computational applicable, as carrying all (M +1)×N pairs in a mini-batch would
quickly drain the GPU memory when M is even moderately small.
● Capitalizing on this application of infinity limit, the statistics of the data become sufficient to reach the
same goal of multiple pairing.
❖ Positive, Negative key pairs
18

❖ Joint Contrastive Learning
https://en.wikipedia.org/wiki/Jensen%27s_inequality 19

● Jensen's inequality

Implicit
Explicit
22

23
● Under this Gaussian assumption, Eq.(7) eventually reduces Eq.(8).
● For any random variable x that follows Gaussian distribution x∼N (μ, Σ), where μ is the expectation of x, Σ is the variance of x,
we have the moment generation function that satisfy.
● We scale the influence of Σk+ by multiplying it with a scalar λ. This tuning of λ hopefully stabilizes the training.

24

25

Question?
26

❖ Implicit Semantic Data Augmentation
https://neurohive.io/en/news/semantic-data-augmentation-improves-neural-network-s-generalization/
● We hope that directions corresponding to meaningful transformations for each class
are well represented by the principal components of the covariance matrix of that class.
● Consider training a deep network G with weights Θ on a training set
D = {(xi,yi)}Ni=1, where yi ∈ {1, . . . , C} is the label of the i-th sample xi over C classes.
● Let the A-dimensional vector ai = [ai1, . . . , aiA]T = G(xi, Θ) denote the deep features of xi
learned by G, and aij indicate the j-th element of ai.
● To obtain semantic directions to augment ai, we randomly sample vectors from
a zero-mean multi-variate normal distribution N(0 , Σyi),
where Σyi is the class-conditional covariance matrix estimated from the features of all
the samples in class yi.
● During training, C covariance matrices are computed, one for each class.
The augmented feature a ̃i is obtained by translating ai along a random direction
sampled from N(0 , λΣyi).
27

● Variable k+
i follows a Gaussian distribution
,where μk+ and Σk+
are respectively the mean and the covariance matrix
of the positive keys for qi+.
● this assumption is legitimate as
positive keys more or less share similarities in the
embedding space around some mean value as they
all mirror the nature of the query to some extent.
Explicit representation in negative keys
https://blog.daum.net/shksjy/228
https://neurohive.io/en/news/semantic-data-augmentation-improves-neural-network-s-generalization/
Implicit representation in positive keys
1 2 3 4
28

29
JCL pre-trained , MoCo v2 pre-trained ResNet-18 network ,
directly extract the features.
Fig(3(a)): calculating the cosine similarities of each pair of
features (every pair of 2 out of 32 augmentations) belonging
to the same identity image.
Fig(3(b)): Accordingly, the variance of the obtained
features belonging to the same image is much smaller.
non-contrastive
contrastive
supervised
unsupervised
Positive key number M 𝜏 λ Batch size N
hyper-parameter 5 0.2 4.0 512

❖ Conclusion in Joint Contrastive Learning
● JCL implicitly involves the joint learning of an infinite number of query-key pairs for each instance.
● By applying rigorous bounding techniques on the proposed formulation, we transfer the originally intractable loss
function into practical implementations.
● Most notably, although JCL is an unsupervised algorithm, the JCL pre-trained networks even outperform its
supervised counterparts in many scenarios.
30

Question?
31

Joint contrastive learning with infinite possibilities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Joint contrastive learning with infinite possibilities

Similar to Joint contrastive learning with infinite possibilities (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

Joint contrastive learning with infinite possibilities

Editor's Notes