Exploring Simple Siamese Representation Learning

Contents
1. Introduction
2. Method
3. Empirical Study
4. Comparisons
5. (Hypothesis)
2

Introduction
3
• Un-/Self-supervised representation learning
Pretext (Auxiliary) Task
Image Model Extra module
Pseudo
-task
Downstream Task
Image Model
New
extra module
Linear Evaluation
Freeze extractor and
train 1-layer NN
Finetune
Train all networks
for downstream
Transfer Learning
Detection/segmentation
Retrieval
Calculate Recall
among kNN

Introduction
4
• Siamese networks
• An undesired trivial solution : all outputs “collapsing” to a constant
• Constrastive Learning
• Attract the positive sample pairs and repulse the negative sample pairs
• SimCLR, MoCo, …
• Clustering
• Alternate between clustering the representations and learning to predict the cluster assignment
• SwAV
• BYOL
• A Siamese network in which one branch is a momentum encoder

Introduction
5
• SimSiam
• Without the momentum encoder, negative pairs, and online clustering
• Does not cause collapsing and can perform competitively
• Siamese networks : inductive biases for modeling invariance
• Invariance : two observations of the same concept should produce the same outputs.
• The weight-sharing Siamese networks can model invariance w.r.t. more complicated transformations
(augmentations).

Method
6
• SimSiam
• Two randomly augmented views 𝑥1 and 𝑥2 from an image 𝑥
• An encoder network 𝑓 consisting of a backbone and a projection MLP head
• Two output vectors 𝑝1 ≜ ℎ 𝑓 𝑥1 and 𝑧2 ≜ 𝑓 𝑥2
𝒟 𝑝1, 𝑧2 = −
𝑝1
𝑝1 2
⋅
𝑧2
𝑧2 2
• A symmetrized loss
ℒ =
1
2
𝒟 𝑝1, 𝑧2 +
1
2
𝒟 𝑝2, 𝑧1
• Stop-gradient operation
ℒ =
1
2
𝒟 𝑝1, stopgrad 𝑧2 +
1
2
𝒟 𝑝2, stopgrad 𝑧1

Method
7
• SimSiam
• Baseline Settings
• Optimizer
• SGD with 0.9 momentum and weight decay 0.0001
• Learning rate : 𝑙𝑟 × BatchSize/256 with a base 𝑙𝑟 = 0.05
• Batch size : 512 / Batch normalization synchronized across devices
• Projection MLP
• (FC (2048) + BN + ReLU) x 2 + FC (𝑑=2048) + BN (𝑧)
• Prediction MLP
• FC (512)+ BN + ReLU + FC (𝑑=2048) (𝑝)
• Experimental setup
• Pre-training : 1000-class ImageNet training set
• Linear evaluation : training a supervised linear classifier on frozen representations

Empirical Study
8
• Stop-gradient
• (Left) quickly find a degenerated solution reach the minimum possible loss of -1 without stop-gradient
• (middle) std of the 𝑙2-normalized output z′
= 𝑧/ 𝑧 2
• The outputs collapse to a constant vector → the std over all samples should be zero for each channel! (the red curve)
• The output 𝑧 has a zero-mean isotropic Gaussian distribution → the std of 𝒛′ is
𝟏
𝒅
→ The outputs do not collapse, and they are scattered on the unit hypersphere!
• 𝑧𝑖
′
= Τ
𝑧𝑖 σ𝑗=1
𝑑
𝑧𝑗
2
1
2
≈ Τ
𝑧𝑖 𝑑
1
2 (if 𝑧𝑗 is subject to an i.i.d Gaussian distribution) → std 𝑧𝑖
′
≈ 1/𝑑
1
2

Empirical Study
9
• Stop-gradient
• (Right) the validation accuracy of a k-nearest-neighbor (kNN) classifier
• The kNN classifier can serve as a monitor of the progress.
• With stop-gradient, the kNN monitor shows a steadily improving accuracy!
• (Table) the linear evaluation result
• w/ stop-grad : a nontrivial accuracy of 67.7%
• w/o stop-grad : 0.1%..

Empirical Study
10
• Stop-gradient
• Discussion
• There exist collapsing solutions.
• The minimum possible loss and the constant outputs (Of course, it is not sufficient to indicate collapsing!)
• It is insufficient for our method to prevent collapsing solely by the architecture designs (e.g., predictor, BN, 𝑙2-norm).

Empirical Study
11
• Predictor
• (Table 1) the predictor MLP (ℎ)’s effect
• (a) removing ℎ (=ℎ is the identity mapping) → the model does not work!
• ℒ(a) =
1
2
𝒟 𝑧1, stopgrad 𝑧2 +
1
2
𝒟 𝑧2, stopgrad 𝑧1
• Same direction as the gradient of 𝒟 𝑧1, 𝑧2 with the magnitude scaled by 1/2 → collapsing!
• The asymmetric variant 𝒟 𝑝1, stopgrad 𝑧2 also fails if removing ℎ, while it can work if ℎ is kept.

Empirical Study
12
• Predictor
• (Table 1) the predictor MLP (ℎ)’s effect
• (b) ℎ is fixed as random initialization → the model does not work!
• This failure is not about collapsing. The training does not converge, and the loss remains high.
• (c) ℎ with a constant 𝑙𝑟 (without decay) → works well!
• It is not necessary to force it converge (by reducing 𝑙𝑟) before the representations are sufficiently trained.

Empirical Study
13
• Batch Size
• (Table 2) the results with a batch size from 64 to 4096
• Linear scaling rule (𝑙𝑟 × BatchSize/256) with base 𝑙𝑟 = 0.5
• 10 epochs of warm-up for batch sizes ≥ 1024
• SGD optimizer, not LARS
• 64, 128 : a drop of 0.8% or 2.0% in accuracy
• 256 to 2048 : similarly good
→ SimSiam, SimCLR, SwAV : Siamese network with direct weight-sharing
→ SimCLR and SwAV both require a large batch (e.g., 4096) to work well.
→ The standard SGD optimizer does not work well when the batch is too large.
→ A specialized optimizer is not necessary for preventing collapsing.

Empirical Study
14
• Batch Normalization
• (Table 3) the configurations of BN on the MLP heads
• (a) remove all BN layers in the MLP heads
• This variant does not cause collapse, although the acc is low (34.6%).
• (b) adding BN to the hidden layers
• It increases accuracy to 67.4%.
• (c) adding BN to the output of the projection MLP (default)
• It boosts accuracy to 68.1%.
• Disabling learnable affine transformation (scale and offset) in 𝑓’s output BN : 68.2%
• (d) adding BN to the output of the prediction MLP ℎ → does not work well!
• Not about collapsing, the training is unstable and the loss oscillates.
→ BN is helpful for optimization when used appropriately.
→ No evidence that BN helps to prevent collapsing.

Empirical Study
15
• Similarity Function
• Modify 𝒟 as: 𝒟 𝑝1, 𝑧2 = −softmax 𝑧2 ⋅ log softmax 𝑝1
• The output of softmax can be thought of as the probabilities of belonging to each of 𝑑 pseudo-categories.
• Simply replace the cosine similarity with the cross-entropy similarity, and symmetrize it.
• The cross-entropy variant can converge to a reasonable result without collapsing.
→ The collapsing prevention behavior is not just about the cosine similarity.

Empirical Study
16
• Symmetrization
• Observe that SimSiam’s behavior of preventing collapsing does not depend on symmetrization.
• Compare with the asymmetric variant (𝒟 𝑝1, stopgrad 𝑧2 ).
• The asymmetric variant achieves reasonable results.
→ Symmetrization is helpful for boosting accuracy, but it is not related to collapse prevention.
• Sampling two pairs for each image in the asymmetric version (“2x”)
→ It makes the gap smaller.

Empirical Study
17
• Summary
• SimSiam can produce meaningful results without collapsing.
• The optimizer (batch size), batch normalization, similarity function, and symmetrization
→ affect accuracy, but we have seen no evidence about collapsing prevention
→ the stop-gradient operation may play an essential role!

Hypothesis
18
• What is implicitly optimized by SimSiam?
• Formulation
• SimSiam is an implementation of an Expectation-Maximization (EM) like algorithm.
ℒ 𝜃, 𝜂 = 𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − 𝜂𝑥 2
2
• ℱ : a network parameterized by 𝜃
• 𝒯 : the augmentation / 𝑥 : an image
• 𝜂 : another set of variables (proportional to the number of images)
• 𝜂𝑥 : the representation of the image 𝑥 / the subscript 𝑥 : using the image index to access a sub-vector of 𝜂
• The expectation 𝔼 ⋅ is over the distribution of images and augmentations.
• The mean squared error ⋅ 2
2
: equivalent to the cosine similarity if the vectors are 𝑙2-normalized

Hypothesis
19
• Formulation
• Solve min
𝜃,𝜂
ℒ 𝜃, 𝜂
• This formulation is analogous to k-means clustering.
• The variable 𝜃 is analogous to the clustering centers (the learnable parameters of an encoder)
• The variable 𝜂𝑥 is analogous to the assignment vector of the sample 𝑥 (a one-hot vector, the representation of 𝑥)
• An alternating algorithm
• Solving for 𝜽 (𝜃𝑡
← argmin𝜃ℒ 𝜃, 𝜂𝑡−1
)
• The stop-gradient operation is a natural consequence
: the gradient does not back-propagate to 𝜂𝑡−1
which is a constant.
• Solving for 𝜼 (𝜂𝑡
← argmin𝜂ℒ 𝜃𝑡
, 𝜂 )
• Minimize 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 − 𝜂𝑥 2
2
for each image 𝑥 : 𝜂𝑥
𝑡
← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 (due to the mean squared error)
• 𝜂𝑥 is assigned with the average representation of 𝑥 over the distribution of augmentation.

Hypothesis
20
• Formulation
• One-step alternation
• Approximate 𝜂𝑥
𝑡
← 𝔼𝒯 ℱ𝜃𝑡 𝒯 𝑥 by sampling the augmentation only once (𝒯′) and ignoring 𝔼𝒯 ⋅
𝜂𝑥
𝑡
← ℱ𝜃𝑡 𝒯′
𝑥
• Then,
𝜃𝑡+1
← argmin𝜃𝔼𝑥,𝒯 ℱ𝜃 𝒯 𝑥 − ℱ𝜃𝑡 𝒯′
𝑥 2
2
• 𝜃𝑡
: a constant in this sub-problem
• 𝒯′ : another view due to its random nature
→ The SimSiam algorithm: a Siamese network naturally with stop-gradient applied.

Hypothesis
21
• Formulation
• Predictor
• Assume : ℎ is helpful in our method because of the approximation ~.
• Minimize 𝔼𝑧 ℎ 𝑧1 − 𝑧2 2
2
→ the optimal solution : ℎ 𝑧1 = 𝔼𝑧 𝑧2 = 𝔼𝒯 𝑓 𝒯 𝑥 for any image 𝑥
• In practice, it would be unrealistic to actually compute the expectation 𝔼𝒯
• But it may be possible for a neural network (e.g., the predictor ℎ) to learn to predict the expectation,
while the sampling of 𝒯 is implicitly distributed across multiple epochs.
• Symmetrization
• Symmetrization is like denser sampling 𝒯.
• 𝔼𝑥,𝒯 ⋅ by sampling a batch of images and one pair of augmentation 𝒯
1, 𝒯
2
• Symmetrization supplies an extra pair 𝒯
2, 𝒯
1
• Not necessary, yet improving accuracy

Hypothesis
22
• Proof of concept
• Multi-step alternation
• SimSiam : alternating between updating 𝜃𝑡
and 𝜂𝑡
, with an interval of one step of SGD update
→ the interval has multiple steps of SGD
• 𝑡 : the index of an outer loop
• 𝜃𝑡
is updated by an innter loop of 𝑘 SGD steps
• Pre-compute the 𝜂𝑥 required for all 𝑘 SGD steps and cache them in memory
• 1-step : SimSiam / 1-epoch : the 𝑘 steps required for one epoch

Hypothesis
23
• Proof of concept
• Expectation over augmentations
• Do not update 𝜂𝑥 directly, maintain a moving-average
𝜂𝑥
𝑡
← 𝑚 ∗ 𝜂𝑥
𝑡−1
+ 1 − 𝑚 ∗ ℱ𝜃𝑡 𝒯′
𝑥 𝑚 = 0.8
• This moving-average provides an approximated expectation of multiple views.
• 55.0% accuracy without the predictor ℎ
• Fails completely if removing ℎ but do not maintain the moving average. (Table 1a)
→ The usage of predictor ℎ is related to approximating 𝔼𝒯 ⋅
• Discussion
• It does not explain why collapsing is prevented.

Comparisons
24
• Result Comparisons
• ImageNet
• SimSiam achieves competitive results with its simplicity.
• The highest accuracy among all methods under 100-epoch pre-training
• Better results than SimCLR in all cases

Comparisons
25
• Result Comparisons
• Transfer Learning
• VOC (object detection) and COCO (object detection and instance segmentation)
• SimSiam, base : the baseline pre-training recipe
• SimSiam, optimal : optimal recipe (𝑙𝑟=0.5 and 𝑤𝑑=1e-5)
• Siamese structure is a core factor for their general success.

Comparisons
26
• Methodology Comparisons
• Relation to SimCLR
• SimSiam is SimCLR without negatives.
• Negative samples (“dissimilarity”) to prevent collapsing
• Append the prediction MLP ℎ and stop-gradient to SimCLR
• Neither the stop-gradient nor the extra predictor is necessary or helpful.
• a consequence of another underlying optimization problem.
• It is different from the contrastive learning problem!
• Relation to BYOL
• SimSiam is BYOL without the momentum encoder.
• The momentum encoder may be beneficial for accuracy, but it is not necessary for preventing collapsing.

Exploring Simple Siamese Representation Learning

More Related Content

What's hot

Similar to Exploring Simple Siamese Representation Learning

More from Sungchul Kim

Recently uploaded

Exploring Simple Siamese Representation Learning