Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
1. Sangmin Woo
School of Electrical Engineering and Computer Science
Gwangju Institute of Science and Technology (GIST)
2021.06.23
2. 2 / 32
In a nutshell…
• Siamese networks [1] maximize the similarity between two augmentations of
one image, subject to certain conditions for avoiding collapsing solutions
• Collapsing solutions do exist for the loss and structure, but a stop-gradient
operation plays an essential role in preventing collapsing
• A simple Siamese networks (SimSiam) shows surprising results without using:
• Negative sample pairs
• Large batches
• Momentum encoders
[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
3. 3 / 32
In a nutshell…
• Siamese networks [1] maximize the similarity between two augmentations of
one image, subject to certain conditions for avoiding collapsing solutions
• Collapsing solutions do exist for the loss and structure, but a stop-gradient
operation plays an essential role in preventing collapsing
• A simple Siamese networks (SimSiam) shows surprising results without using:
• Negative sample pairs
• Large batches
• Momentum encoders
[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
Siamese networks are weight-sharing neural networks applied on two or more inputs.
They are natural tools for comparing entities.
4. 4 / 32
In a nutshell…
• Siamese networks [1] maximize the similarity between two augmentations of
one image, subject to certain conditions for avoiding collapsing solutions
• Collapsing solutions do exist for the loss and structure, but a stop-gradient
operation plays an essential role in preventing collapsing
• A simple Siamese networks (SimSiam) shows surprising results without using:
• Negative sample pairs
• Large batches
• Momentum encoders
[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
An undesired trivial solution to Siamese networks is
all outputs “collapsing” to a constant.
5. 5 / 32
In a nutshell…
• Siamese networks [1] maximize the similarity between two augmentations of
one image, subject to certain conditions for avoiding collapsing solutions
• Collapsing solutions do exist for the loss and structure, but a stop-gradient
operation plays an essential role in preventing collapsing
• A simple Siamese networks (SimSiam) shows surprising results without using:
• Negative sample pairs
• Large batches
• Momentum encoders
[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
stop the gradient from flowing backward
Tensorflow: stop_gradient / PyTorch: detach
6. 6 / 32
In a nutshell…
• Siamese networks [1] maximize the similarity between two augmentations of
one image, subject to certain conditions for avoiding collapsing solutions
• Collapsing solutions do exist for the loss and structure, but a stop-gradient
operation plays an essential role in preventing collapsing
• A simple Siamese networks (SimSiam) shows surprising results without using:
• Negative sample pairs
• Large batches
• Momentum encoders
[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2).
Typical strategies for preventing Collapsing
9. 9 / 32
Siamese Networks
• Recent methods define the inputs as two augmentations of one image, and
maximize the similarity subject to different conditions.
• Siamese networks can naturally introduce inductive biases for modeling
“invariance”
• Convolution is a successful inductive bias via weight-sharing for modeling
translation-invariance
• Weight-sharing Siamese networks can model augmentation-invariance
10. 10 / 32
“Collapsing” Preventing Strategies
• Contrastive Learning
• Repulses different images (negative pairs) while attracting the same image’s two
views (positive pairs)
• SimCLR [2] (Google; cited by ~1500)
• Online Clustering
• SwAV [3] (FAIR; cited by ~200)
• Momentum Encoder
• MoCO [4] (FAIR; cited by ~1250), BYOL [5] (Deepmind; cited by ~350)
[2] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International
conference on machine learning (pp. 1597-1607). PMLR.
[3] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster
assignments. arXiv preprint arXiv:2006.09882.
[4] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 9729-9738).
[5] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent: A new approach to self-
supervised learning. arXiv preprint arXiv:2006.07733.
11. 11 / 32
Contrastive Learning
• SimCLR [2]
[2] Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International
conference on machine learning (pp. 1597-1607). PMLR.
12. 12 / 32
Clustering
• SwAV [3]
[3] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster
assignments. arXiv preprint arXiv:2006.09882.
13. 13 / 32
Momentum Encoder
• MoCO [4]
• Only the parameters 𝜃𝑞 are updated by back-propagation
• The momentum update makes 𝜃𝑘 evolve more smoothly than 𝜃𝑞
• As a result, though the keys in the queue are encoded by different encoders (in
different mini-batches), the difference among these encoders can be made small.
[4] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 9729-9738).
where,
Momentum update rule:
14. 14 / 32
Momentum Encoder
[5] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent: A new approach to self-
supervised learning. arXiv preprint arXiv:2006.07733.
• BYOL [5]
• Relies only on positive pairs but it does not collapse in case a momentum encoder
is used
• BYOL minimizes a similarity loss between 𝑞𝜃(𝑧𝜃) and 𝑠𝑔 𝑧𝜉
’
• where 𝜃 are the trained weights, 𝜉 are an exponential moving average of 𝜃 and 𝑠𝑔
means stop-gradient
• At the end of training, everything but 𝑓𝜃 is discarded, and 𝑦𝜃 is used as the image
representation
15. 15 / 32
SimSiam
• SimSiam works surprisingly well without any “collapsing” preventing
strategies:
• Negative pairs
• Momentum encoder
• Large-batch training
16. 16 / 32
SimSiam
• SimSiam works surprisingly well without any “collapsing” preventing
strategies:
• Negative pairs
• Momentum encoder
• Large-batch training
: Stop the gradient flow
Not updated by
its parameters’ gradients
Weight
Shared
17. 17 / 32
SimSiam
• “SimSiam” = “BYOL without Momentum Encoder”
• Hypothesis of factors involved in preventing “collapsing”
• BYOL: momentum encoder
• SimSiam: Stop-gradient operation
• This discovery can be obscured with the usage of a momentum encoder, which is
always accompanied with stop-gradient (as it is not updated by its parameters’
gradients).
• While the moving-average behavior may improve accuracy with an appropriate
momentum coefficient, experiments show that momentum encoder is not directly
related to preventing collapsing.
21. 21 / 32
Empirical Study
• with vs. without Stop-gradient
• (left) Without stop-gradient, the optimizer quickly finds a degenerated solution and
reaches the minimum possible loss of −𝟏
• (mid) If the outputs collapse to a constant vector, their standard deviation over all
samples becomes zero for each channel
• (right) validation accuracy of kNN classifier can serve as a monitor of the progress.
With stop-gradient, kNN monitor shows a steadily improving accuracy.
• (table) Linear evaluation result
22. 22 / 32
Discussion
• Stop-gradient
• Experiments empirically show there exist collapsing solutions.
• However, a stop-gradient operation can play a key role in preventing such
solutions.
• The optimizer (batch size), batch normalization, similarity function, and
symmetrization may affect accuracy, but we have seen no evidence that they are
related to collapse prevention.
• The importance of stop-gradient suggests that there should be a different
underlying optimization problem that is being solved.
• The authors hypothesize that there are implicitly two sets of variables, and
SimSiam behaves like alternating between optimizing each set.
23. 23 / 32
Hypothesis
• “Siasiam” = “Expectation-Maximization” like algorithm
• It implicitly involves two sets of variables, and solves two underlying sub-problems
• Loss function
• ℱ: network parameterized by 𝜃
• 𝒯: augmentation
• 𝑥: image
• 𝔼: expectation is over the distribution of images (𝑥) and augmentations (𝒯)
• ⋅ 2
2
: mean squared error (= 𝑙2-normalized cosine similarity)
• 𝜂𝑥: representation of the image 𝑥 (another set of variables)
24. 24 / 32
Hypothesis
• Objective:
• Where,
• This formulation is analogous to k-means clustering
• Variable 𝜃 is analogous to the clustering centers
• i.e., learnable parameters of an encoder
• Variable 𝜂𝑥 is analogous to the assignment vector of the sample 𝒙 (one-hot vector
in k-means)
• i.e., representation of 𝑥
25. 25 / 32
Hypothesis
• Solve by an alternating algorithm
• Fixing one set of variables and solving for the other set:
• 𝑡: index of alternation
• Solving for 𝜽 (clustering centers): SGD can be used to solve the (1)
• stop-gradient operation is a natural consequence, because the gradient does not back-
propagate to 𝜂𝑡−1
which is a constant in this sub-problem.
• Solving for 𝜼 (assignment vectors): (2) can be solved independently for each 𝜂𝑥.
Now the problem is to minimize: for each image 𝑥.
• 𝜂𝑥 is assigned with the average representation of 𝑥 over the distribution of augmentation
⋯ (1)
⋯ (2)
𝒯
26. 26 / 32
Hypothesis
• One-step alternation
• This can be approximated by sampling the augmentation 𝒯 only once, denoted
as 𝒯′, and ignoring 𝔼𝒯[⋅]:
• Inserting it into the , we have:
• 𝜽𝒕
is a constant in this sub-problem, and 𝓣′
implies another view due to its
random nature.
27. 27 / 32
Proof-of-concept
• Multi-step alternation
• Hypothesis: “SimSiam algorithm is like alternating between (1) and (2)”
• In this variant, 𝒕 is treated as the index of an outer loop
• And the sub-problem in (1) is updated by an inner loop of 𝒌 SGD steps
• In each alternation, 𝜂𝑥 required for all 𝑘 SGD steps are pre-computed and cached in
memory. Then, 𝑘 SGD steps are performed to update 𝜃
• 1-step = SimSiam
• Alternating optimization is a valid formulation
⋯ (1)
⋯ (2)
28. 28 / 32
Discussion
• SimSiam’s non-collapsing behavior remains as an open question…
• This is an authors’ understanding:
• Alternating optimization provides a different trajectory, and the trajectory depends
on the initialization
• It is unlikely that the initialized 𝜼, which is the output of a randomly initialized
network, would be a constant
• Starting from this initialization, it may be difficult for the alternating optimizer to
approach a constant 𝜼𝒙 for all 𝒙
• because the method does not compute the gradients w.r.t. 𝜼 jointly for all 𝒙
29. 29 / 32
Comparisons
• ImageNet Linear Evaluation
• All are based on pre-trained ResNet-50, with two 224×224 views
• Small batch size
• No negative pairs
• No momentum encoder
• Achieves highest accuracy among all methods under 100-epoch pre-training
30. 30 / 32
Comparisons
• Transfer Learning
• Representation qualities are compared by transferring them to other tasks
• The pre-trained models are fine-tuned end-to-end in the target datasets
• SimSiam, base: same pre-training recipe as in ImageNet experiment
• SimSiam, optimal: another recipe producing better results in all tasks
31. 31 / 32
Takeaways
• Siamese networks are natural and effective tool for modeling invariance,
which is a focus of representation learning
• Siamese structure can be a core reason for recent methods’ success
• Stop-gradient helps to avoid collapsing solutions
Invariance means that two observations of the same concept should produce the same outputs
There have been several general strategies for preventing Siamese networks from collapsing.
Google + hinton
SimCLR directly uses negative samples coexisting in the current batch, and it requires a large batch size to work well.
They alternate between clustering the representations and learning to predict the cluster assignment.
SwAV incorporates clustering into a Siamese network, by computing the assignment from one view and predicting it from another view.
In a Siamese network, MoCo maintains a queue of negative samples and turns one branch into a momentum encoder to improve consistency of the queue.
Two randomly augmented views of one image are processed by the same encoder network f (a backbone plus a projection MLP).
Then a prediction MLP h is applied on one side, and a stop-gradient operation is applied on the other side.
The model maximizes the similarity between both sides.
It uses neither negative pairs nor a momentum encoder.
Two randomly augmented views of one image are processed by the same encoder network f (a backbone plus a projection MLP).
Then a prediction MLP h is applied on one side, and a stop-gradient operation is applied on the other side.
The model maximizes the similarity between both sides.
It uses neither negative pairs nor a momentum encoder.
The architectures and all hyper-parameters are kept unchanged, and stop-gradient is the only difference.