5. Method
• Motivation
• Step 1 → 1.4% Top 1 Acc
• Fixed randomly initialized encoder + trainable linear layer
• Labeled dataset으로 학습
• Step 2 → 18.8% Top 1 Acc
• Random initialized encoder & linear layer로 unlabeled dataset 예측
• 예측된 label을 새로운 fixed randomly initialized encoder
+ trainable linear layer 사전 학습
• 실제 labeled dataset으로 학습
→ 실제 사용하는 online network에 target network가 필요함!
(Self-knowledge distillation + Semi-supervised?)
5
https://hoya012.github.io/blog/byol/
6. Method
• Terminology
• 𝜃 : a set of weights of the online network
• 𝜉 : a set of weights of the target network
6
Encoder Projector PredictorAugmentation
Stop gradient
7. Method
• Description of BYOL
• Target network는 online network가 학습할 regression target을 예측
• Target network의 parameter 𝜉는 online parameter 𝜃의 exponential moving average
𝜉 ⟵ 𝜏𝜉 + 1 − 𝜏 𝜃, 𝜏 ∈ 0,1
• Loss : prediction 𝑞 𝜃 𝑧 𝜃 와 𝑧 𝜉
′
의 𝑙2-norm의 mean squared error
ℒ 𝜃
BYOL ≜ 𝑞 𝜃 𝑧 𝜃 − ഥ𝑧 𝜉
′
2
2
= 2 − 2 ⋅
𝑞 𝜃 𝑧 𝜃 , 𝑧 𝜉
′
𝑞 𝜃 𝑧 𝜃 2 ⋅ 𝑧 𝜉
′
2
when 𝑞 𝜃 𝑧 𝜃 ≜ Τ𝑞 𝜃 𝑧 𝜃 𝑞 𝜃 𝑧 𝜃 2 and ഥ𝑧 𝜉
′
≜ ൗ𝑧 𝜉
′
𝑧 𝜉
′
2
• 두 network의 input을 서로 바꾸어 낸 결과로 ሚℒ 𝜃
BYOL 계산
• ℒ 𝜃
BYOL + ሚℒ 𝜃
BYOL를 online network에만 적용
7
9. Method
• Implementation details
• Architecture
• ResNet50
• 4096-dimension MLP (projection) with no batch normalization
• 256-dimension prediction layer
• Optimization
• LARS optimizer
• 1000 epochs with warm-up period of 10 epochs
• Linear scaled learning rate 0.2 (LearningRate = 0.2 x BatchSize/256)
• 1.5 ⋅ 10−6
global weight decay parameter
• 𝜏base = 0.996 and 𝜏 ≜ 1 − 1 − 𝜏base ⋅ cos 𝜋𝑘/𝐾 /2 with k the current training step and K the total step
• 4096 batch size split over 512 Cloud TPU v3 cores
9
17. Conclusion
• Negative pair 없이 representation 학습
• 하지만 역시나 큰 batch size 필요
• 여러 task에서 State-of-the-art
• 하지만 이미 SimCLRv2에게 짐..
• Augmentation option에 대해 robust 함
• 그래도 적합한 augmentation을 찾는 것이 필요함
17