220109 dl paper review grokking (iclr 2021 workshop)

•

0 likes•102 views

오늘 소개드릴 논문은 소규모 데이터셋의 오버피팅 이후에 발생하는 모델의 일반화 현상 이른바 Grokking 현상에 대한 내용입니다. 트레이닝셋의 학습이 이제 잘 끝난 모델에는 이터레이션이 지속 될수록 트레이닝 에러는 지속적으로 감소하는 반면에 테스트에러는 최저점에 도달한 이후에 다시 증가하는 경향이 있는데요 이 트레이닝에러와 테스트에러가 가장 최소화되는 지점에서 이제 학습을 끝내면 이 모델의 일반화가 잘 되었다고 얘기합니다. 오버피팅이 발생해 버리면 테스트 셋은 정작 제대로 추론하지 못하는 경향이 있는대 논문의 저자들은 오버피팅으로 끝난 모델을 계속 학습을 시키면 이제 어느순간 지날수록 갑자기 모델이 일반화에 성공하는 현상을 발견했고 이걸 그로킹 현상으로 명명했습니다. 방법론 실험과 그로킹 현상에 대해서 자세하게 펀디멘탈팀 이근배님이 자세한 리뷰 도와주셨습니다. 오늘도 많은 관심 미리 감사드립니다 ! https://youtu.be/mcnSN645xUE

Data & Analytics

Paper review
2022/1/9
Presenter 이근배
Fundamental Team 김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이재윤, 조남경
1st Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021
https://alogs.theguntretort.com/.media/21f57cd5af2ccd6a1e95ee2ec1dc91c538a70f7375d6e98e50a58eabf8fbc197.pdf

Image credit: Different methods for mitigating overfitting on Neural Networks, Pablo Sanchez https://quantdare.com/mitigating-overfitting-neural-networks/
Recap: Model generalization

Grokking: A dramatic example of generalization far after overfitting on an
algorithmic dataset
Left: Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Right: Figure 4, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.

Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) https://youtu.be/dND-7llwrpw

Contributions
• Long after severely overfitting, validation accuracy sometimes suddenly
begins to increase from chance level toward perfect generalization. We call
this phenomenon ‘grokking’.
• We find that weight decay is particularly effective at improving
generalization on the tasks we study.

Dataset: Binary operations
Appendix A, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.

Tuning optimization hyperparamters
1. Adam w/ full batch
2. Adam
3. Adam w/ full batch and Gaussian noise added to the update direction for each
parameter (W ← W + lr · (∆W + ε), where ε is sampled from unit Gaussian, ∆W
is the standard Adam weight update
4. Adam w/ dropout = 0.1
5. AdamW w/ weight decay = 1
6. AdamW w/ weight decay 1 towards the initialization instead of the origin
7. Adam w/ lr = 3 · 10−4
8. Adam w/ lr = 3 · 10−3
9. Adam w/ Gaussian weight noise of standard deviation = 0.01 (i.e. each
parameter W replaced by W + 0.01 · ε in the model, with ε sampled from unit
Gaussian).

Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Training time required to reach 99% validation accuracy

Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Best validation accuracy achieved after 105 steps

Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Different optimization algorithms lead to different amounts of generalization

Figure 6, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Generalization with memorizing several outliers

220109 dl paper review grokking (iclr 2021 workshop)

More from taeseon ryu

YOLO V6taeseon ryu

Dataset Distillation by Matching Training Trajectories taeseon ryu

RL_UpsideDowntaeseon ryu

Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu

MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu

Scaling Instruction-Finetuned Language Modelstaeseon ryu

Visual prompt tuningtaeseon ryu

mPLUGtaeseon ryu

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu

Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu

The Forward-Forward Algorithmtaeseon ryu

Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu

BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu

ProximalPolicyOptimizationtaeseon ryu

Dream2Control paper reviewtaeseon ryu

Online Continual Learning on Class Incremental Blurry Task Configuration with...taeseon ryu

[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentationtaeseon ryu

Unsupervised Neural Machine Translation for Low-Resource Domainstaeseon ryu

PaLM Scaling Language Modeling with Pathways - 230219 (1).pdftaeseon ryu

Distributional RL via Moment Matchingtaeseon ryu

More from taeseon ryu (20)

YOLO V6

Dataset Distillation by Matching Training Trajectories

RL_UpsideDown

Packed Levitated Marker for Entity and Relation Extraction

MOReL: Model-Based Offline Reinforcement Learning

Scaling Instruction-Finetuned Language Models

Visual prompt tuning

mPLUG

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf

Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf

The Forward-Forward Algorithm

Towards Robust and Reproducible Active Learning using Neural Networks

BRIO: Bringing Order to Abstractive Summarization

ProximalPolicyOptimization

Dream2Control paper review

Online Continual Learning on Class Incremental Blurry Task Configuration with...

[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Unsupervised Neural Machine Translation for Low-Resource Domains

PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf

Distributional RL via Moment Matching

220109 dl paper review grokking (iclr 2021 workshop)

1. Paper review 2022/1/9 Presenter 이근배 Fundamental Team 김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이재윤, 조남경 1st Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021 https://alogs.theguntretort.com/.media/21f57cd5af2ccd6a1e95ee2ec1dc91c538a70f7375d6e98e50a58eabf8fbc197.pdf

4. Image credit: Different methods for mitigating overfitting on Neural Networks, Pablo Sanchez https://quantdare.com/mitigating-overfitting-neural-networks/ Recap: Model generalization

5. Grokking: A dramatic example of generalization far after overfitting on an algorithmic dataset Left: Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Right: Figure 4, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.

6. Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) https://youtu.be/dND-7llwrpw

7. Contributions • Long after severely overfitting, validation accuracy sometimes suddenly begins to increase from chance level toward perfect generalization. We call this phenomenon ‘grokking’. • We find that weight decay is particularly effective at improving generalization on the tasks we study.

10. Dataset: Binary operations Appendix A, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.

11. Tuning optimization hyperparamters 1. Adam w/ full batch 2. Adam 3. Adam w/ full batch and Gaussian noise added to the update direction for each parameter (W ← W + lr · (∆W + ε), where ε is sampled from unit Gaussian, ∆W is the standard Adam weight update 4. Adam w/ dropout = 0.1 5. AdamW w/ weight decay = 1 6. AdamW w/ weight decay 1 towards the initialization instead of the origin 7. Adam w/ lr = 3 · 10−4 8. Adam w/ lr = 3 · 10−3 9. Adam w/ Gaussian weight noise of standard deviation = 0.01 (i.e. each parameter W replaced by W + 0.01 · ε in the model, with ε sampled from unit Gaussian).

12.

13. Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Training time required to reach 99% validation accuracy

14. Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Best validation accuracy achieved after 105 steps

15. Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Different optimization algorithms lead to different amounts of generalization

16. Figure 6, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Generalization with memorizing several outliers

220109 dl paper review grokking (iclr 2021 workshop)

Recommended

Recommended

More Related Content

More from taeseon ryu

More from taeseon ryu (20)

220109 dl paper review grokking (iclr 2021 workshop)