SlideShare a Scribd company logo
1 of 17
Download to read offline
Paper review
2022/1/9
Presenter 이근배
Fundamental Team 김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이재윤, 조남경
1st Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021
https://alogs.theguntretort.com/.media/21f57cd5af2ccd6a1e95ee2ec1dc91c538a70f7375d6e98e50a58eabf8fbc197.pdf
Image credit: Different methods for mitigating overfitting on Neural Networks, Pablo Sanchez https://quantdare.com/mitigating-overfitting-neural-networks/
Recap: Model generalization
Grokking: A dramatic example of generalization far after overfitting on an
algorithmic dataset
Left: Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Right: Figure 4, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) https://youtu.be/dND-7llwrpw
Contributions
• Long after severely overfitting, validation accuracy sometimes suddenly
begins to increase from chance level toward perfect generalization. We call
this phenomenon ‘grokking’.
• We find that weight decay is particularly effective at improving
generalization on the tasks we study.
Dataset: Binary operations
Appendix A, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Tuning optimization hyperparamters
1. Adam w/ full batch
2. Adam
3. Adam w/ full batch and Gaussian noise added to the update direction for each
parameter (W ← W + lr · (∆W + ε), where ε is sampled from unit Gaussian, ∆W
is the standard Adam weight update
4. Adam w/ dropout = 0.1
5. AdamW w/ weight decay = 1
6. AdamW w/ weight decay 1 towards the initialization instead of the origin
7. Adam w/ lr = 3 · 10−4
8. Adam w/ lr = 3 · 10−3
9. Adam w/ Gaussian weight noise of standard deviation = 0.01 (i.e. each
parameter W replaced by W + 0.01 · ε in the model, with ε sampled from unit
Gaussian).
Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Training time required to reach 99% validation accuracy
Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Best validation accuracy achieved after 105 steps
Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Different optimization algorithms lead to different amounts of generalization
Figure 6, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
Generalization with memorizing several outliers
220109 dl paper review   grokking (iclr 2021 workshop)

More Related Content

More from taeseon ryu

Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimizationtaeseon ryu
 
Dream2Control paper review
Dream2Control paper reviewDream2Control paper review
Dream2Control paper reviewtaeseon ryu
 
Online Continual Learning on Class Incremental Blurry Task Configuration with...
Online Continual Learning on Class Incremental Blurry Task Configuration with...Online Continual Learning on Class Incremental Blurry Task Configuration with...
Online Continual Learning on Class Incremental Blurry Task Configuration with...taeseon ryu
 
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentationtaeseon ryu
 
Unsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource DomainsUnsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource Domainstaeseon ryu
 
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdfPaLM Scaling Language Modeling with Pathways - 230219 (1).pdf
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdftaeseon ryu
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matchingtaeseon ryu
 

More from taeseon ryu (20)

YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 
Dream2Control paper review
Dream2Control paper reviewDream2Control paper review
Dream2Control paper review
 
Online Continual Learning on Class Incremental Blurry Task Configuration with...
Online Continual Learning on Class Incremental Blurry Task Configuration with...Online Continual Learning on Class Incremental Blurry Task Configuration with...
Online Continual Learning on Class Incremental Blurry Task Configuration with...
 
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
[2023] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
 
Unsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource DomainsUnsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource Domains
 
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdfPaLM Scaling Language Modeling with Pathways - 230219 (1).pdf
PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 

220109 dl paper review grokking (iclr 2021 workshop)

  • 1. Paper review 2022/1/9 Presenter 이근배 Fundamental Team 김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이재윤, 조남경 1st Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021 https://alogs.theguntretort.com/.media/21f57cd5af2ccd6a1e95ee2ec1dc91c538a70f7375d6e98e50a58eabf8fbc197.pdf
  • 2.
  • 3.
  • 4. Image credit: Different methods for mitigating overfitting on Neural Networks, Pablo Sanchez https://quantdare.com/mitigating-overfitting-neural-networks/ Recap: Model generalization
  • 5. Grokking: A dramatic example of generalization far after overfitting on an algorithmic dataset Left: Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Right: Figure 4, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
  • 6. Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) https://youtu.be/dND-7llwrpw
  • 7. Contributions • Long after severely overfitting, validation accuracy sometimes suddenly begins to increase from chance level toward perfect generalization. We call this phenomenon ‘grokking’. • We find that weight decay is particularly effective at improving generalization on the tasks we study.
  • 8.
  • 9.
  • 10. Dataset: Binary operations Appendix A, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021.
  • 11. Tuning optimization hyperparamters 1. Adam w/ full batch 2. Adam 3. Adam w/ full batch and Gaussian noise added to the update direction for each parameter (W ← W + lr · (∆W + ε), where ε is sampled from unit Gaussian, ∆W is the standard Adam weight update 4. Adam w/ dropout = 0.1 5. AdamW w/ weight decay = 1 6. AdamW w/ weight decay 1 towards the initialization instead of the origin 7. Adam w/ lr = 3 · 10−4 8. Adam w/ lr = 3 · 10−3 9. Adam w/ Gaussian weight noise of standard deviation = 0.01 (i.e. each parameter W replaced by W + 0.01 · ε in the model, with ε sampled from unit Gaussian).
  • 12.
  • 13. Figure 1, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Training time required to reach 99% validation accuracy
  • 14. Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Best validation accuracy achieved after 105 steps
  • 15. Figure 2, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Different optimization algorithms lead to different amounts of generalization
  • 16. Figure 6, Power, Alethea, et al. "Grokking: Generalization beyond overfitting on small algorithmic datasets." ICLR MATH-AI Workshop. 2021. Generalization with memorizing several outliers