SlideShare a Scribd company logo
1 of 18
Download to read offline
High-Performance Large-Scale Image
Recognition Without Normalization
New SOTA validation accuracies on ImageNet by DeepMind
이미지 팀 : 김병현, 박동훈, 안종식, 이찬혁, 허다운, 홍은기
발표자 : 박동훈
https://arxiv.org/abs/2102.06171
Contents
1. Performance
2. Batch Normalization: pros and cons
3. Previous Normalizer-Free Networks
4. Proposed Method : Adaptive gradient clipping
5. Experimental Results
6. Conclusion
2
Performance
• EfficientNet-B7 on ImageNet 8.7x faster to train
• New state-of-the-art top-1 accuracy of 86.5%.
- After finetuning on ImageNet after pretraining on 300 million labeled images,
It achieves 89.2%
• Image Classification on ImageNet[1]
[1] https://paperswithcode.com/sota/image-classification-on-imagenet
3
Batch Normalization(previous knowledge)
• The change in the distributions of
each layers occurs ‘Covariant Shift’[2]
Batch Normalization 효과
·Downscales residual branch
· Regularizing effect
· Eliminates mean-shift
· Efficient large-batch training
4
[2] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift
http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf
Train
Also calculate exponential moving
avg & var
𝝁 ← 1 − 𝛼 𝜇 + 𝛼𝜇𝛽
𝝈𝟐
← 1 − 𝛼 𝜎2
+ 𝛼𝜎𝛽
2
Test
BNtest x =
x − 𝝁
𝝈𝟐 +𝜖
Batch Normalization - Bad
• First, it is a surprisingly expensive computational primitive, which
incurs memory overhead
→ 계산 과부하
• Discrepancy between the behaviors of the model during training and at
inference time.
→ 학습과 추론 동작 상이
• Breaks the independence between training examples in the minibatch
→ Mini-batch의 독립성을 깨트린다.
5
Batch Normalization - Bad
• Can train residual networks with large learning rate, But only benefit if
batch size is also large.
→ 배치 정규화는 큰 lr를 사용 가능하게 하지만, 배치 사이즈도 커야 효과 있다.
• Batch normalization is often the cause of subtle implementation errors,
especially during distributed training(Pham et al., 2019)
→(실무자들의 의견에 의하면) 배치 정규화를 하면 HW에 따라 결과가 매번 달랐으며,
특히 분산 학습에서 미세한 구현 에러가 발생하였다.
6
https://www.youtube.com/watch?v=rNkHjZtH0RQ
𝛴
Q & A
Previous Normalizer-Free Networks
8
De, S. and Smith, S. Batch normalization biases residual blocks towards the identity function in deep networks. In NIPS 2020
“If our theory is correct, it should be possible to train deep residual networks without norm
alization, simply by downscaling the residual branch.”
Fully connected linear
unnormalized residual network
Fully connected linear
normalized residual network
Normalized convolutional
residual network
Residual
branch
분산 감소
"Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks”
(In NIPS 2020)
Previous Normalizer-Free Networks
• Residual block : ℎ𝑖+1 = ℎ𝑖 + 𝛼𝑓𝑖(ℎ𝑖 / 𝛽𝑖)
• ℎ𝑖 : residual block의 입력
• 𝑉𝑎𝑟 𝑓𝑖 𝑧 = 𝑉𝑎𝑟(𝑧)
• 𝛼는 Residual Block이후 곱해주는 값. 분산을 줄여 주기 위함. e.g. 𝛼 = 0.2
• 𝛽𝑖 = √𝑉𝑎𝑟(ℎ𝑖) where 𝑉𝑎𝑟 ℎ𝑖+1 = 𝑉𝑎𝑟 ℎ𝑖 + 𝛼2
9
Original residual unit proposed by K. He et al
“Normalizer-Free ResNets”(NF-ResNets)(Brock et al., 2021)
1/𝜷
𝜶
Brock, A., De, S., and Smith, S. L. Characterizing signal propagation to close the performance gap in unnormalized resnets. In ICLR, 2021
Previous Normalizer-Free Networks
그외
• Weight Standardization Standardization (Huang et al., 2017; Qiao et al., 2019))
𝑊𝑖𝑗 =
𝑊𝑖𝑗 − 𝜇𝑖
𝑁𝜎𝑖
𝜇𝑖 = (1/N) σj Wij , 𝜎i
2
=(1/N) σj(Wij − 𝜇𝑖)2
• Dropout(Srivastava et al.,2014)
• Stochastic Depth(Huang et al., 2016))
10
Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered weight normalization in accelerating training of deep neural networks. In ICCV 2017.
Q & A
Gradient Clipping(previous knowledge)
• 경사 하강(gradient descent)이 가파른
절벽에서 합리적으로 수행될 수 있도록 돕
는다.
• RNN 계열의 모델 학습에 널리 쓰인다.
• Hard to tuning threshold(𝜆)
• But enable us to train higher batch
size
기울기 norm ||g||이 thres 보다 클 경우,
→ 정규화 해서 thres 크기 만큼으로 벡터 크기를 수정
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In ICML, 2013.
ො
𝑔 : Gradient vector
𝜖 ∶ Loss
𝜃 ∶ Vector
12
Adaptive Gradient Clipping
13
• 비율
| 𝐺𝑙 |
||𝑊𝑙||
이 학습의 단위가 될 수 있다는 것에 영감을 받았다.
• 𝑊𝑙
∈ 𝑅𝑁×𝑀
: 𝑙𝑡ℎ
번째 계층의 가중치 행렬
• 𝐺𝑙 ∈ 𝑅𝑁×𝑀 : 𝑊𝑙에 대응하는 기울기
• | 𝑊𝑙 |𝐹 = σ𝑖
𝑁 σ𝑗
𝑀
𝑊𝑖,𝑗
𝑙 2
, | 𝑊𝑙 | = max(| 𝑊𝑙 |𝐹, 𝜖) , 𝜖 = 10−3
• 𝜆 = [0.01, 0.02, 0.04, 0.08, 0.16]
∆𝑊𝑙 = −ℎ𝐺𝑙
| ∆𝑊𝑙 |
||𝑊𝑙||
= h
| 𝐺𝑙 |
||𝑊𝑙||
Training[1]
1/ | 𝑊𝑙
|
Adaptive
[1] https://www.youtube.com/watch?v=o_peo6U7IRM
Experimental Result
14
Model Detail
15
https://github.com/deepmind/deepmind-research/tree/master/nfnets
[Training Detail]
• Softmax cross-entrophy loss with
label smoothing of 0.1
• Stochastic gradient descent with
Nesterov’s momentum 0.9
• Weight decay coefficient 2 x 10-5
• Dropout, Stochastic Depth(0.25)
Conclusion
• 배치 정규화를 적용하지 않고도, 큰 배치 사이즈로 학습 할 때 배치 정
규화를 적용한 모델의 성능을 뛰어 넘는 최초의 모델
• 배치 정규화 적용한 모델과 성능은 비슷하면서도 빠르게 학습할 수 있
다.
• AGC 기법을 적용한 family models을 만들었다
• 정규화 없는 모델이 (이미지 넷과 같은 모델을 학습 한 후) Finetuning
할 때 되려 더 좋은 성능을 나타낸 다는 것을 보였다.
16
개인적인 생각
• 해당 팀은 정규화 없는 학습 방식 관련해서 연구를 많이 한 팀.
• 학습 속도를 빠르게 해서 모델 검증을 하고 싶음.
• 정규화 없이 학습을 하려 했으나, 배치 사이즈를 크게 하니 성능이
되려 안 좋게 나옴.
• 여러 아이디어로 실험을 해보고 나온 결과.
17
감사합니다

More Related Content

What's hot

Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From DataSungjoon Choi
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
Kernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian ProcessesKernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian ProcessesSungjoon Choi
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingTomoya Oda
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnnDebarko De
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningMLAI2
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用Ryo Iwaki
 
Clustering introduction
Clustering introductionClustering introduction
Clustering introductionYan Xu
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...MLAI2
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用Ryo Iwaki
 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsAlex Conway
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
KDD Poster Nurjahan Begum
KDD Poster Nurjahan BegumKDD Poster Nurjahan Begum
KDD Poster Nurjahan BegumNurjahan Begum
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門Ryo Iwaki
 

What's hot (20)

Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Kernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian ProcessesKernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian Processes
 
InfoGAIL
InfoGAIL InfoGAIL
InfoGAIL
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence Modeling
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Clustering introduction
Clustering introductionClustering introduction
Clustering introduction
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision Applications
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
KDD Poster Nurjahan Begum
KDD Poster Nurjahan BegumKDD Poster Nurjahan Begum
KDD Poster Nurjahan Begum
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 

Similar to High performance large-scale image recognition without normalization

Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesValue Amplify Consulting
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesNamkug Kim
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxDrKBManwade
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxssuserd23711
 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...IJECEIAES
 
Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017Alex Conway
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer VisionDavid Dao
 

Similar to High performance large-scale image recognition without normalization (20)

Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
N ns 1
N ns 1N ns 1
N ns 1
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...
 
Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docxkarenmillo
 

Recently uploaded (20)

Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Ultrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptxUltrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptx
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docx
 

High performance large-scale image recognition without normalization

  • 1. High-Performance Large-Scale Image Recognition Without Normalization New SOTA validation accuracies on ImageNet by DeepMind 이미지 팀 : 김병현, 박동훈, 안종식, 이찬혁, 허다운, 홍은기 발표자 : 박동훈 https://arxiv.org/abs/2102.06171
  • 2. Contents 1. Performance 2. Batch Normalization: pros and cons 3. Previous Normalizer-Free Networks 4. Proposed Method : Adaptive gradient clipping 5. Experimental Results 6. Conclusion 2
  • 3. Performance • EfficientNet-B7 on ImageNet 8.7x faster to train • New state-of-the-art top-1 accuracy of 86.5%. - After finetuning on ImageNet after pretraining on 300 million labeled images, It achieves 89.2% • Image Classification on ImageNet[1] [1] https://paperswithcode.com/sota/image-classification-on-imagenet 3
  • 4. Batch Normalization(previous knowledge) • The change in the distributions of each layers occurs ‘Covariant Shift’[2] Batch Normalization 효과 ·Downscales residual branch · Regularizing effect · Eliminates mean-shift · Efficient large-batch training 4 [2] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf Train Also calculate exponential moving avg & var 𝝁 ← 1 − 𝛼 𝜇 + 𝛼𝜇𝛽 𝝈𝟐 ← 1 − 𝛼 𝜎2 + 𝛼𝜎𝛽 2 Test BNtest x = x − 𝝁 𝝈𝟐 +𝜖
  • 5. Batch Normalization - Bad • First, it is a surprisingly expensive computational primitive, which incurs memory overhead → 계산 과부하 • Discrepancy between the behaviors of the model during training and at inference time. → 학습과 추론 동작 상이 • Breaks the independence between training examples in the minibatch → Mini-batch의 독립성을 깨트린다. 5
  • 6. Batch Normalization - Bad • Can train residual networks with large learning rate, But only benefit if batch size is also large. → 배치 정규화는 큰 lr를 사용 가능하게 하지만, 배치 사이즈도 커야 효과 있다. • Batch normalization is often the cause of subtle implementation errors, especially during distributed training(Pham et al., 2019) →(실무자들의 의견에 의하면) 배치 정규화를 하면 HW에 따라 결과가 매번 달랐으며, 특히 분산 학습에서 미세한 구현 에러가 발생하였다. 6 https://www.youtube.com/watch?v=rNkHjZtH0RQ 𝛴
  • 8. Previous Normalizer-Free Networks 8 De, S. and Smith, S. Batch normalization biases residual blocks towards the identity function in deep networks. In NIPS 2020 “If our theory is correct, it should be possible to train deep residual networks without norm alization, simply by downscaling the residual branch.” Fully connected linear unnormalized residual network Fully connected linear normalized residual network Normalized convolutional residual network Residual branch 분산 감소 "Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks” (In NIPS 2020)
  • 9. Previous Normalizer-Free Networks • Residual block : ℎ𝑖+1 = ℎ𝑖 + 𝛼𝑓𝑖(ℎ𝑖 / 𝛽𝑖) • ℎ𝑖 : residual block의 입력 • 𝑉𝑎𝑟 𝑓𝑖 𝑧 = 𝑉𝑎𝑟(𝑧) • 𝛼는 Residual Block이후 곱해주는 값. 분산을 줄여 주기 위함. e.g. 𝛼 = 0.2 • 𝛽𝑖 = √𝑉𝑎𝑟(ℎ𝑖) where 𝑉𝑎𝑟 ℎ𝑖+1 = 𝑉𝑎𝑟 ℎ𝑖 + 𝛼2 9 Original residual unit proposed by K. He et al “Normalizer-Free ResNets”(NF-ResNets)(Brock et al., 2021) 1/𝜷 𝜶 Brock, A., De, S., and Smith, S. L. Characterizing signal propagation to close the performance gap in unnormalized resnets. In ICLR, 2021
  • 10. Previous Normalizer-Free Networks 그외 • Weight Standardization Standardization (Huang et al., 2017; Qiao et al., 2019)) 𝑊𝑖𝑗 = 𝑊𝑖𝑗 − 𝜇𝑖 𝑁𝜎𝑖 𝜇𝑖 = (1/N) σj Wij , 𝜎i 2 =(1/N) σj(Wij − 𝜇𝑖)2 • Dropout(Srivastava et al.,2014) • Stochastic Depth(Huang et al., 2016)) 10 Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered weight normalization in accelerating training of deep neural networks. In ICCV 2017.
  • 11. Q & A
  • 12. Gradient Clipping(previous knowledge) • 경사 하강(gradient descent)이 가파른 절벽에서 합리적으로 수행될 수 있도록 돕 는다. • RNN 계열의 모델 학습에 널리 쓰인다. • Hard to tuning threshold(𝜆) • But enable us to train higher batch size 기울기 norm ||g||이 thres 보다 클 경우, → 정규화 해서 thres 크기 만큼으로 벡터 크기를 수정 Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In ICML, 2013. ො 𝑔 : Gradient vector 𝜖 ∶ Loss 𝜃 ∶ Vector 12
  • 13. Adaptive Gradient Clipping 13 • 비율 | 𝐺𝑙 | ||𝑊𝑙|| 이 학습의 단위가 될 수 있다는 것에 영감을 받았다. • 𝑊𝑙 ∈ 𝑅𝑁×𝑀 : 𝑙𝑡ℎ 번째 계층의 가중치 행렬 • 𝐺𝑙 ∈ 𝑅𝑁×𝑀 : 𝑊𝑙에 대응하는 기울기 • | 𝑊𝑙 |𝐹 = σ𝑖 𝑁 σ𝑗 𝑀 𝑊𝑖,𝑗 𝑙 2 , | 𝑊𝑙 | = max(| 𝑊𝑙 |𝐹, 𝜖) , 𝜖 = 10−3 • 𝜆 = [0.01, 0.02, 0.04, 0.08, 0.16] ∆𝑊𝑙 = −ℎ𝐺𝑙 | ∆𝑊𝑙 | ||𝑊𝑙|| = h | 𝐺𝑙 | ||𝑊𝑙|| Training[1] 1/ | 𝑊𝑙 | Adaptive [1] https://www.youtube.com/watch?v=o_peo6U7IRM
  • 15. Model Detail 15 https://github.com/deepmind/deepmind-research/tree/master/nfnets [Training Detail] • Softmax cross-entrophy loss with label smoothing of 0.1 • Stochastic gradient descent with Nesterov’s momentum 0.9 • Weight decay coefficient 2 x 10-5 • Dropout, Stochastic Depth(0.25)
  • 16. Conclusion • 배치 정규화를 적용하지 않고도, 큰 배치 사이즈로 학습 할 때 배치 정 규화를 적용한 모델의 성능을 뛰어 넘는 최초의 모델 • 배치 정규화 적용한 모델과 성능은 비슷하면서도 빠르게 학습할 수 있 다. • AGC 기법을 적용한 family models을 만들었다 • 정규화 없는 모델이 (이미지 넷과 같은 모델을 학습 한 후) Finetuning 할 때 되려 더 좋은 성능을 나타낸 다는 것을 보였다. 16
  • 17. 개인적인 생각 • 해당 팀은 정규화 없는 학습 방식 관련해서 연구를 많이 한 팀. • 학습 속도를 빠르게 해서 모델 검증을 하고 싶음. • 정규화 없이 학습을 하려 했으나, 배치 사이즈를 크게 하니 성능이 되려 안 좋게 나옴. • 여러 아이디어로 실험을 해보고 나온 결과. 17