SlideShare a Scribd company logo
1 of 18
Download to read offline
High-Performance Large-Scale Image
Recognition Without Normalization
New SOTA validation accuracies on ImageNet by DeepMind
이미지 팀 : 김병현, 박동훈, 안종식, 이찬혁, 홍은기
발표자 : 박동훈
https://arxiv.org/abs/2102.06171
Contents
1. Performance
2. Batch Normalization: pros and cons
3. Previous Normalizer-Free Networks
4. Proposed Method : Adaptive gradient clipping
5. Experimental Results
6. Conclusion
2
Performance
• EfficientNet-B7 on ImageNet 8.7x faster to train
• New state-of-the-art top-1 accuracy of 86.5%.
- After finetuning on ImageNet after pretraining on 300 million labeled images,
It achieves 89.2%
• Image Classification on ImageNet[1]
[1] https://paperswithcode.com/sota/image-classification-on-imagenet
3
Batch Normalization(previous knowledge)
• The change in the distributions of
each layers occurs ‘Covariant Shift’[2]
Batch Normalization 효과
·Downscales residual branch
· Regularizing effect
· Eliminates mean-shift
· Efficient large-batch training
4
[2] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift
http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf
Train
Also calculate exponential moving
avg & var
𝝁 ← 1 − 𝛼 𝜇 + 𝛼𝜇𝛽
𝝈𝟐
← 1 − 𝛼 𝜎2
+ 𝛼𝜎𝛽
2
Test
BNtest x =
x − 𝝁
𝝈𝟐 +𝜖
Batch Normalization - Bad
• First, it is a surprisingly expensive computational primitive, which
incurs memory overhead
→ 계산 과부하
• Discrepancy between the behaviors of the model during training and at
inference time.
→ 학습과 추론 동작 상이
• Breaks the independence between training examples in the minibatch
→ Mini-batch의 독립성을 깨트린다.
5
Batch Normalization - Bad
• Can train residual networks with large learning rate, But only benefit if
batch size is also large.
→ 배치 정규화는 큰 lr를 사용 가능하게 하지만, 배치 사이즈도 커야 효과 있다.
• Batch normalization is often the cause of subtle implementation errors,
especially during distributed training(Pham et al., 2019)
→(실무자들의 의견에 의하면) 배치 정규화를 하면 HW에 따라 결과가 매번 달랐으며,
특히 분산 학습에서 미세한 구현 에러가 발생하였다.
6
https://www.youtube.com/watch?v=rNkHjZtH0RQ
𝛴
Q & A
Previous Normalizer-Free Networks
8
De, S. and Smith, S. Batch normalization biases residual blocks towards the identity function in deep networks. In NIPS 2020
“If our theory is correct, it should be possible to train deep residual networks without norm
alization, simply by downscaling the residual branch.”
Fully connected linear
unnormalized residual network
Fully connected linear
normalized residual network
Normalized convolutional
residual network
Residual
branch
분산 감소
"Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks”
(In NIPS 2020)
Previous Normalizer-Free Networks
• Residual block : ℎ𝑖+1 = ℎ𝑖 + 𝛼𝑓𝑖(ℎ𝑖 / 𝛽𝑖)
• ℎ𝑖 : residual block의 입력
• 𝑉𝑎𝑟 𝑓𝑖 𝑧 = 𝑉𝑎𝑟(𝑧)
• 𝛼는 Residual Block이후 곱해주는 값. 분산을 줄여 주기 위함. e.g. 𝛼 = 0.2
• 𝛽𝑖 = √𝑉𝑎𝑟(ℎ𝑖) where 𝑉𝑎𝑟 ℎ𝑖+1 = 𝑉𝑎𝑟 ℎ𝑖 + 𝛼2
9
Original residual unit proposed by K. He et al
“Normalizer-Free ResNets”(NF-ResNets)(Brock et al., 2021)
1/𝜷
𝜶
Brock, A., De, S., and Smith, S. L. Characterizing signal propagation to close the performance gap in unnormalized resnets. In ICLR, 2021
Previous Normalizer-Free Networks
그외
• Weight Standardization Standardization (Huang et al., 2017; Qiao et al., 2019))
𝑊𝑖𝑗 =
𝑊𝑖𝑗 − 𝜇𝑖
𝑁𝜎𝑖
𝜇𝑖 = (1/N) σj Wij , 𝜎i
2
=(1/N) σj(Wij − 𝜇𝑖)2
• Dropout(Srivastava et al.,2014)
• Stochastic Depth(Huang et al., 2016))
10
Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered weight normalization in accelerating training of deep neural networks. In ICCV 2017.
Q & A
Gradient Clipping(previous knowledge)
• 경사 하강(gradient descent)이 가파른
절벽에서 합리적으로 수행될 수 있도록 돕
는다.
• RNN 계열의 모델 학습에 널리 쓰인다.
• Hard to tuning threshold(𝜆)
• But enable us to train higher batch
size
기울기 norm ||g||이 thres 보다 클 경우,
→ 정규화 해서 thres 크기 만큼으로 벡터 크기를 수정
Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In ICML, 2013.
ො
𝑔 : Gradient vector
𝜖 ∶ Loss
𝜃 ∶ Vector
12
Adaptive Gradient Clipping
13
• 비율
| 𝐺𝑙 |
||𝑊𝑙||
이 학습의 단위가 될 수 있다는 것에 영감을 받았다.
• 𝑊𝑙
∈ 𝑅𝑁×𝑀
: 𝑙𝑡ℎ
번째 계층의 가중치 행렬
• 𝐺𝑙 ∈ 𝑅𝑁×𝑀 : 𝑊𝑙에 대응하는 기울기
• | 𝑊𝑙 |𝐹 = σ𝑖
𝑁 σ𝑗
𝑀
𝑊𝑖,𝑗
𝑙 2
, | 𝑊𝑙 | = max(| 𝑊𝑙 |𝐹, 𝜖) , 𝜖 = 10−3
• 𝜆 = [0.01, 0.02, 0.04, 0.08, 0.16]
∆𝑊𝑙 = −ℎ𝐺𝑙
| ∆𝑊𝑙 |
||𝑊𝑙||
= h
| 𝐺𝑙 |
||𝑊𝑙||
Training[1]
1/ | 𝑊𝑙
|
Adaptive
[1] https://www.youtube.com/watch?v=o_peo6U7IRM
Experimental Result
14
Model Detail
15
https://github.com/deepmind/deepmind-research/tree/master/nfnets
[Training Detail]
• Softmax cross-entrophy loss with
label smoothing of 0.1
• Stochastic gradient descent with
Nesterov’s momentum 0.9
• Weight decay coefficient 2 x 10−5
• Dropout, Stochastic Depth(0.25)
Conclusion
• 배치 정규화를 적용하지 않고도, 큰 배치 사이즈로 학습 할 때 배치 정
규화를 적용한 모델의 성능을 뛰어 넘는 최초의 모델
• 배치 정규화 적용한 모델과 성능은 비슷하면서도 빠르게 학습할 수 있
다.
• AGC 기법을 적용한 family models을 만들었다
• 정규화 없는 모델이 (이미지 넷과 같은 모델을 학습 한 후) Finetuning
할 때 되려 더 좋은 성능을 나타낸 다는 것을 보였다.
16
개인적인 생각
• 해당 팀은 정규화 없는 학습 방식 관련해서 연구를 많이 한 팀.
• 학습 속도를 빠르게 해서 모델 검증을 하고 싶음.
• 정규화 없이 학습을 하려 했으나, 배치 사이즈를 크게 하니 성능이
되려 안 좋게 나옴.
• 여러 아이디어로 실험을 해보고 나온 결과.
17
감사합니다

More Related Content

What's hot

Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
KDD Poster Nurjahan Begum
KDD Poster Nurjahan BegumKDD Poster Nurjahan Begum
KDD Poster Nurjahan Begum
Nurjahan Begum
 

What's hot (20)

Robot, Learning From Data
Robot, Learning From DataRobot, Learning From Data
Robot, Learning From Data
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Kernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian ProcessesKernel, RKHS, and Gaussian Processes
Kernel, RKHS, and Gaussian Processes
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training
 
InfoGAIL
InfoGAIL InfoGAIL
InfoGAIL
 
Clustering introduction
Clustering introductionClustering introduction
Clustering introduction
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence Modeling
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
KDD Poster Nurjahan Begum
KDD Poster Nurjahan BegumKDD Poster Nurjahan Begum
KDD Poster Nurjahan Begum
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 

Similar to 4 high performance large-scale image recognition without normalization

Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...
IJECEIAES
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
ssuserd23711
 

Similar to 4 high performance large-scale image recognition without normalization (20)

Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
N ns 1
N ns 1N ns 1
N ns 1
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...
 
NITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptxNITW_Improving Deep Neural Networks (1).pptx
NITW_Improving Deep Neural Networks (1).pptx
 
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptxNITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks.pptx
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
IRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog ImagesIRJET- Image Classification – Cat and Dog Images
IRJET- Image Classification – Cat and Dog Images
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

4 high performance large-scale image recognition without normalization

  • 1. High-Performance Large-Scale Image Recognition Without Normalization New SOTA validation accuracies on ImageNet by DeepMind 이미지 팀 : 김병현, 박동훈, 안종식, 이찬혁, 홍은기 발표자 : 박동훈 https://arxiv.org/abs/2102.06171
  • 2. Contents 1. Performance 2. Batch Normalization: pros and cons 3. Previous Normalizer-Free Networks 4. Proposed Method : Adaptive gradient clipping 5. Experimental Results 6. Conclusion 2
  • 3. Performance • EfficientNet-B7 on ImageNet 8.7x faster to train • New state-of-the-art top-1 accuracy of 86.5%. - After finetuning on ImageNet after pretraining on 300 million labeled images, It achieves 89.2% • Image Classification on ImageNet[1] [1] https://paperswithcode.com/sota/image-classification-on-imagenet 3
  • 4. Batch Normalization(previous knowledge) • The change in the distributions of each layers occurs ‘Covariant Shift’[2] Batch Normalization 효과 ·Downscales residual branch · Regularizing effect · Eliminates mean-shift · Efficient large-batch training 4 [2] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf Train Also calculate exponential moving avg & var 𝝁 ← 1 − 𝛼 𝜇 + 𝛼𝜇𝛽 𝝈𝟐 ← 1 − 𝛼 𝜎2 + 𝛼𝜎𝛽 2 Test BNtest x = x − 𝝁 𝝈𝟐 +𝜖
  • 5. Batch Normalization - Bad • First, it is a surprisingly expensive computational primitive, which incurs memory overhead → 계산 과부하 • Discrepancy between the behaviors of the model during training and at inference time. → 학습과 추론 동작 상이 • Breaks the independence between training examples in the minibatch → Mini-batch의 독립성을 깨트린다. 5
  • 6. Batch Normalization - Bad • Can train residual networks with large learning rate, But only benefit if batch size is also large. → 배치 정규화는 큰 lr를 사용 가능하게 하지만, 배치 사이즈도 커야 효과 있다. • Batch normalization is often the cause of subtle implementation errors, especially during distributed training(Pham et al., 2019) →(실무자들의 의견에 의하면) 배치 정규화를 하면 HW에 따라 결과가 매번 달랐으며, 특히 분산 학습에서 미세한 구현 에러가 발생하였다. 6 https://www.youtube.com/watch?v=rNkHjZtH0RQ 𝛴
  • 8. Previous Normalizer-Free Networks 8 De, S. and Smith, S. Batch normalization biases residual blocks towards the identity function in deep networks. In NIPS 2020 “If our theory is correct, it should be possible to train deep residual networks without norm alization, simply by downscaling the residual branch.” Fully connected linear unnormalized residual network Fully connected linear normalized residual network Normalized convolutional residual network Residual branch 분산 감소 "Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks” (In NIPS 2020)
  • 9. Previous Normalizer-Free Networks • Residual block : ℎ𝑖+1 = ℎ𝑖 + 𝛼𝑓𝑖(ℎ𝑖 / 𝛽𝑖) • ℎ𝑖 : residual block의 입력 • 𝑉𝑎𝑟 𝑓𝑖 𝑧 = 𝑉𝑎𝑟(𝑧) • 𝛼는 Residual Block이후 곱해주는 값. 분산을 줄여 주기 위함. e.g. 𝛼 = 0.2 • 𝛽𝑖 = √𝑉𝑎𝑟(ℎ𝑖) where 𝑉𝑎𝑟 ℎ𝑖+1 = 𝑉𝑎𝑟 ℎ𝑖 + 𝛼2 9 Original residual unit proposed by K. He et al “Normalizer-Free ResNets”(NF-ResNets)(Brock et al., 2021) 1/𝜷 𝜶 Brock, A., De, S., and Smith, S. L. Characterizing signal propagation to close the performance gap in unnormalized resnets. In ICLR, 2021
  • 10. Previous Normalizer-Free Networks 그외 • Weight Standardization Standardization (Huang et al., 2017; Qiao et al., 2019)) 𝑊𝑖𝑗 = 𝑊𝑖𝑗 − 𝜇𝑖 𝑁𝜎𝑖 𝜇𝑖 = (1/N) σj Wij , 𝜎i 2 =(1/N) σj(Wij − 𝜇𝑖)2 • Dropout(Srivastava et al.,2014) • Stochastic Depth(Huang et al., 2016)) 10 Huang, L., Liu, X., Liu, Y., Lang, B., and Tao, D. Centered weight normalization in accelerating training of deep neural networks. In ICCV 2017.
  • 11. Q & A
  • 12. Gradient Clipping(previous knowledge) • 경사 하강(gradient descent)이 가파른 절벽에서 합리적으로 수행될 수 있도록 돕 는다. • RNN 계열의 모델 학습에 널리 쓰인다. • Hard to tuning threshold(𝜆) • But enable us to train higher batch size 기울기 norm ||g||이 thres 보다 클 경우, → 정규화 해서 thres 크기 만큼으로 벡터 크기를 수정 Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In ICML, 2013. ො 𝑔 : Gradient vector 𝜖 ∶ Loss 𝜃 ∶ Vector 12
  • 13. Adaptive Gradient Clipping 13 • 비율 | 𝐺𝑙 | ||𝑊𝑙|| 이 학습의 단위가 될 수 있다는 것에 영감을 받았다. • 𝑊𝑙 ∈ 𝑅𝑁×𝑀 : 𝑙𝑡ℎ 번째 계층의 가중치 행렬 • 𝐺𝑙 ∈ 𝑅𝑁×𝑀 : 𝑊𝑙에 대응하는 기울기 • | 𝑊𝑙 |𝐹 = σ𝑖 𝑁 σ𝑗 𝑀 𝑊𝑖,𝑗 𝑙 2 , | 𝑊𝑙 | = max(| 𝑊𝑙 |𝐹, 𝜖) , 𝜖 = 10−3 • 𝜆 = [0.01, 0.02, 0.04, 0.08, 0.16] ∆𝑊𝑙 = −ℎ𝐺𝑙 | ∆𝑊𝑙 | ||𝑊𝑙|| = h | 𝐺𝑙 | ||𝑊𝑙|| Training[1] 1/ | 𝑊𝑙 | Adaptive [1] https://www.youtube.com/watch?v=o_peo6U7IRM
  • 15. Model Detail 15 https://github.com/deepmind/deepmind-research/tree/master/nfnets [Training Detail] • Softmax cross-entrophy loss with label smoothing of 0.1 • Stochastic gradient descent with Nesterov’s momentum 0.9 • Weight decay coefficient 2 x 10−5 • Dropout, Stochastic Depth(0.25)
  • 16. Conclusion • 배치 정규화를 적용하지 않고도, 큰 배치 사이즈로 학습 할 때 배치 정 규화를 적용한 모델의 성능을 뛰어 넘는 최초의 모델 • 배치 정규화 적용한 모델과 성능은 비슷하면서도 빠르게 학습할 수 있 다. • AGC 기법을 적용한 family models을 만들었다 • 정규화 없는 모델이 (이미지 넷과 같은 모델을 학습 한 후) Finetuning 할 때 되려 더 좋은 성능을 나타낸 다는 것을 보였다. 16
  • 17. 개인적인 생각 • 해당 팀은 정규화 없는 학습 방식 관련해서 연구를 많이 한 팀. • 학습 속도를 빠르게 해서 모델 검증을 하고 싶음. • 정규화 없이 학습을 하려 했으나, 배치 사이즈를 크게 하니 성능이 되려 안 좋게 나옴. • 여러 아이디어로 실험을 해보고 나온 결과. 17