SlideShare a Scribd company logo
1 of 36
Download to read offline
SWITCH TRANSFORMERS: SCALING TO
TRILLION PARAMETER MODELS WITH
SIMPLE AND EFFICIENT SPARSITY
자연어처리팀: 박희수(발표자), 백지윤, 진명훈
Motivation
NLP 모델 하나 학습시키는데 도대체
얼마나 많은 에너지가 소비되는 걸까?
Energy and Policy Considerations for Deep Learning in NLP
NAS 를 통해 Transformer model 하나
학습시키면 지구 온난화의 주범이…
Motivation
그럼에도 불구하고 NLP 모델 사이즈는…!
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
Motivation
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Mixture of Expert (MoE) 기반의 sparsely-
active transformers
MoE 는
(1) complexity,
(2) communication costs, and
(3) training instabilities
위 세가지 문제로 널리 쓰이지는 못하고 있다!
위 세가지 문제를 해결한, 효율적인 spasely-
active transformers 를 만들어보자!!!!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
강아지의 일부분을 보
고 강아지를 인식하는
능력 필요
배경과 여러 다른 객체
사이에서 강아지를 찾
아내는 능력 필요
Too much load!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
배경 분리
전문가
객체 탐지
전문가
강아지 부분
인식전문가
MoE Layer
Two Imagenet images from the same class
ℎ 𝑥 = 𝑊
𝑟 ∙ 𝑥
→ 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥
→ 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘
𝑦 = ෍
𝑘
𝑝(𝑥) ∙ 𝐸(𝑥)
MoE for Transformer
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Transformer 에는 feed
forward network 에만 MoE 적용
Routing은 token 단위로 적용
Any Question?
Basic idea for Switch Transformer
오직 하나의 expert 만 선택하자!
1. Single expert를 사용하여 Router
연산을 줄임! (top-k -> top-1)
2. K개의 Expert 선택할 경우 데이터를 k
개로 복사해야 하지만 1개를 선택할 경
우 그럴 필요가 없으므로 기존 MoE 기
법에 비해 batch 사이즈가 줄어드는
효과
3. Routing 연산 후 Device 간의
communication 작업이 필요한데 1개
의 expert 만을 선택함으로써 이를 줄
일 수 있음
expert capacity =
tokens per batch
number of experts
× capacity factor
k = 1 routing strategy
(1) Distributed Switch Implementation.
k = 1 routing strategy
(2) A Differentiable Load Balancing Loss.
Benchmarking Switch versus MoE
a quality threshold of
Neg. Log Perp.=-1.495.
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Float32 precision is only used within the body of the
router function – on computations local to that device.
Float32
bfloat16
bfloat16
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Improved Training and Fine-Tuning Techniques
(2) Selective Dropout
Expert dropout 과 기존 dropout의 비율을 적절히 조정해
줬을 때 최적의 성능이 나옴
Improved Training and Fine-Tuning Techniques
(3) A Better Initialization
Truncated normal distribution 으로 초기화 해준 수에 0.1
을 곱했을 때 결과의 편차가 가장 적었음 → stable results
Any Question?
• FLOPS 등 다른 변수는 모두 동일한 상태에서 비교
• 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼
수 있는 성능을 60K 만에 도달
Scaling Properties
(1) Results on a step-basis (총 training step 고정)
• 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교
• 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음
Scaling Properties
(2) Results on a time-basis (총 training time 고정)
• T5-large 의 경우 각 토큰당 연산량이 3.5배 많음
• 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음
• 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음
Scaling Properties
(3) Scaling vs A Large Dense Model (총 parameter 수 고정)
Any Question?
Fine-tuning
(1) Baseline and Switch models used for fine-tuning
Fine-tuning
(2) Fine-tuning tasks and datasets.
Distillation
(1) Distillation techniques..
Distillation
(2) Achievable compression rates.
Distillation
(3) Distilling a fine-tuned model.
Multilingual Learning
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑛 𝑚 = 1
Designing models with data, model, and expert-parallelism
Data parallelism: This has the advantage that no communication is needed until the entire
forward and backward pass is finished and the gradients need to be then aggregated across
all cores
Model parallelism: All cores must keep the full B tokens and each core will contain a
unique slice of the weights. For each forward and backward pass, a communication cost is
now incurred.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 1
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑚
Designing models with data, model, and expert-parallelism
Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the
weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 4
Designing models with data, model, and expert-parallelism
Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of
both the weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 𝑁
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 1
Designing models with data, model, and expert-parallelism
Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and
𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation.
𝑛
= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
Designing models with data, model, and expert-parallelism
Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the
Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps.
Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts,
exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger
FLOPs per sequence, is sometimes unstable
Designing models with data, model, and expert-parallelism
• Isn’t Switch Transformer better due to sheer parameter count?
Yes, and by design! Parameters, independent of the total FLOPs used, are a useful
axis to scale neural language models.
• I don’t have access to a supercomputer – is this still useful for me?
Though this work has focused on extremely large models, we also find that models
with as few as two experts improves performance while easily fitting within memory
constraints of commonly available GPUs or TPUs
• Do sparse models outperform dense models on the speed-accuracy pareto curve?
Yes. Across a wide variety of different model’s sizes, sparse models outperform
dense models per step and on wall clock time.
• I can’t deploy a trillion parameters model – can we shrink these models?
We cannot fully preserve the model quality, but compression rates of 10 to 100x are
achievable by distilling our sparse models into dense models while achieving ≈30% of
the quality gain of the expert model.
DISCUSSION
Any Question?

More Related Content

What's hot

Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
ASP.NET과 C#으로 개발하는 대규모 소셜 게임
ASP.NET과 C#으로 개발하는 대규모 소셜 게임ASP.NET과 C#으로 개발하는 대규모 소셜 게임
ASP.NET과 C#으로 개발하는 대규모 소셜 게임흥배 최
 
TERA Server Architecture
TERA Server ArchitectureTERA Server Architecture
TERA Server Architectureujentus
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018devCAT Studio, NEXON
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략YEONG-CHEON YOU
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
 
[NDC08] 최적화와 프로파일링 - 송창규
[NDC08] 최적화와 프로파일링 - 송창규[NDC08] 최적화와 프로파일링 - 송창규
[NDC08] 최적화와 프로파일링 - 송창규ChangKyu Song
 
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스Seunghwa Song
 
언리얼을 활용한 오브젝트 풀링
언리얼을 활용한 오브젝트 풀링언리얼을 활용한 오브젝트 풀링
언리얼을 활용한 오브젝트 풀링TonyCms
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveUniversity of Amsterdam
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)홍배 김
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb VisualizationCONNECT FOUNDATION
 
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.ozlael ozlael
 

What's hot (20)

Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
ASP.NET과 C#으로 개발하는 대규모 소셜 게임
ASP.NET과 C#으로 개발하는 대규모 소셜 게임ASP.NET과 C#으로 개발하는 대규모 소셜 게임
ASP.NET과 C#으로 개발하는 대규모 소셜 게임
 
TERA Server Architecture
TERA Server ArchitectureTERA Server Architecture
TERA Server Architecture
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
이승재, 실버바인 서버엔진 2 설계 리뷰, NDC2018
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략실시간 게임 서버 최적화 전략
실시간 게임 서버 최적화 전략
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Parallelformers
ParallelformersParallelformers
Parallelformers
 
[NDC08] 최적화와 프로파일링 - 송창규
[NDC08] 최적화와 프로파일링 - 송창규[NDC08] 최적화와 프로파일링 - 송창규
[NDC08] 최적화와 프로파일링 - 송창규
 
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
언리얼을 활용한 오브젝트 풀링
언리얼을 활용한 오브젝트 풀링언리얼을 활용한 오브젝트 풀링
언리얼을 활용한 오브젝트 풀링
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
BERT
BERTBERT
BERT
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
 
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.
유니티 그래픽 최적화, 어디까지 해봤니 (Optimizing Unity Graphics) Unite Seoul Ver.
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language modelsHyunKyu Jeon
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...ssuser4b1f48
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Interpretable ML
Interpretable MLInterpretable ML
Interpretable MLMayur Sand
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & MetricsSanghamitra Deb
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (20)

presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language models
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
C3 w3
C3 w3C3 w3
C3 w3
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Duplicate_Quora_Question_Detection
Duplicate_Quora_Question_DetectionDuplicate_Quora_Question_Detection
Duplicate_Quora_Question_Detection
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
Graph processing
Graph processingGraph processing
Graph processing
 
Interpretable ML
Interpretable MLInterpretable ML
Interpretable ML
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 

Recently uploaded (20)

FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Ultrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptxUltrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  • 1. SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY 자연어처리팀: 박희수(발표자), 백지윤, 진명훈
  • 2. Motivation NLP 모델 하나 학습시키는데 도대체 얼마나 많은 에너지가 소비되는 걸까? Energy and Policy Considerations for Deep Learning in NLP NAS 를 통해 Transformer model 하나 학습시키면 지구 온난화의 주범이…
  • 3. Motivation 그럼에도 불구하고 NLP 모델 사이즈는…! http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 4. Motivation GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Mixture of Expert (MoE) 기반의 sparsely- active transformers MoE 는 (1) complexity, (2) communication costs, and (3) training instabilities 위 세가지 문제로 널리 쓰이지는 못하고 있다! 위 세가지 문제를 해결한, 효율적인 spasely- active transformers 를 만들어보자!!!!
  • 5. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 강아지의 일부분을 보 고 강아지를 인식하는 능력 필요 배경과 여러 다른 객체 사이에서 강아지를 찾 아내는 능력 필요 Too much load!
  • 6. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 배경 분리 전문가 객체 탐지 전문가 강아지 부분 인식전문가
  • 7. MoE Layer Two Imagenet images from the same class ℎ 𝑥 = 𝑊 𝑟 ∙ 𝑥 → 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥 → 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘 𝑦 = ෍ 𝑘 𝑝(𝑥) ∙ 𝐸(𝑥)
  • 8. MoE for Transformer GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Transformer 에는 feed forward network 에만 MoE 적용 Routing은 token 단위로 적용
  • 10. Basic idea for Switch Transformer 오직 하나의 expert 만 선택하자! 1. Single expert를 사용하여 Router 연산을 줄임! (top-k -> top-1) 2. K개의 Expert 선택할 경우 데이터를 k 개로 복사해야 하지만 1개를 선택할 경 우 그럴 필요가 없으므로 기존 MoE 기 법에 비해 batch 사이즈가 줄어드는 효과 3. Routing 연산 후 Device 간의 communication 작업이 필요한데 1개 의 expert 만을 선택함으로써 이를 줄 일 수 있음
  • 11. expert capacity = tokens per batch number of experts × capacity factor k = 1 routing strategy (1) Distributed Switch Implementation.
  • 12. k = 1 routing strategy (2) A Differentiable Load Balancing Loss.
  • 13. Benchmarking Switch versus MoE a quality threshold of Neg. Log Perp.=-1.495.
  • 14. (1) Selective Precision Improved Training and Fine-Tuning Techniques Float32 precision is only used within the body of the router function – on computations local to that device. Float32 bfloat16 bfloat16
  • 15. (1) Selective Precision Improved Training and Fine-Tuning Techniques
  • 16. Improved Training and Fine-Tuning Techniques (2) Selective Dropout Expert dropout 과 기존 dropout의 비율을 적절히 조정해 줬을 때 최적의 성능이 나옴
  • 17. Improved Training and Fine-Tuning Techniques (3) A Better Initialization Truncated normal distribution 으로 초기화 해준 수에 0.1 을 곱했을 때 결과의 편차가 가장 적었음 → stable results
  • 19. • FLOPS 등 다른 변수는 모두 동일한 상태에서 비교 • 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼 수 있는 성능을 60K 만에 도달 Scaling Properties (1) Results on a step-basis (총 training step 고정)
  • 20. • 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교 • 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음 Scaling Properties (2) Results on a time-basis (총 training time 고정)
  • 21. • T5-large 의 경우 각 토큰당 연산량이 3.5배 많음 • 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음 • 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음 Scaling Properties (3) Scaling vs A Large Dense Model (총 parameter 수 고정)
  • 23. Fine-tuning (1) Baseline and Switch models used for fine-tuning
  • 27. Distillation (3) Distilling a fine-tuned model.
  • 29. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑛 𝑚 = 1 Designing models with data, model, and expert-parallelism Data parallelism: This has the advantage that no communication is needed until the entire forward and backward pass is finished and the gradients need to be then aggregated across all cores
  • 30. Model parallelism: All cores must keep the full B tokens and each core will contain a unique slice of the weights. For each forward and backward pass, a communication cost is now incurred. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 1 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑚 Designing models with data, model, and expert-parallelism
  • 31. Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 4 Designing models with data, model, and expert-parallelism
  • 32. Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 𝑁 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 1 Designing models with data, model, and expert-parallelism
  • 33. Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 Designing models with data, model, and expert-parallelism
  • 34. Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps. Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts, exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger FLOPs per sequence, is sometimes unstable Designing models with data, model, and expert-parallelism
  • 35. • Isn’t Switch Transformer better due to sheer parameter count? Yes, and by design! Parameters, independent of the total FLOPs used, are a useful axis to scale neural language models. • I don’t have access to a supercomputer – is this still useful for me? Though this work has focused on extremely large models, we also find that models with as few as two experts improves performance while easily fitting within memory constraints of commonly available GPUs or TPUs • Do sparse models outperform dense models on the speed-accuracy pareto curve? Yes. Across a wide variety of different model’s sizes, sparse models outperform dense models per step and on wall clock time. • I can’t deploy a trillion parameters model – can we shrink these models? We cannot fully preserve the model quality, but compression rates of 10 to 100x are achievable by distilling our sparse models into dense models while achieving ≈30% of the quality gain of the expert model. DISCUSSION