SlideShare a Scribd company logo
1 of 36
Download to read offline
SWITCH TRANSFORMERS: SCALING TO
TRILLION PARAMETER MODELS WITH
SIMPLE AND EFFICIENT SPARSITY
자연어처리팀: 박희수(발표자), 백지윤, 진명훈
Motivation
NLP 모델 하나 학습시키는데 도대체
얼마나 많은 에너지가 소비되는 걸까?
Energy and Policy Considerations for Deep Learning in NLP
NAS 를 통해 Transformer model 하나
학습시키면 지구 온난화의 주범이…
Motivation
그럼에도 불구하고 NLP 모델 사이즈는…!
http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
Motivation
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Mixture of Expert (MoE) 기반의 sparsely-
active transformers
MoE 는
(1) complexity,
(2) communication costs, and
(3) training instabilities
위 세가지 문제로 널리 쓰이지는 못하고 있다!
위 세가지 문제를 해결한, 효율적인 spasely-
active transformers 를 만들어보자!!!!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
강아지의 일부분을 보
고 강아지를 인식하는
능력 필요
배경과 여러 다른 객체
사이에서 강아지를 찾
아내는 능력 필요
Too much load!
What is the Mixture of Expert (MoE)?
Two Imagenet images from the same class
배경 분리
전문가
객체 탐지
전문가
강아지 부분
인식전문가
MoE Layer
Two Imagenet images from the same class
ℎ 𝑥 = 𝑊
𝑟 ∙ 𝑥
→ 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥
→ 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘
𝑦 = ෍
𝑘
𝑝(𝑥) ∙ 𝐸(𝑥)
MoE for Transformer
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Transformer 에는 feed
forward network 에만 MoE 적용
Routing은 token 단위로 적용
Any Question?
Basic idea for Switch Transformer
오직 하나의 expert 만 선택하자!
1. Single expert를 사용하여 Router
연산을 줄임! (top-k -> top-1)
2. K개의 Expert 선택할 경우 데이터를 k
개로 복사해야 하지만 1개를 선택할 경
우 그럴 필요가 없으므로 기존 MoE 기
법에 비해 batch 사이즈가 줄어드는
효과
3. Routing 연산 후 Device 간의
communication 작업이 필요한데 1개
의 expert 만을 선택함으로써 이를 줄
일 수 있음
expert capacity =
tokens per batch
number of experts
× capacity factor
k = 1 routing strategy
(1) Distributed Switch Implementation.
k = 1 routing strategy
(2) A Differentiable Load Balancing Loss.
Benchmarking Switch versus MoE
a quality threshold of
Neg. Log Perp.=-1.495.
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Float32 precision is only used within the body of the
router function – on computations local to that device.
Float32
bfloat16
bfloat16
(1) Selective Precision
Improved Training and Fine-Tuning Techniques
Improved Training and Fine-Tuning Techniques
(2) Selective Dropout
Expert dropout 과 기존 dropout의 비율을 적절히 조정해
줬을 때 최적의 성능이 나옴
Improved Training and Fine-Tuning Techniques
(3) A Better Initialization
Truncated normal distribution 으로 초기화 해준 수에 0.1
을 곱했을 때 결과의 편차가 가장 적었음 → stable results
Any Question?
• FLOPS 등 다른 변수는 모두 동일한 상태에서 비교
• 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼
수 있는 성능을 60K 만에 도달
Scaling Properties
(1) Results on a step-basis (총 training step 고정)
• 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교
• 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음
Scaling Properties
(2) Results on a time-basis (총 training time 고정)
• T5-large 의 경우 각 토큰당 연산량이 3.5배 많음
• 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음
• 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음
Scaling Properties
(3) Scaling vs A Large Dense Model (총 parameter 수 고정)
Any Question?
Fine-tuning
(1) Baseline and Switch models used for fine-tuning
Fine-tuning
(2) Fine-tuning tasks and datasets.
Distillation
(1) Distillation techniques..
Distillation
(2) Achievable compression rates.
Distillation
(3) Distilling a fine-tuned model.
Multilingual Learning
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑛 𝑚 = 1
Designing models with data, model, and expert-parallelism
Data parallelism: This has the advantage that no communication is needed until the entire
forward and backward pass is finished and the gradients need to be then aggregated across
all cores
Model parallelism: All cores must keep the full B tokens and each core will contain a
unique slice of the weights. For each forward and backward pass, a communication cost is
now incurred.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 1
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑁 = 𝑚
Designing models with data, model, and expert-parallelism
Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the
weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑊𝑖𝑛
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 4
Designing models with data, model, and expert-parallelism
Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of
both the weights and intermediate activation.
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 𝑁
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
𝑚 = 1
Designing models with data, model, and expert-parallelism
Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and
𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation.
𝑛
= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝐵
𝑑𝑓𝑓
𝐵
𝑑𝑚𝑜𝑑𝑒𝑙𝑠
𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡
𝑊𝑖𝑛 𝑊𝑜𝑢𝑡
𝑛 = 4
𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠)
𝑑
𝑚𝑜𝑑𝑒𝑙𝑠
Designing models with data, model, and expert-parallelism
Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the
Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps.
Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts,
exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger
FLOPs per sequence, is sometimes unstable
Designing models with data, model, and expert-parallelism
• Isn’t Switch Transformer better due to sheer parameter count?
Yes, and by design! Parameters, independent of the total FLOPs used, are a useful
axis to scale neural language models.
• I don’t have access to a supercomputer – is this still useful for me?
Though this work has focused on extremely large models, we also find that models
with as few as two experts improves performance while easily fitting within memory
constraints of commonly available GPUs or TPUs
• Do sparse models outperform dense models on the speed-accuracy pareto curve?
Yes. Across a wide variety of different model’s sizes, sparse models outperform
dense models per step and on wall clock time.
• I can’t deploy a trillion parameters model – can we shrink these models?
We cannot fully preserve the model quality, but compression rates of 10 to 100x are
achievable by distilling our sparse models into dense models while achieving ≈30% of
the quality gain of the expert model.
DISCUSSION
Any Question?

More Related Content

What's hot

PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
ONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningHagay Lupesko
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Trends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_businessTrends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_businessSANG WON PARK
 
KorQuAD introduction
KorQuAD introductionKorQuAD introduction
KorQuAD introductionSeungyoungLim
 
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...Kai Wähner
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
 
Large scale-lm-part1
Large scale-lm-part1Large scale-lm-part1
Large scale-lm-part1gohyunwoong
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentationAhmed rebai
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingAnimesh Singh
 
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market SizingBy The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market SizingAlan Quayle
 
KFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesAnimesh Singh
 
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호NAVER D2
 
Keras: Deep Learning Library for Python
Keras: Deep Learning Library for PythonKeras: Deep Learning Library for Python
Keras: Deep Learning Library for PythonRafi Khan
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 

What's hot (20)

KorQuAD v2.0 소개
KorQuAD v2.0 소개KorQuAD v2.0 소개
KorQuAD v2.0 소개
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
ONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep Learning
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Trends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_businessTrends_of_MLOps_tech_in_business
Trends_of_MLOps_tech_in_business
 
KorQuAD introduction
KorQuAD introductionKorQuAD introduction
KorQuAD introduction
 
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...
Showdown: Integration Framework (Spring Integration, Apache Camel) vs. Enterp...
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Large scale-lm-part1
Large scale-lm-part1Large scale-lm-part1
Large scale-lm-part1
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
 
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market SizingBy The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
By The Numbers: CPaaS, UCaaS, CCaaS Landscapes and Market Sizing
 
KFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow Pipelines
 
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호
[221] 딥러닝을 이용한 지역 컨텍스트 검색 김진호
 
Keras: Deep Learning Library for Python
Keras: Deep Learning Library for PythonKeras: Deep Learning Library for Python
Keras: Deep Learning Library for Python
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language modelsHyunKyu Jeon
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...ssuser4b1f48
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Interpretable ML
Interpretable MLInterpretable ML
Interpretable MLMayur Sand
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & MetricsSanghamitra Deb
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 

Similar to Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (20)

presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Super tickets in pre trained language models
Super tickets in pre trained language modelsSuper tickets in pre trained language models
Super tickets in pre trained language models
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
C3 w3
C3 w3C3 w3
C3 w3
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Duplicate_Quora_Question_Detection
Duplicate_Quora_Question_DetectionDuplicate_Quora_Question_Detection
Duplicate_Quora_Question_Detection
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
Graph processing
Graph processingGraph processing
Graph processing
 
Interpretable ML
Interpretable MLInterpretable ML
Interpretable ML
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 

Recently uploaded (20)

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  • 1. SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY 자연어처리팀: 박희수(발표자), 백지윤, 진명훈
  • 2. Motivation NLP 모델 하나 학습시키는데 도대체 얼마나 많은 에너지가 소비되는 걸까? Energy and Policy Considerations for Deep Learning in NLP NAS 를 통해 Transformer model 하나 학습시키면 지구 온난화의 주범이…
  • 3. Motivation 그럼에도 불구하고 NLP 모델 사이즈는…! http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__High_Performance_NLP.pdf
  • 4. Motivation GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Mixture of Expert (MoE) 기반의 sparsely- active transformers MoE 는 (1) complexity, (2) communication costs, and (3) training instabilities 위 세가지 문제로 널리 쓰이지는 못하고 있다! 위 세가지 문제를 해결한, 효율적인 spasely- active transformers 를 만들어보자!!!!
  • 5. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 강아지의 일부분을 보 고 강아지를 인식하는 능력 필요 배경과 여러 다른 객체 사이에서 강아지를 찾 아내는 능력 필요 Too much load!
  • 6. What is the Mixture of Expert (MoE)? Two Imagenet images from the same class 배경 분리 전문가 객체 탐지 전문가 강아지 부분 인식전문가
  • 7. MoE Layer Two Imagenet images from the same class ℎ 𝑥 = 𝑊 𝑟 ∙ 𝑥 → 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ℎ 𝑥 → 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡𝑜𝑝𝑘 𝑦 = ෍ 𝑘 𝑝(𝑥) ∙ 𝐸(𝑥)
  • 8. MoE for Transformer GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Transformer 에는 feed forward network 에만 MoE 적용 Routing은 token 단위로 적용
  • 10. Basic idea for Switch Transformer 오직 하나의 expert 만 선택하자! 1. Single expert를 사용하여 Router 연산을 줄임! (top-k -> top-1) 2. K개의 Expert 선택할 경우 데이터를 k 개로 복사해야 하지만 1개를 선택할 경 우 그럴 필요가 없으므로 기존 MoE 기 법에 비해 batch 사이즈가 줄어드는 효과 3. Routing 연산 후 Device 간의 communication 작업이 필요한데 1개 의 expert 만을 선택함으로써 이를 줄 일 수 있음
  • 11. expert capacity = tokens per batch number of experts × capacity factor k = 1 routing strategy (1) Distributed Switch Implementation.
  • 12. k = 1 routing strategy (2) A Differentiable Load Balancing Loss.
  • 13. Benchmarking Switch versus MoE a quality threshold of Neg. Log Perp.=-1.495.
  • 14. (1) Selective Precision Improved Training and Fine-Tuning Techniques Float32 precision is only used within the body of the router function – on computations local to that device. Float32 bfloat16 bfloat16
  • 15. (1) Selective Precision Improved Training and Fine-Tuning Techniques
  • 16. Improved Training and Fine-Tuning Techniques (2) Selective Dropout Expert dropout 과 기존 dropout의 비율을 적절히 조정해 줬을 때 최적의 성능이 나옴
  • 17. Improved Training and Fine-Tuning Techniques (3) A Better Initialization Truncated normal distribution 으로 초기화 해준 수에 0.1 을 곱했을 때 결과의 편차가 가장 적었음 → stable results
  • 19. • FLOPS 등 다른 변수는 모두 동일한 상태에서 비교 • 결과 : (1) Experts 가 증가할 수록 성능 향상 (2) T5 base model 이 450K 스텝에서 낼 수 있는 성능을 60K 만에 도달 Scaling Properties (1) Results on a step-basis (총 training step 고정)
  • 20. • 제한된 시간과 메모리 용량 하에서T5 와 Switch 다시 비교 • 결과 : T5 보다 7 배 빨리 성능을 낼 수 있음 Scaling Properties (2) Results on a time-basis (총 training time 고정)
  • 21. • T5-large 의 경우 각 토큰당 연산량이 3.5배 많음 • 결과1 : T5-base 보다 7배 빨리 성능을 낼 수 있음 • 결과2 : T5-large 보다 2.5배 빨리 성능을 낼 수 있음 Scaling Properties (3) Scaling vs A Large Dense Model (총 parameter 수 고정)
  • 23. Fine-tuning (1) Baseline and Switch models used for fine-tuning
  • 27. Distillation (3) Distilling a fine-tuned model.
  • 29. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑛 𝑚 = 1 Designing models with data, model, and expert-parallelism Data parallelism: This has the advantage that no communication is needed until the entire forward and backward pass is finished and the gradients need to be then aggregated across all cores
  • 30. Model parallelism: All cores must keep the full B tokens and each core will contain a unique slice of the weights. For each forward and backward pass, a communication cost is now incurred. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 1 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑁 = 𝑚 Designing models with data, model, and expert-parallelism
  • 31. Model And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑊𝑖𝑛 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 4 Designing models with data, model, and expert-parallelism
  • 32. Expert And Data parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 𝑁 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 𝑚 = 1 Designing models with data, model, and expert-parallelism
  • 33. Expert, Data and Model parallelism: Each core will be responsible for 𝐵/𝑛 tokens and 𝑑𝑓𝑓/𝑚 of both the weights and intermediate activation. 𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑡𝑠 𝑚 = 4 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝐵 𝑑𝑓𝑓 𝐵 𝑑𝑚𝑜𝑑𝑒𝑙𝑠 𝑖𝑛𝑝𝑢𝑡: 𝑥 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒: ℎ = 𝑥𝑊𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡: 𝑦 = 𝑅𝑒𝐿𝑈 ℎ 𝑊𝑜𝑢𝑡 𝑊𝑖𝑛 𝑊𝑜𝑢𝑡 𝑛 = 4 𝑑𝑓𝑓(≫ 𝑑𝑚𝑜𝑑𝑒𝑙𝑠) 𝑑 𝑚𝑜𝑑𝑒𝑙𝑠 Designing models with data, model, and expert-parallelism
  • 34. Sample efficiency versus T5-XXL: The gap continues to increase with additional training, with the Switch-XXL model out-performing the T5-XXL by 0.087 by 500k steps. Training instability: We find that the larger Switch-C model, with 1.6T parameters and 2048 experts, exhibits no training instability at all. Instead, the Switch XXL version, with nearly 10x larger FLOPs per sequence, is sometimes unstable Designing models with data, model, and expert-parallelism
  • 35. • Isn’t Switch Transformer better due to sheer parameter count? Yes, and by design! Parameters, independent of the total FLOPs used, are a useful axis to scale neural language models. • I don’t have access to a supercomputer – is this still useful for me? Though this work has focused on extremely large models, we also find that models with as few as two experts improves performance while easily fitting within memory constraints of commonly available GPUs or TPUs • Do sparse models outperform dense models on the speed-accuracy pareto curve? Yes. Across a wide variety of different model’s sizes, sparse models outperform dense models per step and on wall clock time. • I can’t deploy a trillion parameters model – can we shrink these models? We cannot fully preserve the model quality, but compression rates of 10 to 100x are achievable by distilling our sparse models into dense models while achieving ≈30% of the quality gain of the expert model. DISCUSSION