SlideShare a Scribd company logo
1 of 15
AlBert
기존 연구(Original BERT)의 문제
Ⓐ 요구되는 메모리 과다, ⓑ 훈련시간 과다, ⓒ 파라메터 수의 최적화 미비 등 3가지 문제
1. 필요한 파라미터와 메모리 증가에 따른 하드웨어적인 이슈 (※ 특히 MRC 문제를 예로 하고 있음)
An obstacle to answering this question is the memory limitations of available hardware. Given that current
state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these
limitations as we try to scale our models.
2. 훈련에 소요되는 시간 문제 (TPU 16장으로 8일, 일반적인 GPU로는 수 개월 소요)
Training speed can also be significantly hampered in distributed training, as the communication overhead is
directly proportional to the number of parameters in the model
3. 파라미터 수의 증가가 성능의 향상으로 이어지지 않음(Large 모델 기준 X2의 경우 오히려 성능 하락)
We also observe that simply growing the hidden size of a model such as BERT-large (Devlin et al., 2019)
can lead to worse performance
제안 방법
Factorized embedding parameterization
1. Factorized embedding parameterization
As such, untying the WordPiece embedding size E from the hidden layer size H allows us to make a
more efficient usage of the total model parameters as informed by modeling needs, which dictate that H
> E. Therefore, for ALBERT we use a factorization of the embedding parameters, decomposing them
into two smaller matrices. Instead of projecting the one-hot vectors directly into the hidden space of size
H, we first project them into a lower dimensional embedding space of size E, and then project it to the
hidden space.
O(V × H) ➔ O(V × E + E × H)
A
B
C
D
E
F
…
H
V
A
B
C
D
E
F
…
E (H>E)
V
H
E
30,000 * 768 = 23백만 30,000 * 200 = 6백만 200 * 768 = 15만
= x
+>
※ Embedding 시 필요한 파라메터 수의 감소
제안 방법
Factorized embedding parameterization
1. huggingface github에서 Factorized embedding parameterization을 찾을려다가 실패
- albert Embedding implementation
- albert git
https://github.com/huggingface/transformers
https://github.com/brightmart/albert_zh/
https://github.com/google-research/ALBERT
제안 방법
Factorized embedding parameterization
- bert Embedding implementation
제안방법
Factorized embedding parameterization
1. Factorized embedding parameterization
shows the effect of changing the vocabulary embedding size E using an ALBERT-base configuration
setting (see Table 2), using the same set of representative downstream tasks. Under the non-shared
condition (BERT-style), larger embedding sizes give better performance, but not by much. Under the
all-shared condition (ALBERT-style), an embedding of size 128 appears to be the best. Based on these
results, we use an embedding size E = 128 in all future settings, as a necessary step to do further
scaling.
제안 방법
Cross-layer parameter sharing
2. Cross-layer parameter sharing
For ALBERT, we propose cross-layer parameter sharing as another way to improve parameter
efficiency. There are multiple ways to share parameters, e.g., only sharing feed-forward network
(FFN) parameters across layers, or only sharing attention parameters.
Multi Head Attention
Add & Norm
Scaled Dot Product Attention
Feed Forward
Block-2
Add & Norm
Multi Head Attention
Add & Norm
Feed
Forward
Add & Norm
Multi Head Attention
Block-1
Shared
제안 방법
Cross-layer parameter sharing
- albert layer groups implementation
제안 방법
Cross-layer parameter sharing
- bert layer implementation
제안 방법
Cross-layer parameter sharing
- albert vs bert architecture
제안 방법
Cross-layer parameter sharing
- parameter sharing type(all, attention, ffn)
제안 방법
Inter-sentence coherence loss
3. Inter-sentence coherence loss
However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and
decided to eliminate it, a decision supported by an improvement in downstream task performance
across several tasks.
We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task.
That is, for ALBERT, we use a sentence-order prediction (SOP) loss. The SOP loss uses as
positive examples the same technique as BERT (two consecutive segments from the same
document), and as negative examples the same two consecutive segments but with their order
swapped
NSP : Next Sentence Prediction SOP : Sentence Order Prediction
positive
Seq A
Sentence
Seq B
Sentence
True
Seq A
Sentence
Seq B
Sentence
True
negative
Seq A
Sentence
Random
Sentence
False
Seq B
Sentence
Seq A
Sentence
False
제안 방법
Inter-sentence coherence loss
- SOP results
추가실험
Inter-sentence coherence loss
- MLM task에서 추가 데이터로 훈련(xlnet, roberta에서 쓴 데이터 셋)
- Drop out을 뺐더니 성능이 더 좋아짐
- 추가데이터에서는 Squad에서는 성능이 나뻐짐(아마도 도메인 밖의 데이터가
섞여서인듯)
전체적인 실험 결과 (※ 자체 구현 後 송험 결과)
Able to Check our test code here : https://github.com/jeehyun100/text_proj
Bert(파라미터 사이즈 : 414M)
{
"exact": 80.21759697256385,
"f1": 87.94263692549254,
"total": 10570,
}
https://github.com/jeehyun100/text_proj
Albert(파라미터 사이즈 :
48M)
{
"exact": 80.87038789025544,
"f1": 88.67964179873631,
"total": 10570,
}
더 적은 파라미터 사이즈로 더 높은 성능
달성

More Related Content

What's hot

Diploma ii cfpc- u-5.1 pointer, structure ,union and intro to file handling
Diploma ii  cfpc- u-5.1 pointer, structure ,union and intro to file handlingDiploma ii  cfpc- u-5.1 pointer, structure ,union and intro to file handling
Diploma ii cfpc- u-5.1 pointer, structure ,union and intro to file handlingRai University
 
Brute force-algorithm
Brute force-algorithmBrute force-algorithm
Brute force-algorithm9854098540
 
Effective Java - Always override toString() method
Effective Java - Always override toString() methodEffective Java - Always override toString() method
Effective Java - Always override toString() methodFerdous Mahmud Shaon
 
Inter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurInter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurniveditJain
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
E041122335
E041122335E041122335
E041122335IOSR-JEN
 
Graphical Models In Python | Edureka
Graphical Models In Python | EdurekaGraphical Models In Python | Edureka
Graphical Models In Python | EdurekaEdureka!
 

What's hot (7)

Diploma ii cfpc- u-5.1 pointer, structure ,union and intro to file handling
Diploma ii  cfpc- u-5.1 pointer, structure ,union and intro to file handlingDiploma ii  cfpc- u-5.1 pointer, structure ,union and intro to file handling
Diploma ii cfpc- u-5.1 pointer, structure ,union and intro to file handling
 
Brute force-algorithm
Brute force-algorithmBrute force-algorithm
Brute force-algorithm
 
Effective Java - Always override toString() method
Effective Java - Always override toString() methodEffective Java - Always override toString() method
Effective Java - Always override toString() method
 
Inter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT JodhpurInter IIT Tech Meet 2k19, IIT Jodhpur
Inter IIT Tech Meet 2k19, IIT Jodhpur
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
E041122335
E041122335E041122335
E041122335
 
Graphical Models In Python | Edureka
Graphical Models In Python | EdurekaGraphical Models In Python | Edureka
Graphical Models In Python | Edureka
 

Similar to Improve BERT with Factorized Embeddings & Parameter Sharing

CP Optimizer Walkthrough
CP Optimizer WalkthroughCP Optimizer Walkthrough
CP Optimizer WalkthroughPaulShawIBM
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients國騰 丁
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RSatoshi Kato
 
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulation
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulationNetworkTeamofRockDamageModelingandEnergyGeostorageSimulation
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulationJianming Zeng
 
Using Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesUsing Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesCS, NcState
 
Transfer Learning for Improving Model Predictions in Robotic Systems
Transfer Learning for Improving Model Predictions  in Robotic SystemsTransfer Learning for Improving Model Predictions  in Robotic Systems
Transfer Learning for Improving Model Predictions in Robotic SystemsPooyan Jamshidi
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesSigOpt
 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesScott Clark
 
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...Eswar Publications
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxyadav834181
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered HarmfulPrateek Singh
 
Automated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsAutomated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsNat Rice
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Deploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesDeploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesFredrick Ishengoma
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 

Similar to Improve BERT with Factorized Embeddings & Parameter Sharing (20)

CP Optimizer Walkthrough
CP Optimizer WalkthroughCP Optimizer Walkthrough
CP Optimizer Walkthrough
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
Exploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in RExploratory data analysis using xgboost package in R
Exploratory data analysis using xgboost package in R
 
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulation
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulationNetworkTeamofRockDamageModelingandEnergyGeostorageSimulation
NetworkTeamofRockDamageModelingandEnergyGeostorageSimulation
 
Using Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesUsing Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization Strategies
 
Transfer Learning for Improving Model Predictions in Robotic Systems
Transfer Learning for Improving Model Predictions  in Robotic SystemsTransfer Learning for Improving Model Predictions  in Robotic Systems
Transfer Learning for Improving Model Predictions in Robotic Systems
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning Pipelines
 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning Pipelines
 
Eg25807814
Eg25807814Eg25807814
Eg25807814
 
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...
An Elitist Simulated Annealing Algorithm for Solving Multi Objective Optimiza...
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptx
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered Harmful
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Automated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsAutomated Essay Scoring Using Efficient Transformer-Based Language Models
Automated Essay Scoring Using Efficient Transformer-Based Language Models
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Deploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesDeploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalities
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 

More from seungwoo kim

Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)seungwoo kim
 
Graph Neural Network #2-1 (PinSage)
Graph Neural Network #2-1 (PinSage)Graph Neural Network #2-1 (PinSage)
Graph Neural Network #2-1 (PinSage)seungwoo kim
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요seungwoo kim
 
Graph Neural Network 1부
Graph Neural Network 1부Graph Neural Network 1부
Graph Neural Network 1부seungwoo kim
 
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanisms
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanismsEnhancing VAEs for collaborative filtering : flexible priors & gating mechanisms
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanismsseungwoo kim
 
Deep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsDeep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsseungwoo kim
 
XAI recent researches
XAI recent researchesXAI recent researches
XAI recent researchesseungwoo kim
 
Siamese neural networks+Bert
Siamese neural networks+BertSiamese neural networks+Bert
Siamese neural networks+Bertseungwoo kim
 
MRC recent trend_ppt
MRC recent trend_pptMRC recent trend_ppt
MRC recent trend_pptseungwoo kim
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 

More from seungwoo kim (10)

Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)
 
Graph Neural Network #2-1 (PinSage)
Graph Neural Network #2-1 (PinSage)Graph Neural Network #2-1 (PinSage)
Graph Neural Network #2-1 (PinSage)
 
Graph neural network 2부 recommendation 개요
Graph neural network  2부  recommendation 개요Graph neural network  2부  recommendation 개요
Graph neural network 2부 recommendation 개요
 
Graph Neural Network 1부
Graph Neural Network 1부Graph Neural Network 1부
Graph Neural Network 1부
 
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanisms
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanismsEnhancing VAEs for collaborative filtering : flexible priors & gating mechanisms
Enhancing VAEs for collaborative filtering : flexible priors & gating mechanisms
 
Deep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsDeep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendations
 
XAI recent researches
XAI recent researchesXAI recent researches
XAI recent researches
 
Siamese neural networks+Bert
Siamese neural networks+BertSiamese neural networks+Bert
Siamese neural networks+Bert
 
MRC recent trend_ppt
MRC recent trend_pptMRC recent trend_ppt
MRC recent trend_ppt
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Improve BERT with Factorized Embeddings & Parameter Sharing

  • 2. 기존 연구(Original BERT)의 문제 Ⓐ 요구되는 메모리 과다, ⓑ 훈련시간 과다, ⓒ 파라메터 수의 최적화 미비 등 3가지 문제 1. 필요한 파라미터와 메모리 증가에 따른 하드웨어적인 이슈 (※ 특히 MRC 문제를 예로 하고 있음) An obstacle to answering this question is the memory limitations of available hardware. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these limitations as we try to scale our models. 2. 훈련에 소요되는 시간 문제 (TPU 16장으로 8일, 일반적인 GPU로는 수 개월 소요) Training speed can also be significantly hampered in distributed training, as the communication overhead is directly proportional to the number of parameters in the model 3. 파라미터 수의 증가가 성능의 향상으로 이어지지 않음(Large 모델 기준 X2의 경우 오히려 성능 하락) We also observe that simply growing the hidden size of a model such as BERT-large (Devlin et al., 2019) can lead to worse performance
  • 3. 제안 방법 Factorized embedding parameterization 1. Factorized embedding parameterization As such, untying the WordPiece embedding size E from the hidden layer size H allows us to make a more efficient usage of the total model parameters as informed by modeling needs, which dictate that H > E. Therefore, for ALBERT we use a factorization of the embedding parameters, decomposing them into two smaller matrices. Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. O(V × H) ➔ O(V × E + E × H) A B C D E F … H V A B C D E F … E (H>E) V H E 30,000 * 768 = 23백만 30,000 * 200 = 6백만 200 * 768 = 15만 = x +> ※ Embedding 시 필요한 파라메터 수의 감소
  • 4. 제안 방법 Factorized embedding parameterization 1. huggingface github에서 Factorized embedding parameterization을 찾을려다가 실패 - albert Embedding implementation - albert git https://github.com/huggingface/transformers https://github.com/brightmart/albert_zh/ https://github.com/google-research/ALBERT
  • 5. 제안 방법 Factorized embedding parameterization - bert Embedding implementation
  • 6. 제안방법 Factorized embedding parameterization 1. Factorized embedding parameterization shows the effect of changing the vocabulary embedding size E using an ALBERT-base configuration setting (see Table 2), using the same set of representative downstream tasks. Under the non-shared condition (BERT-style), larger embedding sizes give better performance, but not by much. Under the all-shared condition (ALBERT-style), an embedding of size 128 appears to be the best. Based on these results, we use an embedding size E = 128 in all future settings, as a necessary step to do further scaling.
  • 7. 제안 방법 Cross-layer parameter sharing 2. Cross-layer parameter sharing For ALBERT, we propose cross-layer parameter sharing as another way to improve parameter efficiency. There are multiple ways to share parameters, e.g., only sharing feed-forward network (FFN) parameters across layers, or only sharing attention parameters. Multi Head Attention Add & Norm Scaled Dot Product Attention Feed Forward Block-2 Add & Norm Multi Head Attention Add & Norm Feed Forward Add & Norm Multi Head Attention Block-1 Shared
  • 8. 제안 방법 Cross-layer parameter sharing - albert layer groups implementation
  • 9. 제안 방법 Cross-layer parameter sharing - bert layer implementation
  • 10. 제안 방법 Cross-layer parameter sharing - albert vs bert architecture
  • 11. 제안 방법 Cross-layer parameter sharing - parameter sharing type(all, attention, ffn)
  • 12. 제안 방법 Inter-sentence coherence loss 3. Inter-sentence coherence loss However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and decided to eliminate it, a decision supported by an improvement in downstream task performance across several tasks. We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task. That is, for ALBERT, we use a sentence-order prediction (SOP) loss. The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order swapped NSP : Next Sentence Prediction SOP : Sentence Order Prediction positive Seq A Sentence Seq B Sentence True Seq A Sentence Seq B Sentence True negative Seq A Sentence Random Sentence False Seq B Sentence Seq A Sentence False
  • 14. 추가실험 Inter-sentence coherence loss - MLM task에서 추가 데이터로 훈련(xlnet, roberta에서 쓴 데이터 셋) - Drop out을 뺐더니 성능이 더 좋아짐 - 추가데이터에서는 Squad에서는 성능이 나뻐짐(아마도 도메인 밖의 데이터가 섞여서인듯)
  • 15. 전체적인 실험 결과 (※ 자체 구현 後 송험 결과) Able to Check our test code here : https://github.com/jeehyun100/text_proj Bert(파라미터 사이즈 : 414M) { "exact": 80.21759697256385, "f1": 87.94263692549254, "total": 10570, } https://github.com/jeehyun100/text_proj Albert(파라미터 사이즈 : 48M) { "exact": 80.87038789025544, "f1": 88.67964179873631, "total": 10570, } 더 적은 파라미터 사이즈로 더 높은 성능 달성