Super tickets in pre trained language models

•Download as PPTX, PDF•

0 likes•164 views

HyunKyu Jeon

Super tickets in pre-trained language models 논문리뷰입니다.

Data & Analytics

Super Tickets in Pre-Trained Language Models: From
Model Compression to Improving Generalization
Chen Liang, Simiao Zuo, Minshuo Chen , Haoming Jiang, Xiaodong Liu, Pengcheng
He, Tuo Zhao, Weizhu Chen

Lottery Ticket Hypothesis
• A randomly-initialized, dense neural network contains a subnetwork
that is initialized such that—when trained in isolation—it can match the
test accuracy of the original network after training for at most the same
number of iterations.

Phase Transition on LTH
1) Phase Transition: The change in the test accuracy of the compressed model
2) Super Ticket: The best value for weight remaining(esp. between Phase1 and Phase2 in this paper ).

Contributions
• The first to identify the phase transition phenomenon in pruning large
neural language models
• The first to show that pruning can improve the generalization when the
models are lightly compressed
• Propose a new pruning approach for multi-task fine-tunning of neural
language models

Transformer – MultiHeadAttention
• Attention
• Multi-Head Attention

Finding Super Tickets
• Prunning of attention heads and feed-forward layers.
• Adopt Importance score
Low Importance Score: small contribution towards the output
High Importance Score: high expressive power for the output

Multi-task learning with Tickets Sharing

Experiments - Single Task
- Baseline: ST-DNN(Base/Large): BERT(Base/Large) with Single Task FFN.
- Proposed: SuperT(Base/Large): BERT(Base/Large) with Super Tickets.
• Models
spec. pruning by 8 different sparsity(e.g. 10% heads/20% FFN) -> choose best!
- Optimizer: Adamax
- Learning rate: {5e-5, 1e-4, 2e-4}
- Batch size: {8, 6, 32}
• Compile/Train Options

Experiment results on GLUE Benchmarks
In all the tasks, SuperT consistently archieves
better generalization than ST-DNN.
Performance gain of the super tickets is more
Significant in small task.
Performance of the super tickets is related to
Model size. In large models, more non-
expressive tickets can be pruned without
Performance degradation.

Experiment results on GLUE Benchmarks
Single task fine-tunning evaluation results of
1) Super tickets(blue) 2) random(orange) 3) losing tickets(8 different sparsity levels)

Experiments – Multi Task
Baseline:
1) MT-DNN(Base/Large): BERT(Base/Large) with task shared layers
2) MT-DNN(Base/Large)+ ST Fine-tuning: further trained MT-DNN on individual downstream task
• Models
-Same as the ones of Single Task.
• Compile/Train Options
Proposed:
1) Ticket-Share(Base/Large): MT-DNN model refined through the ticket sharing strategy.
2) Ticket-Share(Base/Large)+ ST Fine-tunning: A fine-tuned single-task Ticket-Share model.

Analysis
• Sensitivity to Random Seed
Training with super tickets effectively reduces
model variance on the performance caused
by the random initialization.
• Tickets Importance Across Tasks
SST-2 benefits little from tickets sharing(see
Figure6(a)(c)(d))
CoLA (Figure 6(c)), or dominated jointly by
two tasks, e.g., CoLA and STS-B (Figure 6(d))
are dominated by a single Task.
Thus, some tickets only learn task-specific
knowledge, and the two tasks may share
certain task-specific knowledge.

Discussion
• Structured Lottery Tickets
• Searching Better Generalized Super Tickets
• Searching Super Tickets Efficiently

What's hot

Policy Based reinforcement Learning for time series Anomaly detectionKishor Datta Gupta

L4. Ensembles of Decision TreesMachine Learning Valencia

Triangular Learner ModelLoc Nguyen

Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo

Borderline SmoteTrector Rancor

Analyzing individual neurons in pre trained language modelsken-ando

Ensemble learning TechniquesBabu Priyavrat

Lecture 9 slides: Machine learning for Protein Structure ...butest

THE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELSIJMIT JOURNAL

Ensemble learningHaris Jamil

RapidMiner: Learning Schemes In Rapid MinerDataminingTools Inc

What's hot (11)

Policy Based reinforcement Learning for time series Anomaly detection

L4. Ensembles of Decision Trees

Triangular Learner Model

Aaa ped-14-Ensemble Learning: About Ensemble Learning

Borderline Smote

Analyzing individual neurons in pre trained language models

Ensemble learning Techniques

Lecture 9 slides: Machine learning for Protein Structure ...

THE IMPACT OF MOBILE NODES ARRIVAL PATTERNS IN MANETS USING POISSON MODELS

Ensemble learning

RapidMiner: Learning Schemes In Rapid Miner

Similar to Super tickets in pre trained language models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu

Lifelong Learning for Dynamically Expandable NetworksNAVER Engineering

PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksJinwon Lee

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim

gpt3_presentation.pdfGiacomo Frisoni

MEME – An Integrated Tool For Advanced Computational ExperimentsGIScRG

Machine Learning Algorithm - Decision Trees Kush Kulshrestha

ML Module 3 Non Linear Learning.pptxDebabrataPain1

Learning to Search Henry Kautzbutest

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

ClassificationDatamining Tools

ClassificationDataminingTools Inc

PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo

240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptxthanhdowork

Data clustering using kernel basedIJITCA Journal

2019 Levenshtein Transformer広樹本間

Wasserstein 1031 thesis [Chung il kim]Chung-Il Kim

PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo

Learning to Learn by Gradient Descent by Gradient DescentKaty Lee

Similar to Super tickets in pre trained language models (20)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

Lifelong Learning for Dynamically Expandable Networks

PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

gpt3_presentation.pdf

MEME – An Integrated Tool For Advanced Computational Experiments

Machine Learning Algorithm - Decision Trees

ML Module 3 Non Linear Learning.pptx

Learning to Search Henry Kautz

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...

Classification

PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...

240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx

Data clustering using kernel based

2019 Levenshtein Transformer

Wasserstein 1031 thesis [Chung il kim]

PR-411: Model soups: averaging weights of multiple fine-tuned models improves...

Learning to Learn by Gradient Descent by Gradient Descent

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Customer Service Analytics - Make Sense of All Your Data.pptx

Brighton SEO | April 2024 | Data Storytelling

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

20240419 - Measurecamp Amsterdam - SAM.pdf

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Call Girls In Mahipalpur O9654467111 Escorts Service

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

04242024_CCC TUG_Joins and Relationships

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

GA4 Without Cookies [Measure Camp AMS]

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

RA-11058_IRR-COMPRESS Do 198 series of 1998

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

DBA Basics: Getting Started with Performance Tuning.pdf

Super tickets in pre trained language models

1. Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization Chen Liang, Simiao Zuo, Minshuo Chen , Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, Weizhu Chen

2. Lottery Ticket Hypothesis • A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.

3. Lottery Ticket Hypothesis

4. Phase Transition on LTH 1) Phase Transition: The change in the test accuracy of the compressed model 2) Super Ticket: The best value for weight remaining(esp. between Phase1 and Phase2 in this paper ).

5. Contributions • The first to identify the phase transition phenomenon in pruning large neural language models • The first to show that pruning can improve the generalization when the models are lightly compressed • Propose a new pruning approach for multi-task fine-tunning of neural language models

6. Transformer – MultiHeadAttention • Attention • Multi-Head Attention

7. Finding Super Tickets • Prunning of attention heads and feed-forward layers. • Adopt Importance score Low Importance Score: small contribution towards the output High Importance Score: high expressive power for the output

8. Multi-task learning with Tickets Sharing

9. Experiments - Single Task - Baseline: ST-DNN(Base/Large): BERT(Base/Large) with Single Task FFN. - Proposed: SuperT(Base/Large): BERT(Base/Large) with Super Tickets. • Models spec. pruning by 8 different sparsity(e.g. 10% heads/20% FFN) -> choose best! - Optimizer: Adamax - Learning rate: {5e-5, 1e-4, 2e-4} - Batch size: {8, 6, 32} • Compile/Train Options

10. Experiment results on GLUE Benchmarks

11. Experiment results on GLUE Benchmarks In all the tasks, SuperT consistently archieves better generalization than ST-DNN. Performance gain of the super tickets is more Significant in small task. Performance of the super tickets is related to Model size. In large models, more non- expressive tickets can be pruned without Performance degradation.

12. Experiment results on GLUE Benchmarks Single task fine-tunning evaluation results of 1) Super tickets(blue) 2) random(orange) 3) losing tickets(8 different sparsity levels)

13. Experiments – Multi Task Baseline: 1) MT-DNN(Base/Large): BERT(Base/Large) with task shared layers 2) MT-DNN(Base/Large)+ ST Fine-tuning: further trained MT-DNN on individual downstream task • Models -Same as the ones of Single Task. • Compile/Train Options Proposed: 1) Ticket-Share(Base/Large): MT-DNN model refined through the ticket sharing strategy. 2) Ticket-Share(Base/Large)+ ST Fine-tunning: A fine-tuned single-task Ticket-Share model.

14. Experiment results on GLUE

15. Experiment results on SNLI/ SciTail

16. Analysis • Sensitivity to Random Seed Training with super tickets effectively reduces model variance on the performance caused by the random initialization. • Tickets Importance Across Tasks SST-2 benefits little from tickets sharing(see Figure6(a)(c)(d)) CoLA (Figure 6(c)), or dominated jointly by two tasks, e.g., CoLA and STS-B (Figure 6(d)) are dominated by a single Task. Thus, some tickets only learn task-specific knowledge, and the two tasks may share certain task-specific knowledge.

17. Discussion • Structured Lottery Tickets • Searching Better Generalized Super Tickets • Searching Super Tickets Efficiently

Editor's Notes

In multi-task learning, the shared model is highly over-parameterized to ensure a sufficient capacity for fitting individual tasks Multi-task model inevitably exhibits task-dependent redundancy when being adapted to individual tasks.

Super tickets in pre trained language models

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Super tickets in pre trained language models

Similar to Super tickets in pre trained language models (20)

More from HyunKyu Jeon

More from HyunKyu Jeon (20)

Recently uploaded

Recently uploaded (20)

Super tickets in pre trained language models

Editor's Notes