Evaluating Parameter Efficient Learning for Generation.pdf

PERMs
Evaluating Parameter Efficient Learning for
Generation
Peng Xu, Mostofa Patwary, Shrimai Prabhumoye et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
May 9, 2023
1 / 30

PERMs
Table of contents
1 Abstract
2 Introduction
3 Methodology
4 Experimental Setup
5 Results
6 Conclusion
7 Limitations
2 / 30

PERMs
Abstract
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
3 / 30

PERMs
Abstract
Abstract
In this paper, they present a comprehensive evaluation of parameter
efficient learning methods (PERMs) for generation tasks in natural
language processing.
They compare PERMs to finetuning from three new perspectives,
including
1 The impact of sample and model size
2 Generalization to unseen domains and datasets
3 Faithfulness of generations
4 / 30

PERMs
Abstract
Abstract
Their results show that PERMs can outperform finetuning in
certain scenarios, particularly when training with fewer samples
and using larger pre-trained language models.
This study provides valuable insights into the effectiveness of PERMs
for adapting pre-trained language models to downstream tasks.
5 / 30

PERMs
Introduction
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
6 / 30

PERMs
Introduction
Introduction
The recent advancements in pre-trained language models (PLMs)
have revolutionized the field of natural language processing (NLP),
enabling state-of-the-art performance on a wide range of tasks.
However, adapting these large and complex models to specific
downstream tasks can be computationally expensive and
time-consuming.
Parameter efficient learning methods (PERMs) have emerged as a
promising solution to this challenge, providing an efficient way for
PLMs to adapt to new tasks with limited training data.
7 / 30

PERMs
Introduction
Introduction
In this paper, they present a comprehensive evaluation of PERMs
for generation tasks in NLP, comparing their performance to
finetuning from three new perspectives.
Their study sheds light on the effectiveness of PERMs for adapting
PLMs to downstream tasks and provides valuable insights into their
potential applications in real-world scenarios.
8 / 30

PERMs
Introduction
Contribution
They conducted a thorough evaluation of parameter efficient
learning methods (PERMs) for generating natural language text
They compared PERMs to finetuning from three new
perspectives, including the impact of sample and model size,
generalization to new domains and datasets, and the accuracy
of generated text
Their study provides insights into how PERMs can help
pre-trained language models (PLMs) adapt to new tasks with
limited training data
They offer valuable information on how PERMs can be used in
real-world scenarios where training large models is difficult or
expensive
9 / 30

PERMs
Methodology
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
10 / 30

PERMs
Methodology
Methodology
They compare the following four PERMs to finetuning (FT) using
GPT-style models from Megatron-LM
1 Adapter (AP)
2 Prefix Tuning (PF)
3 Prompt Tuning (PT)
4 P-tuning
11 / 30

PERMs
Methodology
Adapter
This method adds an extra layer with a bottleneck structure by first
projecting input h to a low dimension using trainable weights Wdown
and then projecting up to the original dimension using trainable
weights Wup.
Adapter(h) = h + g(hWdown)Wup
where g is the activation function.
12 / 30

PERMs
Methodology
Prefix Tuning
It adds trainable prefix tokens at the beginning of each transformer
block.
K ← concat ([WK; K])
V ← concat ([WV; V])
13 / 30

PERMs
Methodology
Prompt Tuning
This method adds extra parameters to the embedding layer and uses
these trainable embeddings to prompt the input.
14 / 30

PERMs
Methodology
P-tuning
It adds a prompt encoder to encode pseudo prompts and the encoded
representation is used to prompt the input.
15 / 30

PERMs
Experimental Setup
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
16 / 30

PERMs
Experimental Setup
Experimental Setup
Datasets
1 Summarization (Xsum): split the Xsum dataset into news
articles for training and sports articles for testing.
2 Dialogue (Wazards / CMU DoG): they ignore the knowledge
retrieval step and take the golden knowledge for the response
generation. And they test their model over all test set dialogue
turns except the starting one.
Metrics
1 Quality Metrics
2 Faithfulness Metrics
17 / 30

PERMs
Results
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
18 / 30

PERMs
Results
Results
1 In-domain Results
2 Cross-domain and Cross-dataset Generalization
3 Faithfulness
19 / 30

PERMs
Results
In-domain Results
For the reasult, they think that it can be attributed to the structural bias
of Adapter.
The skip-connection structure allows Adapter to add a small deviation
to the activation, which makes the optimization of the PLM
checkpoint smooth.
20 / 30

PERMs
Results
In-domain Results
21 / 30

PERMs
Results
Scaling up to 530b model
Because Adapter gets better performances than other methods, they
apply AP to one of the largest GPT model, MT-NLG.
This result shows that decoder-only model can still beat
encoder-decoder model, but it needs a much larger model size.
22 / 30

PERMs
Results
Scaling up varying parameter sizes for PERMs
With model size, it’s for trainable parameters’ size, and the parameters
is for extra inference parameters.
23 / 30

PERMs
Results
Cross-domain and Cross-dataset Generalization
24 / 30

PERMs
Results
Cross-domain and Cross-dataset Generalization
25 / 30

PERMs
Results
Faithfulness
26 / 30

PERMs
Conclusion
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
27 / 30

PERMs
Conclusion
Conclusion
In this paper, they extensively compare PERMs with finetuning over
three main areas:
1 In-domain evaluation by scaling both the sample size and model
size
2 Cross-domain and cross-dataset generalization
3 Faithfulness of generations
Compared to finetuning, not all PERMs can easily achieve better
cross-domain and cross-dataset scores than finetuning even with large
PLM. Adapter is a better choice than other PERMs in such cases.
And, Prefix tuning is the best method for faithfulness.
28 / 30

PERMs
Limitations
Table of contents
1 Abstract
2 Introduction
3 Methodology
5 Results
6 Conclusion
7 Limitations
29 / 30

PERMs
Limitations
Limitations
They are only able to qualitatively show the cross point when FT
is better than AP
Only for summarization and dialogue generation when choosing
between these methods
In faithfulness, when the model is large enough, and the dataset
is large too, PF achieves quite close scores to FT
30 / 30

Evaluating Parameter Efficient Learning for Generation.pdf

Recommended

Recommended

More Related Content

Similar to Evaluating Parameter Efficient Learning for Generation.pdf

Similar to Evaluating Parameter Efficient Learning for Generation.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

Evaluating Parameter Efficient Learning for Generation.pdf