4. PERMs
Abstract
Abstract
In this paper, they present a comprehensive evaluation of parameter
efficient learning methods (PERMs) for generation tasks in natural
language processing.
They compare PERMs to finetuning from three new perspectives,
including
1 The impact of sample and model size
2 Generalization to unseen domains and datasets
3 Faithfulness of generations
4 / 30
5. PERMs
Abstract
Abstract
Their results show that PERMs can outperform finetuning in
certain scenarios, particularly when training with fewer samples
and using larger pre-trained language models.
This study provides valuable insights into the effectiveness of PERMs
for adapting pre-trained language models to downstream tasks.
5 / 30
7. PERMs
Introduction
Introduction
The recent advancements in pre-trained language models (PLMs)
have revolutionized the field of natural language processing (NLP),
enabling state-of-the-art performance on a wide range of tasks.
However, adapting these large and complex models to specific
downstream tasks can be computationally expensive and
time-consuming.
Parameter efficient learning methods (PERMs) have emerged as a
promising solution to this challenge, providing an efficient way for
PLMs to adapt to new tasks with limited training data.
7 / 30
8. PERMs
Introduction
Introduction
In this paper, they present a comprehensive evaluation of PERMs
for generation tasks in NLP, comparing their performance to
finetuning from three new perspectives.
Their study sheds light on the effectiveness of PERMs for adapting
PLMs to downstream tasks and provides valuable insights into their
potential applications in real-world scenarios.
8 / 30
9. PERMs
Introduction
Contribution
They conducted a thorough evaluation of parameter efficient
learning methods (PERMs) for generating natural language text
They compared PERMs to finetuning from three new
perspectives, including the impact of sample and model size,
generalization to new domains and datasets, and the accuracy
of generated text
Their study provides insights into how PERMs can help
pre-trained language models (PLMs) adapt to new tasks with
limited training data
They offer valuable information on how PERMs can be used in
real-world scenarios where training large models is difficult or
expensive
9 / 30
11. PERMs
Methodology
Methodology
They compare the following four PERMs to finetuning (FT) using
GPT-style models from Megatron-LM
1 Adapter (AP)
2 Prefix Tuning (PF)
3 Prompt Tuning (PT)
4 P-tuning
11 / 30
12. PERMs
Methodology
Adapter
This method adds an extra layer with a bottleneck structure by first
projecting input h to a low dimension using trainable weights Wdown
and then projecting up to the original dimension using trainable
weights Wup.
Adapter(h) = h + g(hWdown)Wup
where g is the activation function.
12 / 30
13. PERMs
Methodology
Prefix Tuning
It adds trainable prefix tokens at the beginning of each transformer
block.
K ← concat ([WK; K])
V ← concat ([WV; V])
13 / 30
17. PERMs
Experimental Setup
Experimental Setup
Datasets
1 Summarization (Xsum): split the Xsum dataset into news
articles for training and sports articles for testing.
2 Dialogue (Wazards / CMU DoG): they ignore the knowledge
retrieval step and take the golden knowledge for the response
generation. And they test their model over all test set dialogue
turns except the starting one.
Metrics
1 Quality Metrics
2 Faithfulness Metrics
17 / 30
20. PERMs
Results
In-domain Results
For the reasult, they think that it can be attributed to the structural bias
of Adapter.
The skip-connection structure allows Adapter to add a small deviation
to the activation, which makes the optimization of the PLM
checkpoint smooth.
20 / 30
22. PERMs
Results
Scaling up to 530b model
Because Adapter gets better performances than other methods, they
apply AP to one of the largest GPT model, MT-NLG.
This result shows that decoder-only model can still beat
encoder-decoder model, but it needs a much larger model size.
22 / 30
23. PERMs
Results
Scaling up varying parameter sizes for PERMs
With model size, it’s for trainable parameters’ size, and the parameters
is for extra inference parameters.
23 / 30
28. PERMs
Conclusion
Conclusion
In this paper, they extensively compare PERMs with finetuning over
three main areas:
1 In-domain evaluation by scaling both the sample size and model
size
2 Cross-domain and cross-dataset generalization
3 Faithfulness of generations
Compared to finetuning, not all PERMs can easily achieve better
cross-domain and cross-dataset scores than finetuning even with large
PLM. Adapter is a better choice than other PERMs in such cases.
And, Prefix tuning is the best method for faithfulness.
28 / 30
30. PERMs
Limitations
Limitations
They are only able to qualitatively show the cross point when FT
is better than AP
Only for summarization and dialogue generation when choosing
between these methods
In faithfulness, when the model is large enough, and the dataset
is large too, PF achieves quite close scores to FT
30 / 30