DBA Basics: Getting Started with Performance Tuning.pdf
Super tickets in pre trained language models
1. Super Tickets in Pre-Trained Language Models: From
Model Compression to Improving Generalization
Chen Liang, Simiao Zuo, Minshuo Chen , Haoming Jiang, Xiaodong Liu, Pengcheng
He, Tuo Zhao, Weizhu Chen
2. Lottery Ticket Hypothesis
• A randomly-initialized, dense neural network contains a subnetwork
that is initialized such that—when trained in isolation—it can match the
test accuracy of the original network after training for at most the same
number of iterations.
4. Phase Transition on LTH
1) Phase Transition: The change in the test accuracy of the compressed model
2) Super Ticket: The best value for weight remaining(esp. between Phase1 and Phase2 in this paper ).
5. Contributions
• The first to identify the phase transition phenomenon in pruning large
neural language models
• The first to show that pruning can improve the generalization when the
models are lightly compressed
• Propose a new pruning approach for multi-task fine-tunning of neural
language models
7. Finding Super Tickets
• Prunning of attention heads and feed-forward layers.
• Adopt Importance score
Low Importance Score: small contribution towards the output
High Importance Score: high expressive power for the output
11. Experiment results on GLUE Benchmarks
In all the tasks, SuperT consistently archieves
better generalization than ST-DNN.
Performance gain of the super tickets is more
Significant in small task.
Performance of the super tickets is related to
Model size. In large models, more non-
expressive tickets can be pruned without
Performance degradation.
12. Experiment results on GLUE Benchmarks
Single task fine-tunning evaluation results of
1) Super tickets(blue) 2) random(orange) 3) losing tickets(8 different sparsity levels)
13. Experiments – Multi Task
Baseline:
1) MT-DNN(Base/Large): BERT(Base/Large) with task shared layers
2) MT-DNN(Base/Large)+ ST Fine-tuning: further trained MT-DNN on individual downstream task
• Models
-Same as the ones of Single Task.
• Compile/Train Options
Proposed:
1) Ticket-Share(Base/Large): MT-DNN model refined through the ticket sharing strategy.
2) Ticket-Share(Base/Large)+ ST Fine-tunning: A fine-tuned single-task Ticket-Share model.
16. Analysis
• Sensitivity to Random Seed
Training with super tickets effectively reduces
model variance on the performance caused
by the random initialization.
• Tickets Importance Across Tasks
SST-2 benefits little from tickets sharing(see
Figure6(a)(c)(d))
CoLA (Figure 6(c)), or dominated jointly by
two tasks, e.g., CoLA and STS-B (Figure 6(d))
are dominated by a single Task.
Thus, some tickets only learn task-specific
knowledge, and the two tasks may share
certain task-specific knowledge.
In multi-task learning, the shared model is highly
over-parameterized to ensure a sufficient capacity
for fitting individual tasks
Multi-task model inevitably exhibits task-dependent
redundancy when being adapted to individual tasks.