Multi-Task Learning in
Transformer-Based
Architectures for NLP
ABOUT ME
Data Science Student at FER, Zagreb
NLP Researcher at Doxray
TIN FERKOVIĆ
TABLE OF CONTENTS
SINGLE TASK LEARNING, STL
N tasks = N models
SHARED ENCODER
+ task-specific heads
ADAPTERS
small, modular
components
HYPERNETWORKS
generate parameters
0 1
2 3
STL
N tasks = N models
0
BERT (large)
345 M (1.34 GB)
5.65 GB
64 TPUs
4 days
~$7,000
4 GB
PARAMETERS
GPU MEMORY
PRE-TRAINING
CHECKPOINT
284t of CO2
(average transatlantic flight)
CO2 EMISSION
SINGLE TASK LEARNING
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).
6,000
TPU chips
And for today’s SOTA LLMs…
SINGLE TASK LEARNING
50 days
Training time
And for today’s SOTA LLMs…
SINGLE TASK LEARNING
$10M
Estimated training cost
And for today’s SOTA LLMs…
SINGLE TASK LEARNING
SHARED
ENCODER
+ TASK-SPECIFIC HEADS
1
MOTIVATION
SINGLE MODEL
N times storage
reduction
DATA EFFICIENCY
Low-resource tasks
benefit
KNOWLEDGE SHARING
Gradient updates
from other tasks
SHARED ENCODERS
ARCHITECTURE
SHARED ENCODERS
Batches
Sequential
Homogeneous
Heterogeneous
DECISIONS
Catastrophic forgetting
1 batch = 1 task
Multiple tasks in a batch
SHARED ENCODERS
Sampling
Proportional/
Random
Temperature
Annealed
DECISIONS
Simple, causes underfitting
Ease difference in dataset
sizes
Train equally towards the
end of training
Uncertainty Active learning
SHARED ENCODERS
Loss weighting
Uncertainty
Learning speed
Performance
DECISIONS
Logit - ground truth
Convergence
Is it the same metric?
Normalized Diving point-loss by log(n)
SHARED ENCODERS
INTERFERENCE
Different gradient
directions
SAMPLING
Difficult
PERFORMANCE
Easily
outperformed
UNDER-/OVERFITTING
Problems with
different dataset sizes
LOSS WEIGHTING
Difficult
TASK GROUPING
Identify compatible
tasks
DRAWBACKS
SHARED ENCODERS
ADAPTERS
small, modular components
02
MOTIVATION
0.5 %
Trainable
parameters
MODULARITY
Train, save, inject
PERFORMANCE
On-par with STL
ADAPTERS
Augment the base model with
new task-specific sub-functions
FUNCTION
Augment function’s input
by concatenating the
parameter vector
INPUT
Directly augment
parameters of the base
model
PARAMETER
COMPOSITIONS
ADAPTERS
FUNCTION
BOTTLENECK ADAPTER
PARAMETER
LoRA
INPUT
PREFIX-TUNING
ADAPTERS
ADAPTERS
Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on Machine Learning. PMLR, 2019.
FUNCTION
BOTTLENECK ADAPTER
ADAPTERS
Li, Xiang Lisa, and Percy Liang. "Prefix-Tuning: Optimizing Continuous Prompts for Generation.". 2021.
INPUT
PREFIX-TUNING
ADAPTERS
Hu, Edward J., et al. "LoRA: Low-Rank Adaptation of Large Language Models.". 2021.
PARAMETER
LoRA
ADAPTERS
FUNCTION
BOTTLENECK ADAPTER
PARAMETER
LoRA
INPUT
PREFIX-TUNING
PARAMETER
EFFICIENCY
TRAINING
EFFICIENCY
INFERENCE
EFFICIENCY
PERFORMANCE COMPOSITIONALITY
FUNCTION
COMPOSITION X ✓ X ✓✓ ✓
INPUT
COMPOSITION ✓✓ X X X ✓
PARAMETER
COMPOSITION ✓ X ✓✓ ✓ ✓
ADAPTERS
ADAPTERS
METHOD
COMBINATIONS
Bottleneck
ADAPTERS
Mao, Yuning, et al. "UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning.. 2022.
ADAPTER FUSION
ADAPTERS
Pfeiffer, Jonas, et al. "AdapterFusion: Non-Destructive Task Composition for Transfer Learning." 2021.
ADAPTER FUSION
ADAPTERS
Pfeiffer, Jonas, et al. "AdapterFusion: Non-Destructive Task Composition for Transfer Learning." 2021.
HYPERNETWORKS
generate parameters
03
MOTIVATION
KNOWLEDGE SHARING
Hypernetwork
TASK-SPECIFIC
Generated
components
HYPERNETWORKS
ARCHITECTURE
HYPERNETWORKS
ARCHITECTURE
HYPERNETWORKS
Üstün, Ahmet, et al. "Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer." (2022).
INTERFERENCE
Different gradient
directions
SAMPLING
Difficult
UNDER-/OVERFITTING
Problems with
different dataset sizes
LOSS WEIGHTING
Difficult
DRAWBACKS
HYPERNETWORKS
“
“
"MULTI-TASK LEARNING: BECAUSE
DOING ONE THING AT A TIME IS SO
LAST YEAR."
— ChatGPT
THANKS
Do you have any questions?
tin.ferkovic@doxray.com
linkedin.com/in/tinferkovic/
doxray.com
CREDITS: Slidesgo, Flaticon, Freepik

[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf