Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

Sequential Reptile: Inter-Task Gradient
Alignment for Multilingual Learning
Seanie Lee, Hae Beom Lee, Juho Lee, Sung Ju Hwang

Data Scarcity
There are not enough labeled data for non-English languages.
2
Finnish
Indonesian
Bengali
Telugu
Yoruba
Swahili

Pretrained Multilingual Language Model
Language models pretrained on multilingual corpus shows impressive
performance on low resource languages.
3
Multilingual BERT
XLM
Multilingual T5

Multi-Task Learning – (1)
Assuming there is a common structure across languages, we can levera
ge multi-task learning to mitigate data scarcity.
4
Finnish
Indonesian
Bengali
Telugu
Yoruba
Swahili
Multilingual Model

Multi-Task Learning – (2)
Given 𝑇 tasks, we want to estimate a parameter 𝜙 minimizing the sum
of each task loss.
5
𝜙
…

Catastrophic Forgetting
Finetuning pretrained language models leads to the catastrophic
forgetting of knowledge from pretraining [1,2].
6
[1] Lee et al., 2020. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. ICLR 2020.
[2] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020.
Philadelphia has more murals tha
n any other U.S. city, thanks in par
t to the 1984 …
Which city has more mura
ls than any other city?
Question
Paragraph
Philadelphia
[MASK] city has more murals
than any other [MASK]?
??

Gradient Alignment and Conflict
For MTL, we need to maximize knowledge transfer between languages
and minimize negative interference.
7
𝜙
We need to align task gradients and avoid gradient conflict, which prev
ents models from memorizing task specific knowledge.
Gradient Conflict
𝜙
Gradient Alignment

Related Works to Gradient Alignment
8
PCGrad [3] and GradVac [4] manually alter gradients to aggressively
minimize MTL objective.
[3] Yu et al., 2020. Gradient Surgery for Multi-Task Learning. NeurIPS 2020.
[4] Wang et al., 2021. Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Model. ICLR 2021.
PCGrad GradVac PCGrad GradVac

Explicit Gradient Alignment
Explicitly maximizing dot product of task gradients is expensive.
It requires to compute Hessian of the model parameters 𝜙.
9

Implicit Gradient Alignment
10
Reptile [5] shows that SGD implicitly aligns gradients of mini-batches
within a task without any second order derivatives.
[5] Nichol et al., 2020. On First-Order Meta-Learning Algorithms. ArXiv 2018.
𝜙!
𝜃"

Limitation of Reptile
11
Reptile performs inner optimization independently for each task. It can
not align gradients across tasks.
𝜙!
𝜃"
($)
𝜙!
𝜃"
(")

Sequential Reptile
12
We propose Sequential Reptile where inner trajectory consists of mini-
batches from all tasks. 𝜙!
𝜙"

Experimental Setup
• Tasks
- Multilingual NLP tasks (QA, NER, NLI)
- Each language serves as a distinct task for MTL.
• Baselines
1) Base MTL
2) PCGrad
3) GradVac
4) RecAdam [6]
5) GradNorm [7]
6) Reptile
13
[6] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020.
[7] Chen et al., 2018. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. ICML 2018.

Experimental Result - QA
14
We train multilingual-BERT (m-BERT) on TYDI-QA datasets for question
answering.

Experimental Result - NER
15
We train multilingual-BERT (m-BERT) on WikiAnn datasets for named
entity recognition.

Analysis – (1)
16
Sequential Reptile achieves high cosine similarity of task gradients.

Analysis – (2)
17
Sequential Reptile achieves low masked language modeling loss and
low l2 distance from the pretrained model.

Analysis – (3)
18
To verify Sequential Reptile general knowledge across languages,
we perform zero-shot cross-lingual transfer experiments.
Seen Languages: ar, bn, en, fi, id, ko, ru, sw, te

Zero Shot Cross Lingual Transfer
Train a mBERT on English labeled data and evaluate it on
unseen languages.
19
We partition English data into four disjoint clusters and consider each
cluster as a task.

Experimental Result - QA
20
We train multilingual-BERT (m-BERT) on SQuAD and evaluate it on
MLQA datasets for question answering.

Experimental Result - NLI
21
We train multilingual-BERT (m-BERT) on MNLI and evaluate it on
XNLI datasets for NLI.

Experimental Result – Image Classification
22
We finetune ResNet18 pretrained on ImageNet on 8 different image
classification datasets.

Analysis
23
Sequential Reptile achieves better tradeoff between MTL loss and task
cosine similarity. Learning rate controls the trade off.

Conclusion
• We observe that gradient alignment is important for knowledge trans
fer and preventing catastrophic forgetting.
• We propose an efficient algorithm to align task gradients without co
mputing second order derivative.
• We verify efficacy of Sequential Reptile on various datasets, including
NLP and vision tasks.
24

Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

Similar to Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning (20)

More from MLAI2

More from MLAI2 (20)

Recently uploaded

Recently uploaded (20)

Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning