SlideShare a Scribd company logo
1 of 24
Download to read offline
Sequential Reptile: Inter-Task Gradient
Alignment for Multilingual Learning
Seanie Lee, Hae Beom Lee, Juho Lee, Sung Ju Hwang
Data Scarcity
There are not enough labeled data for non-English languages.
2
Finnish
Indonesian
Bengali
Telugu
Yoruba
Swahili
Pretrained Multilingual Language Model
Language models pretrained on multilingual corpus shows impressive
performance on low resource languages.
3
Multilingual BERT
XLM
Multilingual T5
Multi-Task Learning – (1)
Assuming there is a common structure across languages, we can levera
ge multi-task learning to mitigate data scarcity.
4
Finnish
Indonesian
Bengali
Telugu
Yoruba
Swahili
Multilingual Model
Multi-Task Learning – (2)
Given 𝑇 tasks, we want to estimate a parameter 𝜙 minimizing the sum
of each task loss.
5
𝜙
…
Catastrophic Forgetting
Finetuning pretrained language models leads to the catastrophic
forgetting of knowledge from pretraining [1,2].
6
[1] Lee et al., 2020. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. ICLR 2020.
[2] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020.
Philadelphia has more murals tha
n any other U.S. city, thanks in par
t to the 1984 …
Which city has more mura
ls than any other city?
Question
Paragraph
Philadelphia
[MASK] city has more murals
than any other [MASK]?
??
Gradient Alignment and Conflict
For MTL, we need to maximize knowledge transfer between languages
and minimize negative interference.
7
𝜙
We need to align task gradients and avoid gradient conflict, which prev
ents models from memorizing task specific knowledge.
Gradient Conflict
𝜙
Gradient Alignment
Related Works to Gradient Alignment
8
PCGrad [3] and GradVac [4] manually alter gradients to aggressively
minimize MTL objective.
[3] Yu et al., 2020. Gradient Surgery for Multi-Task Learning. NeurIPS 2020.
[4] Wang et al., 2021. Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Model. ICLR 2021.
PCGrad GradVac PCGrad GradVac
Explicit Gradient Alignment
Explicitly maximizing dot product of task gradients is expensive.
It requires to compute Hessian of the model parameters 𝜙.
9
Implicit Gradient Alignment
10
Reptile [5] shows that SGD implicitly aligns gradients of mini-batches
within a task without any second order derivatives.
[5] Nichol et al., 2020. On First-Order Meta-Learning Algorithms. ArXiv 2018.
𝜙!
𝜃"
Limitation of Reptile
11
Reptile performs inner optimization independently for each task. It can
not align gradients across tasks.
𝜙!
𝜃"
($)
𝜙!
𝜃"
(")
Sequential Reptile
12
We propose Sequential Reptile where inner trajectory consists of mini-
batches from all tasks. 𝜙!
𝜙"
Experimental Setup
• Tasks
- Multilingual NLP tasks (QA, NER, NLI)
- Each language serves as a distinct task for MTL.
• Baselines
1) Base MTL
2) PCGrad
3) GradVac
4) RecAdam [6]
5) GradNorm [7]
6) Reptile
13
[6] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020.
[7] Chen et al., 2018. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. ICML 2018.
Experimental Result - QA
14
We train multilingual-BERT (m-BERT) on TYDI-QA datasets for question
answering.
Experimental Result - NER
15
We train multilingual-BERT (m-BERT) on WikiAnn datasets for named
entity recognition.
Analysis – (1)
16
Sequential Reptile achieves high cosine similarity of task gradients.
Analysis – (2)
17
Sequential Reptile achieves low masked language modeling loss and
low l2 distance from the pretrained model.
Analysis – (3)
18
To verify Sequential Reptile general knowledge across languages,
we perform zero-shot cross-lingual transfer experiments.
Seen Languages: ar, bn, en, fi, id, ko, ru, sw, te
Zero Shot Cross Lingual Transfer
Train a mBERT on English labeled data and evaluate it on
unseen languages.
19
We partition English data into four disjoint clusters and consider each
cluster as a task.
Experimental Result - QA
20
We train multilingual-BERT (m-BERT) on SQuAD and evaluate it on
MLQA datasets for question answering.
Experimental Result - NLI
21
We train multilingual-BERT (m-BERT) on MNLI and evaluate it on
XNLI datasets for NLI.
Experimental Result – Image Classification
22
We finetune ResNet18 pretrained on ImageNet on 8 different image
classification datasets.
Analysis
23
Sequential Reptile achieves better tradeoff between MTL loss and task
cosine similarity. Learning rate controls the trade off.
Conclusion
• We observe that gradient alignment is important for knowledge trans
fer and preventing catastrophic forgetting.
• We propose an efficient algorithm to align task gradients without co
mputing second order derivative.
• We verify efficacy of Sequential Reptile on various datasets, including
NLP and vision tasks.
24

More Related Content

What's hot

What's hot (20)

Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYSA NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
A NEW ALGORITHM FOR DATA HIDING USING OPAP AND MULTIPLE KEYS
 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
[Icml2019] parameter efficient training of deep convolutional neural network...
[Icml2019] parameter efficient training of  deep convolutional neural network...[Icml2019] parameter efficient training of  deep convolutional neural network...
[Icml2019] parameter efficient training of deep convolutional neural network...
 
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
Deep Learning Opening Workshop - Improving Generative Models - Junier Oliva, ...
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Learning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningLearning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learning
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
Content Based Image Retrieval Using 2-D Discrete Wavelet TransformContent Based Image Retrieval Using 2-D Discrete Wavelet Transform
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
 
Image Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic Algorithm
 
[Icml2019] mix hop higher-order graph convolutional architectures via spars...
[Icml2019]  mix hop  higher-order graph convolutional architectures via spars...[Icml2019]  mix hop  higher-order graph convolutional architectures via spars...
[Icml2019] mix hop higher-order graph convolutional architectures via spars...
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
Siamese networks
Siamese networksSiamese networks
Siamese networks
 
3D human body modeling from RGB images
3D human body modeling from RGB images3D human body modeling from RGB images
3D human body modeling from RGB images
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
 
Deep MIML Network
Deep MIML NetworkDeep MIML Network
Deep MIML Network
 
A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...
 
N ns 1
N ns 1N ns 1
N ns 1
 

Similar to Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
kevig
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Ranking
butest
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Isabelle Augenstein
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 

Similar to Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning (20)

[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Ranking
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
 
An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...An exploratory research on grammar checking of Bangla sentences using statist...
An exploratory research on grammar checking of Bangla sentences using statist...
 
Game Assignments in computer Science
Game Assignments in computer ScienceGame Assignments in computer Science
Game Assignments in computer Science
 

More from MLAI2

Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Skill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement LearningSkill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement Learning
MLAI2
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
MLAI2
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
MLAI2
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
MLAI2
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
MLAI2
 
Accurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset PoolingAccurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset Pooling
MLAI2
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
MLAI2
 
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MLAI2
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MLAI2
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive Learning
MLAI2
 

More from MLAI2 (20)

Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Online Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient DistillationOnline Hyperparameter Meta-Learning with Hypergradient Distillation
Online Hyperparameter Meta-Learning with Hypergradient Distillation
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
Skill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement LearningSkill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement Learning
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
 
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set EncodingMini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
 
Accurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset PoolingAccurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset Pooling
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
 
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive Learning
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
 
Adversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability SuppressionAdversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability Suppression
 
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
 
Meta Dropout: Learning to Perturb Latent Features for Generalization
Meta Dropout: Learning to Perturb Latent Features for Generalization Meta Dropout: Learning to Perturb Latent Features for Generalization
Meta Dropout: Learning to Perturb Latent Features for Generalization
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

  • 1. Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning Seanie Lee, Hae Beom Lee, Juho Lee, Sung Ju Hwang
  • 2. Data Scarcity There are not enough labeled data for non-English languages. 2 Finnish Indonesian Bengali Telugu Yoruba Swahili
  • 3. Pretrained Multilingual Language Model Language models pretrained on multilingual corpus shows impressive performance on low resource languages. 3 Multilingual BERT XLM Multilingual T5
  • 4. Multi-Task Learning – (1) Assuming there is a common structure across languages, we can levera ge multi-task learning to mitigate data scarcity. 4 Finnish Indonesian Bengali Telugu Yoruba Swahili Multilingual Model
  • 5. Multi-Task Learning – (2) Given 𝑇 tasks, we want to estimate a parameter 𝜙 minimizing the sum of each task loss. 5 𝜙 …
  • 6. Catastrophic Forgetting Finetuning pretrained language models leads to the catastrophic forgetting of knowledge from pretraining [1,2]. 6 [1] Lee et al., 2020. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models. ICLR 2020. [2] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020. Philadelphia has more murals tha n any other U.S. city, thanks in par t to the 1984 … Which city has more mura ls than any other city? Question Paragraph Philadelphia [MASK] city has more murals than any other [MASK]? ??
  • 7. Gradient Alignment and Conflict For MTL, we need to maximize knowledge transfer between languages and minimize negative interference. 7 𝜙 We need to align task gradients and avoid gradient conflict, which prev ents models from memorizing task specific knowledge. Gradient Conflict 𝜙 Gradient Alignment
  • 8. Related Works to Gradient Alignment 8 PCGrad [3] and GradVac [4] manually alter gradients to aggressively minimize MTL objective. [3] Yu et al., 2020. Gradient Surgery for Multi-Task Learning. NeurIPS 2020. [4] Wang et al., 2021. Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Model. ICLR 2021. PCGrad GradVac PCGrad GradVac
  • 9. Explicit Gradient Alignment Explicitly maximizing dot product of task gradients is expensive. It requires to compute Hessian of the model parameters 𝜙. 9
  • 10. Implicit Gradient Alignment 10 Reptile [5] shows that SGD implicitly aligns gradients of mini-batches within a task without any second order derivatives. [5] Nichol et al., 2020. On First-Order Meta-Learning Algorithms. ArXiv 2018. 𝜙! 𝜃"
  • 11. Limitation of Reptile 11 Reptile performs inner optimization independently for each task. It can not align gradients across tasks. 𝜙! 𝜃" ($) 𝜙! 𝜃" (")
  • 12. Sequential Reptile 12 We propose Sequential Reptile where inner trajectory consists of mini- batches from all tasks. 𝜙! 𝜙"
  • 13. Experimental Setup • Tasks - Multilingual NLP tasks (QA, NER, NLI) - Each language serves as a distinct task for MTL. • Baselines 1) Base MTL 2) PCGrad 3) GradVac 4) RecAdam [6] 5) GradNorm [7] 6) Reptile 13 [6] Chen et al., 2020. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. EMNLP 2020. [7] Chen et al., 2018. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. ICML 2018.
  • 14. Experimental Result - QA 14 We train multilingual-BERT (m-BERT) on TYDI-QA datasets for question answering.
  • 15. Experimental Result - NER 15 We train multilingual-BERT (m-BERT) on WikiAnn datasets for named entity recognition.
  • 16. Analysis – (1) 16 Sequential Reptile achieves high cosine similarity of task gradients.
  • 17. Analysis – (2) 17 Sequential Reptile achieves low masked language modeling loss and low l2 distance from the pretrained model.
  • 18. Analysis – (3) 18 To verify Sequential Reptile general knowledge across languages, we perform zero-shot cross-lingual transfer experiments. Seen Languages: ar, bn, en, fi, id, ko, ru, sw, te
  • 19. Zero Shot Cross Lingual Transfer Train a mBERT on English labeled data and evaluate it on unseen languages. 19 We partition English data into four disjoint clusters and consider each cluster as a task.
  • 20. Experimental Result - QA 20 We train multilingual-BERT (m-BERT) on SQuAD and evaluate it on MLQA datasets for question answering.
  • 21. Experimental Result - NLI 21 We train multilingual-BERT (m-BERT) on MNLI and evaluate it on XNLI datasets for NLI.
  • 22. Experimental Result – Image Classification 22 We finetune ResNet18 pretrained on ImageNet on 8 different image classification datasets.
  • 23. Analysis 23 Sequential Reptile achieves better tradeoff between MTL loss and task cosine similarity. Learning rate controls the trade off.
  • 24. Conclusion • We observe that gradient alignment is important for knowledge trans fer and preventing catastrophic forgetting. • We propose an efficient algorithm to align task gradients without co mputing second order derivative. • We verify efficacy of Sequential Reptile on various datasets, including NLP and vision tasks. 24