SlideShare a Scribd company logo
Transfer Learning in NLP:
A Survey
Nupur Yadav
San Jose State University
Emerging Technologies in ML
Introduction
• The limitations of deep learning models, such as requiring a
large amount of data to train models and demand for huge
computing resources, forces research for the knowledge
transfer possibilities.
• Many large DL models are emerging that demand the need
for transfer learning.
Models used for NLP
1. Recurrent-Based Models
2. Attention-Based Models
3. CNN-Based Models
Recurrent-Based Models
• In RNNs we pass the previous model state along with each input to make them learn
the sequential context. Works great for many tasks such as speech recognition,
translation, text generation, time-series classification, and biological modeling.
• Suffers from the problem of vanishing gradient because of using backpropagation and
its sequential nature.
• To overcome this issue many ideas came such as using Rectified Linear Unit (ReLU)
as the activation function, then Long Short-Term Memory (LSTM) architecture,
bidirectional LSTMs, Gated Recurrent Networks (GRUs).
• GRUs are the fastest versions of LSTMs and can beat LSTMs in some tasks such as
automatic capturing the grammatical properties of the input sentences.
Attention-Based Models
• RNNs aggregates the sequence activations in one vector which causes
the learning process to forget about words that were fed in the past.
• Attention-based models on the other hand attend each word differently
to inputs based on the similarity score.
• Attention can be applied between different sequences or in the same
sequence which is called self-attention.
CNN-Based Models
• It uses convolutional and max-pooling layers for sub-sampling.
Convolutional layers extract features and pooling layers reduce the
spatial size of the extracted features.
• In NLP, CNNs have been successfully used for sentence classification
tasks such as movie reviews, question classification, etc.
• Also used in language modeling where gated convolutional layers were
used to preserve larger contexts and can be parallelized compared to
traditional recurrent neural networks.
Language Models
1. Unidirectional LM: Consider tokens either that are to the left of the current context
or to the right.
2. Bidirectional LM: Each token can attend to any other token in the current context.
3. Masked LM: Used in bidirectional LM where we randomly mask some tokens in the
current context and then predict these masked tokens.
4. Sequence-to-sequence LM: involves splitting the input into two separate parts. In
the first part, every token can see the context of any other token in that part but in
the second part, every token can only attend to tokens to the left.
5. Permutation LM: This language model combines the benefits of both auto-
regressive and auto-encoding.
6. Encoder-Decoder based LM: In comparison to other approaches that use a single
stack of encoder/decoder blocks, this approach uses both the blocks.
Transfer Learning
• Given a source domain-task tuple (Ds, Ts) and a different
target domain-task tuple (Dt , Tt ), transfer learning can be
defined as the process of using the source domain and task
in the learning process of the target domain task.
• Mathematically, the objective of transfer learning is to learn
the target conditional probability distribution P(Yt|Xt) in Dt
with the information gained from Ds and where Ds ≠Dt or Ts
≠ Tt.
Types of Transfer Learning
Transductive Transfer Learning: Transductive
Transfer Learning is when for the same task, the
target domain or task doesn’t have labeled data or
has very few labeled samples.
It can further be divided into the following sub-
categories:
• Domain Adaptation
• Cross-lingual transfer learning
Domain Adaptation
• Process of adapting to a new domain. It usually happens when we want to learn a different data
distribution in the target domain.
• Useful if the new task to train-on has a different distribution or the amount of labeled data is
scarce.
• One of the recent work involves using adversarial domain adaptation for the detection of
duplicate questions. This approach had three main components: an encoder, a similarity
function, and a domain adaptation module.
• The encoder encoded the question and was optimized to fool the domain classifier that the
question was from the target domain.
• The similarity function calculated the probability for a pair of questions to find they were similar
or duplicate.
• And the domain adaptation component was used to decrease the difference between the target
and source domain distributions.
• This approach proved better and achieved an average improvement of around 5.6% over the
best benchmark for different pairs of domains.
Cross-lingual transfer learning
• This involves adapting to a different language in the target domain.
• Useful when we want to use a high-resource language to learn corresponding tasks in
a low-resource language.
• One of the recent work involves using a new dataset to evaluate three different cross-
lingual transfer methods on the task of user intent classification and time slot detection.
• The dataset contained 57k annotated utterances in English, Thai, and Spanish and
was categorized into three domains which were reminder, weather, and alarm.
• The three cross-lingual transfer methods used were translating the training data, using
cross-lingual pre-trained embeddings, and novel methods of using multilingual
machine translation encoders as contextual word representations.
• The latter two methods outperformed the translation method on the target language
that had only several hundred training examples, i.e., a low resource target language.
Types of Transfer Learning
Inductive Transfer Learning: Inductive Transfer
Learning is when for different tasks in the source
and target domain we have labeled data in the
target domain only.
It can further be divided into the following sub-
categories:
• Sequential Transfer Learning
• Multi-task Transfer Learning
Sequential Transfer Learning
It involves learning multiple tasks in a sequential fashion. It is further divided into
five sub-categories:
1. Sequential Fine Tuning:
• Fine-tuning involves the training of the pre-trained model on the target task.
• One of the recent works involves the model for a unified pre-trained language
model i.e., UNILM.
• It combines three different training objectives to pre-train a model in a unified
way which includes Unidirectional, Bidirectional, and Sequence-to-Sequence.
• The UNILM model achieved state-of-the-art results on different tasks including
generative question answering, abstractive summarization, and document-
grounded dialog response generation.
Sequential Transfer Learning
2. Adapter Modules:
• They are a compact and extensible transfer learning method for NLP .
• Provides parameter efficiency by only adding a few trainable
parameters per task, and as new tasks are added previous ones don’t
require revisiting.
• In the latest work, adapter modules were used to share the parameters
between different tasks by fine-tuning the BERT model.
• The model was evaluated against GLUE tasks and obtained state-of-
the-art results on text entailment while achieving parameter efficiency.
Sequential Transfer Learning
3. Feature-Based:
• The representations of a pre-trained model are fed to another model. It provides the benefit of
using the task-specific model again for similar data.
• Also, extracting feature once saves a lot of computing resources if the same data is used
repeatedly.
• In one of the recent work, researchers used a semi-supervised approach for the task of
sequence labeling.
• A pre-trained neural language model was used that was trained in an unsupervised approach. It
was a bidirectional language model where both forward and backward hidden states are
concatenated together.
• The output was then augmented to token representations and fed to a supervised sequence
tagging model (TagLM) which was then trained in a supervised way to output the tag of each
sequence. The datasets used were CoNLL 2003 NER and CoNLL 200 chunking.
• The model achieved state-of-the-art results on both tasks compared to other forms of transfer
learning.
Sequential Transfer Learning
3. Zero-shot:
• Simplest approach where for a given pre-trained model, we don’t apply any
training procedure to optimize/learn new parameters.
• In a recent study, researchers used the zero-shot transfer on text
classification.
• Each classification task was modeled as a text entailment problem where the
positive class meant an entailment and the negative class meant there was
non.
• Then a pre-trained Bert model on text classification in a zero-shot scenario
was used to classify texts in different tasks like emotion detection, topic
categorization, and situation frame detection.
• This approach achieved better accuracy in two out of the three tasks in
comparison to unsupervised tasks like Word2Vec.
Multi-task Transfer Learning
It involves learning multiple tasks at the same time. For instance, if we are given a pre-trained
model and want to transfer the learning to multiple tasks then all tasks are learned in a parallel
fashion.
Multi-task Fine Tuning:
• In recent work, the researchers used this approach to explore the effect of using a unified text-
to-text transfer transformer (T5).
• The architecture used was similar to the Transformers model with an encoder-decoder network.
But it used fully-visible masking instead of casual masking especially for inputs that require
predictions based on a prefix like translation.
• The dataset used for training the models was created from the common crawl dataset which was
around 750GB. The model required around 11 billion parameters to be trained on such a large
dataset.
• Multi-task pre-trained models were used to perform well on different tasks where the models
were trained on different tasks by using prefixes like: “Translate English to German”. By fine-
tuning the model achieved state-of-the-art results on different tasks like text classification,
summarization, and question answering.
Conclusion
• We see that attention-based models are more popular compared to RNN-based and
CNN-based language models.
• BERT appears to be the default architecture for language modeling due to its
bidirectional architectures which makes it successful in many downstream tasks.
• In transfer learning techniques used for NLP, sequential fine-tuning seems to be the
most popular approach.
• Multi-task fine tuning seems to gain popularity in recent years as many studies found
that training on multiple tasks at the same time yields better results.
• Also, text classification datasets seems to be widely used compared to other tasks in
NLP because fine-tuning models in such tasks are easier.
Future Work
• For future work, it is recommended to use bidirectional models like BERT for
specific tasks like abstractive question answering, sentiment classification, and
parts-of-speech tagging and models like GPT-2, T5 for generative tasks like
generative question answering, summarization, and text generation.
• Adapter modules can replace sequential fine-tuning as they show comparable
results to traditional fine tuning and are faster and compact due to parameter
sharing.
• Also, extensive research should be done to reduce the size of these bigger
language models so that they can be easily deployed on embedded devices
and on the web.

More Related Content

What's hot

Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
Vishal Mishra
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
ParrotAI
 
Convolutional neural networks deepa
Convolutional neural networks deepaConvolutional neural networks deepa
Convolutional neural networks deepa
deepa4466
 
LeNet to ResNet
LeNet to ResNetLeNet to ResNet
LeNet to ResNet
Somnath Banerjee
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
Gioele Ciaparrone
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
Jeong Choi
 
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
GeeksLab Odessa
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image Description
Dmytro Mishkin
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
MojammilHusain
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
CNN
CNNCNN
Cnn
CnnCnn
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
Yogendra Tamang
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
Roozbeh Sanaei
 
Neuroevolution and deep learing
Neuroevolution and deep learing Neuroevolution and deep learing
Neuroevolution and deep learing
Accenture
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
Rimzim Thube
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang
 

What's hot (20)

Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 
Convolutional neural networks deepa
Convolutional neural networks deepaConvolutional neural networks deepa
Convolutional neural networks deepa
 
LeNet to ResNet
LeNet to ResNetLeNet to ResNet
LeNet to ResNet
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
Review on cs231 part-2
Review on cs231 part-2Review on cs231 part-2
Review on cs231 part-2
 
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
 
HardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image DescriptionHardNet: Convolutional Network for Local Image Description
HardNet: Convolutional Network for Local Image Description
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
CNN
CNNCNN
CNN
 
Cnn
CnnCnn
Cnn
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
 
Neuroevolution and deep learing
Neuroevolution and deep learing Neuroevolution and deep learing
Neuroevolution and deep learing
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 

Similar to Transfer Learning in NLP: A Survey

240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
thanhdowork
 
Story story ppt
Story story pptStory story ppt
Story story ppt
Pooja Patil
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
IJET - International Journal of Engineering and Techniques
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
Unsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource DomainsUnsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource Domains
taeseon ryu
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
AbdurrahimDerric
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Lviv Startup Club
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
short_story.pptx
short_story.pptxshort_story.pptx
short_story.pptx
SanjayBhargavMadaman
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
ayaha osaki
 
Multi Task Learning and Meta Learning
Multi Task Learning and Meta LearningMulti Task Learning and Meta Learning
Multi Task Learning and Meta Learning
Srilalitha Veerubhotla
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
Frozen Paradise
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Jinho Choi
 
Balochi Language Text Classification Using Deep Learning 1.pptx
Balochi Language Text Classification Using Deep Learning 1.pptxBalochi Language Text Classification Using Deep Learning 1.pptx
Balochi Language Text Classification Using Deep Learning 1.pptx
MuhammadHamza463794
 

Similar to Transfer Learning in NLP: A Survey (20)

240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
Story story ppt
Story story pptStory story ppt
Story story ppt
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Unsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource DomainsUnsupervised Neural Machine Translation for Low-Resource Domains
Unsupervised Neural Machine Translation for Low-Resource Domains
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
short_story.pptx
short_story.pptxshort_story.pptx
short_story.pptx
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
Multi Task Learning and Meta Learning
Multi Task Learning and Meta LearningMulti Task Learning and Meta Learning
Multi Task Learning and Meta Learning
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
 
Balochi Language Text Classification Using Deep Learning 1.pptx
Balochi Language Text Classification Using Deep Learning 1.pptxBalochi Language Text Classification Using Deep Learning 1.pptx
Balochi Language Text Classification Using Deep Learning 1.pptx
 

Recently uploaded

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 

Recently uploaded (20)

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 

Transfer Learning in NLP: A Survey

  • 1. Transfer Learning in NLP: A Survey Nupur Yadav San Jose State University Emerging Technologies in ML
  • 2. Introduction • The limitations of deep learning models, such as requiring a large amount of data to train models and demand for huge computing resources, forces research for the knowledge transfer possibilities. • Many large DL models are emerging that demand the need for transfer learning.
  • 3. Models used for NLP 1. Recurrent-Based Models 2. Attention-Based Models 3. CNN-Based Models
  • 4. Recurrent-Based Models • In RNNs we pass the previous model state along with each input to make them learn the sequential context. Works great for many tasks such as speech recognition, translation, text generation, time-series classification, and biological modeling. • Suffers from the problem of vanishing gradient because of using backpropagation and its sequential nature. • To overcome this issue many ideas came such as using Rectified Linear Unit (ReLU) as the activation function, then Long Short-Term Memory (LSTM) architecture, bidirectional LSTMs, Gated Recurrent Networks (GRUs). • GRUs are the fastest versions of LSTMs and can beat LSTMs in some tasks such as automatic capturing the grammatical properties of the input sentences.
  • 5. Attention-Based Models • RNNs aggregates the sequence activations in one vector which causes the learning process to forget about words that were fed in the past. • Attention-based models on the other hand attend each word differently to inputs based on the similarity score. • Attention can be applied between different sequences or in the same sequence which is called self-attention.
  • 6. CNN-Based Models • It uses convolutional and max-pooling layers for sub-sampling. Convolutional layers extract features and pooling layers reduce the spatial size of the extracted features. • In NLP, CNNs have been successfully used for sentence classification tasks such as movie reviews, question classification, etc. • Also used in language modeling where gated convolutional layers were used to preserve larger contexts and can be parallelized compared to traditional recurrent neural networks.
  • 7. Language Models 1. Unidirectional LM: Consider tokens either that are to the left of the current context or to the right. 2. Bidirectional LM: Each token can attend to any other token in the current context. 3. Masked LM: Used in bidirectional LM where we randomly mask some tokens in the current context and then predict these masked tokens. 4. Sequence-to-sequence LM: involves splitting the input into two separate parts. In the first part, every token can see the context of any other token in that part but in the second part, every token can only attend to tokens to the left. 5. Permutation LM: This language model combines the benefits of both auto- regressive and auto-encoding. 6. Encoder-Decoder based LM: In comparison to other approaches that use a single stack of encoder/decoder blocks, this approach uses both the blocks.
  • 8. Transfer Learning • Given a source domain-task tuple (Ds, Ts) and a different target domain-task tuple (Dt , Tt ), transfer learning can be defined as the process of using the source domain and task in the learning process of the target domain task. • Mathematically, the objective of transfer learning is to learn the target conditional probability distribution P(Yt|Xt) in Dt with the information gained from Ds and where Ds ≠Dt or Ts ≠ Tt.
  • 9. Types of Transfer Learning Transductive Transfer Learning: Transductive Transfer Learning is when for the same task, the target domain or task doesn’t have labeled data or has very few labeled samples. It can further be divided into the following sub- categories: • Domain Adaptation • Cross-lingual transfer learning
  • 10. Domain Adaptation • Process of adapting to a new domain. It usually happens when we want to learn a different data distribution in the target domain. • Useful if the new task to train-on has a different distribution or the amount of labeled data is scarce. • One of the recent work involves using adversarial domain adaptation for the detection of duplicate questions. This approach had three main components: an encoder, a similarity function, and a domain adaptation module. • The encoder encoded the question and was optimized to fool the domain classifier that the question was from the target domain. • The similarity function calculated the probability for a pair of questions to find they were similar or duplicate. • And the domain adaptation component was used to decrease the difference between the target and source domain distributions. • This approach proved better and achieved an average improvement of around 5.6% over the best benchmark for different pairs of domains.
  • 11. Cross-lingual transfer learning • This involves adapting to a different language in the target domain. • Useful when we want to use a high-resource language to learn corresponding tasks in a low-resource language. • One of the recent work involves using a new dataset to evaluate three different cross- lingual transfer methods on the task of user intent classification and time slot detection. • The dataset contained 57k annotated utterances in English, Thai, and Spanish and was categorized into three domains which were reminder, weather, and alarm. • The three cross-lingual transfer methods used were translating the training data, using cross-lingual pre-trained embeddings, and novel methods of using multilingual machine translation encoders as contextual word representations. • The latter two methods outperformed the translation method on the target language that had only several hundred training examples, i.e., a low resource target language.
  • 12. Types of Transfer Learning Inductive Transfer Learning: Inductive Transfer Learning is when for different tasks in the source and target domain we have labeled data in the target domain only. It can further be divided into the following sub- categories: • Sequential Transfer Learning • Multi-task Transfer Learning
  • 13. Sequential Transfer Learning It involves learning multiple tasks in a sequential fashion. It is further divided into five sub-categories: 1. Sequential Fine Tuning: • Fine-tuning involves the training of the pre-trained model on the target task. • One of the recent works involves the model for a unified pre-trained language model i.e., UNILM. • It combines three different training objectives to pre-train a model in a unified way which includes Unidirectional, Bidirectional, and Sequence-to-Sequence. • The UNILM model achieved state-of-the-art results on different tasks including generative question answering, abstractive summarization, and document- grounded dialog response generation.
  • 14. Sequential Transfer Learning 2. Adapter Modules: • They are a compact and extensible transfer learning method for NLP . • Provides parameter efficiency by only adding a few trainable parameters per task, and as new tasks are added previous ones don’t require revisiting. • In the latest work, adapter modules were used to share the parameters between different tasks by fine-tuning the BERT model. • The model was evaluated against GLUE tasks and obtained state-of- the-art results on text entailment while achieving parameter efficiency.
  • 15. Sequential Transfer Learning 3. Feature-Based: • The representations of a pre-trained model are fed to another model. It provides the benefit of using the task-specific model again for similar data. • Also, extracting feature once saves a lot of computing resources if the same data is used repeatedly. • In one of the recent work, researchers used a semi-supervised approach for the task of sequence labeling. • A pre-trained neural language model was used that was trained in an unsupervised approach. It was a bidirectional language model where both forward and backward hidden states are concatenated together. • The output was then augmented to token representations and fed to a supervised sequence tagging model (TagLM) which was then trained in a supervised way to output the tag of each sequence. The datasets used were CoNLL 2003 NER and CoNLL 200 chunking. • The model achieved state-of-the-art results on both tasks compared to other forms of transfer learning.
  • 16. Sequential Transfer Learning 3. Zero-shot: • Simplest approach where for a given pre-trained model, we don’t apply any training procedure to optimize/learn new parameters. • In a recent study, researchers used the zero-shot transfer on text classification. • Each classification task was modeled as a text entailment problem where the positive class meant an entailment and the negative class meant there was non. • Then a pre-trained Bert model on text classification in a zero-shot scenario was used to classify texts in different tasks like emotion detection, topic categorization, and situation frame detection. • This approach achieved better accuracy in two out of the three tasks in comparison to unsupervised tasks like Word2Vec.
  • 17. Multi-task Transfer Learning It involves learning multiple tasks at the same time. For instance, if we are given a pre-trained model and want to transfer the learning to multiple tasks then all tasks are learned in a parallel fashion. Multi-task Fine Tuning: • In recent work, the researchers used this approach to explore the effect of using a unified text- to-text transfer transformer (T5). • The architecture used was similar to the Transformers model with an encoder-decoder network. But it used fully-visible masking instead of casual masking especially for inputs that require predictions based on a prefix like translation. • The dataset used for training the models was created from the common crawl dataset which was around 750GB. The model required around 11 billion parameters to be trained on such a large dataset. • Multi-task pre-trained models were used to perform well on different tasks where the models were trained on different tasks by using prefixes like: “Translate English to German”. By fine- tuning the model achieved state-of-the-art results on different tasks like text classification, summarization, and question answering.
  • 18. Conclusion • We see that attention-based models are more popular compared to RNN-based and CNN-based language models. • BERT appears to be the default architecture for language modeling due to its bidirectional architectures which makes it successful in many downstream tasks. • In transfer learning techniques used for NLP, sequential fine-tuning seems to be the most popular approach. • Multi-task fine tuning seems to gain popularity in recent years as many studies found that training on multiple tasks at the same time yields better results. • Also, text classification datasets seems to be widely used compared to other tasks in NLP because fine-tuning models in such tasks are easier.
  • 19. Future Work • For future work, it is recommended to use bidirectional models like BERT for specific tasks like abstractive question answering, sentiment classification, and parts-of-speech tagging and models like GPT-2, T5 for generative tasks like generative question answering, summarization, and text generation. • Adapter modules can replace sequential fine-tuning as they show comparable results to traditional fine tuning and are faster and compact due to parameter sharing. • Also, extensive research should be done to reduce the size of these bigger language models so that they can be easily deployed on embedded devices and on the web.