Transfer Learning in NLP:
Concepts and Tools
Thomas Wolf
HuggingFace Inc.
Overview
● Concepts and History
● Anatomy of a State-of-the-art Model
● Open source tools
● Current Trends
● Limits and Open Questions
Sebastian
Ruder
Matthew
Peters
Swabha
Swayamdipta
Some slides are adapted from our
NAACL 2019 Tutorial on Transfer
Learning in NLP with my collaborators
👈
Slides: http://tiny.cc/NAACLTransfer
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 4
Concepts & History
5
What is Transfer Learning?
Pan and Yang (2010)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 6
Why Transfer Learning in NLP? (intuitively)
Why should Transfer Learning work in NLP?
● Many NLP tasks share common knowledge about language (linguistic
representations, structural similarities...)
● Tasks can inform each other—e.g. syntax and semantics
● Annotated data is rare, make use of as much supervision as available.
● Unlabelled data is super abundant (internet), should try to use it
Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks
(e.g. classification, information extraction, Q&A, etc)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 7
Why Transfer Learning in NLP? (empirically)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 8
Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 9
Ruder (2019)
We will
focus on
this
Types of transfer learning in NLP
Training: Sequential Transfer Learning
Learn on one task / dataset, then transfer to another task / dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
classification
sequence labeling
Q&A
....
Pretraining Adaptation
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 10
[-0.4, 0.9, …]
History
Word vectors
cats = [0.2, -0.3, …]
dogs = [0.4, -0.5, …]
Sentence/doc vectors
It’s raining
cats and dogs.
We have two
cats.
[0.8, 0.9, …]
[-1.2, 0.0, …]
}
}
Word-in-context
vectors
We have two cats.
}
[1.2, -0.3, …]
It’s raining cats and dogs.
}
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 11
History: The rise of language modeling
Many currently successful pretraining approaches are based on language
modeling, i.e. learning to predict:
● empirical probability of text: Pϴ(text)
● empirical conditional probability of text (e.g. translation): Pϴ(text | other text)
Advantages:
● Doesn’t require human annotation
● Many languages have enough text to learn high capacity model
● Versatile—can learn both sentence or word representations with a variety of
objective functions (autoregressive language modeling, masked language
modeling, span prediction, skip-thoughts, cross-view training….)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 12
● Language modeling is a very difficult task, even for humans.
● Language models are expected to compress any possible context into a
vector that generalizes over possible completions.
○ E.g. “I think this is the beginning of a beautiful ???”
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 13
History: The rise of language modeling
● To have any chance at solving this task, a model is forced to learn syntax,
semantics, encode facts about the world, etc.
● Given enough data and compute, a big model can do a reasonable job!
Anatomy of a State-of-the-Art
Transfer Learning Model
14
A State-of-the-Art Transfer Learning Model
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 15
Two essential components: model & training
● The model: pre-training architecture and adaptations for fine-tuning
○ Current large architectures are mostly based on Transformers (but ULMFiT)
○ Unclear advantages of smart architecture (XLNet) versus more data (RoBERTa)
○ Trend toward larger models: XLM (664M), GPT-2 (1.5B), Megatron-LM (8.5 B)
● The training: pre-training and adaptation phases
○ Learning long term dependencies => long stream of continuous text (books, wiki)
○ Toward using more data in both phases RoBERTa (160GB) MT-DNN (WNLI)
○ Quality of the data is important
Pretrained
model
Adaptation
Head
Tokenizer
Model: Using a typical Transfer Learning model
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 16
Jim Henson was a puppeteer
Jim
Henson
was
a
puppet
##eer
Tokenization
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
Classifier
model
Convert
to
vocabulary
indices
Pretrained
model
True 0.7886
False -0.223
Model: From shallow to deep
Devlin et al 2019: BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
1 layer 24 layers
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 17
Bengio et al 2003: A Neural Probabilistic
Language Model
BERT is pretrained for both sentence and contextual word representations, using masked language
modeling and next sentence prediction.
● Pretrained model: BERT-large has 340M parameters, 24 layers
● Adaptation head: just a linear layer on top of the representation output by the pretrained model.
Model: the example of BERT
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 18
See also: Logeswaran and Lee, ICLR 2018, (Devlin et al. 2019)
Large-scale transformer architectures (GPT-2, BERT, XLM…) are very similar to each other and consist of:
● summing words and position embeddings
● applying a succession of transformer blocks with:
○ layer normalisation
○ a self-attention module
○ dropout and a residual connection
○ another layer normalisation
○ a feed-forward module with one hidden layer and a non linearity:
Linear ⇨ ReLU/gelu ⇨ Linear
○ dropout and a residual connection
Model: Inside BERT, GPT-2, XLNet, RoBERTa
Main differences between BERT, GPT-2, XLNet: the pretraining objective
● causal language modeling for GPT
● masked language modeling for BERT (+ next sentence prediction)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 19
(Child et al, 2019)
Model: Adapting for target task
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 20
General workflow:
1. Remove pretraining task head if not useful for target task
E.g. remove softmax classifier
2. Add target task-specific layers on top/bottom of
pretrained model
Simple: adding linear layer(s) on top of the pretrained model
More complex: model output as input for a separate model
Sometimes more complex: Adapting to a structurally different task
Ex: Pretraining with a single input sequence and adapting to a task with
several input sequences (ex: translation, conditional generation...)
➯ Use pretrained model to initialize as much as possible of target model
➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019
Training: Adaptation on a text classification task
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 21
Replace the pretraining
head with a classification
head: a linear layer,
which takes as input the
hidden-state of a token
Keep our pretrained model
unchanged as the backbone.
Initialization of the model:
● Initialize the weights of the model (in particular the added parameters)
● Reload common weights from the pretrained model.
Training: Adaptation on a text classification task
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 22
We are at the state-of-the-art
(ULMFiT)
Remarks:
❏ The error rate goes down quickly! After one epoch we already have >90% accuracy.
⇨ Fine-tuning is highly data efficient in Transfer Learning
❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.
⇨ Fine-tuning is often robust to the exact choice of hyper-parameters
Training: Adaptation on a text classification task
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 23
A few words on robustness & variance.
❏ Large pretrained models (e.g. BERT large) are
prone to degenerate performance when fine-tuned
on tasks with small training sets.
❏ Observed behavior is often “on-off”: it either works
very well or doesn’t work at all.
❏ Understanding the conditions and causes of this
behavior (models, adaptation schemes) is an
open research question.
Phang et al., 2018 23
Open-source tools
Hubs and Libraries
24
Open-sourcing: practical considerations
● Pretraining large-scale models is costly
Use open-source models
Share your pretrained models
“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019
● Sharing/accessing pretrained models
○ Hubs: Tensorflow Hub, PyTorch Hub
○ Author released checkpoints: ex BERT, GPT...
○ Third-party libraries: AllenNLP, fast.ai, HuggingFace
● Design considerations
○ Hubs/libraries:
■ Simple to use but can be difficult to modify model internal architecture
○ Author released checkpoints:
■ More difficult to use but you have full control over the model internals
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 25
● Based on GitHub repositories, a model is shared by adding a file to the GitHub repository.
● PyTorch Hub can fetch the model from the master branch on GitHub. This means that you
don’t need to package your model (pip) & users can always access the most recent version.
● Both model definitions and pre-trained weights can be shared
● More details: https://pytorch.org/hub and https://pytorch.org/docs/stable/hub.html
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 26
PyTorch Hub
Main limitations of Hubs
TensorFlow-Hub
● TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of
TensorFlow graph with their weights and assets.
● Modules are automatically downloaded and cached when instantiated.
● Each time a module is called, it adds operations to the current TensorFlow graph.
● More details: https://tensorflow.org/hub
● No access to the source code of the model (black-box)
● Not possible to modify the internals of the model (e.g. to add Adapters)
HuggingFace library with Transformers 👾
We’ve built an opinionated library of pretrained models (Pytorch-transformers) for
NLP researchers and practitioners seeking to use/study/modify large-scale
pretrained transformers models such as BERT, GPT, GPT-2, XLNet, RoBERTa...
The library was designed with two strong principles in mind:
● be as easy to use and fast to on-board as possible:
○ almost no abstractions to learn: models, tokenizer and configuration,
○ a common from_pretrained() method takes care of downloading/caching/loading
classes from pretrained instances supplied in the library or user’s saved instances,
○ to build-upon the library, the user can use regular PyTorch modules.
● provide state-of-the-art models identical to the original models:
○ examples reproducing official results,
○ carefully drafted code as close as possible to the original computation graph.
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 27
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 28
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 29
Current Trends
30
Larger models
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 31
Numberofparametersofthemodel
(inmillions)
Larger models on larger datasets
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 32
Minimum amount of data is required to unlock the potential of Transfer Learning
Question: Did no one think of this before? Why did it only start in ‘18 (ELMo)?
J. Devlin’s Answer: Good results on pre-training is >1,000x to 100,000 more
expensive than supervised training.
○ E.g., 10x-100x bigger model trained for 100x-1,000x as many steps.
○ Imagine in 2013: well-tuned 2-layer, LSTM gets 80% accuracy on sentiment
analysis, training for 8 hours.
○ Pre-train large-scale language model on same architecture for a week, get +0.5%.
○ Reviewers: “Who would do something so expensive for such a small gain?”
Devlin et al.
Larger models on larger datasets
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 33
Diminishing returns of using more data/bigger models:
➭ For a linear gain in performance, an exponentially larger model is required.
Radford and Wu et al. Devlin et al. Hancock @ Fwdays’19
A trend for smaller models
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 34
And a lot of very fresh work which will be published around the end of the year:
Tsai et al., Turc et al., Tang et al., ...
Numberofparametersofthemodel
(inmillions)
Smaller models: Distillating large models
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 35
Training cost make headlines but as large-scale models reach production,
inference time will likely account for most of a model's total environmental cost.
Distilling larger models in smaller ones:
● reduce inference cost
● capitalize on the inductive biases learned by a large model.
95% of the performances of a model like Bert can be preserved in a distilled
model 40% smaller and 60% faster (our teams work on DistilBERT, open-sourced
in our pytorch-transformers library)
Smaller models: Distillation from large models
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 36
Distillation: two main tricks to train a student model from a teacher model:
1. Starting from high-quality weights initializations derived from the teacher
2. Training the student to mimic the full output distribution of the teacher
Limits, Open Questions
37
Shortcomings of pretrained language models
Large, pretrained language models can be difficult to optimize.
● Fine-tuning is often unstable and has a high variance, particularly if the
target datasets are very small. BERT large is prone to degenerate
performance; multiple random restarts can be necessary (Phang et al., 2018)
● Do we really need all these parameters?
● Recent work shows that only a few of the attention heads in BERT are
required (Voita et al., ACL 2019, Michel et al.).
● More work needed to understand model parameters.
● Pruning and distillation are two ways to deal with this.
● See also: the lottery ticket hypothesis (Frankle et al., ICLR 2019).
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 38
Shortcoming of language modeling in general
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 39
The most successful current pretraining methods are based on variants of
language modeling. But this have many shortcomings:
● Not appropriate for all models
○ If we condition on more inputs (video/sound), need to pretrain those parts
● Weak signal for semantics and long-term context vs. strong signal for
syntax and short-term word co-occurrences
● Pretrained language models are bad at
○ fine-grained linguistic tasks (Liu et al., NAACL 2019)
○ common sense (when you actually make it difficult; Zellers et al., ACL 2019);
coherent natural language generation
○ tend to overfit to surface form information when fine-tuned; ‘rapid surface
learners’
Shortcoming of language modeling in general
Need for grounded representations
● Limits of distributional hypothesis—difficult to learn certain types of
information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ Facts about named entities
○ No grounding to other modalities
● Possible solutions:
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 40
That’s all folks!
41

Thomas Wolf "Transfer learning in NLP"

  • 1.
    Transfer Learning inNLP: Concepts and Tools Thomas Wolf HuggingFace Inc.
  • 2.
    Overview ● Concepts andHistory ● Anatomy of a State-of-the-art Model ● Open source tools ● Current Trends ● Limits and Open Questions Sebastian Ruder Matthew Peters Swabha Swayamdipta Some slides are adapted from our NAACL 2019 Tutorial on Transfer Learning in NLP with my collaborators 👈 Slides: http://tiny.cc/NAACLTransfer Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 4
  • 3.
  • 4.
    What is TransferLearning? Pan and Yang (2010) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 6
  • 5.
    Why Transfer Learningin NLP? (intuitively) Why should Transfer Learning work in NLP? ● Many NLP tasks share common knowledge about language (linguistic representations, structural similarities...) ● Tasks can inform each other—e.g. syntax and semantics ● Annotated data is rare, make use of as much supervision as available. ● Unlabelled data is super abundant (internet), should try to use it Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 7
  • 6.
    Why Transfer Learningin NLP? (empirically) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 8 Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time
  • 7.
    Transfer Learning inNLP: Concepts and Tools - Thomas Wolf - Slide 9 Ruder (2019) We will focus on this Types of transfer learning in NLP
  • 8.
    Training: Sequential TransferLearning Learn on one task / dataset, then transfer to another task / dataset word2vec GloVe skip-thought InferSent ELMo ULMFiT GPT BERT classification sequence labeling Q&A .... Pretraining Adaptation Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 10
  • 9.
    [-0.4, 0.9, …] History Wordvectors cats = [0.2, -0.3, …] dogs = [0.4, -0.5, …] Sentence/doc vectors It’s raining cats and dogs. We have two cats. [0.8, 0.9, …] [-1.2, 0.0, …] } } Word-in-context vectors We have two cats. } [1.2, -0.3, …] It’s raining cats and dogs. } Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 11
  • 10.
    History: The riseof language modeling Many currently successful pretraining approaches are based on language modeling, i.e. learning to predict: ● empirical probability of text: Pϴ(text) ● empirical conditional probability of text (e.g. translation): Pϴ(text | other text) Advantages: ● Doesn’t require human annotation ● Many languages have enough text to learn high capacity model ● Versatile—can learn both sentence or word representations with a variety of objective functions (autoregressive language modeling, masked language modeling, span prediction, skip-thoughts, cross-view training….) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 12
  • 11.
    ● Language modelingis a very difficult task, even for humans. ● Language models are expected to compress any possible context into a vector that generalizes over possible completions. ○ E.g. “I think this is the beginning of a beautiful ???” Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 13 History: The rise of language modeling ● To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc. ● Given enough data and compute, a big model can do a reasonable job!
  • 12.
    Anatomy of aState-of-the-Art Transfer Learning Model 14
  • 13.
    A State-of-the-Art TransferLearning Model Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 15 Two essential components: model & training ● The model: pre-training architecture and adaptations for fine-tuning ○ Current large architectures are mostly based on Transformers (but ULMFiT) ○ Unclear advantages of smart architecture (XLNet) versus more data (RoBERTa) ○ Trend toward larger models: XLM (664M), GPT-2 (1.5B), Megatron-LM (8.5 B) ● The training: pre-training and adaptation phases ○ Learning long term dependencies => long stream of continuous text (books, wiki) ○ Toward using more data in both phases RoBERTa (160GB) MT-DNN (WNLI) ○ Quality of the data is important
  • 14.
    Pretrained model Adaptation Head Tokenizer Model: Using atypical Transfer Learning model Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 16 Jim Henson was a puppeteer Jim Henson was a puppet ##eer Tokenization 11067 5567 245 120 7756 9908 1.2 2.7 0.6 -0.2 3.7 9.1 -2.1 3.1 1.5 -4.7 2.4 6.7 6.1 2.4 7.3 -0.6 -3.1 2.5 1.9 -0.1 0.7 2.1 4.2 -3.1 Classifier model Convert to vocabulary indices Pretrained model True 0.7886 False -0.223
  • 15.
    Model: From shallowto deep Devlin et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 1 layer 24 layers Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 17 Bengio et al 2003: A Neural Probabilistic Language Model
  • 16.
    BERT is pretrainedfor both sentence and contextual word representations, using masked language modeling and next sentence prediction. ● Pretrained model: BERT-large has 340M parameters, 24 layers ● Adaptation head: just a linear layer on top of the representation output by the pretrained model. Model: the example of BERT Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 18 See also: Logeswaran and Lee, ICLR 2018, (Devlin et al. 2019)
  • 17.
    Large-scale transformer architectures(GPT-2, BERT, XLM…) are very similar to each other and consist of: ● summing words and position embeddings ● applying a succession of transformer blocks with: ○ layer normalisation ○ a self-attention module ○ dropout and a residual connection ○ another layer normalisation ○ a feed-forward module with one hidden layer and a non linearity: Linear ⇨ ReLU/gelu ⇨ Linear ○ dropout and a residual connection Model: Inside BERT, GPT-2, XLNet, RoBERTa Main differences between BERT, GPT-2, XLNet: the pretraining objective ● causal language modeling for GPT ● masked language modeling for BERT (+ next sentence prediction) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 19 (Child et al, 2019)
  • 18.
    Model: Adapting fortarget task Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 20 General workflow: 1. Remove pretraining task head if not useful for target task E.g. remove softmax classifier 2. Add target task-specific layers on top/bottom of pretrained model Simple: adding linear layer(s) on top of the pretrained model More complex: model output as input for a separate model Sometimes more complex: Adapting to a structurally different task Ex: Pretraining with a single input sequence and adapting to a task with several input sequences (ex: translation, conditional generation...) ➯ Use pretrained model to initialize as much as possible of target model ➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019
  • 19.
    Training: Adaptation ona text classification task Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 21 Replace the pretraining head with a classification head: a linear layer, which takes as input the hidden-state of a token Keep our pretrained model unchanged as the backbone. Initialization of the model: ● Initialize the weights of the model (in particular the added parameters) ● Reload common weights from the pretrained model.
  • 20.
    Training: Adaptation ona text classification task Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 22 We are at the state-of-the-art (ULMFiT) Remarks: ❏ The error rate goes down quickly! After one epoch we already have >90% accuracy. ⇨ Fine-tuning is highly data efficient in Transfer Learning ❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models. ⇨ Fine-tuning is often robust to the exact choice of hyper-parameters
  • 21.
    Training: Adaptation ona text classification task Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 23 A few words on robustness & variance. ❏ Large pretrained models (e.g. BERT large) are prone to degenerate performance when fine-tuned on tasks with small training sets. ❏ Observed behavior is often “on-off”: it either works very well or doesn’t work at all. ❏ Understanding the conditions and causes of this behavior (models, adaptation schemes) is an open research question. Phang et al., 2018 23
  • 22.
  • 23.
    Open-sourcing: practical considerations ●Pretraining large-scale models is costly Use open-source models Share your pretrained models “Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019 ● Sharing/accessing pretrained models ○ Hubs: Tensorflow Hub, PyTorch Hub ○ Author released checkpoints: ex BERT, GPT... ○ Third-party libraries: AllenNLP, fast.ai, HuggingFace ● Design considerations ○ Hubs/libraries: ■ Simple to use but can be difficult to modify model internal architecture ○ Author released checkpoints: ■ More difficult to use but you have full control over the model internals Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 25
  • 24.
    ● Based onGitHub repositories, a model is shared by adding a file to the GitHub repository. ● PyTorch Hub can fetch the model from the master branch on GitHub. This means that you don’t need to package your model (pip) & users can always access the most recent version. ● Both model definitions and pre-trained weights can be shared ● More details: https://pytorch.org/hub and https://pytorch.org/docs/stable/hub.html Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 26 PyTorch Hub Main limitations of Hubs TensorFlow-Hub ● TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of TensorFlow graph with their weights and assets. ● Modules are automatically downloaded and cached when instantiated. ● Each time a module is called, it adds operations to the current TensorFlow graph. ● More details: https://tensorflow.org/hub ● No access to the source code of the model (black-box) ● Not possible to modify the internals of the model (e.g. to add Adapters)
  • 25.
    HuggingFace library withTransformers 👾 We’ve built an opinionated library of pretrained models (Pytorch-transformers) for NLP researchers and practitioners seeking to use/study/modify large-scale pretrained transformers models such as BERT, GPT, GPT-2, XLNet, RoBERTa... The library was designed with two strong principles in mind: ● be as easy to use and fast to on-board as possible: ○ almost no abstractions to learn: models, tokenizer and configuration, ○ a common from_pretrained() method takes care of downloading/caching/loading classes from pretrained instances supplied in the library or user’s saved instances, ○ to build-upon the library, the user can use regular PyTorch modules. ● provide state-of-the-art models identical to the original models: ○ examples reproducing official results, ○ carefully drafted code as close as possible to the original computation graph. Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 27
  • 26.
    Transfer Learning inNLP: Concepts and Tools - Thomas Wolf - Slide 28
  • 27.
    Transfer Learning inNLP: Concepts and Tools - Thomas Wolf - Slide 29
  • 28.
  • 29.
    Larger models Transfer Learningin NLP: Concepts and Tools - Thomas Wolf - Slide 31 Numberofparametersofthemodel (inmillions)
  • 30.
    Larger models onlarger datasets Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 32 Minimum amount of data is required to unlock the potential of Transfer Learning Question: Did no one think of this before? Why did it only start in ‘18 (ELMo)? J. Devlin’s Answer: Good results on pre-training is >1,000x to 100,000 more expensive than supervised training. ○ E.g., 10x-100x bigger model trained for 100x-1,000x as many steps. ○ Imagine in 2013: well-tuned 2-layer, LSTM gets 80% accuracy on sentiment analysis, training for 8 hours. ○ Pre-train large-scale language model on same architecture for a week, get +0.5%. ○ Reviewers: “Who would do something so expensive for such a small gain?” Devlin et al.
  • 31.
    Larger models onlarger datasets Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 33 Diminishing returns of using more data/bigger models: ➭ For a linear gain in performance, an exponentially larger model is required. Radford and Wu et al. Devlin et al. Hancock @ Fwdays’19
  • 32.
    A trend forsmaller models Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 34 And a lot of very fresh work which will be published around the end of the year: Tsai et al., Turc et al., Tang et al., ... Numberofparametersofthemodel (inmillions)
  • 33.
    Smaller models: Distillatinglarge models Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 35 Training cost make headlines but as large-scale models reach production, inference time will likely account for most of a model's total environmental cost. Distilling larger models in smaller ones: ● reduce inference cost ● capitalize on the inductive biases learned by a large model. 95% of the performances of a model like Bert can be preserved in a distilled model 40% smaller and 60% faster (our teams work on DistilBERT, open-sourced in our pytorch-transformers library)
  • 34.
    Smaller models: Distillationfrom large models Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 36 Distillation: two main tricks to train a student model from a teacher model: 1. Starting from high-quality weights initializations derived from the teacher 2. Training the student to mimic the full output distribution of the teacher
  • 35.
  • 36.
    Shortcomings of pretrainedlanguage models Large, pretrained language models can be difficult to optimize. ● Fine-tuning is often unstable and has a high variance, particularly if the target datasets are very small. BERT large is prone to degenerate performance; multiple random restarts can be necessary (Phang et al., 2018) ● Do we really need all these parameters? ● Recent work shows that only a few of the attention heads in BERT are required (Voita et al., ACL 2019, Michel et al.). ● More work needed to understand model parameters. ● Pruning and distillation are two ways to deal with this. ● See also: the lottery ticket hypothesis (Frankle et al., ICLR 2019). Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 38
  • 37.
    Shortcoming of languagemodeling in general Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 39 The most successful current pretraining methods are based on variants of language modeling. But this have many shortcomings: ● Not appropriate for all models ○ If we condition on more inputs (video/sound), need to pretrain those parts ● Weak signal for semantics and long-term context vs. strong signal for syntax and short-term word co-occurrences ● Pretrained language models are bad at ○ fine-grained linguistic tasks (Liu et al., NAACL 2019) ○ common sense (when you actually make it difficult; Zellers et al., ACL 2019); coherent natural language generation ○ tend to overfit to surface form information when fine-tuned; ‘rapid surface learners’
  • 38.
    Shortcoming of languagemodeling in general Need for grounded representations ● Limits of distributional hypothesis—difficult to learn certain types of information from raw text ○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013) ○ Common sense isn’t written down ○ Facts about named entities ○ No grounding to other modalities ● Possible solutions: ○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019) ○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019) ○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 40
  • 39.