1
Concepts
What is Transfer Learning?
2
What is Transfer Learning?
Pan and Yang (2010)
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 3
Adapted from NAACL 2019 Tutorial: https://tinyurl.com/NAACLTransfer
Sequential Transfer Learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classification
Word labeling
Question-Answering
....
Pretraining Adaptation
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 4
Computationally
intensive
step General purpose
model
Training: The rise of language modeling pretraining
Many currently successful pretraining approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
● Doesn’t require human annotation – self-supervised
● Many languages have enough text to learn high capacity model
● Versatile – can be used to learn both sentence and word representations with
a variety of objective functions
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 5
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 6
Pretraining Transformers models (BERT, GPT…)
Sequential Transfer Learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classification
Word labeling
Question-Answering
....
Pretraining Adaptation
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 7
Computationally
intensive
step General purpose
model
Data-efficient
step
Task-specific
high-performance
model
Model: Adapting for target task
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 8
General workflow:
1. Remove pretraining task head (if not used for target task)
2. Add target task-specific elements on top/bottom:
- simple: linear layer(s)
- complex: full LSTM on top
Sometimes very complex: Adapting to a structurally different task
Ex: Pretraining with a single input sequence and adapting to a task with
several input sequences (ex: translation, conditional generation...)
➯ Use pretrained model to initialize as much as possible of target model
➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019
Downstream tasks and
Model Adaptation:
Quick Examples
9
Pretrained
model
Adaptation
Head
Tokenizer
A – Transfer Learning for text classification
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 10
Jim Henson was a puppeteer
Jim
Henson
was
a
puppet
##eer
Tokenization
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
Classifier
model
Convert
to
vocabulary
indices
Pretrained
model
True 0.7886
False -0.223
A – Transfer Learning for text classification
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 11
Remarks:
❏ The error rate goes down quickly! After one epoch we already have >90% accuracy.
⇨ Fine-tuning is highly data efficient in Transfer Learning
❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.
⇨ Fine-tuning is often robust to the exact choice of hyper-parameters
A dialog generation task:
B – Transfer Learning for Language Generation
More complex adaptation:
❏ Duplicate the model to initialize an encoder-decoder structure
e.g. Lample & Conneau, 2019,
Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf,
ACL 2019
❏ Use a single model with concatenated inputs
see e.g. Wolf et al., 2019, Khandelwal et al. 2019
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 12
B – Transfer Learning for
Language Generation
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 13
The Conversational Intelligence
Challenge 2
« ConvAI2 »
(NIPS 2018 competition)
Final Automatic Evaluation Leaderboard
(hidden test set)
Trends and limits of
Transfer Learning in NLP
14
15
Model size and Computational efficiency
❏ Recent trends
❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA
⬆
Google
GShard
600B
⬆
GPT3
175B
16
Model size and Computational efficiency
Why is this a problem?
❏ Narrowing the research competition field
❏ what is the place of academia in today’s NLP?
fine-tuning? analysis and BERTology? critics?
❏ Environmental costs
❏ Is bigger-is-better a scientific research program?
“Energy and Policy
Considerations for Deep
Learning in NLP” - Strubell,
Ganesh, McCallum - ACL 2019
17
Model size and Computational efficiency
Reducing the size of a pretrained model
Three main techniques currently investigated:
❏ Distillation
DistilBert: 95% of Bert performances in a model 40% smaller and 60% faster
❏ Pruning
❏ Quantization
From FP32 to INT8
The generalization problem:
❏ Models are brittle: fail when text is modified, even with meaning preserved
❏ Models are spurious: memorize artifacts and biases instead of truly learning
Brittle Spurious
Robin Jia and Percy Liang, “Adversarial Examples for Evaluating Reading Comprehension
Systems,” ArXiv:1707.07328 [Cs], July 23, 2017, http://arxiv.org/abs/1707.07328
R. Thomas McCoy, Junghyun Min, and Tal Linzen, “BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization across Models with Similar
Test Set Performance,” ArXiv:1911.02969 [Cs], November 7, 2019, http://arxiv.org/abs/1911.02969.
18
The inductive bias question
Shortcoming of language modeling in general
Need for grounded representations
● Limits of distributional hypothesis—difficult to learn certain types of
information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ Facts about named entities
○ No relation with other modalities (image, audio…)
● Possible solutions:
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 19
20
Current transfer learning performs adaptation once.
❏ Ultimately, we’d like to have models that continue to retain and accumulate knowledge across many
tasks (Yogatama et al., 2019).
❏ No distinction between pretraining and adaptation; just one stream of tasks.
Main challenge:
Catastrophic forgetting.
Different approaches from the literature:
❏ Memory
❏ Regularization
❏ Task-specific weights, etc.
Continual and meta-learning
Hugging Face Inc.
21
Hugging Face: Democratizing NLP
● Core research goals:
○ For most: intelligence as making sense of data
○ For us: intelligence as creativity, interaction, adaptability
● Started with Conversational AI (text/image/sound interaction):
○ Neural Language Generation in a Conversational AI game
○ Product used by more than 3M users, 600M+ messages exchanged
● Develop & open-source tools for Transfer Learning in NLP
● We want to accelerate, catalyse and democratize research-level work in
Natural Language Understanding as well as Natural Language Generation
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 22
Democratizing NLP – sharing knowledge, code, data
● Knowledge sharing
○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation)
○ Workshop NeuralGen 2019 (Language Generation with Neural Networks)
○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP)
○ EurNLP Summit (European NLP summit)
● Code & model sharing: Open-sourcing the “right way”
○ Two extremes: 1000-commands research-code ⟺ 1-command production code
■ To target the widest community – our goal is to be 👆 right in the middle
○ Breaking barriers
■ Researchers / Practitioners
■ PyTorch / TensorFlow
○ Speeding up and fueling research in Natural Language Processing
■ Make people stand on the shoulders of giants
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 23
Libraries
24
Transformers library
We’ve built an opinionated framework providing state-of-the-art general-purpose
tools for Natural Language Understanding and Generation.
Features:
● Super easy to use – fast to on-board
● For everyone – NLP researchers, practitioners, educators
● State-of-the-Art performances – on both NLU and NLG tasks
● Reduce costs/footprint – 30+ pretrained models in 100+ languages
● Deep interoperability between TensorFlow 2.0 and PyTorch
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 25
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 26
Transformers library: code example
💥 Check it out at 💥
https://github.com/huggingface/transformers
Tokenizers library
Now that neural nets have fast implementations, a bottleneck in Deep-Learning
based NLP pipelines is often tokenization: converting strings ➡ model inputs.
We have just released 🤗Tokenizers: ultra-fast & versatile tokenization
Features:
● Encode 1GB in 20sec
● BPE/byte-level-BPE/WordPiece/SentencePiece...
● Bindings in python/js/rust…
● Link: https://github.com/huggingface/tokenizers
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 27
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 28
Tokenizers library: code example
nlp library
The full data-processing pipeline goes beyond tokenization and models to include
data access and preprocessing at one end and model evaluation at the other end.
We have recently released 🤗nlp to tackle these.
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 29
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp
nlp library
🤗 nlp is a lightweight and extensible library to easily share, access and use
datasets and evaluation metrics for Natural Language Processing (NLP).
Features:
● One-line access to 150+ datasets and metrics – Open/collaborative hub
● Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
● Lightweight and fast with a transparent and pythonic API
● Strive on large datasets: Wikipedia (18GB) only take 9 MB of RAM when
● Smart caching: never wait for your data to process several times
● Link: https://github.com/huggingface/nlp
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 30
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 31
nlp library: online viewer
💥 Check it out at 💥
https://github.com/huggingface/nlp
Hands-on!
32
Thanks for listening!
33

Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"

  • 1.
  • 2.
  • 3.
    What is TransferLearning? Pan and Yang (2010) Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 3 Adapted from NAACL 2019 Tutorial: https://tinyurl.com/NAACLTransfer
  • 4.
    Sequential Transfer Learning Learnon one task/dataset, transfer to another task/dataset word2vec GloVe skip-thought InferSent ELMo ULMFiT GPT BERT DistilBERT Text classification Word labeling Question-Answering .... Pretraining Adaptation Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 4 Computationally intensive step General purpose model
  • 5.
    Training: The riseof language modeling pretraining Many currently successful pretraining approaches are based on language modeling: learning to predict Pϴ (text) or Pϴ (text | other text) Advantages: ● Doesn’t require human annotation – self-supervised ● Many languages have enough text to learn high capacity model ● Versatile – can be used to learn both sentence and word representations with a variety of objective functions Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 5
  • 6.
    Transfer Learning inNLP: Concepts, Tools & Trends - Thomas Wolf - Slide 6 Pretraining Transformers models (BERT, GPT…)
  • 7.
    Sequential Transfer Learning Learnon one task/dataset, transfer to another task/dataset word2vec GloVe skip-thought InferSent ELMo ULMFiT GPT BERT DistilBERT Text classification Word labeling Question-Answering .... Pretraining Adaptation Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 7 Computationally intensive step General purpose model Data-efficient step Task-specific high-performance model
  • 8.
    Model: Adapting fortarget task Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 8 General workflow: 1. Remove pretraining task head (if not used for target task) 2. Add target task-specific elements on top/bottom: - simple: linear layer(s) - complex: full LSTM on top Sometimes very complex: Adapting to a structurally different task Ex: Pretraining with a single input sequence and adapting to a task with several input sequences (ex: translation, conditional generation...) ➯ Use pretrained model to initialize as much as possible of target model ➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019
  • 9.
    Downstream tasks and ModelAdaptation: Quick Examples 9
  • 10.
    Pretrained model Adaptation Head Tokenizer A – TransferLearning for text classification Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 10 Jim Henson was a puppeteer Jim Henson was a puppet ##eer Tokenization 11067 5567 245 120 7756 9908 1.2 2.7 0.6 -0.2 3.7 9.1 -2.1 3.1 1.5 -4.7 2.4 6.7 6.1 2.4 7.3 -0.6 -3.1 2.5 1.9 -0.1 0.7 2.1 4.2 -3.1 Classifier model Convert to vocabulary indices Pretrained model True 0.7886 False -0.223
  • 11.
    A – TransferLearning for text classification Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 11 Remarks: ❏ The error rate goes down quickly! After one epoch we already have >90% accuracy. ⇨ Fine-tuning is highly data efficient in Transfer Learning ❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models. ⇨ Fine-tuning is often robust to the exact choice of hyper-parameters
  • 12.
    A dialog generationtask: B – Transfer Learning for Language Generation More complex adaptation: ❏ Duplicate the model to initialize an encoder-decoder structure e.g. Lample & Conneau, 2019, Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf, ACL 2019 ❏ Use a single model with concatenated inputs see e.g. Wolf et al., 2019, Khandelwal et al. 2019 Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 12
  • 13.
    B – TransferLearning for Language Generation Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 13 The Conversational Intelligence Challenge 2 « ConvAI2 » (NIPS 2018 competition) Final Automatic Evaluation Leaderboard (hidden test set)
  • 14.
    Trends and limitsof Transfer Learning in NLP 14
  • 15.
    15 Model size andComputational efficiency ❏ Recent trends ❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA ⬆ Google GShard 600B ⬆ GPT3 175B
  • 16.
    16 Model size andComputational efficiency Why is this a problem? ❏ Narrowing the research competition field ❏ what is the place of academia in today’s NLP? fine-tuning? analysis and BERTology? critics? ❏ Environmental costs ❏ Is bigger-is-better a scientific research program? “Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019
  • 17.
    17 Model size andComputational efficiency Reducing the size of a pretrained model Three main techniques currently investigated: ❏ Distillation DistilBert: 95% of Bert performances in a model 40% smaller and 60% faster ❏ Pruning ❏ Quantization From FP32 to INT8
  • 18.
    The generalization problem: ❏Models are brittle: fail when text is modified, even with meaning preserved ❏ Models are spurious: memorize artifacts and biases instead of truly learning Brittle Spurious Robin Jia and Percy Liang, “Adversarial Examples for Evaluating Reading Comprehension Systems,” ArXiv:1707.07328 [Cs], July 23, 2017, http://arxiv.org/abs/1707.07328 R. Thomas McCoy, Junghyun Min, and Tal Linzen, “BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization across Models with Similar Test Set Performance,” ArXiv:1911.02969 [Cs], November 7, 2019, http://arxiv.org/abs/1911.02969. 18 The inductive bias question
  • 19.
    Shortcoming of languagemodeling in general Need for grounded representations ● Limits of distributional hypothesis—difficult to learn certain types of information from raw text ○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013) ○ Common sense isn’t written down ○ Facts about named entities ○ No relation with other modalities (image, audio…) ● Possible solutions: ○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019) ○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019) ○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018) Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 19
  • 20.
    20 Current transfer learningperforms adaptation once. ❏ Ultimately, we’d like to have models that continue to retain and accumulate knowledge across many tasks (Yogatama et al., 2019). ❏ No distinction between pretraining and adaptation; just one stream of tasks. Main challenge: Catastrophic forgetting. Different approaches from the literature: ❏ Memory ❏ Regularization ❏ Task-specific weights, etc. Continual and meta-learning
  • 21.
  • 22.
    Hugging Face: DemocratizingNLP ● Core research goals: ○ For most: intelligence as making sense of data ○ For us: intelligence as creativity, interaction, adaptability ● Started with Conversational AI (text/image/sound interaction): ○ Neural Language Generation in a Conversational AI game ○ Product used by more than 3M users, 600M+ messages exchanged ● Develop & open-source tools for Transfer Learning in NLP ● We want to accelerate, catalyse and democratize research-level work in Natural Language Understanding as well as Natural Language Generation Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 22
  • 23.
    Democratizing NLP –sharing knowledge, code, data ● Knowledge sharing ○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation) ○ Workshop NeuralGen 2019 (Language Generation with Neural Networks) ○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP) ○ EurNLP Summit (European NLP summit) ● Code & model sharing: Open-sourcing the “right way” ○ Two extremes: 1000-commands research-code ⟺ 1-command production code ■ To target the widest community – our goal is to be 👆 right in the middle ○ Breaking barriers ■ Researchers / Practitioners ■ PyTorch / TensorFlow ○ Speeding up and fueling research in Natural Language Processing ■ Make people stand on the shoulders of giants Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 23
  • 24.
  • 25.
    Transformers library We’ve builtan opinionated framework providing state-of-the-art general-purpose tools for Natural Language Understanding and Generation. Features: ● Super easy to use – fast to on-board ● For everyone – NLP researchers, practitioners, educators ● State-of-the-Art performances – on both NLU and NLG tasks ● Reduce costs/footprint – 30+ pretrained models in 100+ languages ● Deep interoperability between TensorFlow 2.0 and PyTorch Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 25
  • 26.
    Transfer Learning inNLP: Concepts, Tools & Trends - Thomas Wolf - Slide 26 Transformers library: code example 💥 Check it out at 💥 https://github.com/huggingface/transformers
  • 27.
    Tokenizers library Now thatneural nets have fast implementations, a bottleneck in Deep-Learning based NLP pipelines is often tokenization: converting strings ➡ model inputs. We have just released 🤗Tokenizers: ultra-fast & versatile tokenization Features: ● Encode 1GB in 20sec ● BPE/byte-level-BPE/WordPiece/SentencePiece... ● Bindings in python/js/rust… ● Link: https://github.com/huggingface/tokenizers Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 27
  • 28.
    Transfer Learning inNLP: Concepts, Tools & Trends - Thomas Wolf - Slide 28 Tokenizers library: code example
  • 29.
    nlp library The fulldata-processing pipeline goes beyond tokenization and models to include data access and preprocessing at one end and model evaluation at the other end. We have recently released 🤗nlp to tackle these. Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 29 Data Tokenization Prediction 🤗 nlp Tokenizers Transformers Metrics 🤗 nlp
  • 30.
    nlp library 🤗 nlpis a lightweight and extensible library to easily share, access and use datasets and evaluation metrics for Natural Language Processing (NLP). Features: ● One-line access to 150+ datasets and metrics – Open/collaborative hub ● Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 ● Lightweight and fast with a transparent and pythonic API ● Strive on large datasets: Wikipedia (18GB) only take 9 MB of RAM when ● Smart caching: never wait for your data to process several times ● Link: https://github.com/huggingface/nlp Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 30
  • 31.
    Transfer Learning inNLP: Concepts, Tools & Trends - Thomas Wolf - Slide 31 nlp library: online viewer 💥 Check it out at 💥 https://github.com/huggingface/nlp
  • 32.
  • 33.