Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"

Concepts
What is Transfer Learning?
2

What is Transfer Learning?
Pan and Yang (2010)
Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 3
Adapted from NAACL 2019 Tutorial: https://tinyurl.com/NAACLTransfer

Sequential Transfer Learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classiﬁcation
Word labeling
Question-Answering
....
Pretraining Adaptation
Computationally
intensive
step General purpose
model

Training: The rise of language modeling pretraining
Many currently successful pretraining approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
● Doesn’t require human annotation – self-supervised
● Many languages have enough text to learn high capacity model
● Versatile – can be used to learn both sentence and word representations with
a variety of objective functions

Pretraining Transformers models (BERT, GPT…)

Sequential Transfer Learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classiﬁcation
Word labeling
Question-Answering
....
Pretraining Adaptation
Computationally
intensive
step General purpose
model
Data-efficient
step
Task-specific
high-performance
model

Model: Adapting for target task
General workflow:
1. Remove pretraining task head (if not used for target task)
2. Add target task-specific elements on top/bottom:
- simple: linear layer(s)
- complex: full LSTM on top
Sometimes very complex: Adapting to a structurally different task
Ex: Pretraining with a single input sequence and adapting to a task with
several input sequences (ex: translation, conditional generation...)
➯ Use pretrained model to initialize as much as possible of target model
➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019

Downstream tasks and
Model Adaptation:
Quick Examples
9

Pretrained
model
Adaptation
Head
Tokenizer
A – Transfer Learning for text classification
Jim Henson was a puppeteer
Jim
Henson
was
a
puppet
##eer
Tokenization
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
Classifier
model
Convert
to
vocabulary
indices
Pretrained
model
True 0.7886
False -0.223

A – Transfer Learning for text classification
Remarks:
❏ The error rate goes down quickly! After one epoch we already have >90% accuracy.
⇨ Fine-tuning is highly data efficient in Transfer Learning
❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.
⇨ Fine-tuning is often robust to the exact choice of hyper-parameters

A dialog generation task:
B – Transfer Learning for Language Generation
More complex adaptation:
❏ Duplicate the model to initialize an encoder-decoder structure
e.g. Lample & Conneau, 2019,
Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf,
ACL 2019
❏ Use a single model with concatenated inputs
see e.g. Wolf et al., 2019, Khandelwal et al. 2019

B – Transfer Learning for
Language Generation
The Conversational Intelligence
Challenge 2
« ConvAI2 »
(NIPS 2018 competition)
Final Automatic Evaluation Leaderboard
(hidden test set)

Trends and limits of
Transfer Learning in NLP
14

15
Model size and Computational efficiency
❏ Recent trends
❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA
⬆
Google
GShard
600B
⬆
GPT3
175B

16
Why is this a problem?
❏ Narrowing the research competition field
❏ what is the place of academia in today’s NLP?
fine-tuning? analysis and BERTology? critics?
❏ Environmental costs
❏ Is bigger-is-better a scientific research program?
“Energy and Policy
Considerations for Deep
Learning in NLP” - Strubell,
Ganesh, McCallum - ACL 2019

17
Reducing the size of a pretrained model
Three main techniques currently investigated:
❏ Distillation
DistilBert: 95% of Bert performances in a model 40% smaller and 60% faster
❏ Pruning
❏ Quantization
From FP32 to INT8

The generalization problem:
❏ Models are brittle: fail when text is modified, even with meaning preserved
❏ Models are spurious: memorize artifacts and biases instead of truly learning
Brittle Spurious
Robin Jia and Percy Liang, “Adversarial Examples for Evaluating Reading Comprehension
Systems,” ArXiv:1707.07328 [Cs], July 23, 2017, http://arxiv.org/abs/1707.07328
R. Thomas McCoy, Junghyun Min, and Tal Linzen, “BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization across Models with Similar
Test Set Performance,” ArXiv:1911.02969 [Cs], November 7, 2019, http://arxiv.org/abs/1911.02969.
18
The inductive bias question

Shortcoming of language modeling in general
Need for grounded representations
● Limits of distributional hypothesis—difficult to learn certain types of
information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ Facts about named entities
○ No relation with other modalities (image, audio…)
● Possible solutions:
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)

20
Current transfer learning performs adaptation once.
❏ Ultimately, we’d like to have models that continue to retain and accumulate knowledge across many
tasks (Yogatama et al., 2019).
❏ No distinction between pretraining and adaptation; just one stream of tasks.
Main challenge:
Catastrophic forgetting.
Different approaches from the literature:
❏ Memory
❏ Regularization
❏ Task-specific weights, etc.
Continual and meta-learning

Hugging Face: Democratizing NLP
● Core research goals:
○ For most: intelligence as making sense of data
○ For us: intelligence as creativity, interaction, adaptability
● Started with Conversational AI (text/image/sound interaction):
○ Neural Language Generation in a Conversational AI game
○ Product used by more than 3M users, 600M+ messages exchanged
● Develop & open-source tools for Transfer Learning in NLP
● We want to accelerate, catalyse and democratize research-level work in
Natural Language Understanding as well as Natural Language Generation

Democratizing NLP – sharing knowledge, code, data
● Knowledge sharing
○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation)
○ Workshop NeuralGen 2019 (Language Generation with Neural Networks)
○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP)
○ EurNLP Summit (European NLP summit)
● Code & model sharing: Open-sourcing the “right way”
○ Two extremes: 1000-commands research-code ⟺ 1-command production code
■ To target the widest community – our goal is to be 👆 right in the middle
○ Breaking barriers
■ Researchers / Practitioners
■ PyTorch / TensorFlow
○ Speeding up and fueling research in Natural Language Processing
■ Make people stand on the shoulders of giants

Transformers library
We’ve built an opinionated framework providing state-of-the-art general-purpose
tools for Natural Language Understanding and Generation.
Features:
● Super easy to use – fast to on-board
● For everyone – NLP researchers, practitioners, educators
● State-of-the-Art performances – on both NLU and NLG tasks
● Reduce costs/footprint – 30+ pretrained models in 100+ languages
● Deep interoperability between TensorFlow 2.0 and PyTorch

Transformers library: code example
💥 Check it out at 💥
https://github.com/huggingface/transformers

Tokenizers library
Now that neural nets have fast implementations, a bottleneck in Deep-Learning
based NLP pipelines is often tokenization: converting strings ➡ model inputs.
We have just released 🤗Tokenizers: ultra-fast & versatile tokenization
Features:
● Encode 1GB in 20sec
● BPE/byte-level-BPE/WordPiece/SentencePiece...
● Bindings in python/js/rust…
● Link: https://github.com/huggingface/tokenizers

Tokenizers library: code example

nlp library
The full data-processing pipeline goes beyond tokenization and models to include
data access and preprocessing at one end and model evaluation at the other end.
We have recently released 🤗nlp to tackle these.
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp

nlp library
🤗 nlp is a lightweight and extensible library to easily share, access and use
datasets and evaluation metrics for Natural Language Processing (NLP).
Features:
● One-line access to 150+ datasets and metrics – Open/collaborative hub
● Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
● Lightweight and fast with a transparent and pythonic API
● Strive on large datasets: Wikipedia (18GB) only take 9 MB of RAM when
● Smart caching: never wait for your data to process several times
● Link: https://github.com/huggingface/nlp

nlp library: online viewer
💥 Check it out at 💥
https://github.com/huggingface/nlp

Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"

More Related Content

What's hot

Similar to Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"

More from Fwdays

Recently uploaded

Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"