4. GPT-3: Introduction
4
https://arxiv.org/pdf/2005.14165.pdf
Recent NLP paradigm - Finetuning a pre-trained LM to downstream tasks
- has led substantial progress on many challenging NLP tasks
- entirely removing the need for task-specific architectures.
However,
- needs large amount of task-specific datasets
- needs task-specific fine-tuning
Humans do not require large supervised datasets to learn.
A brief directive / a tiny number of demonstrations is often sufficient.
11. GPT-3 shows how powerful language models can be.
Drawbacks:
- It requires a gigantic LM to work well, making it UNUSABLE in real-world.
- It does not scale to more than a few examples as the context window of most LMs is limited to a
few hundred tokens.
GPT-3: Summary
11
13. PET: Semi-supervised Knowledge Distillation
13
1. Various patterns are used for
finetuning language models.
2. The ensemble of trained
language models annotates
unlabeled data.
3. A classifier is trained on the
so-obtained soft labeled
dataset.
14. PET: Without Knowledge Distillation
Without knowledge distillation
using unlabeled dataset, PET
performs even better.
But n*k times larger than the
distilled model. (k=number of
PVPs, n=number of LMs per each
PVP)
19. 👍
- Unsupervised / semi-supervised
learning
- No / few labeled data
- Text in text out using Language
model
Conclusion: Future of NLP
19
👎
- Supervised learning
- Large amount of labeled data
- Task specific fine-tuning
21. Tool: Next Word Prediction
21
https://github.com/renatoviolin/next_word_prediction
22. Reference
Language Models are Few-Shot Learners
- https://arxiv.org/pdf/2005.14165.pdf
Exploiting Cloze Questions for Few Shot Text Classification and
Natural Language Inference
- https://arxiv.org/pdf/2001.07676.pdf
It’s Not Just Size That Matters: Small Language Models Are Also
Few-Shot Learners
- https://arxiv.org/pdf/2009.07118.pdf