Pre-trained Language
model - from ELMO to GPT-2
Jiwon Kim
jiwon.kim@dpschool.io
Why ‘pre-trained language model’?
● Achieved great performance on a variety of language
tasks.
● similar to how ImageNet classification pre-training helps
many vision tasks (*)
● Even better than CV tasks, it does not require labeled data
for pre-training.
(*) Although recently He et al. (2018) found that pre-training might not be necessary for image segmentation task.
The problem of
previous
embeddings
feat., Gensim, NLTK, Scikit-learn
Then, what’s difference between
“I’m eating an apple”
and
“I have an Apple pencil”?
The gradient - Sebastian Ruder
Embeddings Dependent on Context
ELMO
ULMFiT
BERT
GPT-2
ELMo: Deep contextualized word
representations
AllenNLP
ELMO - core structure
ELMO - code
feat. tensorflow hub
ULMFiT: Universal Language Model
Fine-tuning for Text Classification
fast.ai
ULMFiT - core structure
Transfer-learning!
ULMFiT
ULMFiT
feat. fastai
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
Google
BERT - core structure
BERT - code
BERT - code
feat. huggingface
“You know, we could just
throw more GPUs and data at
it.”
GPT-2: "Language Models are
Unsupervised Multitask Learners"
OpenAI
GPT-2 - core structure
GPT-2 - code
A text generation sample from OpenAI’s GPT-2 language model
Comparing the Model
What tasks can we do with these models?
1 - tricky to implement and not good accuracy
SQuAD NER SRL Sentiment Coref Text Generation
ELMO 4 3 3 3 3 2
ULMFiT 1 2 2 5 2 4
BERT 3 4 4 4 4 3
GPT-2 5 1 1 2 1 5
5 - possible and easy

Pre trained language model

Editor's Notes

  • #3  allowing us to experiment with increased training scale, up to our very limit.
  • #4 Idea is simple, learn each word’s vector, which usually called word2vec
  • #5 two “apple” words refer to very different things but they would still share the same word embedding vector.
  • #10 How embedding of ELMO comes up?
  • #13 Starting of era of transfer-learning