Transfer learning in NLP
What has changed and why is it important for business?
Jakub Nowacki, PhD
Lead Machine Learning Engineer @ Sotrender
Trainer @ Sages
Transfer Learning
https://medium.com/@pierre_guillou/understand-how-works-resnet-without-talking-about-residual-64698f157e0
c
😭😎
Embeddings (Word2vec, FastText etc.)
https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c
So what is wrong with that?
[0.0, 0.0, …, 0.0]
Contextualized word-embeddings
http://jalammar.github.io/illustrated-bert/
Language model
https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27
A statistical language model is a probability distribution over
sequences of words. Given such a sequence, say of length m, it
assigns a probability P(w_1,..., w_m) to the whole sequence.
Wikipedia: https://en.wikipedia.org/wiki/Language_model
ELMo
http://jalammar.github.io/illustrated-bert/
ELMo
http://jalammar.github.io/illustrated-bert/
LSTM vs Transformer
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
BERT
http://jalammar.github.io/illustrated-bert/
BERT
http://jalammar.github.io/illustrated-bert/
Transfer learning in NLP
http://jalammar.github.io/illustrated-bert/
NLP’s ImageNet moment
http://ruder.io/nlp-imagenet/
Cyberbullying
https://kidshelpline.com.au/teens/issues/cyberbullying
PolEval 2019 Cyberbullying
http://poleval.pl/tasks/task6
Precision = 0.5
Recall = 0.5522
F1-score = 0.5248 (balanced)
Accuracy = 0.866
🏆
4th place
(theoretically, since we
didn’t take part)Word
embeddings
(FastText)
Flair
embeddings
(forward)
Flair
embeddings
(backward)
Stacked
embeddings
BiLSTM
(with dropouts)
Linear
Document
Harmful?
PolEval 2019 Cyberbullying
http://poleval.pl/tasks/task6
@anonymized_account Czyżby Madryt brał przykład z
Warszawy?
@anonymized_account @anonymized_account No to
Skończmy k**** z tym wersalem w j****** szczujni
The pros and cons
Shallow embeddings
(Word2Vec, FastText etc.)
Pros:
• Easy to train
• Small
• A lot of existing models
Cons:
• Same embedding for
different meaning
• May have issues with
inflection
• May have issues with
out-of-vocabulary (OOV)
words
Contextualized Embeddings
(ELMo, Flair etc.)
Pros:
• Embedding based on the
context
• Moderate size and
training speed
• Existing models
• No OOV problem
Cons:
• Require extra network
architecture
• LSTMs are rather slow
• Should be used along
with shallow embeddings
Transformer-based models
(e.g. BERT etc.)
Pros:
• Task-agnostic model
• Can be used as
embeddings or tuned
• Existing models
• Faster than LSTMs
• No OOV problem
Cons:
• Can be really large
• Hard to tune and even
harder to train (TPUs
almost a must)
• Multilingual versions are
very large
https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html
Thank you!
Questions?

DSS 2019 Transfer Learning in Nlp