Small Data for Big Problems: Practical Transfer Learning for NLP

Small Data for Big
Problems
Practical Transfer
Learning for NLP

Who Am I?
• Founder, CTO Indico
• Research done at Olin College of
Engineering
• Indico focuses Intelligent Process
Automation for Unstructured
Content
• Leverages Indico innovation in
Transfer Learning for text and image
content

Agenda
• Overview of Traditional Approaches to
Feature Engineering in NLP
• Introduction to Transfer Learning and
Text Embeddings
• Word Embeddings vs. Text
Embeddings
• Takeaways and Resources

Assumed
Knowledge
• Traditional NLP Basics (e.g. tf-idf
vectors)
• Traditional Data Science Basics (e.g.
Logistic Regression)
• Generic Math Background (e.g. Vector
Spaces)

The Problem With Text
John Malkovitch plays tennis in Winchester. He
has been reporting soreness in his elbow. His
60th birthday is in two weeks. After he returns
from his birthday trip to Casablanca we will
recommend a steroid shot to reduce
inflammation.
Feature(s)
Name

inflammation.
Feature(s)
Name
Traditional Solution(s)
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value

inflammation.
Feature(s)
Name
Issues(s)
• Out of Vocabulary
• Tf-idf
• Soundex/NYSIIS encoding
• Ignore – low algorithmic value

inflammation.
Feature(s)
• Gender
• Location
• Age

inflammation.
Feature(s)
• Gender
• Location
• Age
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary

inflammation.
Feature(s)
• Gender
• Location
• Age
Issues(s)
• Local Context: His birthday vs
his daughter’s birthday
• Brittle gender detection
• Location detection
• Tf-idf
• Hand-coded features (i.e.
gender)
• Location dictionary

inflammation.
Feature(s)
• Activity
• Prior Affliction/Treatment
• Travel

inflammation.
Feature(s)
• Activity
• Travel
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon

inflammation.
Feature(s)
• Activity
• Travel
Issues(s)
• Linguistic Context (Semantics)
• Error-prone parse trees
• Maintaining the lexicon
• Tf-idf
• Parse trees (soreness ->
elbow)
• Domain-specific lexicon

Problem Traditional Solution Traditional Problem
Linguistic Context • Stemming
• Synonym sets
• Lexicons
• Brittle
• Labor-intensive
• Messy real-world data
Local Context • Parse trees
• N-grams
• Phrase lexicon
• Inaccurate parsing
• Limited Context
Out of Vocabulary Issues • Lemmatization
• Expanded vocabulary
• Ignore
• Computationally expensive
• Diminishing returns

Problems with
Small Data
Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues

Enter Embeddings Transfer Learning

What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
Embedding Method
(e.g. Word2Vec)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…

What is an Embedding?
Text Space
(e.g. English)
Embedding Space
(e.g. R300)
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Embedding Method
(e.g. Word2Vec)
Linguistic Context
(e.g. Wikipedia)

Pitfalls
• Sufficient, Diverse Linguistic Context
• Clean Test/Train Splits
• The Curse of Dimensionality
• Effective Benchmarking

King
Queen
- man
+ woman
(Royalty)
How do Embeddings Work?
• Meaning is “encoded” into the
embedding space
• Individual dimensions are not
human interpretable
• Embedding method learns by
examining large corpora of
generic language
• Goal is accurate language
representation as a proxy for
downstream performance

“Word” Embeddings
Examples
• Word2vec
• GloVe
• fastText

Token Value
“great” [0.1, 0.3, …]
… …
Examples In Practice
• Word2vec
• GloVe
• fastText

Token Value
“great” [0.1, 0.3, …]
… …
Training
The quick brown fox _____ over the lazy dog
___ ___ ____ ___ jumps ___ __ ___ ___
CBOW
Skip Gram
• Word2vec
• GloVe
• fastText

Do They Really Preserve Algorithmic Value?
• Embeddings generally
outperform raw text at low data
volumes
• Leveraging large, generic text
corpora improves
generalizability
• This is 4 year old tech.
Embeddings have improved
drastically. Text has not.
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Glove Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove

Text Embeddings
Examples
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
Examples
In Practice
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
Training
The quick brown fox jumps over the lazy
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Language
Supervised
dog
True
Often built on top of pre-trained word embeddings
• Doc2vec
• Elmo
• ULMFiT

Text Embeddings
CNN-Style
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Prediction
https://arxiv.org/pdf/1408.5882.pdf
Example

Text Embeddings
RNN-Style
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
Output
Memory
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.1
0.2
0.8
0.1
0.3
0.6
0.8
0.3
…
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
…
σ σ σ σ σ σ σ σ
Prediction
https://arxiv.org/pdf/1802.05365.pdf
Example

Add Linguistic Context (Semantics)
Add Local Context
Prevent Out of Vocabulary Issues
Problems with
Small Data

The Power of Context
We used a bytepair encoding (BPE) vocabulary…
significantly improving upon the state of the art in 9 out of
the 12 tasks studied
- Improving Language Understanding by Generative Pre-Training*
* https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

Do They Really Preserve Algorithmic Value?
• Newer transfer learning
techniques have made deep
learning at low data volumes
tractable
• Even when operating on top of
byte-pair encodings sufficient
context is retained to achieve
sota performance
• 4x error reduction over tf-idf
Reported numbers are the average of 5 runs of randomly sampled test/train splits
each reporting the average of a 5-fold cv, within which Logistic Regression
hyperparameters are optimized. Generated using Enso
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
50
75
100
125
150
175
200
225
250
275
300
325
350
375
400
425
450
475
500
Accuracy
Number of Data Points
Finetune Benchmark (Movie Review Sentiment
Analysis)
tf-idf
Glove
Finetune

Takeaways
• At low data volumes embeddings
drastically improve accuracy via
transfer learning
• The transfer learning space moves
very quickly. Adoption of Glove is very
low, but already out of date
• This is just basic framing. Practical use
of embeddings is more complex. See
our session at DSS to learn more

Resources
• Github library – Finetune
(https://github.com/indicodatasolutions/finetune)
• Github library – Enso
(https://github.com/indicodatasolutions/enso)
• Indico Machine learning newsletter
(indico.io)
• Deep Learning Book
(https://www.deeplearningbook.org/)

Questions?
• slater@indico.io
• Quora: https://www.quora.com/profile/Slater-Ryan-Victoroff

The Real Problem With Text
Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Feature Engineering?
Standard Data Science?

Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Overfitting
Test/Train Contamination

Select
Features
Optimize
Hyper
parameters
Test/Train
Split
Train
Model
Evaluate
Errors and
View Test
Error
Overfitting
Test/Train Contamination
Manual feature engineering
leads to inaccurate
perceptions of performance

Small Data for Big Problems: Practical Transfer Learning for NLP

More Related Content

Similar to Small Data for Big Problems: Practical Transfer Learning for NLP

More from indico data

Recently uploaded

Small Data for Big Problems: Practical Transfer Learning for NLP