From idea to production in a day – Leveraging Azure ML and Streamlit to build...
AN EXPLORATION OF NON-LABEL-PRESERVING DATA AUGMENTATIONS FOR ACTIVE LEARNING
1. AN EXPLORATION OF NON
LABEL-PRESERVING DATA
AUGMENTATIONS
Jonathan Zarecki
To appear in IJCAI20 as “Textual Membership Queries”
2. About me
◦ Jonathan Zarecki
◦ MSc in ML & Active Learning with Prof. Shaul
Markovitch (Technion)
◦ Currently pursuing a Phd in CS with Prof. Gal Chechik
(BIU & Nvidia).
3. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦ Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for
active-learning.
◦ Empirical evaluation of this method on several datasets
5. Data
Augmentations
(quickly)
Now with text
Batman is really
awesome
is really awesome
Batman is really not
awesome
Awesome is really
Batman
Batman is really great
Random
deletion
Random
Insertion
Random
Switch
Synonym
Replacemen
t
EDA – Wei & Zou (EMNLP 19)
6. Data
Augmentations
(quickly)
Now with text
In textual
augmentations it’s not
always trivial to keep
the sentence valid or
readable.
Batman is really
awesome
is really awesome
Batman is really not
awesome
Awesome is really
Batman
Batman is really great
Random
deletion
Random
Insertion
Random
Switch
Synonym
Replacemen
t
EDA – Wei & Zou (EMNLP 19)
7. Non-Restrictive Textual Augmentations
◦ What will happen if we let loose ? Apply any augmentation we want ?
My favorite movie so far
My computer favorite movie
so far
Add computer
My computer favorite movie
so
Remove far
10. ◦ But Let’s leave unreadable sentences aside.
◦ Another important property of using more expressive augmentations is that
the label might change !
Batman is really
awesome
Batman is really bad
Non-Restrictive Textual Augmentations
11. Non Label-Preserving (LP) Augmentations
We want augmentations which will:
1. Change the sentence’s meaning significantly
2. Keep the sentence fully readable
(Somewhat) Unlike image augmentations
Using more expressive textual augmentation have the risk to make the resulting sentence
gibberish or completely change it’s label
Not knowing an example’s label we arrive at the field of active-learning
12. Overview
◦Potential problems of traditional data augmentations
in text
◦ Quick overview of active-learning
◦ Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for
active-learning.
◦ Empirical evaluation of this method on several datasets
13. Overview
◦ Potential problems of traditional data augmentations in text
◦Quick overview of active-learning
◦ Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for
active-learning.
◦ Empirical evaluation of this method on several datasets
22. Overview
◦ Potential problems of traditional data augmentations in text
◦Quick overview of active-learning
◦ Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for
active-learning.
◦ Empirical evaluation of this method on several datasets
23. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for
active-learning.
◦ Empirical evaluation of this method on several datasets
24. Why are textual modifications hard?
◦ When sentences are not built carefully they can easily become unreadable:
◦ Sentences has to comply to syntactic rules.
◦ But also to semantic rules
Took I the dog
to
Does not comply to
syntactic rules
I ate a book for breakfast
Does not “make sense”
25. Modification Operators Definition
◦ First we find all “replaceable word” in the sentence
◦ Nouns, verbs & adjectives
◦ For each replaceable word we look at the
knowledge-base and find words to replace it.
◦ All options returned are the modification operators for a
given sentence.
I hate all the catsI hate all the cats
hate
despise
adore
dislike
detest
cats
Dogs
Wolves
Lions
Pigs
26. So how does it look ?
I hate all the cats
I hate all the dogs
I despise all the cats
I adore all the cats
I adore all the dogs
Hate
Speech
I hate all the cats
Non-Hate
Speech
27. Semantic Knowledge-bases
◦ In order for our modification operators to work we need to find meaningful replacements.
◦ Replacements should be functionally similar – behave the same as the replaced word
◦ We need a knowledge-base where we can find such words.
◦ Options for this include: word2vec, WordNet and more.
◦ We chose “Dependency Word2vec” (Levy & Goldberg et al. 2014) as our
knowledge base
Dependency w2v – ACL 2014
28. Qualitative analysis of the knowledge-
bases
Dependency Word2vec (Levy & Goldberg 2014)
Introduces a subtle change in the word2vec context:
Australian scientist discovers star withtelescope
Australian scientist discoversstar with telescope
prep_withnsubj
dobj
Word2vec:
Dependency
word2vec:
29. Qualitative analysis of the knowledge-
bases
Dependency Word2vec (Levy & Goldberg 2014)
Functional similarity is exhibited very well in dep w2v.
w2v
dumbledore
hallows
half-blood
malfoy
snape
Dep w2v
sunnydale
collinwood
calarts
greendale
millfield
hogwarts
Related to
Harry Potter
Schools
30. Full example of modification operators
Batman is really awesomeBatman is really awesome
Batman
superman
superboy
supergirl
catwoman
aquaman
awesome
terrific
marvelous
wonderful
lousy
awful
Further analysis of 4 different
knowledge-bases can be seen in
the full paper.
31. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for active-learning.
◦ Empirical evaluation of this method on several datasets
32. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦ Definition of new textual modification operators
◦Applying heuristic-search with modification
operators for active-learning.
◦ Empirical evaluation of this method on several datasets
33. Stochastic Synthesis Algorithm
◦ A simple way to use the operators is just applying them randomly.
◦ Until enough instances have been generated do:
1. Randomly choose an instance from the available examples
2. Apply a random operators to it
3. Return as new MQ
Examples
𝜙1
𝜙1
1 𝜙1
2
𝜙1
3𝜙1
3
𝜙2
𝜙2
1 𝜙2
2
𝜙2
3
𝜙2
3
New examples
34. Using search algorithms to generate
examples
◦ Repeatedly applying these
operators gives us many
options
◦ Using search algorithms we
can actively look for the
most informative examples
◦ But how do we
direct the search ?
𝜙1
𝜙1
1
𝜙1
2
𝜙1
3
= 𝜙2
𝜙2
1
= 𝜙3 𝜙2
2
𝜙2
3
𝜙3
1 𝜙3
2
𝜙3
3
37. Search Heuristic Function
◦ To direct the search we need a function that gives higher score to more informative
instances.
◦ We used existing active learning functions to give higher score to more informative
examples:
◦ Uncertainty sampling (Lewis & Gale, 1994)
◦ Expected model change (Lindenbaum, Markovitch, & Rusakov, 2004)
38. Heuristic-Search Generation
◦ Similar to the stochastic approach, but apply a search alg’ to pick the best example
◦ Until enough instances have been generated do:
1. Randomly choose an instance from the available examples
2. Run a heuristic search on that instance
3. Return as a new example
Examples
𝜙3
1 𝜙3
2
𝜙3
3
New examples
𝜙2
1
= 𝜙3 𝜙2
2 𝜙2
3
𝜙1
𝜙1
1
𝜙1
2 𝜙1
3
= 𝜙2
𝜙2
3
Uncertainty sampling
directs the search
39. Heuristic-Search Generation
◦ Similar to the stochastic approach, but apply a search alg’ to pick the best example
◦ Until enough instances have been generated do:
1. Randomly choose an instance from the available examples
2. Run a heuristic search on that instance
3. Return as a new example
Examples
𝜙′3
1 𝜙′3
2
𝜙′3
3
New examples
𝜙′2
1
= 𝜙′3 𝜙′2
2 𝜙′2
3
𝜙′1
𝜙′1
1
𝜙′1
2 𝜙′1
3
= 𝜙′2
𝜙′3
3
𝜙2
3
Uncertainty sampling
directs the search
40. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦ Definition of new textual modification operators
◦Applying heuristic-search with modification
operators for active-learning.
◦ Empirical evaluation of this method on several datasets
41. Overview
◦ Potential problems of traditional data augmentations in text
◦ Quick overview of active-learning
◦ Definition of new textual modification operators
◦ Applying heuristic-search with modification operators for active-learning.
◦ Empirical evaluation of this method on several datasets
42. Sentence Quality – Human Evaluation
◦ We already talked about how generating a readable sentence can be hard, are these operators
comply with that ?
◦ We randomly chose 1000 sentences from each category, and asked
“Is this sentence fully readable to you ?”
Original Sentences (96%)
Yes No
HS Sentences (95%)
Yes No
Wikipedia LSTM
Sentences (21%)
Yes No
46. Datasets
◦ Sentiment Analysis:
◦ CMR: Cornell sentiment polarity dataset
◦ SST: Stanford sentiment treebank, a sentence sentiment analysis dataset
◦ KS: A Kaggle short sentence sentiment analysis dataset
◦ Subjectivity/Objectivity Detection
◦ SUBJ: Cornell sentence subjective / objective dataset
◦ Offensive-language and Hate-speech Detection:
◦ HS: Hate speech and offensive language classification dataset
47. Compared Methods
◦ Our methods:
◦ Uncertainty sampling Hill-climbing MQ synthesis (US-HC-MQ)
◦ Uncertainty sampling Beam-search MQ synthesis (US-BS-MQ)
◦ Stochastic Synthesis (S-MQ)
◦ Competitor Methods:
◦ WordNet-based Synonym-replacement (WNA) (Lecun et al. 2016)
◦ Original examples (IDEAL)
◦ LSTM Generator (RNN) – pretrained on English Wikipedia
Uses unlabeled
data
48. Results –
Experiment 1
• We can see that our methods
have consistently improved
the initial accuracy
• The search-based method
are superior among almost all
datasets
53. What did we see ?
◦ Potential problems of non-restrictive augmentations in text
◦ Definition of new modification operators (= non-label preserving augmentations)
in the textual domain.
◦ Using heuristic-search with modification operators for generating new examples
for active-learning.
◦ Empirical evaluation of this method on several datasets
54. I want to thank you
for coming !
I want to thank you
for arriving !
I would-like to thank
you for coming !
I want to condemn
you for coming !
I want to thank you
for going !
Thanks-for-coming
sentence
I want to thank you
for coming !
Thank You !
Questions ?
Editor's Notes
Hello everyone, My name is Jonathan and today I’ll take you through a tour of non-label-preserving augmentations, what does label-preserving mean and why we might want not to do that, their pros and cons, and my personal work (to appear in IJCAI20) on defining such operators in the textual domain **
Now lets get started
Traditional=מסורתי
But we’ll never see augmentations that change this butterfly image to…
Let’s get back to text
Punchline, augmentation in NLP is not trivial like in CV
So the only real “augmentation” from EDA is the synonym replacement.
Even easier to make unreadable sentences
Guaranteed for those who click on that link
If we start with
The label is totally different !
Next slide is the summary
Main motivation slide. Iron out message
Traditional=מסורתי
Traditional=מסורתי
Let’s do a quick overview of AL, as we’ll revisit the subject during the rest of the talk
That’s it for the introduction
The experiment was repeated 20 times for statistical significance.
The experiment was repeated 20 times for statistical significance.
Traditional=מסורתי
Traditional=מסורתי
Semantic rules = ‘make sense’
So how did we build the operators in the instance space.
For these reasons we have to make sure that our operators will keep the resulting sentences are legal English sentences.
What’s a knowledge-base ? We’ll get to that later
In order to find
Dep w2v was designed to exhibit these two properties and this examples shows this well.
Where w2v returned words with related meanings, dep w2v returned other scientists.
This property is repeated for many cases, more details can be found in the original dep w2v paper (very recommended)
Dep w2v was designed to exhibit these two properties and this examples shows this well.
Where w2v returned words with related meanings, dep w2v returned other scientists.
This property is repeated for many cases, more details can be found in the original dep w2v paper (very recommended)
In word2vec for example we will get topically related words such as “dc comics” for batman
Next slide is search algs’
Traditional=מסורתי
Traditional=מסורתי
Now lets look at how its done
I left notes to relevant papers in the
Next slide is empirical evaluation
Traditional=מסורתי
Traditional=מסורתי
RNN Wasn’t able to generate proper sentences with only 10 training instances.
”I am movie .”
”x this film .”
Original AL setup
The experiment was repeated 20 times for statistical significance.
We test our framework on 5 datasets: 3 sentiment analysis datasets, one subjectivity/objectivity dataset and one hate-speech and offensive language detection
Objective the movie begins in the past where a young boy named sam . . . (attempts to save celebi from a hunter . )
Subjective
I really liked the movie it was a-lot of fun
We compared 2 search-based methods, one using hill-climbing as search alg’ and another with beam-search. Both algs used uncertainty sampling as its heuristic
As there are no other works that perform textual MQs, so we chose 3 competitors that do similar augmentation/generation to compare with.First we used an ‘upper-limit’ method we call “IDEAL” this method uses other original examples and from the pool the most informative examples
Show with pointer what I’m talking about
As we can see the search based-methods (blue, red) are superior in almost all datasets.
Another interesting point we can see in the graphs is the squeezing of information I talked about in the example. We can see the initially there is a-lot of information to extract from the existing examples and we have high accuracy gains. But after adding a few examples, most information was already extracted and the plot stops rising (mitkanes). This is a good sign showing that our initial intuition still holds in this case.
The experiment was repeated 20 times for statistical significance.
We can see a clear hierarchy here between uncertainty HC, random HC and S-MQ
This reinforces our hypothesis that using the more sophisticated approaches result in better instances. At the very least it results in more label changes.
The idea is very general, and might be useful in other domains where even unlabeled data is scarce.
רציתי להודות לכל מי שהגיע היום על התמיכה במהלך הדרך הלא-פשוטה הזאתובמיוחד רציתי להודות לפרופסור שלי שאול מרקוביץ שנתן לי חופש פעולה מדהים בעבודה הזו, תמך בי במשך כל הדרך ואפילו נתן לי לנסוע לדרום אמריקה ל3 חודשים במהלך ההשתלמות, דבר לא נפוץ בעליל כאן בפקולטה.אז תודה לכולם ולשאול.