Learning with limited labelled data in NLP: multi-task learning and beyond

Johns Hopkins University
30 May 2018
Learning with Limited
Labelled Data in NLP
Multi-Task Learning and
Beyond
Isabelle Augenstein
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/

Research Group
2
Johannes Bjerva
Postdoc
computational typology
and low-resource
learning
Yova
Kementchedjhieva
PhD Fellow
(co-advised w. Anders
Søgaard)
morphological analysis of
low resource languages
Ana Gonzalez
PhD Fellow
(co-advised w.
Anders Søgaard)
multilingual question
answering for
customer service
bots
Mareike Hartmann
PhD Fellow
(co-advised w. Anders
Søgaard)
detecting disinformation
with multilingual stance
detection

Learning with Limited Labelled Data: Why?
3
General Challenges
- Manually annotating training data is expensive
- Only few large NLP datasets
- New tasks and domains
- Domain drift
Multilingual and Diversity Aspects
- Underrepresented languages
- Dialects

Learning with Limited Labelled Data: How?
4
- Domain Adaptation
- Weakly Supervised Learning
- Distant Supervision
- Transfer Learning
- Multi-Task Learning
- Unsupervised Learning

General Research Overview
5
Learning with
Limited Labelled
Data
Multi-Task
Learning
Semi-Supervised
Learning
Distant Supervision
Multilingual Learning Computational
Typology
Information
Extraction
Stance Detection
Fact
Checking
Representation
Learning
Question
Answering

This Talk
6
Part 1: General Challenges
- Method: combining multitask and semi-supervised
learning
- Application: very similar pairwise sequence
classification tasks
Part 2: Multilingual and Diversity Aspects
- Method: unsupervised and transfer learning
- Application: predicting linguistic features of languages

Multi-task Learning of Pairwise
Sequence Classification Tasks Over
Disparate Label Spaces
Isabelle Augenstein*, Sebastian Ruder*,
Anders Søgaard
NAACL HLT 2018 (long), to appear
*equal contributions
7

Problem
8
- Different NLU tasks (e.g. stance detection, aspect-based
sentiment analysis, natural language inference)
- Limited training data for most individual tasks
- However:
- they can be modelled with same base neural model
- they are semantically related
- they have similar labels
- How to exploit synergies between those tasks?

Datasets and Tasks
Topic-based sentiment analysis:
Tweet: No power at home, sat in the
dark listening to AC/DC in the hope it’ll
make the electricity come back again
Topic: AC/DC
Label: positive
Target-dependent sentiment
analysis:
Text: how do you like settlers of catan
for the wii?
Target: wii
Label: neutral
Aspect-based sentiment analysis:
Text: For the price, you cannot eat this
well in Manhattan
Aspects: restaurant prices, food quality
Label: positive
9
Stance detection:
Tweet: Be prepared - if we continue the
policies of the liberal left, we will be #Greece
Target: Donald Trump
Label: favor
Fake news detection:
Document: Dino Ferrari hooked the whopper
wels catfish, (...), which could be the biggest
in the world.
Headline: Fisherman lands 19 STONE
catfish which could be the biggest in the
world to be hooked
Label: agree
Natural language inference:
Premise: Fun for only children
Hypothesis: Fun for adults and children
Label: contradiction

Multi-Task Learning
11
Separate inputs
for each task

Multi-Task Learning
12
Shared hidden
layers
Separate inputs
for each task

Multi-Task Learning
13
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions

Multi-Task Learning
14
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions
Negative log-
likelihood
objectives

Goal: Exploiting Synergies between Tasks
15
- Modelling tasks in a joint label space
- Label Transfer Network that learns to transfer labels
between tasks
- Use semi-supervised learning, trained end-to-end with
multi-task learning model
- Extensive evaluation on a set of pairwise sequence
classification tasks

Related Work
16
- Learning task similarities
- Enforce clustering of tasks (Evgeniou et al., 2005; Jacob et al., 2009)
- Induce shared prior (Yu et al., 2005; Xue et al., 2007; Daumé III,
2009)
- Learn grouping (Kang et al., 2011; Kumar and Daumé III, 2012)
- Only works for homogeneous tasks with same label spaces
- Multi-task learning with neural networks
- Hard parameter sharing (Caruana, 1993)
- Different sharing structures (Søgaard and Goldberg, 2016)
- Private and public subspaces (Liu et al., 2017; Ruder et al., 2017)
- Training on disparate annotation sets (Chen et al., 2016; Peng et al.,
2017)
- Does not take into account similarities between label spaces

Related Work
17
- Semi-supervised learning
- Self-training, co-training, tri-training, EM, etc.
- Closest: co-forest (Li and Zhou, 2007) - each learner is improved with
unlabeled instances labeled by the ensemble consisting of all the
other learners
- Unsupervised aux tasks in MTL (Plank et al., 2016; Rei, 2017)
- Label transformations
- Use distributional information to map from a language-specific tagset
to a tagset used for other languages for cross-lingual transfer (Zhang
et al., 2012)
- Correlation analysis to transfer between tasks with disparate label
spaces (Kim et al., 2015)
- Label transformations for multi-label classification problems (Yeh et
al., 2017)

Multi-Task Learning
18
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions
Negative log-
likelihood
objectives

19
Best-Performing Aux Tasks
Main task Aux task
Topic-2 FNC-1, MultiNLI, Target
Topic-5 FNC-1, MultiNLI, ABSA-L, Target
Target FNC-1, MultiNLI, Topic-5
Stance FNC-1, MultiNLI, Target
ABSA-L Topic-5
ABSA-R Topic-5, ABSA-L, Target
FNC-1 Stance, MultiNLI, Topic-5, ABSA-R, Target
MultiNLI Topic-5
Trends:
• Target used by all Twitter main tasks
• Tasks with a higher number of labels (e.g. Topic-5) are used more often
• Tasks with more training data (FNC-1, MultiNLI) are used more often

Label Embedding Layer
21
Label
embedding
space
Prediction with label
compatibility function:
c(l, h) = l · h

Label Transfer Network
Goal: learn to produce pseudo labels for target task
LTNT = MLP([o1, …, oT-1])
Li
oi = ∑ pj
Ti lj
j=1
- Output label embedding oi of task Ti: sum of the task’s
label embeddings lj weighted with their probability pj
Ti
- LTN: trained on labelled target task data
- Trained with a negative log-likelihood objective LLTN to
produce a pseudo-label for the target task
23

Semi-Supervised MTL
Goal: relabel aux task data as main task data using LTN
- LTN can be used to produce pseudo-labels for aux or
unlabelled instances
- Train the target task model on the additional pseudo-
labelled data
- Additional loss: minimise the mean squared error
between the model predictions pTi and pseudo-labels
zTi produced by LTN
24

Label Transfer Network (w or w/o semi-supervision)
25

Overall Results
- Label embeddings improve performance
- New SoA on topic-based sentiment analysis
However:
- Softmax predictions of other, even highly related tasks
are less helpful for predicting main labels than the
output layer of the main task model
- At best, learning the relabelling model alongside the
main model might act as a regulariser to main model
- Future work: use relabelling model to label unlabelled
data instead
32

Tracking Typological Traits of Uralic
Languages in Distributed Language
Representations
Johannes Bjerva, Isabelle Augenstein
IWCLUL 2018
33

From Phonology to Syntax: Unsupervised
Linguistic Typology at Different Levels with
Language Embeddings
Johannes Bjerva, Isabelle Augenstein
34

Linguistic Typology
35
● ‘The systematic study and comparison of language
structures’ (Velupillai, 2012)
● Long history (Herder, 1772; von der Gabelentz, 1891; …)
● Computational approaches (Dunn et al., 2011; Wälchli,
2014; Östling, 2015, ...)

Why Computational Typology?
36
● Answer linguistic research questions on large scale
● Multilingual learning
○ Language representations
○ Cross-lingual transfer
○ Few-shot or zero-shot learning
● This work:
○ Features in the World Atlas of Language Structures (WALS)
○ Computational Typology via unsupervised modelling of languages
in neural networks

Resources that exist for many languages
● Universal Dependencies (>60 languages)
● UniMorph (>50 languages)
● New Testament translations (>1,000 languages)
● Automated Similarity Judgment Program (>4,500
languages)
39

Multilingual NLP and Language Representations
● No explicit representation
○ Multilingual Word Embeddings
● Google’s “Enabling zero-shot
learning” NMT trick
○ Language given explicitly in
input
● One-hot encodings
○ Languages represented as a
sparse vector
● Language Embeddings
○ Languages represented as a
distributed vector
40
(Östling and Tiedemann, 2017)

Distributed Language Representations
41
• Language Embeddings
• Analogous to Word
Embeddings
• Can be learned in a
neural network without
supervision

Language Embeddings in Deep Neural Networks
42

Language Embeddings in Deep Neural Networks
43
1. Do language
embeddings aid
multilingual modelling?
2. Do language
embeddings contain
typological
information?

Research Questions
● RQ 1: Which typological properties are encoded in task-
specific distributed language representations, and can we
predict phonological, morphological and syntactic
properties of languages using such representations?
● RQ 2: To what extent do the encoded properties change as
the representations are fine-tuned for tasks at different
linguistic levels?
● RQ 3: How are language similarities encoded in fine-tuned
language embeddings?
44

Phonological Features
45
● 20 features
● E.g. descriptions of the
consonant and vowel
inventories, presence of tone
and stress markers

Morphological Features
46
● 41 features
● Features from morphological
and nominal chapter
● E.g. number of genders, usage
of definite and indefinite articles
and reduplication

Word Order Features
47
● 56 features
● Encode ordering of subjects,
objects and verbs

Experimental Setup
48
Data
● Pre-trained language embeddings (Östling and Tiedemann, 2017)
● Task-specific datasets: grapheme-to-phoneme (G2P), phonological
reconstruction (ASJP), morphological inflection (SIGMORPHON),
part-of-speech tagging (UD)
Dataset Class Ltask Ltask ⋂ Lpre
G2P Phonology 311 102
ASJP Phonology 4664 824
SIGMORPHON Morphology 52 29
UD Syntax 50 27

Experimental Setup
49
Method
● Fine-tune language embeddings on grapheme-to-phoneme (G2P),
phonological reconstruction (ASJP), morphological inflection
(SIGMORPHON), part-of-speech tagging (UD)
○ train supervised seq2seq models on G2P, ASJP, SIGMORPHON
tasks and
○ Train seq labelling model on UD task
● Predict typological properties with kNN model

Experimental Setup
50
Seq2Seq Model

Experimental Setup
51
Data Splits
(i) evaluating on randomly selected language/feature pairs from
a task-related feature set
predict task-related features given random sample of languages
(ii) evaluating on an unseen language family from
a task-related feature set
… given sample from which a whole language family is omitted
(iii) evaluating on randomly selected language/feature pairs
from all WALS feature sets
compare task-specific feature encoding with a general one

Grapheme-to-Phoneme (G2P)
52
System/Features Random
lang/feat pairs
from phon.
features
Unseen lang
family from
phon.
features
Random
lang/feat pairs
from all
features
Most frequent
class
*75.46% 65.57% 79.9%
k-NN (pre-trained) 71.45% *86.45% 80.39%
k-NN (fine-tuned) 71.66% 82.36% 79.17%

Grapheme-to-Phoneme (G2P)
53
Hypothesis: grapheme-to-phoneme is more related to diachronic
development of the writing system than it is to genealogy or phonology
Example:
- Norwegian & Danish --
almost same
orthography, different
phonology
- Pre-trained language
embeddings should be
very similar and
diverge during training
- Comparison to
typologically distant
languages Finnish and
Tagalog

Phonological Reconstruction (ASJP)
54
- Fine-tuned language embeddings does not outperform MFC
- Reason: task is very similar for each language
- However: eval on unseen lang family with pre-trained embs outperforms MFC
-> Language embeddings encode features to some extend relevant to phonology
lang/feat pairs
from phon.
features
Unseen lang
family from
phon.
features
Random
lang/feat
pairs from
all features
Most frequent class *59.39% 63.71% *58.12%
k-NN (pre-trained) 53.02% *77.44% 51.6%
k-NN (fine-tuned) 53.09% *77.45% 51.9%

Morphological Inflection (SIGMORPHON)
55
- Fine-tuned language embeddings outperforms MFC
-> Pre-trained and fine-tuned Language embeddings encode
features relevant to morphology
lang/feat pairs
from morph.
features
Unseen lang
family from
morph. features
Random
lang/feat
pairs
from all
features
Most frequent class 77.98% 85.68% 84.12%
k-NN (pre-trained) 74.49% 88.83% 84.97%
k-NN (fine-tuned) *82.91% *91.92% 84.95%

Morphological Inflection (SIGMORPHON)
56
Example:
- languages with same number of cases might benefit param sharing
- feature 38A mainly encodes if the indefinite word is distinct from word
for one, -> not surprising that this is not learned in morph. inflection

Part-of-Speech Tagging (UD)
57
- Improvements for all experimental settings
-> Pre-trained and fine-tuned language embeddings encode
features relevant to word order
lang/feat pairs
from word
order features
Unseen lang
family from
word order
features
Random
lang/feat
pairs
from all
features
Most frequent class 67.81% 82.47% 82.93%
k-NN (pre-trained) 76.66% 92.76% 82.69%
k-NN (fine-tuned) *80.81% *94.48% 83.55%

Part-of-Speech Tagging (UD)
58
• Hierarchical clustering
of language
embeddings
• Language modelling
based language
embeddings
• English with
Romance
• Large amount of
romance vocabulary
• PoS based language
embeddings
• English with
Germanic
• Morpho-syntactically
more similar
(Bjerva and Augenstein (2018)

Conclusions
59
- If features are encoded depends on target task
- Does not work for phonological tasks
- Works for morphological inflection and PoS tagging
- We can predict typological features for unseen language families
with high accuracies
- Changes in the features encoded in language embeddings are
relatively monotonic -> convergence towards single optimum
- G2P task: phonological differences between otherwise similar
languages (e.g. Norwegian Bokmål and Danish) are accurately
encoded

Future work
60
• Improve
multilingual
modelling
• E.g., share
morphologically
relevant
parameters for
morphologically
similar languages
• Complete WALS
using language
embeddings and
KBP methods

Conclusions: This Talk
61
Part 1: General Challenges
- Method: multitask and semi-supervised learning, label embeddings
- Application: very similar pairwise sequence classification tasks
- Take-away:
multitask learning 😀
+ label embeddings 😀
+ semi-supervised learning (same data) 😐
Part 2: Multilingual and Diversity Aspects
- Method: unsupervised and transfer learning
- Application: predicting linguistic features of languages
- Take-away:
Language embeddings facilitate zero-shot learning
+ encode linguistic features
+ encoding varies dep. on NLP task they are fine-tuned for

Presented Papers
Isabelle Augenstein, Sebastian Ruder, Anders Søgaard. Multi-task Learning
of Pairwise Sequence Classification Tasks Over Disparate Label Spaces.
Johannes Bjerva, Isabelle Augenstein. Tracking Typological Traits of Uralic
Languages in Distributed Language Representations. Fourth International
Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2018),
January 2018
Johannes Bjerva, Isabelle Augenstein. Unsupervised Linguistic Typology at
Different Levels with Language Embeddings. NAACL HLT 2018 (long), to
appear
62

Thank you!
augenstein@di.ku.dk
@IAugenstein
63

Learning with limited labelled data in NLP: multi-task learning and beyond

More Related Content

What's hot

Similar to Learning with limited labelled data in NLP: multi-task learning and beyond

More from Isabelle Augenstein

Recently uploaded

Learning with limited labelled data in NLP: multi-task learning and beyond