Johns Hopkins University
30 May 2018
Learning with Limited
Labelled Data in NLP
Multi-Task Learning and
Beyond
Isabelle Augenstein
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/
Research Group
2
Johannes Bjerva
Postdoc
computational typology
and low-resource
learning
Yova
Kementchedjhieva
PhD Fellow
(co-advised w. Anders
Søgaard)
morphological analysis of
low resource languages
Ana Gonzalez
PhD Fellow
(co-advised w.
Anders Søgaard)
multilingual question
answering for
customer service
bots
Mareike Hartmann
PhD Fellow
(co-advised w. Anders
Søgaard)
detecting disinformation
with multilingual stance
detection
Learning with Limited Labelled Data: Why?
3
General Challenges
- Manually annotating training data is expensive
- Only few large NLP datasets
- New tasks and domains
- Domain drift
Multilingual and Diversity Aspects
- Underrepresented languages
- Dialects
Learning with Limited Labelled Data: How?
4
- Domain Adaptation
- Weakly Supervised Learning
- Distant Supervision
- Transfer Learning
- Multi-Task Learning
- Unsupervised Learning
General Research Overview
5
Learning with
Limited Labelled
Data
Multi-Task
Learning
Semi-Supervised
Learning
Distant Supervision
Multilingual Learning Computational
Typology
Information
Extraction
Stance Detection
Fact
Checking
Representation
Learning
Question
Answering
This Talk
6
Part 1: General Challenges
- Method: combining multitask and semi-supervised
learning
- Application: very similar pairwise sequence
classification tasks
Part 2: Multilingual and Diversity Aspects
- Method: unsupervised and transfer learning
- Application: predicting linguistic features of languages
Multi-task Learning of Pairwise
Sequence Classification Tasks Over
Disparate Label Spaces
Isabelle Augenstein*, Sebastian Ruder*,
Anders Søgaard
NAACL HLT 2018 (long), to appear
*equal contributions
7
Problem
8
- Different NLU tasks (e.g. stance detection, aspect-based
sentiment analysis, natural language inference)
- Limited training data for most individual tasks
- However:
- they can be modelled with same base neural model
- they are semantically related
- they have similar labels
- How to exploit synergies between those tasks?
Datasets and Tasks
Topic-based sentiment analysis:
Tweet: No power at home, sat in the
dark listening to AC/DC in the hope it’ll
make the electricity come back again
Topic: AC/DC
Label: positive
Target-dependent sentiment
analysis:
Text: how do you like settlers of catan
for the wii?
Target: wii
Label: neutral
Aspect-based sentiment analysis:
Text: For the price, you cannot eat this
well in Manhattan
Aspects: restaurant prices, food quality
Label: positive
9
Stance detection:
Tweet: Be prepared - if we continue the
policies of the liberal left, we will be #Greece
Target: Donald Trump
Label: favor
Fake news detection:
Document: Dino Ferrari hooked the whopper
wels catfish, (...), which could be the biggest
in the world.
Headline: Fisherman lands 19 STONE
catfish which could be the biggest in the
world to be hooked
Label: agree
Natural language inference:
Premise: Fun for only children
Hypothesis: Fun for adults and children
Label: contradiction
Multi-Task Learning
10
Multi-Task Learning
11
Separate inputs
for each task
Multi-Task Learning
12
Shared hidden
layers
Separate inputs
for each task
Multi-Task Learning
13
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions
Multi-Task Learning
14
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions
Negative log-
likelihood
objectives
Goal: Exploiting Synergies between Tasks
15
- Modelling tasks in a joint label space
- Label Transfer Network that learns to transfer labels
between tasks
- Use semi-supervised learning, trained end-to-end with
multi-task learning model
- Extensive evaluation on a set of pairwise sequence
classification tasks
Related Work
16
- Learning task similarities
- Enforce clustering of tasks (Evgeniou et al., 2005; Jacob et al., 2009)
- Induce shared prior (Yu et al., 2005; Xue et al., 2007; Daumé III,
2009)
- Learn grouping (Kang et al., 2011; Kumar and Daumé III, 2012)
- Only works for homogeneous tasks with same label spaces
- Multi-task learning with neural networks
- Hard parameter sharing (Caruana, 1993)
- Different sharing structures (Søgaard and Goldberg, 2016)
- Private and public subspaces (Liu et al., 2017; Ruder et al., 2017)
- Training on disparate annotation sets (Chen et al., 2016; Peng et al.,
2017)
- Does not take into account similarities between label spaces
Related Work
17
- Semi-supervised learning
- Self-training, co-training, tri-training, EM, etc.
- Closest: co-forest (Li and Zhou, 2007) - each learner is improved with
unlabeled instances labeled by the ensemble consisting of all the
other learners
- Unsupervised aux tasks in MTL (Plank et al., 2016; Rei, 2017)
- Label transformations
- Use distributional information to map from a language-specific tagset
to a tagset used for other languages for cross-lingual transfer (Zhang
et al., 2012)
- Correlation analysis to transfer between tasks with disparate label
spaces (Kim et al., 2015)
- Label transformations for multi-label classification problems (Yeh et
al., 2017)
Multi-Task Learning
18
Shared hidden
layers
Separate inputs
for each task
Separate
output layers +
classification
functions
Negative log-
likelihood
objectives
19
Best-Performing Aux Tasks
Main task Aux task
Topic-2 FNC-1, MultiNLI, Target
Topic-5 FNC-1, MultiNLI, ABSA-L, Target
Target FNC-1, MultiNLI, Topic-5
Stance FNC-1, MultiNLI, Target
ABSA-L Topic-5
ABSA-R Topic-5, ABSA-L, Target
FNC-1 Stance, MultiNLI, Topic-5, ABSA-R, Target
MultiNLI Topic-5
Trends:
• Target used by all Twitter main tasks
• Tasks with a higher number of labels (e.g. Topic-5) are used more often
• Tasks with more training data (FNC-1, MultiNLI) are used more often
Label Embedding Layer
20
Label Embedding Layer
21
Label
embedding
space
Prediction with label
compatibility function:
c(l, h) = l · h
Label Embeddings
22
Label Transfer Network
Goal: learn to produce pseudo labels for target task
LTNT = MLP([o1, …, oT-1])
Li
oi = ∑ pj
Ti lj
j=1
- Output label embedding oi of task Ti: sum of the task’s
label embeddings lj weighted with their probability pj
Ti
- LTN: trained on labelled target task data
- Trained with a negative log-likelihood objective LLTN to
produce a pseudo-label for the target task
23
Semi-Supervised MTL
Goal: relabel aux task data as main task data using LTN
- LTN can be used to produce pseudo-labels for aux or
unlabelled instances
- Train the target task model on the additional pseudo-
labelled data
- Additional loss: minimise the mean squared error
between the model predictions pTi and pseudo-labels
zTi produced by LTN
24
Label Transfer Network (w or w/o semi-supervision)
25
Relabelling
26
Overall Results
27
Overall Results
28
Overall Results
29
Overall Results
30
Overall Results
31
Overall Results
- Label embeddings improve performance
- New SoA on topic-based sentiment analysis
However:
- Softmax predictions of other, even highly related tasks
are less helpful for predicting main labels than the
output layer of the main task model
- At best, learning the relabelling model alongside the
main model might act as a regulariser to main model
- Future work: use relabelling model to label unlabelled
data instead
32
Tracking Typological Traits of Uralic
Languages in Distributed Language
Representations
Johannes Bjerva, Isabelle Augenstein
IWCLUL 2018
33
From Phonology to Syntax: Unsupervised
Linguistic Typology at Different Levels with
Language Embeddings
Johannes Bjerva, Isabelle Augenstein
NAACL HLT 2018 (long), to appear
34
Linguistic Typology
35
● ‘The systematic study and comparison of language
structures’ (Velupillai, 2012)
● Long history (Herder, 1772; von der Gabelentz, 1891; …)
● Computational approaches (Dunn et al., 2011; Wälchli,
2014; Östling, 2015, ...)
Why Computational Typology?
36
● Answer linguistic research questions on large scale
● Multilingual learning
○ Language representations
○ Cross-lingual transfer
○ Few-shot or zero-shot learning
● This work:
○ Features in the World Atlas of Language Structures (WALS)
○ Computational Typology via unsupervised modelling of languages
in neural networks
37
38
Resources that exist for many languages
● Universal Dependencies (>60 languages)
● UniMorph (>50 languages)
● New Testament translations (>1,000 languages)
● Automated Similarity Judgment Program (>4,500
languages)
39
Multilingual NLP and Language Representations
● No explicit representation
○ Multilingual Word Embeddings
● Google’s “Enabling zero-shot
learning” NMT trick
○ Language given explicitly in
input
● One-hot encodings
○ Languages represented as a
sparse vector
● Language Embeddings
○ Languages represented as a
distributed vector
40
(Östling and Tiedemann, 2017)
Distributed Language Representations
41
• Language Embeddings
• Analogous to Word
Embeddings
• Can be learned in a
neural network without
supervision
Language Embeddings in Deep Neural Networks
42
Language Embeddings in Deep Neural Networks
43
1. Do language
embeddings aid
multilingual modelling?
2. Do language
embeddings contain
typological
information?
Research Questions
● RQ 1: Which typological properties are encoded in task-
specific distributed language representations, and can we
predict phonological, morphological and syntactic
properties of languages using such representations?
● RQ 2: To what extent do the encoded properties change as
the representations are fine-tuned for tasks at different
linguistic levels?
● RQ 3: How are language similarities encoded in fine-tuned
language embeddings?
44
Phonological Features
45
● 20 features
● E.g. descriptions of the
consonant and vowel
inventories, presence of tone
and stress markers
Morphological Features
46
● 41 features
● Features from morphological
and nominal chapter
● E.g. number of genders, usage
of definite and indefinite articles
and reduplication
Word Order Features
47
● 56 features
● Encode ordering of subjects,
objects and verbs
Experimental Setup
48
Data
● Pre-trained language embeddings (Östling and Tiedemann, 2017)
● Task-specific datasets: grapheme-to-phoneme (G2P), phonological
reconstruction (ASJP), morphological inflection (SIGMORPHON),
part-of-speech tagging (UD)
Dataset Class Ltask Ltask ⋂ Lpre
G2P Phonology 311 102
ASJP Phonology 4664 824
SIGMORPHON Morphology 52 29
UD Syntax 50 27
Experimental Setup
49
Method
● Fine-tune language embeddings on grapheme-to-phoneme (G2P),
phonological reconstruction (ASJP), morphological inflection
(SIGMORPHON), part-of-speech tagging (UD)
○ train supervised seq2seq models on G2P, ASJP, SIGMORPHON
tasks and
○ Train seq labelling model on UD task
● Predict typological properties with kNN model
Experimental Setup
50
Seq2Seq Model
Experimental Setup
51
Data Splits
(i) evaluating on randomly selected language/feature pairs from
a task-related feature set
predict task-related features given random sample of languages
(ii) evaluating on an unseen language family from
a task-related feature set
… given sample from which a whole language family is omitted
(iii) evaluating on randomly selected language/feature pairs
from all WALS feature sets
compare task-specific feature encoding with a general one
Grapheme-to-Phoneme (G2P)
52
System/Features Random
lang/feat pairs
from phon.
features
Unseen lang
family from
phon.
features
Random
lang/feat pairs
from all
features
Most frequent
class
*75.46% 65.57% 79.9%
k-NN (pre-trained) 71.45% *86.45% 80.39%
k-NN (fine-tuned) 71.66% 82.36% 79.17%
Grapheme-to-Phoneme (G2P)
53
Hypothesis: grapheme-to-phoneme is more related to diachronic
development of the writing system than it is to genealogy or phonology
Example:
- Norwegian & Danish --
almost same
orthography, different
phonology
- Pre-trained language
embeddings should be
very similar and
diverge during training
- Comparison to
typologically distant
languages Finnish and
Tagalog
Phonological Reconstruction (ASJP)
54
- Fine-tuned language embeddings does not outperform MFC
- Reason: task is very similar for each language
- However: eval on unseen lang family with pre-trained embs outperforms MFC
-> Language embeddings encode features to some extend relevant to phonology
System/Features Random
lang/feat pairs
from phon.
features
Unseen lang
family from
phon.
features
Random
lang/feat
pairs from
all features
Most frequent class *59.39% 63.71% *58.12%
k-NN (pre-trained) 53.02% *77.44% 51.6%
k-NN (fine-tuned) 53.09% *77.45% 51.9%
Morphological Inflection (SIGMORPHON)
55
- Fine-tuned language embeddings outperforms MFC
-> Pre-trained and fine-tuned Language embeddings encode
features relevant to morphology
System/Features Random
lang/feat pairs
from morph.
features
Unseen lang
family from
morph. features
Random
lang/feat
pairs
from all
features
Most frequent class 77.98% 85.68% 84.12%
k-NN (pre-trained) 74.49% 88.83% 84.97%
k-NN (fine-tuned) *82.91% *91.92% 84.95%
Morphological Inflection (SIGMORPHON)
56
Example:
- languages with same number of cases might benefit param sharing
- feature 38A mainly encodes if the indefinite word is distinct from word
for one, -> not surprising that this is not learned in morph. inflection
Part-of-Speech Tagging (UD)
57
- Improvements for all experimental settings
-> Pre-trained and fine-tuned language embeddings encode
features relevant to word order
System/Features Random
lang/feat pairs
from word
order features
Unseen lang
family from
word order
features
Random
lang/feat
pairs
from all
features
Most frequent class 67.81% 82.47% 82.93%
k-NN (pre-trained) 76.66% 92.76% 82.69%
k-NN (fine-tuned) *80.81% *94.48% 83.55%
Part-of-Speech Tagging (UD)
58
• Hierarchical clustering
of language
embeddings
• Language modelling
based language
embeddings
• English with
Romance
• Large amount of
romance vocabulary
• PoS based language
embeddings
• English with
Germanic
• Morpho-syntactically
more similar
(Bjerva and Augenstein (2018)
Conclusions
59
- If features are encoded depends on target task
- Does not work for phonological tasks
- Works for morphological inflection and PoS tagging
- We can predict typological features for unseen language families
with high accuracies
- Changes in the features encoded in language embeddings are
relatively monotonic -> convergence towards single optimum
- G2P task: phonological differences between otherwise similar
languages (e.g. Norwegian Bokmål and Danish) are accurately
encoded
Future work
60
• Improve
multilingual
modelling
• E.g., share
morphologically
relevant
parameters for
morphologically
similar languages
• Complete WALS
using language
embeddings and
KBP methods
Conclusions: This Talk
61
Part 1: General Challenges
- Method: multitask and semi-supervised learning, label embeddings
- Application: very similar pairwise sequence classification tasks
- Take-away:
multitask learning 😀
+ label embeddings 😀
+ semi-supervised learning (same data) 😐
Part 2: Multilingual and Diversity Aspects
- Method: unsupervised and transfer learning
- Application: predicting linguistic features of languages
- Take-away:
Language embeddings facilitate zero-shot learning
+ encode linguistic features
+ encoding varies dep. on NLP task they are fine-tuned for
Presented Papers
Isabelle Augenstein, Sebastian Ruder, Anders Søgaard. Multi-task Learning
of Pairwise Sequence Classification Tasks Over Disparate Label Spaces.
NAACL HLT 2018 (long), to appear
Johannes Bjerva, Isabelle Augenstein. Tracking Typological Traits of Uralic
Languages in Distributed Language Representations. Fourth International
Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2018),
January 2018
Johannes Bjerva, Isabelle Augenstein. Unsupervised Linguistic Typology at
Different Levels with Language Embeddings. NAACL HLT 2018 (long), to
appear
62
Thank you!
augenstein@di.ku.dk
@IAugenstein
63

Learning with limited labelled data in NLP: multi-task learning and beyond

  • 1.
    Johns Hopkins University 30May 2018 Learning with Limited Labelled Data in NLP Multi-Task Learning and Beyond Isabelle Augenstein augenstein@di.ku.dk @IAugenstein http://isabelleaugenstein.github.io/
  • 2.
    Research Group 2 Johannes Bjerva Postdoc computationaltypology and low-resource learning Yova Kementchedjhieva PhD Fellow (co-advised w. Anders Søgaard) morphological analysis of low resource languages Ana Gonzalez PhD Fellow (co-advised w. Anders Søgaard) multilingual question answering for customer service bots Mareike Hartmann PhD Fellow (co-advised w. Anders Søgaard) detecting disinformation with multilingual stance detection
  • 3.
    Learning with LimitedLabelled Data: Why? 3 General Challenges - Manually annotating training data is expensive - Only few large NLP datasets - New tasks and domains - Domain drift Multilingual and Diversity Aspects - Underrepresented languages - Dialects
  • 4.
    Learning with LimitedLabelled Data: How? 4 - Domain Adaptation - Weakly Supervised Learning - Distant Supervision - Transfer Learning - Multi-Task Learning - Unsupervised Learning
  • 5.
    General Research Overview 5 Learningwith Limited Labelled Data Multi-Task Learning Semi-Supervised Learning Distant Supervision Multilingual Learning Computational Typology Information Extraction Stance Detection Fact Checking Representation Learning Question Answering
  • 6.
    This Talk 6 Part 1:General Challenges - Method: combining multitask and semi-supervised learning - Application: very similar pairwise sequence classification tasks Part 2: Multilingual and Diversity Aspects - Method: unsupervised and transfer learning - Application: predicting linguistic features of languages
  • 7.
    Multi-task Learning ofPairwise Sequence Classification Tasks Over Disparate Label Spaces Isabelle Augenstein*, Sebastian Ruder*, Anders Søgaard NAACL HLT 2018 (long), to appear *equal contributions 7
  • 8.
    Problem 8 - Different NLUtasks (e.g. stance detection, aspect-based sentiment analysis, natural language inference) - Limited training data for most individual tasks - However: - they can be modelled with same base neural model - they are semantically related - they have similar labels - How to exploit synergies between those tasks?
  • 9.
    Datasets and Tasks Topic-basedsentiment analysis: Tweet: No power at home, sat in the dark listening to AC/DC in the hope it’ll make the electricity come back again Topic: AC/DC Label: positive Target-dependent sentiment analysis: Text: how do you like settlers of catan for the wii? Target: wii Label: neutral Aspect-based sentiment analysis: Text: For the price, you cannot eat this well in Manhattan Aspects: restaurant prices, food quality Label: positive 9 Stance detection: Tweet: Be prepared - if we continue the policies of the liberal left, we will be #Greece Target: Donald Trump Label: favor Fake news detection: Document: Dino Ferrari hooked the whopper wels catfish, (...), which could be the biggest in the world. Headline: Fisherman lands 19 STONE catfish which could be the biggest in the world to be hooked Label: agree Natural language inference: Premise: Fun for only children Hypothesis: Fun for adults and children Label: contradiction
  • 10.
  • 11.
  • 12.
  • 13.
    Multi-Task Learning 13 Shared hidden layers Separateinputs for each task Separate output layers + classification functions
  • 14.
    Multi-Task Learning 14 Shared hidden layers Separateinputs for each task Separate output layers + classification functions Negative log- likelihood objectives
  • 15.
    Goal: Exploiting Synergiesbetween Tasks 15 - Modelling tasks in a joint label space - Label Transfer Network that learns to transfer labels between tasks - Use semi-supervised learning, trained end-to-end with multi-task learning model - Extensive evaluation on a set of pairwise sequence classification tasks
  • 16.
    Related Work 16 - Learningtask similarities - Enforce clustering of tasks (Evgeniou et al., 2005; Jacob et al., 2009) - Induce shared prior (Yu et al., 2005; Xue et al., 2007; Daumé III, 2009) - Learn grouping (Kang et al., 2011; Kumar and Daumé III, 2012) - Only works for homogeneous tasks with same label spaces - Multi-task learning with neural networks - Hard parameter sharing (Caruana, 1993) - Different sharing structures (Søgaard and Goldberg, 2016) - Private and public subspaces (Liu et al., 2017; Ruder et al., 2017) - Training on disparate annotation sets (Chen et al., 2016; Peng et al., 2017) - Does not take into account similarities between label spaces
  • 17.
    Related Work 17 - Semi-supervisedlearning - Self-training, co-training, tri-training, EM, etc. - Closest: co-forest (Li and Zhou, 2007) - each learner is improved with unlabeled instances labeled by the ensemble consisting of all the other learners - Unsupervised aux tasks in MTL (Plank et al., 2016; Rei, 2017) - Label transformations - Use distributional information to map from a language-specific tagset to a tagset used for other languages for cross-lingual transfer (Zhang et al., 2012) - Correlation analysis to transfer between tasks with disparate label spaces (Kim et al., 2015) - Label transformations for multi-label classification problems (Yeh et al., 2017)
  • 18.
    Multi-Task Learning 18 Shared hidden layers Separateinputs for each task Separate output layers + classification functions Negative log- likelihood objectives
  • 19.
    19 Best-Performing Aux Tasks Maintask Aux task Topic-2 FNC-1, MultiNLI, Target Topic-5 FNC-1, MultiNLI, ABSA-L, Target Target FNC-1, MultiNLI, Topic-5 Stance FNC-1, MultiNLI, Target ABSA-L Topic-5 ABSA-R Topic-5, ABSA-L, Target FNC-1 Stance, MultiNLI, Topic-5, ABSA-R, Target MultiNLI Topic-5 Trends: • Target used by all Twitter main tasks • Tasks with a higher number of labels (e.g. Topic-5) are used more often • Tasks with more training data (FNC-1, MultiNLI) are used more often
  • 20.
  • 21.
    Label Embedding Layer 21 Label embedding space Predictionwith label compatibility function: c(l, h) = l · h
  • 22.
  • 23.
    Label Transfer Network Goal:learn to produce pseudo labels for target task LTNT = MLP([o1, …, oT-1]) Li oi = ∑ pj Ti lj j=1 - Output label embedding oi of task Ti: sum of the task’s label embeddings lj weighted with their probability pj Ti - LTN: trained on labelled target task data - Trained with a negative log-likelihood objective LLTN to produce a pseudo-label for the target task 23
  • 24.
    Semi-Supervised MTL Goal: relabelaux task data as main task data using LTN - LTN can be used to produce pseudo-labels for aux or unlabelled instances - Train the target task model on the additional pseudo- labelled data - Additional loss: minimise the mean squared error between the model predictions pTi and pseudo-labels zTi produced by LTN 24
  • 25.
    Label Transfer Network(w or w/o semi-supervision) 25
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    Overall Results - Labelembeddings improve performance - New SoA on topic-based sentiment analysis However: - Softmax predictions of other, even highly related tasks are less helpful for predicting main labels than the output layer of the main task model - At best, learning the relabelling model alongside the main model might act as a regulariser to main model - Future work: use relabelling model to label unlabelled data instead 32
  • 33.
    Tracking Typological Traitsof Uralic Languages in Distributed Language Representations Johannes Bjerva, Isabelle Augenstein IWCLUL 2018 33
  • 34.
    From Phonology toSyntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings Johannes Bjerva, Isabelle Augenstein NAACL HLT 2018 (long), to appear 34
  • 35.
    Linguistic Typology 35 ● ‘Thesystematic study and comparison of language structures’ (Velupillai, 2012) ● Long history (Herder, 1772; von der Gabelentz, 1891; …) ● Computational approaches (Dunn et al., 2011; Wälchli, 2014; Östling, 2015, ...)
  • 36.
    Why Computational Typology? 36 ●Answer linguistic research questions on large scale ● Multilingual learning ○ Language representations ○ Cross-lingual transfer ○ Few-shot or zero-shot learning ● This work: ○ Features in the World Atlas of Language Structures (WALS) ○ Computational Typology via unsupervised modelling of languages in neural networks
  • 37.
  • 38.
  • 39.
    Resources that existfor many languages ● Universal Dependencies (>60 languages) ● UniMorph (>50 languages) ● New Testament translations (>1,000 languages) ● Automated Similarity Judgment Program (>4,500 languages) 39
  • 40.
    Multilingual NLP andLanguage Representations ● No explicit representation ○ Multilingual Word Embeddings ● Google’s “Enabling zero-shot learning” NMT trick ○ Language given explicitly in input ● One-hot encodings ○ Languages represented as a sparse vector ● Language Embeddings ○ Languages represented as a distributed vector 40 (Östling and Tiedemann, 2017)
  • 41.
    Distributed Language Representations 41 •Language Embeddings • Analogous to Word Embeddings • Can be learned in a neural network without supervision
  • 42.
    Language Embeddings inDeep Neural Networks 42
  • 43.
    Language Embeddings inDeep Neural Networks 43 1. Do language embeddings aid multilingual modelling? 2. Do language embeddings contain typological information?
  • 44.
    Research Questions ● RQ1: Which typological properties are encoded in task- specific distributed language representations, and can we predict phonological, morphological and syntactic properties of languages using such representations? ● RQ 2: To what extent do the encoded properties change as the representations are fine-tuned for tasks at different linguistic levels? ● RQ 3: How are language similarities encoded in fine-tuned language embeddings? 44
  • 45.
    Phonological Features 45 ● 20features ● E.g. descriptions of the consonant and vowel inventories, presence of tone and stress markers
  • 46.
    Morphological Features 46 ● 41features ● Features from morphological and nominal chapter ● E.g. number of genders, usage of definite and indefinite articles and reduplication
  • 47.
    Word Order Features 47 ●56 features ● Encode ordering of subjects, objects and verbs
  • 48.
    Experimental Setup 48 Data ● Pre-trainedlanguage embeddings (Östling and Tiedemann, 2017) ● Task-specific datasets: grapheme-to-phoneme (G2P), phonological reconstruction (ASJP), morphological inflection (SIGMORPHON), part-of-speech tagging (UD) Dataset Class Ltask Ltask ⋂ Lpre G2P Phonology 311 102 ASJP Phonology 4664 824 SIGMORPHON Morphology 52 29 UD Syntax 50 27
  • 49.
    Experimental Setup 49 Method ● Fine-tunelanguage embeddings on grapheme-to-phoneme (G2P), phonological reconstruction (ASJP), morphological inflection (SIGMORPHON), part-of-speech tagging (UD) ○ train supervised seq2seq models on G2P, ASJP, SIGMORPHON tasks and ○ Train seq labelling model on UD task ● Predict typological properties with kNN model
  • 50.
  • 51.
    Experimental Setup 51 Data Splits (i)evaluating on randomly selected language/feature pairs from a task-related feature set predict task-related features given random sample of languages (ii) evaluating on an unseen language family from a task-related feature set … given sample from which a whole language family is omitted (iii) evaluating on randomly selected language/feature pairs from all WALS feature sets compare task-specific feature encoding with a general one
  • 52.
    Grapheme-to-Phoneme (G2P) 52 System/Features Random lang/featpairs from phon. features Unseen lang family from phon. features Random lang/feat pairs from all features Most frequent class *75.46% 65.57% 79.9% k-NN (pre-trained) 71.45% *86.45% 80.39% k-NN (fine-tuned) 71.66% 82.36% 79.17%
  • 53.
    Grapheme-to-Phoneme (G2P) 53 Hypothesis: grapheme-to-phonemeis more related to diachronic development of the writing system than it is to genealogy or phonology Example: - Norwegian & Danish -- almost same orthography, different phonology - Pre-trained language embeddings should be very similar and diverge during training - Comparison to typologically distant languages Finnish and Tagalog
  • 54.
    Phonological Reconstruction (ASJP) 54 -Fine-tuned language embeddings does not outperform MFC - Reason: task is very similar for each language - However: eval on unseen lang family with pre-trained embs outperforms MFC -> Language embeddings encode features to some extend relevant to phonology System/Features Random lang/feat pairs from phon. features Unseen lang family from phon. features Random lang/feat pairs from all features Most frequent class *59.39% 63.71% *58.12% k-NN (pre-trained) 53.02% *77.44% 51.6% k-NN (fine-tuned) 53.09% *77.45% 51.9%
  • 55.
    Morphological Inflection (SIGMORPHON) 55 -Fine-tuned language embeddings outperforms MFC -> Pre-trained and fine-tuned Language embeddings encode features relevant to morphology System/Features Random lang/feat pairs from morph. features Unseen lang family from morph. features Random lang/feat pairs from all features Most frequent class 77.98% 85.68% 84.12% k-NN (pre-trained) 74.49% 88.83% 84.97% k-NN (fine-tuned) *82.91% *91.92% 84.95%
  • 56.
    Morphological Inflection (SIGMORPHON) 56 Example: -languages with same number of cases might benefit param sharing - feature 38A mainly encodes if the indefinite word is distinct from word for one, -> not surprising that this is not learned in morph. inflection
  • 57.
    Part-of-Speech Tagging (UD) 57 -Improvements for all experimental settings -> Pre-trained and fine-tuned language embeddings encode features relevant to word order System/Features Random lang/feat pairs from word order features Unseen lang family from word order features Random lang/feat pairs from all features Most frequent class 67.81% 82.47% 82.93% k-NN (pre-trained) 76.66% 92.76% 82.69% k-NN (fine-tuned) *80.81% *94.48% 83.55%
  • 58.
    Part-of-Speech Tagging (UD) 58 •Hierarchical clustering of language embeddings • Language modelling based language embeddings • English with Romance • Large amount of romance vocabulary • PoS based language embeddings • English with Germanic • Morpho-syntactically more similar (Bjerva and Augenstein (2018)
  • 59.
    Conclusions 59 - If featuresare encoded depends on target task - Does not work for phonological tasks - Works for morphological inflection and PoS tagging - We can predict typological features for unseen language families with high accuracies - Changes in the features encoded in language embeddings are relatively monotonic -> convergence towards single optimum - G2P task: phonological differences between otherwise similar languages (e.g. Norwegian Bokmål and Danish) are accurately encoded
  • 60.
    Future work 60 • Improve multilingual modelling •E.g., share morphologically relevant parameters for morphologically similar languages • Complete WALS using language embeddings and KBP methods
  • 61.
    Conclusions: This Talk 61 Part1: General Challenges - Method: multitask and semi-supervised learning, label embeddings - Application: very similar pairwise sequence classification tasks - Take-away: multitask learning 😀 + label embeddings 😀 + semi-supervised learning (same data) 😐 Part 2: Multilingual and Diversity Aspects - Method: unsupervised and transfer learning - Application: predicting linguistic features of languages - Take-away: Language embeddings facilitate zero-shot learning + encode linguistic features + encoding varies dep. on NLP task they are fine-tuned for
  • 62.
    Presented Papers Isabelle Augenstein,Sebastian Ruder, Anders Søgaard. Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. NAACL HLT 2018 (long), to appear Johannes Bjerva, Isabelle Augenstein. Tracking Typological Traits of Uralic Languages in Distributed Language Representations. Fourth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2018), January 2018 Johannes Bjerva, Isabelle Augenstein. Unsupervised Linguistic Typology at Different Levels with Language Embeddings. NAACL HLT 2018 (long), to appear 62
  • 63.