Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning with limited labelled data in NLP: multi-task learning and beyond

1,248 views

Published on

When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.

In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].

[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375

==========

Bio from my website http://isabelleaugenstein.github.io/index.html:

I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.

Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.

Published in: Technology
  • Earn $90/day Working Online. You won't get rich, but it is going to make you some money! ★★★ http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Learning with limited labelled data in NLP: multi-task learning and beyond

  1. 1. Johns Hopkins University 30 May 2018 Learning with Limited Labelled Data in NLP Multi-Task Learning and Beyond Isabelle Augenstein augenstein@di.ku.dk @IAugenstein http://isabelleaugenstein.github.io/
  2. 2. Research Group 2 Johannes Bjerva Postdoc computational typology and low-resource learning Yova Kementchedjhieva PhD Fellow (co-advised w. Anders Søgaard) morphological analysis of low resource languages Ana Gonzalez PhD Fellow (co-advised w. Anders Søgaard) multilingual question answering for customer service bots Mareike Hartmann PhD Fellow (co-advised w. Anders Søgaard) detecting disinformation with multilingual stance detection
  3. 3. Learning with Limited Labelled Data: Why? 3 General Challenges - Manually annotating training data is expensive - Only few large NLP datasets - New tasks and domains - Domain drift Multilingual and Diversity Aspects - Underrepresented languages - Dialects
  4. 4. Learning with Limited Labelled Data: How? 4 - Domain Adaptation - Weakly Supervised Learning - Distant Supervision - Transfer Learning - Multi-Task Learning - Unsupervised Learning
  5. 5. General Research Overview 5 Learning with Limited Labelled Data Multi-Task Learning Semi-Supervised Learning Distant Supervision Multilingual Learning Computational Typology Information Extraction Stance Detection Fact Checking Representation Learning Question Answering
  6. 6. This Talk 6 Part 1: General Challenges - Method: combining multitask and semi-supervised learning - Application: very similar pairwise sequence classification tasks Part 2: Multilingual and Diversity Aspects - Method: unsupervised and transfer learning - Application: predicting linguistic features of languages
  7. 7. Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces Isabelle Augenstein*, Sebastian Ruder*, Anders Søgaard NAACL HLT 2018 (long), to appear *equal contributions 7
  8. 8. Problem 8 - Different NLU tasks (e.g. stance detection, aspect-based sentiment analysis, natural language inference) - Limited training data for most individual tasks - However: - they can be modelled with same base neural model - they are semantically related - they have similar labels - How to exploit synergies between those tasks?
  9. 9. Datasets and Tasks Topic-based sentiment analysis: Tweet: No power at home, sat in the dark listening to AC/DC in the hope it’ll make the electricity come back again Topic: AC/DC Label: positive Target-dependent sentiment analysis: Text: how do you like settlers of catan for the wii? Target: wii Label: neutral Aspect-based sentiment analysis: Text: For the price, you cannot eat this well in Manhattan Aspects: restaurant prices, food quality Label: positive 9 Stance detection: Tweet: Be prepared - if we continue the policies of the liberal left, we will be #Greece Target: Donald Trump Label: favor Fake news detection: Document: Dino Ferrari hooked the whopper wels catfish, (...), which could be the biggest in the world. Headline: Fisherman lands 19 STONE catfish which could be the biggest in the world to be hooked Label: agree Natural language inference: Premise: Fun for only children Hypothesis: Fun for adults and children Label: contradiction
  10. 10. Multi-Task Learning 10
  11. 11. Multi-Task Learning 11 Separate inputs for each task
  12. 12. Multi-Task Learning 12 Shared hidden layers Separate inputs for each task
  13. 13. Multi-Task Learning 13 Shared hidden layers Separate inputs for each task Separate output layers + classification functions
  14. 14. Multi-Task Learning 14 Shared hidden layers Separate inputs for each task Separate output layers + classification functions Negative log- likelihood objectives
  15. 15. Goal: Exploiting Synergies between Tasks 15 - Modelling tasks in a joint label space - Label Transfer Network that learns to transfer labels between tasks - Use semi-supervised learning, trained end-to-end with multi-task learning model - Extensive evaluation on a set of pairwise sequence classification tasks
  16. 16. Related Work 16 - Learning task similarities - Enforce clustering of tasks (Evgeniou et al., 2005; Jacob et al., 2009) - Induce shared prior (Yu et al., 2005; Xue et al., 2007; Daumé III, 2009) - Learn grouping (Kang et al., 2011; Kumar and Daumé III, 2012) - Only works for homogeneous tasks with same label spaces - Multi-task learning with neural networks - Hard parameter sharing (Caruana, 1993) - Different sharing structures (Søgaard and Goldberg, 2016) - Private and public subspaces (Liu et al., 2017; Ruder et al., 2017) - Training on disparate annotation sets (Chen et al., 2016; Peng et al., 2017) - Does not take into account similarities between label spaces
  17. 17. Related Work 17 - Semi-supervised learning - Self-training, co-training, tri-training, EM, etc. - Closest: co-forest (Li and Zhou, 2007) - each learner is improved with unlabeled instances labeled by the ensemble consisting of all the other learners - Unsupervised aux tasks in MTL (Plank et al., 2016; Rei, 2017) - Label transformations - Use distributional information to map from a language-specific tagset to a tagset used for other languages for cross-lingual transfer (Zhang et al., 2012) - Correlation analysis to transfer between tasks with disparate label spaces (Kim et al., 2015) - Label transformations for multi-label classification problems (Yeh et al., 2017)
  18. 18. Multi-Task Learning 18 Shared hidden layers Separate inputs for each task Separate output layers + classification functions Negative log- likelihood objectives
  19. 19. 19 Best-Performing Aux Tasks Main task Aux task Topic-2 FNC-1, MultiNLI, Target Topic-5 FNC-1, MultiNLI, ABSA-L, Target Target FNC-1, MultiNLI, Topic-5 Stance FNC-1, MultiNLI, Target ABSA-L Topic-5 ABSA-R Topic-5, ABSA-L, Target FNC-1 Stance, MultiNLI, Topic-5, ABSA-R, Target MultiNLI Topic-5 Trends: • Target used by all Twitter main tasks • Tasks with a higher number of labels (e.g. Topic-5) are used more often • Tasks with more training data (FNC-1, MultiNLI) are used more often
  20. 20. Label Embedding Layer 20
  21. 21. Label Embedding Layer 21 Label embedding space Prediction with label compatibility function: c(l, h) = l · h
  22. 22. Label Embeddings 22
  23. 23. Label Transfer Network Goal: learn to produce pseudo labels for target task LTNT = MLP([o1, …, oT-1]) Li oi = ∑ pj Ti lj j=1 - Output label embedding oi of task Ti: sum of the task’s label embeddings lj weighted with their probability pj Ti - LTN: trained on labelled target task data - Trained with a negative log-likelihood objective LLTN to produce a pseudo-label for the target task 23
  24. 24. Semi-Supervised MTL Goal: relabel aux task data as main task data using LTN - LTN can be used to produce pseudo-labels for aux or unlabelled instances - Train the target task model on the additional pseudo- labelled data - Additional loss: minimise the mean squared error between the model predictions pTi and pseudo-labels zTi produced by LTN 24
  25. 25. Label Transfer Network (w or w/o semi-supervision) 25
  26. 26. Relabelling 26
  27. 27. Overall Results 27
  28. 28. Overall Results 28
  29. 29. Overall Results 29
  30. 30. Overall Results 30
  31. 31. Overall Results 31
  32. 32. Overall Results - Label embeddings improve performance - New SoA on topic-based sentiment analysis However: - Softmax predictions of other, even highly related tasks are less helpful for predicting main labels than the output layer of the main task model - At best, learning the relabelling model alongside the main model might act as a regulariser to main model - Future work: use relabelling model to label unlabelled data instead 32
  33. 33. Tracking Typological Traits of Uralic Languages in Distributed Language Representations Johannes Bjerva, Isabelle Augenstein IWCLUL 2018 33
  34. 34. From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings Johannes Bjerva, Isabelle Augenstein NAACL HLT 2018 (long), to appear 34
  35. 35. Linguistic Typology 35 ● ‘The systematic study and comparison of language structures’ (Velupillai, 2012) ● Long history (Herder, 1772; von der Gabelentz, 1891; …) ● Computational approaches (Dunn et al., 2011; Wälchli, 2014; Östling, 2015, ...)
  36. 36. Why Computational Typology? 36 ● Answer linguistic research questions on large scale ● Multilingual learning ○ Language representations ○ Cross-lingual transfer ○ Few-shot or zero-shot learning ● This work: ○ Features in the World Atlas of Language Structures (WALS) ○ Computational Typology via unsupervised modelling of languages in neural networks
  37. 37. 37
  38. 38. 38
  39. 39. Resources that exist for many languages ● Universal Dependencies (>60 languages) ● UniMorph (>50 languages) ● New Testament translations (>1,000 languages) ● Automated Similarity Judgment Program (>4,500 languages) 39
  40. 40. Multilingual NLP and Language Representations ● No explicit representation ○ Multilingual Word Embeddings ● Google’s “Enabling zero-shot learning” NMT trick ○ Language given explicitly in input ● One-hot encodings ○ Languages represented as a sparse vector ● Language Embeddings ○ Languages represented as a distributed vector 40 (Östling and Tiedemann, 2017)
  41. 41. Distributed Language Representations 41 • Language Embeddings • Analogous to Word Embeddings • Can be learned in a neural network without supervision
  42. 42. Language Embeddings in Deep Neural Networks 42
  43. 43. Language Embeddings in Deep Neural Networks 43 1. Do language embeddings aid multilingual modelling? 2. Do language embeddings contain typological information?
  44. 44. Research Questions ● RQ 1: Which typological properties are encoded in task- specific distributed language representations, and can we predict phonological, morphological and syntactic properties of languages using such representations? ● RQ 2: To what extent do the encoded properties change as the representations are fine-tuned for tasks at different linguistic levels? ● RQ 3: How are language similarities encoded in fine-tuned language embeddings? 44
  45. 45. Phonological Features 45 ● 20 features ● E.g. descriptions of the consonant and vowel inventories, presence of tone and stress markers
  46. 46. Morphological Features 46 ● 41 features ● Features from morphological and nominal chapter ● E.g. number of genders, usage of definite and indefinite articles and reduplication
  47. 47. Word Order Features 47 ● 56 features ● Encode ordering of subjects, objects and verbs
  48. 48. Experimental Setup 48 Data ● Pre-trained language embeddings (Östling and Tiedemann, 2017) ● Task-specific datasets: grapheme-to-phoneme (G2P), phonological reconstruction (ASJP), morphological inflection (SIGMORPHON), part-of-speech tagging (UD) Dataset Class Ltask Ltask ⋂ Lpre G2P Phonology 311 102 ASJP Phonology 4664 824 SIGMORPHON Morphology 52 29 UD Syntax 50 27
  49. 49. Experimental Setup 49 Method ● Fine-tune language embeddings on grapheme-to-phoneme (G2P), phonological reconstruction (ASJP), morphological inflection (SIGMORPHON), part-of-speech tagging (UD) ○ train supervised seq2seq models on G2P, ASJP, SIGMORPHON tasks and ○ Train seq labelling model on UD task ● Predict typological properties with kNN model
  50. 50. Experimental Setup 50 Seq2Seq Model
  51. 51. Experimental Setup 51 Data Splits (i) evaluating on randomly selected language/feature pairs from a task-related feature set predict task-related features given random sample of languages (ii) evaluating on an unseen language family from a task-related feature set … given sample from which a whole language family is omitted (iii) evaluating on randomly selected language/feature pairs from all WALS feature sets compare task-specific feature encoding with a general one
  52. 52. Grapheme-to-Phoneme (G2P) 52 System/Features Random lang/feat pairs from phon. features Unseen lang family from phon. features Random lang/feat pairs from all features Most frequent class *75.46% 65.57% 79.9% k-NN (pre-trained) 71.45% *86.45% 80.39% k-NN (fine-tuned) 71.66% 82.36% 79.17%
  53. 53. Grapheme-to-Phoneme (G2P) 53 Hypothesis: grapheme-to-phoneme is more related to diachronic development of the writing system than it is to genealogy or phonology Example: - Norwegian & Danish -- almost same orthography, different phonology - Pre-trained language embeddings should be very similar and diverge during training - Comparison to typologically distant languages Finnish and Tagalog
  54. 54. Phonological Reconstruction (ASJP) 54 - Fine-tuned language embeddings does not outperform MFC - Reason: task is very similar for each language - However: eval on unseen lang family with pre-trained embs outperforms MFC -> Language embeddings encode features to some extend relevant to phonology System/Features Random lang/feat pairs from phon. features Unseen lang family from phon. features Random lang/feat pairs from all features Most frequent class *59.39% 63.71% *58.12% k-NN (pre-trained) 53.02% *77.44% 51.6% k-NN (fine-tuned) 53.09% *77.45% 51.9%
  55. 55. Morphological Inflection (SIGMORPHON) 55 - Fine-tuned language embeddings outperforms MFC -> Pre-trained and fine-tuned Language embeddings encode features relevant to morphology System/Features Random lang/feat pairs from morph. features Unseen lang family from morph. features Random lang/feat pairs from all features Most frequent class 77.98% 85.68% 84.12% k-NN (pre-trained) 74.49% 88.83% 84.97% k-NN (fine-tuned) *82.91% *91.92% 84.95%
  56. 56. Morphological Inflection (SIGMORPHON) 56 Example: - languages with same number of cases might benefit param sharing - feature 38A mainly encodes if the indefinite word is distinct from word for one, -> not surprising that this is not learned in morph. inflection
  57. 57. Part-of-Speech Tagging (UD) 57 - Improvements for all experimental settings -> Pre-trained and fine-tuned language embeddings encode features relevant to word order System/Features Random lang/feat pairs from word order features Unseen lang family from word order features Random lang/feat pairs from all features Most frequent class 67.81% 82.47% 82.93% k-NN (pre-trained) 76.66% 92.76% 82.69% k-NN (fine-tuned) *80.81% *94.48% 83.55%
  58. 58. Part-of-Speech Tagging (UD) 58 • Hierarchical clustering of language embeddings • Language modelling based language embeddings • English with Romance • Large amount of romance vocabulary • PoS based language embeddings • English with Germanic • Morpho-syntactically more similar (Bjerva and Augenstein (2018)
  59. 59. Conclusions 59 - If features are encoded depends on target task - Does not work for phonological tasks - Works for morphological inflection and PoS tagging - We can predict typological features for unseen language families with high accuracies - Changes in the features encoded in language embeddings are relatively monotonic -> convergence towards single optimum - G2P task: phonological differences between otherwise similar languages (e.g. Norwegian Bokmål and Danish) are accurately encoded
  60. 60. Future work 60 • Improve multilingual modelling • E.g., share morphologically relevant parameters for morphologically similar languages • Complete WALS using language embeddings and KBP methods
  61. 61. Conclusions: This Talk 61 Part 1: General Challenges - Method: multitask and semi-supervised learning, label embeddings - Application: very similar pairwise sequence classification tasks - Take-away: multitask learning 😀 + label embeddings 😀 + semi-supervised learning (same data) 😐 Part 2: Multilingual and Diversity Aspects - Method: unsupervised and transfer learning - Application: predicting linguistic features of languages - Take-away: Language embeddings facilitate zero-shot learning + encode linguistic features + encoding varies dep. on NLP task they are fine-tuned for
  62. 62. Presented Papers Isabelle Augenstein, Sebastian Ruder, Anders Søgaard. Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. NAACL HLT 2018 (long), to appear Johannes Bjerva, Isabelle Augenstein. Tracking Typological Traits of Uralic Languages in Distributed Language Representations. Fourth International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2018), January 2018 Johannes Bjerva, Isabelle Augenstein. Unsupervised Linguistic Typology at Different Levels with Language Embeddings. NAACL HLT 2018 (long), to appear 62
  63. 63. Thank you! augenstein@di.ku.dk @IAugenstein 63

×