Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neural Semi-supervised Learning under Domain Shift

338 views

Published on

An overview of my work on semi-supervised learning under domain shift using neural networks.

Published in: Science
  • Login to see the comments

  • Be the first to like this

Neural Semi-supervised Learning under Domain Shift

  1. 1. Neural Semi-supervised Learning under
 Domain Shift Sebastian Ruder
  2. 2. ‣ Across domains
 (Ruder & Plank, EMNLP 2017, ACL 2018;
 Howard* & Ruder*, ACL 2018) ‣ Across tasks
 (Ruder et al., arXiv 2017;
 (Augenstein* & Ruder* et al., NAACL 2018) ‣ Across languages
 (Ruder et al., JAIR 2018;
 Søgaard, Ruder & Vulic, ACL 2018;
 Ruder* & Cotterell* et al., EMNLP 2018;
 Kementchedjhieva, Ruder et al., CoNLL 2018) 2 Research overview: Transfer learning * equal contribution 12/6/2017 label_embedding_layer.html Label Embedding Layer
  3. 3. ‣ Across domains
 (Ruder & Plank, EMNLP 2017, ACL 2018;
 Howard* & Ruder*, ACL 2018) 3 Research overview: Transfer learning * equal contribution Beware of non-i.i.d. data! ‣ We never know how well our models truly generalise if we just test them on data of the same distribution. ‣ CIFAR-10 classifiers don’t even generalise to CIFAR-10 (Recht et al., 2018). ‣ “A challenge to the community: we should evaluate on out-of- distribution data or on a new task.”
 - Percy Liang, DeepGen workshop, NAACL-HLT 2018 Recht, B., Roelofs, R., Schmidt, L., & Berkeley, U. C. (2018). Do CIFAR-10 Classifiers Generalize to CIFAR-10 ?
  4. 4. 4 Learning under Domain Shift Labeled source data Different task Un- labeled target data How to select the most relevant data?
  5. 5. 5 Data setting 1:
 Multiple source domains Target domain Source domains Ruder, S., & Plank, B. (2017). Learning to select data for transfer learning with Bayesian Optimization. In Proceedings of EMNLP 2017.
  6. 6. Why select data for domain adaptation at all? Why don’t we just train on all source data? ‣ Prevent negative transfer for dissimilar domains. ‣ e.g. “electrifying” is positive in , but negative in Existing approaches ‣ use a single similarity metric in isolation; ‣ focus on a single task. 6 Background
  7. 7. Intuition ‣ Different tasks and domains require different notions of similarity. Idea ‣ Learn a data selection policy using Bayesian Optimisation. 7 Our approach
  8. 8. 8 Our approach x1 x2 xm ⋮ S = ϕ(x)⊤ w Training examples ⋮ Selection policy xn Sorted examples m ‣ Related: curriculum learning (Tsvetkov et al., 2016) Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. In Proceedings of ACL 2016.
  9. 9. ‣ Treat objective as a black box, which we iteratively approximate ‣ Use Bayesian Optimisation (BO) to obtain best parameter setting.
 cf. Fang & Cohn (2017) who use RL for selecting data for active learning ‣ Sample-efficient; only need about 100-200 samples to converge. ‣ BO is typically used for hyper-parameter tuning (Snoek et al., 2012; Melis et al., 2018). ‣ Alternative: Learn latent permutation with Sinkhorn operator (Adams and Zemel, 2011; Mena et al., 2018) 9 Learning the data selection policy Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of EMNLP 2017. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of NIPS 2012. Melis, G., Dyer, C., & Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR 2018. Mena, G. E., Belanger, D., Linderman, S., & Snoek, J. (2018). Learning Latent Permutations with Gumbel-Sinkhorn Networks. In Proceedings of ICLR 2018.
  10. 10. 10 Optimisation framework X Feature
 extraction Bayesian Optimisation Task model training Evaluation on validation set Scoring & 
 sorting with St ̂y Mt Xtϕ(X) wt wt+1
  11. 11. Two important choices ‣ Surrogate model: used to approximate objective function; e.g. Gaussian Process (GP) ‣ Acquisition function: propose new samples; trades off exploration vs. exploitation; e.g. Expected Improvement (EI) Procedure: ‣ Sample next weight vector by optimising the acquisition function over the GP: ‣ Obtain noisy validation score from trained model ‣ Append sample to , update GP 11 Bayesian Optimisation wt = arg max w u(w|D1:t−1) wt u ̂yt D1:t = {D1:t−1, (wt, ̂yt)}
  12. 12. ‣ Treat each source example and entire target domain as distributions and based on term and topic probabilities 12 Features P Q Similarity feature Diversity feature Jensen-Shannon divergence # word types Rényi divergence Type-token ratio Bhattacharyya distance Entropy Cosine similarity Simpson's index Euclidean distance Rényi entropy Variational distance Quadratic entropy
  13. 13. 13 Data & Tasks Three tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging and dependency parsing on SANCL 2012 dataset (Petrov and McDonald, 2012) Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL 2007. Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
  14. 14. 14 Sentiment Analysis Results Selecting 2,000 from 6,000 source domain examples Accuracy(%) 62 68 74 80 86 Book DVD Electronics Kitchen Random JS divergence (examples) JS divergence (domain) Similarity (topics) Diversity Similiarity + diversity All source data (6,000 examples) ‣ Selecting relevant data is useful when domains are very different.
  15. 15. 15 POS Tagging Results Selecting 2,000 from 14-17.5k source domain examples Accuracy(%) 92 93.25 94.5 95.75 97 Answers Emails Newsgroups Reviews Weblogs WSJ JS divergence (examples) Similarity (terms) Diversity Similiarity + diversity All source data ‣ Learned data selection outperforms static selection, but is less useful when domains are very similar.
  16. 16. 16 Dependency Parsing Results Selecting 2,000 from 14-17.5k source domain examples LabeledAttachmentScore(LAS) 80 82.25 84.5 86.75 89 Answers Emails Newsgroups Reviews Weblogs WSJ JS divergence (examples) Similarity (terms) Diversity Similiarity + diversity All source data
  17. 17. 17 Cross-Model Transfer Results Training a BiLSTM with the policy learned by a BiLSTM and a Structured Perceptron for POS tagging Accuracy(%) 92 93 94 95 96 Answers Emails Newsgroups Reviews Weblogs WSJ BiLSTM, similarity + diversity Structured Perceptron, similarity + diversity ‣ The data selection policy can be learned with a cheap model and transferred to more expensive models.
  18. 18. ‣ Bayesian Optimisation is an efficient way to optimise an expensive function, e.g. order of training examples. ‣ Different domains & tasks have different notions of similarity. ‣ Preferring certain examples is mainly useful when domains are dissimilar. ‣ Diversity complements similarity. ‣ The learned policy transfers (to some extent) across models, tasks, and domains. 18 Takeaways …
  19. 19. 19 Learning under Domain Shift Labeled source data Un- labeled target data How well does SSL work with NNs?
  20. 20. 20 Data setting 2:
 Single source domain Target domain Source domain Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In Proceedings of ACL 2018.
  21. 21. ‣ State-of-the-art domain adaptation approaches ‣ leverage task-specific features ‣ evaluate on proprietary datasets or on a single benchmark ‣ Only compare against weak baselines ‣ Almost none evaluate against approaches from the extensive semi-supervised learning (SSL) literature 21 Learning under Domain Shift
  22. 22. ‣ How do classics in SSL compare to recent advances? ‣ Can we combine the best of both worlds? ‣ How well do these approaches work on out-of-distribution data? 22 Revisiting Semi-Supervised Learning Classics in a Neural World
  23. 23. • Self-training • (Co-training*) • Tri-training • Tri-training with disagreement Bootstrapping algorithms * used in concurrent work: Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of NAACL-HLT 2018.
  24. 24. 1. Train model on labeled data. 2. Use confident predictions on unlabeled data as training examples. Repeat. 24 Self-training - Error amplification ‣ Mixed success in NLP. Some recent success in CV (Radosavic et al., 2018). Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni-Supervised Learning. In Proceedings of CVPR 2018.
  25. 25. ‣ Calibration ‣ Output probabilities in neural networks are poorly calibrated. ‣ Throttling (Abney, 2007), i.e. selecting the top highest confidence unlabeled examples works best. ‣ Online learning ‣ Training until convergence on labeled data and then on unlabeled data works best. 25 Self-training variants Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni- Supervised Learning. In Proceedings of CVPR 2018. n
  26. 26. 26 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. y = 1 x y = 1 1 Tri-training Tri-training
  27. 27. 27 Tri-training Tri-training 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. 3. Final prediction: majority voting Tri-training y = 1y = 1 y = 0 1 x
  28. 28. Tri-training
 with disagreement 28 Tri-training with disagreement 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree and prediction differs. y = 1 x y = 1 1 y = 0 - 3 independent models
  29. 29. ‣ Sampling unlabeled data ‣ Producing predictions for all unlabeled examples is expensive ‣ Sample number of unlabeled examples ‣ Confidence thresholding ‣ Not effective for classic approaches, but essential for our method 29 Tri-training hyper-parameters
  30. 30. 30 y = 1 x y = 1 1 Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. Multi-task
 Tri-training 3. Restrict final layers to 
 use different 
 representations. 4. Train third objective 
 function only on 
 pseudo labeled to 
 bridge domain shift.
  31. 31. 31 BiLSTM w2 char BiLSTM BiLSTM w1 char BiLSTM BiLSTM w3 char BiLSTM m1 m2 m3 m1 m2 m3 m1 m2 m3 orthogonality constraint (Bousmalis et al., 2016) Multi-task
 Tri-training Lorth = ∥W⊤ m1 Wm2 ∥2 F L(θ) = − ∑ i ∑ 1,..,n log Pmi (y| ⃗h ) + γLorthLoss: (Plank et al., 2016)
  32. 32. 32 Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
  33. 33. Sentiment Analysis Results Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 33 ‣ Multi-task tri-training slightly outperforms tri-training, but has higher variance.
  34. 34. 34 POS Tagging Results Trained on 10% labeled data (WSJ) Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training with disagreement works best with little data.
  35. 35. 35 POS Tagging Results * result from Schnabel & Schütze (2014) Trained on full labeled data (WSJ) Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training works best in the full data setting.
  36. 36. 36 POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens AccuracyonOOVtokens 50 57.5 65 72.5 80 %OOVtokens 0 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. ‣ MT-Tri does worse than source-only baseline on OOV.
  37. 37. 37 POS accuracy per binned log frequency Accuracydeltavs.src-onlybaseline -0.005 0 0.005 0.009 0.014 0.018 Binned frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 MT-Tri Tri ‣ Tri-training works best on low-frequency tokens (leftmost bins). POS Tagging Analysis
  38. 38. 38 POS Tagging Analysis Accuracy on unknown word-tag (UWT) tokens AccuracyonUWTtokens 8 12.5 17 21.5 26 %UWTtokens 0 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs UWT rate Src Tri MT-Tri FLORS* ‣ No bootstrapping method works well on unknown word- tag combinations. ‣ Less lexicalized FLORS approach is superior. very difficult cases * result from Schnabel & Schütze (2014)
  39. 39. ‣ Classic tri-training works best: outperforms recent state-of-the-art methods for sentiment analysis. ‣ We address the drawback of tri-training (space & time complexity) via the proposed MT-Tri model ‣ MT-Tri works best on sentiment, but not for POS. ‣ Importance of: ‣ Comparing neural methods to classics (strong baselines) ‣ Evaluation on multiple tasks & domains 39 Takeaways Tri-training
  40. 40. 40 Learning under Domain Shift Labeled source data Different task Un- labeled target data How can we leverage pretrained LMs?
  41. 41. 41 Data setting 3:
 Different target task Target domain
 Target task Source domain
 Source task Howard, J.*, & Ruder, S.* (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018. *: equal contribution
  42. 42. ‣ Best practice: initialise first layer with pretrained word embeddings ‣ Recent approaches (McCann et al., 2017; Peters et al., 2018): Pretrained embeddings as fixed features. Peters et al. (2018) is task-specific. ‣ Why not initialise remaining parameters? ‣ Dai and Le (2015) first proposed fine-tuning a LM. However: No pretraining. Naive fine-tuning. 42 Transfer learning for NLP status quo McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word Vectors. In Proceedings of NIPS 2017. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018. Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. In Proceedings of NIPS 2015.
  43. 43. 43 Universal Language Model Fine- tuning 3 step recipe: 1. Train language model (LM) on general domain data. 2. Fine-tune LM on target data. 3. Train classifier on labeled data on top of LM.
  44. 44. ‣ Model: AWD-LSTM language model ‣ 3-layer LSTM ‣ Tuned dropout hyperparameters ‣ Data: WikiText-103 ‣ 103 million tokens of Wikipedia text ‣ Train for ~24 hours on a Tesla V100 ‣ Recently: deeper models, trained on more data, for longer (Radford et al., 2018) 44 Language Model Pretraining Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
  45. 45. ‣ Discriminative fine-tuning
 Different layers capture different types of information. They should be fine-tuned to different extents. 45 Language Model Fine-tuning θl t = θl t−1 − ηl ⋅ ∇θl J(θ) ‣ Slanted triangular learning rates
 The model should converge quickly to a suitable region and then refine its parameters.
  46. 46. ‣ Concat pooling
 Concatenate pooled representations of hidden states to capture long document contexts:
 
 ‣ Gradual unfreezing
 Gradually unfreeze the layers starting from the last layer to prevent catastrophic forgetting. ‣ Bidirectional language model
 Pretrain both forward and backward LMs and fine-tune them independently. 46 Classifier Fine-tuning hc = [hT, 𝚖𝚊𝚡𝚙𝚘𝚘𝚕(H), 𝚖𝚎𝚊𝚗𝚙𝚘𝚘𝚕(H)]
  47. 47. 47 Dataset Type # classes # examples TREC-6 6 5.5k IMDb 2 25k Yelp-bi 2 560k Yelp-full 5 650k AG News 4 120k DBpedia 14 560k Data & Tasks
  48. 48. 48 Results Previous SOTA vs. ULMFiT Errorrate(%) 0 1.75 3.5 5.25 7 IMDb TREC-6 AG News DBpedia Yelp-bi 2.16 0.8 5.01 3.6 4.6 2.64 0.84 6.57 3.9 5.9 Previous SOTA ULMFiT Yelp-full 29.9830.58 ‣ ULMFiT outperforms the state-of-the-art by a significant margin on many of the datasets.
  49. 49. 49 Few-shot Learning IMDb Errorrate(%) 0 12.5 25 37.5 50 # of training examples 100 500 2000 10000 20000 From scratch ULMFiT, supervised ULMFiT, semi-supervised AG-News # of training examples 100 500 2000 10000 108000 ‣ With 100 labeled examples, matches performance of training from scratch with 10x and 20x more data. ‣ With 50-100k additional unlabeled examples, matches performance of training with 50x and 20x more data.
  50. 50. ‣ Proposed a general approach for fine-tuning a pretrained language model. ‣ Proposed new techniques to reduce catastrophic forgetting during fine-tuning. ‣ Approach achieves new SOTA on 6 text classification tasks. ‣ Very sample-efficient. 50 Takeaways
  51. 51. ‣ In order to understand how well our models truly generalise, we need to measure their performance on out- of-distribution data. ‣ It is important to evaluate our models on different domains and tasks. ‣ Using pretrained language models is an effective way of doing transfer / semi-supervised learning (SSL). ‣ Can be complemented by “explicit” SSL. We can take lessons from traditional approaches. ‣ Dealing with stark domain differences is still a challenge and requires ways to explicitly avoid negative transfer. 51 Final Takeaways

×