Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strong Baselines for Neural Semi-supervised Learning under Domain Shift

518 views

Published on

Oral presentation given at ACL 2018 about our paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift (http://aclweb.org/anthology/P18-1096).

Published in: Science
  • Login to see the comments

Strong Baselines for Neural Semi-supervised Learning under Domain Shift

  1. 1. Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara Plank
  2. 2. ‣ State-of-the-art domain adaptation approaches ‣ leverage task-specific features ‣ evaluate on proprietary datasets or on a single benchmark ‣ Only compare against weak baselines ‣ Almost none evaluate against approaches from the extensive semi-supervised learning (SSL) literature 2 Learning under Domain Shift
  3. 3. ‣ How do classics in SSL compare to recent advances? ‣ Can we combine the best of both worlds? ‣ How well do these approaches work on out-of-distribution data? 3 Revisiting Semi-Supervised Learning Classics in a Neural World
  4. 4. • Self-training • (Co-training) • Tri-training • Tri-training with disagreement Bootstrapping algorithms
  5. 5. 1. Train model on labeled data. 2. Use confident predictions on unlabeled data as training examples. Repeat. 5 Self-training - Error amplification
  6. 6. ‣ Calibration ‣ Output probabilities in neural networks are poorly calibrated. ‣ Throttling (Abney, 2007), i.e. selecting the top n highest confidence unlabeled examples works best. ‣ Online learning ‣ Training until convergence on labeled data and then on unlabeled data works best. 6 Self-training variants
  7. 7. 7 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. y = 1 x y = 1 1 Tri-training Tri-training
  8. 8. 8 Tri-training Tri-training 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. 3. Final prediction: majority voting Tri-training y = 1y = 1 y = 0 1 x
  9. 9. Tri-training
 with disagreement 9 Tri-training with disagreement 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree and prediction differs. y = 1 x y = 1 1 y = 0 - 3 independent models
  10. 10. ‣ Sampling unlabeled data ‣ Producing predictions for all unlabeled examples is expensive ‣ Sample number of unlabeled examples ‣ Confidence thresholding ‣ Not effective for classic approaches, but essential for our method 10 Tri-training hyper-parameters
  11. 11. 11 y = 1 x y = 1 1 Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. Multi-task
 Tri-training 3. Restrict final layers to 
 use different 
 representations. 4. Train third objective 
 function only on 
 pseudo labeled to 
 bridge domain shift.
  12. 12. 12 BiLSTM w2 char BiLSTM BiLSTM w1 char BiLSTM BiLSTM w3 char BiLSTM m1 m2 m3 m1 m2 m3 m1 m2 m3 orthogonality constraint (Bousmalis et al., 2016) Multi-task
 Tri-training Lorth = ∥W⊤ m1 Wm2 ∥2 F L(θ) = − ∑ i ∑ 1,..,n log Pmi (y| ⃗h ) + γLorthLoss: (Plank et al., 2016)
  13. 13. 13 Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
  14. 14. Sentiment Analysis Results Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14 ‣ Multi-task tri-training slightly outperforms tri-training, but has higher variance.
  15. 15. 15 POS Tagging Results Trained on 10% labeled data (WSJ) Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training with disagreement works best with little data.
  16. 16. 16 POS Tagging Results * result from Schnabel & Schütze (2014) Trained on full labeled data (WSJ) Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training works best in the full data setting.
  17. 17. 17 POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens AccuracyonOOVtokens 50 57.5 65 72.5 80 %OOVtokens 0 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. ‣ MT-Tri does worse than source-only baseline on OOV.
  18. 18. 18 POS accuracy per binned log frequency Accuracydeltavs.src-onlybaseline -0.005 0 0.005 0.009 0.014 0.018 Binned frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 MT-Tri Tri ‣ Tri-training works best on low-frequency tokens (leftmost bins). POS Tagging Analysis
  19. 19. 19 POS Tagging Analysis Accuracy on unknown word-tag (UWT) tokens AccuracyonUWTtokens 8 12.5 17 21.5 26 %UWTtokens 0 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs UWT rate Src Tri MT-Tri FLORS* ‣ No bootstrapping method works well on unknown word- tag combinations. ‣ Less lexicalized FLORS approach is superior. very difficult cases * result from Schnabel & Schütze (2014)
  20. 20. ‣ Classic tri-training works best: outperforms recent state-of-the-art methods for sentiment analysis. ‣ We address the drawback of tri-training (space & time complexity) via the proposed MT-Tri model ‣ MT-Tri works best on sentiment, but not for POS. ‣ Importance of: ‣ Comparing neural methods to classics (strong baselines) ‣ Evaluation on multiple tasks & domains 20 Takeaways Tri-training

×