Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ODSC East: Effective Transfer Learning for NLP


Published on

Presented by indico co-founder Madison May at ODSC East.

Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.

Published in: Technology
  • Be the first to comment

ODSC East: Effective Transfer Learning for NLP

  1. 1. @ODSC OPEN DATA SCIENCE CONFERENCE Boston | May 1 - 4 2018
  2. 2. Effective Transfer Learning for NLP Madison May
  3. 3. Machine Learning Architect @ Indico Data Solutions Solve big problems with small data. Email: Twitter: @pragmaticml Github: @madisonmay
  4. 4. Overview: - Deep learning and its limitations - Transfer learning primer - Practical recommendations for transfer learning - Enso + transfer learning benchmarking - Transfer learning in recent literature
  5. 5. Deep learning and its limitations
  6. 6. A better term for “deep learning”: “representation learning” "Visualizing and Understanding Convolutional Networks” Zeiler, Fergus Input Layer 1 activation Layer 2 activation Layer 3 activation Pre-trained ImageNet model Feature responds to car wheels Feature responds to faces
  7. 7. Representation learning in NLP: word2vec CBOW objective for word2vec model
  8. 8. Learned word2vec representations have semantic meaning “Distributed Representations of Words and Phrases and their Compositionality” Mikolov, Sutskever, et al. Advances in neural information processing systems, 3111-3119
  9. 9. Training data requirements Deep Learning Traditional ML Labeled Training Data Performance ~10,000+ labeled examples
  10. 10. Training Time + Computational Expense
  11. 11. Transfer learning primer
  12. 12. Everyone has problems. Not everyone has data. Small data problems are more common than big data problems. <1k examples = small data
  13. 13. Transfer learning: the application of knowledge gained in one context to a different context
  14. 14. A shuffled tiger Each pixel treated as an independent feature → Can tell that tigers are generally orange and black but not much more Independently each pixel has little predictive value
  15. 15. Transfer learning: re-represent new data in terms of existing concepts 0.8 0.9 0.7 0.8 large orange striped cat
  16. 16. In practice, learned features aren’t this interpretable. However, the relationship between input feature and target is typically simpler, and learning simpler relationships requires less data and less compute.
  17. 17. Basic transfer learning outline: 1) Train base model on large, general corpus 2) Compute base model’s representations of input data for target task 3) Train lightweight model on top of pre-trained feature representations Shared encoder -- “featurizer” “Source Model” (ex. Movie Review Sentiment) input hidden hidden Custom classifier “Target model” Box Office Results Movie Sentiment Aspect Movie Genre Prediction
  18. 18. How does transfer learning fix deep learning’s problems? Training data requirements: ● Pre-trained representations → simpler models → less training data Memory Requirements: ● A single copy of the base model can fuel many transfer models ● Target models have thousands rather than millions of parameters ● Target model size measured in KBs rather than GBs Training Time Requirements: ● Target model training takes seconds rather than days
  19. 19. HBO’s Silicon Valley “Not Hotdog” app Transfer learning for computer vision for “practical” application
  20. 20. Transfer learning for NLP vs transfer learning for computer vision ● More variety in types of target tasks (entity extraction, classification, seq. labeling) ● More variety in input data (source language, field-specific terminology) ● No clear “ImageNet” equivalent -- lack of large, generic, labeled corpora ● Lack of consensus on what source tasks produce good representations
  21. 21. Practical recommendations for transfer learning
  22. 22. Source model is the single most important variable Keep source model and target model well-aligned when possible ● Source vocabulary should be aligned with target vocabulary ● Source task should be aligned with target task Good: product review sentiment → product review category Good: hotel ratings → restaurant ratings Less Good: product review sentiment → biology paper classification Source models Target tasks Shape ≅ Vocabulary Color ≅ Task type
  23. 23. What source tasks produce good, general representations? ● Natural language inference ○ Are two sentences in agreement, disagreement, or neither? ● Machine translation ○ English → French ● Multi-task learning ○ Learning to solve many supervised problems at once ● Language modeling ○ Learning to model the distribution of natural language. ○ Predicting the next word in a sequence given context
  24. 24. Keep target models simple ● Limiting model complexity is a strong implicit regularizer ● Logistic regression goes a long way ● Use L2 regularization / dropout as additional regularization
  25. 25. Consider second-order optimization methods ● Transfer learning necessitates simple model with few parameters because of limited training data ● L-BFGS is usually overlooked in deep learning because it scales poorly with number of parameters + examples ● L-BFGS performs well in practice for transfer learning applications First order methods: move a step in direction of gradient Second order methods: move to minimum of second order approximation of curve ■ Weight Update ■ Approx. of loss surface ■ True loss surface
  26. 26. When comparing approaches, measure performance variance ● Limited labeled training data →limited test and validation data ● High variance across CV splits may correspond with poor generalization Training Data Volume Training Data Volume ModelAcc. ModelAcc.
  27. 27. “Classic” machine learning problems are exaggerated at small training dataset sizes ● Ex: class imbalance can lead to degenerate models that predict only a single class -- consider oversampling / undersampling ● Ex: unrepresentative dataset -- small sample sizes increase the likelihood that a model will pick up on spurious correlations class balance
  28. 28. “Feature engineering” has its place ● Modern day “feature engineering” takes the form of model architecture decisions ● Ex: when trying to determine whether or not a job description and a resume are a good match, use the absolute difference of the two feature representations as input to the model. Model input Job Description Resume
  29. 29. Introducing: Enso
  30. 30. Enso: provides a standard interface for the benchmarking of embeddings and transfer learning methods for NLP tasks.
  31. 31. The need: ● Eliminate human “overfitting” of hyperparameters to values that work well for a single task ● Ensure higher fidelity baselines ● Benchmark on many datasets to better understand where an approach is effective
  32. 32. Enso workflow: ● Download 2 dozen included datasets for benchmarking on diverse tasks ● “Featurize” all examples in the dataset via a pre-trained source model ● Train target model using the featurized training examples as inputs ● Repeat process for all combinations of featurizers, dataset sizes, target model architectures, etc. ● Visualize and manually inspect results
  33. 33. > python -m > python -m enso.featurize > python -m enso.experiment > python -m enso.visualize
  34. 34. Comparison of transfer model architectures
  35. 35. Comparison of optimizer used
  36. 36.
  37. 37. Research spotlight
  38. 38. Recent Papers of Note: ● “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning” by Subramanian, et. al. ● “Fine-tuned Language Models for Text Classification” by Howard, Ruder ● “Deep contextualized word representations” by Peters, et. al.
  39. 39. “Deep contextualized word representations” by Peters, et. al. (AllenAI) ● Language modeling is a good objective for source model ● Many different layers of representation are useful, attend over layers of representation and learn to weight on a per-task basis ● Per token representations mean applicability to broader range of tasks than vanilla document representation “Embedding Language Model Outputs” (ELMO) layer weights learned on a variety of target tasks
  40. 40. Shared encoder -- “featurizer” input hidden hidden 0.5 0.2 0.3 Each colored block is a “representation” or “feature vector” Each representation is weighted then summed to produce a feature vector of the same dimensions
  41. 41. Source: Chris Olah's personal blog Bidirectional LSTM
  42. 42. Source + Task RNN’s Source RNN (frozen weights) Task RNN (task-specific arch.) Input + FW + BW (learned avg.)
  43. 43. Empirical Results
  44. 44. Conclusions
  45. 45. ● Small data problems are more common than big data problems. ● Transfer learning enables taking advantage of deep learning without massive labeled corpora. ● When in doubt, trend toward simplicity.
  46. 46. Appendix
  47. 47. Other Resources for Transfer Learning on NLP tasks ●, Sebastian Ruder’s blog ● (Arxiv Computation and Language) ● (Making neural nets uncool again)
  48. 48. “Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning” by Subramanian, et. al. ● Learning document representations using bidirectional LSTM trained on a multi-task learning objective ● Tasks included skip-thought vectors, neural machine translation, parse tree construction, and natural language inference ● Diverse source tasks led to document representations that produced strong empirical results when applied to a dozen different target tasks Task 1 Task 2 Input
  49. 49. “Fine-tuned Language Models for Text Classification” by Howard, Ruder ● Outlines a “bag of tricks” for applying transfer learning to NLP ● Language modeling is an effective source task ● Fine-tune the source model rather than using a static representation ● Use separate learning rate per layer to keep the first layer relatively static while updating the final layer more