Sebastian Ruder

Research Scientist, AYLIEN
PhD Candidate, Insight Centre
@seb_ruder |01.03.17 | LinkedIn Tech Talk
Transfer Learning —
The Next Frontier for ML
Agenda
1. What is Transfer Learning?
2. Why Transfer Learning now?
3. Transfer Learning in practice
4. Transfer Learning for NLP
5. Our research
6. Opportunities and directions
@seb_ruder |01.03.17 | LinkedIn Tech Talk
What is Transfer Learning?
@seb_ruder |01.03.17 | LinkedIn Tech Talk
Model A Model B
Task / domain A
Task / domain B
Traditional ML
Training and
evaluation on the same
task or domain.
What is Transfer Learning?
@seb_ruder |
Knowledge
Model
Source task /
domain Target task /
domain
Transfer learning
Storing knowledge gained solving
one problem and applying it to a
different but related problem.
Model
01.03.17 | LinkedIn Tech Talk
@seb_ruder |
“Transfer learning
will be the next
driver of ML
success.”
Andrew Ng,
NIPS 2016 keynote
@seb_ruder |
Why Transfer Learning now?
@seb_ruder |
Supervised learning
Transfer learning
Unsupervised learning
Reinforcement learning
2016Time
Commercial
success
Drivers of ML success in industry
- Andrew Ng, NIPS 2016 keynote
01.03.17 | LinkedIn Tech Talk
Why Transfer Learning now?
@seb_ruder |
1. Learn very accurate input-output mapping
2. Maturity of ML models
- Computer vision (5% error on ImageNet)
-Automatic speech recognition (3x faster than
typing, 20% more accurate1)
3. Large-scale deployment & adoption of ML models
-Google’s NMT System2
1 Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2016). Speech Is 3x Faster than Typing for English
and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323.
2 Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural
Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint
arXiv:1609.08144.
Huge reliance on labeled data 

Novel tasks / domains without (labeled) data
01.03.17 | LinkedIn Tech Talk
Transfer Learning in practice
@seb_ruder |
• Train new model on features
of large model trained on
ImageNet3
• Train model to confuse source
and target domains4
• Train model on domain-
invariant representations5,6
3 Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for
recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 512–519.
4 Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. Proceedings of the 32nd
International Conference on Machine Learning., 37.
5 Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. NIPS 2016.
6 Sener, O., Song, H. O., Saxena, A., & Savarese, S. (2016). Learning Transferrable Representations for Unsupervised Domain
Adaptation. NIPS 2016.
Computer vision
01.03.17 | LinkedIn Tech Talk
Transfer Learning in practice
@seb_ruder |
• Progressive Neural

Networks7 have
access to weights
from trained models
• PathNet8 learns
weight paths via a
genetic algorithm
7 Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., … Deepmind, G.
(2016). Progressive Neural Networks. arXiv preprint arXiv:1606.04671.
8 Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., … Wierstra, D. (2017). PathNet:
Evolution Channels Gradient Descent in Super Neural Networks. In arXiv preprint arXiv:1701.08734.
Reinforcement learning
01.03.17 | LinkedIn Tech Talk
Transfer Learning for NLP
@seb_ruder |
• Task and domainT D
DS 6= DT TS 6= TT
A (slightly) more technical definition
• Domain where
- : feature space, e.g. BOW representations
- : e.g. distribution over terms in documents
D = {X, P(X)}
X
P(X)
• Task where
- : label space, e.g. true/false labels
- : learned mapping from samples to labels
T = {Y, P(Y |X)}
Y
P(Y |X)
• Transfer learning:

Learning when or
01.03.17 | LinkedIn Tech Talk
Transfer Learning for NLP
@seb_ruder |
Transfer scenarios
1. : Different topics, text types, etc.

2. : Different languages.

3. : Unbalanced classes.

4. : Different tasks.
P(XS) 6= P(XT )
XS 6= XT
P(YS|XS) 6= P(YT |XT )
YS 6= YT
01.03.17 | LinkedIn Tech Talk
Transfer Learning for NLP
@seb_ruder |
Current status
• Not as straightforward as in CV
- No universal deep features
• However: “Simple” transfer through word
embeddings is pervasive
• History of research for task-specific transfer, e.g.
sentiment analysis, POS tagging leveraging NLP
phenomena such as structured features, sentiment
words, etc.
• Few research on transfer between tasks
• More recently: representation-based research
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
Research focus
Finding better ways to transfer knowledge to new
domains, tasks, and languages that
1. perform well in large-scale settings and real-
world applications;
2. are applicable to many tasks and models.
Current focus:
: Training and test distributions are
different.
P(XS) 6= P(XT )
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
Training and test distributions are different.
Different text types. Different accents/ages.
Different topics/categories.
Performance drop or even collapse is inevitable.
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
Transfer learning challenges in real-world applications
1. Domains are not well-defined, but fuzzy and
conflate many factors.





2. One-to-one adaptation is rare and many source
domains are generally available.
3. Models need to be adapted frequently as
conditions change, new data becomes available, etc.
Language
socialfactors
genre
topic
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
• Idea: Use distillation + insights from semi-supervised
learning to transfer knowledge from a single (a) and
multiple teachers (b) to a student model9.
(a) (b)
9 Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Knowledge Adaptation: Teaching to Adapt. In arXiv preprint arXiv:1702.02052.
How to adapt from large source domains?
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
• Idea: Take into account diversity of training data to
select subsets (c) rather than an entire domain (a) or
individual examples (b)10.
10 Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Data Selection Strategies for Multi-Domain Sentiment Analysis. In arXiv preprint arXiv:1702.02426.
How to select data for adaptation?
(a) (b) (c)
01.03.17 | LinkedIn Tech Talk
Our research
@seb_ruder |
Opportunities and future directions
• Learn from past adaptation scenarios and
generalise across domains and tasks.
• Robust adaptation to non-English and low-
resource languages.
• Adaptation for novel tasks and more sophisticated
models, e.g. QA and memory networks.
• Transfer across tasks and leveraging knowledge
from related tasks.
01.03.17 | LinkedIn Tech Talk
References
@seb_ruder |
Image credit
• Google Research blog post11
• Mikolov, T., Joulin, A., & Baroni, M. (2015). A Roadmap towards
Machine Intelligence. arXiv preprint arXiv:1511.08130.
• Google Research blog post12
Our papers
• Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Knowledge Adaptation:
Teaching to Adapt. In arXiv preprint arXiv:1702.02052.
• Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Data Selection Strategies
for Multi-Domain Sentiment Analysis. In arXiv preprint arXiv:
1702.02426.
11 https://research.googleblog.com/2016/10/how-robots-can-acquire-new-skills-from.html
12 https://googleblog.blogspot.ie/2014/04/the-latest-chapter-for-self-driving-car.html
01.03.17 | LinkedIn Tech Talk
@seb_ruder |
Thanks for your attention!
Questions?
01.03.17 | LinkedIn Tech Talk

Transfer Learning -- The Next Frontier for Machine Learning

  • 1.
    Sebastian Ruder
 Research Scientist,AYLIEN PhD Candidate, Insight Centre @seb_ruder |01.03.17 | LinkedIn Tech Talk Transfer Learning — The Next Frontier for ML
  • 2.
    Agenda 1. What isTransfer Learning? 2. Why Transfer Learning now? 3. Transfer Learning in practice 4. Transfer Learning for NLP 5. Our research 6. Opportunities and directions @seb_ruder |01.03.17 | LinkedIn Tech Talk
  • 3.
    What is TransferLearning? @seb_ruder |01.03.17 | LinkedIn Tech Talk Model A Model B Task / domain A Task / domain B Traditional ML Training and evaluation on the same task or domain.
  • 4.
    What is TransferLearning? @seb_ruder | Knowledge Model Source task / domain Target task / domain Transfer learning Storing knowledge gained solving one problem and applying it to a different but related problem. Model 01.03.17 | LinkedIn Tech Talk
  • 5.
  • 6.
    “Transfer learning will bethe next driver of ML success.” Andrew Ng, NIPS 2016 keynote @seb_ruder |
  • 7.
    Why Transfer Learningnow? @seb_ruder | Supervised learning Transfer learning Unsupervised learning Reinforcement learning 2016Time Commercial success Drivers of ML success in industry - Andrew Ng, NIPS 2016 keynote 01.03.17 | LinkedIn Tech Talk
  • 8.
    Why Transfer Learningnow? @seb_ruder | 1. Learn very accurate input-output mapping 2. Maturity of ML models - Computer vision (5% error on ImageNet) -Automatic speech recognition (3x faster than typing, 20% more accurate1) 3. Large-scale deployment & adoption of ML models -Google’s NMT System2 1 Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2016). Speech Is 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices. arXiv preprint arXiv:1608.07323. 2 Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144. Huge reliance on labeled data 
 Novel tasks / domains without (labeled) data 01.03.17 | LinkedIn Tech Talk
  • 9.
    Transfer Learning inpractice @seb_ruder | • Train new model on features of large model trained on ImageNet3 • Train model to confuse source and target domains4 • Train model on domain- invariant representations5,6 3 Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 512–519. 4 Ganin, Y., & Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation. Proceedings of the 32nd International Conference on Machine Learning., 37. 5 Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., & Erhan, D. (2016). Domain Separation Networks. NIPS 2016. 6 Sener, O., Song, H. O., Saxena, A., & Savarese, S. (2016). Learning Transferrable Representations for Unsupervised Domain Adaptation. NIPS 2016. Computer vision 01.03.17 | LinkedIn Tech Talk
  • 10.
    Transfer Learning inpractice @seb_ruder | • Progressive Neural
 Networks7 have access to weights from trained models • PathNet8 learns weight paths via a genetic algorithm 7 Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., … Deepmind, G. (2016). Progressive Neural Networks. arXiv preprint arXiv:1606.04671. 8 Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., … Wierstra, D. (2017). PathNet: Evolution Channels Gradient Descent in Super Neural Networks. In arXiv preprint arXiv:1701.08734. Reinforcement learning 01.03.17 | LinkedIn Tech Talk
  • 11.
    Transfer Learning forNLP @seb_ruder | • Task and domainT D DS 6= DT TS 6= TT A (slightly) more technical definition • Domain where - : feature space, e.g. BOW representations - : e.g. distribution over terms in documents D = {X, P(X)} X P(X) • Task where - : label space, e.g. true/false labels - : learned mapping from samples to labels T = {Y, P(Y |X)} Y P(Y |X) • Transfer learning:
 Learning when or 01.03.17 | LinkedIn Tech Talk
  • 12.
    Transfer Learning forNLP @seb_ruder | Transfer scenarios 1. : Different topics, text types, etc.
 2. : Different languages.
 3. : Unbalanced classes.
 4. : Different tasks. P(XS) 6= P(XT ) XS 6= XT P(YS|XS) 6= P(YT |XT ) YS 6= YT 01.03.17 | LinkedIn Tech Talk
  • 13.
    Transfer Learning forNLP @seb_ruder | Current status • Not as straightforward as in CV - No universal deep features • However: “Simple” transfer through word embeddings is pervasive • History of research for task-specific transfer, e.g. sentiment analysis, POS tagging leveraging NLP phenomena such as structured features, sentiment words, etc. • Few research on transfer between tasks • More recently: representation-based research 01.03.17 | LinkedIn Tech Talk
  • 14.
    Our research @seb_ruder | Researchfocus Finding better ways to transfer knowledge to new domains, tasks, and languages that 1. perform well in large-scale settings and real- world applications; 2. are applicable to many tasks and models. Current focus: : Training and test distributions are different. P(XS) 6= P(XT ) 01.03.17 | LinkedIn Tech Talk
  • 15.
    Our research @seb_ruder | Trainingand test distributions are different. Different text types. Different accents/ages. Different topics/categories. Performance drop or even collapse is inevitable. 01.03.17 | LinkedIn Tech Talk
  • 16.
    Our research @seb_ruder | Transferlearning challenges in real-world applications 1. Domains are not well-defined, but fuzzy and conflate many factors.
 
 
 2. One-to-one adaptation is rare and many source domains are generally available. 3. Models need to be adapted frequently as conditions change, new data becomes available, etc. Language socialfactors genre topic 01.03.17 | LinkedIn Tech Talk
  • 17.
    Our research @seb_ruder | •Idea: Use distillation + insights from semi-supervised learning to transfer knowledge from a single (a) and multiple teachers (b) to a student model9. (a) (b) 9 Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Knowledge Adaptation: Teaching to Adapt. In arXiv preprint arXiv:1702.02052. How to adapt from large source domains? 01.03.17 | LinkedIn Tech Talk
  • 18.
    Our research @seb_ruder | •Idea: Take into account diversity of training data to select subsets (c) rather than an entire domain (a) or individual examples (b)10. 10 Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Data Selection Strategies for Multi-Domain Sentiment Analysis. In arXiv preprint arXiv:1702.02426. How to select data for adaptation? (a) (b) (c) 01.03.17 | LinkedIn Tech Talk
  • 19.
    Our research @seb_ruder | Opportunitiesand future directions • Learn from past adaptation scenarios and generalise across domains and tasks. • Robust adaptation to non-English and low- resource languages. • Adaptation for novel tasks and more sophisticated models, e.g. QA and memory networks. • Transfer across tasks and leveraging knowledge from related tasks. 01.03.17 | LinkedIn Tech Talk
  • 20.
    References @seb_ruder | Image credit •Google Research blog post11 • Mikolov, T., Joulin, A., & Baroni, M. (2015). A Roadmap towards Machine Intelligence. arXiv preprint arXiv:1511.08130. • Google Research blog post12 Our papers • Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Knowledge Adaptation: Teaching to Adapt. In arXiv preprint arXiv:1702.02052. • Ruder, S., Ghaffari, P., & Breslin, J. G. (2017). Data Selection Strategies for Multi-Domain Sentiment Analysis. In arXiv preprint arXiv: 1702.02426. 11 https://research.googleblog.com/2016/10/how-robots-can-acquire-new-skills-from.html 12 https://googleblog.blogspot.ie/2014/04/the-latest-chapter-for-self-driving-car.html 01.03.17 | LinkedIn Tech Talk
  • 21.
    @seb_ruder | Thanks foryour attention! Questions? 01.03.17 | LinkedIn Tech Talk