Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Multi-Task Learning for NLP
2017/04/17 Parsing Group
Motoki Sato
What is Multi-task?
l Single task
2
l Multi task
Model 1
Input
(sentence)
POS
(task1)
Model 2
Input
(sentence)
Chunking
(t...
Multi-task learning Paper (1)
3
l (Søgaard, 2016) ACL 2016 short.
l Tasks:
–  POS (low level task)
–  Chunking (high level...
Multi-task learning Paper (2)
4
l (Hashimoto, 2016) arxiv.
l Tasks (many tasks):
–  POS, Chunking, Dependency parsing,
–  ...
Dataset
5
(Søgaard, 2016) (Hashimoto, 2016)
POS Penn Treebank Penn Treebank
Chunking Penn Treebank Penn Treebank
CCG Penn ...
(Søgaard, 2016)
(Søgaard, 2016)
POS Low level task
Chunking High level task
CCG High level task
6
Input Words and Predict ...
Multi-task for Vision?
l  Cha Zhang, et al. “Improving Multiview Face Detection with Multi-Task Deep
Convolutional Neural ...
Multi-task for NLP?
l  Collobert, et al. “Natural Language Processing (Almost) from Scratch”
8
Share
hidden
layers
Individ...
(Søgaard, 2016) Outermost ver.
9
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi-LS...
(Søgaard, 2016) lower-layer ver.
10
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-LSTM Bi...
Experiments
11
Low-level
task
High-level task
Single task
Multi task
It is consistently better to have POS supervision at
...
(Søgaard, 2016) Domain Adaptation
l What is domain adaptation?
12
Source
Trained
Model
Trained
Model
Target
(ex.) News dom...
(Søgaard, 2016) Source Training
13
Source
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-L...
(Søgaard, 2016) Target Training
14
Target
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM
w0 w1 w2 w3
Bi-L...
Domain Adaptation Experiments
15
High-level task supervision in the source domain,
lower-level task supervision in the tar...
(Hashimoto, 2016)
16
(Hashimoto, 2016)
17
(Hashimoto, 2016)
18
Training Loss for Multi Task Learning
l In (Hashimoto, 2016),
19
L2-norm regularization term
The embedding parameter after...
Dataset
20
(Søgaard, 2016) (Hashimoto, 2016)
POS Penn Treebank Penn Treebank
Chunking Penn Treebank Penn Treebank
CCG Penn...
Catastrophic Forgetting
l  “Overcoming Catastrophic Forgetting in Neural Networks”, James
Kirkpatrick, Raia Hadsell, et al...
Upcoming SlideShare
Loading in …5
×

Multi-Task Learning for NLP

1,090 views

Published on

- Deep multi-task learning with low level tasks supervised at lower layers
https://www.aclweb.org/anthology/P/P16/P16-2038.pdf

- A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
https://arxiv.org/abs/1611.01587

Published in: Data & Analytics
  • Be the first to comment

Multi-Task Learning for NLP

  1. 1. Multi-Task Learning for NLP 2017/04/17 Parsing Group Motoki Sato
  2. 2. What is Multi-task? l Single task 2 l Multi task Model 1 Input (sentence) POS (task1) Model 2 Input (sentence) Chunking (task2) Model Input (sentence) POS (task1) Chunking (task2)
  3. 3. Multi-task learning Paper (1) 3 l (Søgaard, 2016) ACL 2016 short. l Tasks: –  POS (low level task) –  Chunking (high level task)
  4. 4. Multi-task learning Paper (2) 4 l (Hashimoto, 2016) arxiv. l Tasks (many tasks): –  POS, Chunking, Dependency parsing, –  Semantic relatedness, Textual entailment
  5. 5. Dataset 5 (Søgaard, 2016) (Hashimoto, 2016) POS Penn Treebank Penn Treebank Chunking Penn Treebank Penn Treebank CCG Penn Treebank - Dependency parsing - Penn Treebank Semantic relatedness - SICK Textual entailment - SICK
  6. 6. (Søgaard, 2016) (Søgaard, 2016) POS Low level task Chunking High level task CCG High level task 6 Input Words and Predict Tag Examples:
  7. 7. Multi-task for Vision? l  Cha Zhang, et al. “Improving Multiview Face Detection with Multi-Task Deep Convolutional Neural Networks” 7 Share hidden layers (shared representation)
  8. 8. Multi-task for NLP? l  Collobert, et al. “Natural Language Processing (Almost) from Scratch” 8 Share hidden layers Individual layer for each task
  9. 9. (Søgaard, 2016) Outermost ver. 9 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM w0 w1 w2 w3 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM POS Tag Chunk Tag POS Tag Chunk Tag … … 3-th layer 2-th layer 1-th layer Previous multi-task learning shared hidden layers, Share hidden layers
  10. 10. (Søgaard, 2016) lower-layer ver. 10 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM w0 w1 w2 w3 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Chunk Tag Chunk Tag … … 3-th layer 2-th layer 1-th layer Previous multi-task learning shared hidden layers, POS Tag POS Tag POS Tag POS Tag
  11. 11. Experiments 11 Low-level task High-level task Single task Multi task It is consistently better to have POS supervision at the innermost rather than the outermost layer.
  12. 12. (Søgaard, 2016) Domain Adaptation l What is domain adaptation? 12 Source Trained Model Trained Model Target (ex.) News domain (ex.) Twitter domain
  13. 13. (Søgaard, 2016) Source Training 13 Source Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM w0 w1 w2 w3 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Chunk Tag Chunk Tag … … 3-th layer 2-th layer 1-th layer POS Tag POS Tag POS Tag POS Tag WSJ newswire
  14. 14. (Søgaard, 2016) Target Training 14 Target Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM w0 w1 w2 w3 Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Chunk Tag Chunk Tag … … 3-th layer 2-th layer 1-th layer Re-train POS at Target Domain POS Tag POS Tag POS Tag broadcast, weblogs domain No Chunk training at Target Domain
  15. 15. Domain Adaptation Experiments 15 High-level task supervision in the source domain, lower-level task supervision in the target domain.
  16. 16. (Hashimoto, 2016) 16
  17. 17. (Hashimoto, 2016) 17
  18. 18. (Hashimoto, 2016) 18
  19. 19. Training Loss for Multi Task Learning l In (Hashimoto, 2016), 19 L2-norm regularization term The embedding parameter after training the final task in the top-most layer at the previous training epoch.
  20. 20. Dataset 20 (Søgaard, 2016) (Hashimoto, 2016) POS Penn Treebank Penn Treebank Chunking Penn Treebank Penn Treebank CCG Penn Treebank - Dependency parsing - Penn Treebank Semantic relatedness - SICK Textual entailment - SICK Since (Søgaard, 2016) uses same dataset (same input), they can use the sum of loss for multi-tasks.
  21. 21. Catastrophic Forgetting l  “Overcoming Catastrophic Forgetting in Neural Networks”, James Kirkpatrick, Raia Hadsell, et al. https://arxiv.org/abs/1612.00796 l  https://theneuralperspective.com/2017/04/01/overcoming-catastrophic- forgetting-in-neural-networks/ 21

×