SlideShare a Scribd company logo
1 of 40
multi-task
learning in
DNN
Bomwurzel Lior
Research seminar
computer vision
What Is multi-task learning
Auxiliary tasks
Examples
Why does MTL work
Intuition needed for MTL
STL – single task
learning
• optimize a single task
minimizing the loss according to this task only
Simple thought
experiment
Given 50K female and 50K male medical records
Should you train separate models for both genders?
Should you use the gender as additional parameter input?
Simple thought
experiment
Suppose you try to predict:
Ovarian / prostate cancer
Breast cancer
Heart disease
Simple thought
experiment
• we don’t know if to train a separate or
combined model
• Let's let the neural network weights do
the decision for us
• Common features can be learned in the
combined hidden layer
• A feature that develops on one task can
be shared with another.
• weights for features which task do not
use can be low, therefore the coupled
tasks can brake on
MTL – multi task
learning
• Optimize several related tasks
minimizing the loss according to several related tasks
• Learn related tasks in parallel
Use shared representations
leverage information from other tasks
Human inspiration to
multi-task learning
• learn several basics tasks to
perform a difficult task
For Example Driving
MTL – multi-task
learning
STL : 𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝐿 𝑓𝑤(𝑥 𝑖
), 𝑌 𝑖
+ 𝜆𝑅(𝑤)
MTL : 𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
4
𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
L - Loss function such as the hinge loss or square loss
R – Regularization function like L2 L1
MTL -
Backpropagation Example
Disadvantages individually learned tasks
• To perform several tasks, you train several times
• More resources needed for several different networks*
• no information learned from one task can be used to other
Advantages of multi-task learning
• Get more samples from other tasks training sets
• Simplify complex task(hard to codify) to several simple tasks
• Model is more generalized (not optimized on a specific task)
what if I care only about one task?
Surprisingly, most of the real-world problems can use the
benefits of MTL by using auxiliary tasks
auxiliary tasks – learn from hints
• Predicting features as auxiliary tasks
Instead recognize complex objects like cars or pedestrians
train on edges, shapes, regions ,textures, texts,
orientation , distance, shadow, reflections
auxiliary tasks – learn from instances
To Predict the sentiment of the sentence,
use auxiliary task, which predicts if the sentence has a positive or
negative sentiment word.
auxiliary tasks - focusing attention
• Use the auxiliary task to focus attention on parts of the image
that might be ignored.
• For example, lane marking might be ignored because they
don't always appear, and they are relatively small.
If we force the model to learn them, it can be used for the
main task
auxiliary tasks – quantization smoothing
• Train auxiliary task with other quantization
• If our real problem is less quantizes or continuous, it can be
easier to learn a smoother problem.
• example for distance learning, instead of the labels {close,
far} learn the real distance.
auxiliary tasks – use the future
• future measurements can be used in offline learning
problems.
As an example when driving far objects are harder to identify,
Only after car pass near them, you can accurately identify them.
Sometimes you have the results only after the test. Use this to train
offline.
auxiliary tasks – Time series prediction
• When learning a task with a short time scale, the learner may
find it difficult to recognize the longer-term processes, and
vice-versa. Training both scales on a single net
auxiliary tasks – the same task from different
point of view
• Use different matrices as tasks in order to let your model learn different things for each
loss
For example, minimize loss on squared loss, log loss, rank loss, or accuracy
• You can learn the problem on several representations
For example, if it is easier to learn polar cartesian but the application need it in cartesian
• Sometimes it helps to learn the same task multiple times
The random waits which connected to each task let it learn the feature in different ways
Examples – related task (hints)
Example-learn from
future
or move input feature to output
move input feature to
output
• Features you decide to not use in the
input can be used as learning signals as
output
Examples -time
series prediction
Examples -time series prediction
Lesson to be learned – time series prediction
• tasks sometimes can help or interfere with each other
• Tasks help for each other can be asymmetric.
• Always try different models to find the best match for your task
Why MTL works
• Several mechanisms that help MTL backprop
nets to generalize better.
• All mechanisms derive from the summing of
error gradient terms at the hidden layer for the
different tasks.
• Each, however, exploits a different relationship
between tasks.
Why MTL works – representation Bias
• Random weights initialization, Several runs can end with different local minima
• If T and T’ have common minima A and other uncommon minima, it turns out that if we train on both tasks, we will
more likely to end on the common local minima.
• The opposite is also interesting, if one task have strong bias for the uncommon minima the MTL tasks prefer NOT
to use hidden layer representations that other tasks prefer NOT to use. And the other task will end in the
uncommon minima as well
• .
Why MTL works – eavesdropping
• If T’ learns feature F which can be useful to T more easily,
And not a complex representation of F which will be learned by
T.
After the feature is learned, T can use the simple
representation of F.
• For example T’ can be the feature F itself.
Why MTL works – Generalization
• When Learning several tasks, the risk of overfitting a specific
feature decrease
• If T and T’ use F differently (depend on the weights) the only
change that allowed in F have to be supported on both tasks
losses .F cannot be changed in direction which is good only
for one task
Why MTL works – features amplifications
• We want to learn a good representation of feature without the
data depended on noise for the task.
• As different tasks have different noise patterns, learning
several tasks with common internal feature enables the
model to obtain a better representation of the feature
And ignore the noise learned on it
Intuition needed for MTL
Things to take into consideration
Which auxiliary tasks will be helpful?
• Open question
• We don’t have good notation if tasks are similar or related
• Currently we use assumptions that the auxiliary tasks should
be related to the main task in some way that it should be
helpful
• You must test several models and find which best fits your
task
Loss functions
considerations
• Some tasks are more important than others
• Some tasks are learned much easier
• Some tasks have more data
• Some tasks have more noise𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
𝑛
𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
Loss functions
considerations
•
𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
𝑛
𝐿 𝑓𝑤𝑗 𝑥 𝑖
, 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
• Some tasks are more important than
others
• Some tasks are learned much easier
• Some tasks have more data
• Some tasks have more noise
MTL in the wilderness
References
• Abu-Mostafa, Y . S., “Learning from Hints in Neural Networks,” Journal of Complexity, 1990,
6(2), pp. 192–198.
• Caruana, R. "Multitask learning: A knowledge-based source of inductive bias." Proceedings of
the Tenth International Conference on Machine Learning. 1993
• Sebastian Ruder ”An Overview of Multi-Task Learning in Deep Neural Networks”
• ICML conferences :
Andrej Karpathy “Multi-Task Learning In The Wilderness”
Caruana, Rich “Multi -Task Learning: Tricks Of The Trade”
• Coursera
Andrew Ng “Multi-task learning”

More Related Content

What's hot

Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
Hisao Soyama
 
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
Tatsuya Tojima
 

What's hot (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Multi Task Learning and Meta Learning
Multi Task Learning and Meta LearningMulti Task Learning and Meta Learning
Multi Task Learning and Meta Learning
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
数式を(ちょっとしか)使わずに隠れマルコフモデル
数式を(ちょっとしか)使わずに隠れマルコフモデル数式を(ちょっとしか)使わずに隠れマルコフモデル
数式を(ちょっとしか)使わずに隠れマルコフモデル
 
Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
Pythonによるソーシャルデータ分析―わたしはこうやって修士号を取得しました―
 
Optuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization FrameworkOptuna: A Define-by-Run Hyperparameter Optimization Framework
Optuna: A Define-by-Run Hyperparameter Optimization Framework
 
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
Tokyo.R 41 サポートベクターマシンで眼鏡っ娘分類システム構築
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
今日からできる構造学習(主に構造化パーセプトロンについて)
今日からできる構造学習(主に構造化パーセプトロンについて)今日からできる構造学習(主に構造化パーセプトロンについて)
今日からできる構造学習(主に構造化パーセプトロンについて)
 
製造業におけるインクリメンタル学習異常検知事例の紹介
製造業におけるインクリメンタル学習異常検知事例の紹介製造業におけるインクリメンタル学習異常検知事例の紹介
製造業におけるインクリメンタル学習異常検知事例の紹介
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
 
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
 
cvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tipscvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tips
 
Continual Learning Introduction
Continual Learning IntroductionContinual Learning Introduction
Continual Learning Introduction
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 

Similar to Multi task learning in dnn

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Uwe Friedrichsen
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Shirin Elsinghorst
 
23 rote learning and explanation based.doc
23 rote learning and explanation based.doc23 rote learning and explanation based.doc
23 rote learning and explanation based.doc
jdinfo444
 

Similar to Multi task learning in dnn (20)

tensorflow.pptx
tensorflow.pptxtensorflow.pptx
tensorflow.pptx
 
Deep learning short introduction
Deep learning short introductionDeep learning short introduction
Deep learning short introduction
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Teacher toolkit Pycon UK Sept 2018
Teacher toolkit Pycon UK Sept 2018Teacher toolkit Pycon UK Sept 2018
Teacher toolkit Pycon UK Sept 2018
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
SKILLWISE - OOPS CONCEPT
SKILLWISE - OOPS CONCEPTSKILLWISE - OOPS CONCEPT
SKILLWISE - OOPS CONCEPT
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
Software development fundamentals
Software development fundamentalsSoftware development fundamentals
Software development fundamentals
 
23 rote learning and explanation based.doc
23 rote learning and explanation based.doc23 rote learning and explanation based.doc
23 rote learning and explanation based.doc
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
unit 4 ai.pptx
unit 4 ai.pptxunit 4 ai.pptx
unit 4 ai.pptx
 
MLCC Schedule #1
MLCC Schedule #1MLCC Schedule #1
MLCC Schedule #1
 
Programming in Java: Introduction
Programming in Java: IntroductionProgramming in Java: Introduction
Programming in Java: Introduction
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Multi task learning in dnn

  • 2. What Is multi-task learning Auxiliary tasks Examples Why does MTL work Intuition needed for MTL
  • 3. STL – single task learning • optimize a single task minimizing the loss according to this task only
  • 4. Simple thought experiment Given 50K female and 50K male medical records Should you train separate models for both genders? Should you use the gender as additional parameter input?
  • 5. Simple thought experiment Suppose you try to predict: Ovarian / prostate cancer Breast cancer Heart disease
  • 6. Simple thought experiment • we don’t know if to train a separate or combined model • Let's let the neural network weights do the decision for us • Common features can be learned in the combined hidden layer • A feature that develops on one task can be shared with another. • weights for features which task do not use can be low, therefore the coupled tasks can brake on
  • 7. MTL – multi task learning • Optimize several related tasks minimizing the loss according to several related tasks • Learn related tasks in parallel Use shared representations leverage information from other tasks
  • 8. Human inspiration to multi-task learning • learn several basics tasks to perform a difficult task For Example Driving
  • 9. MTL – multi-task learning STL : 𝑚𝑖𝑛 𝑤 1 𝑚 𝑖=1 𝑚 𝐿 𝑓𝑤(𝑥 𝑖 ), 𝑌 𝑖 + 𝜆𝑅(𝑤) MTL : 𝑚𝑖𝑛 𝑤 1 𝑚 𝑖=1 𝑚 𝑗=1 4 𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗 𝑖 + 𝜆𝑅(𝑤𝑗) L - Loss function such as the hinge loss or square loss R – Regularization function like L2 L1
  • 11. Disadvantages individually learned tasks • To perform several tasks, you train several times • More resources needed for several different networks* • no information learned from one task can be used to other
  • 12. Advantages of multi-task learning • Get more samples from other tasks training sets • Simplify complex task(hard to codify) to several simple tasks • Model is more generalized (not optimized on a specific task)
  • 13. what if I care only about one task? Surprisingly, most of the real-world problems can use the benefits of MTL by using auxiliary tasks
  • 14. auxiliary tasks – learn from hints • Predicting features as auxiliary tasks Instead recognize complex objects like cars or pedestrians train on edges, shapes, regions ,textures, texts, orientation , distance, shadow, reflections
  • 15. auxiliary tasks – learn from instances To Predict the sentiment of the sentence, use auxiliary task, which predicts if the sentence has a positive or negative sentiment word.
  • 16. auxiliary tasks - focusing attention • Use the auxiliary task to focus attention on parts of the image that might be ignored. • For example, lane marking might be ignored because they don't always appear, and they are relatively small. If we force the model to learn them, it can be used for the main task
  • 17. auxiliary tasks – quantization smoothing • Train auxiliary task with other quantization • If our real problem is less quantizes or continuous, it can be easier to learn a smoother problem. • example for distance learning, instead of the labels {close, far} learn the real distance.
  • 18. auxiliary tasks – use the future • future measurements can be used in offline learning problems. As an example when driving far objects are harder to identify, Only after car pass near them, you can accurately identify them. Sometimes you have the results only after the test. Use this to train offline.
  • 19. auxiliary tasks – Time series prediction • When learning a task with a short time scale, the learner may find it difficult to recognize the longer-term processes, and vice-versa. Training both scales on a single net
  • 20. auxiliary tasks – the same task from different point of view • Use different matrices as tasks in order to let your model learn different things for each loss For example, minimize loss on squared loss, log loss, rank loss, or accuracy • You can learn the problem on several representations For example, if it is easier to learn polar cartesian but the application need it in cartesian • Sometimes it helps to learn the same task multiple times The random waits which connected to each task let it learn the feature in different ways
  • 21. Examples – related task (hints)
  • 22. Example-learn from future or move input feature to output
  • 23.
  • 24.
  • 25.
  • 26. move input feature to output • Features you decide to not use in the input can be used as learning signals as output
  • 28. Examples -time series prediction
  • 29. Lesson to be learned – time series prediction • tasks sometimes can help or interfere with each other • Tasks help for each other can be asymmetric. • Always try different models to find the best match for your task
  • 30. Why MTL works • Several mechanisms that help MTL backprop nets to generalize better. • All mechanisms derive from the summing of error gradient terms at the hidden layer for the different tasks. • Each, however, exploits a different relationship between tasks.
  • 31. Why MTL works – representation Bias • Random weights initialization, Several runs can end with different local minima • If T and T’ have common minima A and other uncommon minima, it turns out that if we train on both tasks, we will more likely to end on the common local minima. • The opposite is also interesting, if one task have strong bias for the uncommon minima the MTL tasks prefer NOT to use hidden layer representations that other tasks prefer NOT to use. And the other task will end in the uncommon minima as well • .
  • 32. Why MTL works – eavesdropping • If T’ learns feature F which can be useful to T more easily, And not a complex representation of F which will be learned by T. After the feature is learned, T can use the simple representation of F. • For example T’ can be the feature F itself.
  • 33. Why MTL works – Generalization • When Learning several tasks, the risk of overfitting a specific feature decrease • If T and T’ use F differently (depend on the weights) the only change that allowed in F have to be supported on both tasks losses .F cannot be changed in direction which is good only for one task
  • 34. Why MTL works – features amplifications • We want to learn a good representation of feature without the data depended on noise for the task. • As different tasks have different noise patterns, learning several tasks with common internal feature enables the model to obtain a better representation of the feature And ignore the noise learned on it
  • 35. Intuition needed for MTL Things to take into consideration
  • 36. Which auxiliary tasks will be helpful? • Open question • We don’t have good notation if tasks are similar or related • Currently we use assumptions that the auxiliary tasks should be related to the main task in some way that it should be helpful • You must test several models and find which best fits your task
  • 37. Loss functions considerations • Some tasks are more important than others • Some tasks are learned much easier • Some tasks have more data • Some tasks have more noise𝑚𝑖𝑛 𝑤 1 𝑚 𝑖=1 𝑚 𝑗=1 𝑛 𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗 𝑖 + 𝜆𝑅(𝑤𝑗)
  • 38. Loss functions considerations • 𝑚𝑖𝑛 𝑤 1 𝑚 𝑖=1 𝑚 𝑗=1 𝑛 𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗 𝑖 + 𝜆𝑅(𝑤𝑗) • Some tasks are more important than others • Some tasks are learned much easier • Some tasks have more data • Some tasks have more noise
  • 39. MTL in the wilderness
  • 40. References • Abu-Mostafa, Y . S., “Learning from Hints in Neural Networks,” Journal of Complexity, 1990, 6(2), pp. 192–198. • Caruana, R. "Multitask learning: A knowledge-based source of inductive bias." Proceedings of the Tenth International Conference on Machine Learning. 1993 • Sebastian Ruder ”An Overview of Multi-Task Learning in Deep Neural Networks” • ICML conferences : Andrej Karpathy “Multi-Task Learning In The Wilderness” Caruana, Rich “Multi -Task Learning: Tricks Of The Trade” • Coursera Andrew Ng “Multi-task learning”

Editor's Notes

  1. Both E01 and W02 are affected from the same W1 B1 and B2
  2. Resources in memory and GPU (the same feature can be learned several times on different networks) *The MTL network should be big enough to train on all tasks together (again several tasks are learned here) If the network is big enough and the tasks can share waits(According to Richard Caruana)
  3. More samples: if we have multiple related tasks and each one as several limited samples, MTL can train on all the training sets for all the different tasks. How to code minimum loss for driving? Hard mission if you don’t separate it to subtasks. At first glance human cant know if he can start to drive on a complex picture, he needs to examine things separately
  4. These four tasks are related, each task is defined using a common computed subfeature: the parity of bits 2 through 6. Third, on those inputs where Task 1 must compute the parity of bits 2 through 8, Task 2 does not need to compute parity, and vice versa. That is, if B1 = 0, then Task 1 = Parity(B2–B6) but Task 2 = 1 independent of the value of Parity(B2– B8). Task 3 and Task 4 are related similarly: Task 3 needs Parity(B2–B6) when B1 = 1, but Task 4 does not, etc.
  5. We tested MTL on time sequence data in a robot domain where the goal is to predict future sensory states from the current sensed state and the planned action. For example, we were interested in predicting the sonar readings and camera image that would be sensed N meters in the future given the current sonar and camera readings, for N between 1 and 8 meters. As the robot moves, it collects a stream of sense data. We used a backprop net with four sets of outputs. Each set predicts the sonar and camera image that will be sensed at a future distance. Output set 1 is the prediction for 1 meter, set 2 is for 2 meters, set 3 is for 4 meters, and set 4 for 8 meters. The performance of this net on each prediction distance is compared in Table 5 with separate STL nets learning to predict each distance separately. Each entry is the SSE averaged over all sense predictions. Error increases with distance, and MTL outperforms STL at all distances except 1 meter.
  6. Table entry are SSE averaged over all sense predictions. The robot reads sonar and camera signal and needs to predict the reading the future meters
  7. Abu Mustafa 1989 – if a set of candidate function is significantly reduced by the constraint that must satisfy the invariance property, the number of example of F needed for learning process decreases accordingly
  8. [Caruana, 1998] defines two tasks to be similar if they use the same features to make a decision. [Baxter, 2000] argues only theoretically that related tasks share a common optimal hypothesis class, i.e. have the same inductive bias. [Ben-David and Schuller, 2003] propose that two tasks are F-related if the data for both tasks can be generated from a fixed probability distribution using a set of transformations F. While this allows to reason over tasks where different sensors collect data for the same classification problem, e.g. object recognition with data from cameras with different angles and lighting conditions, it is not applicable to tasks that do not deal with the same problem. [Xue et al., 2007] finally argue that two tasks are similar if their classification boundaries, i.e. parameter vectors are close.
  9. Early stopping usually monitor the validation loss and stop to train when your model start to overfit. Now you have several tasks which each one train on a different rate and overfits in a different place Reasons : different training rate(some tasks are easier), different amount of data and noise on it Ideally we want to stop on the same place, or at least for features needed on task A and are learned on task B to be learned before task A You should manipulate the tasks in order to stop at same place: Oversample different , different Regularization , tasks waits so thy ideally overfit in the same spot
  10. Without MTL paradigm the budget will not converge to train different network for different task and camera Use the same hidden layer for all the common features like edges shadows etc, and split according to relevant tasks Activate only the relevant part of your network for the current task. For example in the cut-in prediction , you my want to know the predictions over time series You only want the Main and Narrow camera Tuning only one feature of the net affects the other features(in the manner of loss, number of samples, and specific parts of the net)