2. What Is multi-task learning
Auxiliary tasks
Examples
Why does MTL work
Intuition needed for MTL
3. STL – single task
learning
• optimize a single task
minimizing the loss according to this task only
4. Simple thought
experiment
Given 50K female and 50K male medical records
Should you train separate models for both genders?
Should you use the gender as additional parameter input?
6. Simple thought
experiment
• we don’t know if to train a separate or
combined model
• Let's let the neural network weights do
the decision for us
• Common features can be learned in the
combined hidden layer
• A feature that develops on one task can
be shared with another.
• weights for features which task do not
use can be low, therefore the coupled
tasks can brake on
7. MTL – multi task
learning
• Optimize several related tasks
minimizing the loss according to several related tasks
• Learn related tasks in parallel
Use shared representations
leverage information from other tasks
9. MTL – multi-task
learning
STL : 𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝐿 𝑓𝑤(𝑥 𝑖
), 𝑌 𝑖
+ 𝜆𝑅(𝑤)
MTL : 𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
4
𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
L - Loss function such as the hinge loss or square loss
R – Regularization function like L2 L1
11. Disadvantages individually learned tasks
• To perform several tasks, you train several times
• More resources needed for several different networks*
• no information learned from one task can be used to other
12. Advantages of multi-task learning
• Get more samples from other tasks training sets
• Simplify complex task(hard to codify) to several simple tasks
• Model is more generalized (not optimized on a specific task)
13. what if I care only about one task?
Surprisingly, most of the real-world problems can use the
benefits of MTL by using auxiliary tasks
14. auxiliary tasks – learn from hints
• Predicting features as auxiliary tasks
Instead recognize complex objects like cars or pedestrians
train on edges, shapes, regions ,textures, texts,
orientation , distance, shadow, reflections
15. auxiliary tasks – learn from instances
To Predict the sentiment of the sentence,
use auxiliary task, which predicts if the sentence has a positive or
negative sentiment word.
16. auxiliary tasks - focusing attention
• Use the auxiliary task to focus attention on parts of the image
that might be ignored.
• For example, lane marking might be ignored because they
don't always appear, and they are relatively small.
If we force the model to learn them, it can be used for the
main task
17. auxiliary tasks – quantization smoothing
• Train auxiliary task with other quantization
• If our real problem is less quantizes or continuous, it can be
easier to learn a smoother problem.
• example for distance learning, instead of the labels {close,
far} learn the real distance.
18. auxiliary tasks – use the future
• future measurements can be used in offline learning
problems.
As an example when driving far objects are harder to identify,
Only after car pass near them, you can accurately identify them.
Sometimes you have the results only after the test. Use this to train
offline.
19. auxiliary tasks – Time series prediction
• When learning a task with a short time scale, the learner may
find it difficult to recognize the longer-term processes, and
vice-versa. Training both scales on a single net
20. auxiliary tasks – the same task from different
point of view
• Use different matrices as tasks in order to let your model learn different things for each
loss
For example, minimize loss on squared loss, log loss, rank loss, or accuracy
• You can learn the problem on several representations
For example, if it is easier to learn polar cartesian but the application need it in cartesian
• Sometimes it helps to learn the same task multiple times
The random waits which connected to each task let it learn the feature in different ways
29. Lesson to be learned – time series prediction
• tasks sometimes can help or interfere with each other
• Tasks help for each other can be asymmetric.
• Always try different models to find the best match for your task
30. Why MTL works
• Several mechanisms that help MTL backprop
nets to generalize better.
• All mechanisms derive from the summing of
error gradient terms at the hidden layer for the
different tasks.
• Each, however, exploits a different relationship
between tasks.
31. Why MTL works – representation Bias
• Random weights initialization, Several runs can end with different local minima
• If T and T’ have common minima A and other uncommon minima, it turns out that if we train on both tasks, we will
more likely to end on the common local minima.
• The opposite is also interesting, if one task have strong bias for the uncommon minima the MTL tasks prefer NOT
to use hidden layer representations that other tasks prefer NOT to use. And the other task will end in the
uncommon minima as well
• .
32. Why MTL works – eavesdropping
• If T’ learns feature F which can be useful to T more easily,
And not a complex representation of F which will be learned by
T.
After the feature is learned, T can use the simple
representation of F.
• For example T’ can be the feature F itself.
33. Why MTL works – Generalization
• When Learning several tasks, the risk of overfitting a specific
feature decrease
• If T and T’ use F differently (depend on the weights) the only
change that allowed in F have to be supported on both tasks
losses .F cannot be changed in direction which is good only
for one task
34. Why MTL works – features amplifications
• We want to learn a good representation of feature without the
data depended on noise for the task.
• As different tasks have different noise patterns, learning
several tasks with common internal feature enables the
model to obtain a better representation of the feature
And ignore the noise learned on it
36. Which auxiliary tasks will be helpful?
• Open question
• We don’t have good notation if tasks are similar or related
• Currently we use assumptions that the auxiliary tasks should
be related to the main task in some way that it should be
helpful
• You must test several models and find which best fits your
task
37. Loss functions
considerations
• Some tasks are more important than others
• Some tasks are learned much easier
• Some tasks have more data
• Some tasks have more noise𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
𝑛
𝐿 𝑓𝑤𝑗 𝑥 𝑖 , 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
38. Loss functions
considerations
•
𝑚𝑖𝑛 𝑤
1
𝑚 𝑖=1
𝑚
𝑗=1
𝑛
𝐿 𝑓𝑤𝑗 𝑥 𝑖
, 𝑌𝑗
𝑖
+ 𝜆𝑅(𝑤𝑗)
• Some tasks are more important than
others
• Some tasks are learned much easier
• Some tasks have more data
• Some tasks have more noise
40. References
• Abu-Mostafa, Y . S., “Learning from Hints in Neural Networks,” Journal of Complexity, 1990,
6(2), pp. 192–198.
• Caruana, R. "Multitask learning: A knowledge-based source of inductive bias." Proceedings of
the Tenth International Conference on Machine Learning. 1993
• Sebastian Ruder ”An Overview of Multi-Task Learning in Deep Neural Networks”
• ICML conferences :
Andrej Karpathy “Multi-Task Learning In The Wilderness”
Caruana, Rich “Multi -Task Learning: Tricks Of The Trade”
• Coursera
Andrew Ng “Multi-task learning”
Editor's Notes
Both E01 and W02 are affected from the same W1 B1 and B2
Resources in memory and GPU (the same feature can be learned several times on different networks)
*The MTL network should be big enough to train on all tasks together (again several tasks are learned here)
If the network is big enough and the tasks can share waits(According to Richard Caruana)
More samples: if we have multiple related tasks and each one as several limited samples, MTL can train on all the training sets for all the different tasks.
How to code minimum loss for driving? Hard mission if you don’t separate it to subtasks. At first glance human cant know if he can start to drive on a complex picture, he needs to examine things separately
These four tasks are related, each task is defined using a common computed subfeature: the parity of bits 2 through 6. Third, on those inputs where Task 1 must compute the parity of bits 2 through 8, Task 2 does not need to compute parity, and vice versa. That is, if B1 = 0, then Task 1 = Parity(B2–B6) but Task 2 = 1 independent of the value of Parity(B2– B8). Task 3 and Task 4 are related similarly: Task 3 needs Parity(B2–B6) when B1 = 1, but Task 4 does not, etc.
We tested MTL on time sequence data in a robot domain where the goal is to predict future sensory states from the current sensed state and the planned action. For example, we were interested in predicting the sonar readings and camera image that would be sensed N meters in the future given the current sonar and camera readings, for N between 1 and 8 meters. As the robot moves, it collects a stream of sense data.
We used a backprop net with four sets of outputs. Each set predicts the sonar and camera image that will be sensed at a future distance. Output set 1 is the prediction for 1 meter, set 2 is for 2 meters, set 3 is for 4 meters, and set 4 for 8 meters. The performance of this net on each prediction distance is compared in Table 5 with separate STL nets learning to predict each distance separately. Each entry is the SSE averaged over all sense predictions. Error increases with distance, and MTL outperforms STL at all distances except 1 meter.
Table entry are SSE averaged over all sense predictions.
The robot reads sonar and camera signal and needs to predict the reading the future meters
Abu Mustafa 1989 – if a set of candidate function is significantly reduced by the constraint that must satisfy the invariance property,
the number of example of F needed for learning process decreases accordingly
[Caruana, 1998] defines two tasks to be similar if they use the same features to make a decision. [Baxter, 2000] argues only theoretically that related tasks share a common optimal hypothesis class, i.e. have the same inductive bias. [Ben-David and Schuller, 2003] propose that two tasks are F-related if the data for both tasks can be generated from a fixed probability distribution using a set of transformations F. While this allows to reason over tasks where different sensors collect data for the same classification problem, e.g. object recognition with data from cameras with different angles and lighting conditions, it is not applicable to tasks that do not deal with the same problem. [Xue et al., 2007] finally argue that two tasks are similar if their classification boundaries, i.e. parameter vectors are close.
Early stopping usually monitor the validation loss and stop to train when your model start to overfit.
Now you have several tasks which each one train on a different rate and overfits in a different place
Reasons : different training rate(some tasks are easier), different amount of data and noise on it
Ideally we want to stop on the same place, or at least for features needed on task A and are learned on task B to be learned before task A
You should manipulate the tasks in order to stop at same place:
Oversample different , different Regularization , tasks waits so thy ideally overfit in the same spot
Without MTL paradigm the budget will not converge to train different network for different task and camera
Use the same hidden layer for all the common features like edges shadows etc, and split according to relevant tasks
Activate only the relevant part of your network for the current task.
For example in the cut-in prediction , you my want to know the predictions over time series
You only want the Main and Narrow camera
Tuning only one feature of the net affects the other features(in the manner of loss, number of samples, and specific parts of the net)