Invezz.com - Grow your wealth with trading signals
Active learning from crowds
1. SMU Classification: Restricted
1. Yan, Y., Rosales, R., Fung, G., & Dy, J. G. (2011). Active learning from crowds. In ICML (Vol. 11, pp. 1161–1168).
2. Bi, W., Wang, L., Kwok, J. T., & Tu, Z. (2014). Learning to Predict from Crowdsourced Data. In UAI (pp. 82–91).
3. Rodrigues, F., Lourenco, M., Ribeiro, B., & Pereira, F. C. (2017). Learning Supervised Topic Models for
Classification and Regression from Crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(12), 2409–2422.
3. SMU Classification: Restricted
2
• Most research on supervised learning techniques rely on an
often overlooked assumption that a single domain expert
can provide the required supervision
• Crowdsourcing
- Quality: Mixture of experts and non-experts, annotators having
different expertise
- Inference: Truth inference from noisy labels
- Budget: How to collect enough useful labels before running
out of budget?
4. SMU Classification: Restricted
3
• Motivation behind Crowdsourcing
- It is difficult to collect a single golden ground-truth in some
problem domains
- It is often the case that an annotator does not have the
appropriate knowledge for annotating all the data, even for a
particular domain
- In many instances, collecting annotations from multiple non-
expert annotators can be less costly than collecting
annotations just from one expert
- Collaboration and knowledge sharing is becoming more
common, and thus technology for combining multiple opinions
will become necessary
5. SMU Classification: Restricted
4
• In many learning tasks the labeled data is limited in quantity
or expensive to obtain, but the amount of unlabeled data is
large or easy to obtain
• Try to learn the most at a given cost
- Identify the most useful data point to label given the
information obtained
- Identify the most useful annotator
6. SMU Classification: Restricted
5
• Active Learning from Crowds and Extensions
- Simple Ground Truth Inference
- Learn the prediction model at the same time
- Extend some existed model to the active learning from crowds
scenario
7. SMU Classification: Restricted
6
• Sometimes the annotator may not have the knowledge to
label the data accurately
- The annotation may comes from the observation of the input
data, not the underlying ground truth
• Goal
- Actively collect ground truth from the worker, and learn a
prediction model
8. SMU Classification: Restricted
7
• Probabilistic Multi-Labeler Model
- ! data points {#$, #&, … , #!} from input space )
- The label for the *-th data point by annotator + is ,*
(+)
from
label space /
- The unknown ground truth for the *-th data point is 0* from
output space 1
- All 0 and some of , are unobservable
9. SMU Classification: Restricted
8
• Model Definition
- The classifier is trained by assuming a probabilistic model
over random variables !, ", and #
where $%
&
is the set of annotators for %-th data point
10. SMU Classification: Restricted
9
• Model Definition
-
- We could use a Gaussian model:
where the variance depends on the input ! and is specific to
each annotator "
For binary classification, the variance is a logistic function of
input and annotator
11. SMU Classification: Restricted
10
• Model Definition
-
- We could use a Bernoulli model:
where !"($) is also a logistic function of the input and the
labeler identity "
12. SMU Classification: Restricted
11
• Model Definition
-
- Gaussian model allows for assigning a lower variance to input
regions where the labeler is more consistently correct relative
to areas where there are inconsistencies
- Bernoulli model assigns a higher probability of the labeler
being correct to certain input areas relative to other areas
13. SMU Classification: Restricted
12
• Model Definition
-
- The following logistic regression function is used
because the task is binary classification
14. SMU Classification: Restricted
13
• Optimally Selecting New Training Points and Annotators
- Pick a new training point to be labeled
- Pick a appropriate labeler among all available labelers
• To find the least confident data point
- The potential samples for which the probability of !(# = %|')
is close to
)
*
• To find the most confident annotator given data point
- Recall the aforementioned variance formula
Find the annotator with minimal variance
16. SMU Classification: Restricted
16
• Workers’ qualities can vary drastically and lead to different
noise levels in their annotations
- The worker might not be a expert
- The worker’s default label judgement is incorrect
- Different labeling tasks can have different difficulties
- Worker may not be dedicated to the task
• Worker’s decision process:
- If the worker is dedicated to the labeling task or if he considers
the sample as easy, the corresponding label is generated
according to his underlying decision function
- Otherwise, the label is generated based on his default labeling
judgement
17. SMU Classification: Restricted
17
• The task is a binary classification problem with:
- ! workers
- " query samples
- The #-th sample $(#) is annotated by the set of workers '# ⊆
{*, ,, … , !}
- The annotation by the /-th worker is 0/
(#)
∈ {2, *}
- The ground truth 0∗(#) ∈ {2, *} is generated by a logistic
regression model with parameters 4∗
where
18. SMU Classification: Restricted
18
• Reasons that an annotator gives incorrect label:
1. The annotate is dedicated to the task, but the expertise is not
strong enough
The worker !‘s annotation follows a Bernoulli distribution
where "! is !’s estimation of "∗
A small $! suggests "! being very similar to "∗
-> worker ! has high accuracy
19. SMU Classification: Restricted
19
• Reasons that an annotator gives incorrect label:
2. The annotator is not dedicated to the task, he randomly
annotates according to some default judgement
The worker !‘s annotation follows a Bernoulli distribution
where "! ∈ [%, ']
• Combining the two reasons:
20. SMU Classification: Restricted
20
• Difficulty to an annotator affects the quality significantly:
- The difficulty of !-th sample "(!) to annotator % is &%
(!)
If "(!) is difficult to %, &%
(!)
will be closed to 0
- The sample is difficult if it’s closed to the worker’s decision
boundary
A small '% will makes an easy sample (w/ large distance to the
boundary) seems difficult to the worker
Distance to the boundary
Sensitivity to
sample difficulty
21. SMU Classification: Restricted
21
Accuracy of worker
Whether the
worker is
dedicated to
the task
Sensitivity of
worker to
the difficulty
of the task
Difficulty of
the sample to
the worker
Ground truth
is generated
by a logistic
regression
with these
params
W* is drawn
from this
prior
Worker’s
estimatio
n of w*
22. SMU Classification: Restricted
22
• Baselines
- MTL: prediction model is average of all workers’ model
- RY: coin flipping to decide whether annotation comes from
bias/ground truth
- YAN: active learning from crowd
- GLAD: considering sample difficulty and workers’ expertise
- CUBAM: considering workers’ expertise and bias
- MV: majority vote
Algorithm learns a prediction model
25. SMU Classification: Restricted
25
• Other than binary classification
- Multi-Label Learning from Crowds
• Level of confidence
- Active Learning from Crowds with Unsure Option
- Active Learning with Confidence-based Answers for
Crowdsourcing Labelling Tasks
• More complicated models:
- Gaussian Process Classification and Active Learning with
Multiple Annotators
- Deep Learning from Crowds
26. SMU Classification: Restricted
26
• Crowdsourcing can be very helpful when performing out-of-
sample prediction
• Existed models can be extended to be put in the
crowdsourcing scenario