Summary from the paper:
Ipeirotis P. G. et al. Repeated labeling using multiple noisy labelers //Data Mining and Knowledge Discovery. – 2014. – Т. 28. – №. 2. – С. 402-441.
and short list of related papers for the topic
1. Noisy label(er)s
overview summary from
https://archive.nyu.edu/jspui/bitstream/2451/29799/2/CeDER-10-03.pdf (Iperiotis P.,Provost F. et al. 2010-09)
and other papers..
R. Kiriukhin
2. Rationale, setting and questions.
Rationale: (#CostSensitiveLearning)
● Cost = Data+Features+Labels(ground truth)
Setting: (not “Active Learning” where “cost(labels)>cost(obtaining new data)”)
● Cost of labeling: Low (Mechanical turk)
● Labels quality: Low (noisy, not ground truth)
● Cost of preparing new data, getting new unlabeled samples: High
Questions:
● Can repeated labeling help?
● What is expected improvement?
● What is the best way to do it?
3. Dirty labels, are they even a problem?
“Mushrooms”
dataset:
● Garbage in =
garbage out
● Max
achievable
performance =
F(quality)
● d(Performance
)/d(DataSize) =
G(quality)
active learning
improving q
4. Approach for improving q: majority voting.
Majority voting approach:
● Collect more than 1 label per sample from different labelers (j) (hence “repeated labels”).
● Labelers have a quality: Pr(yij = yi)
● Apply majority voting (giving “integral quality”)
Assumptions:
● Pr(yij = yi |xi) = Pr(yij = yi) = pj (all samples yielding the same probability of mistake for given labeler,
there are no “easy” and “hard” cases)
●
5. Approach for improving q: soft labels.
“Soft labels” approach:
● <same as for majority voting>
● multiplied examples (ME):
○ for every different yi from L(x)={y1,..yn} inject a new copy of x with this yi into the training set with
weight w(x,yi) = |L(x)=yi|/n
Assumptions:
● <same as for majority>
6. ● More labelers = higher quality
○ if pj>0.5
● Marginal q improvement is
not linear from # of labelers
○ For p=0.9 there’s a little
benefit from going from
3 to 11 labelers
○ For p=0.7 going from 1
to 3 labelers gives q
improvement 0.1 which
gives 0.1..0.2
improvement in
ROC-AUC (moving from
q=0.7 curve to q=0.8
curve)
● if pj!=pi then sometimes it is
better using one best labeler
7. When it makes sense?
● When repeated-labeling should be chosen for modeling?
● How cheap (relatively speaking) does labeling have to be?
● For a given cost setting, is repeated-labeling much better or only marginally better?
● Can selectively choosing data points to label improve performance?
Empirical analysis to answer these questions:
● 8 datasets with, k-fold AUC>0.7 and binary response
● 30% as a hold out (test set)
● 70% as a pool for unlabeled and labeled data
● “noising” of labels in labeled data modelled according to pj
8. Decision: label/acquire
● Design choices:
○ Choice of the next sample to (re)label
○ Use “hard” label with majority voting or “soft” labels approach
● Basic strategies:
○ single labeling (SL):
■ get more examples with a single label each
○ fixed round-robin (FRR):
■ keep adding labels to a fixed number of examples, until exhausting our labeling budget
○ generalized round-robin (GRR):
■ give the next label to the example with the fewest labels
○
10. ● SL vs FRR:
○ under noisy labels, the
fixed round-robin
repeated labeling FRR
can perform better than
single-labeling when
there are enough
training examples, i.e.,
after the learning curves
are not so steep
11. With cost introduced, choice:
● acquiring a new training example for cost CU +
CL, (CU for the unlabeled portion, and CL for
the label)
● get another label for an existing example for
cost CL
Units for x axis = Data Acquisition cost = CD:
● ρ = CU/CL
● k = labels per sample for GRR
● CD = CU · Tr + CL · NL = = ρ · CL · Tr + k · CL · Tr
∝ ρ + k
SL vs GRR(majority):
● As the cost ratio ρ increases, the improvement
of GRR over SL also increases. So when the
labels actually are cheap, we can actually get
significant advantage from using a repeated
labeling strategy, such as GRR.
12. ME vs MV:
● the uncertainty-preserving repeated-labeling ME
outperforms MV in all cases, to greater or lesser degrees
● when labeling quality is substantially higher (e.g., p = 0.8),
repeated-labeling still is increasingly preferable to single
labeling as ρ increases; however, we no longer see an
advantage for ME over MV
●
13. Decision: which sample to relabel
With ENTROPY most
of the labeling
resources are wasted,
with the procedure
labeling a small set of
examples very many
times
14. Bayesian estimate of label uncertainty
(LU):
● Bayesian estimation of the
probability that ym is incorrect
○ uniform prior on pj
○ posterior is Beta(n+1,p+1)
● Uncertainty = CDF at the
decision threshold of the Beta
distribution, which is given by the
regularized incomplete beta
function
●
16. NLU vs LU for separating correctly
and incorrectly labeled samples
17. Model Uncertainty (MU):
● ignores the current multiset of
labels. It learns a set of models,
each of which predicts the
probability of class membership.
●
Label and Model uncertainty:
●
MUCV
● MU with H trained on k-folds
(CV)
MUO
● (O)racle. MU with H trained on all
perfect data
18. Decision: which sample to relabel
- influence on model quality
● Overall, combining label and model
uncertainty (LMU and NLMU)
produce the best approaches.
19. soft labels + selective relabeling?
● “soft-labeling is a strategy to
consider in environments
with high noise and when
using basic round-robin
labeling strategies. When
selective labeling is
employed, the benefits of
using soft-labeling
apparently diminish, and so
far we do not have the
evidence to recommend
using soft-labeling.”
20. weighted sampling?
● “The three selective
repeated-labeling strategies
with deterministic selection
order perform significantly
better than the ones with
weighted sampling”
22. > Rizos G., Schuller B. W. Average Jane, Where Art Thou?–Recent Avenues in Efficient Machine Learning Under Subjectivity
Uncertainty. – 2020.
● summary of approaches to optimally learn in case when actual ‘ground truth’ may not be available.
> Fredriksson T. et al. Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies. – 2020.
● This study investigates the challenges that companies experience when annotating and labeling their data
> Raykar V. C. et al. Learning from crowds. – 2010.
● The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels
> Whitehill J. et al. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. – 2009.
● we present a probabilistic model and use it to simultaneously infer the label of each image, the expertise of each labeler,
and the difficulty of each image
> Zhou, Dengyong, et al. "Regularized minimax conditional entropy for crowdsourcing.". (2015).
● minimax conditional entropy principle to infer ground truth from noisy crowdsourced labels
> Zhao, Liyue, Gita Sukthankar, and Rahul Sukthankar. "Incremental relabeling for active learning with noisy crowdsourced
annotations." 2011.
● Unfortunately, most active learning strategies are myopic and sensitive to label noise, which leads to poorly trained
classifiers. We propose an active learning method that is specifically designed to be robust to such noise
> Dawid, Alexander Philip, and Allan M. Skene. "Maximum likelihood estimation of observer error‐rates using the EM algorithm."
1979.
● A model is presented which allows individual error‐rates to be estimated for polytomous facets even when the patient's
“true” response is not available. The EM algorithm is shown to provide a slow but sure way of obtaining maximum
likelihood estimates of the parameters of interest.
> Chen, Xi, et al. "Pairwise ranking aggregation in a crowdsourced setting." 2013.
● In contrast to traditional ranking aggregation methods, the approach learns about and folds into consideration the quality
of contributions of each annotator.