Crowd Teaching with Imperfect Labels

Crowd Teaching with Imperfect Labels
Yao Zhou1, Arun Reddy Nelakurthi2, Ross Maciejewski3, Wei Fan4, Jingrui He1
1University of Illinois at Urbana Champaign, 2Samsung Research America,
3Arizona State University, 4Tencent America

- 2 -
The Surge of Crowdsourcing
q Need label information for training (semi-)supervised ML models
q Huge demand exists for fine-grained label information in real-world applications
o Fine-grained segmentation and localization in CV problems.
o Downstream finetuning tasks in NLP problems.
o AI assisted medical image and signal diagnosis problems.
Jacob Devlin, et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. EMNLP 2018
https://appen.com/blog/computer-vision-vs-machine-vision/
Computer Vision Natural Language Processing

- 3 -
Fine-Grained Crowdsourcing
q What we have and what we need:
o Not annotating data from scratch.
o Coarsely labeled data are available, but not perfect (contains incorrect labels).
o Objectives:
• Leverage label information from amateur workers to improve label quality.
• Teach crowd workers with imperfect labels and improve their labeling expertise.
q Existing solutions
o Conventional crowdsourcing models, see [1, 2, 3]
o Crowd teaching models, see [4, 5, 6]

- 4 -
o Objectives:

- 5 -
q Issues with existing solutions
o Conventional crowdsourcing models only focusing on label quality:
• Downweight the weak workers & trust the good workers.
• Motivate the workers to convey their knowledge by designing good incentive systems.
Yao Zhou, et al., MultiC2: an Optimization Framework for Learning from Task and Worker Dual Heterogeneity. SDM 2016
Nihar B. Shah, et al., Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing. NIPS 2015.
Important fact being ignored: Human beings are good at learning a
concept & transferring the learned concept into similar learning tasks.

- 6 -
o Objectives:

- 7 -
q Issues with existing solutions
o Crowd teaching models only focus on the labeling expertise of crowd workers:
• Require a perfectly labeled teaching set ! (e.g., hypothesis transition teaching models).
• Require a perfect target concept "* (e.g., iterative teaching models).
• Lacks of explanation for teaching samples.
Iterative crowd teaching with perfect target conceptCrowd teaching with perfect teaching set
Adish Singla, et al., Near-Optimally Teaching the Crowd to Classify. ICML 2014
Yao Zhou, et al., Unlearn What You Have Learned: Adaptive Crowd Teaching with Exponentially Decayed Memory Learners, KDD 2018

- 8 -
o Objectives:
q Research question
o Can we simultaneously improve both in a unified framework
?

- 9 -
Roadmap
q Introduction
o Problem setting
o Overview of the framework
q The proposed framework
o Learner model
o Explanation difficulty
o Teacher model
q Extensions
o Imperfect teaching with surrogate cost
o Curriculum learning with label influence
q Experiments
o Quantitative evaluation
o Qualitative evaluation
q Conclusion

- 10 -
Roadmap
q Introduction
o Problem setting
o Learner model
o Teacher model
q Extensions
q Experiments
q Conclusion

- 11 -
Problem Setting
q Given:
o Imperfect labeled data set (mixture of correct labels & incorrect labels).
o Unlabeled data set .
o A prediction model (e.g., a classification model)
q Objective I: providing a higher-quality labeled data set that has
o Items originally belong to , but re-labeled and verified by workers
o Items originally belong to , but stay untouched.
o Items originally belong to , but get new labels from workers.
q Objective II: Using the imperfect prediction model as the teacher to teach the
workers learn to label with personalized teaching sequence and visual explanations.

- 12 -
Overview of the Framework
q Interactions between teacher and student
o First, the teacher, , recommends and shows an item to the learner. The learner
provides an initial label.
o Second, the teacher shows its probabilistic prediction and visual explanation. The
learner will update her labels and choose a trusted explanation.
o Third, a masked explanation will be provided to the learner based on her
confidence. Only the high confidence label will be recorded.

- 13 -
Roadmap
q Introduction
o Problem setting
o Learner model
o Teacher model
q Extensions
q Experiments
q Conclusion

- 14 -
Adaptive Teaching and Learning
q Learner model
o The learners use gradient-based learning procedure to improve their concepts in an
iterative way:
o Each learner has an exponentially decayed retrievability of memory:
o The memory momentum can be rewritten as follows if we set initial to be !.
Learner’s concept at
t-th teaching iteration
Explanation difficulty coefficient
Memory momentum of learner
at t-th teaching iteration
Memory decay rate Index of teaching item in t-th iteration
Where is learning
loss of the learner

- 15 -
q Explanation difficulty
o An item (image or text) with a smaller attention
region (small area of pixels or a few key words)
would be easier to interpret.
o Explanation difficulty is defined as the entropy of
the generated explanation:
o The explanation re-scaling coefficient is defined as:
Explanation as a “domestic” cat by
highlighting the facial area
Re-scaling coefficient > 1 if an item
has non-uniform visual explanations.

- 16 -
q Teacher model
o Teacher aims at maximizing learner’s speed of convergence by minimizing the
distance of learner’s current concept from teacher’s empirical target concept
in two consecutive iterations:
Empirical target concept from the teacher
where is cost
function of the teacher

- 17 -
q Teacher model
o In each iteration, the teacher aims to recommending teaching item with its
explanation . Then, teaching becomes as a pool-based search problem:
o The candidate pool includes both and , thus, it allows the worker to re-label
or add new label to an item.

- 18 -
Roadmap
q Introduction
o Problem setting
o Learner model
o Teacher model
q Extensions
q Experiments
q Conclusion

- 19 -
q Make the teaching model better.
o Is the teacher good enough to teach if it uses the empirical target concept
learned from imperfect data set ?
ü Yes, it is. If the surrogate cost function is used instead.
o Since is imperfect, therefore, the aggregated labels of items will usually include
errors. We define the class-conditional error rate of labels as:
surrogate cost function
original cost function
For short, we have:

- 20 -
o How good is this empirical target concept if the surrogate cost function is used?
§ The empirical risk is bounded by the risk which uses the unknown ground truth
concept with probability at least .
o Can this bound become tighter?
ü Yes, it can. If the following condition is satisfied.
Error rate of learner provided labels Learner’s confident on true class label
Nagarajan Natarajan, et al. Learning with Noisy Labels. NIPS 2013.

- 21 -
o How good is this empirical target concept if the surrogate cost function is used?
§ The empirical risk is bounded by the risk which uses the unknown ground truth
concept with probability at least .
o Can this bound become tighter?
ü Yes, it can. If the following condition is satisfied.
Error rate of labeled items
will decrease if learners
provide confident labels.
will gradually have
smaller upper bound and
get close to the risk that
uses true target concept .

- 22 -
Crowd Teaching with Imperfect Labels
q Make the teaching model better (continue)
o Is the teaching sequence good enough to fit the natural learning trend of human
learners?
Not really! Not all items have equal influence on the prediction model .
o Teaching should follow the principle of curriculum learning, i.e., the teaching
sequence is ranging from easy to difficult.
§ Easy items have small influence on since they are usually data points with large
marginal distance in feature space.
§ Difficult items have large influence because changing their labels would have large
impact on the behavior of the prediction model .

- 23 -
o The teaching sequence should have increasing influences, which is defined as the
model prediction’s change w.r.t. the label perturbation.
§ Item with label is
§ Item with label perturbation is
§ For simplicity, we have
o The empirical risk minimizer after replacing a small mass ! of original item z with z" is:
o The influence of upweighting item z with small mass ! is defined as:
Pang Wei Koh, Percy Liang. Understanding Black-box Predictions via Influence Functions. ICML 2017

- 24 -
o Then, the parameter change of perturbing label y is given as the difference of
influences between upweighting z! and upweighting z:
o With some simplification (first-order approximation, chain rule), we have the
perturbed loss on ztest of changing label y as:

- 25 -
o The influence of label perturbation has two cases:
§ Positive influence (perturb y from -1 to +1)
§ Negative influence (perturb y from +1 to -1)
o Then, the influence score of any item ztest is calculated using their absolute value:
o The overall goal of teaching becomes selecting a sequence of items that are both
effective toward concept learning as well as having increasing influences.
Teaching item index
Teaching score from obj. in Eq. (11)
Influence score

- 26 -
Discussion and Extensions
q Worker (learner) only has confidence on one category of items
o If one worker is only confident on the positive class (i.e., < , she only
gives confident positive labels)
o In this case, c+1>0 and c-1=0 and Theorem 4.1 can be easily satisfied. Therefore,
the updated label set still leads to a better prediction model.
q Teaching with starving prevention.
o If repeated labeling is allowed, the overall teaching score could be always high
for certain items. The low-score items will be starved and never recommended.
o The influence intensity is updated as the entropy of item xi’s label set
§ Low-entropy label set (e.g., ) will downgrade the
influence score faster than high-entropy label set (e.g., ).

- 27 -
Roadmap
q Introduction
o Problem setting
o Learner model
o Teacher model
q Extensions
q Experiments
q Conclusion

- 28 -
Experiments
q Data sets
o Cat images (domestic vs. wild)
o Canidae images (domestic vs. wild)
o Text documents (comp. vs. sci.)
q Features
o Image features are from a finetuned ResNet34’s penultimate layer.
o Text features are TFIDF with certain standard document preprocessing.
q Explanations
o Image explanations are saliency maps generated by Grad-CAM.
o Text explanations are highlighted words generated by LIME.
q Class-conditional error
o Random flip labels with a fixed error rate.
q Human learners
o 61 trials, each learner is assigned with one teaching algorithm using Round-robin.
Ramprasaath Selvaraju, et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017
Marco Ribeiro, et al., Why Should I Trust You? Explaining the Predictions of Any Classifier. KDD 2016

- 29 -
q Teaching interface (image)
Experiments

- 30 -
q Teaching interface (text)
Experiments

- 31 -
q Quantitative results
o Teaching gain: labeling accuracy(after teaching) – labeling accuracy(before teaching)
JEDI: interactive teaching
w/o explanation.
VADER-lite: removes
confidence gauging in the 3rd
step of VADER.
VADER: our proposed three-
step teaching model.
Yao Zhou, et al., Unlearn What You Have Learned: Adaptive Crowd Teaching with Exponentially Decayed Memory Learners, KDD 2018
Experiments

- 32 -
o Label retrieval rate: fraction of items with incorrect labels that have been corrected after
teaching.
Teacher: the initial prediction model.
Worker Initial: the workers without
receive any teaching.
JEDI: interactive teaching w/o
explanation.
VADER-lite: removes confidence
gauging in the 3rd step of VADER.
VADER: our proposed three-step
teaching model.
Improvement due to
confidence gauging
Experiments

- 33 -
o Model performance: retrain the model after teaching, compare the retrained prediction
model (using the MMCE aggregated labels) with the teacher’s performance.
Cat and Text have improved model
performance due to their high label
retrieval rate.
Canidae has barely observed any
improvement in model performance
because of its low label retrieval rate.
The label perturbation influence is not
high enough.
Dengyong Zhou, John Platt, Sumit Basu, and Yi Mao. Learning from the Wisdom of Crowds by Minimax Entropy. NIPS, 2012.
Experiments

- 34 -
q Qualitative results
o Results of the influence score. Top row are the high-influence canidae images and bottom
row are low-influence canidae images. Each image is described by a tuple composed of its
true category and indicator of error (1: label error, 0: no label error)
Experiments

- 35 -
Conclusion
q VADER model
ü Interactive learning and teaching.
ü Simultaneously improves label quality of data & labeling expertise of workers.
ü Does not require perfect teaching set and perfect target concept.
ü Has human interpretable explanations.
ü Theoretical connections between teaching and explanation.

- 36
1. Dengyong Zhou, John Platt, Sumit Basu, Yi Mao. Learning from the Wisdom of Crowds by Minimax Entropy. NIPS, 2012.
2. Vikas Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Valadez, Charles Florin, Luca Bogoni, Linda Moy. Learning From Crowds. JMLR 2010.
3. Yao Zhou, Jingrui He. Crowdsourcing via Tensor Augmentation and Completion. IJCAI 2016.
4. Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, Andreas Krause. Near Optimally Teaching the Crowd to Classify. ICML 2014.
5. Edward Johns, Oisin Mac Aodha, Gabriel J. Brostow. Becoming the expert - interactive multi-class machine teaching. CVPR 2015.
6. Yao Zhou, Arun Reddy Nelakurthi, Jingrui He. Unlearn What You Have Learned: Adaptive Crowd Teaching with Exponentially Decayed
Memory Learners, KDD 2018
7. Yao Zhou, Lei Ying, Jingrui He, MultiC2: an Optimization Framework for Learning from Task and Worker Dual Heterogeneity. SDM 2016
8. Nihar B. Shah, Dengyong Zhou., Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing. NIPS 2015.
9. Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, Ambuj Tewari. Learning with Noisy Labels. NIPS 2013.
10. Pang Wei Koh, Percy Liang. Understanding Black-box Predictions via Influence Functions. ICML 2017
11. Ramprasaath Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. Grad-CAM: Visual
Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017
12. Marco Ribeiro, Sameer Singh, Carlos Guestrin. Why Should I Trust You? Explaining the Predictions of Any Classifier. KDD 2016
References

- 37
Thank You!
Email me via yaozhou3@Illinois.edu if you have any questions!

Crowd Teaching with Imperfect Labels

Recommended

Recommended

More Related Content

Similar to Crowd Teaching with Imperfect Labels

Similar to Crowd Teaching with Imperfect Labels (20)

Recently uploaded

Recently uploaded (20)

Crowd Teaching with Imperfect Labels