Talk at "Wisdom of the Crowd" AAAI 2012 Spring Symposium workshop (http://users.wpi.edu/~soniac/WisdomOfTheCrowd/WoCSchedule.htm) on 2011 AAAI-HComp paper by the same title.
Secure your environment with UiPath and CyberArk technologies - Session 1
On Quality Control and Machine Learning in Crowdsourcing
1. On Quality Control and Machine Learning
in Crowdsourcing
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
2. Quality Control
• Many factors matter
– guidelines, experimental design, human factors,
automation, …
• Only as strong as weakest link
– automation is not a silver bullet
• Errors are not just due to lazy/stupid workers
– Even in carefully designed and managed
annotation projects, uncertain cases encountered
2
3. Human Factors (HF)
• Questionnaire / Survey Design
• Interface / Interaction Design
• Incentives
• Human Relations (HR): recruitment & retention
• Long-term Commitment
– rapport with co-workers
– buy-in to organizational mission & value of work
– opportunities for advancement in organization
• Oversight / Management / Organization
• Communication
3
4. HF Challenges & Consequences
• Not part of typical CS curriculum or expertise
– crowdsourcing disrupts prior area boundaries
• NLP, IR, ML people traditionally don’t do HCI
– now many of us dealing with such issues
• Consequences
– Errors from poor HF
– Stumbling into known problems, recreating solutions
– May see problems through limited vantage point
– May over-rely on automation
• Great opportunities for HCI collaboration
4
5. Minority Voice & Diversity
• Opportunity: more diversity than “experts”
• Risk: false reinforcement of majority view
when minority is ignored, lost, or eliminated
• Questions
– How to recognize when majority is wrong?
– How to recognize alternative or better truths?
– Is QC systematically eliminating diversity?
– How diverse is the crowd really?
5
6. Automation
• Examples
– Task Routing / Worker Selection
– Adaptive Plurality, Decomposition
– Post-hoc: Calibration, Filtering & Aggregation
• Separation of concerns / middleware
– Users specify their task, and system handles QC
– Many do not have interest, time, skill, or risk tolerance
to manage low-level QC on their own
– Critical to widespread/enterprise adoption
– Accelerate field progress
• divide problem space for different groups to work on
6
7. Automation: Questions
• Who are the workers?
• What is the labor model?
• What are affordances of the platform?
• How does that drive subsequent setup?
• Appropriate inner-annotator agreement
measures for crowdwork?
7
8. Lessons from Traditional Annotation
• Need clear, detailed guidelines
• Cannot predict all cases in advance
• Guidelines evolve during annotation
• Humans not merely better visual, audio sensors
– e.g. imprecise directions & unforeseen examples
• Crowdsourcing Questions
– How to handle examples for which current guidelines
are ambiguous, unclear, or insufficient?
– What role do annotators play?
– How to facilitate interaction?
8
9. Worker Organization
• How might we organize workers for effective QC?
• Do workers participate in high level discussions
(telecommuters) or act like automata (HPU)?
• What organizational patterns might be used
– e.g. find-verify, fix-fix-verify, qualify-work
• How do different organizational patterns interact
with automation and other QC factors?
9
10. Impact on Machine Learning: More
• Labeled data
• Uncertain data
• Diverse data
• Specific data
• Ongoing data
• Rapid data
• Hybrid systems
• On-demand evaluation
• Datasets & Benchmarks
• Tasks
10
11. Open Questions
• How do cheap, plentiful , rapid labels alter how we utilize
supervisied vs. semi-supervised vs. unsupervised methods?
– Revist task-specific learning curves
• Mask uncertainty via QC or model, propagate, and expose?
• How do we handle noise in active learning?
• How to best utilize a 24/7 global crowd for lifetime,
continuous, never-ending learning systems?
– Sample size vs. adaptation
• Can we develop a more formal, computational
understanding of Wisdom of Crowds?
– diversity, independence, decentralization, and aggregation
• Can we better connect consensus algorithms with more
general feature-based and ensemble models?
11
12. Other Issues
• Hybrid systems match human-level competence
– Achievable now at certain time/cost tradeoff, which can be
navigated as function of context and need
• Diverse labeling particularly valuable when subjective
– Traditional in-house annotators not diverse & few
• A middle way between traditional annotation and
automated proxy metrics
– e.g. translation quality & BLEU
– More rapid than traditional annotation, more accurate
than automated metrics
• Less re-use has the risk of less comparable evaluation
– Enduring value of community evaluations like TREC
12
13. Thank You!
ir.ischool.utexas.edu/crowd
• Students
– Catherine Grady (iSchool) Matt Lease
– Hyunjoon Jung (ECE) ml@ischool.utexas.edu
– Jorn Klinger (Linguistics) @mattlease
– Adriana Kovashka (CS)
– Abhimanu Kumar (CS)
– Di Liu (iSchool)
– Hohyon Ryu (iSchool)
– William Tang (CS)
– Stephen Wolfson (iSchool)
• Omar Alonso, Microsoft Bing
• Support
– John P. Commons 13