Crowdsourcing for Information Retrieval: From Statistics to Ethics


Published on

Revised October 27, 2013. Talk at UC Berkeley (October 21, 2013), Syracuse University (October 28, 2013).

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crowdsourcing for Information Retrieval: From Statistics to Ethics

  1. 1. Crowdsourcing for Information Retrieval: From Statistics to Ethics Matt Lease School of Information University of Texas at Austin @mattlease
  2. 2. Roadmap • Scalability Challenges in IR Evaluation (brief) • Benchmarking Statistical Consensus Methods • Task Routing via Matrix Factorization • Toward Ethical Crowdsourcing Matt Lease <> 2
  3. 3. Roadmap • Scalability Challenges in IR Evaluation (brief) • Benchmarking Statistical Consensus Methods • Task Routing via Matrix Factorization • Toward Ethical Crowdsourcing Matt Lease <> 3
  4. 4. Why Evaluation at Scale? • Evaluation should closely mirror real use conditions • The best algorithm at small scale may not be best at larger scales – Banko and Brill (2001) – Halevy et al. (2009) • IR systems should be evaluated on the scale of data which users will search in practice Matt Lease <> 4
  5. 5. Why is Evaluation at Scale Hard? • Multiple ways to evaluate; consider Cranfield – Given a document collection and set of user queries – Label documents for relevance to each query – Evaluate search algorithms on these queries & documents • Labeling data is slow/expensive/difficult • Approach 1: label less data (e.g. active learning) – Pooling, metrics robust to sparse data (e.g., BPref) – Measure only relative performance (e.g., statAP, MTC) • Approach 2: label data more efficiently – Crowdsourcing (e.g., Amazon’s Mechanical Turk) Matt Lease <> 5
  6. 6. 6
  7. 7. Crowdsourcing for IR Evaluation • Origin: Alonso et al. (SIGIR Forum 2008) – Continuing active area of research • Primary concern: ensuring reliable data – Reliable data provides foundation for evaluation – If QA inefficient, overhead could reduce any savings – Common strategy: ask multiple people to judge relevance, then aggregate their answers (consensus) Matt Lease <> 7
  8. 8. Roadmap • Scalability Challenges in Evaluating IR Systems • Benchmarking Statistical Consensus Methods • Task Routing via Matrix Factorization • Toward Ethical Crowdsourcing Matt Lease <> 8
  9. 9. SQUARE: A Benchmark for Research on Computing Crowd Consensus Aashish Sheshadri and M. Lease, HCOMP’13 (open source) Matt Lease <> 9
  10. 10. Background • How do we resolve disagreement of multiple peoples’ answers to arrive at consensus? • Simple baseline: majority voting • Long history pre-dating crowdsourcing – Dawid and Skene’79, Smyth et al., ’95 – Recent focus on quality assurance with crowds • Many more methods, active research topic – Across many areas: ML, Vision, NLP, IR, DB, … Matt Lease <> 10
  11. 11. Why Benchmark? • Drive field innovation by clear challenge tasks – e.g., David Tse’s FIST 2012 Keynote (Comp. Biology) • Many other things we can learn – How do methods compare? • Qualitatively & quantitatively? – What is the state-of-the-art today? – What works, what doesn’t, and why? • Where is further research most needed? – How has field progressed over time? Matt Lease <> 11
  12. 12. Cons Method - - Most limited model Cannot be supervised No confusion matrix - Pros Simple, fast, no training Task-independent MV ZC Demartini’12 Worker Reliability parameters - Task-independent Can be supervised Allows priors on worker reliability & class distribution GLAD - - - - - No confusion matrix No worker priors Classification only Space prop. to num classes No worker priors Classification only Space prop. to num classes No worker priors Classification only Space prop. to num classes Automatic classifier requires feature representation Classification only Complex with many hyper-parameters. Unclear how to supervise Whitehill et al.’09 Worker Reliability & Task Difficulty params Naïve Bayes (NB) Snow et al.,’08 = D&S Model fully-supervised Dawid & Skene’79 (DS) Class priors & Worker Confusion matrices Raykar et al.’10 (RY) Worker confusion, sensitivity, specificity (Optional) Automatic Classifier - Task-independent Can be supervised Prior on class distribution - Supports multi-class tasks Models worker confusion Simple maximum-likelihood - Supports multi-class tasks Models worker confusion Unsup, semi-sup, or fully-sup - Classifier not required Priors on worker confusion and class distribution. Has multi-class support. Can be supervised. Welinder et al.’10 (CUBAM) - Worker reliability and confusion - Annotation noise Task Difficulty More Complex Method = Model + Training + Inference Confusion Matrix Detailed model of the annotation process. Can identify worker clusters . Has multi-class support. 12
  13. 13. 13
  14. 14. Results: Unsupervised Accuracy Relative gain/loss vs. majority voting 15% 10% 5% 0% -5% DS ZC RY GLAD CUBCAM -10% -15% BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL 14
  15. 15. Results: Varying Supervision Matt Lease <> 15
  16. 16. Findings • Majority voting never best, rarely much worse • Each method often best for some condition – E.g., original dataset designed for • DS & RY tend to perform best (RY adds priors) • No method performs far beyond others – Of course, contributions aren’t just empirical… Matt Lease <> 16
  17. 17. Why Don’t We See Bigger Gains? • Gold is too noisy to detect improvement? – Cormack & Kolcz’09, Klebanov & Beigman’10 • Limited tasks / scenarios considered? – e.g., we exclude hybrid methods & worker filtering • Might we see greater differences from – Better benchmark tests? – Better tuning of methods? – Additional methods? • We invite community contributions! Matt Lease <> 17
  18. 18. Roadmap • Scalability Challenges in Evaluating IR Systems • Benchmarking Statistical Consensus Methods • Task Routing via Matrix Factorization • Toward Ethical Crowdsourcing Matt Lease <> 18
  19. 19. Crowdsourced Task Routing via Matrix Factorization HyunJoon Jung and M. Lease arXiv 1310.5142, under review Matt Lease <>
  20. 20. Matt Lease <> 20
  21. 21. Task Routing: Background • Selection vs. recommendation vs. assignment – Potential to improve work quality & satisfaction – task search time has latency & is uncompensated – Tradeoffs in push vs. pull, varying models • Many matching criteria one could consider – Preferences, Experience, Skills, Job constraints, … • References – Law and von Ahn, 2011 (Ch. 4) – Chilton et al., 2010 • MTurk “free” selection constrained by search interface Matt Lease <> 21
  22. 22. Matrix Factorization Approach • Collaborative filtering-based recommendation • Intuition: achieve similar accuracy on similar tasks – Notion is more general: e.g. preference, expertise, etc. Worker-example matrix for each task w1 Comprehensive worker-task matrix .. wm w1 0 e1 w2 w2 .. wm w1 0 w2 0 0 1 .. 1 1 0 … 1 1 1 Tn 1 1 e2 e1 1 … e2 e1 1 en … e2 1 … en N Tasks en 1 1 1 1 w1 T1 w2 .. 0.39 wm w2 .. wm T1 0.72 0.59 0.70 0.75 T2 0.5 0.54 0.66 0.73 … 0.66 0.71 0.78 0.89 Tn 0.55 w1 0.87 0.83 0.72 0.91 wm T2 0.5 0.54 0.66 0 Accumulate repeated crowdsourced data 0.78 0.83 0.89 0.72 Tabularize a worker-task relational model Matt Lease <> Apply MF for inferring missing values Select bestpredicted workers for a target task 22
  23. 23. Matrix Factorization • Automatically induce latent features – Task-independent • Popular due to robustness to sparsity – SVD sensitive matrix density; PMF much more robust M workers (M>>N) Worker Features T N tasks » Rij WT D = N-1 dimensions T  R D M Rij  Wi T j   W T ik T jk T k Task Features e.g., rating of user i for movie j W  R D N Matt Lease <> 23
  24. 24. Datasets • 3 MTurk text tasks • Simulated data 24
  25. 25. Baselines • Random assignment – no accuracy prediction; just for task routing • Simple average – Average worker’s accuracies across past tasks • Weighted average – weight each task in average by similarity to target task • task similarity must be estimated from data Matt Lease <> 25
  26. 26. Estimating Task Similarity • Define by Pearson correlation over per-task accuracies of workers who perform both – Ignore any workers doing only one of the tasks Matt Lease <> 26
  27. 27. Results – RMSE & Mean Acc. (MTurk data) Average over tasks k = 1 to 20 workers Per-task & Average k=10 workers Matt Lease <> 27
  28. 28. Findings • How does MF prediction accuracy vary given task similarity, matrix size, & matrix density? – Feasible, PMF beats SVD, more data = better… • MF task routing vs. baselines? – Much better than random; baselines fine in most sparse conditions; improvement beyond that Matt Lease <> 28
  29. 29. Open Questions • Other ways to infer task similarity (e.g. textual) • Under “Big Data” conditions? • When integrating target task observations? • How to better model crowd & spam? • How to address live task routing challenges? Matt Lease <> 29
  30. 30. Roadmap • Scalability Challenges in Evaluating IR Systems • Benchmarking Statistical Consensus Methods • Task Routing via Matrix Factorization • Toward Ethical Crowdsourcing Matt Lease <> 30
  31. 31. A Few Moral Dilemmas • A “fair” price for online work in a global economy? – Is it better to pay nothing (i.e., volunteers, gamification) rather than pay something small for valuable work? • Are we obligated to inform people how their participation / work products will be used? – If my IRB doesn’t require me to obtain informed consent, is there some other moral obligation to do so? • A worker finds his ID posted in a researcher’s online source code and asks that it be removed. This can’t be done without recreating the repo, which many people use. What should be done? Matt Lease <> 31
  32. 32. Mechanical Turk is Not Anonymous Matthew Lease, Jessica Hullman, Jeffrey P. Bigham, Michael S. Bernstein, Juho Kim, Walter S. Lasecki, Saeideh Bakhshi, Tanushree Mitra, and Robert C. Miller. Online: Social Science Research Network, March 6, 2013
  33. 33. ` Amazon profile page URLs use the same IDs as used on MTurk How do we respond when we learn we’ve exposed people to risk? 33
  34. 34. Ethical Crowdsourcing • Assume researchers have good intentions, and so issues of gross negligence are rare – Withholding promised pay after work performed – Not obtaining or complying with IRB oversight • Instead, great challenge is how to recognize our impacts appropriate actions in a complex world – Educating ourselves takes time & effort – Failing to educate ourselves could harm to others • How can we strike a reasonable balance between complete apathy vs. being overly alarmist? Matt Lease <> 34
  35. 35. CACM August, 2013 Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013. Matt Lease <> 35
  36. 36. • • • • • Contribute to society and human well-being Avoid harm to others Be honest and trustworthy Be fair and take action not to discriminate Respect the privacy of others COMPLIANCE WITH THE CODE. As an ACM member I will – Uphold and promote the principles of this Code – Treat violations of this code as inconsistent with membership in the ACM Matt Lease <> 36
  37. 37. CS2008 Curriculum Update (ACM, IEEE) There is reasonably wide agreement that this topic of legal, social, professional and ethical should feature in all computing degrees. …financial and economic imperatives …Which approaches are less expensive and is this sensible? With the advent of outsourcing and off-shoring these matters become more complex and take on new dimensions …there are often related ethical issues concerning exploitation… Such matters ought to feature in courses on legal, ethical and professional practice. if ethical considerations are covered only in the standalone course and not “in context,” it will reinforce the false notion that technical processes are void of ethical issues. Thus it is important that several traditional courses include modules that analyze ethical considerations in the context of the technical subject matter … It would be explicitly against the spirit of the recommendations to have only a standalone course. Matt Lease <> 37
  38. 38. “Contribute to society and human well-being; avoid harm to others” • Do we have a moral obligation to try to ascertain conditions under which work is performed? Or the impact we have upon those performing the work? • Do we feel differently when work is performed by – Political refugees? Children? Prisoners? Disabled? • How do we know who is doing the work, or if a decision to work (for a given price) is freely made? – Does it matter why someone accepts offered work? Matt Lease <> 38
  39. 39. Matt Lease <> 39
  40. 40. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010. Matt Lease <> 40
  41. 41. Some Notable Prior Research • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of these people who we ask to power our computing?” – “abstraction hides detail'‘ - some details may be worth keeping conspicuously present (Jessica Hullman) • Irani and Silberman (2013) – “…AMT helps employers see themselves as builders of innovative technologies, rather than employers unconcerned with working conditions.” – “…human computation currently relies on worker invisibility.” • Fort, Adda, and Cohen (2011) – “…opportunities for our community to deliberately value ethics above cost savings.” 41
  42. 42. Power Asymmetry on MTurk • Mistakes happen, such as wrongly rejecting work – e.g., error by new student, software bug, poor instructions, noisy gold, etc. • How do we balance the harm caused by our mistakes to workers (our liability) vs. our cost/effort of preventing such mistakes? Matt Lease <> 42
  43. 43. Task Decomposition By minimizing context, greater task efficiency & accuracy can often be achieved in practice – e.g. “Can you name who is in this photo?” • Much research on ways to streamline work and decompose complex tasks Matt Lease <> 43
  44. 44. Context & Informed Consent • Assume we wish to obtain informed consent • Without context, consent cannot be informed – Zittrain, Ubiquitous human computing (2008) 44
  45. 45. Independent Contractors vs. Employees • Wolfson & Lease, ASIS&T’11 • Many platforms classify workers as independent contractors (piece-work, not hourly) – Legislators/courts must ultimately decide • Different work classifications yield different legal rights/protections & responsibilities – Domestic vs. international workers – Employment taxes – Litigation can both cause or redress harm • Law aside, to what extent do moral principles underlying current laws apply to online work? Matt Lease <> 45
  46. 46. Consequences of Human Computation as a Panacea where AI Falls Short • • • • The Googler who Looked at the Worst of the Internet Policing the Web’s Lurid Precincts Facebook content moderation The dirty job of keeping Facebook clean • Even linguistic annotators report stress & nightmares from reading news articles! Matt Lease <> 46
  47. 47. What about Freedom? • Crowdsourcing vision: empowering freedom – work whenever you want for whomever you want • Risk: people compelled to perform work – Chinese prisoners farming gold online – Digital sweat shops? Digital slaves? – We know relatively little today about work conditions – How might we monitor and mitigate risk/growth of crowd work inflicting harm to at-risk populations? – Traction? Human Trafficking at MSR Summit’12 Matt Lease <> 47
  48. 48. Robert Sim, MSR Summit’12 Matt Lease <> 48
  49. 49. Join the conversation! Crowdwork-ethics, by Six Silberman an informal, occasional blog for researchers interested in ethical issues in crowd work Matt Lease <> 49
  50. 50. The Future of Crowd Work, CSCW’13 Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton Matt Lease <> 50
  51. 51. Additional References • Irani, Lilly C. The Ideological Work of Microwork. In preparation, draft available online. • Adda, Gilles, et al. Crowdsourcing for language resource development: Critical analysis of amazon mechanical turk overpowering use. Proceedings of the 5th Language and Technology Conference (LTC). 2011. • Adda, Gilles, and Joseph J. Mariani. Economic, Legal and Ethical analysis of Crowdsourcing for Speech Processing. (2013). • Harris, Christopher G., and Padmini Srinivasan. Crowdsourcing and Ethics. Security and Privacy in Social Networks. 67-83. 2013. • Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side to Crowdsourcing. IEEE 3rd conference on social computing (socialcom). 2011. • Horton, John J. The condition of the Turking class: Are online employers fair and honest?. Economics Letters 111.1 (2011): 10-12. Matt Lease <> 51
  52. 52. Additional References (2) • Bederson, B. B., & Quinn, A. J. Web workers unite! addressing challenges of online laborers. In CHI 2011 Human Computation Workshop, 97-106. • Bederson, B. B., & Quinn, A. J. Participation in Human Computation. In CHI 2011 Human Computation Workshop. • Felstiner, Alek. Working the Crowd: Employment and Labor Law in the Crowdsourcing Industry. Berkeley J. Employment & Labor Law 32.1 2011 • Felstiner, Alek. Sweatshop or Paper Route?: Child Labor Laws and InGame Work. CrowdConf (2010). • Larson, Martha. Toward Responsible and Sustainable Crowsourcing. Blog post + Slides from Dagstuhl, September 2013. • Vili Lehdonvirta and Paul Mezier. Identity and Self-Organization in Unstructured Work. Unpublished working paper. 16 October 2013. • Zittrain, Jonathan. Minds for Sale. You Tube. Matt Lease <> 52
  53. 53. Thank You! See also: SIAM’13 Tutorial Slides: Matt Lease <> 53