Crowdsourcing: From Aggregation to Search Engine Evaluation


Published on

Talk given at University of Washington Information School, June 2, 2014.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crowdsourcing: From Aggregation to Search Engine Evaluation

  1. 1. Statistical Crowdsourcing: From Aggregating Judgments to Search Engine Evaluation Matt Lease School of Information @mattlease University of Texas at Austin
  2. 2. Undergraduate Mentors at UW
  3. 3. Roadmap • What are Crowdsourcing & Human Computation? 4-16 – A great research area for iSchools: something for everyone! • Benchmarking Statistical Consensus Methods 18-26 • Psychometrics & Crowds for Relevance Judging 28-35 3 Matt Lease <>
  4. 4. Crowdsourcing • Jeff Howe. WIRED, June 2006 • Rise of digital work & internet empowers a global workforce via open call solicitations • New application of principles from open source movement 4
  5. 5. 5
  6. 6. • Online marketplace for paid labor since 2005 • On-demand, elastic, 24/7 global workforce • API integrates human labor with computation Amazon Mechanical Turk (MTurk) 6
  7. 7. A New Scale of Labeled Data for AI Snow et al., EMNLP 2008 • MTurk labels for 5 NLP Tasks • 22K labels for only $26 • While individual annotations noisy, aggregated consensus labels show high agreement with expert labels (“gold”) 7
  8. 8. AI + Human Computation = A new breed of hybrid intelligent systems PlateMate (Noronha et al., UIST’10) 8
  9. 9. Social & Behavioral Sciences • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists 9
  10. 10. August 12, 2012 10
  11. 11. Ethics of Crowdsourcing? 11 Paul Hyman. Communications of the ACM, Vol. 56 No. 8, Pages 19-21, August 2013. Matt Lease <>
  12. 12. Who are the workers? • A. Baio, November 2008. The Faces of Mechanical Turk. • P. Ipeirotis. March 2010. The New Demographics of Mechanical Turk • J. Ross, et al. Who are the Crowdworkers? CHI 2010. 12 Matt Lease <>
  13. 13. Matt Lease <> 13
  14. 14. Safeguarding Participant Data • “What are the characteristics of MTurk workers?... the MTurk system is set up to strictly protect workers’ anonymity….” 14
  15. 15. ` Amazon profile page URLs use the same IDs as used on MTurk! Lease et al., SSRN’13 15
  16. 16. Crowdsourcing & the Law: Independent Contractors vs. Employees • Wolfson & Lease, ASIS&T’11 • Some platforms classify online contributors as independent contractors (vs. employees) • While Employment is legally-defined (e.g., FLSA and past court decisions), the definition leaks • It seems unlikely Congress will provide clarity • Class action litigation pending in the courts 16 Matt Lease <>
  17. 17. Roadmap • What are Crowdsourcing & Human Computation? 4-16 • Benchmarking Statistical Consensus Methods 18-26 • Psychometrics & Crowds for Relevance Judging 28-35 • a 17 Matt Lease <>
  18. 18. Science of Measurement & Benchmarks • “If you cannot measure it, you cannot improve it.” • Drive field innovation by clear challenge tasks – e.g., David Tse’s FIST 2012 Keynote (Comp. Biology) • Many things we can learn – What is the current state-of-the-art? – How do current methods compare? – What works, what doesn’t, and why? – How has field progressed over time? 18 Matt Lease <>
  19. 19. Finding Consensus in Human Computation • For an objective labeling task, how do we resolve disagreement between responses? • Simple baseline: majority voting • Research pre-dates crowdsourcing – Dawid and Skene’79, Smyth et al., ’95 • One of the most studied problems in HCOMP – Laymen likely to err more than experts – Methods in many areas: ML, Vision, NLP, IR, DB, … 19 Matt Lease <>
  20. 20. 20 Matt Lease <> SQUARE: A Benchmark for Research on Computing Crowd Consensus @HCOMP’13 (open source)
  21. 21. 21 Datasets
  22. 22. Methods Include popular and/or open-source methods • Majority Voting • Expectation-Maximization (Dawid-Skene, 1979) • Naïve Bayes (Snow et al., 2008) • GLAD (Whitehill et al., 2009) • ZenCrowd (Demartini et al., 2012) • Raykar et al. (2012) • CUBAM (Welinder et al., 2010) Matt Lease <> 22
  23. 23. Results: Unsupervised Accuracy Relative gain/loss vs. majority voting 23 -15% -10% -5% 0% 5% 10% 15% BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL DS ZC RY GLAD CUBCAM
  24. 24. Results: Varying Supervision 24 Matt Lease <>
  25. 25. Findings • Majority voting never best, but rarely much worse • No method performs far better than others • Each method often best for some condition – e.g., original dataset method was designed for • DS & RY tend to perform best (RY adds priors) 25 Matt Lease <>
  26. 26. Why Don’t We See Bigger Gains? • Of course contributions aren’t just empirical… • Maybe gold is too noisy to detect improvement? – Cormack & Kolcz’09, Klebanov & Beigman’10 • Might we see bigger differences from – Different tasks/scenarios? – Better benchmark tests? – Different methods or tuning? • We invite community contributions! 26 Matt Lease <>
  27. 27. Roadmap • What are Crowdsourcing & Human Computation? 4-16 • Benchmarking Statistical Consensus Methods 18-26 • Psychometrics & Crowds for Relevance Judging 28-35 • a 27 Matt Lease <>
  28. 28. Multidimensional Relevance Modeling via Psychometrics and Crowdsourcing Joint work with Yinglong Zhang Jin Zhang Jacek Gwizdka Paper @ SIGIR 2014 Matt Lease <> 28
  29. 29. Background: Evaluating IR Systems • Classic Cranfield method (Cleverdon et al., 1966) – Given a document collection & set of queries – Judge documents for topical relevance to each query – Evaluate on these queries & documents • Problem: Scaling manual data labeling is difficult • Idea: try Crowdsourcing – Alonso et al. (SIGIR Forum 2008) – Grady & Lease, 2010 – TREC 2011-2013 Crowdsourcing Track 29 Matt Lease <>
  30. 30. But Problems are Deeper • User relevance > simple topical relevance – The Great Divide in IR: systems-centered vs. user-centered – What other factors to model, & what is their relative importance? Long history of studies, little consensus. – Dearth of labeled data for training/evaluating systems • Even trusted assessors disagree often on “simple” topical relevance judgments – Often attributed to subjectivity, but can we do better? • How do we ensure quality of subjective data? – Largely unstudied in HCOMP community to date Matt Lease <> 30
  31. 31. Pscychology to the Rescue! • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists 31
  32. 32. August 12, 2012 32
  33. 33. Key Ideas from Pscyhometrics • Use standard survey techniques for collecting multi-dimensional relevance judgments – Ask repeated, similar questions, & change polarity • Analyze data via Structural Equation Modeling – cousin to graphical models in statistics/AI – Posit questions associated with latent factors – Use Exploratory Factor Analysis to determine factors & question associations, then prune questions – Use Confirmatory Factor Analysis to assess correlations, test significance, and compare models Matt Lease <> 33
  34. 34. Matt Lease <> 34
  35. 35. Future Directions • Strong foundation for ongoing positivist research of alternative relevance factors – For different user groups, search scenarios, etc. – Need more data to support normative claims • Train/test operational systems for varying factors • Improve judging agreement by making task more natural and/or assessing impact of latent factors • Intra-subject vs. inter-subject aggregation? • SEM vs. graphical modeling? • Other methods for ensuring subjective data quality? Matt Lease <> 35
  36. 36. The Future of Crowd Work, CSCW’13 Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 36 Matt Lease <>
  37. 37. Thank You! Slides:
  38. 38. Matt Lease <> 38
  39. 39. A Few Moral Dilemmas • A “fair” price for online work in a global economy? – Is it better to pay nothing (i.e., volunteers, gamification) rather than pay something small for valuable work? • Are we obligated to inform people how their participation / work products will be used? – If my IRB doesn’t require me to obtain informed consent, is there some other moral obligation to do so? • A worker finds his ID posted in a researcher’s online source code and asks that it be removed. This can’t be done without recreating the repo, which many people use. What should be done? Matt Lease <> 39
  40. 40. Ethical Crowdsourcing • Assume researchers have good intentions, and so issues of gross negligence are rare – Withholding promised pay after work performed – Not obtaining or complying with IRB oversight • Instead, great challenge is how to recognize our impacts appropriate actions in a complex world – Educating ourselves takes time & effort – Failing to educate ourselves could harm to others • How can we strike a reasonable balance between complete apathy vs. being overly alarmist? Matt Lease <> 40
  41. 41. • Contribute to society and human well-being • Avoid harm to others • Be honest and trustworthy • Be fair and take action not to discriminate • Respect the privacy of others COMPLIANCE WITH THE CODE. As an ACM member I will – Uphold and promote the principles of this Code – Treat violations of this code as inconsistent with membership in the ACM 41 Matt Lease <>
  42. 42. CS2008 Curriculum Update (ACM, IEEE) There is reasonably wide agreement that this topic of legal, social, professional and ethical should feature in all computing degrees. …financial and economic imperatives …Which approaches are less expensive and is this sensible? With the advent of outsourcing and off-shoring these matters become more complex and take on new dimensions …there are often related ethical issues concerning exploitation… Such matters ought to feature in courses on legal, ethical and professional practice. if ethical considerations are covered only in the standalone course and not “in context,” it will reinforce the false notion that technical processes are void of ethical issues. Thus it is important that several traditional courses include modules that analyze ethical considerations in the context of the technical subject matter … It would be explicitly against the spirit of the recommendations to have only a standalone course. 42 Matt Lease <>
  43. 43. “Contribute to society and human well-being; avoid harm to others” • Do we have a moral obligation to try to ascertain conditions under which work is performed? Or the impact we have upon those performing the work? • Do we feel differently when work is performed by – Political refugees? Children? Prisoners? Disabled? • How do we know who is doing the work, or if a decision to work (for a given price) is freely made? – Does it matter why someone accepts offered work? Matt Lease <> 43
  44. 44. Some Notable Prior Research • Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of these people who we ask to power our computing?” – “abstraction hides detail'‘ - some details may be worth keeping conspicuously present (Jessica Hullman) • Irani and Silberman (2013) – “…AMT helps employers see themselves as builders of innovative technologies, rather than employers unconcerned with working conditions.” – “…human computation currently relies on worker invisibility.” • Fort, Adda, and Cohen (2011) – “…opportunities for our community to deliberately value ethics above cost savings.” 44
  45. 45. Power Asymmetry on MTurk 45 Matt Lease <> • Mistakes happen, such as wrongly rejecting work – e.g., error by new student, software bug, poor instructions, noisy gold, etc. • How do we balance the harm caused by our mistakes to workers (our liability) vs. our cost/effort of preventing such mistakes?
  46. 46. Task Decomposition By minimizing context, greater task efficiency & accuracy can often be achieved in practice – e.g. “Can you name who is in this photo?” • Much research on ways to streamline work and decompose complex tasks 46 Matt Lease <>
  47. 47. Context & Informed Consent • Assume we wish to obtain informed consent • Without context, consent cannot be informed – Zittrain, Ubiquitous human computing (2008) 47
  48. 48. Consequences of Human Computation as a Panacea where AI Falls Short • The Googler who Looked at the Worst of the Internet • Policing the Web’s Lurid Precincts • Facebook content moderation • The dirty job of keeping Facebook clean • Even linguistic annotators report stress & nightmares from reading news articles! 48 Matt Lease <>
  49. 49. What about Freedom? • Crowdsourcing vision: empowering freedom – work whenever you want for whomever you want • Risk: people compelled to perform work – Chinese prisoners farming gold online – Digital sweat shops? Digital slaves? – We know relatively little today about work conditions – How might we monitor and mitigate risk/growth of crowd work inflicting harm to at-risk populations? – Traction? Human Trafficking at MSR Summit’12 49 Matt Lease <>
  50. 50. Robert Sim, MSR Summit’12 50 Matt Lease <>
  51. 51. Join the conversation! Crowdwork-ethics, by Six Silberman an informal, occasional blog for researchers interested in ethical issues in crowd work 51 Matt Lease <>
  52. 52. Additional References • Irani, Lilly C. The Ideological Work of Microwork. In preparation, draft available online. • Adda, Gilles, et al. Crowdsourcing for language resource development: Critical analysis of amazon mechanical turk overpowering use. Proceedings of the 5th Language and Technology Conference (LTC). 2011. • Adda, Gilles, and Joseph J. Mariani. Economic, Legal and Ethical analysis of Crowdsourcing for Speech Processing. (2013). • Harris, Christopher G., and Padmini Srinivasan. Crowdsourcing and Ethics. Security and Privacy in Social Networks. 67-83. 2013. • Harris, Christopher G. Dirty Deeds Done Dirt Cheap: A Darker Side to Crowdsourcing. IEEE 3rd conference on social computing (socialcom). 2011. • Horton, John J. The condition of the Turking class: Are online employers fair and honest?. Economics Letters 111.1 (2011): 10-12. 52 Matt Lease <>
  53. 53. • Bederson, B. B., & Quinn, A. J. Web workers unite! addressing challenges of online laborers. In CHI 2011 Human Computation Workshop, 97-106. • Bederson, B. B., & Quinn, A. J. Participation in Human Computation. In CHI 2011 Human Computation Workshop. • Felstiner, Alek. Working the Crowd: Employment and Labor Law in the Crowdsourcing Industry. Berkeley J. Employment & Labor Law 32.1 2011 • Felstiner, Alek. Sweatshop or Paper Route?: Child Labor Laws and In- Game Work. CrowdConf (2010). • Larson, Martha. Toward Responsible and Sustainable Crowsourcing. Blog post + Slides from Dagstuhl, September 2013. • Vili Lehdonvirta and Paul Mezier. Identity and Self-Organization in Unstructured Work. Unpublished working paper. 16 October 2013. • Zittrain, Jonathan. Minds for Sale. You Tube. 53 Matt Lease <> Additional References (2)