Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Search for Truth in
Objective & Subjective Crowdsourcing
Matt Lease
School of Information
University of Texas at Austi...
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects resea...
“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
4
FYI: MTurk & Human Subjects Research
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to st...
`
A MTurk worker’s ID is
also their customer
ID on Amazon. Public
profile pages can link
worker ID to name.
Lease et al., ...
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects resea...
Finding Consensus in Human Computation
• For an objective labeling task, how do we resolve
disagreement between respondent...
Value of Benchmarking
• “If you cannot measure it, you cannot improve it.”
• Drive field innovation by clear challenge tas...
10
Matt Lease <ml@utexas.edu>
SQUARE:
A Benchmark
for Research on
Computing
Crowd
Consensus
@HCOMP’13
ir.ischool.utexas.ed...
11
“Real” Crowdsourcing Datasets
12
How does the
crowd behave?
Methods
Includes popular and/or open-source methods
• Task / Model / Supervision / Estimation & sparsity
• Task-independen...
Results: Unsupervised Accuracy
Relative effectiveness vs. majority voting
15
-15%
-10%
-5%
0%
5%
10%
15%
BM HCB SpamCF WVS...
Results: Varying Supervision
16
Matt Lease <ml@utexas.edu>
Findings
• Majority voting never best, but rarely much worse
• No method performs far better than others
• Each method oft...
Provocative: So Where’s the Progress?
• Sure, progress is not only empirical, but…
• Maybe gold is too noisy to detect imp...
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects resea...
Multidimensional Relevance Modeling
via Psychometrics and Crowdsourcing
Joint work with
Yinglong Zhang Jin Zhang Jacek Gwi...
How to Evaluate a Search Engine?
• 3 complementary approaches (with tradeoffs)
– Log analysis (“big data”): e.g., infer re...
Saracevic’s 1997 Salton Award address
“…the human-centered side was often highly critical
of the systems side for ignoring...
RQs: Information Retrieval
• What is relevance?
– What factors constitute it? Can we quantify their
relative importance? H...
RQs: Crowdsourcing Subjective Tasks
• How can we measure/ensure the quality of
subjective judgments (especially online)?
–...
Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Missing/ignoring prior work in other disciplines
– It turns ...
Social Sciences have been…
• …collecting reliable, subjective data from online
participants before “crowdsourcing” was coi...
Pscychology to the Rescue!
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN onli...
Key Ideas from Pscyhometrics
• Use established survey techniques to collect
subjective relevance judgments
– Ask repeated,...
Collecting multi-dimensional relevance
judgments
• Participant picks one of several pre-defined topics
– You want to plan ...
How do we ask the questions?
• Ask 3+ questions per hypothesized dimension
– Ask repeated, similar questions, & change pol...
What Questions might we ask?
• What factors might determine relevance?
• We adopt same 5 factors from (Xu & Chen, 2006)
– ...
Structural Equation Modeling (SEM)
• Based on Sewell Wright’s path analysis (1921)
– A factor model is parameterized by fa...
Exploratory Factor Analysis (EFA) – 1 of 2
• Is the sample large enough for EFA?
– Kaiser-Mayer-Olkin (KMO) Measure of Ade...
• Perform Parallel Analysis
– Create random data w/ same # of factors & questions
– Create correlation matrix and find eig...
Question-Factor Loadings (Weights)
Matt Lease <ml@utexas.edu> 35/20
CFA: Assess and Compare Models
• F First-order baseline model uses a single
latent factor to explain observed data
Posited...
• Null model assume observations independent
– Covariance between questions fixed at 0, means &
coveriances left free
• Co...
Contributions
• Simple, reliable, scalable way to collect diverse (subjective),
multi-dimensional judgments from online pa...
Future Directions
• More data-driven positivist research into factors
– Different user groups, search scenarios, devices, ...
Thank You!
ir.ischool.utexas.edu
40
Slides: www.slideshare.net/mattlease
Upcoming SlideShare
Loading in …5
×

The Search for Truth in Objective & Subject Crowdsourcing

1,176 views

Published on

Talk at Carnegie Mellon University Crowdsourcing Lunch (March 4, 2015). See http://cmu-crowd.blogspot.com.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Search for Truth in Objective & Subject Crowdsourcing

  1. 1. The Search for Truth in Objective & Subjective Crowdsourcing Matt Lease School of Information University of Texas at Austin ir.ischool.utexas.edu @mattlease ml@utexas.edu
  2. 2. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 2 Matt Lease <ml@utexas.edu>
  3. 3. “The place where people & technology meet” ~ Wobbrock et al., 2009 www.ischools.org
  4. 4. 4
  5. 5. FYI: MTurk & Human Subjects Research • “What are the characteristics of MTurk workers?... the MTurk system is set up to strictly protect workers’ anonymity….” 5
  6. 6. ` A MTurk worker’s ID is also their customer ID on Amazon. Public profile pages can link worker ID to name. Lease et al., SSRN’13 6
  7. 7. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 7 Matt Lease <ml@utexas.edu>
  8. 8. Finding Consensus in Human Computation • For an objective labeling task, how do we resolve disagreement between respondents? – e.g., majority voting, weighted voting – Contrast cases: subjective, polling, & ideation • Research pre-dates crowdsourcing (e.g. experts) – Dawid and Skene’79, Smyth et al., ’95 • One of the most studied problems in HCOMP – Quality control of crowd labeling via plurality – Methods in many areas: ML, Vision, NLP, IR, DB, … – With all the time & $$$ invested, what have we learned? 8 Matt Lease <ml@utexas.edu>
  9. 9. Value of Benchmarking • “If you cannot measure it, you cannot improve it.” • Drive field innovation by clear challenge tasks – e.g., David Tse’s FIST 2012 Keynote (Comp. Biology) • Tackling important questions – What is the current state-of-the-art? – How do current methods compare? – What works, what doesn’t, and why? – How has field progressed over time? 9 Matt Lease <ml@utexas.edu>
  10. 10. 10 Matt Lease <ml@utexas.edu> SQUARE: A Benchmark for Research on Computing Crowd Consensus @HCOMP’13 ir.ischool.utexas.edu/square (open source)
  11. 11. 11 “Real” Crowdsourcing Datasets
  12. 12. 12 How does the crowd behave?
  13. 13. Methods Includes popular and/or open-source methods • Task / Model / Supervision / Estimation & sparsity • Task-independent – Majority Voting – ZenCrowd (Demartini et al., 2012), EM-based – GLAD (Whitehill et al., 2009) • Classification-specific (confusion matrices) – Snow et al., 2008, Naïve Bayes – Dawid & Skene (1979), EM-based – Raykar et al. (2012) – CUBAM (Welinder et al., 2010) Matt Lease <ml@utexas.edu> 13
  14. 14. Results: Unsupervised Accuracy Relative effectiveness vs. majority voting 15 -15% -10% -5% 0% 5% 10% 15% BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL DS ZC RY GLAD CUBCAM
  15. 15. Results: Varying Supervision 16 Matt Lease <ml@utexas.edu>
  16. 16. Findings • Majority voting never best, but rarely much worse • No method performs far better than others • Each method often best for some condition – e.g., original dataset method was designed for • DS & RY tend to perform best (RY adds priors) – ZC (also EM-based) does well with injected noise 17 Matt Lease <ml@utexas.edu>
  17. 17. Provocative: So Where’s the Progress? • Sure, progress is not only empirical, but… • Maybe gold is too noisy to detect improvement? – Cormack & Kolcz’09, Klebanov & Beigman’10 • Might we see bigger differences from – Different tasks/scenarios? Larger data scales? – Better methods or tuning? Better benchmark tests? – Spammer detection and filtering? • We invite community contributions! 18 Matt Lease <ml@utexas.edu>
  18. 18. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 19 Matt Lease <ml@utexas.edu>
  19. 19. Multidimensional Relevance Modeling via Psychometrics and Crowdsourcing Joint work with Yinglong Zhang Jin Zhang Jacek Gwizdka Paper @ SIGIR 2014 Matt Lease <ml@utexas.edu> 20
  20. 20. How to Evaluate a Search Engine? • 3 complementary approaches (with tradeoffs) – Log analysis (“big data”): e.g., infer relevance from clicks – User study: users perform controlled search task(s) – Annotate: 1) create a set of queries, 2) label document relevance to each, & 3) measure algorithmic effectiveness • Cranfield (Cleverdon et al., 1966), simplified topical relevance • Examples from Google – Video: How Google makes improvements to its search – Video: How does Google use human raters in web search? – Search Quality Rating Guidelines (November 2, 2012) 21 Matt Lease <ml@utexas.edu>
  21. 21. Saracevic’s 1997 Salton Award address “…the human-centered side was often highly critical of the systems side for ignoring users... [when] results have implications for systems design & practice. Unfortunately… beyond suggestions, concrete design solutions were not delivered. “…the systems side by and large ignores the user side and user studies… the stance is ‘tell us what to do and we will.’ But nobody is telling... “Thus, there are not many interactions…” Matt Lease <ml@utexas.edu> 22/20
  22. 22. RQs: Information Retrieval • What is relevance? – What factors constitute it? Can we quantify their relative importance? How do they interact? • Old question, many studies, little agreement • Significance – Increase fundamental understanding of relevance – Foster multi-dimensional evaluation of IR systems – Bridge human & system-centered relevance modeling • Create multi-dimensional judgment data for training & eval • Motivate research to automatically infer underlying factors Matt Lease <ml@utexas.edu> 23/20
  23. 23. RQs: Crowdsourcing Subjective Tasks • How can we measure/ensure the quality of subjective judgments (especially online)? – Traditional, trusted personnel often disagree in judging even simplified topical relevance – How to distinguish valid subjectivity vs. human error? • Significance – Promote systematic study of quality assurance for subjective tasks in HCOMP community – Help explain/reduce observed labeling disagreements Matt Lease <ml@utexas.edu> 24/20
  24. 24. Why Eytan Adar hates MTurk Research (CHI 2011 CHC Workshop) • Missing/ignoring prior work in other disciplines – It turns out other fields have thought (a lot) about a number of problems that show up in HCOMP! • And other stuff (fun read…) 25
  25. 25. Social Sciences have been… • …collecting reliable, subjective data from online participants before “crowdsourcing” was coined • …inferring latent factors and relationships from noisy, observed data using powerful modeling techniques that are positivist and data-driven • …using MTurk to reproduce many traditional behavioral studies with university students Maybe we can learn something from them? Matt Lease <ml@utexas.edu> 26
  26. 26. Pscychology to the Rescue! • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists 27/20
  27. 27. Key Ideas from Pscyhometrics • Use established survey techniques to collect subjective relevance judgments – Ask repeated, similar questions, & change polarity • Analyze via Structural Equation Modeling (SEM) – Cousin to graphical models in statistics/AI – Posit questions associated with latent factors – Use Exploratory Factor Analysis (EFA) to assess question-factor relationships & prune “bad” questions – Use Confirmatory Factor Analysis (CFA) to assess correlations, test significance, & compare models Matt Lease <ml@utexas.edu> 28
  28. 28. Collecting multi-dimensional relevance judgments • Participant picks one of several pre-defined topics – You want to plan a one week vacation in China • Participant assigned a Web page to judge – We wrote a query for each topic, submitted to a popular search engine, and did stratified sampling of results • Participant answers a set of likert-scale questions – I think the information in this page is incorrect – It’s difficult to understand the information in this page Matt Lease <ml@utexas.edu> 29/20
  29. 29. How do we ask the questions? • Ask 3+ questions per hypothesized dimension – Ask repeated, similar questions, & change polarity – Randomize question order (don’t group questions) – Over-generate questions to allow for later pruning – Exclude participants failing self-consistency checks • Survey design principles: tailor, engage, QA – Use clear, familiar, non-leading wording – Balance response scale and question polarity – Pre-test survey in-house, then pilot study online Matt Lease <ml@utexas.edu> 30/20
  30. 30. What Questions might we ask? • What factors might determine relevance? • We adopt same 5 factors from (Xu & Chen, 2006) – Topicality, reliability, novelty, understability, & scope – Choose same to make revised mechanics & any difference in findings maximally clear • Assume factors are incomplete & imperfect – Positivist approach: do these factors explain observed data better than other alternatives: uni-dimensional relevance or another set of factors? Matt Lease <ml@utexas.edu> 31/20
  31. 31. Structural Equation Modeling (SEM) • Based on Sewell Wright’s path analysis (1921) – A factor model is parameterized by factor loadings, covariances, & residual error terms • Graphical representation: path diagram – Observed variables in boxes – Latent variables in ovals – Directed edges denote causal relationships – Residual error terms implicitly assumed Matt Lease <ml@utexas.edu> 32/20
  32. 32. Exploratory Factor Analysis (EFA) – 1 of 2 • Is the sample large enough for EFA? – Kaiser-Mayer-Olkin (KMO) Measure of Adequacy – Bartlett’s Test of Sphericity • Principal Axis Factoring (PAF) to find eigenvalues – Assume some large, constant # of latent factors – Assume each factor has connecting edge to each question – Estimate factor model parameters by least-squares fit • Prune factors via Parallel Analysis – Create random data with same # factors & questions – Create correlation matrix and find eigenvalues Matt Lease <ml@utexas.edu> 33/20
  33. 33. • Perform Parallel Analysis – Create random data w/ same # of factors & questions – Create correlation matrix and find eigenvalues • Create Scree Plot of Eigenvalues • Re-run EFA for reduced factors • Compute Pearson correlations • Discard questions with: – Weak factor loading – Strong cross-factor loading – Lack of logical interpretation • Kenny’s Rule: need >= 2 questions per factor for EFA Exploratory Factor Analysis (EFA) – 2 of 2 Matt Lease <ml@utexas.edu> 34/20
  34. 34. Question-Factor Loadings (Weights) Matt Lease <ml@utexas.edu> 35/20
  35. 35. CFA: Assess and Compare Models • F First-order baseline model uses a single latent factor to explain observed data Posited hierarchical factor model uses 5 relevance dimensions Matt Lease <ml@utexas.edu> 36/20
  36. 36. • Null model assume observations independent – Covariance between questions fixed at 0, means & coveriances left free • Comparison stats – Non-Normed Fit Index (NNFI) – Comparative Fit Index (CFI) – Root-Mean Squared Error of Approximation (RMSEA) – Standardized-root Mean-Square Residual (SMSR) Confirmatory Factor Analysis (CFA) Matt Lease <ml@utexas.edu> 37/20
  37. 37. Contributions • Simple, reliable, scalable way to collect diverse (subjective), multi-dimensional judgments from online participants – Online survey techniques from pscyhometrics – Doesn’t require objective task, gold labels, or N+ judges – Help distinguish subjectivity vs. error • Describe a rigorous, positivist, data-driven framework for inferring & modeling multi-dimensional relevance – Structural equation modeling (SEM) from pscyhometrics – Run the experiment & let the data speak for itself • Implemented in standard R libraries, data available online Matt Lease <ml@utexas.edu> 38/20
  38. 38. Future Directions • More data-driven positivist research into factors – Different user groups, search scenarios, devices, etc. – Need more data to support normative claims • Train/test operational systems for varying factors – Identify/extend detected features for each dimension – Personalize search results for individual preferences • Improve agreement by making task more natural and/or analyzing latent factors if disagreement • Intra-subject vs. inter-subject aggregation? – Other methods for ensuring subjective data quality? • SEM vs. graphical models? 39/20
  39. 39. Thank You! ir.ischool.utexas.edu 40 Slides: www.slideshare.net/mattlease

×