Crowdsourcing Track Overview at TREC 2011

  • 1,989 views
Uploaded on

Presented Nov. 16 2011 at the National Institute of Standards (NIST) Text REtrieval Conference (TREC). Track organized with Gabriella Kazai with assistance from Hyun Joon Jung.

Presented Nov. 16 2011 at the National Institute of Standards (NIST) Text REtrieval Conference (TREC). Track organized with Gabriella Kazai with assistance from Hyun Joon Jung.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,989
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Overview of the TREC 2011 Crowdsourcing Track Organizers:Gabriella Kazai, Microsoft Research Cambridge Matt Lease, University of Texas at Austin
  • 2. Nov. 16, 2011 TREC 2011 Crowdsourcing Track 2
  • 3. What is Crowdsourcing?• A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve some goal(s)• Enabled by internet-connectivity• Many related concepts – Collective intelligence – Social computing – People services – Human computationNov. 16, 2011 TREC 2011 Crowdsourcing Track 3
  • 4. Why Crowdsourcing? Potential…• Scalability (e.g. cost, time, effort) – e.g. scale to greater pool sizes• Quality (by getting more eyes on the data) – More diverse judgments – More accurate judgments (“wisdom of crowds”)• And more! – New datasets, new tasks, interaction, on-demand evaluation, hybrid search systemsNov. 16, 2011 TREC 2011 Crowdsourcing Track 4
  • 5. Track Goals (for Year 1)• Promote IR community awareness of, investigation of, and experience with crowdsourcing mechanisms and methods• Improve understanding of best practices• Establish shared, reusable benchmarks• Assess state-of-the-art of the field• Attract experience from outside IR communityNov. 16, 2011 TREC 2011 Crowdsourcing Track 5
  • 6. Crowdsourcing in 2011• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)• Crowdsourcing Technologies for Language and Cognition Studies (July 27)• CHI-CHC: Crowdsourcing and Human Computation (May 8)• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)• EC: Workshop on Social Computing and User Generated Content (June 5)• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)• Interspeech: Crowdsourcing for speech processing (August)• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)• TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 6
  • 7. Two Questions, Two Tasks• Task 1: Assessment (human factors) – How can we obtain quality relevance judgments from individual (crowd) participants?• Task 2: Aggregation (statistics) – How can we derive a quality relevance judgment from multiple (crowd) judgments?Nov. 16, 2011 TREC 2011 Crowdsourcing Track 7
  • 8. Task 1: Assessment (human factors)• Measurable outcomes & potential tradeoffs – Quality, time, cost, & effort• Many possible factors – Incentive structures – Interface design – Instructions / guidance – Interaction / feedback – Recruitment & retention –…Nov. 16, 2011 TREC 2011 Crowdsourcing Track 8
  • 9. Task 2: Aggregation (statistics)• “Wisdom of crowds” computing• Typical assumption: noisy input labels – But not always (cf. Yang et al., SIGIR’10)• Many statistical methods have been proposed – Common baseline: majority voteNov. 16, 2011 TREC 2011 Crowdsourcing Track 9
  • 10. Crowdsourcing, Noise & UncertaintyBroadly two approaches1. Alchemy: turn noisy data into gold – Once we have gold, we can go on training and evaluating as before (separation of concerns) – Assume we can mostly clean it up and ignore any remaining error (even gold is rarely 100% pure)2. Model & propagate uncertainty – Let it “spill over” into training and evaluationNov. 16, 2011 TREC 2011 Crowdsourcing Track 10
  • 11. Test Collection: ClueWeb09 subset• Collection: 19K pages rendered by Waterloo – Task 1: teams judge (a subset) – Task 2: teams aggregate judgments we provide• Topics: taken from past MQ and RF tracks• Gold: Roughly 3K prior NIST judgments – Remaining 16K pages have no “gold” judgmentsNov. 16, 2011 TREC 2011 Crowdsourcing Track 11
  • 12. What to Predict?• Teams submit classification and/or ranking labels – Classification supports traditional absolute relevance judging – Rank labels support pair-wise preference or list-wise judging• Classification labels in [0,1] – Probability of relevance (assessor/system uncertainty) – Simple generalization of binary relevance – If probabilities submitted but no ranking, rank labels induced• Ranking as [1..N] – Task 1: rank 5 documents per set • Same worker had to label all 5 examples in a given set (challenge) – Task 2: rank all documents per topicNov. 16, 2011 TREC 2011 Crowdsourcing Track 12
  • 13. Metrics• Classification – Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss – Probabilistic ground truth: KL, RMSE• Ranking – Mean Average Precision (MAP) – Normalized Discounted Cumulative Gain (NDCG) • Ternary NIST judgments conflated to binary • Could explore mapping [0,1] consensus to ternary categoriesNov. 16, 2011 TREC 2011 Crowdsourcing Track 13
  • 14. PredictionClassification Metrics Rel Non-rel Ground True TP TN Truth False FP FNNov. 16, 2011 TREC 2011 Crowdsourcing Track 14
  • 15. Classification Metrics (cont’d)• Classification – Binary ground truth (cont’d)• Classification – Probabilistic ground truth Root Mean Squared Error (RMSE)• Notes – To avoid log(0) = infinity, replace 0 with 10^-15 – Revision: compute average per-example logloss and KL so error does not grow with sample size (particularly with varying team coverage)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 15
  • 16. Ground Truth: Three Versions• Gold: NIST Judgments – only available for a subset of the test collection• Consensus: generated by aggregating team labels (automatic) – full coverage• Team-based (Task 2 only) – use each team’s labels as truth to evaluate all other teams – Inspect variance in team rankings over alternative ground truths – Coverage variesThree primary evaluation conditions1. Over examples having gold labels (evaluate vs. gold labels)2. Over examples having gold labels (evaluate vs. consensus labels)3. Over all examples (evaluate vs. consensus labels)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 16
  • 17. Consensus• Goal: Infer single consensus label from multiple input labels• Methodological Goals: unbiased, transparent, simple• Method: simple average, rounded when metrics require – Task 2: input = example labels from each team – Task 1: input = per-example average of worker labels from each team• Details – Classification labels only; no rank fusion – Using primary runs only – Task 1: each team gets 1 vote regardless of worker count (prevent bias) – Exclude any examples where • only one team submitted a label (bias) • consensus would yield a tie (binary metrics only)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 17
  • 18. How good is consensus? Compare to gold. Task 1: 395 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.69 0.74 0.79 0.57 0.71 0.23 0.38 Rounded Binary Consensus 0.80 0.87 0.85 0.66 6.85 3.14 0.45Task 2: 1000 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.62 0.73 0.60 0.50 0.65 0.19 0.47 Rounded Binary Consensus 0.69 0.83 0.65 0.55 10.71 2.94 0.56 Issue: need to consider proper scoring rulesNov. 16, 2011 TREC 2011 Crowdsourcing Track 18
  • 19. Task 1: Assessment (Judging)
  • 20. Task 1: Data• Option 1: Use Waterloo rendered pages – Available as images, PDFs, and plain text (+html) – Many page images fetched from CMU server – Protect workers from malicious scripting• Option 2: Use some other format – Any team creating some other format was asked to provide that data or conversion tool to others – Avoid comparison based on different renderingNov. 16, 2011 TREC 2011 Crowdsourcing Track 20
  • 21. Task 1: Data• Topics: 270 (240 development, 30 test)• Test Effort: ~2200 topic-document pairs for each team to judge – Shared sets: judged by all teams • Test: 1655 topic-document pairs (331 sets) over 20 topics – Assigned sets: judged subset of teams • Test: 1545 topic-document pairs (309 sets) over 15 topics in total • ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown) – Split intended to let organizers measure any worker-training effects • Increased track complexity, decreased useful redundancy & gold …• Gold: 395 topic-document pairs for test – made available to teams for cross-validation (not blind)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 21
  • 22. Task 1: Cost & Sponsorship• Paid crowd labor only one form of crowdsourcing – Other models: directed gaming, citizen science, virtual pay – Incentives: socialize with others, recognition, social good, learn, etc.• Nonetheless, paid models continue to dominate – e.g. Amazon Mechanical Turk (MTurk), CrowdFlower• Risk: cost of crowd labor being barrier to track participation• Risk Mitigation: sponsorship – CrowdFlower: $100 free credit to interested teams – Amazon: ~ $300 reimbursement to teams using MTurk (expected)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 22
  • 23. Task 1: Participants1. Beijing University of Posts and Telecommunications (BUPT) – CrowdFlower qualification, MTurk judging2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk3. Delft University of Technology & University of Iowa (GeAnn) – Game, recruit via CrowdFlower4. Glasgow – Terrier (uogTr): MTurk5. Microsoft (MSRC): MTurk6. RMIT University (RMIT): CrowdFlower7. University Carlos III of Madrid (uc3m): Mturk8. University of Waterloo (UWaterlooMDS): in-house judging5 used MTurk, 3 used CrowdFlower , 1 in-houseNov. 16, 2011 TREC 2011 Crowdsourcing Track 23
  • 24. Task 1: Evaluation method• Average per-worker performance – Average weighted by number of labels per worker – Primary evaluation includes rejected work• Additional metric: Coverage – What % of examples were labeled by the team?• Cost & time to be self-reported by teamsNov. 16, 2011 TREC 2011 Crowdsourcing Track 24
  • 25. ¼ most productive workers do ¾ of the work # of workers # of labels % of labels Top 25% 44917 76.77% Top 50% 53444 91.34% Top 75% 56558.5 96.66% Total 58510 100%Nov. 16, 2011 TREC 2011 Crowdsourcing Track 25
  • 26. Same worker, multiple teams 2000 1800 1600 # of teams avg. # of belongs to # of worker examples 1400 1Number of Examples 1200 947 56.21 1000 2 35 146.65 800 3 600 2 72.25 400 200 0 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 Anonymized Worker ID Nov. 16, 2011 TREC 2011 Crowdsourcing Track 26
  • 27. Task 2: Aggregation
  • 28. Task 2: Data• Input: judgments provided by organizers – 19,033 topic-document pairs – 89,624 binary judgments from 762 workers• Evaluation: average per-topic performance• Gold: 3275 labels – 2275 for training (1275 relevant, 1000 non-relevant) • Excluded from evaluation – 1000 for blind test (balanced 500/500)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 28
  • 29. Task 2: Participants1. Beijing University of Posts and Telecommunications (BUPT)2. Delft University of Technology – Vuurens (TUD_DMIR)3. Delft University of Technology & University of Iowa (GeAnn)4. Glasgow – Terrier (uogTr)5. Glasgow – Zuccon (qirdcsuog)6. LingPipe7. Microsoft (MSRC)8. University Carlos III of Madrid (uc3m)9. University of Texas at Austin (UTAustin)10. University of Waterloo (UWaterlooMDS)Nov. 16, 2011 TREC 2011 Crowdsourcing Track 29
  • 30. Discussion• Consensus Labels as ground-truth – Consensus Algorithm for Label Generation? – Probabilistic or Rounded Binary Consensus Labels?• Proper scoring rules• Changes for 2012? – Which document collection? Request NIST judging? – Drop the two-task format? Pre-suppose crowdsourced solution? – Broaden sponsorship? Narrow scope? – Additional organizer? – Details • Focus on worker training effects • Treatment of rejected workNov. 16, 2011 TREC 2011 Crowdsourcing Track 30
  • 31. Conclusion• Interesting first year of track – Some insights about what worked well and less well in track design – Participants will tell us about methods developed – More analysis still needed for evaluation• Track will run again in 2012 – Help shape it with feedback (planning session. Hallway, or email)• Acknowledgments – Hyun Joon Jung (UT Austin) – Mark Smucker (U Waterloo) – Ellen Voorhees & Ian Soboroff (NIST)• Sponsors – Amazon – CrowdFlowerNov. 16, 2011 TREC 2011 Crowdsourcing Track 31