Crowdsourcing Track Overview at TREC 2011

Overview of the TREC 2011
Crowdsourcing Track

Organizers:
Gabriella Kazai, Microsoft Research Cambridge
Matt Lease, University of Texas at Austin

Nov. 16, 2011 TREC 2011 Crowdsourcing Track 2

What is Crowdsourcing?
• A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related concepts
– Collective intelligence
– Social computing
– People services
– Human computation

Why Crowdsourcing? Potential…
• Scalability (e.g. cost, time, effort)
– e.g. scale to greater pool sizes
• Quality (by getting more eyes on the data)
– More diverse judgments
– More accurate judgments (“wisdom of crowds”)
• And more!
– New datasets, new tasks, interaction, on-demand
evaluation, hybrid search systems


Track Goals (for Year 1)
• Promote IR community awareness
of, investigation of, and experience with
crowdsourcing mechanisms and methods
• Improve understanding of best practices
• Establish shared, reusable benchmarks
• Assess state-of-the-art of the field
• Attract experience from outside IR community


Crowdsourcing in 2011
• AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
• ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
• Crowdsourcing Technologies for Language and Cognition Studies (July 27)
• CHI-CHC: Crowdsourcing and Human Computation (May 8)
• CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
• CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
• Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
• EC: Workshop on Social Computing and User Generated Content (June 5)
• ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
• Interspeech: Crowdsourcing for speech processing (August)
• NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
• SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
• TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
• UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
• WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)

Two Questions, Two Tasks
• Task 1: Assessment (human factors)
– How can we obtain quality relevance judgments
from individual (crowd) participants?

• Task 2: Aggregation (statistics)
– How can we derive a quality relevance judgment
from multiple (crowd) judgments?


Task 1: Assessment (human factors)
• Measurable outcomes & potential tradeoffs
– Quality, time, cost, & effort
• Many possible factors
– Incentive structures
– Interface design
– Instructions / guidance
– Interaction / feedback
– Recruitment & retention
–…

Task 2: Aggregation (statistics)
• “Wisdom of crowds” computing
• Typical assumption: noisy input labels
– But not always (cf. Yang et al., SIGIR’10)
• Many statistical methods have been proposed
– Common baseline: majority vote


Crowdsourcing, Noise & Uncertainty
Broadly two approaches
1. Alchemy: turn noisy data into gold
– Once we have gold, we can go on training and
evaluating as before (separation of concerns)
– Assume we can mostly clean it up and ignore any
remaining error (even gold is rarely 100% pure)
2. Model & propagate uncertainty
– Let it “spill over” into training and evaluation


Test Collection: ClueWeb09 subset
• Collection: 19K pages rendered by Waterloo
– Task 1: teams judge (a subset)
– Task 2: teams aggregate judgments we provide
• Topics: taken from past MQ and RF tracks
• Gold: Roughly 3K prior NIST judgments
– Remaining 16K pages have no “gold” judgments


What to Predict?
• Teams submit classification and/or ranking labels
– Classification supports traditional absolute relevance judging
– Rank labels support pair-wise preference or list-wise judging
• Classification labels in [0,1]
– Probability of relevance (assessor/system uncertainty)
– Simple generalization of binary relevance
– If probabilities submitted but no ranking, rank labels induced
• Ranking as [1..N]
– Task 1: rank 5 documents per set
• Same worker had to label all 5 examples in a given set (challenge)
– Task 2: rank all documents per topic

Metrics
• Classification
– Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss
– Probabilistic ground truth: KL, RMSE

• Ranking
– Mean Average Precision (MAP)
– Normalized Discounted Cumulative Gain (NDCG)
• Ternary NIST judgments conflated to binary
• Could explore mapping [0,1] consensus to ternary categories


Prediction
Classification Metrics
Rel Non-rel

Ground
True TP TN

Truth
False FP FN


Classification Metrics (cont’d)
• Classification – Binary ground truth (cont’d)

• Classification – Probabilistic ground truth

Root Mean Squared Error (RMSE)
• Notes
– To avoid log(0) = infinity, replace 0 with 10^-15
– Revision: compute average per-example logloss and KL so error does
not grow with sample size (particularly with varying team coverage)


Ground Truth: Three Versions

• Gold: NIST Judgments
– only available for a subset of the test collection
• Consensus: generated by aggregating team labels (automatic)
– full coverage
• Team-based (Task 2 only)
– use each team’s labels as truth to evaluate all other teams
– Inspect variance in team rankings over alternative ground truths
– Coverage varies

Three primary evaluation conditions
1. Over examples having gold labels (evaluate vs. gold labels)
2. Over examples having gold labels (evaluate vs. consensus labels)
3. Over all examples (evaluate vs. consensus labels)


Consensus
• Goal: Infer single consensus label from multiple input labels
• Methodological Goals: unbiased, transparent, simple
• Method: simple average, rounded when metrics require
– Task 2: input = example labels from each team
– Task 1: input = per-example average of worker labels from each team
• Details
– Classification labels only; no rank fusion
– Using primary runs only
– Task 1: each team gets 1 vote regardless of worker count (prevent bias)
– Exclude any examples where
• only one team submitted a label (bias)
• consensus would yield a tie (binary metrics only)


How good is consensus? Compare to gold.

Task 1: 395 gold topic-document pairs
Labels ACC PRE REC SPE LL KL RMSE
Probabilistic
Consensus 0.69 0.74 0.79 0.57 0.71 0.23 0.38
Rounded Binary
Consensus 0.80 0.87 0.85 0.66 6.85 3.14 0.45

Task 2: 1000 gold topic-document pairs
Labels ACC PRE REC SPE LL KL RMSE
Probabilistic
Consensus
0.62 0.73 0.60 0.50 0.65 0.19 0.47
Rounded Binary
Consensus 0.69 0.83 0.65 0.55 10.71 2.94 0.56

Issue: need to consider proper scoring rules

Task 1: Assessment
(Judging)

Task 1: Data
• Option 1: Use Waterloo rendered pages
– Available as images, PDFs, and plain text (+html)
– Many page images fetched from CMU server
– Protect workers from malicious scripting
• Option 2: Use some other format
– Any team creating some other format was asked
to provide that data or conversion tool to others
– Avoid comparison based on different rendering


Task 1: Data
• Topics: 270 (240 development, 30 test)
• Test Effort: ~2200 topic-document pairs for each team to judge
– Shared sets: judged by all teams
• Test: 1655 topic-document pairs (331 sets) over 20 topics
– Assigned sets: judged subset of teams
• Test: 1545 topic-document pairs (309 sets) over 15 topics in total
• ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown)
– Split intended to let organizers measure any worker-training effects
• Increased track complexity, decreased useful redundancy & gold …

• Gold: 395 topic-document pairs for test
– made available to teams for cross-validation (not blind)


Task 1: Cost & Sponsorship
• Paid crowd labor only one form of crowdsourcing
– Other models: directed gaming, citizen science, virtual pay
– Incentives: socialize with others, recognition, social good, learn, etc.

• Nonetheless, paid models continue to dominate
– e.g. Amazon Mechanical Turk (MTurk), CrowdFlower

• Risk: cost of crowd labor being barrier to track participation
• Risk Mitigation: sponsorship
– CrowdFlower: $100 free credit to interested teams
– Amazon: ~ $300 reimbursement to teams using MTurk (expected)

Task 1: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
– CrowdFlower qualification, MTurk judging
2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk
3. Delft University of Technology & University of Iowa (GeAnn)
– Game, recruit via CrowdFlower
4. Glasgow – Terrier (uogTr): MTurk
5. Microsoft (MSRC): MTurk
6. RMIT University (RMIT): CrowdFlower
7. University Carlos III of Madrid (uc3m): Mturk
8. University of Waterloo (UWaterlooMDS): in-house judging

5 used MTurk, 3 used CrowdFlower , 1 in-house


Task 1: Evaluation method
• Average per-worker performance
– Average weighted by number of labels per worker
– Primary evaluation includes rejected work

• Additional metric: Coverage
– What % of examples were labeled by the team?

• Cost & time to be self-reported by teams

¼ most productive workers do ¾ of the work

# of workers
# of labels % of labels

Top 25%
44917 76.77%

Top 50%
53444 91.34%

Top 75%
56558.5 96.66%

Total
58510 100%


Same worker, multiple teams
2000

1800

1600 # of teams avg. # of
belongs to # of worker examples
1400

1
Number of Examples

1200
947 56.21
1000
2
35 146.65
800

3
600 2 72.25
400

200

0
1
25
49
73
97
121
145
169
193
217
241
265
289
313
337
361
385
409
433
457
481
505
529
553
577
601
625
649
673
697
721
745
769
793
817
841
865
889
913
937
961
Anonymized Worker ID

Task 2: Data
• Input: judgments provided by organizers
– 19,033 topic-document pairs
– 89,624 binary judgments from 762 workers
• Evaluation: average per-topic performance
• Gold: 3275 labels
– 2275 for training (1275 relevant, 1000 non-relevant)
• Excluded from evaluation
– 1000 for blind test (balanced 500/500)


Task 2: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
2. Delft University of Technology – Vuurens (TUD_DMIR)
3. Delft University of Technology & University of Iowa (GeAnn)
4. Glasgow – Terrier (uogTr)
5. Glasgow – Zuccon (qirdcsuog)
6. LingPipe
7. Microsoft (MSRC)
8. University Carlos III of Madrid (uc3m)
9. University of Texas at Austin (UTAustin)
10. University of Waterloo (UWaterlooMDS)


Discussion
• Consensus Labels as ground-truth
– Consensus Algorithm for Label Generation?
– Probabilistic or Rounded Binary Consensus Labels?
• Proper scoring rules
• Changes for 2012?
– Which document collection? Request NIST judging?
– Drop the two-task format? Pre-suppose crowdsourced solution?
– Broaden sponsorship? Narrow scope?
– Additional organizer?
– Details
• Focus on worker training effects
• Treatment of rejected work


Conclusion
• Interesting first year of track
– Some insights about what worked well and less well in track design
– Participants will tell us about methods developed
– More analysis still needed for evaluation
• Track will run again in 2012
– Help shape it with feedback (planning session. Hallway, or email)
• Acknowledgments
– Hyun Joon Jung (UT Austin)
– Mark Smucker (U Waterloo)
– Ellen Voorhees & Ian Soboroff (NIST)
• Sponsors
– Amazon
– CrowdFlower

Crowdsourcing Track Overview at TREC 2011

Recommended

Recommended

More Related Content

Similar to Crowdsourcing Track Overview at TREC 2011

Similar to Crowdsourcing Track Overview at TREC 2011 (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing Track Overview at TREC 2011