SlideShare a Scribd company logo
Overview of the TREC 2011
      Crowdsourcing Track



                  Organizers:
Gabriella Kazai, Microsoft Research Cambridge
  Matt Lease, University of Texas at Austin
Nov. 16, 2011   TREC 2011 Crowdsourcing Track   2
What is Crowdsourcing?
• A collection of mechanisms and associated
  methodologies for scaling and directing crowd
  activities to achieve some goal(s)
• Enabled by internet-connectivity
• Many related concepts
       – Collective intelligence
       – Social computing
       – People services
       – Human computation
Nov. 16, 2011            TREC 2011 Crowdsourcing Track   3
Why Crowdsourcing? Potential…
• Scalability (e.g. cost, time, effort)
       – e.g. scale to greater pool sizes
• Quality (by getting more eyes on the data)
       – More diverse judgments
       – More accurate judgments (“wisdom of crowds”)
• And more!
       – New datasets, new tasks, interaction, on-demand
         evaluation, hybrid search systems

Nov. 16, 2011             TREC 2011 Crowdsourcing Track    4
Track Goals (for Year 1)
• Promote IR community awareness
  of, investigation of, and experience with
  crowdsourcing mechanisms and methods
• Improve understanding of best practices
• Establish shared, reusable benchmarks
• Assess state-of-the-art of the field
• Attract experience from outside IR community

Nov. 16, 2011          TREC 2011 Crowdsourcing Track   5
Crowdsourcing in 2011
•   AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8)
•   ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2)
•   Crowdsourcing Technologies for Language and Cognition Studies (July 27)
•   CHI-CHC: Crowdsourcing and Human Computation (May 8)
•   CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”)
•   CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2)
•   Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13)
•   EC: Workshop on Social Computing and User Generated Content (June 5)
•   ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20)
•   Interspeech: Crowdsourcing for speech processing (August)
•   NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD)
•   SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28)
•   TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18)
•   UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18)
•   WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9)
    Nov. 16, 2011                     TREC 2011 Crowdsourcing Track                           6
Two Questions, Two Tasks
• Task 1: Assessment (human factors)
       – How can we obtain quality relevance judgments
         from individual (crowd) participants?


• Task 2: Aggregation (statistics)
       – How can we derive a quality relevance judgment
         from multiple (crowd) judgments?




Nov. 16, 2011           TREC 2011 Crowdsourcing Track     7
Task 1: Assessment (human factors)
• Measurable outcomes & potential tradeoffs
       – Quality, time, cost, & effort
• Many possible factors
       – Incentive structures
       – Interface design
       – Instructions / guidance
       – Interaction / feedback
       – Recruitment & retention
       –…
Nov. 16, 2011            TREC 2011 Crowdsourcing Track   8
Task 2: Aggregation (statistics)
• “Wisdom of crowds” computing
• Typical assumption: noisy input labels
       – But not always (cf. Yang et al., SIGIR’10)
• Many statistical methods have been proposed
       – Common baseline: majority vote




Nov. 16, 2011             TREC 2011 Crowdsourcing Track   9
Crowdsourcing, Noise & Uncertainty
Broadly two approaches
1. Alchemy: turn noisy data into gold
       – Once we have gold, we can go on training and
         evaluating as before (separation of concerns)
       – Assume we can mostly clean it up and ignore any
         remaining error (even gold is rarely 100% pure)
2. Model & propagate uncertainty
       – Let it “spill over” into training and evaluation

Nov. 16, 2011             TREC 2011 Crowdsourcing Track     10
Test Collection: ClueWeb09 subset
• Collection: 19K pages rendered by Waterloo
       – Task 1: teams judge (a subset)
       – Task 2: teams aggregate judgments we provide
• Topics: taken from past MQ and RF tracks
• Gold: Roughly 3K prior NIST judgments
       – Remaining 16K pages have no “gold” judgments




Nov. 16, 2011          TREC 2011 Crowdsourcing Track    11
What to Predict?
• Teams submit classification and/or ranking labels
       – Classification supports traditional absolute relevance judging
       – Rank labels support pair-wise preference or list-wise judging
• Classification labels in [0,1]
       – Probability of relevance (assessor/system uncertainty)
       – Simple generalization of binary relevance
       – If probabilities submitted but no ranking, rank labels induced
• Ranking as [1..N]
       – Task 1: rank 5 documents per set
                • Same worker had to label all 5 examples in a given set (challenge)
       – Task 2: rank all documents per topic
Nov. 16, 2011                        TREC 2011 Crowdsourcing Track                     12
Metrics
• Classification
   – Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss
   – Probabilistic ground truth: KL, RMSE

• Ranking
   – Mean Average Precision (MAP)
   – Normalized Discounted Cumulative Gain (NDCG)
                • Ternary NIST judgments conflated to binary
                • Could explore mapping [0,1] consensus to ternary categories



Nov. 16, 2011                     TREC 2011 Crowdsourcing Track             13
Prediction
Classification Metrics
                                                            Rel    Non-rel




                                      Ground
                                                  True      TP         TN




                                       Truth
                                                  False     FP         FN




Nov. 16, 2011     TREC 2011 Crowdsourcing Track                              14
Classification Metrics (cont’d)
• Classification – Binary ground truth (cont’d)


• Classification – Probabilistic ground truth


                Root Mean Squared Error (RMSE)
• Notes
       – To avoid log(0) = infinity, replace 0 with 10^-15
       – Revision: compute average per-example logloss and KL so error does
         not grow with sample size (particularly with varying team coverage)


Nov. 16, 2011                    TREC 2011 Crowdsourcing Track                 15
Ground Truth: Three Versions

• Gold: NIST Judgments
       – only available for a subset of the test collection
• Consensus: generated by aggregating team labels (automatic)
       – full coverage
• Team-based (Task 2 only)
       – use each team’s labels as truth to evaluate all other teams
       – Inspect variance in team rankings over alternative ground truths
       – Coverage varies

Three primary evaluation conditions
1. Over examples having gold labels (evaluate vs. gold labels)
2. Over examples having gold labels (evaluate vs. consensus labels)
3. Over all examples (evaluate vs. consensus labels)

Nov. 16, 2011                     TREC 2011 Crowdsourcing Track             16
Consensus
• Goal: Infer single consensus label from multiple input labels
• Methodological Goals: unbiased, transparent, simple
• Method: simple average, rounded when metrics require
       – Task 2: input = example labels from each team
       – Task 1: input = per-example average of worker labels from each team
• Details
       –   Classification labels only; no rank fusion
       –   Using primary runs only
       –   Task 1: each team gets 1 vote regardless of worker count (prevent bias)
       –   Exclude any examples where
                • only one team submitted a label (bias)
                • consensus would yield a tie (binary metrics only)

Nov. 16, 2011                         TREC 2011 Crowdsourcing Track             17
How good is consensus? Compare to gold.

 Task 1: 395 gold topic-document pairs
        Labels       ACC    PRE          REC           SPE     LL      KL    RMSE
     Probabilistic
      Consensus      0.69   0.74        0.79           0.57   0.71    0.23   0.38
    Rounded Binary
      Consensus      0.80   0.87        0.85           0.66   6.85    3.14   0.45



Task 2: 1000 gold topic-document pairs
        Labels       ACC    PRE          REC           SPE     LL      KL    RMSE
     Probabilistic
      Consensus
                     0.62   0.73        0.60          0.50    0.65    0.19   0.47
    Rounded Binary
      Consensus      0.69   0.83        0.65          0.55    10.71   2.94   0.56



 Issue: need to consider proper scoring rules
Nov. 16, 2011                 TREC 2011 Crowdsourcing Track                     18
Task 1: Assessment
     (Judging)
Task 1: Data
• Option 1: Use Waterloo rendered pages
       – Available as images, PDFs, and plain text (+html)
       – Many page images fetched from CMU server
       – Protect workers from malicious scripting
• Option 2: Use some other format
       – Any team creating some other format was asked
         to provide that data or conversion tool to others
       – Avoid comparison based on different rendering

Nov. 16, 2011            TREC 2011 Crowdsourcing Track       20
Task 1: Data
• Topics: 270 (240 development, 30 test)
• Test Effort: ~2200 topic-document pairs for each team to judge
       – Shared sets: judged by all teams
                • Test: 1655 topic-document pairs (331 sets) over 20 topics
       – Assigned sets: judged subset of teams
                • Test: 1545 topic-document pairs (309 sets) over 15 topics in total
                • ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown)
       – Split intended to let organizers measure any worker-training effects
                • Increased track complexity, decreased useful redundancy & gold …

• Gold: 395 topic-document pairs for test
       – made available to teams for cross-validation (not blind)

Nov. 16, 2011                           TREC 2011 Crowdsourcing Track                  21
Task 1: Cost & Sponsorship
• Paid crowd labor only one form of crowdsourcing
       – Other models: directed gaming, citizen science, virtual pay
       – Incentives: socialize with others, recognition, social good, learn, etc.

• Nonetheless, paid models continue to dominate
       – e.g. Amazon Mechanical Turk (MTurk), CrowdFlower

• Risk: cost of crowd labor being barrier to track participation
• Risk Mitigation: sponsorship
       – CrowdFlower: $100 free credit to interested teams
       – Amazon: ~ $300 reimbursement to teams using MTurk (expected)
Nov. 16, 2011                    TREC 2011 Crowdsourcing Track                      22
Task 1: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
      –         CrowdFlower qualification, MTurk judging
2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk
3. Delft University of Technology & University of Iowa (GeAnn)
      –         Game, recruit via CrowdFlower
4.     Glasgow – Terrier (uogTr): MTurk
5.     Microsoft (MSRC): MTurk
6.     RMIT University (RMIT): CrowdFlower
7.     University Carlos III of Madrid (uc3m): Mturk
8.     University of Waterloo (UWaterlooMDS): in-house judging

5 used MTurk, 3 used CrowdFlower , 1 in-house

Nov. 16, 2011                        TREC 2011 Crowdsourcing Track   23
Task 1: Evaluation method
• Average per-worker performance
       – Average weighted by number of labels per worker
       – Primary evaluation includes rejected work


• Additional metric: Coverage
       – What % of examples were labeled by the team?


• Cost & time to be self-reported by teams
Nov. 16, 2011           TREC 2011 Crowdsourcing Track   24
¼ most productive workers do ¾ of the work

            # of workers
                           # of labels                     % of labels

                Top 25%
                               44917                        76.77%

                Top 50%
                               53444                        91.34%

                Top 75%
                             56558.5                        96.66%

                 Total
                               58510                         100%


Nov. 16, 2011              TREC 2011 Crowdsourcing Track                 25
Same worker, multiple teams
                      2000


                      1800


                      1600            # of teams                              avg. # of
                                      belongs to # of worker                  examples
                      1400


                                          1
Number of Examples




                      1200
                                                             947               56.21
                      1000
                                          2
                                                              35               146.65
                       800

                                          3
                       600                                     2               72.25
                       400


                       200


                         0
                               1
                              25
                              49
                              73
                              97
                             121
                             145
                             169
                             193
                             217
                             241
                             265
                             289
                             313
                             337
                             361
                             385
                             409
                             433
                             457
                             481
                             505
                             529
                             553
                             577
                             601
                             625
                             649
                             673
                             697
                             721
                             745
                             769
                             793
                             817
                             841
                             865
                             889
                             913
                             937
                             961
                                              Anonymized Worker ID
                     Nov. 16, 2011            TREC 2011 Crowdsourcing Track               26
Task 2: Aggregation
Task 2: Data
• Input: judgments provided by organizers
    – 19,033 topic-document pairs
    – 89,624 binary judgments from 762 workers
• Evaluation: average per-topic performance
• Gold: 3275 labels
    – 2275 for training (1275 relevant, 1000 non-relevant)
           • Excluded from evaluation
    – 1000 for blind test (balanced 500/500)

Nov. 16, 2011               TREC 2011 Crowdsourcing Track   28
Task 2: Participants
1. Beijing University of Posts and Telecommunications (BUPT)
2. Delft University of Technology – Vuurens (TUD_DMIR)
3. Delft University of Technology & University of Iowa (GeAnn)
4. Glasgow – Terrier (uogTr)
5. Glasgow – Zuccon (qirdcsuog)
6. LingPipe
7. Microsoft (MSRC)
8. University Carlos III of Madrid (uc3m)
9. University of Texas at Austin (UTAustin)
10. University of Waterloo (UWaterlooMDS)

Nov. 16, 2011           TREC 2011 Crowdsourcing Track            29
Discussion
• Consensus Labels as ground-truth
       – Consensus Algorithm for Label Generation?
       – Probabilistic or Rounded Binary Consensus Labels?
• Proper scoring rules
• Changes for 2012?
       –   Which document collection? Request NIST judging?
       –   Drop the two-task format? Pre-suppose crowdsourced solution?
       –   Broaden sponsorship? Narrow scope?
       –   Additional organizer?
       –   Details
                • Focus on worker training effects
                • Treatment of rejected work



Nov. 16, 2011                         TREC 2011 Crowdsourcing Track       30
Conclusion
• Interesting first year of track
       – Some insights about what worked well and less well in track design
       – Participants will tell us about methods developed
       – More analysis still needed for evaluation
• Track will run again in 2012
       – Help shape it with feedback (planning session. Hallway, or email)
• Acknowledgments
       – Hyun Joon Jung (UT Austin)
       – Mark Smucker (U Waterloo)
       – Ellen Voorhees & Ian Soboroff (NIST)
• Sponsors
       – Amazon
       – CrowdFlower
Nov. 16, 2011                  TREC 2011 Crowdsourcing Track                  31

More Related Content

Similar to Crowdsourcing Track Overview at TREC 2011

Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)
Elena Simperl
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
Frank van Harmelen
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
Matthew Lease
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
CS, NcState
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
eXascale Infolab
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
eXascale Infolab
 
Crowdsourcing in NLP
Crowdsourcing in NLPCrowdsourcing in NLP
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Matthew Lease
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
Liwei Ren任力偉
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
CS, NcState
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big Discoveries
Govnet Events
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
CS, NcState
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
voginip
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
Lukas Mandrake
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
Xavier Ochoa
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
Toine Bogers
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
GESIS
 

Similar to Crowdsourcing Track Overview at TREC 2011 (20)

Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)Improving DBpedia (one microtask at a time)
Improving DBpedia (one microtask at a time)
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
 
Crowdsourcing in NLP
Crowdsourcing in NLPCrowdsourcing in NLP
Crowdsourcing in NLP
 
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
Big Data for Big Discoveries
Big Data for Big DiscoveriesBig Data for Big Discoveries
Big Data for Big Discoveries
 
Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction Local vs. Global Models for Effort Estimation and Defect Prediction
Local vs. Global Models for Effort Estimation and Defect Prediction
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 

More from Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
Matthew Lease
 

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
 

Recently uploaded

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Crowdsourcing Track Overview at TREC 2011

  • 1. Overview of the TREC 2011 Crowdsourcing Track Organizers: Gabriella Kazai, Microsoft Research Cambridge Matt Lease, University of Texas at Austin
  • 2. Nov. 16, 2011 TREC 2011 Crowdsourcing Track 2
  • 3. What is Crowdsourcing? • A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve some goal(s) • Enabled by internet-connectivity • Many related concepts – Collective intelligence – Social computing – People services – Human computation Nov. 16, 2011 TREC 2011 Crowdsourcing Track 3
  • 4. Why Crowdsourcing? Potential… • Scalability (e.g. cost, time, effort) – e.g. scale to greater pool sizes • Quality (by getting more eyes on the data) – More diverse judgments – More accurate judgments (“wisdom of crowds”) • And more! – New datasets, new tasks, interaction, on-demand evaluation, hybrid search systems Nov. 16, 2011 TREC 2011 Crowdsourcing Track 4
  • 5. Track Goals (for Year 1) • Promote IR community awareness of, investigation of, and experience with crowdsourcing mechanisms and methods • Improve understanding of best practices • Establish shared, reusable benchmarks • Assess state-of-the-art of the field • Attract experience from outside IR community Nov. 16, 2011 TREC 2011 Crowdsourcing Track 5
  • 6. Crowdsourcing in 2011 • AAAI-HCOMP: 3rd Human Computation Workshop (Aug. 8) • ACIS: Crowdsourcing, Value Co-Creation, & Digital Economy Innovation (Nov. 30 – Dec. 2) • Crowdsourcing Technologies for Language and Cognition Studies (July 27) • CHI-CHC: Crowdsourcing and Human Computation (May 8) • CIKM: BooksOnline (Oct. 24, “crowdsourcing … online books”) • CrowdConf 2011 -- 2nd Conf. on the Future of Distributed Work (Nov. 1-2) • Crowdsourcing: Improving … Scientific Data Through Social Networking (June 13) • EC: Workshop on Social Computing and User Generated Content (June 5) • ICWE: 2nd International Workshop on Enterprise Crowdsourcing (June 20) • Interspeech: Crowdsourcing for speech processing (August) • NIPS: Second Workshop on Computational Social Science and the Wisdom of Crowds (Dec. TBD) • SIGIR-CIR: Workshop on Crowdsourcing for Information Retrieval (July 28) • TREC-Crowd: Year 1 of TREC Crowdsourcing Track (Nov. 16-18) • UbiComp: 2nd Workshop on Ubiquitous Crowdsourcing (Sep. 18) • WSDM-CSDM: Crowdsourcing for Search and Data Mining (Feb. 9) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 6
  • 7. Two Questions, Two Tasks • Task 1: Assessment (human factors) – How can we obtain quality relevance judgments from individual (crowd) participants? • Task 2: Aggregation (statistics) – How can we derive a quality relevance judgment from multiple (crowd) judgments? Nov. 16, 2011 TREC 2011 Crowdsourcing Track 7
  • 8. Task 1: Assessment (human factors) • Measurable outcomes & potential tradeoffs – Quality, time, cost, & effort • Many possible factors – Incentive structures – Interface design – Instructions / guidance – Interaction / feedback – Recruitment & retention –… Nov. 16, 2011 TREC 2011 Crowdsourcing Track 8
  • 9. Task 2: Aggregation (statistics) • “Wisdom of crowds” computing • Typical assumption: noisy input labels – But not always (cf. Yang et al., SIGIR’10) • Many statistical methods have been proposed – Common baseline: majority vote Nov. 16, 2011 TREC 2011 Crowdsourcing Track 9
  • 10. Crowdsourcing, Noise & Uncertainty Broadly two approaches 1. Alchemy: turn noisy data into gold – Once we have gold, we can go on training and evaluating as before (separation of concerns) – Assume we can mostly clean it up and ignore any remaining error (even gold is rarely 100% pure) 2. Model & propagate uncertainty – Let it “spill over” into training and evaluation Nov. 16, 2011 TREC 2011 Crowdsourcing Track 10
  • 11. Test Collection: ClueWeb09 subset • Collection: 19K pages rendered by Waterloo – Task 1: teams judge (a subset) – Task 2: teams aggregate judgments we provide • Topics: taken from past MQ and RF tracks • Gold: Roughly 3K prior NIST judgments – Remaining 16K pages have no “gold” judgments Nov. 16, 2011 TREC 2011 Crowdsourcing Track 11
  • 12. What to Predict? • Teams submit classification and/or ranking labels – Classification supports traditional absolute relevance judging – Rank labels support pair-wise preference or list-wise judging • Classification labels in [0,1] – Probability of relevance (assessor/system uncertainty) – Simple generalization of binary relevance – If probabilities submitted but no ranking, rank labels induced • Ranking as [1..N] – Task 1: rank 5 documents per set • Same worker had to label all 5 examples in a given set (challenge) – Task 2: rank all documents per topic Nov. 16, 2011 TREC 2011 Crowdsourcing Track 12
  • 13. Metrics • Classification – Binary ground truth: P, R, Accuracy, Sensitivity, LogLoss – Probabilistic ground truth: KL, RMSE • Ranking – Mean Average Precision (MAP) – Normalized Discounted Cumulative Gain (NDCG) • Ternary NIST judgments conflated to binary • Could explore mapping [0,1] consensus to ternary categories Nov. 16, 2011 TREC 2011 Crowdsourcing Track 13
  • 14. Prediction Classification Metrics Rel Non-rel Ground True TP TN Truth False FP FN Nov. 16, 2011 TREC 2011 Crowdsourcing Track 14
  • 15. Classification Metrics (cont’d) • Classification – Binary ground truth (cont’d) • Classification – Probabilistic ground truth Root Mean Squared Error (RMSE) • Notes – To avoid log(0) = infinity, replace 0 with 10^-15 – Revision: compute average per-example logloss and KL so error does not grow with sample size (particularly with varying team coverage) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 15
  • 16. Ground Truth: Three Versions • Gold: NIST Judgments – only available for a subset of the test collection • Consensus: generated by aggregating team labels (automatic) – full coverage • Team-based (Task 2 only) – use each team’s labels as truth to evaluate all other teams – Inspect variance in team rankings over alternative ground truths – Coverage varies Three primary evaluation conditions 1. Over examples having gold labels (evaluate vs. gold labels) 2. Over examples having gold labels (evaluate vs. consensus labels) 3. Over all examples (evaluate vs. consensus labels) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 16
  • 17. Consensus • Goal: Infer single consensus label from multiple input labels • Methodological Goals: unbiased, transparent, simple • Method: simple average, rounded when metrics require – Task 2: input = example labels from each team – Task 1: input = per-example average of worker labels from each team • Details – Classification labels only; no rank fusion – Using primary runs only – Task 1: each team gets 1 vote regardless of worker count (prevent bias) – Exclude any examples where • only one team submitted a label (bias) • consensus would yield a tie (binary metrics only) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 17
  • 18. How good is consensus? Compare to gold. Task 1: 395 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.69 0.74 0.79 0.57 0.71 0.23 0.38 Rounded Binary Consensus 0.80 0.87 0.85 0.66 6.85 3.14 0.45 Task 2: 1000 gold topic-document pairs Labels ACC PRE REC SPE LL KL RMSE Probabilistic Consensus 0.62 0.73 0.60 0.50 0.65 0.19 0.47 Rounded Binary Consensus 0.69 0.83 0.65 0.55 10.71 2.94 0.56 Issue: need to consider proper scoring rules Nov. 16, 2011 TREC 2011 Crowdsourcing Track 18
  • 19. Task 1: Assessment (Judging)
  • 20. Task 1: Data • Option 1: Use Waterloo rendered pages – Available as images, PDFs, and plain text (+html) – Many page images fetched from CMU server – Protect workers from malicious scripting • Option 2: Use some other format – Any team creating some other format was asked to provide that data or conversion tool to others – Avoid comparison based on different rendering Nov. 16, 2011 TREC 2011 Crowdsourcing Track 20
  • 21. Task 1: Data • Topics: 270 (240 development, 30 test) • Test Effort: ~2200 topic-document pairs for each team to judge – Shared sets: judged by all teams • Test: 1655 topic-document pairs (331 sets) over 20 topics – Assigned sets: judged subset of teams • Test: 1545 topic-document pairs (309 sets) over 15 topics in total • ~ 500 assigned to each team (~ 30 rel, 20 non-rel, 450 unknown) – Split intended to let organizers measure any worker-training effects • Increased track complexity, decreased useful redundancy & gold … • Gold: 395 topic-document pairs for test – made available to teams for cross-validation (not blind) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 21
  • 22. Task 1: Cost & Sponsorship • Paid crowd labor only one form of crowdsourcing – Other models: directed gaming, citizen science, virtual pay – Incentives: socialize with others, recognition, social good, learn, etc. • Nonetheless, paid models continue to dominate – e.g. Amazon Mechanical Turk (MTurk), CrowdFlower • Risk: cost of crowd labor being barrier to track participation • Risk Mitigation: sponsorship – CrowdFlower: $100 free credit to interested teams – Amazon: ~ $300 reimbursement to teams using MTurk (expected) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 22
  • 23. Task 1: Participants 1. Beijing University of Posts and Telecommunications (BUPT) – CrowdFlower qualification, MTurk judging 2. Delft University of Technology – Vuurens (TUD_DMIR): MTurk 3. Delft University of Technology & University of Iowa (GeAnn) – Game, recruit via CrowdFlower 4. Glasgow – Terrier (uogTr): MTurk 5. Microsoft (MSRC): MTurk 6. RMIT University (RMIT): CrowdFlower 7. University Carlos III of Madrid (uc3m): Mturk 8. University of Waterloo (UWaterlooMDS): in-house judging 5 used MTurk, 3 used CrowdFlower , 1 in-house Nov. 16, 2011 TREC 2011 Crowdsourcing Track 23
  • 24. Task 1: Evaluation method • Average per-worker performance – Average weighted by number of labels per worker – Primary evaluation includes rejected work • Additional metric: Coverage – What % of examples were labeled by the team? • Cost & time to be self-reported by teams Nov. 16, 2011 TREC 2011 Crowdsourcing Track 24
  • 25. ¼ most productive workers do ¾ of the work # of workers # of labels % of labels Top 25% 44917 76.77% Top 50% 53444 91.34% Top 75% 56558.5 96.66% Total 58510 100% Nov. 16, 2011 TREC 2011 Crowdsourcing Track 25
  • 26. Same worker, multiple teams 2000 1800 1600 # of teams avg. # of belongs to # of worker examples 1400 1 Number of Examples 1200 947 56.21 1000 2 35 146.65 800 3 600 2 72.25 400 200 0 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 Anonymized Worker ID Nov. 16, 2011 TREC 2011 Crowdsourcing Track 26
  • 28. Task 2: Data • Input: judgments provided by organizers – 19,033 topic-document pairs – 89,624 binary judgments from 762 workers • Evaluation: average per-topic performance • Gold: 3275 labels – 2275 for training (1275 relevant, 1000 non-relevant) • Excluded from evaluation – 1000 for blind test (balanced 500/500) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 28
  • 29. Task 2: Participants 1. Beijing University of Posts and Telecommunications (BUPT) 2. Delft University of Technology – Vuurens (TUD_DMIR) 3. Delft University of Technology & University of Iowa (GeAnn) 4. Glasgow – Terrier (uogTr) 5. Glasgow – Zuccon (qirdcsuog) 6. LingPipe 7. Microsoft (MSRC) 8. University Carlos III of Madrid (uc3m) 9. University of Texas at Austin (UTAustin) 10. University of Waterloo (UWaterlooMDS) Nov. 16, 2011 TREC 2011 Crowdsourcing Track 29
  • 30. Discussion • Consensus Labels as ground-truth – Consensus Algorithm for Label Generation? – Probabilistic or Rounded Binary Consensus Labels? • Proper scoring rules • Changes for 2012? – Which document collection? Request NIST judging? – Drop the two-task format? Pre-suppose crowdsourced solution? – Broaden sponsorship? Narrow scope? – Additional organizer? – Details • Focus on worker training effects • Treatment of rejected work Nov. 16, 2011 TREC 2011 Crowdsourcing Track 30
  • 31. Conclusion • Interesting first year of track – Some insights about what worked well and less well in track design – Participants will tell us about methods developed – More analysis still needed for evaluation • Track will run again in 2012 – Help shape it with feedback (planning session. Hallway, or email) • Acknowledgments – Hyun Joon Jung (UT Austin) – Mark Smucker (U Waterloo) – Ellen Voorhees & Ian Soboroff (NIST) • Sponsors – Amazon – CrowdFlower Nov. 16, 2011 TREC 2011 Crowdsourcing Track 31