SlideShare a Scribd company logo
1 of 29
QUALITY CONTROL MECHANISMS FOR
CROWDSOURCING: PEER REVIEW, ARBITRATION,
& EXPERTISE AT FAMILYSEARCH INDEXING


 CSCW, SAN ANTONIO, TX
 FEB 26, 2013

 Derek Hansen, Patrick Schone, Douglas
 Corey, Matthew Reid, & Jake Gehring
FamilySearch.org
FamilySearch Indexing (FSI)
FamilySearch Indexing (FSI)
FSI in Broader Landscape
• Crowdsourcing Project
 Aggregates discrete tasks completed by
 volunteers who replace professionals (Howe,
 2006; Doan, et al., 2011)
• Human Computation System
 Humans use computational system to work on a
 problem that may someday be solvable by
 computers (Quinn & Bederson, 2011)
• Lightweight Peer Production
 Largely anonymous contributors independently
 completing discrete, repetitive tasks provided by
 authorities (Haythornthwaite, 2009)
Design Challenge: Improve
efficiency without sacrificing quality

                       Scanned
                       Documents
   Amount




            Time
Quality Control Mechanisms
• 9 Types of quality control for human computation
  systems (Quinn & Bederson, 2011)
  • Redundancy
  • Multi-level review
• Find-Fix-Verify pattern (Bernstein, et al., 2010)
• Weight proposed solutions by reputation of contributor
  (McCann, et al., 2003)
• Peer or expert oversight (Cosley, et al., 2005)
• Tournament selection approach (Sun, et al., 2011)
A-B-Arbitrate process (A-B-ARB)

      A

                           ARB

       B

Currently Used Mechanism
Peer review process (A-R-RARB)

      A              R                   RARB




                     Already Filled In
Proposed Mechanism                       Optional?
Two Act Play
  Act I: Experience        Act II: Quality Control
What is the role of        Is peer review or
experience on quality      arbitration better in terms
and efficiency?            of quality and efficiency?

Historical data analysis   Field experiment using
using full US and          2,000 images from the
Canadian Census            1930 US Census Data &
records from 1920 and      corresponding truth set
earlier
Act I: Experience
Quality is estimated based on A-B
agreement (no truth set)

Efficiency calculated using keystroke-
logging data with idle time and outliers
removed
A-B agreement by field
A-B agreement by language
     (1871 Canadian Census)


English Language    French Language
Given Name: 79.8%   Given Name: 62.7%
Surname: 66.4%      Surname: 48.8%
A-B agreement by experience
                            Birth Place: All U.S. Censuses
 B (novice ↔ experienced)




                                A (novice ↔ experienced)
A-B agreement by experience
                            Given Name: All U.S. Censuses
 B (novice ↔ experienced)




                                A (novice ↔ experienced)
A-B agreement by experience
                            Surname: All U.S. Censuses
 B (novice ↔ experienced)




                               A (novice ↔ experienced)
A-B agreement by experience
                            Gender: All U.S. Censuses
 B (novice ↔ experienced)




                               A (novice ↔ experienced)
A-B agreement by experience
Birthplace: English-speaking Canadian Census
   B (novice ↔ experienced)




                              A (novice ↔ experienced)
Time & keystroke by experience
Summary & Implications of Act I
Experienced workers are faster and more
accurate, gains which continue even at high levels
- Focus on retention
- Encourage both novices & experts to do more
- Develop interventions to speed up experience
  gains (e.g., send users common mistakes made
  by people at their experience level)
Summary & Implications of Act I
Contextual knowledge (e.g., Canadian placenames)
and specialized skills (e.g., French language fluency)
is needed for some tasks
- Recruit people with existing knowledge & skills
- Provide contextual information when possible
  (e.g., Canadian placename prompts)
- Don’t remove context (e.g., captcha)
- Allow users to specialize?
Act II: Quality Control
A-B-ARB data from original transcribers (Feb
2011)
A-R-RARB data includes original A data and
newly collected R and RARB data from
people new to this method (Jan-Feb of 2012)
Truth Set data from company with
independent audit by FSI experts

Statistical Test: mixed-model logistic
regression (accurate or not) with random
effects, controlling for expertise
Limitations
• Experience levels of R and RARB were
  lower than expected, though we did
  statistically control for this
• Original B data used in A-B-ARB for
  certain fields was transcribed in non-
  standard manner requiring adjustment
No Need for RARB
• No gains in quality from extra arbitration of
  peer reviewed data (A-R = A-R-RARB)
• RARB takes some time, so better without
Quality
  Comparison
• Both methods were
  statistically better than A
  alone
• A-B-ARB had slightly
  lower error rates than A-R
• R “missed” more
  errors, but also
  introduced fewer errors
Time Comparison
Summary & Implications of Act II
Peer Review shows considerable efficiency
gains with nearly as good quality as Arbitration
- Prime reviewers to find errors (e.g., prompt
  them with expected # of errors on a page)
- Highlight potential problems (e.g., let A flag
  tough fields)
- Route difficult pages to experts
- Consider an A-R1-R2 process when high quality
  is critical
Summary & Implications of Act II
Reviewing reviewers isn’t always worth the time
- At least in some contexts, Find-Fix may not
  need Verify
Quality of different fields varies dramatically
- Use different quality control mechanisms for
  harder or easier fields
Integrate human and algorithmic transcription
- Use algorithms on easy fields & integrate into
  review process so machine learning can occur
Questions
• Derek Hansen (dlhansen@byu.edu)
• Patrick Schone (BoiseBound@aol.com)
• Douglas Corey (corey@mathed.byu.edu)
• Matthew Reid (matthewreid007@gmail.com)
• Jake Gehring (GehringJG@familysearch.org)

More Related Content

Similar to Peer Review Beats Arbitration for Quality Control in Crowdsourcing

Improving Family Search Indexing Efficiency and Quality
Improving Family Search Indexing Efficiency and QualityImproving Family Search Indexing Efficiency and Quality
Improving Family Search Indexing Efficiency and QualityDerek Hansen
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
 
DI workshop.pdf
DI workshop.pdfDI workshop.pdf
DI workshop.pdfagothoskar
 
7QC Tools Study Materials - LSSGB - Quality Control.pptx
7QC Tools Study Materials - LSSGB - Quality Control.pptx7QC Tools Study Materials - LSSGB - Quality Control.pptx
7QC Tools Study Materials - LSSGB - Quality Control.pptxsboral2
 
Evaluating Complex Systems: Strategies for Testing Systems You Can’t Understand
Evaluating Complex Systems: Strategies for Testing Systems You Can’t UnderstandEvaluating Complex Systems: Strategies for Testing Systems You Can’t Understand
Evaluating Complex Systems: Strategies for Testing Systems You Can’t UnderstandUXPA International
 
The Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanThe Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanJames Coleman
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
D03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapD03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapLeanleaders.org
 
D03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapD03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapLeanleaders.org
 

Similar to Peer Review Beats Arbitration for Quality Control in Crowdsourcing (12)

Improving Family Search Indexing Efficiency and Quality
Improving Family Search Indexing Efficiency and QualityImproving Family Search Indexing Efficiency and Quality
Improving Family Search Indexing Efficiency and Quality
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Quality key users
Quality key usersQuality key users
Quality key users
 
DI workshop.pdf
DI workshop.pdfDI workshop.pdf
DI workshop.pdf
 
L6s best sharing experience MSA
L6s best sharing experience MSAL6s best sharing experience MSA
L6s best sharing experience MSA
 
7QC Tools Study Materials - LSSGB - Quality Control.pptx
7QC Tools Study Materials - LSSGB - Quality Control.pptx7QC Tools Study Materials - LSSGB - Quality Control.pptx
7QC Tools Study Materials - LSSGB - Quality Control.pptx
 
Evaluating Complex Systems: Strategies for Testing Systems You Can’t Understand
Evaluating Complex Systems: Strategies for Testing Systems You Can’t UnderstandEvaluating Complex Systems: Strategies for Testing Systems You Can’t Understand
Evaluating Complex Systems: Strategies for Testing Systems You Can’t Understand
 
The Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanThe Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim Coleman
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
D03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapD03 15 Deliverable Roadmap
D03 15 Deliverable Roadmap
 
D03 15 Deliverable Roadmap
D03 15 Deliverable RoadmapD03 15 Deliverable Roadmap
D03 15 Deliverable Roadmap
 
Thesis Talk
Thesis TalkThesis Talk
Thesis Talk
 

More from Derek Hansen

TESOL_LoomVue.pptx
TESOL_LoomVue.pptxTESOL_LoomVue.pptx
TESOL_LoomVue.pptxDerek Hansen
 
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...Derek Hansen
 
Byu ISYS presentation_seminar
Byu ISYS presentation_seminarByu ISYS presentation_seminar
Byu ISYS presentation_seminarDerek Hansen
 
Guest Lecture Irvine
Guest Lecture IrvineGuest Lecture Irvine
Guest Lecture IrvineDerek Hansen
 
Designing Reusable ARGs
Designing Reusable ARGsDesigning Reusable ARGs
Designing Reusable ARGsDerek Hansen
 
Infrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceInfrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceDerek Hansen
 
The PLACE approach (Prototyping Location Activities and Collective Experience)
The PLACE approach (Prototyping Location Activities and Collective Experience)The PLACE approach (Prototyping Location Activities and Collective Experience)
The PLACE approach (Prototyping Location Activities and Collective Experience)Derek Hansen
 
Veiled viral marketing
Veiled viral marketingVeiled viral marketing
Veiled viral marketingDerek Hansen
 
EventGraphs Talk at HCIL2011
EventGraphs Talk at HCIL2011EventGraphs Talk at HCIL2011
EventGraphs Talk at HCIL2011Derek Hansen
 
Medicine 2.0 2008 Hansen
Medicine 2.0 2008 HansenMedicine 2.0 2008 Hansen
Medicine 2.0 2008 HansenDerek Hansen
 

More from Derek Hansen (13)

TESOL_LoomVue.pptx
TESOL_LoomVue.pptxTESOL_LoomVue.pptx
TESOL_LoomVue.pptx
 
Aahb workshop
Aahb workshopAahb workshop
Aahb workshop
 
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...
Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...
 
Byu ISYS presentation_seminar
Byu ISYS presentation_seminarByu ISYS presentation_seminar
Byu ISYS presentation_seminar
 
Guest Lecture Irvine
Guest Lecture IrvineGuest Lecture Irvine
Guest Lecture Irvine
 
Designing Reusable ARGs
Designing Reusable ARGsDesigning Reusable ARGs
Designing Reusable ARGs
 
Infrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social ScienceInfrastructure for Supporting Computational Social Science
Infrastructure for Supporting Computational Social Science
 
Intro
IntroIntro
Intro
 
The PLACE approach (Prototyping Location Activities and Collective Experience)
The PLACE approach (Prototyping Location Activities and Collective Experience)The PLACE approach (Prototyping Location Activities and Collective Experience)
The PLACE approach (Prototyping Location Activities and Collective Experience)
 
Veiled viral marketing
Veiled viral marketingVeiled viral marketing
Veiled viral marketing
 
EventGraphs Talk at HCIL2011
EventGraphs Talk at HCIL2011EventGraphs Talk at HCIL2011
EventGraphs Talk at HCIL2011
 
NodeXL Research
NodeXL ResearchNodeXL Research
NodeXL Research
 
Medicine 2.0 2008 Hansen
Medicine 2.0 2008 HansenMedicine 2.0 2008 Hansen
Medicine 2.0 2008 Hansen
 

Peer Review Beats Arbitration for Quality Control in Crowdsourcing

  • 1. QUALITY CONTROL MECHANISMS FOR CROWDSOURCING: PEER REVIEW, ARBITRATION, & EXPERTISE AT FAMILYSEARCH INDEXING CSCW, SAN ANTONIO, TX FEB 26, 2013 Derek Hansen, Patrick Schone, Douglas Corey, Matthew Reid, & Jake Gehring
  • 5. FSI in Broader Landscape • Crowdsourcing Project Aggregates discrete tasks completed by volunteers who replace professionals (Howe, 2006; Doan, et al., 2011) • Human Computation System Humans use computational system to work on a problem that may someday be solvable by computers (Quinn & Bederson, 2011) • Lightweight Peer Production Largely anonymous contributors independently completing discrete, repetitive tasks provided by authorities (Haythornthwaite, 2009)
  • 6. Design Challenge: Improve efficiency without sacrificing quality Scanned Documents Amount Time
  • 7. Quality Control Mechanisms • 9 Types of quality control for human computation systems (Quinn & Bederson, 2011) • Redundancy • Multi-level review • Find-Fix-Verify pattern (Bernstein, et al., 2010) • Weight proposed solutions by reputation of contributor (McCann, et al., 2003) • Peer or expert oversight (Cosley, et al., 2005) • Tournament selection approach (Sun, et al., 2011)
  • 8. A-B-Arbitrate process (A-B-ARB) A ARB B Currently Used Mechanism
  • 9. Peer review process (A-R-RARB) A R RARB Already Filled In Proposed Mechanism Optional?
  • 10. Two Act Play Act I: Experience Act II: Quality Control What is the role of Is peer review or experience on quality arbitration better in terms and efficiency? of quality and efficiency? Historical data analysis Field experiment using using full US and 2,000 images from the Canadian Census 1930 US Census Data & records from 1920 and corresponding truth set earlier
  • 11. Act I: Experience Quality is estimated based on A-B agreement (no truth set) Efficiency calculated using keystroke- logging data with idle time and outliers removed
  • 13. A-B agreement by language (1871 Canadian Census) English Language French Language Given Name: 79.8% Given Name: 62.7% Surname: 66.4% Surname: 48.8%
  • 14. A-B agreement by experience Birth Place: All U.S. Censuses B (novice ↔ experienced) A (novice ↔ experienced)
  • 15. A-B agreement by experience Given Name: All U.S. Censuses B (novice ↔ experienced) A (novice ↔ experienced)
  • 16. A-B agreement by experience Surname: All U.S. Censuses B (novice ↔ experienced) A (novice ↔ experienced)
  • 17. A-B agreement by experience Gender: All U.S. Censuses B (novice ↔ experienced) A (novice ↔ experienced)
  • 18. A-B agreement by experience Birthplace: English-speaking Canadian Census B (novice ↔ experienced) A (novice ↔ experienced)
  • 19. Time & keystroke by experience
  • 20. Summary & Implications of Act I Experienced workers are faster and more accurate, gains which continue even at high levels - Focus on retention - Encourage both novices & experts to do more - Develop interventions to speed up experience gains (e.g., send users common mistakes made by people at their experience level)
  • 21. Summary & Implications of Act I Contextual knowledge (e.g., Canadian placenames) and specialized skills (e.g., French language fluency) is needed for some tasks - Recruit people with existing knowledge & skills - Provide contextual information when possible (e.g., Canadian placename prompts) - Don’t remove context (e.g., captcha) - Allow users to specialize?
  • 22. Act II: Quality Control A-B-ARB data from original transcribers (Feb 2011) A-R-RARB data includes original A data and newly collected R and RARB data from people new to this method (Jan-Feb of 2012) Truth Set data from company with independent audit by FSI experts Statistical Test: mixed-model logistic regression (accurate or not) with random effects, controlling for expertise
  • 23. Limitations • Experience levels of R and RARB were lower than expected, though we did statistically control for this • Original B data used in A-B-ARB for certain fields was transcribed in non- standard manner requiring adjustment
  • 24. No Need for RARB • No gains in quality from extra arbitration of peer reviewed data (A-R = A-R-RARB) • RARB takes some time, so better without
  • 25. Quality Comparison • Both methods were statistically better than A alone • A-B-ARB had slightly lower error rates than A-R • R “missed” more errors, but also introduced fewer errors
  • 27. Summary & Implications of Act II Peer Review shows considerable efficiency gains with nearly as good quality as Arbitration - Prime reviewers to find errors (e.g., prompt them with expected # of errors on a page) - Highlight potential problems (e.g., let A flag tough fields) - Route difficult pages to experts - Consider an A-R1-R2 process when high quality is critical
  • 28. Summary & Implications of Act II Reviewing reviewers isn’t always worth the time - At least in some contexts, Find-Fix may not need Verify Quality of different fields varies dramatically - Use different quality control mechanisms for harder or easier fields Integrate human and algorithmic transcription - Use algorithms on easy fields & integrate into review process so machine learning can occur
  • 29. Questions • Derek Hansen (dlhansen@byu.edu) • Patrick Schone (BoiseBound@aol.com) • Douglas Corey (corey@mathed.byu.edu) • Matthew Reid (matthewreid007@gmail.com) • Jake Gehring (GehringJG@familysearch.org)

Editor's Notes

  1. The goal of FamilySearch.org is to help people find their ancestors. It is a freely available resource that compiles information from databases from around the world. The Church of Jesus Christ of Latter-Day Saints sponsors it, but it can be used by anyone for free.
  2. FamilySearch Indexing’s role is to transcribe text from scanned images so it is in a machine-readable format that can be searched. This is done by hundreds of thousands of indexers, making it the world’s largest document transcription service. Documents include census records, vital records (e.g., birth, death, marriage, burial), church records (e.g., christening), military records, legal records, cemetery records, and migration records from countries around the globe.
  3. As you can see, transcribing names from hand-written documents is not a trivial task, though a wide range of people are capable of learning to do it and no specialized equipment is needed. Nearly 400,000 contributors have transcribed records, with over 500 new volunteers signing up each day in the recent past. The challenges of transcription work make quality control mechanisms essential to the success of the project, and also underscore the importance of understanding expertise and how it develops over time.
  4. Documents are being scanned at an increasing rate. If we are to benefit from these new resources we’ll need to keep pace with the indexing efforts.Thus, the goals of FSI are to (a) Index as many documents as possible, while (b) assuring a certain level of quality.
  5. And there are others for more complex tasks that require coordination, such as those occurring on Wikipedia (e.g., Kittur & Kraut, 2008).Note that some of these are not mutually exclusive. Many have only been tested in research prototype projects, but not at scale. And others were not designed with efficiency in mind.
  6. The current quality control mechanism is called A-B-Arbitrate (or just A-B-ARB or “arbitration” for short). In this process, person A and person B index the document independently, and an experience arbitrator (ARB) reviews any discrepancies between the two.
  7. This is a proposed model that has not been tested until this study.The model could include arbitration (ARB) or that step could be skipped if A-B results in high enough quality on its own (see findings).
  8. Quality is measured as agreement between independent coders in Act I. This is not true quality, but is highly correlated with high quality.Quality is measured against a truth set created by a company who assured 99.9% accuracy and was independently audited by expert FSI transcribers.Efficiency is measured in terms of “active” time spent indexing (after “idle time” was removed) and keystrokes as captured by the indexing program.
  9. Quality (estimated based on A-B agreement)Measures difficulty more than actual qualityUnderestimates quality, since an experienced Arbitrator reviews all A-B disagreementsGood at capturing differences across people, fields, and projectsTime (calculated using keystroke-logging data)Idle time is tracked separately, making actual time measurements more accurateOutliers removed
  10. Notice the high variation in agreement depending on how many options there are for a field to have (e.g., gender has only a couple options, while surname has many options)
  11. This finding is likely due to the fact that most transcribers are English Speaking which suggests the need to recruit contributors who are native speakers of other languages
  12. Experience is based on EL(U) = round(log5(N(U))) Where U represents the transcriber, N(U) is the number of images that U has transcribed, and EL(U) is the experience level of U.Rank Number of images transcribed0 11 5 2 25 3 125 4 625 5 3125 6 15625 7 78125 8 390625
  13. There isn’t much improvement, since it’s an “easy” field to agree on. In other words, even novices are good.
  14. Here there isn’t much improvement, but the overall agreement is low. This suggests that even experts are not good, likely because of unfamiliarity with Canadian placenames given the predominantly US indexing population. Remember, that expertise is based on all contributions, not just those in this category.
  15. More experienced transcribers are much faster (up to 4 times faster) than inexperienced users. They also have fewer keystrokes (e.g., using help functionality; fixing mistakes…)Though not shown here, the paper shows how experienced indexer work also requires less time to arbitrate and fewer keystrokes.Furthermore, English-speaking 1871 Candadian Census were 2.68 seconds faster per line than the French version, even though French version required more keystrokes. Again, this is likely due to the fact that most transcribers are native English speakers.
  16. 2,000 random images including many fields (e.g., surname, county of origin, gender, age) for each of the 50 lines of data (which include a single row for each individual). Note that this is repeated measures data, since the same transcriber transcribes all 50 rows of an image in a “batch” and some people transcribe more than one page. We use a mixed-model to account for this.Because people performing R were new to this method and the system was not tuned to the needs of reviewers, the A-R-RARB data should be considered a baseline – i.e., a lower bound on how well A-R-RARB can do.
  17. A new approach based on peer review instead of independent indexing would likely improve efficiency, but its effect on quality is unknown. Anecdotal evidence suggests that peer reviewing may be twice as fast as indexing from scratch.
  18. This is likely due to the fact that most R edits fix problems – they rarely introduce new problems. However, RARB doesn’t know who A or R is, and they erroneously agree with A too much, which is why there is no gain from RARB, and in fact some small losses in quality due to RARB.
  19. There are clear gains in time for the A-R model, because reviewing takes about half as much time as transcribing from scratch.
  20. Remember, in our study Peer Review was a new method for those performing it and the system hadn’t been customized to support it well, so it may do as well as A-B-ARB with some minor improvements and training.