Peer Review Beats Arbitration for Quality Control in Crowdsourcing

QUALITY CONTROL MECHANISMS FOR
CROWDSOURCING: PEER REVIEW, ARBITRATION,
& EXPERTISE AT FAMILYSEARCH INDEXING

CSCW, SAN ANTONIO, TX
FEB 26, 2013

Derek Hansen, Patrick Schone, Douglas
Corey, Matthew Reid, & Jake Gehring

FSI in Broader Landscape
• Crowdsourcing Project
Aggregates discrete tasks completed by
volunteers who replace professionals (Howe,
2006; Doan, et al., 2011)
• Human Computation System
Humans use computational system to work on a
problem that may someday be solvable by
computers (Quinn & Bederson, 2011)
• Lightweight Peer Production
Largely anonymous contributors independently
completing discrete, repetitive tasks provided by
authorities (Haythornthwaite, 2009)

Design Challenge: Improve
efficiency without sacrificing quality

Scanned
Documents
Amount

Time

Quality Control Mechanisms
• 9 Types of quality control for human computation
systems (Quinn & Bederson, 2011)
• Redundancy
• Multi-level review
• Find-Fix-Verify pattern (Bernstein, et al., 2010)
• Weight proposed solutions by reputation of contributor
(McCann, et al., 2003)
• Peer or expert oversight (Cosley, et al., 2005)
• Tournament selection approach (Sun, et al., 2011)

A-B-Arbitrate process (A-B-ARB)

A

ARB

B

Currently Used Mechanism

Peer review process (A-R-RARB)

A R RARB

Already Filled In
Proposed Mechanism Optional?

Two Act Play
Act I: Experience Act II: Quality Control
What is the role of Is peer review or
experience on quality arbitration better in terms
and efficiency? of quality and efficiency?

Historical data analysis Field experiment using
using full US and 2,000 images from the
Canadian Census 1930 US Census Data &
records from 1920 and corresponding truth set
earlier

Act I: Experience
Quality is estimated based on A-B
agreement (no truth set)

Efficiency calculated using keystroke-
logging data with idle time and outliers
removed

A-B agreement by language
(1871 Canadian Census)

English Language French Language
Given Name: 79.8% Given Name: 62.7%
Surname: 66.4% Surname: 48.8%

A-B agreement by experience
Birth Place: All U.S. Censuses
B (novice ↔ experienced)

A (novice ↔ experienced)

Given Name: All U.S. Censuses


Surname: All U.S. Censuses


Gender: All U.S. Censuses


Birthplace: English-speaking Canadian Census


Time & keystroke by experience

Summary & Implications of Act I
Experienced workers are faster and more
accurate, gains which continue even at high levels
- Focus on retention
- Encourage both novices & experts to do more
- Develop interventions to speed up experience
gains (e.g., send users common mistakes made
by people at their experience level)

Summary & Implications of Act I
Contextual knowledge (e.g., Canadian placenames)
and specialized skills (e.g., French language fluency)
is needed for some tasks
- Recruit people with existing knowledge & skills
- Provide contextual information when possible
(e.g., Canadian placename prompts)
- Don’t remove context (e.g., captcha)
- Allow users to specialize?

Act II: Quality Control
A-B-ARB data from original transcribers (Feb
2011)
A-R-RARB data includes original A data and
newly collected R and RARB data from
people new to this method (Jan-Feb of 2012)
Truth Set data from company with
independent audit by FSI experts

Statistical Test: mixed-model logistic
regression (accurate or not) with random
effects, controlling for expertise

Limitations
• Experience levels of R and RARB were
lower than expected, though we did
statistically control for this
• Original B data used in A-B-ARB for
certain fields was transcribed in non-
standard manner requiring adjustment

No Need for RARB
• No gains in quality from extra arbitration of
peer reviewed data (A-R = A-R-RARB)
• RARB takes some time, so better without

Quality
Comparison
• Both methods were
statistically better than A
alone
• A-B-ARB had slightly
lower error rates than A-R
• R “missed” more
errors, but also
introduced fewer errors

Summary & Implications of Act II
Peer Review shows considerable efficiency
gains with nearly as good quality as Arbitration
- Prime reviewers to find errors (e.g., prompt
them with expected # of errors on a page)
- Highlight potential problems (e.g., let A flag
tough fields)
- Route difficult pages to experts
- Consider an A-R1-R2 process when high quality
is critical

Summary & Implications of Act II
Reviewing reviewers isn’t always worth the time
- At least in some contexts, Find-Fix may not
need Verify
Quality of different fields varies dramatically
- Use different quality control mechanisms for
harder or easier fields
Integrate human and algorithmic transcription
- Use algorithms on easy fields & integrate into
review process so machine learning can occur

Questions
• Derek Hansen (dlhansen@byu.edu)
• Patrick Schone (BoiseBound@aol.com)
• Douglas Corey (corey@mathed.byu.edu)
• Matthew Reid (matthewreid007@gmail.com)
• Jake Gehring (GehringJG@familysearch.org)

Peer Review Beats Arbitration for Quality Control in Crowdsourcing

Recommended

Recommended

More Related Content

Similar to Peer Review Beats Arbitration for Quality Control in Crowdsourcing

Similar to Peer Review Beats Arbitration for Quality Control in Crowdsourcing (12)

More from Derek Hansen

More from Derek Hansen (13)

Peer Review Beats Arbitration for Quality Control in Crowdsourcing

Editor's Notes