The goal of FamilySearch.org is to help people find their ancestors. It is a freely available resource that compiles information from databases from around the world. The Church of Jesus Christ of Latter-Day Saints sponsors it, but it can be used by anyone for free.
FamilySearch Indexing’s role is to transcribe text from scanned images so it is in a machine-readable format that can be searched. This is done by hundreds of thousands of indexers, making it the world’s largest document transcription service. Documents include census records, vital records (e.g., birth, death, marriage, burial), church records (e.g., christening), military records, legal records, cemetery records, and migration records from countries around the globe.
As you can see, transcribing names from hand-written documents is not a trivial task, though a wide range of people are capable of learning to do it and no specialized equipment is needed. Nearly 400,000 contributors have transcribed records, with over 500 new volunteers signing up each day in the recent past. The challenges of transcription work make quality control mechanisms essential to the success of the project, and also underscore the importance of understanding expertise and how it develops over time.
Documents are being scanned at an increasing rate. If we are to benefit from these new resources we’ll need to keep pace with the indexing efforts.Thus, the goals of FSI are to (a) Index as many documents as possible, while (b) assuring a certain level of quality.
And there are others for more complex tasks that require coordination, such as those occurring on Wikipedia (e.g., Kittur & Kraut, 2008).Note that some of these are not mutually exclusive. Many have only been tested in research prototype projects, but not at scale. And others were not designed with efficiency in mind.
The current quality control mechanism is called A-B-Arbitrate (or just A-B-ARB or “arbitration” for short). In this process, person A and person B index the document independently, and an experience arbitrator (ARB) reviews any discrepancies between the two.
This is a proposed model that has not been tested until this study.The model could include arbitration (ARB) or that step could be skipped if A-B results in high enough quality on its own (see findings).
Quality is measured as agreement between independent coders in Act I. This is not true quality, but is highly correlated with high quality.Quality is measured against a truth set created by a company who assured 99.9% accuracy and was independently audited by expert FSI transcribers.Efficiency is measured in terms of “active” time spent indexing (after “idle time” was removed) and keystrokes as captured by the indexing program.
Quality (estimated based on A-B agreement)Measures difficulty more than actual qualityUnderestimates quality, since an experienced Arbitrator reviews all A-B disagreementsGood at capturing differences across people, fields, and projectsTime (calculated using keystroke-logging data)Idle time is tracked separately, making actual time measurements more accurateOutliers removed
Notice the high variation in agreement depending on how many options there are for a field to have (e.g., gender has only a couple options, while surname has many options)
This finding is likely due to the fact that most transcribers are English Speaking which suggests the need to recruit contributors who are native speakers of other languages
Experience is based on EL(U) = round(log5(N(U))) Where U represents the transcriber, N(U) is the number of images that U has transcribed, and EL(U) is the experience level of U.Rank Number of images transcribed0 11 5 2 25 3 125 4 625 5 3125 6 15625 7 78125 8 390625
There isn’t much improvement, since it’s an “easy” field to agree on. In other words, even novices are good.
Here there isn’t much improvement, but the overall agreement is low. This suggests that even experts are not good, likely because of unfamiliarity with Canadian placenames given the predominantly US indexing population. Remember, that expertise is based on all contributions, not just those in this category.
More experienced transcribers are much faster (up to 4 times faster) than inexperienced users. They also have fewer keystrokes (e.g., using help functionality; fixing mistakes…)Though not shown here, the paper shows how experienced indexer work also requires less time to arbitrate and fewer keystrokes.Furthermore, English-speaking 1871 Candadian Census were 2.68 seconds faster per line than the French version, even though French version required more keystrokes. Again, this is likely due to the fact that most transcribers are native English speakers.
2,000 random images including many fields (e.g., surname, county of origin, gender, age) for each of the 50 lines of data (which include a single row for each individual). Note that this is repeated measures data, since the same transcriber transcribes all 50 rows of an image in a “batch” and some people transcribe more than one page. We use a mixed-model to account for this.Because people performing R were new to this method and the system was not tuned to the needs of reviewers, the A-R-RARB data should be considered a baseline – i.e., a lower bound on how well A-R-RARB can do.
A new approach based on peer review instead of independent indexing would likely improve efficiency, but its effect on quality is unknown. Anecdotal evidence suggests that peer reviewing may be twice as fast as indexing from scratch.
This is likely due to the fact that most R edits fix problems – they rarely introduce new problems. However, RARB doesn’t know who A or R is, and they erroneously agree with A too much, which is why there is no gain from RARB, and in fact some small losses in quality due to RARB.
There are clear gains in time for the A-R model, because reviewing takes about half as much time as transcribing from scratch.
Remember, in our study Peer Review was a new method for those performing it and the system hadn’t been customized to support it well, so it may do as well as A-B-ARB with some minor improvements and training.
Cscw family searchindexing
QUALITY CONTROL MECHANISMS FORCROWDSOURCING: PEER REVIEW, ARBITRATION,& EXPERTISE AT FAMILYSEARCH INDEXING CSCW, SAN ANTONIO, TX FEB 26, 2013 Derek Hansen, Patrick Schone, Douglas Corey, Matthew Reid, & Jake Gehring
FSI in Broader Landscape• Crowdsourcing Project Aggregates discrete tasks completed by volunteers who replace professionals (Howe, 2006; Doan, et al., 2011)• Human Computation System Humans use computational system to work on a problem that may someday be solvable by computers (Quinn & Bederson, 2011)• Lightweight Peer Production Largely anonymous contributors independently completing discrete, repetitive tasks provided by authorities (Haythornthwaite, 2009)
Design Challenge: Improveefficiency without sacrificing quality Scanned Documents Amount Time
Quality Control Mechanisms• 9 Types of quality control for human computation systems (Quinn & Bederson, 2011) • Redundancy • Multi-level review• Find-Fix-Verify pattern (Bernstein, et al., 2010)• Weight proposed solutions by reputation of contributor (McCann, et al., 2003)• Peer or expert oversight (Cosley, et al., 2005)• Tournament selection approach (Sun, et al., 2011)
A-B-Arbitrate process (A-B-ARB) A ARB BCurrently Used Mechanism
Peer review process (A-R-RARB) A R RARB Already Filled InProposed Mechanism Optional?
Two Act Play Act I: Experience Act II: Quality ControlWhat is the role of Is peer review orexperience on quality arbitration better in termsand efficiency? of quality and efficiency?Historical data analysis Field experiment usingusing full US and 2,000 images from theCanadian Census 1930 US Census Data &records from 1920 and corresponding truth setearlier
Act I: ExperienceQuality is estimated based on A-Bagreement (no truth set)Efficiency calculated using keystroke-logging data with idle time and outliersremoved
Summary & Implications of Act IExperienced workers are faster and moreaccurate, gains which continue even at high levels- Focus on retention- Encourage both novices & experts to do more- Develop interventions to speed up experience gains (e.g., send users common mistakes made by people at their experience level)
Summary & Implications of Act IContextual knowledge (e.g., Canadian placenames)and specialized skills (e.g., French language fluency)is needed for some tasks- Recruit people with existing knowledge & skills- Provide contextual information when possible (e.g., Canadian placename prompts)- Don’t remove context (e.g., captcha)- Allow users to specialize?
Act II: Quality ControlA-B-ARB data from original transcribers (Feb2011)A-R-RARB data includes original A data andnewly collected R and RARB data frompeople new to this method (Jan-Feb of 2012)Truth Set data from company withindependent audit by FSI expertsStatistical Test: mixed-model logisticregression (accurate or not) with randomeffects, controlling for expertise
Limitations• Experience levels of R and RARB were lower than expected, though we did statistically control for this• Original B data used in A-B-ARB for certain fields was transcribed in non- standard manner requiring adjustment
No Need for RARB• No gains in quality from extra arbitration of peer reviewed data (A-R = A-R-RARB)• RARB takes some time, so better without
Quality Comparison• Both methods were statistically better than A alone• A-B-ARB had slightly lower error rates than A-R• R “missed” more errors, but also introduced fewer errors
Summary & Implications of Act IIPeer Review shows considerable efficiencygains with nearly as good quality as Arbitration- Prime reviewers to find errors (e.g., prompt them with expected # of errors on a page)- Highlight potential problems (e.g., let A flag tough fields)- Route difficult pages to experts- Consider an A-R1-R2 process when high quality is critical
Summary & Implications of Act IIReviewing reviewers isn’t always worth the time- At least in some contexts, Find-Fix may not need VerifyQuality of different fields varies dramatically- Use different quality control mechanisms for harder or easier fieldsIntegrate human and algorithmic transcription- Use algorithms on easy fields & integrate into review process so machine learning can occur
Questions• Derek Hansen (email@example.com)• Patrick Schone (BoiseBound@aol.com)• Douglas Corey (firstname.lastname@example.org)• Matthew Reid (email@example.com)• Jake Gehring (GehringJG@familysearch.org)