Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Landscape of
Crowdsourcing and Transcription
Ben Brumfield
Duke University
November 20, 2013
Methodological Origins
What is transcription?
Indexing

•
•
•
•
•

Structured Data
Extracting from Text
Databases for Search and Analysis
Granular Quality Control
Gamif...
Editing

•
•
•
•

Books, Diaries, Letters, Articles
Representing Text
Traditional Editorial Workflow
Digital or Print Edit...
Community Origins

•
•
•
•
•
•

Libraries and Archives
Documentary & Scholarly Editing
Genealogy
Bioinformatics & Astronom...
Libraries and Archives

•

Material:
– Hand-written letters
–

•
•
•

OCRed newspaper articles

Goal: Findability
Format: ...
Documentary & Scholarly Editing

•

Material:
– Literary drafts
–

•
•
•

Historic correspondence

Goal: High-quality edit...
Genealogy

• Material: Handwritten records
• Goal: Findability
• Format: Structured data
–

Spreadsheets

–

Proprietary d...
Bioinformatics

• Material: Specimens
• Goal: Analysis
• Format: Custom Databases
• Destination:
–

Analytic Databases

–
...
Investigative Journalism

• Material: Receipts, FOIA Responses
• Goal: Findability
• Format: Custom Databases
• Destinatio...
Free Culture

• Material: OCR and e-Texts
• Goal: Readability
• Format: Plaintext, wiki mark-up
• Destination: Digital edi...
How it works
●

Who are the volunteers?
How it works
●
●

Who are the volunteers?
Why do they volunteer?
How it works
●
●
●

Who are the volunteers?
Why do they volunteer?
What about accuracy?
OldWeather Participation
●
●
●

●

More than 1.6 million weather observations.
16,000 volunteers.
1 million log pages tran...
OldWeather Participation
●
●
●

●

More than 1.6 million weather observations.
16,000 volunteers.
1 million log pages tran...
Power-law Distribution
●

Most contributions are made by a core of wellinformed enthusiasts.

●

True regardless of projec...
Objection!
●

●

●

“Can we really believe there is a crowd out
there that is capable of producing publishable
translation...
“That's when it dawned on me: what you need for mass
volunteer projects isn't actually crowd-sourcing, but
nerd-sourcing. ...
One “Well-Informed Enthusiast”
●

In 14 days,
–
–
–

Entire diary transcribed
250 revisions to 43 pages
Two dozen footnote...
Quality Control
Quality Control
OldWeather Accuracy
●

●

Individual transcriptions are about 97%
accurate
Of 1000 transcribed logbook entries:
–
–
–

3 w...
Costs and Results
●

Harry Ransom Center Manuscript Fragments

●

Stadarchief Leuven Itinera Nova
HRC Manuscript Fragments
●

$0 capital budget
–
–

●

Images captured with a camera phone.
Crowdsourcing platform was Flic...
HRC Manuscript Fragments
60

50

40

30

20

10

0
Staff

Volunteers

Other Scholars

Unidentified

Unidentifiable
HRC Manuscript Fragments
“What is 30,000+ views, 284 twitter followers,
and 147 facebook followers worth? [...] I know
for...
HRC Manuscript Fragments
“The biggest lesson for me is that you've got to
engage your contributors more actively than I
di...
Free as in puppy!

http://www.flickr.com/photos/magnusbrath/7614518858/
Itinera Nova
Itinera Nova
●

765 registers conserved by 2 volunteers

●

486 registers photographed

●

301,201 images processed (25% b...
Itinera Nova
●

Budget for first three years
–

4 person-years staff salaries

–

Professional-quality book scanner

–

So...
More Resources
●

People
–

Mia Ridge

–

Chris Lintott (Zooniverse)

–

Micah Erwin (HRC Manuscript Fragments)

–

Meliss...
Questions?
Ben Brumfield
benwbrum@gmail.com
http://fromthepage.com/
http://tinyurl.com/TranscriptionToolGDoc
http://manusc...
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)
Upcoming SlideShare
Loading in …5
×

The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)

7,695 views

Published on

One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn't recognize handwriting, human volunteers are converting letters, diaries, and log books into formats that can be read, mined, searched, and used to improve collection metadata. But cultural heritage institutions aren't the only organizations working with handwritten material, and many innovations are happening within investigative journalism, citizen science, and genealogy.

This talk will present an overview of the landscape of crowdsourced transcription: where it came from, who's doing it, and the kinds of contributions their volunteers make, followed by a discussion of motivation, participation, recruitment, and quality controls.

(Video available at http://www.youtube.com/watch?v=jNrTC4Y0_dk )

Published in: Technology, Education
  • Be the first to comment

The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)

  1. 1. The Landscape of Crowdsourcing and Transcription Ben Brumfield Duke University November 20, 2013
  2. 2. Methodological Origins What is transcription?
  3. 3. Indexing • • • • • Structured Data Extracting from Text Databases for Search and Analysis Granular Quality Control Gamification
  4. 4. Editing • • • • Books, Diaries, Letters, Articles Representing Text Traditional Editorial Workflow Digital or Print Editions
  5. 5. Community Origins • • • • • • Libraries and Archives Documentary & Scholarly Editing Genealogy Bioinformatics & Astronomy Investigative Journalism Free Culture
  6. 6. Libraries and Archives • Material: – Hand-written letters – • • • OCRed newspaper articles Goal: Findability Format: Plaintext transcripts Destination: Search engines, finding aids
  7. 7. Documentary & Scholarly Editing • Material: – Literary drafts – • • • Historic correspondence Goal: High-quality editions Format: TEI or other XML Destination: Human-readable print or digital editions
  8. 8. Genealogy • Material: Handwritten records • Goal: Findability • Format: Structured data – Spreadsheets – Proprietary databases • Destination: Searchable databases
  9. 9. Bioinformatics • Material: Specimens • Goal: Analysis • Format: Custom Databases • Destination: – Analytic Databases – Scientific Journals – Museum Collection Databases
  10. 10. Investigative Journalism • Material: Receipts, FOIA Responses • Goal: Findability • Format: Custom Databases • Destination: News Articles
  11. 11. Free Culture • Material: OCR and e-Texts • Goal: Readability • Format: Plaintext, wiki mark-up • Destination: Digital editions
  12. 12. How it works ● Who are the volunteers?
  13. 13. How it works ● ● Who are the volunteers? Why do they volunteer?
  14. 14. How it works ● ● ● Who are the volunteers? Why do they volunteer? What about accuracy?
  15. 15. OldWeather Participation ● ● ● ● More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user.
  16. 16. OldWeather Participation ● ● ● ● More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user – but this statistic is worthless!
  17. 17. Power-law Distribution ● Most contributions are made by a core of wellinformed enthusiasts. ● True regardless of project size. ● What are the implications?
  18. 18. Objection! ● ● ● “Can we really believe there is a crowd out there that is capable of producing publishable translations?” “I suspect that for most medieval topics, the vulgus is just too indoctum to make the effort worthwhile.” “I cannot see where there is the expertise out there to make this work.”
  19. 19. “That's when it dawned on me: what you need for mass volunteer projects isn't actually crowd-sourcing, but nerd-sourcing. You need to find, among the vast number of vaguely interested, not very analytical people who look at web sites, the small number of tidy-minded obsessives who care deeply about the ethnic origins of Freddie Mercury or want to analyse statistical data for fun and no profit. And then you need to persuade these people to do as much work for you as you can. “The success of mass volunteering, therefore is going to depend heavily on the number of well-informed enthusiasts 'out there'.” Rachel Stone
  20. 20. One “Well-Informed Enthusiast” ● In 14 days, – – – Entire diary transcribed 250 revisions to 43 pages Two dozen footnotes
  21. 21. Quality Control
  22. 22. Quality Control
  23. 23. OldWeather Accuracy ● ● Individual transcriptions are about 97% accurate Of 1000 transcribed logbook entries: – – – 3 will be lost because of transcription errors 10 will be illegible At least 3 will be errors in the logs
  24. 24. Costs and Results ● Harry Ransom Center Manuscript Fragments ● Stadarchief Leuven Itinera Nova
  25. 25. HRC Manuscript Fragments ● $0 capital budget – – ● Images captured with a camera phone. Crowdsourcing platform was Flickr. Minimal staffing – 100 unpaid hours (July-October 2012) – 10-20 paid hours/week (March-August 2013)
  26. 26. HRC Manuscript Fragments 60 50 40 30 20 10 0 Staff Volunteers Other Scholars Unidentified Unidentifiable
  27. 27. HRC Manuscript Fragments “What is 30,000+ views, 284 twitter followers, and 147 facebook followers worth? [...] I know for sure that it has suddenly put the HRC on the map for a lot of medievalists who assumed this institution was not all that interested in that area.” Micah Erwin
  28. 28. HRC Manuscript Fragments “The biggest lesson for me is that you've got to engage your contributors more actively than I did.” Micah Erwin
  29. 29. Free as in puppy! http://www.flickr.com/photos/magnusbrath/7614518858/
  30. 30. Itinera Nova
  31. 31. Itinera Nova ● 765 registers conserved by 2 volunteers ● 486 registers photographed ● 301,201 images processed (25% by volunteers) ● 12,033 “acts” transcribed ● 35-40 volunteers, 1 full-time staff member
  32. 32. Itinera Nova ● Budget for first three years – 4 person-years staff salaries – Professional-quality book scanner – Software development – Tutorial development – Exhibitions – Conferences
  33. 33. More Resources ● People – Mia Ridge – Chris Lintott (Zooniverse) – Micah Erwin (HRC Manuscript Fragments) – Melissa Terras (Transcribe Bentham) – Dominic McDevitt-Parks (Wikisource) – Paul Flemons (Biodiversity Volunteer Portal)
  34. 34. Questions? Ben Brumfield benwbrum@gmail.com http://fromthepage.com/ http://tinyurl.com/TranscriptionToolGDoc http://manuscripttranscription.blogspot.com

×