The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)

  • 4,023 views
Uploaded on

One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn't recognize handwriting, human volunteers are converting letters, diaries, and log …

One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn't recognize handwriting, human volunteers are converting letters, diaries, and log books into formats that can be read, mined, searched, and used to improve collection metadata. But cultural heritage institutions aren't the only organizations working with handwritten material, and many innovations are happening within investigative journalism, citizen science, and genealogy.

This talk will present an overview of the landscape of crowdsourced transcription: where it came from, who's doing it, and the kinds of contributions their volunteers make, followed by a discussion of motivation, participation, recruitment, and quality controls.

(Video available at http://www.youtube.com/watch?v=jNrTC4Y0_dk )

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,023
On Slideshare
0
From Embeds
0
Number of Embeds
55

Actions

Shares
Downloads
3
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • family search indexing approach to the freedman's letters. family search is good at coming up with "who is mentioned in this document", it's an index of people. All they asked the volunteers to do is record the first 3 names mentioned in the letter. This is a misapplication of indexing.
  • limitations of editing: digital editions -- what do you do with tabular data (records) that are recorded in manuscripts. e.g. account keeping in the jeremiah graves diaries. how uch to pay the school master, how much he rented out his slaves. It's represented as text currently, but that means you can't ask questions of the data like you can with a spreadsheet. This may change, Katherine Tomasek has a NEH/NXX startup grant to represent the Wheaton account books in TEI and coming up with markup extensions to cover financial records. This might be the right solution.
  • Inspiration: Project Gutenberg's distributed proofreader
    Wikisource: What is wikipedia is not
    when I first started editing wikipedia in 2002, there were a lot of arguments about what belonged on wikipedia and what didn't. The full text of Darwin's Origin of Species was one great big wikipedia article. Do primary texts belong on Wikipedia -- no, so wikisource was created (~2003) to be a repository of original sources. cut and paste from project gutenberg or manually keyed from someone staring at a book.
    Anecdote: 2005-2006 an editor of the french language wikisource was inspired by distributed proofreader to create a side by side facsimile display and editing screen. This was a mediawiki plugin called proofread page. And pretty rapidly was extended to consume the .djvu files produced by the internet archive and open content alliance bookscanners. So would display the page and prepulate the editable text with the OCR. Proof read page remained only a tool for ocr correction because of wikisource policy that only primary sources that had previously been published "on paper" were allowed on wikisource. So despite being an ideal tool for manuscript transcription, it was only opened up for manuscripts last year. In a minute you'll have a lab where you get to try out proof read page.
    Documentary Editing: What humanities scholars are familar with. Preparing editions, from manuscript sources, for print publication. This is a separate tradition from OCR correction, but it has a lot in common with it. Examples: From The Page, Scripto, Transcribe Bentham, T-PEN, Islandora TEI and others. This is the origin of my own tool, FromThePage. We had inherited several diaries, we wanted to produce print editions of them. I was inspired by Wikipedia features of collaborative editing, version control, and the automated indexing that comes from wiki markup in the now-obscure "what links here" feature of wikipedia. So in 2005 I built FTP. You'll get a chance to try that out shortly as well.
    Genealogy: Genealogists are primarily concerned with tracking down names, dates, and locations within documents. Grassroots programs: indexes that allow people to find documents that mention their ancestors. Originated using offline transcription but online search database. (Many predate the internet.) Ancestry.com, van papier naar digitaal and the UK based FreeBMD, FreeCen, and FreeReg. Many of these projects have moved online, Ancestry.com invites members to index records via their "world archives" project (closed tool). FamilySearchIndexing is moving from a downloadable java app to an online and mobile tool. And in March I started working with FreeUKGen to develop their online transcription tool for freeReg and freeCen.
    Natural Sciences:
    north american bird phenology project
    klauber field notes
    Astronomy:
    started as purely non-textual
    galaxy zoo -- solar flares and planet hunters
    old weather
    Ancient Lives
    What's the score
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.
  • When people tlak about crowdsoucring, they talk about the ability to tag music/photos/music. New meaning can emerge when large groups of people categorize things using free form self generated tags. The best discussion of this is Clay Shirky's talk "Ontology is Overrated". People talk about these "folksonomies"; crowdsourced ontology. Tagging is similar to crowdsourcing, humans looking at images, creating meaning about those images. But tagging is too limited: the data is too small, and too unstructured. A geneological indexing project that attempted to use tages would find it very difficult to represent people and families without duplications. Too imprecise: you're recording small pieces of information about an image without noting where on the image it is.
    But!
    Zooniverse Talk has potential, offering discussion fora around tags and the images they represent.

Transcript

  • 1. The Landscape of Crowdsourcing and Transcription Ben Brumfield Duke University November 20, 2013
  • 2. Methodological Origins What is transcription?
  • 3. Indexing • • • • • Structured Data Extracting from Text Databases for Search and Analysis Granular Quality Control Gamification
  • 4. Editing • • • • Books, Diaries, Letters, Articles Representing Text Traditional Editorial Workflow Digital or Print Editions
  • 5. Community Origins • • • • • • Libraries and Archives Documentary & Scholarly Editing Genealogy Bioinformatics & Astronomy Investigative Journalism Free Culture
  • 6. Libraries and Archives • Material: – Hand-written letters – • • • OCRed newspaper articles Goal: Findability Format: Plaintext transcripts Destination: Search engines, finding aids
  • 7. Documentary & Scholarly Editing • Material: – Literary drafts – • • • Historic correspondence Goal: High-quality editions Format: TEI or other XML Destination: Human-readable print or digital editions
  • 8. Genealogy • Material: Handwritten records • Goal: Findability • Format: Structured data – Spreadsheets – Proprietary databases • Destination: Searchable databases
  • 9. Bioinformatics • Material: Specimens • Goal: Analysis • Format: Custom Databases • Destination: – Analytic Databases – Scientific Journals – Museum Collection Databases
  • 10. Investigative Journalism • Material: Receipts, FOIA Responses • Goal: Findability • Format: Custom Databases • Destination: News Articles
  • 11. Free Culture • Material: OCR and e-Texts • Goal: Readability • Format: Plaintext, wiki mark-up • Destination: Digital editions
  • 12. How it works ● Who are the volunteers?
  • 13. How it works ● ● Who are the volunteers? Why do they volunteer?
  • 14. How it works ● ● ● Who are the volunteers? Why do they volunteer? What about accuracy?
  • 15. OldWeather Participation ● ● ● ● More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user.
  • 16. OldWeather Participation ● ● ● ● More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user – but this statistic is worthless!
  • 17. Power-law Distribution ● Most contributions are made by a core of wellinformed enthusiasts. ● True regardless of project size. ● What are the implications?
  • 18. Objection! ● ● ● “Can we really believe there is a crowd out there that is capable of producing publishable translations?” “I suspect that for most medieval topics, the vulgus is just too indoctum to make the effort worthwhile.” “I cannot see where there is the expertise out there to make this work.”
  • 19. “That's when it dawned on me: what you need for mass volunteer projects isn't actually crowd-sourcing, but nerd-sourcing. You need to find, among the vast number of vaguely interested, not very analytical people who look at web sites, the small number of tidy-minded obsessives who care deeply about the ethnic origins of Freddie Mercury or want to analyse statistical data for fun and no profit. And then you need to persuade these people to do as much work for you as you can. “The success of mass volunteering, therefore is going to depend heavily on the number of well-informed enthusiasts 'out there'.” Rachel Stone
  • 20. One “Well-Informed Enthusiast” ● In 14 days, – – – Entire diary transcribed 250 revisions to 43 pages Two dozen footnotes
  • 21. Quality Control
  • 22. Quality Control
  • 23. OldWeather Accuracy ● ● Individual transcriptions are about 97% accurate Of 1000 transcribed logbook entries: – – – 3 will be lost because of transcription errors 10 will be illegible At least 3 will be errors in the logs
  • 24. Costs and Results ● Harry Ransom Center Manuscript Fragments ● Stadarchief Leuven Itinera Nova
  • 25. HRC Manuscript Fragments ● $0 capital budget – – ● Images captured with a camera phone. Crowdsourcing platform was Flickr. Minimal staffing – 100 unpaid hours (July-October 2012) – 10-20 paid hours/week (March-August 2013)
  • 26. HRC Manuscript Fragments 60 50 40 30 20 10 0 Staff Volunteers Other Scholars Unidentified Unidentifiable
  • 27. HRC Manuscript Fragments “What is 30,000+ views, 284 twitter followers, and 147 facebook followers worth? [...] I know for sure that it has suddenly put the HRC on the map for a lot of medievalists who assumed this institution was not all that interested in that area.” Micah Erwin
  • 28. HRC Manuscript Fragments “The biggest lesson for me is that you've got to engage your contributors more actively than I did.” Micah Erwin
  • 29. Free as in puppy! http://www.flickr.com/photos/magnusbrath/7614518858/
  • 30. Itinera Nova
  • 31. Itinera Nova ● 765 registers conserved by 2 volunteers ● 486 registers photographed ● 301,201 images processed (25% by volunteers) ● 12,033 “acts” transcribed ● 35-40 volunteers, 1 full-time staff member
  • 32. Itinera Nova ● Budget for first three years – 4 person-years staff salaries – Professional-quality book scanner – Software development – Tutorial development – Exhibitions – Conferences
  • 33. More Resources ● People – Mia Ridge – Chris Lintott (Zooniverse) – Micah Erwin (HRC Manuscript Fragments) – Melissa Terras (Transcribe Bentham) – Dominic McDevitt-Parks (Wikisource) – Paul Flemons (Biodiversity Volunteer Portal)
  • 34. Questions? Ben Brumfield benwbrum@gmail.com http://fromthepage.com/ http://tinyurl.com/TranscriptionToolGDoc http://manuscripttranscription.blogspot.com