CrowdsourcedManuscript Transcription         Ben Brumfield     Roots and Routes 2012
Not just crowdsourcing...● Collaborative work● Off-site solo work● Private work
Not just manuscripts...●   Maps●   Textiles●   Music●   Flawed OCR
Not just transcription...● Indexing● Editing● IdentificationCounting seals on Arctic ice caps.
What it isntWell concentrate on web-based tools forextracting text from images, not addressing:● Oral History● Video● Audi...
BreakWhat materials are you working with outside ofmodern, printed books and websites?
Origins (Approaches)Two Approaches and one Dead End● Indexing● Editing● Tagging
Indexing●   Structured Data●   Extracts from Text vs. Representing Text●   Databases for Search and Analysis●   Granular Q...
Editing●   Books, Diaries, Letters, Articles●   Representing Text●   Traditional Editorial Workflow●   Digital or Print Ed...
Tagging● Too small● Too imprecise
Origins (Traditions)●   OCR Correction●   Documentary Editing●   Genealogy●   Natural Science●   AstronomySplit this into ...
Online Tools● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-u...
Lab Session 1: BreadthNYPL Whats on the Menu  IndexingWikisource  Editing
Selection Factors●   Source Material●   Transcript Purpose●   Organizational/Project Management Fit●   Financial and Techn...
Source MaterialEvaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need r...
PurposeHow will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to...
Organizational/Project Management Fit● How important is traditional editorial  workflow?● Will you rely on volunteers? How...
Financial and Technical ResourcesDo you have or need:● System administrators to install non-hosted  software?● Money to pa...
Lab Session 2: Markup OptionsFromThePageTranscribeBentham
Technical Questions to Answer● Where are the images now?● How do images get into the system?● How do transcripts get out o...
WikisourcePro:● Mediawiki plus its add-on modules (e.g.  print-on-demand, export).● Wikimedia community.● Incredibly matur...
Bentham Transcription DeskPro:● MediaWiki is very mature.● TEI Toolbar (can also be used on other  systems)● Deployed outs...
ScriptoPro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed an...
FromThePagePro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available.Con:● Sin...
Islandora TEI EditorCaveat: I dont know much about this tool orthis team.● Based on Drupal and Fedora● Supports TEI via fr...
T-PENCaveat: I dont know much about this tool.●   Designed for medieval manuscripts.●   Supports TEI natively.●   Line-by-...
ScribePro:● Excellent for complex layout or non-  documentary transcription.● Zooniverse team is large, well-funded,  expe...
PybossaCaveat: I dont know much about this tool orthis team.● Open Knowledge Foundations  crowdsourcing task management to...
TextLabCaveat: I dont know much about this tool orthis team.● Melville Electronic Library.● Direct addition of TEI tags to...
Lab Session 3: ConfigurationScribe  Old Weather,  Whats the Score,  Development deployments
Find me                Ben Brumfield           benwbrum@gmail.com http://manuscripttranscription.blogspot.com/            ...
Upcoming SlideShare
Loading in...5
×

Roots and Routes: Crowdsourced Manuscript Transcription Workshop

576

Published on

3-hour long workshop on crowdsourced transcription software for the University of Toronto's Roots and Routes seminar in 2012.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
576
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Roots and Routes: Crowdsourced Manuscript Transcription Workshop

  1. 1. CrowdsourcedManuscript Transcription Ben Brumfield Roots and Routes 2012
  2. 2. Not just crowdsourcing...● Collaborative work● Off-site solo work● Private work
  3. 3. Not just manuscripts...● Maps● Textiles● Music● Flawed OCR
  4. 4. Not just transcription...● Indexing● Editing● IdentificationCounting seals on Arctic ice caps.
  5. 5. What it isntWell concentrate on web-based tools forextracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile DisplayTools exist for these tasks, nevertheless.
  6. 6. BreakWhat materials are you working with outside ofmodern, printed books and websites?
  7. 7. Origins (Approaches)Two Approaches and one Dead End● Indexing● Editing● Tagging
  8. 8. Indexing● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification
  9. 9. Editing● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions
  10. 10. Tagging● Too small● Too imprecise
  11. 11. Origins (Traditions)● OCR Correction● Documentary Editing● Genealogy● Natural Science● AstronomySplit this into 5 slides
  12. 12. Online Tools● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-up and customization● All require making trade-offs
  13. 13. Lab Session 1: BreadthNYPL Whats on the Menu IndexingWikisource Editing
  14. 14. Selection Factors● Source Material● Transcript Purpose● Organizational/Project Management Fit● Financial and Technical Resources
  15. 15. Source MaterialEvaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important is that layout?
  16. 16. PurposeHow will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?
  17. 17. Organizational/Project Management Fit● How important is traditional editorial workflow?● Will you rely on volunteers? How will you motivate them?● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?
  18. 18. Financial and Technical ResourcesDo you have or need:● System administrators to install non-hosted software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for customization?● Support for on-going costs to keep the site running, however small?
  19. 19. Lab Session 2: Markup OptionsFromThePageTranscribeBentham
  20. 20. Technical Questions to Answer● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?
  21. 21. WikisourcePro:● Mediawiki plus its add-on modules (e.g. print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.
  22. 22. Bentham Transcription DeskPro:● MediaWiki is very mature.● TEI Toolbar (can also be used on other systems)● Deployed outside original project.Con:● Development efforts halted.
  23. 23. ScriptoPro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development.Con:● Your CMS handles all metadata.● Mark-up is extremely limited.
  24. 24. FromThePagePro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available.Con:● Single developer (me).● No TEI mark-up.
  25. 25. Islandora TEI EditorCaveat: I dont know much about this tool orthis team.● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.
  26. 26. T-PENCaveat: I dont know much about this tool.● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.
  27. 27. ScribePro:● Excellent for complex layout or non- documentary transcription.● Zooniverse team is large, well-funded, experienced.● Configurable.Con:● No automated tool for loading images or viewing transcript database (yet!)● No concept of image-as-a-text.
  28. 28. PybossaCaveat: I dont know much about this tool orthis team.● Open Knowledge Foundations crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.
  29. 29. TextLabCaveat: I dont know much about this tool orthis team.● Melville Electronic Library.● Direct addition of TEI tags to image.
  30. 30. Lab Session 3: ConfigurationScribe Old Weather, Whats the Score, Development deployments
  31. 31. Find me Ben Brumfield benwbrum@gmail.com http://manuscripttranscription.blogspot.com/ @benwbrum
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×