Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Roots and Routes: Crowdsourced Manuscript Transcription Workshop


Published on

3-hour long workshop on crowdsourced transcription software for the University of Toronto's Roots and Routes seminar in 2012.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Roots and Routes: Crowdsourced Manuscript Transcription Workshop

  1. 1. CrowdsourcedManuscript Transcription Ben Brumfield Roots and Routes 2012
  2. 2. Not just crowdsourcing...● Collaborative work● Off-site solo work● Private work
  3. 3. Not just manuscripts...● Maps● Textiles● Music● Flawed OCR
  4. 4. Not just transcription...● Indexing● Editing● IdentificationCounting seals on Arctic ice caps.
  5. 5. What it isntWell concentrate on web-based tools forextracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile DisplayTools exist for these tasks, nevertheless.
  6. 6. BreakWhat materials are you working with outside ofmodern, printed books and websites?
  7. 7. Origins (Approaches)Two Approaches and one Dead End● Indexing● Editing● Tagging
  8. 8. Indexing● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification
  9. 9. Editing● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions
  10. 10. Tagging● Too small● Too imprecise
  11. 11. Origins (Traditions)● OCR Correction● Documentary Editing● Genealogy● Natural Science● AstronomySplit this into 5 slides
  12. 12. Online Tools● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-up and customization● All require making trade-offs
  13. 13. Lab Session 1: BreadthNYPL Whats on the Menu IndexingWikisource Editing
  14. 14. Selection Factors● Source Material● Transcript Purpose● Organizational/Project Management Fit● Financial and Technical Resources
  15. 15. Source MaterialEvaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important is that layout?
  16. 16. PurposeHow will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?
  17. 17. Organizational/Project Management Fit● How important is traditional editorial workflow?● Will you rely on volunteers? How will you motivate them?● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?
  18. 18. Financial and Technical ResourcesDo you have or need:● System administrators to install non-hosted software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for customization?● Support for on-going costs to keep the site running, however small?
  19. 19. Lab Session 2: Markup OptionsFromThePageTranscribeBentham
  20. 20. Technical Questions to Answer● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?
  21. 21. WikisourcePro:● Mediawiki plus its add-on modules (e.g. print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.
  22. 22. Bentham Transcription DeskPro:● MediaWiki is very mature.● TEI Toolbar (can also be used on other systems)● Deployed outside original project.Con:● Development efforts halted.
  23. 23. ScriptoPro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development.Con:● Your CMS handles all metadata.● Mark-up is extremely limited.
  24. 24. FromThePagePro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available.Con:● Single developer (me).● No TEI mark-up.
  25. 25. Islandora TEI EditorCaveat: I dont know much about this tool orthis team.● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.
  26. 26. T-PENCaveat: I dont know much about this tool.● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.
  27. 27. ScribePro:● Excellent for complex layout or non- documentary transcription.● Zooniverse team is large, well-funded, experienced.● Configurable.Con:● No automated tool for loading images or viewing transcript database (yet!)● No concept of image-as-a-text.
  28. 28. PybossaCaveat: I dont know much about this tool orthis team.● Open Knowledge Foundations crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.
  29. 29. TextLabCaveat: I dont know much about this tool orthis team.● Melville Electronic Library.● Direct addition of TEI tags to image.
  30. 30. Lab Session 3: ConfigurationScribe Old Weather, Whats the Score, Development deployments
  31. 31. Find me Ben Brumfield @benwbrum