Roots and Routes: Crowdsourced Manuscript Transcription Workshop
CrowdsourcedManuscript Transcription Ben Brumfield Roots and Routes 2012
Not just crowdsourcing...● Collaborative work● Off-site solo work● Private work
Not just manuscripts...● Maps● Textiles● Music● Flawed OCR
Not just transcription...● Indexing● Editing● IdentificationCounting seals on Arctic ice caps.
What it isntWell concentrate on web-based tools forextracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile DisplayTools exist for these tasks, nevertheless.
BreakWhat materials are you working with outside ofmodern, printed books and websites?
Origins (Approaches)Two Approaches and one Dead End● Indexing● Editing● Tagging
Indexing● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification
Editing● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions
Source MaterialEvaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important is that layout?
PurposeHow will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?
Organizational/Project Management Fit● How important is traditional editorial workflow?● Will you rely on volunteers? How will you motivate them?● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?
Financial and Technical ResourcesDo you have or need:● System administrators to install non-hosted software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for customization?● Support for on-going costs to keep the site running, however small?
Technical Questions to Answer● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?
WikisourcePro:● Mediawiki plus its add-on modules (e.g. print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.
Bentham Transcription DeskPro:● MediaWiki is very mature.● TEI Toolbar (can also be used on other systems)● Deployed outside original project.Con:● Development efforts halted.
ScriptoPro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development.Con:● Your CMS handles all metadata.● Mark-up is extremely limited.
FromThePagePro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available.Con:● Single developer (me).● No TEI mark-up.
Islandora TEI EditorCaveat: I dont know much about this tool orthis team.● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.
T-PENCaveat: I dont know much about this tool.● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.
ScribePro:● Excellent for complex layout or non- documentary transcription.● Zooniverse team is large, well-funded, experienced.● Configurable.Con:● No automated tool for loading images or viewing transcript database (yet!)● No concept of image-as-a-text.
PybossaCaveat: I dont know much about this tool orthis team.● Open Knowledge Foundations crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.
TextLabCaveat: I dont know much about this tool orthis team.● Melville Electronic Library.● Direct addition of TEI tags to image.
Lab Session 3: ConfigurationScribe Old Weather, Whats the Score, Development deployments
Find me Ben Brumfield firstname.lastname@example.org http://manuscripttranscription.blogspot.com/ @benwbrum
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.