Purposeful Gaming, OCR Correction and Seed & Nursery Catalog Digitization

670 views

Published on

An online game will be developed to crowd-source the correction of OCRed content in the Biodiversity Heritage Library (BHL). Several additional content types will digitized and added to BHL, namely seed lists, seed & nursery catalogs, and hand-written field notebooks.

  • Be the first to comment

  • Be the first to like this

Purposeful Gaming, OCR Correction and Seed & Nursery Catalog Digitization

  1. 1. Purposeful Gaming, OCR Correction and Seed & Nursery Catalog Digitization Marty Schlabach Food & Agriculture Librarian Cornell University A Grant-Funded BHL Project Council on Botanical & Horticultural Libraries Annual Meeting Richmond, VA April 30, 2014
  2. 2. •Full title - Purposeful Gaming and BHL: Engaging the Public in Improving and Enhancing Access to Digital Texts •National Leadership Grant for Libraries given to Missouri Botanical Garden in St Louis. •Partners include Harvard, Cornell, New York Botanical Garden •Funded by IMLS •Runs Dec 2013-Nov 2015 What is Purposeful Gaming and BHL?
  3. 3. BHL Problem Statement: •Major challenge for digital libraries: full-text searching of scanned texts is significantly hampered by poor output from Optical Character Recognition (OCR) software. •Historic literature has proven to be particularly problematic because of its tendency to have varying fonts, typesetting, and layouts
  4. 4. 8 Primary Objectives of Purposeful Gaming and BHL 1) digitizing horticultural catalogs 2) transcribing field notebooks, horticultural catalogs & other digital content in BHL 3) building a technical framework for management of digital text outputs 4) comparing digital outputs for OCR accuracy 5) developing and deploying a game to crowd-source OCR error 6) evaluating accuracy scores from the game against ground truth pages 7) generating an error matrix for clean-up 8) producing a report and disseminating findings
  5. 5. Like reCaptcha ……. reCAPTCHA offers more than just spam protection. Every time CAPTCHAs are solved, that human effort helps digitize text, annotate images, and build machine learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.
  6. 6. …..but a lot more fun! http://www.digitalkoot.fi/ Mole Bridge http://www.youtube.com/watch?v=Q6CZId38Hvk Mole Hunt http://www.youtube.com/watch?v=uHN5WW6yCc4
  7. 7. BHL Content •Primarily Books & Journals •Purposeful Gaming Project adding new content types • Seed & Nursery Catalogs & Seed Lists • Test OCR correction on this content type • Crowd-source the transcription when needed • Field Notebooks, • Handwritten, OCR virtually impossible • Crowd-source the transcription
  8. 8. Seed & Nursery Catalog and Seed List Digitization Seed & Nursery Catalogs •New York Botanical Garden •Cornell Seed Lists •Missouri Botanical Garden
  9. 9. Why Digitize Seed & Nursery Catalogs? • Taxonomists • Discover early introductions of new plants • Gardeners • Peruse old catalogs for historical availability and uses of traditional cultivars of heirloom annuals and perennials • Museums and botanical gardens • Recreate historical gardens • Plant breeders • Look for descriptions of plants with unique disease and pest resistance • Historians of art and illustration • Drawn to the striking representations of flowers, fruit & vegetables • Historians of printing • Catalogs documented changes in printing • Text-only broadsides & pamphlets • Multipage booklets with engraved illustrations • Colorful lithographs added • Photographic illustrations, b&w and later color
  10. 10. Seed & Nursery Catalog Selection Issues • Evaluate copyright status • Assess physical condition • Minimize duplication among collaborators • Determine selection criteria
  11. 11. •50,000+ catalogs in collection •Previous NEH grant •Catalog whole seed & nursery catalog collection •Digitize a portion •In IMLS project, will continue systematic digitization •Targeting 500 catalogs, 15,000 pages •In-house scanning •Currently available via Mertz Digital •http://mertzdigital.nybg.org/ •Upload to Internet Archive (IA) & BHL New York Botanical Garden (NYBG)
  12. 12. •130,000+ catalogs in Bailey Horticultural Catalog collection •Digitization priorities •Firms that supplied grapes •Have identified 325+ firms •1771- •Minimizing duplication of firms & catalogs NYBG & NAL are digitizing •Targeting 3,700 catalogs, 125,000 pages •External scan vendor Trigonix •Upload to Internet Archive (IA) & BHL Cornell University Library
  13. 13. •Not a BHL member or IMLS recipient, but a collaborator •Already digitized 6,400+ catalogs •Selection priorities •Peter Henderson •Women-owned firms •Mid-Atlantic firms •Long collection runs •In-house scanning •Uploading to Internet Archive (IA) •https://archive.org/details/usda-nurseryandseedcatalog •BHL ingests non-member content from IA National Agriculture Library (NAL)
  14. 14. • 140,000 digitized pages of horticultural catalogs in BHL – minimum of 2 OCR outputs for each page • Implementation of a transcription tool for BHL materials • 2,000 pages of transcribed field notebooks • Technical framework within BHL architecture for classifying, comparing and managing multiple OCR outputs for a single page (ie. possibility to include corrected OCR) • Production of an error matrix that will allow for automated text correction on full BHL corpus • Proof of concept for whether gaming and crowd-sourcing can be used to improve access to digital texts Purposeful Gaming: Development Deliverables
  15. 15. Acknowledgements • IMLS Purposeful Gaming grant colleagues at – Missouri Botanical Garden – New York Botanical Garden – Cornell University Library – Harvard Museum of Comparative Zoology • Trish Rose-Sandler, Missouri Botanical Garden – Plagiarized (with permission) selected slides • IMLS for funding

×