Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime. Applying ermerging technology to historic scientific literature


Published on

Botany 2007, Chicago, IL.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this Applying ermerging technology to historic scientific literature

  1. 1. Applying emerging technology to historic scientific literature Chris Freeland Doug Holland Missouri Botanical Garden
  2. 2. <ul><li>Published literature is the foundation on which biological science is based </li></ul>Botany & systematics are sciences built on accumulated knowledge
  3. 3. Taxonomic Literature <ul><li>Over 250 years of systematic description of life </li></ul><ul><li>Systema naturae (10 th ed. 1758) by Carl von Linné </li></ul>
  4. 4. Taxonomic Literature The cited half-life of publications in taxonomy is longer than in any other scientific discipline * * * The decay rate is longer than in any scientific discipline - Macro-economic case for open access, Tom Moritz
  5. 5. How historic literature is used
  6. 6. Taxonomic Impediment <ul><li>Specimen collections </li></ul><ul><li>Databases </li></ul><ul><li>Publications </li></ul><ul><li>Observations </li></ul><ul><li>‘ Gray’ literature </li></ul><ul><li>Index cards </li></ul><ul><li>Field notebooks </li></ul>
  7. 7. <ul><li>A freely accessible, Web- based encyclopedia of digitized botanical literature, sponsored by the Missouri Botanical Garden Library </li></ul><ul><li>650,000+ pages of text </li></ul><ul><li>1,300 volumes, 200 titles </li></ul><ul><li>145,000 linked protologues </li></ul><ul><li>~10TB of data </li></ul>
  8. 8. Workflow Selection Preparation Post Production Publication Metadata Enhancement Digitization Conservation
  9. 9. Selection
  10. 10. Preparation <ul><li>Review bibliographic metadata in MOBOT library catalog </li></ul><ul><ul><li>Clean up, if needed </li></ul></ul><ul><li>Extract MARC </li></ul><ul><ul><li>Transform to MARCXML </li></ul></ul><ul><ul><li>Parse into Botanicus DB </li></ul></ul><ul><li>Review title & determine which scanning device to use </li></ul><ul><ul><li>Possible trip through Conservation </li></ul></ul>
  11. 11. Digitization 5 Full time scanners 3 Indus 5002 book scanners 1 Kodak i280 Sheet feed scanner
  12. 12. Post Production – Custom Apps <ul><li>PageConvert </li></ul><ul><ul><li>JPEG2000 (*.jp2) creation </li></ul></ul><ul><ul><li>Thumbnail creation </li></ul></ul><ul><ul><li>Moves derivative images to server </li></ul></ul><ul><ul><li>Updates item records to prepare for publishing </li></ul></ul><ul><ul><li>Runs on each scanning workstation </li></ul></ul><ul><li>PagePublish </li></ul><ul><ul><li>Looks for items ready to publish </li></ul></ul><ul><ul><li>Creates or updates page records </li></ul></ul><ul><ul><li>Guesses page “types” text or illustration </li></ul></ul><ul><ul><li>Triggers OCR generation and PDF creation </li></ul></ul><ul><ul><li>Updates titles and item records to “publish ready” </li></ul></ul><ul><ul><li>Runs centrally </li></ul></ul>
  13. 13. Post Production – Packaged Apps <ul><li>PrimeOCR </li></ul><ul><ul><li>6 voting engines </li></ul></ul><ul><ul><li>Multi-language support </li></ul></ul><ul><ul><li>Character coordinates </li></ul></ul><ul><ul><li>Outputs ASCII text, other formats </li></ul></ul><ul><li>LuraTech PDF Compressor </li></ul><ul><ul><li>2GB of TIF page images -> 30MB PDF </li></ul></ul><ul><ul><li>PDF/A </li></ul></ul><ul><ul><li>OCR (ABBY FineReader) </li></ul></ul>
  14. 14. Enhancement - Paginator
  15. 15. View
  16. 19. Web 2.0 Features <ul><li>AJAX interface </li></ul><ul><ul><li>JPEG2000 (Image compression with zoom) </li></ul></ul><ul><li>Web Services </li></ul><ul><ul><li>uBio TaxonFinder and NameBank Taxonomic Intelligence </li></ul></ul><ul><li>RSS feeds </li></ul><ul><ul><li>Volumes added and news </li></ul></ul><ul><li>Mash Ups </li></ul><ul><ul><li>Geocoded Subject headings plotted on Google Maps </li></ul></ul><ul><li>Tag Clouds </li></ul>
  17. 20. 9. Page View
  18. 22. <ul><li>Distributed taxonomic indexing </li></ul><ul><ul><li>Public-resource computing application that identifies name-like strings in OCR text </li></ul></ul><ul><ul><li>Bundles of text pages sent to volunteer computers for indexing & results reporting </li></ul></ul><ul><li>Runs as a screensaver </li></ul><ul><li>Open source framework behind SETI@Home </li></ul>
  19. 23. TIF Image from Scanner Converted to text via PrimeOCR Name finding via bTaxonGrab Extract names Submit to TaxonFinder SOAP response SciLINC in action…
  20. 24. Taxonomic Impedectomy Prof. Newton wrote me that he is extremely excited about your digitization project. At the moment he and his graduate botany students in Kenya have access to very few resources . He spends his summer terms at Kew doing his research for the next year's teaching and writing, but he tells me that now, because of what is already on your site, he will not have to carry so much back to Kenya for his research and his students but can download and work with your resources right there. -- excerpt re: Botanicus from email August 2006
  21. 25. The Future <ul><li>User Accounts </li></ul><ul><ul><li>User defined views </li></ul></ul><ul><ul><li>MyBookshelf – favoriting & sharing </li></ul></ul><ul><li>Wiki-type editing & tagging </li></ul><ul><ul><li>Metadata enrichment </li></ul></ul><ul><ul><li>OCR correction by users </li></ul></ul><ul><li>Bibliographic Intelligence </li></ul><ul><ul><li>Improved “click through” citations </li></ul></ul><ul><ul><li>Citation finding & linking </li></ul></ul><ul><li>Increased geospatial extraction and visualization </li></ul>
  22. 26. Biodiversity Heritage Library <ul><li>American Museum of Natural History (New York) </li></ul><ul><li>Field Museum (Chicago) </li></ul><ul><li>Natural History Museum (London) </li></ul><ul><li>Smithsonian Institution (Washington) </li></ul><ul><li>Missouri Botanical Garden </li></ul><ul><li>New York Botanical Garden </li></ul><ul><li>Royal Botanic Garden, Kew </li></ul><ul><li>Botany Libraries, Harvard University </li></ul><ul><li>Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University </li></ul><ul><li>Marine Biological Laboratory / Woods Hole Oceanographic Institution </li></ul>
  23. 27. <ul><li>Core literature pre-1923: 400,000 (80 million pages) </li></ul><ul><li>All pre-1923: 600-750,000 (120-150 million pages) </li></ul><ul><li>All literature: 1.4-1.6 million (280-320 million pages) </li></ul>Biodiversity Heritage Library
  24. 28.
  25. 31. brought to you by: <ul><li>Andrew W. Mellon Foundation </li></ul><ul><ul><li>2000-2004 </li></ul></ul><ul><li>Wm. Keck Foundation </li></ul><ul><ul><li>2005-2007 </li></ul></ul><ul><li>Institute of Museum and Library Services (IMLS) </li></ul><ul><ul><li>2006-2008 </li></ul></ul>
  26. 32. <ul><li>Please comment and send questions and suggestions to: </li></ul><ul><li>[email_address] </li></ul><ul><li>[email_address] </li></ul>