Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BHL Tech Overview for BHL-Europe


Published on

Presented at BHL-Europe Kickoff Meeting.
Museum für Naturkunde, Berlin
12 May 2009

Published in: Technology, Education
  • Be the first to comment

BHL Tech Overview for BHL-Europe

  1. 1. BHL Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
  2. 2. About BHL: Usage, History
  3. 3. Goals of BHL <ul><li>Scan public domain biodiversity literature. </li></ul><ul><li>Negotiate rights to digitize copyrighted materials. </li></ul><ul><li>Ingest content digitized by others. </li></ul><ul><li>Provide interfaces & APIs for repository. </li></ul><ul><ul><li>GUIs </li></ul></ul><ul><ul><li>Services for data mining & citation resolution </li></ul></ul>
  4. 4. <ul><li>More than: </li></ul><ul><ul><li>33,000 volumes </li></ul></ul><ul><ul><li>13.3 million pages </li></ul></ul><ul><li>Avg. monthly growth rate </li></ul><ul><ul><li>1,500 volumes </li></ul></ul><ul><ul><li>600,000 pages </li></ul></ul>Now Online
  5. 5. Monthly Usage Stats <ul><li>45,000 unique users </li></ul><ul><li>250,000 pageviews </li></ul>
  6. 6. History <ul><li>Preliminary work: MOBOT’s Botanicus </li></ul><ul><ul><li> </li></ul></ul><ul><li>Funded by Keck Foundation & IMLS </li></ul><ul><li>Working demonstration of how nomenclators/databases can link into digitized scientific literature </li></ul>
  7. 7. Architecture
  8. 8. Distributed <ul><li>Digitized content on Internet Archive servers in California </li></ul><ul><li>Metadata index on MOBOT servers in Missouri </li></ul><ul><li>Image server on MBL servers in Massachusetts </li></ul><ul><li>Nice, but not global </li></ul>
  9. 9. MOBOT Petabox cluster Internet Archive Image Server MBL
  10. 11. Scanning Workflow
  11. 12. Scanning Operations <ul><li>BHL uses scanning centers established by Internet Archive for mass scanning. </li></ul><ul><li>Some partner libraries also scan in-house. </li></ul><ul><li>Want to expand international footprint: </li></ul><ul><ul><li>mirrored content </li></ul></ul><ul><ul><li>ingest from global data providers </li></ul></ul>Locations of BHL/IA Scanning Centers
  12. 13. Workflow Selection Preparation Post Production (Re)publication Digitization Conservation
  13. 14. Open Access Data <ul><li>Flora medica , oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18] </li></ul><ul><ul><li>Publisher: Jena,August Schmid,1831 [i.e. 1829-1831]. </li></ul></ul>PDF OCR XML JP2
  14. 16. Complexities of distributed, mass scanning from NYBG from Smithsonian
  15. 17. Post Processing & Derivatives
  16. 18. Derivatives <ul><li>JPEG2000 (JP2) images </li></ul><ul><li>OCR: ABBY FineReader </li></ul><ul><li>PDF: LuraTech PDF Compressor </li></ul><ul><li>XML metadata </li></ul>
  17. 19. Name Finding via TaxonFinder
  18. 20. Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
  19. 21. Name Finding Stats to date * <ul><li>Have mined more than 42 million name string occurrences </li></ul><ul><li>More than 30 million name strings verified by NameBank </li></ul><ul><ul><li>1.5 million unique </li></ul></ul>*12 May 2009
  20. 22. Content Delivery
  21. 25. OCR error rate for names only Top OCR errors Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR. 1 Insert Space 8 n->v 2 Omit Space 9 l->i 3 e->c 10 r->i 4 u->I 11 u->ii 5 u->n 12 h->l 6 i->l 13 h->ii 7 c->e 14 e->o 35.16%
  22. 26. Current image delivery: djatoka <ul><li>Images stored as JPEG2000 (.jp2) </li></ul><ul><li>Decoded & delivered to browser via djatoka </li></ul><ul><ul><li>Open source JP2 image server </li></ul></ul><ul><ul><li>Developed by digital librarians </li></ul></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><li>Rapid development cycle (v1.1) </li></ul></ul><ul><ul><li>Growing community of users </li></ul></ul>
  23. 27. djatoka Browser IIPViewer .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb A user requests Mushrooms of America, edible and poisonous , Plate X: locate: BHL/IA architecture St. Louis San Francisco Woods Hole
  24. 30. New delivery option: IA Bookreader <ul><li>Open source </li></ul><ul><li>Example: Flora medica </li></ul><ul><ul><li> </li></ul></ul>
  25. 31. IA Book Viewer
  26. 32. APIs & Data Sharing <ul><li>Name Service ( Documentation ) </li></ul><ul><ul><li>REST: XML or JSON </li></ul></ul><ul><li>Data Export ( Documentation ) </li></ul><ul><ul><li>Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files </li></ul></ul>
  27. 33. *Soon: Citation resolver via OpenURL <ul><ul><li>Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407. </li></ul></ul><ul><ul><li> &rft.jtitle=Phytologia &rft.atitle=Noteworthy+grasses+from+Mexico &rft.aulast=Beetle &rft.aufirst=A & &rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407 </li></ul></ul>
  28. 34. Articles
  29. 41. Article repository <ul><li>Needed a way to display these PDFs </li></ul><ul><li>Wanted to extend contribution functionality to users </li></ul><ul><li>“ Safe harbor” model </li></ul><ul><ul><li>BHL provides platform </li></ul></ul><ul><ul><li>Community provides content </li></ul></ul><ul><ul><ul><li>Scientists, students, libraries </li></ul></ul></ul>
  30. 42. http:// <ul><li>Drupal with Biblio module </li></ul><ul><li>Multi-lingual interface </li></ul><ul><li>Customizable display, layout </li></ul><ul><li>Solr search/faceting </li></ul><ul><li>OAI & other services for discovery/sharing </li></ul>
  31. 47. Outreach
  32. 48. BHL Blog <ul><li>Updates </li></ul><ul><li>Announcements </li></ul><ul><li>1,500 users / month </li></ul>
  33. 49. Twitter <ul><li> </li></ul><ul><li>Communication tool </li></ul><ul><ul><li>Connecting with LinkedData community, other users </li></ul></ul><ul><ul><li>Receiving assistance, guidance </li></ul></ul><ul><ul><li>FAST turnaround </li></ul></ul>
  34. 50. If BHL-E is not a Research Project…
  35. 51. Technologies in hand: <ul><li>TaxonFinder </li></ul><ul><li>djatoka </li></ul><ul><li>IA Bookreader </li></ul><ul><li>Drupal/Biblio </li></ul><ul><li>OAI-PMH </li></ul><ul><li>OpenURL </li></ul><ul><li>Fedora Commons </li></ul>
  36. 52. Needed: <ul><li>Deduplication Tools </li></ul><ul><li>Storage </li></ul><ul><li>OCR </li></ul><ul><li>Markup/rekeying </li></ul><ul><li>UI/UX </li></ul><ul><li>Interface translation </li></ul><ul><li>Data synchronization </li></ul>
  37. 53. Thank you <ul><ul><li>Chris Freeland </li></ul></ul><ul><ul><li>4344 Shaw Blvd. </li></ul></ul><ul><ul><li>St. Louis, MO 63110 </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li> </li></ul></ul>