BHL Tech Overview for BHL-Europe

1,412 views

Published on

Presented at BHL-Europe Kickoff Meeting.
Museum für Naturkunde, Berlin
12 May 2009

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,412
On SlideShare
0
From Embeds
0
Number of Embeds
103
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • BHL Tech Overview for BHL-Europe

    1. 1. BHL Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
    2. 2. About BHL: Usage, History
    3. 3. Goals of BHL <ul><li>Scan public domain biodiversity literature. </li></ul><ul><li>Negotiate rights to digitize copyrighted materials. </li></ul><ul><li>Ingest content digitized by others. </li></ul><ul><li>Provide interfaces & APIs for repository. </li></ul><ul><ul><li>GUIs </li></ul></ul><ul><ul><li>Services for data mining & citation resolution </li></ul></ul>http://www.biodiversitylibrary.org
    4. 4. <ul><li>More than: </li></ul><ul><ul><li>33,000 volumes </li></ul></ul><ul><ul><li>13.3 million pages </li></ul></ul><ul><li>Avg. monthly growth rate </li></ul><ul><ul><li>1,500 volumes </li></ul></ul><ul><ul><li>600,000 pages </li></ul></ul>Now Online
    5. 5. Monthly Usage Stats <ul><li>45,000 unique users </li></ul><ul><li>250,000 pageviews </li></ul>
    6. 6. History <ul><li>Preliminary work: MOBOT’s Botanicus </li></ul><ul><ul><li>http://www.botanicus.org </li></ul></ul><ul><li>Funded by Keck Foundation & IMLS </li></ul><ul><li>Working demonstration of how nomenclators/databases can link into digitized scientific literature </li></ul>
    7. 7. Architecture
    8. 8. Distributed <ul><li>Digitized content on Internet Archive servers in California </li></ul><ul><li>Metadata index on MOBOT servers in Missouri </li></ul><ul><li>Image server on MBL servers in Massachusetts </li></ul><ul><li>Nice, but not global </li></ul>
    9. 9. MOBOT Petabox cluster Internet Archive Image Server MBL
    10. 11. Scanning Workflow
    11. 12. Scanning Operations <ul><li>BHL uses scanning centers established by Internet Archive for mass scanning. </li></ul><ul><li>Some partner libraries also scan in-house. </li></ul><ul><li>Want to expand international footprint: </li></ul><ul><ul><li>mirrored content </li></ul></ul><ul><ul><li>ingest from global data providers </li></ul></ul>Locations of BHL/IA Scanning Centers
    12. 13. Workflow Selection Preparation Post Production (Re)publication Digitization Conservation
    13. 14. Open Access Data <ul><li>Flora medica , oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18] </li></ul><ul><ul><li>Publisher: Jena,August Schmid,1831 [i.e. 1829-1831]. </li></ul></ul>PDF OCR XML JP2
    14. 16. Complexities of distributed, mass scanning from NYBG from Smithsonian
    15. 17. Post Processing & Derivatives
    16. 18. Derivatives <ul><li>JPEG2000 (JP2) images </li></ul><ul><li>OCR: ABBY FineReader </li></ul><ul><li>PDF: LuraTech PDF Compressor </li></ul><ul><li>XML metadata </li></ul>
    17. 19. Name Finding via TaxonFinder
    18. 20. Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
    19. 21. Name Finding Stats to date * <ul><li>Have mined more than 42 million name string occurrences </li></ul><ul><li>More than 30 million name strings verified by NameBank </li></ul><ul><ul><li>1.5 million unique </li></ul></ul>*12 May 2009
    20. 22. Content Delivery
    21. 25. OCR error rate for names only Top OCR errors Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR. http://biodiversitylibrary.blogspot.com/2008/10/evaluation-of-taxonomic-name-finding.html 1 Insert Space 8 n->v 2 Omit Space 9 l->i 3 e->c 10 r->i 4 u->I 11 u->ii 5 u->n 12 h->l 6 i->l 13 h->ii 7 c->e 14 e->o 35.16%
    22. 26. Current image delivery: djatoka <ul><li>Images stored as JPEG2000 (.jp2) </li></ul><ul><li>Decoded & delivered to browser via djatoka </li></ul><ul><ul><li>Open source JP2 image server </li></ul></ul><ul><ul><li>Developed by digital librarians </li></ul></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><li>Rapid development cycle (v1.1) </li></ul></ul><ul><ul><li>Growing community of users </li></ul></ul>
    23. 27. djatoka Browser IIPViewer www.biodiversitylibrary.org .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2 images.biodivlibrary.org A user requests Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907 locate: BHL/IA architecture St. Louis San Francisco Woods Hole
    24. 30. New delivery option: IA Bookreader <ul><li>Open source </li></ul><ul><li>Example: Flora medica </li></ul><ul><ul><li>http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229 </li></ul></ul>
    25. 31. IA Book Viewer http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
    26. 32. APIs & Data Sharing <ul><li>Name Service ( Documentation ) </li></ul><ul><ul><li>REST: XML or JSON </li></ul></ul><ul><li>Data Export ( Documentation ) </li></ul><ul><ul><li>Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files </li></ul></ul>
    27. 33. *Soon: Citation resolver via OpenURL <ul><ul><li>Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407. </li></ul></ul><ul><ul><li>http://example.edu/cgi?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:article &rft.jtitle=Phytologia &rft.atitle=Noteworthy+grasses+from+Mexico &rft.aulast=Beetle &rft.aufirst=A &rft.date=1977 &rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407 </li></ul></ul>
    28. 34. Articles
    29. 41. Article repository <ul><li>Needed a way to display these PDFs </li></ul><ul><li>Wanted to extend contribution functionality to users </li></ul><ul><li>“ Safe harbor” model </li></ul><ul><ul><li>BHL provides platform </li></ul></ul><ul><ul><li>Community provides content </li></ul></ul><ul><ul><ul><li>Scientists, students, libraries </li></ul></ul></ul>
    30. 42. http:// cite.biodiversitylibrary.org <ul><li>Drupal with Biblio module </li></ul><ul><li>Multi-lingual interface </li></ul><ul><li>Customizable display, layout </li></ul><ul><li>Solr search/faceting </li></ul><ul><li>OAI & other services for discovery/sharing </li></ul>
    31. 47. Outreach
    32. 48. BHL Blog <ul><li>Updates </li></ul><ul><li>Announcements </li></ul><ul><li>1,500 users / month </li></ul>
    33. 49. Twitter <ul><li>twitter.com/BioDivLibrary </li></ul><ul><li>Communication tool </li></ul><ul><ul><li>Connecting with LinkedData community, other users </li></ul></ul><ul><ul><li>Receiving assistance, guidance </li></ul></ul><ul><ul><li>FAST turnaround </li></ul></ul>
    34. 50. If BHL-E is not a Research Project…
    35. 51. Technologies in hand: <ul><li>TaxonFinder </li></ul><ul><li>djatoka </li></ul><ul><li>IA Bookreader </li></ul><ul><li>Drupal/Biblio </li></ul><ul><li>OAI-PMH </li></ul><ul><li>OpenURL </li></ul><ul><li>Fedora Commons </li></ul>
    36. 52. Needed: <ul><li>Deduplication Tools </li></ul><ul><li>Storage </li></ul><ul><li>OCR </li></ul><ul><li>Markup/rekeying </li></ul><ul><li>UI/UX </li></ul><ul><li>Interface translation </li></ul><ul><li>Data synchronization </li></ul>
    37. 53. Thank you <ul><ul><li>Chris Freeland </li></ul></ul><ul><ul><li>4344 Shaw Blvd. </li></ul></ul><ul><ul><li>St. Louis, MO 63110 </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>http://www.biodiversitylibrary.org </li></ul></ul>

    ×