BHL  Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics,  Missouri Botanical Garden
About BHL: Usage, History
Goals of BHL <ul><li>Scan public domain biodiversity literature. </li></ul><ul><li>Negotiate rights to digitize copyrighte...
<ul><li>More than: </li></ul><ul><ul><li>33,000 volumes </li></ul></ul><ul><ul><li>13.3 million pages </li></ul></ul><ul><...
Monthly Usage Stats <ul><li>45,000 unique users </li></ul><ul><li>250,000 pageviews </li></ul>
History <ul><li>Preliminary work: MOBOT’s Botanicus </li></ul><ul><ul><li>http://www.botanicus.org </li></ul></ul><ul><li>...
Architecture
Distributed <ul><li>Digitized content on Internet Archive servers in California  </li></ul><ul><li>Metadata index on MOBOT...
MOBOT Petabox cluster Internet Archive Image Server MBL
 
Scanning Workflow
Scanning Operations <ul><li>BHL uses scanning centers established by  Internet Archive  for mass scanning.  </li></ul><ul>...
Workflow Selection Preparation Post Production (Re)publication Digitization Conservation
Open Access Data <ul><li>Flora  medica , oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18]   </li></ul><ul...
 
Complexities of distributed, mass scanning from NYBG from Smithsonian
Post Processing & Derivatives
Derivatives <ul><li>JPEG2000 (JP2) images </li></ul><ul><li>OCR: ABBY FineReader </li></ul><ul><li>PDF: LuraTech PDF Compr...
Name Finding via  TaxonFinder
Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Findi...
Name Finding Stats to date * <ul><li>Have mined more than  42 million  name string occurrences  </li></ul><ul><li>More tha...
Content Delivery
 
 
OCR error rate  for names only Top OCR errors Study in 2008 found that for sample population of 3,003 names, 1,056 were in...
Current image delivery: djatoka <ul><li>Images stored as JPEG2000 (.jp2) </li></ul><ul><li>Decoded & delivered to browser ...
djatoka Browser  IIPViewer www.biodiversitylibrary.org .jp2 .jpg IA  /page/1274907 pageid: 1274907 BHLdb http://www.archiv...
 
 
New delivery option: IA Bookreader <ul><li>Open source </li></ul><ul><li>Example:  Flora medica </li></ul><ul><ul><li>http...
IA Book Viewer http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
APIs & Data Sharing <ul><li>Name Service  ( Documentation ) </li></ul><ul><ul><li>REST: XML or JSON </li></ul></ul><ul><li...
*Soon: Citation resolver via OpenURL <ul><ul><li>Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 3...
Articles
 
 
 
 
 
 
Article repository <ul><li>Needed a way to display these PDFs </li></ul><ul><li>Wanted to extend contribution functionalit...
http:// cite.biodiversitylibrary.org <ul><li>Drupal with Biblio module </li></ul><ul><li>Multi-lingual interface </li></ul...
 
 
 
 
Outreach
BHL Blog <ul><li>Updates </li></ul><ul><li>Announcements </li></ul><ul><li>1,500 users / month </li></ul>
Twitter <ul><li>twitter.com/BioDivLibrary </li></ul><ul><li>Communication tool </li></ul><ul><ul><li>Connecting with Linke...
If BHL-E  is not a Research Project…
Technologies in hand: <ul><li>TaxonFinder </li></ul><ul><li>djatoka </li></ul><ul><li>IA Bookreader </li></ul><ul><li>Drup...
Needed: <ul><li>Deduplication Tools </li></ul><ul><li>Storage </li></ul><ul><li>OCR </li></ul><ul><li>Markup/rekeying </li...
Thank you <ul><ul><li>Chris Freeland </li></ul></ul><ul><ul><li>4344 Shaw Blvd. </li></ul></ul><ul><ul><li>St. Louis, MO 6...
Upcoming SlideShare
Loading in...5
×

BHL Tech Overview for BHL-Europe

1,145

Published on

Presented at BHL-Europe Kickoff Meeting.
Museum für Naturkunde, Berlin
12 May 2009

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,145
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • BHL Tech Overview for BHL-Europe

    1. 1. BHL Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
    2. 2. About BHL: Usage, History
    3. 3. Goals of BHL <ul><li>Scan public domain biodiversity literature. </li></ul><ul><li>Negotiate rights to digitize copyrighted materials. </li></ul><ul><li>Ingest content digitized by others. </li></ul><ul><li>Provide interfaces & APIs for repository. </li></ul><ul><ul><li>GUIs </li></ul></ul><ul><ul><li>Services for data mining & citation resolution </li></ul></ul>http://www.biodiversitylibrary.org
    4. 4. <ul><li>More than: </li></ul><ul><ul><li>33,000 volumes </li></ul></ul><ul><ul><li>13.3 million pages </li></ul></ul><ul><li>Avg. monthly growth rate </li></ul><ul><ul><li>1,500 volumes </li></ul></ul><ul><ul><li>600,000 pages </li></ul></ul>Now Online
    5. 5. Monthly Usage Stats <ul><li>45,000 unique users </li></ul><ul><li>250,000 pageviews </li></ul>
    6. 6. History <ul><li>Preliminary work: MOBOT’s Botanicus </li></ul><ul><ul><li>http://www.botanicus.org </li></ul></ul><ul><li>Funded by Keck Foundation & IMLS </li></ul><ul><li>Working demonstration of how nomenclators/databases can link into digitized scientific literature </li></ul>
    7. 7. Architecture
    8. 8. Distributed <ul><li>Digitized content on Internet Archive servers in California </li></ul><ul><li>Metadata index on MOBOT servers in Missouri </li></ul><ul><li>Image server on MBL servers in Massachusetts </li></ul><ul><li>Nice, but not global </li></ul>
    9. 9. MOBOT Petabox cluster Internet Archive Image Server MBL
    10. 11. Scanning Workflow
    11. 12. Scanning Operations <ul><li>BHL uses scanning centers established by Internet Archive for mass scanning. </li></ul><ul><li>Some partner libraries also scan in-house. </li></ul><ul><li>Want to expand international footprint: </li></ul><ul><ul><li>mirrored content </li></ul></ul><ul><ul><li>ingest from global data providers </li></ul></ul>Locations of BHL/IA Scanning Centers
    12. 13. Workflow Selection Preparation Post Production (Re)publication Digitization Conservation
    13. 14. Open Access Data <ul><li>Flora medica , oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18] </li></ul><ul><ul><li>Publisher: Jena,August Schmid,1831 [i.e. 1829-1831]. </li></ul></ul>PDF OCR XML JP2
    14. 16. Complexities of distributed, mass scanning from NYBG from Smithsonian
    15. 17. Post Processing & Derivatives
    16. 18. Derivatives <ul><li>JPEG2000 (JP2) images </li></ul><ul><li>OCR: ABBY FineReader </li></ul><ul><li>PDF: LuraTech PDF Compressor </li></ul><ul><li>XML metadata </li></ul>
    17. 19. Name Finding via TaxonFinder
    18. 20. Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
    19. 21. Name Finding Stats to date * <ul><li>Have mined more than 42 million name string occurrences </li></ul><ul><li>More than 30 million name strings verified by NameBank </li></ul><ul><ul><li>1.5 million unique </li></ul></ul>*12 May 2009
    20. 22. Content Delivery
    21. 25. OCR error rate for names only Top OCR errors Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR. http://biodiversitylibrary.blogspot.com/2008/10/evaluation-of-taxonomic-name-finding.html 1 Insert Space 8 n->v 2 Omit Space 9 l->i 3 e->c 10 r->i 4 u->I 11 u->ii 5 u->n 12 h->l 6 i->l 13 h->ii 7 c->e 14 e->o 35.16%
    22. 26. Current image delivery: djatoka <ul><li>Images stored as JPEG2000 (.jp2) </li></ul><ul><li>Decoded & delivered to browser via djatoka </li></ul><ul><ul><li>Open source JP2 image server </li></ul></ul><ul><ul><li>Developed by digital librarians </li></ul></ul><ul><ul><li>Scalable </li></ul></ul><ul><ul><li>Rapid development cycle (v1.1) </li></ul></ul><ul><ul><li>Growing community of users </li></ul></ul>
    23. 27. djatoka Browser IIPViewer www.biodiversitylibrary.org .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2 images.biodivlibrary.org A user requests Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907 locate: BHL/IA architecture St. Louis San Francisco Woods Hole
    24. 30. New delivery option: IA Bookreader <ul><li>Open source </li></ul><ul><li>Example: Flora medica </li></ul><ul><ul><li>http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229 </li></ul></ul>
    25. 31. IA Book Viewer http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
    26. 32. APIs & Data Sharing <ul><li>Name Service ( Documentation ) </li></ul><ul><ul><li>REST: XML or JSON </li></ul></ul><ul><li>Data Export ( Documentation ) </li></ul><ul><ul><li>Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files </li></ul></ul>
    27. 33. *Soon: Citation resolver via OpenURL <ul><ul><li>Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407. </li></ul></ul><ul><ul><li>http://example.edu/cgi?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:article &rft.jtitle=Phytologia &rft.atitle=Noteworthy+grasses+from+Mexico &rft.aulast=Beetle &rft.aufirst=A &rft.date=1977 &rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407 </li></ul></ul>
    28. 34. Articles
    29. 41. Article repository <ul><li>Needed a way to display these PDFs </li></ul><ul><li>Wanted to extend contribution functionality to users </li></ul><ul><li>“ Safe harbor” model </li></ul><ul><ul><li>BHL provides platform </li></ul></ul><ul><ul><li>Community provides content </li></ul></ul><ul><ul><ul><li>Scientists, students, libraries </li></ul></ul></ul>
    30. 42. http:// cite.biodiversitylibrary.org <ul><li>Drupal with Biblio module </li></ul><ul><li>Multi-lingual interface </li></ul><ul><li>Customizable display, layout </li></ul><ul><li>Solr search/faceting </li></ul><ul><li>OAI & other services for discovery/sharing </li></ul>
    31. 47. Outreach
    32. 48. BHL Blog <ul><li>Updates </li></ul><ul><li>Announcements </li></ul><ul><li>1,500 users / month </li></ul>
    33. 49. Twitter <ul><li>twitter.com/BioDivLibrary </li></ul><ul><li>Communication tool </li></ul><ul><ul><li>Connecting with LinkedData community, other users </li></ul></ul><ul><ul><li>Receiving assistance, guidance </li></ul></ul><ul><ul><li>FAST turnaround </li></ul></ul>
    34. 50. If BHL-E is not a Research Project…
    35. 51. Technologies in hand: <ul><li>TaxonFinder </li></ul><ul><li>djatoka </li></ul><ul><li>IA Bookreader </li></ul><ul><li>Drupal/Biblio </li></ul><ul><li>OAI-PMH </li></ul><ul><li>OpenURL </li></ul><ul><li>Fedora Commons </li></ul>
    36. 52. Needed: <ul><li>Deduplication Tools </li></ul><ul><li>Storage </li></ul><ul><li>OCR </li></ul><ul><li>Markup/rekeying </li></ul><ul><li>UI/UX </li></ul><ul><li>Interface translation </li></ul><ul><li>Data synchronization </li></ul>
    37. 53. Thank you <ul><ul><li>Chris Freeland </li></ul></ul><ul><ul><li>4344 Shaw Blvd. </li></ul></ul><ul><ul><li>St. Louis, MO 63110 </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>http://www.biodiversitylibrary.org </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×