BHL Tech Overview for BHL-Europe

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    3 Favorites

    BHL Tech Overview for BHL-Europe - Presentation Transcript

    1. BHL Technology Overview Chris Freeland Technical Director, BHL Director of Bioinformatics, Missouri Botanical Garden
    2. About BHL: Usage, History
    3. Goals of BHL
      • Scan public domain biodiversity literature.
      • Negotiate rights to digitize copyrighted materials.
      • Ingest content digitized by others.
      • Provide interfaces & APIs for repository.
        • GUIs
        • Services for data mining & citation resolution
      http://www.biodiversitylibrary.org
      • More than:
        • 33,000 volumes
        • 13.3 million pages
      • Avg. monthly growth rate
        • 1,500 volumes
        • 600,000 pages
      Now Online
    4. Monthly Usage Stats
      • 45,000 unique users
      • 250,000 pageviews
    5. History
      • Preliminary work: MOBOT’s Botanicus
        • http://www.botanicus.org
      • Funded by Keck Foundation & IMLS
      • Working demonstration of how nomenclators/databases can link into digitized scientific literature
    6. Architecture
    7. Distributed
      • Digitized content on Internet Archive servers in California
      • Metadata index on MOBOT servers in Missouri
      • Image server on MBL servers in Massachusetts
      • Nice, but not global
    8. MOBOT Petabox cluster Internet Archive Image Server MBL
    9.  
    10. Scanning Workflow
    11. Scanning Operations
      • BHL uses scanning centers established by Internet Archive for mass scanning.
      • Some partner libraries also scan in-house.
      • Want to expand international footprint:
        • mirrored content
        • ingest from global data providers
      Locations of BHL/IA Scanning Centers
    12. Workflow Selection Preparation Post Production (Re)publication Digitization Conservation
    13. Open Access Data
      • Flora medica , oder, Abbildung der wichtigsten officinellen Pflanzen…[Heft 1-18]
        • Publisher: Jena,August Schmid,1831 [i.e. 1829-1831].
      PDF OCR XML JP2
    14.  
    15. Complexities of distributed, mass scanning from NYBG from Smithsonian
    16. Post Processing & Derivatives
    17. Derivatives
      • JPEG2000 (JP2) images
      • OCR: ABBY FineReader
      • PDF: LuraTech PDF Compressor
      • XML metadata
    18. Name Finding via TaxonFinder
    19. Raw Image Converted to text via OCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
    20. Name Finding Stats to date *
      • Have mined more than 42 million name string occurrences
      • More than 30 million name strings verified by NameBank
        • 1.5 million unique
      *12 May 2009
    21. Content Delivery
    22.  
    23.  
    24. OCR error rate for names only Top OCR errors Study in 2008 found that for sample population of 3,003 names, 1,056 were incorrectly transcribed by OCR. http://biodiversitylibrary.blogspot.com/2008/10/evaluation-of-taxonomic-name-finding.html 1 Insert Space 8 n->v 2 Omit Space 9 l->i 3 e->c 10 r->i 4 u->I 11 u->ii 5 u->n 12 h->l 6 i->l 13 h->ii 7 c->e 14 e->o 35.16%
    25. Current image delivery: djatoka
      • Images stored as JPEG2000 (.jp2)
      • Decoded & delivered to browser via djatoka
        • Open source JP2 image server
        • Developed by digital librarians
        • Scalable
        • Rapid development cycle (v1.1)
        • Growing community of users
    26. djatoka Browser IIPViewer www.biodiversitylibrary.org .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2 images.biodivlibrary.org A user requests Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907 locate: BHL/IA architecture St. Louis San Francisco Woods Hole
    27.  
    28.  
    29. New delivery option: IA Bookreader
      • Open source
      • Example: Flora medica
        • http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
    30. IA Book Viewer http://www.us.archive.org/GnuBook/?id=floramedicaodera118diet#229
    31. APIs & Data Sharing
      • Name Service ( Documentation )
        • REST: XML or JSON
      • Data Export ( Documentation )
        • Monthly export of BHL titles, volumes, pages, names, other metadata in delimited files
    32. *Soon: Citation resolver via OpenURL
        • Beetle, A. A. 1977. Noteworthy grasses from Mexico V. Phytologia 37(4): 317–407.
        • http://example.edu/cgi?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:article &rft.jtitle=Phytologia &rft.atitle=Noteworthy+grasses+from+Mexico &rft.aulast=Beetle &rft.aufirst=A &rft.date=1977 &rft.volume=37&rft.issue=4&rft.spage=317&rft.epage=407
    33. Articles
    34.  
    35.  
    36.  
    37.  
    38.  
    39.  
    40. Article repository
      • Needed a way to display these PDFs
      • Wanted to extend contribution functionality to users
      • “ Safe harbor” model
        • BHL provides platform
        • Community provides content
          • Scientists, students, libraries
    41. http:// cite.biodiversitylibrary.org
      • Drupal with Biblio module
      • Multi-lingual interface
      • Customizable display, layout
      • Solr search/faceting
      • OAI & other services for discovery/sharing
    42.  
    43.  
    44.  
    45.  
    46. Outreach
    47. BHL Blog
      • Updates
      • Announcements
      • 1,500 users / month
    48. Twitter
      • twitter.com/BioDivLibrary
      • Communication tool
        • Connecting with LinkedData community, other users
        • Receiving assistance, guidance
        • FAST turnaround
    49. If BHL-E is not a Research Project…
    50. Technologies in hand:
      • TaxonFinder
      • djatoka
      • IA Bookreader
      • Drupal/Biblio
      • OAI-PMH
      • OpenURL
      • Fedora Commons
    51. Needed:
      • Deduplication Tools
      • Storage
      • OCR
      • Markup/rekeying
      • UI/UX
      • Interface translation
      • Data synchronization
    52. Thank you
        • Chris Freeland
        • 4344 Shaw Blvd.
        • St. Louis, MO 63110
        • [email_address]
        • http://www.biodiversitylibrary.org

    + chrisfreelandchrisfreeland, 6 months ago

    custom

    658 views, 3 favs, 2 embeds more stats

    Presented at BHL-Europe Kickoff Meeting.
    Museum fü more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 658
      • 623 on SlideShare
      • 35 from embeds
    • Comments 0
    • Favorites 3
    • Downloads 8
    Most viewed embeds
    • 32 views on http://blog.chrisfreeland.com
    • 3 views on http://advancephysicsatgeorge.wikispaces.com

    more

    All embeds
    • 32 views on http://blog.chrisfreeland.com
    • 3 views on http://advancephysicsatgeorge.wikispaces.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories