Your SlideShare is downloading. ×
BHL / EOL technology sit down
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BHL / EOL technology sit down

1,024
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,024
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BHL / BIG
    2 Mar 2010
    Woods Hole
  • 2. BHL: Content & Usage
    Zeamayshttp://www.eol.org/pages/1115259
    Literature: http://biodiversitylibrary.org/name/Zea_mays
  • 3. Size of BHL
    24TB & growing!
  • 4. Workflow
    Conservation
    Digitization
    Selection
    Preparation
    Post Production
    (Re)publication
  • 5.
  • 6.
  • 7. Scanning = human work
  • 8. Scan & Store: Internet Archive
    Storage in Petaboxes
    Scanning on Scribes
  • 9.
  • 10. Scanning Derivatives
    Master
    Derivatives
    XML
    JP2
    PDF
    JPG
    TXT
    DJVu
    PDF
    OCR
    JP2
    XML
  • 11. Petabox cluster
    Internet Archive
    Image Server
    Cluster
    MBL
    MOBOT
  • 12.
  • 13. CiteBank
    BHL Data Flow – Sep 2009
  • 14. Usage: 1 Jan 08 – 31 Jan 10
    Daily average
    1,026 visitors
    1,680 visits / day
    8,200 pageviews / day
  • 15. Referrers: 1 Jan 08 – 31 Jan 10
    Jan 1, 2008 – Jan 31, 2010
  • 16.
  • 17. BHL: App / ui / Services
    Zeamayshttp://www.eol.org/pages/1115259
    Literature: http://biodiversitylibrary.org/name/Zea_mays
  • 18. BHL Development Team
    <- Mike
    Phil ->
  • 19. Name Finding via TaxonFinder
  • 20. SOAP response
    Name finding via TaxonFinder
    Extract names
    Submit to NameBank
    Image from Scanner
    Converted to text via OCR
    Name Finding in action
    with Taxonomic Intelligence…
  • 21. Name Finding Evaluation
    Structured and performed by Qin Wei
    Ph.D. student at UIUC, working with Bryan Heidorn
    Methodology
    Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL corpus
    Compared those against OCR,then two name finding algorithms (TaxonFinder & FAT)
    Goals
    Spark discussion, set baseline for future work
  • 22. Characteristics of sample
    = 86.91%
  • 23. 35.16%
    OCR error rate for names only
    Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
    Top OCR errors
  • 24. Considerations
    Improving OCR software is out of scope
    Google’s Tesseract is only viable open source option
    Flurry of activity in 2006-2007, quiet since
    Rekeying is expensive given size of corpus
    Will not scale
  • 25. Name finding statistics
    27.7 million pages scanned
    70.4 million name strings found
    56.2 million names verified with a NameBankID
    1.4 million unique names with a NameBankID
    3.3 million unique names *without* a NameBankID
    This is where the interesting data live!!!
  • 26. http://www.biodiversitylibrary.org/name/Physeter_catodon
  • 27. But where are the articles??
  • 28.
  • 29.
  • 30.
  • 31.
  • 32. PDF Generation Stats
  • 33. Mandate for new development
    display / manage articles
    meet community demands for bibliography / citation management
    build from more open source tools
  • 34. Development goals re: citations
    Create a repository for community-vetted taxonomic bibliographies.
    Ability to ingest, display, download, and index articles so that the BHL can operate as an article repository.
    Build from existing community of work around Drupal / Biblio.
    In use by collaborators
  • 35. http://www.citebank.org
  • 36. http://citebank.org/search
  • 37. http://citebank.org/node/47423
  • 38.
  • 39. Services
    OpenURL
    Facilitate links to citations: protologues, articles, references
    Documentation: http://www.biodiversitylibrary.org/openurlhelp.aspx
    Useful to Nomenclators, Reference Systems
    IPNI
    Tropicos
    Names Service
    Return all occurrences of a name throughout BHL digitized corpus
    Documentation: http://bit.ly/2e6sg9
    Access to 51million name strings using TaxonFinder
    1.4million unique names
    Working out a strategy for obscure species
    Algorithm improvements to detect nomenclatural & taxonomic acts
    New API
  • 40. Services: OpenURL
    http://www.biodiversitylibrary.org/openurl?
    pid=title:3934&volume=14&issue=&spage=301&date=1879
    http://www.tropicos.org/Name/1200408
  • 41. Services: OpenURL Disambiguation
    Looking for:
    BHL returns:
  • 42. Services: OpenURL Results
  • 43. How?
    Tropicos maintains internal authority list of publications:
    Each protologue/reference tied to authority:
    Matched Tropicos TitleIDs to BHL TitleIDs:
    Throw citations at resolver at regular intervals & cache data in Tropicos
    http://www.tropicos.org/Publication/775
    http://www.tropicos.org/Publication/775 =
    http://www.biodiversitylibrary.org/title/3934
    http://www.biodiversitylibrary.org/openurl?
    pid=title:3934&volume=14&issue=&spage=301&date=1879
  • 44. BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx
    http://www.biodiversitylibrary.org/services/name/NameService.asmx
  • 45. Other consumers
    EarthCape Lab
    BioGuid
    BioSTOR
    Research projects
    BREC - NSF
    Conjecturator - NSF
    Darwin’s Library – NEH/JISC
    Hong Cui @ University of AZ - NSF
  • 46. http://bioguid.info/bhl/compare.php?name1=Physeter+catodon&name2=Physeter+macrocephalus
  • 47. Hardware / infrastructure
    Zeamayshttp://www.eol.org/pages/1115259
    Literature: http://biodiversitylibrary.org/name/Zea_mays
  • 48. <insert Phil here>
  • 49. Global BHL
    Zeamayshttp://www.eol.org/pages/1115259
    Literature: http://biodiversitylibrary.org/name/Zea_mays
  • 50. Global BHL
  • 51.
  • 52. Global BHL Nodes
    BHL-Australia
    http://ec2-75-101-224-221.compute-1.amazonaws.com/
    BHL-China
    http://bhl-china.org
    BHL-Europe
    http://biodiversitylibrary.eu
  • 53. BHl: collaboration w/ BIG
    Zeamayshttp://www.eol.org/pages/1115259
    Literature: http://biodiversitylibrary.org/name/Zea_mays
  • 54. <insert discussion here>
    Existing issues in Jira
    Taxonomic name finding enhancements
    Nomenclatural acts in web services
    Other algorithms / verification
    WoRMS data
    Improvement
    Ranking results
    Visualization
    LifeDesks
    Bibliography sharing
    Resolve to articles
  • 55. Thanks!
    Chris Freeland
    chris.freeland@mobot.org