BHL / EOL technology sit down

1,261 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,261
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BHL / EOL technology sit down

  1. 1. BHL / BIG <br />2 Mar 2010<br />Woods Hole<br />
  2. 2. BHL: Content & Usage<br />Zeamayshttp://www.eol.org/pages/1115259<br />Literature: http://biodiversitylibrary.org/name/Zea_mays<br />
  3. 3. Size of BHL<br />24TB & growing!<br />
  4. 4. Workflow<br />Conservation<br />Digitization<br />Selection<br />Preparation<br />Post Production<br />(Re)publication<br />
  5. 5.
  6. 6.
  7. 7. Scanning = human work<br />
  8. 8. Scan & Store: Internet Archive<br />Storage in Petaboxes<br />Scanning on Scribes<br />
  9. 9.
  10. 10. Scanning Derivatives<br />Master<br />Derivatives<br />XML<br />JP2<br />PDF<br />JPG<br />TXT<br />DJVu<br />PDF<br />OCR<br />JP2<br />XML<br />
  11. 11. Petabox cluster<br />Internet Archive<br />Image Server<br />Cluster<br />MBL<br />MOBOT<br />
  12. 12.
  13. 13. CiteBank<br />BHL Data Flow – Sep 2009<br />
  14. 14. Usage: 1 Jan 08 – 31 Jan 10 <br />Daily average<br />1,026 visitors<br />1,680 visits / day<br />8,200 pageviews / day<br />
  15. 15. Referrers: 1 Jan 08 – 31 Jan 10<br />Jan 1, 2008 – Jan 31, 2010<br />
  16. 16.
  17. 17. BHL: App / ui / Services<br />Zeamayshttp://www.eol.org/pages/1115259<br />Literature: http://biodiversitylibrary.org/name/Zea_mays<br />
  18. 18. BHL Development Team<br /><- Mike<br />Phil -><br />
  19. 19. Name Finding via TaxonFinder<br />
  20. 20. SOAP response<br />Name finding via TaxonFinder<br />Extract names<br />Submit to NameBank<br />Image from Scanner<br />Converted to text via OCR<br />Name Finding in action<br />with Taxonomic Intelligence…<br />
  21. 21. Name Finding Evaluation<br />Structured and performed by Qin Wei<br />Ph.D. student at UIUC, working with Bryan Heidorn<br />Methodology<br />Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL corpus<br />Compared those against OCR,then two name finding algorithms (TaxonFinder & FAT)<br />Goals<br />Spark discussion, set baseline for future work<br />
  22. 22. Characteristics of sample<br />= 86.91%<br />
  23. 23. 35.16%<br />OCR error rate for names only<br />Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.<br />Top OCR errors<br />
  24. 24. Considerations<br />Improving OCR software is out of scope<br />Google’s Tesseract is only viable open source option<br />Flurry of activity in 2006-2007, quiet since<br />Rekeying is expensive given size of corpus<br />Will not scale <br />
  25. 25. Name finding statistics<br />27.7 million pages scanned<br />70.4 million name strings found<br />56.2 million names verified with a NameBankID<br />1.4 million unique names with a NameBankID<br />3.3 million unique names *without* a NameBankID<br />This is where the interesting data live!!!<br />
  26. 26. http://www.biodiversitylibrary.org/name/Physeter_catodon<br />
  27. 27. But where are the articles??<br />
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32. PDF Generation Stats<br />
  33. 33. Mandate for new development<br />display / manage articles<br />meet community demands for bibliography / citation management<br />build from more open source tools<br />
  34. 34. Development goals re: citations<br />Create a repository for community-vetted taxonomic bibliographies.<br />Ability to ingest, display, download, and index articles so that the BHL can operate as an article repository.<br />Build from existing community of work around Drupal / Biblio.<br />In use by collaborators<br />
  35. 35. http://www.citebank.org<br />
  36. 36. http://citebank.org/search<br />
  37. 37. http://citebank.org/node/47423<br />
  38. 38.
  39. 39. Services<br />OpenURL<br />Facilitate links to citations: protologues, articles, references<br />Documentation: http://www.biodiversitylibrary.org/openurlhelp.aspx<br />Useful to Nomenclators, Reference Systems<br />IPNI<br />Tropicos<br />Names Service<br />Return all occurrences of a name throughout BHL digitized corpus<br />Documentation: http://bit.ly/2e6sg9<br />Access to 51million name strings using TaxonFinder<br />1.4million unique names<br />Working out a strategy for obscure species<br />Algorithm improvements to detect nomenclatural & taxonomic acts<br />New API<br />
  40. 40. Services: OpenURL<br />http://www.biodiversitylibrary.org/openurl?<br />pid=title:3934&volume=14&issue=&spage=301&date=1879<br />http://www.tropicos.org/Name/1200408<br />
  41. 41. Services: OpenURL Disambiguation<br />Looking for:<br />BHL returns:<br />
  42. 42. Services: OpenURL Results<br />
  43. 43. How?<br />Tropicos maintains internal authority list of publications:<br />Each protologue/reference tied to authority:<br />Matched Tropicos TitleIDs to BHL TitleIDs:<br />Throw citations at resolver at regular intervals & cache data in Tropicos<br />http://www.tropicos.org/Publication/775<br />http://www.tropicos.org/Publication/775 =<br />http://www.biodiversitylibrary.org/title/3934<br />http://www.biodiversitylibrary.org/openurl?<br />pid=title:3934&volume=14&issue=&spage=301&date=1879<br />
  44. 44. BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx<br />http://www.biodiversitylibrary.org/services/name/NameService.asmx<br />
  45. 45. Other consumers<br />EarthCape Lab<br />BioGuid<br />BioSTOR<br />Research projects<br />BREC - NSF<br />Conjecturator - NSF<br />Darwin’s Library – NEH/JISC<br />Hong Cui @ University of AZ - NSF<br />
  46. 46. http://bioguid.info/bhl/compare.php?name1=Physeter+catodon&name2=Physeter+macrocephalus<br />
  47. 47. Hardware / infrastructure<br />Zeamayshttp://www.eol.org/pages/1115259<br />Literature: http://biodiversitylibrary.org/name/Zea_mays<br />
  48. 48. <insert Phil here><br />
  49. 49. Global BHL<br />Zeamayshttp://www.eol.org/pages/1115259<br />Literature: http://biodiversitylibrary.org/name/Zea_mays<br />
  50. 50. Global BHL<br />
  51. 51.
  52. 52. Global BHL Nodes<br />BHL-Australia<br />http://ec2-75-101-224-221.compute-1.amazonaws.com/<br />BHL-China<br />http://bhl-china.org<br />BHL-Europe<br />http://biodiversitylibrary.eu<br />
  53. 53. BHl: collaboration w/ BIG<br />Zeamayshttp://www.eol.org/pages/1115259<br />Literature: http://biodiversitylibrary.org/name/Zea_mays<br />
  54. 54. <insert discussion here><br />Existing issues in Jira<br />Taxonomic name finding enhancements<br />Nomenclatural acts in web services<br />Other algorithms / verification<br />WoRMS data<br />Improvement<br />Ranking results<br />Visualization<br />LifeDesks<br />Bibliography sharing<br />Resolve to articles<br />
  55. 55. Thanks!<br />Chris Freeland<br />chris.freeland@mobot.org<br />

×