GlobalNames - Canadensys - Shorthouse

295 views

Published on

Summary slides for AntCat workshop August 24-26 San Francisco, CA

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
295
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GlobalNames - Canadensys - Shorthouse

  1. 1. Global Names Recognition and Discovery (GNRD) • High throughput, queue-based « skin » on multiple processes of scientific name-finding engines – NetiNeti: Python, machine-learning-based – TaxonFinder: Perl, dictionary-based • Inputs: any file, URL, free-form text – Uses Docsplit gem (Tesseract OCR as needed) – Can send gzip request • Outputs: JSON/xml – Scientific names & their character offsets – OCR text – Resolved names
  2. 2. GNRD Clients & Applications
  3. 3. 15,000 OCR’d articles, 1868 - 2002 All with DOIs 158,000 unique scientific names 92,000 vernaculars 20,000 entities
  4. 4. No Consistency in Search APIs { "totalResults": 152, "startIndex": 1, "itemsPerPage": 30, "results": [ { "id": 14349, "title": "Ursus", "link": "http://eol.org/14349?action=overview&controller=taxa", "content": "Ursus Linnaeus, 1758; Ursus; Ursus (genus); Ursus (genus) Linnaeus, 1758; Ursus Arctos Bruinosus" }, { ... }, ], "first": "http://eol.org/api/search/Ursus.json?page=1", "self": "http://eol.org/api/search/Ursus.json?page=1", "next": "http://eol.org/api/search/Ursus.json?page=2", "last": "http://eol.org/api/search/Ursus.json?page=6" } http://eol.org/api/search/1.0.json?q=Ursus http://api.gbif.org/name_usage/search?q=Ursus { offset: 0, limit: 20, endOfRecords: false, count: 77, results: [ { datasetTitle: "English Wikipedia Species Pages", parent: "Ursidae", kingdom: "Animalia", phylum: "Chordata", clazz: "Mammalia", order: "Carnivora", family: "Ursidae", genus: "Ursus », scientificName: "Ursus", canonicalName: "Ursus", authorship: "", nameType: "WELLFORMED", rank: "GENUS", …
  5. 5. Use Darwin Core Terms
  6. 6. OpenURL • Created in late 1990s by a Flemish librarian • eg v0.1 http://resolver.example.edu/cgi?genre=book &isbn=0836218310&title=The+Far+Side+Galle ry+3 • But no specification for response structure!!!
  7. 7. bibJSON { "title": "Open Bibliography for Science, Technology and Medicine", "author":[ {"name": "Richard Jones"}, {"name": "Mark MacGillivray"}, {"name": "Peter Murray-Rust"}, {"name": "Jim Pitman"}, {"name": "Peter Sefton"}, {"name": "Ben O'Steen"}, {"name": "William Waites"} ], "type": "article", "year": "2011", "journal": {"name": "Journal of Cheminformatics"}, "link": [{"url":"http://www.jcheminf.com/content/3/1/47"}], "identifier": [{"type":"doi","id":"10.1186/1758-2946-3-47"}] }
  8. 8. Recommendation • Use DwC terms as query params for find or ‘q’ for search • Use DwC terms as keys in JSON responses http://www.antweb.org/description.do?name=claripes%2 0orbiculatopunctatus&genus=camponotus&rank=species& project=worldants http://www.antweb.org/description.do?specificEpithet=cla ripes&infraspecificEpithet=orbiculatopunctatus&genus=ca mponotus&taxonRank=species&project=worldants
  9. 9. Canadensys: Vascular Plants of Canada (VASCAN) Luc Brouillet, Peter Desmet, et al.
  10. 10. http://data.canadensys.net/vascan
  11. 11. http://data.canadensys.net/vascan/name/Carex%20abbreviata
  12. 12. http://data.canadensys.net/vascan/taxon/26512
  13. 13. http://doi.org/10.3897/phytokeys.25.3100
  14. 14. http://creativecommons.org/publicdomain/zero/1.0/
  15. 15. Suggestions for AntCat • Run literature through GNRD • Simplify web presence with concentration on search as the entry point – index all available content – Present « pages » as declaration of relationships • Use Darwin Core terms in « find » and « search » services • Make DwC-A, CC-0 waiver, data paper & publish to GBIF, make accessible to GN

×