Global Names Recognition and
Discovery (GNRD)
• High throughput, queue-based « skin » on
multiple processes of scientific ...
GNRD Clients & Applications
15,000 OCR’d articles, 1868 - 2002
All with DOIs
158,000 unique scientific names
92,000 vernaculars
20,000 entities
No Consistency in Search APIs
{
"totalResults": 152,
"startIndex": 1,
"itemsPerPage": 30,
"results": [
{
"id": 14349,
"tit...
Use Darwin Core Terms
OpenURL
• Created in late 1990s by a Flemish librarian
• eg v0.1
http://resolver.example.edu/cgi?genre=book
&isbn=08362183...
bibJSON
{
"title": "Open Bibliography for Science, Technology and Medicine",
"author":[
{"name": "Richard Jones"},
{"name"...
Recommendation
• Use DwC terms as query params for find or ‘q’ for
search
• Use DwC terms as keys in JSON responses
http:/...
Canadensys:
Vascular Plants of Canada
(VASCAN)
Luc Brouillet, Peter Desmet, et al.
http://data.canadensys.net/vascan
http://data.canadensys.net/vascan/name/Carex%20abbreviata
http://data.canadensys.net/vascan/taxon/26512
http://doi.org/10.3897/phytokeys.25.3100
http://creativecommons.org/publicdomain/zero/1.0/
Suggestions for AntCat
• Run literature through GNRD
• Simplify web presence with concentration on
search as the entry poi...
GlobalNames - Canadensys - Shorthouse
GlobalNames - Canadensys - Shorthouse
GlobalNames - Canadensys - Shorthouse
GlobalNames - Canadensys - Shorthouse
GlobalNames - Canadensys - Shorthouse
GlobalNames - Canadensys - Shorthouse
Upcoming SlideShare
Loading in...5
×

GlobalNames - Canadensys - Shorthouse

119

Published on

Summary slides for AntCat workshop August 24-26 San Francisco, CA

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
119
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GlobalNames - Canadensys - Shorthouse

  1. 1. Global Names Recognition and Discovery (GNRD) • High throughput, queue-based « skin » on multiple processes of scientific name-finding engines – NetiNeti: Python, machine-learning-based – TaxonFinder: Perl, dictionary-based • Inputs: any file, URL, free-form text – Uses Docsplit gem (Tesseract OCR as needed) – Can send gzip request • Outputs: JSON/xml – Scientific names & their character offsets – OCR text – Resolved names
  2. 2. GNRD Clients & Applications
  3. 3. 15,000 OCR’d articles, 1868 - 2002 All with DOIs 158,000 unique scientific names 92,000 vernaculars 20,000 entities
  4. 4. No Consistency in Search APIs { "totalResults": 152, "startIndex": 1, "itemsPerPage": 30, "results": [ { "id": 14349, "title": "Ursus", "link": "http://eol.org/14349?action=overview&controller=taxa", "content": "Ursus Linnaeus, 1758; Ursus; Ursus (genus); Ursus (genus) Linnaeus, 1758; Ursus Arctos Bruinosus" }, { ... }, ], "first": "http://eol.org/api/search/Ursus.json?page=1", "self": "http://eol.org/api/search/Ursus.json?page=1", "next": "http://eol.org/api/search/Ursus.json?page=2", "last": "http://eol.org/api/search/Ursus.json?page=6" } http://eol.org/api/search/1.0.json?q=Ursus http://api.gbif.org/name_usage/search?q=Ursus { offset: 0, limit: 20, endOfRecords: false, count: 77, results: [ { datasetTitle: "English Wikipedia Species Pages", parent: "Ursidae", kingdom: "Animalia", phylum: "Chordata", clazz: "Mammalia", order: "Carnivora", family: "Ursidae", genus: "Ursus », scientificName: "Ursus", canonicalName: "Ursus", authorship: "", nameType: "WELLFORMED", rank: "GENUS", …
  5. 5. Use Darwin Core Terms
  6. 6. OpenURL • Created in late 1990s by a Flemish librarian • eg v0.1 http://resolver.example.edu/cgi?genre=book &isbn=0836218310&title=The+Far+Side+Galle ry+3 • But no specification for response structure!!!
  7. 7. bibJSON { "title": "Open Bibliography for Science, Technology and Medicine", "author":[ {"name": "Richard Jones"}, {"name": "Mark MacGillivray"}, {"name": "Peter Murray-Rust"}, {"name": "Jim Pitman"}, {"name": "Peter Sefton"}, {"name": "Ben O'Steen"}, {"name": "William Waites"} ], "type": "article", "year": "2011", "journal": {"name": "Journal of Cheminformatics"}, "link": [{"url":"http://www.jcheminf.com/content/3/1/47"}], "identifier": [{"type":"doi","id":"10.1186/1758-2946-3-47"}] }
  8. 8. Recommendation • Use DwC terms as query params for find or ‘q’ for search • Use DwC terms as keys in JSON responses http://www.antweb.org/description.do?name=claripes%2 0orbiculatopunctatus&genus=camponotus&rank=species& project=worldants http://www.antweb.org/description.do?specificEpithet=cla ripes&infraspecificEpithet=orbiculatopunctatus&genus=ca mponotus&taxonRank=species&project=worldants
  9. 9. Canadensys: Vascular Plants of Canada (VASCAN) Luc Brouillet, Peter Desmet, et al.
  10. 10. http://data.canadensys.net/vascan
  11. 11. http://data.canadensys.net/vascan/name/Carex%20abbreviata
  12. 12. http://data.canadensys.net/vascan/taxon/26512
  13. 13. http://doi.org/10.3897/phytokeys.25.3100
  14. 14. http://creativecommons.org/publicdomain/zero/1.0/
  15. 15. Suggestions for AntCat • Run literature through GNRD • Simplify web presence with concentration on search as the entry point – index all available content – Present « pages » as declaration of relationships • Use Darwin Core terms in « find » and « search » services • Make DwC-A, CC-0 waiver, data paper & publish to GBIF, make accessible to GN
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×