Scaling Namefinding


Published on

Published in: Technology
1 Comment
  • Nice layout, and a good overview of the evolution of the API; we can do better, and 'this should work (tm)' - so let's rawk it breh!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Scaling Namefinding

    1. 1. current BHL
    2. 2. future
    3. 3. process disambiguate /identify ID lookup reconcile
    4. 4. process disambiguate / identify ID lookup reconcileMature & scalable, well defined and standardized
    5. 5. process disambiguate /identify ID lookup reconcile in progress, needs API & standard
    6. 6. process disambiguate /identify ID lookup reconcile GNI has API, needs standards
    7. 7. current response<entity> <nameString>Abietineae</nameString> <namebankID>8401003</namebankID> <weblinks> <website> <title>Tropicos</title> <link></link> <logo></logo> <links> <link nameString="Abietineae Eichler"></link> </links> </website> </weblinks> </entity>
    8. 8. issuesthe TF API is doing jobs it shouldn’t do..Namebank is a large but outdated dataset“taxonfinder” has no idea what a namebank ID actually is, it only knows stringscurrent code is completely dependent on and is not scalable
    9. 9. why change?scaling - we can run 10,000 taxonfinding processes using any algorithmthat supports the standard. Super fast indexing of BHLfuture-proofing for devs - any new namefinding tool can take advantageof the API and doesn’t need to write a webservice or API of it’s ownfuture-proofing for BHL - any new namefinding tool can be added withone parameter(&client=taxonfinder | &client=neti)reliability - existing TF API goes down when Rod runs a screen scrapingtool on
    10. 10. new API specAPI specsRequestinput (string)type (text , url)format (xml=default, json)ResponseXML ResponseA response example that corresponds to the xml schema:<names xmlns="" xmlns:dwc="">  <name>    <verbatim>T. rotundata</verbatim>    <dwc:scientificName>Tillandsia rotundata</dwc:scientificName>    <!--   0-100   -->    <score>100</score>    <offset start="4550" end="4573" />  </name></names>
    11. 11. New APIyou give us text, we give you strings and offsets. This is the limit ofwhat a “namefinding” tool can and should doseparately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF...once you know Mus musculus is EOL ID “9872332” you don’t need to knowthat again. If a book on mice has 40,000 instances of Mus musculus, youneed to know where they are, but not the NameBank ID 40,000 times..(this is a scaling problem..)Where do we get these? GNI has 19.3m names & IDs.
    12. 12. issuesmisspellings etc need to be “reconciled”this definitely isn’t the job of a name finding tool
    13. 13. next? we could make a tool that hacks together IDs and names.. ... but that’s not dev time well spentwe could participate in a process to check off the latter two categories of the name finding -> ID resolution process ... yes we can Let’s make a spec, build some APIs. silver lining - we can start now