Advancing the International Plant Names Index (IPNI)


Published on

The "names and taxa" information space is often thought of as being composed of three layers:
Taxonomic concepts
Code governed nomenclatural acts
Name occurrences
In many circumstances the distinction of these layers is blurred, leading to confusion and inefficiencies in information management. To date, IPNI has been mainly concerned with the middle layer comprising ICBN governed nomenclatural acts, and is formed of three key components: curated data, information services to expose this data, and dedicated editorial staff to provide nomenclatural expertise.
IPNI will be advanced from its current state to better connect to the layers above (taxonomic concepts) and below (name occurrences). This will require the expansion of data holdings, improved linkages, and the development of information services and associated workflows. These will be offered to key actors including name authors, publishers, taxonomists and managers of biodiversity information.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Elecronic publication Lisd plant list example
  • Sample IPNI record
  • Standardised author
  • Standardised publication title and collation
  • Distribution
  • Type specimen information
  • Links to other IPNI records
  • Code annotation (on linked record)
  • Full record history
  • Resolvable persistent identifier (LSID), returns structured data in a standard format.
  • Could mention issues with this here – names aren’t entered until the hard copy arrives at K / HUH library – we estimate 2 year time lag between publication data and entry to IPNI. Stats derived from 2004 onwards. IK editors discussion: Could do some analysis on this with sandwich student
  • Spelling correction Endings Connecting vowels OCR error fixes
  • Orange: authors (88.6%) Green: publication titles Author standardisation: 1% rise requires creation of over 25,000 links Checking intensive - often ambiguity in the non-standard, unlinked abbreviations, e.g. un-standardised string ' Henr. ' was found to be: Henrickson  in this string:  ( Henr . ) S.L.Welsh & Crompton Henrard  in this string:  ( Henr . ) Clayton
  • These shown as number of epithets modified per month July 2010 is when we did a big OCR fix
  • Screenshot showing propagation of errors on next slide
  • OCR error translated mc -> rne – dates from IK digitisation T he old version persists in many datasets that have been derived from IPNI. Linking (via persistent identifier – as described in later slide) would ensure that derived datasets benefit from this kind of curation.
  • Can mention the GTI work here
  • Stats page on the site at contains these tables from 2004 onwards This is data for most recent full year (2010)
  • BUT the response to user queries has very little visibility - point to point email, only visible to participants, even though the issues discussed may be of wider relevance
  • NN/AP: Perhaps we should add the average number of searches per day to the stats page.
  • Division of labour btw nomenclature and taxonomy: IPNI handles citation of name, reference and authorship and objective links such as combination – basionym. Checklists handle taxonomic synonymy and references supporting the assertion of concepts Referencing datasets benefit from ongoing curation of IPNI data.
  • This string translation higher value than purely lexical approach as an editor has checked it. Edit distance – number of single character transpositions required to modify one string into another NN 2011-07-14: here is another example which might better explain what I am on about: Plectranthus macrophyl i us -> Plectranthus macrophyl l us and Plectranthus m i crophyllus -> Plectranthus m a crophyllus same edit distance (1 character) BUT: former is high value – checked by editor, latter programmatically derived and a much more dangerous assumption to make
  • Data structure 10 years old – needs re-engineering to deal with requirements. NN: Moved data structure point to the notes as the crux is not just the data structure but the idea that we have a single data structure – technically I’d like us to split between top copy version for editing and multiple (dumber, flatter) slaves to service API calls etc – these can be hit as hard as we like without impacting on the editors workflow. Faceting – different routes to the data.
  • Advancing the International Plant Names Index (IPNI)

    1. 1. Advancing the International Plant Names Index (IPNI) Nicky Nicolson, Alan Paton, Jim Croft, James Macklin, Paul Morris, Greg Whitbread, Kanchi Gandhi
    2. 2. Advancing IPNI <ul><li>Current - where IPNI is now </li></ul><ul><li>Issues </li></ul><ul><li>Future - where we’d like to go and how to get there </li></ul>
    3. 3. What data? <ul><li>What data types: </li></ul><ul><ul><li>ICBN governed nomenclatural acts </li></ul></ul><ul><ul><li>Standardised author list </li></ul></ul><ul><ul><li>Publications </li></ul></ul><ul><li>Which groups: </li></ul><ul><ul><li>Vascular plants </li></ul></ul><ul><li>Which ranks: </li></ul><ul><ul><li>Family and below </li></ul></ul>
    4. 13. How is data entered? <ul><li>Data entry: </li></ul><ul><ul><li>From literature scanning, journals received by library at Kew, Harvard, Canberra (2 years - 95%) </li></ul></ul><ul><ul><li>User reports of missing nomenclatural acts, usually accompanied by a link to digitised literature page (BHL) </li></ul></ul><ul><li>How many? </li></ul><ul><ul><li>About 7400 names entered in average year </li></ul></ul><ul><ul><li>About 6100 nomenclatural acts published / year </li></ul></ul><ul><ul><li>… of these about 2800 are tax. novs. </li></ul></ul>
    5. 14. How is data managed? <ul><li>Full audit history on core objects – names / authors / publications . </li></ul><ul><li>Average 300,000 edits on name records / year </li></ul><ul><li>Standardisation effort ongoing      : </li></ul><ul><ul><li>Epithet </li></ul></ul><ul><ul><li>Author citation </li></ul></ul><ul><ul><li>Publication title </li></ul></ul><ul><ul><li>Collation </li></ul></ul><ul><ul><li>Year </li></ul></ul>
    6. 15. Standardisation – author and title
    7. 16. Standardisation – epithet updates
    8. 17. Standardisation of epithets <ul><li>Why important </li></ul><ul><ul><li>Main search criterion </li></ul></ul><ul><ul><li>Improving epithets enables other improvements in dataset e.g.: </li></ul></ul><ul><ul><ul><li>basionym linkage </li></ul></ul></ul><ul><ul><ul><li>de-duplication </li></ul></ul></ul><ul><ul><li>Errors propagate </li></ul></ul>
    9. 18. Rhus kea mc yi was an OCR error for Rhus kea rne yi but the incorrect value persists in datasets derived from IPNI
    10. 19. Statistics <ul><li>Dataset can be used for trends analysis: </li></ul><ul><ul><li>Publication rates </li></ul></ul><ul><ul><li>Combination rates </li></ul></ul><ul><ul><li>Author collaborations </li></ul></ul><ul><li>Audit history used to determine changes in data-set over time </li></ul><ul><li> </li></ul>
    11. 20.
    12. 21. As well as the data… <ul><li>IPNI editors respond to user queries about the data, dealing with c. 50 cases / month </li></ul><ul><li>Includes an expert service re interpretation of ICBN </li></ul><ul><li>Can provide worked examples illustrating particular articles of the code </li></ul>
    13. 22. Why should anyone care? <ul><li>c55,000 searches / day </li></ul><ul><li>BUT </li></ul><ul><li>dataset is not being used to full advantage </li></ul><ul><li>inputs not being handled efficiently: </li></ul><ul><ul><li>limited to partnership </li></ul></ul><ul><ul><li>missing out on community input </li></ul></ul><ul><li>expertise is hidden </li></ul>
    14. 23. Future <ul><li>Increase efficiency of input </li></ul><ul><ul><li>provision of core data </li></ul></ul><ul><ul><li>annotating and linking existing data </li></ul></ul><ul><ul><li>solving nomenclatural problems </li></ul></ul><ul><li>Increase output </li></ul><ul><ul><li>usage of IPNI data </li></ul></ul><ul><ul><li>benefit from on-going curation effort </li></ul></ul><ul><ul><li>benefit from nomenclatural expertise </li></ul></ul>
    15. 24. Data in - contributor services <ul><li>Pre-publication data entry </li></ul><ul><li>Batch submission of datasets </li></ul><ul><li>Annotation </li></ul><ul><li>Addition of links within dataset </li></ul><ul><li>Facilitate interpretation of nomenclatural issues </li></ul><ul><li>Accreditation – credit for helping improve the data </li></ul>
    16. 25. Pre-publication data entry <ul><li>Workflow currently being trialled </li></ul><ul><ul><li>Author or publisher submits data to IPNI once article has been accepted for publication </li></ul></ul><ul><ul><li>Generated record suppressed until publication effective under the code </li></ul></ul><ul><ul><li>But this not yet automated! </li></ul></ul>
    17. 26. Electronic Publication Example - Phytokeys <ul><li>A nomenclator of Pacific oceanic island Phyllanthus (Phyllanthaceae), including Glochidion </li></ul><ul><li>Warren L. Wagner, David H. Lorence </li></ul><ul><li>5. Phyllanthus atalotrichus ( A.C. Sm.) W.L. Wagner & Lorence, comb. nov. </li></ul><ul><li> </li></ul>PhytoKeys 4: 67–94 (2011) doi: 10.3897/phytokeys.4.1581
    18. 27. Pre-publication issues <ul><li>Name squatting – mitigated by only entering names which are in papers accepted for publication </li></ul><ul><li>Curation of record throughout publication process </li></ul><ul><li>Electronic and effective publication – before this the record will not be visible </li></ul><ul><li>IPNI editors provide visible expert service re validity of name </li></ul>
    19. 28. Where IPNI data are placed Any name occurrence: e.g. specimens, reports, literature citation concepts Standard form of name
    20. 29. Data out - links <ul><li>To concept layer: </li></ul><ul><ul><li>embed IPNI identifiers </li></ul></ul><ul><ul><li>storage of factual concepts / links to concept layer </li></ul></ul><ul><li>To name occurrence layer: </li></ul><ul><ul><li>seed lexical reconciliation projects (e.g. GNI) </li></ul></ul><ul><li>To allied information: </li></ul><ul><ul><li>literature </li></ul></ul><ul><ul><li>types </li></ul></ul>
    21. 30. Links to concept layer <ul><li>Embed IPNI identifiers in externally held names lists </li></ul><ul><ul><li>IPNI holds curated name data, labelled with persistent identifiers. </li></ul></ul><ul><ul><li>Need a tool to seed IPNI identifiers into datasets (in prototype) </li></ul></ul><ul><ul><li>Can devolve curation of name elements in other systems to IPNI </li></ul></ul><ul><ul><li>Benefit from on-going curation: </li></ul></ul><ul><ul><ul><li>300,000 edits per year </li></ul></ul></ul><ul><ul><li>Report on changes in name list since date </li></ul></ul>
    22. 31. Links to the Concept Layer Example The Plant List
    23. 32. Link to name occurrence layer <ul><li>IPNI’s version history can be used to seed lexical reconciliation projects (GNI), e.g.: </li></ul><ul><ul><li>Plectranthus macrophylius -> Plectranthus macrophyllus </li></ul></ul><ul><li>These editorialised translations of higher value than programmatically derived operations of the same edit distance, e.g: </li></ul><ul><ul><li>Plectranthus microphyllus -> Plectranthus macrophyllus </li></ul></ul><ul><li>Standardisation tools and techniques opened up for use in allied projects </li></ul>
    24. 33. Conclusion <ul><li>Faciliate electronic publication - pilot registration </li></ul><ul><li>Foster larger community to support the data and automate workflows </li></ul><ul><li>Stronger links between: </li></ul><ul><ul><li>the people who produce names </li></ul></ul><ul><ul><li>the places where they are published </li></ul></ul><ul><ul><li>the downstream users </li></ul></ul><ul><li>Technical redevelopment </li></ul>