Explanation of how names data are gathered, structured, standardised and annotated - and how these data are mobilised using names services. Challenges are around credit and attribution, usage metrics on services.
Presented at the Research Data Alliance plenary 5, 9-11 March 2015, San Diego.
9. Summary
• Created structured, standardised, annotated
data
• Data can be the “fuel” for names services
• How to ensure that the creators / annotators
get credit?
– Attribution
– Usage metrics
16. Translate an attempt at recording a
name as text to an identifier
Schinus longifolius var. paraguariensis
(Hassler) F. Barkley
229196-2
17.
18.
19.
20. We need to know how the data are
being used...
... To ensure the workers who scan, structure,
annotate the data get credit for their work,
and metrics on its downstream use.
Google Analytics for services?
Editor's Notes
Data are structured, stored, presented for machine to machine use in RDF.
Legalistic code governs how new names are brought into being. Editors interpret the code, apply it to the nomenclatural acts that they collect and annotate accordingly.
Web page view shows standardised data plus expert annotation
We now have a populated dataset – data have been extracted, structured, standardised, annotated.
These data can be the fuel for names services, but how to get attribution and credit back to those who have structured, standardised, annotated.
Example of services run at organisation scale – primarily designed for our own researchers.
Data manipulation / cleaning program called OpenRefine used as our working environment. We can make simple queries to IPNI (for nomenclature) and The Plant List (for taxonomy).
It can read data from lots of formats, like CSV and Excel.
It looks like a spreadsheet, but it has lots of features for cleaning up messy data. The one we want is to query a reconciliation service – a special website that can be given a piece of text like a plant name, and returns an identifier, like an IPNI id.
Example – a dataset including a column of scientific plant names
Select a reconciliation service against IPNI
The text representations of scientific names are passed to the service (using JSON over HTTP), IPNI ids, with hyperlinks to IPNI, are brought back.
The dataset now augmented with an ID from the reconciled name
So we are here – passed in a name, got back an identifier.
Now to query TPL for the taxonomic status.
We’ve populated our data resources that hold scientific names with the identifiers for those names, so we can now do the distributed equivalent of a database join. We’ve a separate resource (“TPL”) which organises the scientific names into a taxonomy.
Our researcher can now Choose “Add columns from TPL…”
TPL gives us a list of “properties” – things that it knows about names. Our researcher has chosen “taxonomic status”, and there’s a preview on the right.
Our dataset is now augmented with name identifiers and we’ve used those identifiers to go to a separate resource (TPL) to get taxonomic the status from TPL.