Building a “names    backbone”    Nicky Nicolson, RBG Kew
A names backbone== “an environment for the management of multiple  overlapping classifications and tracking how these  cha...
The current situation…Many overlapping systems, few links
… and what we’re aiming for:Authoritative data, reduced duplication, many more links
Names backbone: a layered environment
Name occurrence layer AKA           “Nomen-clutter”== any attemptat thetranscription ofa name..
Names layerHolds objectivepublished factsabout a name:-Orthography- Authorship- Protologuereference- Type citation- Object...
Concepts layerHypothesesdraw namestogether to formconcepts viaheterotypicsynonymy
The (current) problem:Most people wantto operate atconcept level…
The (current) problem:… but haveto start rightdown at thelowest level
The problem:
Solving the problem…We need to provide ways to allow people to better navigate between the layers, and better focus their ...
Drawbacks of data models used to              date• conflated the storage of names and concepts.• store only a single clas...
A new (graph) model• Stores data as graphs – composed of nodes and  directed relationships• Both nodes and relationships c...
Using a graph model to hold         concept data: Attempt #1Two nodes, with name+ status properties,and an “accepted_as”li...
Using a graph model to hold         concept data: Attempt #2More strict about theseparation of thenomenclaturalinformation...
Using a graph model to hold         concept data: Attempt #3Add an attribute toindicate whichclassification assertsthis su...
Links become more interesting            than the nodesExpand the dataheld on thesubjectiverelationship to allowit to beco...
Multiple opinions – using the           same name nodesReuse the namenodes to storemultiple opinionsusing the samebasic fa...
Relationships heldObjective, e.g.:• Combination-basionym• Later_homonym• Alternative_name_for• …Subjective, e.g.:• Parent_...
Objective relationships “stronger” than               subjective
Supporting versioningWe keep all relationships, modifications to the data just  mark relationships as no longer current.We...
Versioning = name id +      classification + stateWe can always resurrect the state of the graph.Versioning enables remote...
Versioning = name id +      classification + stateWe can always resurrect the state of the graph.Versioning enables remote...
Versioning = name id +             classification + stateState1, according toWCS:Xus yus Smith (A) = Aus bus Jones(S)State...
What can be done with this kind of          data model?• Client systems can reliably connect to a version of a  concept• W...
Building it: we first focussed on      the top two layers…
… but we need a way to manage    the name occurrences
Building the name occurrence layer:Populating it:• Seed it with authoritative set of names• Add the version history of the...
Services: name occurrence layer- Data input / output:DwCA-Linking andreviewing links-RSS feeds toindicate activity
Services: names layer- Data input / output:TCS-Propose addition /edit of names-RSS feeds toindicate activity
Services: concepts layer- Data input / output:TCS-Createclassifications usingnames-Proposeaddition / edit ofnames to names...
The names backbone is an       extensible environment:• Links “name occurrences” to names• Separates curation of names and...
Upcoming SlideShare
Loading in …5
×

Building a names backbone

1,017 views

Published on

1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
1,017
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
11
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide
  • DEFRA funded project – for Kew internal information management, but applicable wider.Staffed with a development team of 5, and a data improvement team of 4, plus people working on project management and business change.Names are crucial to Kew’s scientific work and day to day management of the collections.We have many systems which hold nomenclatural and taxonomic information
  • Many systems few links.Huge overlap in data and functionalityA single scientific question can be answered in multiple different ways
  • Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
  • Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
  • IPNI / IF / Zoobank
  • WSCP etc
  • Most scientific questions operate at the concept level...
  • Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
  • …Fun board game for a small child, big waste of effort when we are trying to do science.We need to provide ways to allow people to better navigate between the layers, and better focus their efforts – e.g. build classifications using the same objective bases.
  • We’ve investigated using a different storage technology that stores data as graphs (structures composed of nodes and directed relationships between nodes) rather than in a relational structure. Both nodes and relationships can hold data in the form of properties. These are strongly typed, and indexed for retrieval performance.Drawbacks?A very different way of thinking about the dataNeeds an API to interact with the underlying storageBut:The graph model gets us a long way – it’s a natural way to represent the data.
  • In the first use of a graph model, we imported some data from the plant list. We created two nodes, each with fullName and status properties, and created an “accepted_as” link between the two to represent the fact that one name is a synonym of the other.This is quite a naïve use of the graph model – and it repeats a problem seen with the WCS data structure, namely that the status is effectively stored in two places – explicitly in the status property on the name node, and implicitly by the participation of the name node in an accepted_as relationship.
  • The second attempt was more strict about the separation of the nomenclatural information (the nodes) and the taxonomic information (the relationships between nodes). The benefit of a graph model is that information can be stored on the relationships between nodes – so we can have an “accordingTo” property on subjective relationships like “acceptedAs” and support many of these relationships to represent differing and potentially conflicting taxonomic opinions.
  • Add an attribute to indicate which classification asserts this subjective relationshipTaxonomic status of a name is inferred from its participation in a subjective taxonomic relationship. We can query the graph database for the treatment of the name “Cus bus Jones” according to WCS and see that it is accepted (it has an incoming accepted_as link). Similarly, according to WCS, the name “Aus bus (L.) K.” is a synonym as it has an outgoing accepted_as link.
  • Expand the data held on the subjective relationship to allow it to be computationally assessed
  • “Nomenclutter”
  • Building a names backbone

    1. 1. Building a “names backbone” Nicky Nicolson, RBG Kew
    2. 2. A names backbone== “an environment for the management of multiple overlapping classifications and tracking how these change over time”Not a monolith: • Built on a layered view of the domain – clearly separating names and taxonomy • Names form the objective basis for higher layers
    3. 3. The current situation…Many overlapping systems, few links
    4. 4. … and what we’re aiming for:Authoritative data, reduced duplication, many more links
    5. 5. Names backbone: a layered environment
    6. 6. Name occurrence layer AKA “Nomen-clutter”== any attemptat thetranscription ofa name..
    7. 7. Names layerHolds objectivepublished factsabout a name:-Orthography- Authorship- Protologuereference- Type citation- Objectivesynonymy
    8. 8. Concepts layerHypothesesdraw namestogether to formconcepts viaheterotypicsynonymy
    9. 9. The (current) problem:Most people wantto operate atconcept level…
    10. 10. The (current) problem:… but haveto start rightdown at thelowest level
    11. 11. The problem:
    12. 12. Solving the problem…We need to provide ways to allow people to better navigate between the layers, and better focus their efforts – e.g. build classifications using the same objective bases.We started with a blank sheet of paper – it’s hard to get existing systems to conform to the layering that we need
    13. 13. Drawbacks of data models used to date• conflated the storage of names and concepts.• store only a single classification• store only the end product of a thought process, not work in progress• are difficult to version• are difficult to query effectively (for hierarchies etc)
    14. 14. A new (graph) model• Stores data as graphs – composed of nodes and directed relationships• Both nodes and relationships can hold data as properties• Supports highly interconnected data• Supports self-referential data• Optimised for queries on relationships
    15. 15. Using a graph model to hold concept data: Attempt #1Two nodes, with name+ status properties,and an “accepted_as”link.== a naïve use of thegraph model: status isstored in 2 places(explicitly in statusproperty, implicitlyby the participationrelationship)
    16. 16. Using a graph model to hold concept data: Attempt #2More strict about theseparation of thenomenclaturalinformation (the nodes)and the taxonomicinformation (therelationships betweennodes), but the linkis still very sparse…
    17. 17. Using a graph model to hold concept data: Attempt #3Add an attribute toindicate whichclassification assertsthis subjectiverelationship:Taxonomic status of aname is inferred fromits participationin a subjectivetaxonomic relationship.
    18. 18. Links become more interesting than the nodesExpand the dataheld on thesubjectiverelationship to allowit to becomputationallyassessed
    19. 19. Multiple opinions – using the same name nodesReuse the namenodes to storemultiple opinionsusing the samebasic facts (namenodes)
    20. 20. Relationships heldObjective, e.g.:• Combination-basionym• Later_homonym• Alternative_name_for• …Subjective, e.g.:• Parent_child (taxonomic placement)• Synonym (heterotypic synonymy)• …
    21. 21. Objective relationships “stronger” than subjective
    22. 22. Supporting versioningWe keep all relationships, modifications to the data just mark relationships as no longer current.We can always resurrect the state of the graph== persistent identification of taxon concepts
    23. 23. Versioning = name id + classification + stateWe can always resurrect the state of the graph.Versioning enables remote curation of the data
    24. 24. Versioning = name id + classification + stateWe can always resurrect the state of the graph.Versioning enables remote curation of the data
    25. 25. Versioning = name id + classification + stateState1, according toWCS:Xus yus Smith (A) = Aus bus Jones(S)State2, according toWCS:Xus zus White (A) = Xus yus Smith(S) = Aus bus Jones We can always resurrect the state of the graph.(S) Versioning enables remote curation of the data
    26. 26. What can be done with this kind of data model?• Client systems can reliably connect to a version of a concept• We can see how concepts change over time• Researchers can query the data to compare classifications and identify areas of disputeLonger term:• Examine the “computed acceptance” rules used in TPL - could these be run on the relationships in the names backbone?
    27. 27. Building it: we first focussed on the top two layers…
    28. 28. … but we need a way to manage the name occurrences
    29. 29. Building the name occurrence layer:Populating it:• Seed it with authoritative set of names• Add the version history of these names – how were these names transcribed in the past?Using it:• Load candidate name occurrences and match them, storing metrics on the match.Reviewing – a “data improvement” team to:• Verify the matches, focussing on ambiguity (that which can’t be done computationally) == annotation
    30. 30. Services: name occurrence layer- Data input / output:DwCA-Linking andreviewing links-RSS feeds toindicate activity
    31. 31. Services: names layer- Data input / output:TCS-Propose addition /edit of names-RSS feeds toindicate activity
    32. 32. Services: concepts layer- Data input / output:TCS-Createclassifications usingnames-Proposeaddition / edit ofnames to nameslayer-RSS feeds
    33. 33. The names backbone is an extensible environment:• Links “name occurrences” to names• Separates curation of names and concepts• Supports building concepts on the same objective basis: enables sharing and reuse of foundation data.• Allow many relationships to form concepts – supports multiple overlapping classifications• Allows distributed curation of the concepts.

    ×