2. A names backbone
== “an environment for the management of multiple
overlapping classifications and tracking how these
change over time”
Not a monolith:
• Built on a layered view of the domain – clearly
separating names and taxonomy
• Names form the objective basis for higher layers
12. Solving the problem…
We need to provide ways to allow people to better
navigate between the layers, and better focus their
efforts – e.g. build classifications using the same
objective bases.
We started with a blank sheet of paper – it’s hard to get
existing systems to conform to the layering that we
need
13. Drawbacks of data models used to
date
• conflated the storage of names and concepts.
• store only a single classification
• store only the end product of a thought process, not
work in progress
• are difficult to version
• are difficult to query effectively (for hierarchies etc)
14. A new (graph) model
• Stores data as graphs – composed of nodes and
directed relationships
• Both nodes and relationships can hold data as
properties
• Supports highly interconnected data
• Supports self-referential data
• Optimised for queries on relationships
15. Using a graph model to hold
concept data: Attempt #1
Two nodes, with name
+ status properties,
and an “accepted_as”
link.
== a naïve use of the
graph model: status is
stored in 2 places
(explicitly in status
property, implicitly
by the participation
relationship)
16. Using a graph model to hold
concept data: Attempt #2
More strict about the
separation of the
nomenclatural
information (the nodes)
and the taxonomic
information (the
relationships between
nodes), but the link
is still very sparse…
17. Using a graph model to hold
concept data: Attempt #3
Add an attribute to
indicate which
classification asserts
this subjective
relationship:
Taxonomic status of a
name is inferred from
its participation
in a subjective
taxonomic relationship.
18. Links become more interesting
than the nodes
Expand the data
held on the
subjective
relationship to allow
it to be
computationally
assessed
19. Multiple opinions – using the
same name nodes
Reuse the name
nodes to store
multiple opinions
using the same
basic facts (name
nodes)
22. Supporting versioning
We keep all relationships, modifications to the data just
mark relationships as no longer current.
We can always resurrect the state of the graph
== persistent identification of taxon concepts
23. Versioning = name id +
classification + state
We can always resurrect the state of the graph.
Versioning enables remote curation of the data
24. Versioning = name id +
classification + state
We can always resurrect the state of the graph.
Versioning enables remote curation of the data
25. Versioning = name id +
classification + state
State1, according to
WCS:
Xus yus Smith (A)
= Aus bus Jones
(S)
State2, according to
WCS:
Xus zus White (A)
= Xus yus Smith
(S)
= Aus bus Jones
We can always resurrect the state of the graph.
(S)
Versioning enables remote curation of the data
26. What can be done with this kind of
data model?
• Client systems can reliably connect to a version of a
concept
• We can see how concepts change over time
• Researchers can query the data to compare
classifications and identify areas of dispute
Longer term:
• Examine the “computed acceptance” rules used in
TPL - could these be run on the relationships in the
names backbone?
28. … but we need a way to manage
the name occurrences
29. Building the name occurrence layer:
Populating it:
• Seed it with authoritative set of names
• Add the version history of these names – how were
these names transcribed in the past?
Using it:
• Load candidate name occurrences and match them,
storing metrics on the match.
Reviewing – a “data improvement” team to:
• Verify the matches, focussing on ambiguity (that
which can’t be done computationally) == annotation
30. Services: name occurrence layer
- Data input / output:
DwCA
-Linking and
reviewing links
-RSS feeds to
indicate activity
31. Services: names layer
- Data input / output:
TCS
-Propose addition /
edit of names
-RSS feeds to
indicate activity
32. Services: concepts layer
- Data input / output:
TCS
-Create
classifications using
names
-Propose
addition / edit of
names to names
layer
-RSS feeds
33. The names backbone is an
extensible environment:
• Links “name occurrences” to names
• Separates curation of names and concepts
• Supports building concepts on the same objective
basis: enables sharing and reuse of foundation data.
• Allow many relationships to form concepts – supports
multiple overlapping classifications
• Allows distributed curation of the concepts.
Editor's Notes
DEFRA funded project – for Kew internal information management, but applicable wider.Staffed with a development team of 5, and a data improvement team of 4, plus people working on project management and business change.Names are crucial to Kew’s scientific work and day to day management of the collections.We have many systems which hold nomenclatural and taxonomic information
Many systems few links.Huge overlap in data and functionalityA single scientific question can be answered in multiple different ways
Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
IPNI / IF / Zoobank
WSCP etc
Most scientific questions operate at the concept level...
Name occurrence layer – any informal attempt at the transcription of a nameSome name occurrences are code governed names – eligible to appear in the next layer – the names layer – this holds all the objective published facts about a name – its orthography, authorship, protologue reference, type citation and objective synonymyConcepts layer – hypotheses draw these names together to form concepts via heterotypic synonymy.Most people are interested in working with concepts. Unfortunately most people are only armed with name occurrences.
…Fun board game for a small child, big waste of effort when we are trying to do science.We need to provide ways to allow people to better navigate between the layers, and better focus their efforts – e.g. build classifications using the same objective bases.
We’ve investigated using a different storage technology that stores data as graphs (structures composed of nodes and directed relationships between nodes) rather than in a relational structure. Both nodes and relationships can hold data in the form of properties. These are strongly typed, and indexed for retrieval performance.Drawbacks?A very different way of thinking about the dataNeeds an API to interact with the underlying storageBut:The graph model gets us a long way – it’s a natural way to represent the data.
In the first use of a graph model, we imported some data from the plant list. We created two nodes, each with fullName and status properties, and created an “accepted_as” link between the two to represent the fact that one name is a synonym of the other.This is quite a naïve use of the graph model – and it repeats a problem seen with the WCS data structure, namely that the status is effectively stored in two places – explicitly in the status property on the name node, and implicitly by the participation of the name node in an accepted_as relationship.
The second attempt was more strict about the separation of the nomenclatural information (the nodes) and the taxonomic information (the relationships between nodes). The benefit of a graph model is that information can be stored on the relationships between nodes – so we can have an “accordingTo” property on subjective relationships like “acceptedAs” and support many of these relationships to represent differing and potentially conflicting taxonomic opinions.
Add an attribute to indicate which classification asserts this subjective relationshipTaxonomic status of a name is inferred from its participation in a subjective taxonomic relationship. We can query the graph database for the treatment of the name “Cus bus Jones” according to WCS and see that it is accepted (it has an incoming accepted_as link). Similarly, according to WCS, the name “Aus bus (L.) K.” is a synonym as it has an outgoing accepted_as link.
Expand the data held on the subjective relationship to allow it to be computationally assessed