Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

A new power balance is needed
for trustworthy biodiversity data
Please
@taxonbytes
Nico Franz1 & Beckett W. Sterner1
With contributions by Edward Gilbert1, Andrew Johnston1,
Guanyang Zhang1, Bertram Ludäscher2 & Alan Weakley3
1 School of Life Sciences, Arizona State University
2 iSchool, University of Illinois at Urbana-Champaign
3 Herbarium, University of North Carolina at Chapel Hill
TDWG 2016 – Biodiversity Information Standards
December 09, 2016 – Instituto Tecnológico de Costa Rica (#TDWG16)
@ http://www.slideshare.net/taxonbytes/franz-sterner-tdwg-2016-new-power-balance-needed-for-trustworthy-biodiversity-data

Largely derived from doi:10.3897/rio.2.e10610
91dd0ee1-8a37-4efc-85b7-8176874cf5be

Premise: We agree that there are significant data quality issues
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Aggregated Australian millipede data 'taken to the cleaners'

91dd0ee1-8a37-4efc-85b7-8176874cf5be
Aggregators respond to the charges

91dd0ee1-8a37-4efc-85b7-8176874cf5be
Aggregators respond to the charges
But this leaves open the question(s):
Who (exactly) is responsible for
how much of each particular issue?

We seem to disagree on the question of responsibility assignment(s)
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Source: Belbin et al. 2013. A specialist's audit […]: An 'aggregator's' perspective. doi:10.3897/zookeys.305.5438
Page 73

Often enough, aggregators respond by:
• Acknowledging the general issues and their relevance.
• Pointing to many issues that effectively reside "with the sources".
• Calling for more collaboration across all levels; as well as new tools and
annotation options that "motivate and empower" the research community.
91dd0ee1-8a37-4efc-85b7-8176874cf5be
Source: Belbin et al. 2013. A specialist's audit […]: An 'aggregator's' perspective. doi:10.3897/zookeys.305.5438
Page 74

Thesis: For taxonomy integration, this both wrong and self-defeating
91dd0ee1-8a37-4efc-85b7-8176874cf5be
• Many aggregators are designed to impose a single taxonomic hierarchy –
one at a time – onto all taxonomically annotated records.

91dd0ee1-8a37-4efc-85b7-8176874cf5be
• By design, these "backbones" are rarely attributable to individual (expert)
authors, but instead are newly created systematic theories that only appear
at the system level.

91dd0ee1-8a37-4efc-85b7-8176874cf5be
• Data are aggregated accordingly; yet backbone-driven modifications may
newly disrupt the original integrity of submitted data packages.

91dd0ee1-8a37-4efc-85b7-8176874cf5be
• Data are aggregated accordingly; yet backbone-driven modifications may
newly disrupt the original integrity of submitted data packages.
• By deflecting on responsibilities, aggregators may cause additional self-harm.
Ultimately, the power balance – as presently built in – must shift to bring
experts back into the process of licensing succinct, trustworthy data packages.

Let's re-diagnose:
What happens in dynamic,
open systems?
Charly Lewisw, CC BY-SA 3.0

Taxonomic views of a frequently revised organismal lineage
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids, "pogonias")

Snapshot of a more frequently revised organismal lineage
• Vertical sections identify taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)
• Colors identify lineages of taxonomic names (epithets) in use
• There is no consensus! Five incongruent schemata are used concurrently

Further diagnosis:
If incongruent taxonomies are endorsed
– locally, provisionally, and democratically –
then what is the impact for
aggregated biodiversity data?

Further diagnosis:
 Taxonomy becomes a variable
that we need to represent,
and thereby control for
(at the system level)

The 'consensus'
• Query: "Where do these orchid
species occur?"
• Same set of 250 orchid specimens,
according to 4 taxonomies.
"Controllingthetaxonomicvariable" Example: the Cleistes use case

The 'consensus' The 'bible'
"Controllingthetaxonomicvariable"
• Query: "Where do these orchid
species occur?"
• Same set of 250 orchid specimens,
according to 4 taxonomies.
Example: the Cleistes use case

The (formerly)
federal 'standard'

The (formerly)
federal 'standard'
The 'best', latest
regional flora

The (formerly)
federal 'standard'
The 'best', latest
regional flora
Expert views
are in conflict

The (formerly)
federal 'standard'
The 'best', latest
regional flora
Expert views
are in conflict
"Just bad"

The (formerly)
federal 'standard'
The 'best', latest
regional flora
Impact:
Name-based aggregation has created
a novel synthesis that nobody believes in
"Just bad"

The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Just
bad"
Expert views
are in conflict
Solution:
Instead of aggregating
an artificial 'consensus',
…

The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Just
bad"
Expert views
are reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services

Challenges:
How can we redesign aggregation to yield
high-quality biodiversity data packages?

Challenges:
How can we redesign aggregation to yield
high-quality biodiversity data packages?
What does this mean for Darwin Core1
and how we use this aggregation standard?
1 Wieczorek et al. 2012. Darwin Core: an evolving […]. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715

Preview of solution with eight steps
• DwC is insufficient, and part of the problem

# 1: Represent only taxonomic concept labels (TCLs) 1
• Syntax (TCL): taxonomic name [author, year, page] sec. source
1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX
Cleistes divaricata
sec. Gregg & Catling 1993
Pogonia
sec. Brown & Wunderlin 1997

# 1: DwC score keeping  TCLs are optional; < 1% realized?
• TCL ~ DwC: nameAccordingTo
• SCAN: 19,722 of nearly 9 million records have TCLs (0.2%)
• Lack of enforcement to use TCLs makes standard less big data-ready
"Who authors GBIF's Backbone?"
https://storify.com/taxonbytes/who-authors-gbif-s-backbone

# 2: Represent each source coherently (Parent-Child relationships)
• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]
Cleistesiopsis bifaria sec. Pans. & de Barr. 2008
is a child of
Cleistesiopsis sec. Pans. & de Barr. 2008

# 2: DwC score keeping  Not (adequately) represented
• PC ~ DwC: genus, family, order (etc.; higherClassification)
• However, higher-level names in DwC are not modeled as TCLs
• Taxonomic coherence of sources cannot be preserved with DwC alone
DwC record with higherClassification
(BDJ)

# 3: Do not force a single hierarchy onto all tip-level TCLs
• Syntax (PC): Tip-level TCL1 , TCL2 , etc. [where TCL1/2 = different sources]

# 3: DwC score keeping  Optional Not (ever?) practiced
• No PC ~ DwC: infra-/specificEpithet only
• Typically, a single, 'unitary' higher-level classification is represented
• Combinations of algorithmic and social practices achieve the single hierarchy
"Who authors GBIF's Backbone?"
https://storify.com/taxonbytes/who-authors-gbif-s-backbone

# 4: Link TCLs via expert-provided RCC–5 articulations
• Syntax (RCC–5): TCL1 {==, >, <, ><, !} TCL2 [where TCL1/2 = diff. sources]
• RCC–5 = Region Connection Calculus
• 14 articulations provided by: http://tinyurl.com/Weakley-Flora-2015
Cleistes bifaria "Coastal Populations" sec. Smith et al. 2004
== (is congruent with)
Cleistesiopsis oricamporum sec. Brown & Pans. 2009
==

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
Region Connection Calculus (semantics: set constraints)
== < > >< !
• Two regions N, M are either:
• congruent (N == M)
• properly inclusive (N < M)
• inversely properly inclusive (N > M)
• overlapping (N >< M)
• exclusive of each other (N ! M)
• RCC–5 articulations answer the query: "can we join regions N and M?"
• Taxonomies have multiple RCC–5 alignable components: nodes (parents,
children), node-associated traits, even node-anchoring specimens

# 4: DwC score keeping  Not (adequately) represented
• RCC–5 ~ DwC: accepted(Scientific)Name(Usage), relationshipOfResource,
taxonomicStatus (etc.; nomenclatural relationships)
• Nomenclatural relationships are type-focused, not region-focused
• "Taxonomic Concept Schema"  yes! (however: http://www.tdwg.org/standards/117)
Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063
Example:
Milkweed butterflies

Oscillating meanings of the epithet hyalites – 1911 to 2003
Phenotypicdiversity
Type-anchorednameidentityrelations
Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063

# 5: Identify occurrence records only to TCLs
Records:
EKY39235
MTSU003611
NCSC00040204
…
Records:
BOON8098
CLEMS0061133
WILLI39399
…
Records:
GMUF-0039355
IBE006808
USCH58399
…
Records:
CONV0006268
MDKY00006482
NCU00038930
…
Records:
BRYV0023582, BRYV0023584
KHD00032030, MISS0016604
MMNS000227, NCSC00040206
USMS_000002923, USMS_000002924
VSC0053223, VSC0065528
…
Records:
ARIZ393087
DBG39049
USCH51217
…
Records:
NCU00040710
USCH96248
VSC0053218
…
Records:
CLEMS0012881
FUGR0003293
GA023130
…
Records:
BOON8100
NCSC00040210
SJNM45487
…
Records:
GA023144
LSU00012494
MISS0016608
…
Records:
IBE006810, IND-0012374, MMNS000227
Records:
NY8654
• Syntax (ID): Occurrence / organism is identified to TCL
"CLEMS0012881"
is identified to
Cleistes divaricata sec. Smith et al. 2004
[additional ID metadata]

DwC record with Identification metadata
(BDJ)
# 5: DwC score keeping  ID metadata optional; > 50% realized
• ID ~ DwC: Identification, (date)identified(By), identificationReference
• SCAN: 4,715,277 of nearly 9 million records have ID metadata (52.5%)
• Enforcement…still also require use of TCLs

# 6: Generate comprehensive, consistent RCC–5 alignments
• Euler/X is a toolkit that infers logically consistent RCC–5 alignments

# 6: Generate comprehensive, consistent RCC–5 alignments
• Valued-added: MIR – set of Maximally Informative Relations containing
the RCC–5 articulation for every possible TCL pair  scalability
Reasonerinference

# 7: Joining occurrence-to-TCL identifications & RCC–5 alignments
Records:
BOON8098, CLEMS0061133, CONV0006268, EKY39235
GMUF-0039355, IBE006808, IBE006810, IND-0012374
MDKY00006482, MMNS000227, MTSU003611, NCSC00040204
NCU00038930, NY8654, USCH58399, WILLI39399
…
Records:
ARIZ393087, BRYV0023582, BRYV0023584, DBG39049
KHD00032030, MISS0016604, MMNS00022, NCSC00040206
USMS_000002923, USMS_000002924, VSC0053223, VSC0065528
…
Records:
BOON8100, CLEMS0012881, FUGR0003293
GA023130, GA023144, LSU00012494
MISS0016608, NCSC00040210, NCU00040710
SJNM45487, USCH96248, VSC0053218
…
• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals

The (formerly)
federal 'standard'
The 'best', latest
regional flora
Impact:
"Please select your preference (A – D);
we can perform all translations"

• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records  resolves incongruent lineage of name usages
# 8: "Do you trust us now?" Aggregation as a translational service

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset  resolving only one narrowly circumscribed concept

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset  resolving only one narrowly circumscribed concept
• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,
yet translated into the more granular TCLs sec. Weakley 2015"
• Returns (again) many records, yet represents and contrasts two treatments,
as opposed to providing the ambiguous lineage view (above)
• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)

Conclusion – designing trusted biodiversity data services
• The Darwin Core standard for aggregating biodiversity data:
(1) Has under-utilized options for better representing taxonomic expertise
(2) Is part of a design paradigm that undermines the plurality of expertise

• Solutions are in development that realize data aggregation via translational
services – not as disenfranchising "backbones" – and without disrupting the
formation of expert-licensed, high-quality biodiversity data packages

• Solutions are in development that realize data aggregation via translational
services – not as disenfranchising "backbones" – and without disrupting the
formation of expert-licensed, high-quality biodiversity data packages
• All of us – not just aggregators – "own" the responsibility of designing
systems where the plurality of taxonomic expertise is fairly accommodated

Acknowledgments & links to products
• Cleistes use case: Alan Weakley (UNC)
• Euler/X toolkit: Shizhuo Yu (UC Davis)
• Other data issues, discussions: Andrew Johnston, Guanyang Zhang
• NSF DEB–1155984, DBI–1342595 (PI Franz)
• NSF IIS–118088, DBI–1147273 (PI Ludäscher)
• Euler/X code @ https://github.com/EulerProject/EulerX
• Franz et al. 2016. Two influential primate classifications logically aligned.
Systematic Biology 65(4): 561–582. Link

Interested in exploring
multi-taxonomy and/or
-phylogeny alignments?
Please contact me.
nico.franz@asu.edu
@taxonbytes
https://biokic.asu.edu/

Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Similar to Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data (20)

More from taxonbytes

More from taxonbytes (19)

Recently uploaded

Recently uploaded (20)

Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity data

Editor's Notes