Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Reconciling succeeding
taxonomic classifications

Nico M. Franz
School of Life Sciences, Arizona State University

Mingmin Chen, Shizhuo Yu, Bertram Ludäscher *
Department of Computer Science, University of California at Davis
ESA Annual Meeting 2012
November 14, 2012 – Knoxville, TN
* PI – NSF-IIS 1118088: A logic-based, provenance-aware system for merging scientific data under context and classification constraints.

Challenge – describing classification provenance beyond synonymy
Andropogon spp. in the Carolinas, from Hackel 1889 to Weakley 2005

Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm


Individual columns represent past classifications of Andropogon.

Source: Weakley. 2005. Flora of the Carolinas, Virginia, and Georgia. Available at http://www.herbarium.unc.edu/flora.htm


Individual rows represent equivalent taxonomic entities, (almost)
regardless of their name labels.


Individual rows represent equivalent taxonomic entities, (almost)
regardless of their name labels.
Name/synonymy relationships are not sufficiently granular to
capture this evolution of taxonomic views of Andropogon species.

Tracking classification provenance with concepts and articulations
Definition: A taxonomic concept is the underlying meaning of a scientific name as stated
by a particular author and publication. It represents the author's full-blown
view of how the name reaches out to un-/observed objects in nature.

Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by
the full Linnaean name and followed by the specific author and publication.

Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.

Tracking classification provenance with concepts and articulations
Definition: A taxonomic concept is the underlying meaning of a scientific name as stated
by a particular author and publication. It represents the author's full-blown
view of how the name reaches out to un-/observed objects in nature.

Labeling: The abbreviation sec. for the Latin secundum, or "according to", is preceded by
the full Linnaean name and followed by the specific author and publication.

Examples: Andropogon virginicus L. sec. Radford et al. (1968)
Andropogon virginicus L. sec. Weakley (2005)

[earlier, wider concept]
[later, narrower concept]

Utility: Representing multiple classifications (revisions) through concepts makes it possible
to track their similarities and differences through articulations.

Source: Berendsohn. 1995. The concept of "potential taxa" in databases. Taxon 44: 207–212.

Five basic articulations between two concepts C1, C2 (set theory)

equivalence

inverse proper
inclusion

exclusion

proper inclusion

overlap

Use of "OR" to express uncertainty.
Example: C1 == OR > C2

Source: Franz & Peet. 2009. Towards a language for mapping relationships among taxonomic concepts. Syst. Biodiv. 7: 5–20.

How does it work? Connecting Hackel 1889 and Small 1933
Step 1: Transcribe two concept hierarchies…
Hackel 1889 (1-12)

Small 1933 (13-16)

…and add unique IDs

Step 2: Create a table with all concept labels
Hackel 1889 (1-12)

Small 1933 (13-16)

Step 3: Create a table with corresponding parent/child relationships ('is_a')
Hackel 1889 (1-12)

Small 1933 (13-16)

Step 4: Create a table with a suitable set of articulations
Hackel 1889 (1-12)

Small 1933 (13-16)

Step 4: Create a table with a suitable set of articulations
Hackel 1889 (1-12)

Small 1933 (13-16)

Translation
Congruence

Concept hierarchies

Articulations

Technical challenges to creating articulations
Input of concept hierarchies
Lack of a server-based platform (e.g. Global Names Architecture)

Lack of user-friendly classification input / visualization tools


Input of articulations (goal: achieve a complete and consistent mapping)
Taxonomic experts will not input ∞ articulations
Taxonomic experts will miss relevant articulations ("mir")
Taxonomic experts could be uncertain of articulations ("possible worlds")
Taxonomic experts could posit logically inconsistent articulations


Input of articulations (goal: achieve a complete and consistent mapping)
Taxonomic experts will not input ∞ articulations
Taxonomic experts will miss relevant articulations ("mir")
Taxonomic experts could be uncertain of articulations ("possible worlds")
Taxonomic experts could posit logically inconsistent articulations

"CleanTax" is being developed to explore solutions to these challenges. 1

1

There is continuation/overlap with the "Exploring Taxonomic Concepts" project that focuses on character matching (DBI-1147266).

CleanTax – technical specifications
CleanTax = a set of Python programming scripts stored on bitbucket.org
(initially developed by Dave Thau; now being developed further on many fronts)
CleanTax reads in concept/articulation tables from a PostgreSQL database
CleanTax transforms the input for processing by logic reasoners; including:
Prover9 / Mace4 theorem provers – first-order logic [thorough, yet slow]
OWL / HermiT – description logic , knowledge representation [complex]
DLV System – propositional logic, answer set programming [promising!]

CleanTax assesses consistency and completeness of articulations
Output of the set of maximally informative relationships – "mir"

Report , causal explanation, interactive repair of inconsistent articulations
Calculate multiple possible worlds (if ambiguous articulations are present)

CleanTax assesses consistency and completeness of articulations
Output of the set of maximally informative relationships – "mir"

Report , causal explanation, interactive repair of inconsistent articulations
Calculate multiple possible worlds (if ambiguous articulations are present)
CleanTax creates multiple user-preferred views of the input and merge taxonomies
Reduced Containment Graph – RCG; and Directed Acyclic Graph – DAG

'Training' CleanTax on abstract examples

New!

Initial expert-made
set of articulations

Input

Output – raw hmtl list of articulations ("look-up" + inferred)

Input

Output – 72 maximally informative relationships = mir

Based on the mir, all theoretically possible articulations
of the R32 lattice can be logically deduced.

Abstract Example 1 – Reduced Contained Graph of the merge
Input
Blue circles
Black circles

shared concepts
unique concepts

Black solid arrows expert input
Grey dashed arrows deducible
Red solid arrows newly inferred

More CleanTax training… our infamous Abstract Example 4
Example 4 – representing multiple 'possible worlds'

3/5 articulations
are disjoint (OR)

Reduced Containment Graphs of 7 'possible worlds' (combined or's)
Example 4 – CleanTax infers 7 possible worlds (user can view / select / repair / rerun)

Asserted by expert
Implied articulations
Inferred by CleanTax
Shared concepts
Unique concepts
Reduced Containment Graphs (RCGs)

Exploring "views" of the merge - circular Euler diagrams of PW1
Table of mir

Corresponding Euler diagram (circular)
Identical
information
content

Correspondence of circular and Directed Acyclic Diagrams
PW1: Typical Euler circles

Euler-DAG of PW1

Identical
information
content

Real-life examples, I – reconciling two weevil classifications 1
Curculionoidea sec. Kuschel 1995

Curculionoidea sec. Marvaldi & Morrone 2000

Concepts 348-372

Concepts 117-157

1

Initial articulations provided by NMF.

Merge taxonomy of Kuschel 1995 / Marvaldi & Morrone 2000
CleanTax RCG – 1 newly inferred articulation (

) + several inconsistencies

Microcerinae sec. M&M 2000 [363] are included in Brachycerinae sec. KU 1995 [148]
(yes, I missed that; Kuschel 1995 only mentions it in the text, not in the main taxon list)

Real-life examples, II – reconciling two weevil classifications
Curculionoidea sec. Crowson 1981

Curculionoidea sec. Marvaldi & Morrone 2000

Concepts 348-372

Concepts 1-17

Merge taxonomy of Crowson 1981 / Marvaldi & Morrone 2000
CleanTax RCG – 4 newly inferred articulations (

) / does not depict overlap (><)

e.g. {Aglycyderidae [2], Allocorynidae [3], Oxycorynidae [17]} sec. Crowson 1981
are included in Belidae [353] sec. M&M 2000

Euler-DAG of the Crowson / Marvaldi & Morrone merge taxonomy
Solid lines – proper inclusion
Black solid line given
Green solid line inferred
Orange solid line explanatory
[Red solid line inconsistent]
Dashed lines - overlap
Black dashed line given
Green dashed line inferred
Orange dashed line explanatory
Red dashed line inconsistent
Concept boxes - concepts
Orange square box shared
Black square box unique
Dashed square box combined
Dashed oval box inconsistent

DAGs generate "combined concepts"
Belidae
sec. MM2000

Belidae
sec. Cro1981

intersections of overlaps
"Belidae"
INT(Cro/MM)
Shared - [2,3,17,357]

New naming/viewing conventions – simple merges (shared, unique) *
Input

Concept B

A
Attelabidae CR81
AttCR81 [9]

Output

Concept A

B
Attelabidae MM00
AttMM00 [55]

Concept A – Concept B
AB
Attelabidae CR81 – Attelabidae MM00
AttCR81.AttMM00

* Simple extension to three or more congruent concepts.

New naming/viewing conventions – combined merges (overlap; T1, T2)
Input

Concept A

Concept B

A
Belidae CR81
BelCR81 [10]

B
Belidae MM00
BelMM00 [353]

Euler
Ab
BelCR81.
belMM00

AB
BelCR81.
BelMM00

A

aB
BelMM00.
belCR81

B

DAG
Ab

AB

aB

Input

Concept A

Concept C

A
Curculionidae CR81
CurCR81

T1, T2, T3

Concept B
B
Curculionidae KU95
CurKU95

C
Curculionidae s.s. MM00
CurMM00

Euler

ABc
Abc

aBc

CurCR81.
CurKU95.
curMM00

CURCR81.
curKU95.
curMM00

CurKU95.
curCR81.
curMM00

ABC
AbC

aBC

CurCR81.
CurKU95.
CurMM00

CurCR81.
CurMM00.
curKU95

CurKU95.
CurMM00.
curCR81

abC
CurMM00.
curCR81.
curKU95

DAG

A

Abc

B

ABc

C

aBc

AbC

ABC

aBC

abC

Current workflow / "usability" (CleanTax on "Lore" server, UC Davis)

Input script
Possible worlds

Visualization
Euler-DAG
Output file

Inconsistency
Repair, explanation

Interactive
reduction of PWs
(decision tree)

Shared, real use cases (Perelleschus) with ETC feature-based project
5 taxonomies, 48 concepts, expert articulations, plus textual feature diagnoses

Conclusions and outlook
Improvements to CleanTax will remove many of the technical challenges towards a
full-blown taxon concept approach ( improved tracking of classification provenance).

Other technical challenges are being addressed (server platform, algorithmic
scalability, intensional/ostensive articulations, visualization [Euler, combined
concepts], workflow integration).
Many non-technical challenges remain (in short: transparent/consistent use).


The current approach treats concepts as a 'black box' – the input data are simple and
make no reference to type specimens, synapomorphies, diagnostic features, etc.
"Exploring Taxonomic Concepts" project will develop tools for a balanced view.

Nevertheless, the articulations can expose deep and varied semantic links among
succeeding classifications.


The current approach treats concepts as a 'black box' – the input data are simple and
make no reference to type specimens, synapomorphies, diagnostic features, etc.
"Exploring Taxonomic Concepts" project will develop tools for a balanced view.

Nevertheless, the articulations can expose deep and varied semantic links among
succeeding classifications.
CleanTax may be the first attempt to 'explain' classification provenance to logic
reasoners. This could have considerable implications for future data integration.

Acknowledgments
Shawn Bowers, Dave Thau, Alan Weakley
NSF-IIS 1118088:

"III-SMALL: A logic-based, provenance-aware system for merging scientific data under
context and classification constraints"

"Euler" team, UC Davis

Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012

Similar to Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012 (20)

More from taxonbytes

More from taxonbytes (20)

Recently uploaded

Recently uploaded (20)

Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012